How to Use Proxies to Rotate IP Addresses in Python

Learn how to perform web scraping at scale by preventing websites to ban your ip address while scraping them using different proxy methods in Python.
  · 7 min read · Updated jul 2020 · Web Scraping · Sponsored


A proxy is a server application that acts as an intermediary for requests between a client and the server from which the client is requesting a certain service (HTTP, SSL, etc.).

When using a proxy server, instead of directly connecting to the target server and requesting whatever that is you wanna request, you direct the request to the proxy server which evaluates the request and performs it and returns the response, here is a simple Wikipedia demonstration of proxy servers:

Web scraping experts often use more than one proxy to prevent websites to ban their IP address. Proxies have several other benefits, including bypassing filters and censorship, hiding your real IP address, etc.

In this tutorial, you will learn how you can use proxies in Python using requests library, we will be also using stem library which is a Python controller library for Tor, let's install them:

pip3 install bs4 requests stem

Related: How to Make a Subdomain Scanner in Python.

Using Free Available Proxies

First, there are some websites that offer free proxy list to use, I have built a function to automatically grab this list:

import requests
import random
from bs4 import BeautifulSoup as bs

def get_free_proxies():
    url = "https://free-proxy-list.net/"
    # get the HTTP response and construct soup object
    soup = bs(requests.get(url).content, "html.parser")
    proxies = []
    for row in soup.find("table", attrs={"id": "proxylisttable"}).find_all("tr")[1:]:
        tds = row.find_all("td")
        try:
            ip = tds[0].text.strip()
            port = tds[1].text.strip()
            host = f"{ip}:{port}"
            proxies.append(host)
        except IndexError:
            continue
    return proxies

However, when I tried to use them, most of them were timing out, I filtered some working ones:

proxies = [
    '167.172.248.53:3128',
    '194.226.34.132:5555',
    '203.202.245.62:80',
    '141.0.70.211:8080',
    '118.69.50.155:80',
    '201.55.164.177:3128',
    '51.15.166.107:3128',
    '91.205.218.64:80',
    '128.199.237.57:8080',
]

This list may not be viable forever, in fact, most of these will stop working when you read this tutorial (so you should execute the above function each time you want to use fresh proxy servers).

The below function accepts a list of proxies and creates a requests session which randomly selects one of the proxies passed:

def get_session(proxies):
    # construct an HTTP session
    session = requests.Session()
    # choose one random proxy
    proxy = random.choice(proxies)
    session.proxies = {"http": proxy, "https": proxy}
    return session

Let's test this by making a request to a website that returns our IP address:

for i in range(5):
    s = get_session(proxies)
    try:
        print("Request page with IP:", s.get("http://icanhazip.com", timeout=1.5).text.strip())
    except Exception as e:
        continue

Here is my output:

Request page with IP: 45.64.134.198
Request page with IP: 141.0.70.211
Request page with IP: 94.250.248.230
Request page with IP: 46.173.219.2
Request page with IP: 201.55.164.177

As you can see, these are some IP addresses of the working proxy servers and not our real IP address (try to visit this website in your browser and you'll see your real IP address).

Free proxies tend to die very quickly, mostly in days or even hours and would often die before our scraping project ends. To prevent that, you need to use premium proxies for large scale data extraction projects, there are many providers out there who rotate IP addresses for you. One of the well known solutions is Crawlera. We will talk more about it in the last section of this tutorial.

Using Tor as a Proxy

You can also use Tor network to rotate IP addresses:

import requests
from stem.control import Controller
from stem import Signal

def get_tor_session():
    # initialize a requests Session
    session = requests.Session()
    # setting the proxy of both http & https to the localhost:9050 
    # this requires a running Tor service in your machine and listening on port 9050 (by default)
    session.proxies = {"http": "socks5://localhost:9050", "https": "socks5://localhost:9050"}
    return session

def renew_connection():
    with Controller.from_port(port=9051) as c:
        c.authenticate()
        # send NEWNYM signal to establish a new clean connection through the Tor network
        c.signal(Signal.NEWNYM)

if __name__ == "__main__":
    s = get_tor_session()
    ip = s.get("http://icanhazip.com").text
    print("IP:", ip)
    renew_connection()
    s = get_tor_session()
    ip = s.get("http://icanhazip.com").text
    print("IP:", ip)

Note: The above code should work only if you have Tor installed in your machine (head to this link to properly install it) and well configured (ControlPort 9051 is enabled, check this stackoverflow answer for further details).

This will create a session with a Tor IP address and make an HTTP request, and then renew the connection by sending NEWNYM signal (which tells Tor to establish a new clean connection) to change the IP address and make another request, here is the output:

IP: 185.220.101.49

IP: 109.70.100.21

Great! However, when you experience web scraping using Tor network, you'll soon realize it's pretty slow most of the times, that is why the recommended way is below.

Using Crawlera

Scrapinghub's Crawlera allows you to crawl quickly and reliably, it manages and rotates proxies internally, so if you're banned, it will automatically detects that and rotates the IP address for you.

Crawlera is a smart proxy network, specifically designed for web scraping and crawling. Its job is clear: making your life easier as a web scraper. It helps you get successful requests and extract data at scale from any website using any web scraping tool.

Crawlera Proxy

With its simple API, the request you make when scraping will be routed through a pool of high-quality proxies. When necessary, it automatically introduces delays between requests and removes/adds IP addresses to overcome different crawling challanges.

Here is how you can use Crawlera with requests library in Python:

import requests

url = "http://icanhazip.com"
proxy_host = "proxy.crawlera.com"
proxy_port = "8010"
proxy_auth = "<APIKEY>:"
proxies = {
       "https": f"https://{proxy_auth}@{proxy_host}:{proxy_port}/",
       "http": f"http://{proxy_auth}@{proxy_host}:{proxy_port}/"
}

r = requests.get(url, proxies=proxies, verify=False)

Once you register for a plan, you'll be provided with an API key in which you'll replace proxy_auth.

So, here is what Crawlera does for you:

  • You send the HTTP request using its single endpoint API.
  • It automatically selects, rotates, throttles and blacklists IPs to retrieve the target data.
  • It handles request headers and maintains sessions.
  • You receive a successful request in response.

Conclusion

There are several proxy types including transparent proxies, anonymous proxies, elite proxies. If your goal of using proxies is to prevent websites from banning your scrapers, then elite proxies are your optimal choice, it will make you seem like a regular internet user who is not using a proxy at all.

Further more, an extra anti-scraping measure, is using rotating user agents, in which you send a changing spoofed header each time saying that you're a regular browser.

Finally, Crawlera saves your time and energy by automatically managing proxies for you, it also provides a 14-day free trial, so you can just try it out without any risk. If you need a proxy solution, I highly suggest you should try Crawlera.

Learn also: How to Extract All Website Links in Python.

Happy Coding ♥

View Full Code
Sharing is caring!



Read Also





Comment panel