How to Use Proxies to Rotate IP Addresses in Python

Prevent websites to ban your IP address while scraping websites or anonymizing your browsing using different proxy servers and methods in Python.
Abdou Rockikz · 6 min read · Updated feb 2020 · Web Scraping · Sponsored


A proxy is a server application that acts as an intermediary for requests between a client and the server from which the client is requesting a certain service (HTTP, SSL, etc.).

When using a proxy server, instead of directly connecting to the target server and requesting whatever that is you wanna request, you direct the request to the proxy server which evaluates the request and performs it and returns the response, here is a simple Wikipedia demonstration of proxy servers:

Web scraping experts often use more than one proxy to prevent websites to ban their IP address. Proxies have several other benefits, including bypassing filters and censorship, hiding your real IP address, etc.

In this tutorial, you will learn how you can use proxies in Python using requests library, we will be also using stem library which is a Python controller library for Tor, let's install them:

pip3 install bs4 requests stem

Related: How to Make a Subdomain Scanner in Python.

Using Free Available Proxies

First, there are some websites that offer free proxy list to use, I have built a function to automatically grab this list:

import requests
import random
from bs4 import BeautifulSoup as bs

def get_free_proxies():
    url = "https://free-proxy-list.net/"
    # get the HTTP response and construct soup object
    soup = bs(requests.get(url).content, "html.parser")
    proxies = []
    for row in soup.find("table", attrs={"id": "proxylisttable"}).find_all("tr")[1:]:
        tds = row.find_all("td")
        try:
            ip = tds[0].text.strip()
            port = tds[1].text.strip()
            host = f"{ip}:{port}"
            proxies.append(host)
        except IndexError:
            continue
    return proxies

However, when I tried to use them, most of them were timing out, I filtered some working ones:

proxies = [
    '167.172.248.53:3128',
    '194.226.34.132:5555',
    '203.202.245.62:80',
    '141.0.70.211:8080',
    '118.69.50.155:80',
    '201.55.164.177:3128',
    '51.15.166.107:3128',
    '91.205.218.64:80',
    '128.199.237.57:8080',
]

This list may not be viable forever, in fact, most of these will stop working very soon (so you should execute the above function each time you want to use fresh proxy servers).

The below function accepts a list of proxies and creates a requests session which randomly selects one of the proxies passed:

def get_session(proxies):
    # construct an HTTP session
    session = requests.Session()
    # choose one random proxy
    proxy = random.choice(proxies)
    session.proxies = {"http": proxy, "https": proxy}
    return session

Let's test this by making a request to a website that returns our IP address:

for i in range(5):
    s = get_session(proxies)
    try:
        print("Request page with IP:", s.get("http://icanhazip.com", timeout=1.5).text.strip())
    except Exception as e:
        continue

Here is my output:

Request page with IP: 45.64.134.198
Request page with IP: 141.0.70.211
Request page with IP: 94.250.248.230
Request page with IP: 46.173.219.2
Request page with IP: 201.55.164.177

As you can see, these are some IP addresses of the working proxy servers and not our real IP address (try to visit this website in your browser and you'll see your real IP address).

Free proxies tend to die very quickly, mostly in days or even hours and would often die before our scraping project ends. To prevent that, you need to use premium proxies for large scale data extraction projects, there are many providers out there who rotate IP addresses for you. One of the well known solutions is Crawlera.

Using Crawlera

Scrapinghub's Crawlera allows you to crawl quickly and reliably, it manages and rotates proxies internally, so if you're banned, it will automatically detects that and rotates the IP address for you.

Here is how you can use Crawlera with requests library:

import requests

url = "http://icanhazip.com"
proxy_host = "proxy.crawlera.com"
proxy_port = "8010"
proxy_auth = "<APIKEY>:"
proxies = {
       "https": f"https://{proxy_auth}@{proxy_host}:{proxy_port}/",
       "http": f"http://{proxy_auth}@{proxy_host}:{proxy_port}/"
}

r = requests.get(url, proxies=proxies, verify=False)

Once you register for a plan, you'll be provided with an API key.

So, here is what Crawlera does for you:

  • You send the HTTP request using its single endpoint API.
  • It automatically selects, rotates, throttles and blacklists IPs to retrieve the target data.
  • You receive a successful request in response.

Using Tor as a Proxy

You can also use Tor network to rotate IP addresses:

import requests
from stem.control import Controller
from stem import Signal

def get_tor_session():
    # initialize a requests Session
    session = requests.Session()
    # setting the proxy of both http & https to the localhost:9050 
    # this requires a running Tor service in your machine and listening on port 9050 (by default)
    session.proxies = {"http": "socks5://localhost:9050", "https": "socks5://localhost:9050"}
    return session

def renew_connection():
    with Controller.from_port(port=9051) as c:
        c.authenticate()
        # send NEWNYM signal to establish a new clean connection through the Tor network
        c.signal(Signal.NEWNYM)

if __name__ == "__main__":
    s = get_tor_session()
    ip = s.get("http://icanhazip.com").text
    print("IP:", ip)
    renew_connection()
    s = get_tor_session()
    ip = s.get("http://icanhazip.com").text
    print("IP:", ip)

Note: The above code should work only if you have Tor installed in your machine (head to this link to properly install it) and well configured (ControlPort 9051 is enabled, check this stackoverflow answer for further details).

This will create a session with a Tor IP address and make an HTTP request, and then renew the connection by sending NEWNYM signal (which tells Tor to establish a new clean connection) to change the IP address and make another request, here is the output:

IP: 185.220.101.49

IP: 109.70.100.21

Conclusion

There are several proxy types including transparent proxies, anonymous proxies, elite proxies. If your goal of using proxies is to prevent websites from banning your scrapers, then elite proxies are your optimal choice, it will make you seem like a regular internet user who is not using a proxy at all.

However, if you're more worried about keeping your privacy on the Internet, then anonymous proxies are you choice. An anonymous proxy identifies itself as a proxy server but does not make the original IP address available.

Further more, an extra anti-scraping measure, is using rotating user agents, in which you send a changing spoofed header each time saying that you're a regular browser.

Finally, Crawlera saves your time and energy by automatically managing proxies for you, if you want to try Crawlera, click here.

Learn also: How to Extract All Website Links in Python.

Happy Coding ♥

View Full Code
Sharing is caring!



Read Also





Comment panel

   
Comment system is still in Beta, if you find any bug, please consider contacting us here.