Learn all about the latest trends and best practices in data extraction - Join us at Extract Summit
zyte logo

Learn how to configure and utilize proxies with Python requests module

time to read
By the one and only
August 22, 2019

Sending HTTP requests in Python is not necessarily easy. We have built-in modules like urllib, urllib2 to deal with HTTP requests. Also, we have third-party tools like Requests. Many developers use Requests because it is high level and designed to make it extremely easy to send HTTP requests. That is why it is key to configure and utilize proxies with Python requests module.

But choosing the tool which is most suitable for your needs is just one thing. In the web scraping world, there are many obstacles we need to overcome. One huge challenge is when your scraper gets blocked. To solve this problem, you need to use proxies. In this article, I’m going to show you how to utilize proxies when using the Python Requests module so your scraper will not get banned.

Python Requests and proxies

In this part, we're going to cover how to configure proxies in Requests. To get started we need a working proxy and a URL we want to send the request to.

Basic usage

import requests
proxies = {
 “http”: “”,
 “https”: “”,
r = requests.get(“http://toscrape.com”, proxies=proxies)

The proxies dictionary must follow this scheme. It is not enough to define only the proxy address and port. You also need to specify the protocol. You can use the same proxy for multiple protocols. If you need authentication use this syntax for your proxy:


Environment variables

In the above example, you can define proxies for each individual request. If you don’t need this kind of customization you can just set these environment variables:

export HTTP_PROXY=""
export HTTPS_PROXY=""

This way you don’t need to define any proxies in your code. Just make the request and it will work.

Proxy with session

Sometimes you need to create a session and use a proxy at the same time to request a page. In this case, you first have to create a new session object and add proxies to it then finally send the request through the session object:

import requests
s = requests.Session()
s.proxies = {
  “http”: “”,
  “https”: “”,
r = s.get(“http://toscrape.com”)

IP rotating

As discussed earlier, a common problem that we encounter while extracting data from the web is that our scraper gets blocked. It is frustrating because if we can’t even reach the website we won’t be able to scrape it either. The solution for this is to use some kind of proxy or rather multiple rotating proxies. A proxy solution will let us get around the IP ban.

To be able to rotate IPs, we first need to have a pool of IP addresses. We can use free proxies that we can find on the internet or we can use commercial solutions for this. Be aware, that if your product/service relies on scraped data a free proxy solution will probably not be enough for your needs. If a high success rate and data quality are important for you, you should choose a paid proxy solution like Zyte Smart Proxy Manager (formerly Crawlera).

IP rotation with Python requests

So let’s say we have a list of proxies. Something like this:

ip_addresses = [“”, “”, “”, “”, “”]

Then, we can randomly pick a proxy to use for our request. If the proxy works properly we can access the given site. If there’s a connection error we might want to delete this proxy from the list and retry the same URL with another proxy.

    proxy_index = random.randint(0, len(ip_addresses) - 1)
    proxy = {"http": ip_addresses(proxy_index), "https": ip_addresses(proxy_index)}
    requests.get(url, proxies=proxies)
    # implement here what to do when there’s a connection error
    # for example: remove the used proxy from the pool and retry the request using another one

There are multiple ways you can handle connection errors. Because sometimes the proxy that you are trying to use is just simply banned. In this case, there’s not much you can do about it other than removing it from the pool and retrying using another proxy. But other times if it isn’t banned you just have to wait a little bit before using the same proxy again.

Implementing your own smart proxy solution which finds the best way to deal with errors is very hard to do. That’s why you should consider using a managed solution, like Zyte Smart Proxy Manager (formerly Crawlera), to avoid all the unnecessary pains with proxies.

Using Zyte Proxy Manager (formerly Crawlera) with Python requests

As a closing note, I want to show you how to solve proxy issues in the easiest way with Zyte Proxy Manager.

import requests
url = "http://httpbin.org/ip"
proxy_host = "proxy.crawlera.com"
proxy_port = "8010"
proxy_auth = ":"
proxies = {
       "https": "https://{}@{}:{}/".format(proxy_auth, proxy_host, proxy_port),
       "http": "http://{}@{}:{}/".format(proxy_auth, proxy_host, proxy_port)
r = requests.get(url, proxies=proxies, verify=False)

What does this piece of code do? It sends a successful HTTP python request. When you use Zyte Proxy Manager, you don’t need to deal with proxy rotation manually. Everything is taken care of internally.

If you find that managing proxies on your own is too complex and you’re looking for an easy solution, give Zyte Smart Proxy Manager (formerly Crawlera) a go. It has a free trial!

Written by Attila Toth
Sign up to the blog