A proxy is an intermediary server that hides your IP, so you can navigate through web traffic anonymously and securely. Proxies have very interesting use-cases, the most prominent of them being web scraping for pricing intelligence, SEO monitoring, data collection for market research, etc. And the correct use of rotating proxies is a key ingredient of this.
If you want to know more about proxies for web scraping and how proxies work, feel free to skim through our recent blog.
So let’s get started!
python -m pip install requests
pip install scrapypip install scrapy-rotating-proxies
To use Smart Proxy Manager with Scrapy, you need to install this middleware `scrapy-zyte-smartproxy`.
pip install scrapy-zyte-smartproxy
First, import the Requests library, then create a proxy dictionary to map the protocols - HTTP and HTTPS to a proxy URL. Finally, set up a response using requests.get method to make the request to a URL using the proxy dictionary. For example:
import requests proxies = { 'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080', } response = requests.get('http://example.org', proxies=proxies)
You can configure proxies for individual URLs even if the schema is the same. This comes in handy when you want to use different proxies for different websites you wish to scrape.
import requests proxies = { 'http://example.org': 'http://10.10.1.10:3128', 'http://something.test': 'http://10.10.1.10:1080', } requests.get('http://something.test/some/url', proxies=proxies)
Sometimes you need to create a session and use a proxy at the same time to request a page. In this case, you first have to create a new session object and add proxies to it then finally send the request through the session object:
`requests.get` essentially uses the `requests.Session` under the hood.
import requests s = requests.Session() s.proxies = { "http": "http://10.10.10.10:8000", "https": "http://10.10.10.10:8000", } r = s.get("http://toscrape.com")
For the internet, your IP address is your identity. One can only make limited requests to a website with one IP. Think of websites as some sort of regulator. Websites get suspicious of requests coming from the same IP over and over again. This is ‘IP Rate Limitation’. IP rate limitations applied by websites can cause blocking, throttling, or CAPTCHAs. One way to overcome this is to rotate proxies. Read more about why you need rotating proxies.
Now let's get to the “how” part. This tutorial demonstrates three ways you work with rotating proxies:
Note: You don’t need any different proxies to run the code demonstrated in this tutorial. If your product/service relies on web scraped data, a free proxy solution will probably not be enough for your needs.
Let’s discuss them one by one:
In the code shown below, first, we create a proxy pool dictionary. Then, randomly pick a proxy to use for our request. If the proxy works properly we can access the given site. If there’s a connection error we may have to delete this proxy from the list and retry the same URL with another proxy.
import requests s = requests.Session() s.proxies = { "http": "http://10.10.10.10:8000", "https": "http://10.10.10.10:8000", } r = s.get("http://toscrape.com")
In your settings.py
ROTATING_PROXY_LIST = [ 'Proxy_IP:port', 'Proxy_IP:port', # ... ]
If you want more external control over the IPs, you can even load it from a file like this.
ROTATING_PROXY_LIST_PATH = 'listofproxies.txt'
DOWNLOADER_MIDDLEWARES = { # ... 'rotating_proxies.middlewares.RotatingProxyMiddleware': 800, 'rotating_proxies.middlewares.BanDetectionMiddleware': 800, # ... }
That’s it! Now all your requests will automatically be routed randomly between the proxies.
Note: Sometimes the proxy that you are trying to use is just simply banned. In this case, there’s not much you can do about it other than remove it from the pool and retry using another proxy. But other times if it isn’t banned you just have to wait a little bit before using the same proxy again.
The above-discussed ways to rotate proxies work well for building demos and minimum viable products. But things can get tricky as soon as you decide to scale your data extraction project. Infrastructure management of proxy pools is quite challenging, time-consuming, and resource extensive. You will soon find yourself refurbishing proxies to keep the pool healthy, managing bans and sessions, rotating user agents, etc. Proxy infrastructure also needs to be configured to work with headless browsers to crawl javascript-heavy websites. Phew! It’s not shocking how quickly your data extraction project gets converted into a proxy management project.
Thanks to the Zyte Smart Proxy Manager – you don't need to rotate and manage any proxies. It is all done automatically so you can focus on extracting quality data. Let’s see how easy it is to integrate with your scrapy project.
# enable the middleware DOWNLOADER_MIDDLEWARE={'scrapy_zyte_smartproxy.ZyteSmartProxyMiddleware': 610}
# enable Zyte Proxy ZYTE_SMARTPROXY_ENABLED = True # the API key you get with your subscription ZYTE_SMARTPROXY_APIKEY = '<your_zyte_proxy_apikey>'
import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" zyte_smartproxy_enabled = True zyte_smartproxy_apikey = 'a7f74201a57542d7a0b0a08946147fd3' custom_settings = { "DEFAULT_REQUEST_HEADERS": { "X-Crawlera-Profile": "desktop", "X-Crawlera-Cookies": "disable", } } def start_requests(self): urls = [ 'https://quotes.toscrape.com/page/1/', 'https://quotes.toscrape.com/page/2/', ] for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): page = response.url.split("/")[-2] filename = f'quotes-{page}.html' with open(filename, 'wb') as f: f.write(response.body) self.log(f'Saved file {filename}')
This piece of code sends a successful HTTP Python request to https://quotes.toscrape.com/.
When you use Zyte Proxy Manager, you don’t need to deal with proxy rotation manually. Everything is taken care of internally through the use of our rotating proxies.
You can try Zyte Smart Proxy Manager for 14 days for free.