When scraping the web at any reasonable scale, using proxies is an absolute must. However, it is common for managing and troubleshooting proxy issues to consume more time than building and maintaining the spiders themselves.
In this guide, we will cover everything you need to know about the best proxies for web scraping and how they will make your life easier.
Zyte Smart Proxy Manager (formally Crawlera) is a proxy manager designed specifically for web crawling and scraping. It routes requests through a pool of IPs, throttling access by introducing delays and discarding proxies from the pool when they get banned or have similar problems when accessing certain domains. A user can give instructions to Crawlera using an API allowing features such as setting a browser profile or using IPs from a certain region to help mimic requests from real users. Using Zyte Smart Proxy Manager can allow you to offload proxy management of your scraping project and help you focus on building your scraping and crawler logic.
Zyte Smart Proxy Manager (formally Crawlera) selects proxies and browser profiles from pools when users are trying to access websites. It monitors the responses to detect when bans occur, either by checking the response status or following site specific rules to classify unexpected responses as bans. When bans are detected it will try again using a new proxy/profile. The amount of retries as well as specific kinds of browser profiles and other settings can be selected by users using an API, which can help cut down bans if you know of certain settings which can be reliable. Zyte Smart Proxy Manager (formally Crawlera) handles your proxy management for you, allowing you to focus more on building your scraping and crawling logic.
Before we discuss what a proxy is we first need to understand what an IP address is and how they work.
An IP address is a numerical address assigned to every device that connects to an Internet Protocol network like the internet, giving each device a unique identity. Most IP addresses look like this:
A proxy is a 3rd party server that enables you to route your request through their servers and use their IP address in the process. When using a proxy, the website you are making the request to no longer sees your IP address but the IP address of the proxy, giving you the ability to scrape the web anonymously if you choose.
Currently, the world is transitioning from IPv4 to a newer standard called IPv6. This newer version will allow for the creation of more IP addresses. However, in the proxy business IPv6 is still not a big thing so most IPs still use the IPv4 standard.
When scraping a website, we recommend that you use a 3rd party proxy and set your company name as the user agent so the website owner can contact you if your scraping is overburdening their servers or if they would like you to stop scraping the data displayed on their website.
There are a number of reasons why proxies are important for data web scraping:
A proxy service for scraping is used to manage proxies for a scraping project. A simple proxy service for scraping could simply be a set of proxies that are used in parallel to create the appearance of separate users accessing the site at the same time. A more complex proxy service for scraping would be something like Zyte Smart Proxy Manager (formally Crawlera) which detects proxies that may be “burnt” by antibot systems and cycles them out. Proxy services are important for large scraping projects both for mitigating antibot defences and to help speed up handling of requests sent in parallel.
A VPN is a type of proxy server that routes all your web traffic through a (typically) encrypted server. The purpose of a VPN is to anonymise web traffic, an ISP will only see a VPN user sending requests to their VPN, while any service being connected to will see connections coming from the VPN rather than a user’s own machine. Some network proxies may not provide this anonymising feature and may only operate on certain kinds of requests.
Ok, we now know what proxies are, but how do you use them as part of your web scraping?
In a similar way to if we only use our own IP address to scrape a website, if you only use one proxy to scrape a website this will reduce your crawling reliability, geotargeting options, and the number of concurrent requests you can make.
As a result, you need to build a pool of proxies that you can route your requests through. Splitting the amount of traffic over a large number of proxies.
The size of your proxy pool will depend on a number of factors:
All five of these factors have a big impact on the effectiveness of your proxy pool. If you don’t properly configure your pool of proxies for your specific web scraping project you can often find that your proxies are being blocked and you’re no longer able to access the target website.
In the next section, we will look at the different types of IPs you can use as proxies.
If you’ve done any level of research into your proxy options you will have probably realized that this can be a confusing topic. Every proxy provider is shouting from the rafters that they have the best website proxy IPs, with very little explanation as to why. Making it very hard to assess which is the best proxy solution for your particular project.
So in this section of the guide, we will break down the key differences between the available proxy solutions and help you decide which solution is best for your needs. First, let’s talk about the fundamentals of proxies - the underlying IPs.
As mentioned already, a proxy is just a 3rd party IP address that you can route your request through. However, there are 3 main types of IPs to choose from. Each type with its own pros and cons.
Datacenter IPs are the most common type of proxy IP. They are the IPs of servers housed in data centers. These IPs are the most commonplace and the cheapest to buy. With the right proxy management solution, you can build a very robust web crawling solution for your business.
Residential IPs are the IPs of private residences, enabling you to route your request through a residential network. As residential IPs are harder to obtain, they are also much more expensive. In a lot of situations, they are overkill as you could easily achieve the same results with cheaper data center IPs. They also raise legal/consent issues due to the fact you are using a person’s personal network to scrape the web.
The length a residential proxy lasts depending on whether you are rotating your proxies. One IP can last for up to 1, 10, or 30 minutes during a sticky session. However, if you're choosing a rotating session, the IPs will change with every request.
Most ISPs by default provide users with a rotating IP address. This means that every time you unplug your modem you can be given a brand new IP address. Some ISPs will offer the choice of having a static IP address, which means that the same IP will always be used for your address. This can be limited for commercial use and is typically only needed when a user would need incoming web requests that target their IP.
Mobile IPs are the IPs of private mobile devices. As you can imagine, acquiring the IPs of mobile devices is quite difficult so they are very expensive. For most web scraping projects mobile IPs are overkill unless you want to only scrape the results shown to mobile users. But more significantly they raise even trickier legal/consent issues as oftentimes the device owner isn't fully aware that you are using their GSM network for web scraping.
Our recommendation is to go with data center IPs and put in place a robust proxy management solution. In the vast majority of cases, this approach will generate the best results for the lowest cost. With proper proxy management, data center IPs give similar results as residential or mobile IPs without legal concerns and at a fraction of the cost.
A residential proxy is uses an IP that an ISP will identify as connected to a home address. A datacenter proxy uses an IP that is connected to a corporation or datacenter. When a residential proxy is used the request will be more likely to appear as though it comes from a normal user, which can help prevent identification by some antibot measures.
The other consideration we need to discuss is whether you should use public, shared, or dedicated proxies.
As a general rule, you always stay well clear of public proxies, or "open proxies". Not only are these proxies of very low quality, but they can also be very dangerous. These proxies are open for anyone to use, so they quickly get used to slam websites with huge amounts of dubious requests. Inevitably resulting in them getting blacklisted and blocked by websites very quickly. What makes them even worse though is that these proxies are often infected with malware and other viruses. As a result, when using a public proxy you run the risk of spreading any malware that is present, infecting your own machines, and even making public your web scraping activities if you haven't properly configured your security (SSL certs, etc.).
The decision between shared or dedicated proxies is a bit more intricate. Depending on the size of your project, your need for performance and your budget using a web scraping IP rotation service where you pay for access to a shared pool of IPs might be the right option for you. However, if you have a larger budget and where performance is a high priority for you then paying for a dedicated pool of proxies might be the better option.
Ok, by now you should have a good idea of what proxies are and what are the pros and cons of the different types of IPs you can use in your proxy pool. However, picking the right type of proxy is only part of the battle, the real tricky part is managing your pool of proxies so they don’t get banned.
If you are planning on scraping at any reasonable scale, just purchasing a pool of proxies and routing your requests through them likely won’t be sustainable long term. Your proxies will inevitably get banned and stop returning high-quality data.
Here are some of the main challenges that you will face when managing your proxy pool:
Managing a pool of 5-10 proxies is ok, but when you have 100s or 1,000s it can get messy fast. To overcome these challenges you have three core solutions: Do It Yourself, Proxy Rotators, and Done For You Solutions.
The courts determined that scraping public data is legal. As long as the data is available on the public domain and it is not copyright protected then it can be legally scraped regardless of whether a proxy is being used. The data scraped should, however, be used within the confines of the law.
In this situation, you purchase a pool of shared or dedicated proxies, then build and tweak a proxy management solution yourself to overcome all the challenges you run into. This can be the cheapest option but can be the most wasteful in terms of time and resources. Often it is best to only take this option if you have a dedicated web scraping team who have the bandwidth to manage your proxy pool, or if you have zero budget and can’t afford anything better.
A proxy rotator is a system used to change proxies for each request sent by a scraper or crawler. It is typically called a rotator because after the last available proxy is used it will go back to the start of the proxy pool. Using a rotator to cycle your pool of proxies can prevent batches of requests from being sent from the same IP, which can be used as a sign of automation by antibot system.
A proxy rotator will either be something you’ve built for yourself from scratch or part of a service you have purchased.How you use it will vary and you will need to consult the documentation of the solution for in depth instructions.
Once you have the list of Proxy IPs to rotate, the rest is easy. The function get_proxies will return a set of proxy strings that can be passed to the request object as proxy config. Now that we have the list of Proxy IP Addresses in variable proxies, we'll go ahead and rotate it using a Round Robin method.
A typical way that antibot systems detect automation is seeing a large number of requests coming from the same IP address in a short period of time. When you use a web scraping ip rotation service your requests will cycle through a number of addresses, making it harder to detect that all the requests are coming from the same place.
The final solution is to completely outsource the management of your proxy management. Solutions such as Zyte Smart Proxy Manager (formerly Crawlera), which is basically a rotating proxy for scraping, are designed as smart downloaders, where your spiders just have to make a request to its API and it will return the data you require. Managing all the proxy rotation, throttling, blacklists, session management, etc. under the hood so you don’t have to.
Each one of these approaches has its own pros and cons, so the best solution will depend on your specific priorities and constraints.
Whether you set a proxy on or off depends on a lot of factors. Typically smart proxy managers will have a cost per request, so if you don’t need a proxy for a project it can be wasteful to always use one. The decision to set a proxy should be based on whether you need your requests to appear to come from a specific reason or if you need to make multiple requests appear to come from different users.
It's basically split into two configurations: either Automatic or Manual proxy setup. In 99% of the cases, everything should be set to Off. If anything is turned on, your web traffic could be going through a proxy.
Here at Zyte (formerly Scrapinghub), we have been in the web scraping industry for 12 years. We have helped extract web data for more than 1,000 clients ranging from Government agencies and Fortune 100 companies to early-stage startups and individuals. During this time we gained a tremendous amount of experience and expertise in web data extraction.
Here are some of our best resources if you want to deepen your proxy management knowledge: