If you are serious about web scraping you’ll quickly realize that proxy management is a critical component of any web scraping project.
When scraping the web at any reasonable scale, using proxies is an absolute must. However, it is common for managing and troubleshooting proxy issues to consume more time than building and maintaining the spiders themselves.
In this guide, we will cover everything you need to know about proxies for web scraping and how they will make your life easier.
Before we discuss what a proxy is we first need to understand what an IP address is and how they work.
An IP address is a numerical address assigned to every device that connects to an Internet Protocol network like the internet, giving each device a unique identity. Most IP addresses look like this:
A proxy is a 3rd party server that enables you to route your request through their servers and use their IP address in the process. When using a proxy, the website you are making the request to no longer sees your IP address but the IP address of the proxy, giving you the ability to scrape the web anonymously if you choose.
Currently, the world is transitioning from IPv4 to a newer standard called IPv6. This newer version will allow for the creation of more IP addresses. However, in the proxy business IPv6 is still not a big thing so most IPs still use the IPv4 standard.
When scraping a website, we recommend that you use a 3rd party proxy and set your company name as the user agent so the website owner can contact you if your scraping is overburdening their servers or if they would like you to stop scraping the data displayed on their website.
There are a number of reasons why proxies are important for data web scraping:
Ok, we now know what proxies are, but how do you use them as part of your web scraping?
In a similar way to if we only use our own IP address to scrape a website, if you only use one proxy to scrape a website this will reduce your crawling reliability, geotargeting options, and the number of concurrent requests you can make.
As a result, you need to build a pool of proxies that you can route your requests through. Splitting the amount of traffic over a large number of proxies.
The size of your proxy pool will depend on a number of factors:
All five of these factors have a big impact on the effectiveness of your proxy pool. If you don’t properly configure your pool of proxies for your specific web scraping project you can often find that your proxies are being blocked and you’re no longer able to access the target website.
In the next section, we will look at the different types of IPs you can use as proxies.
If you’ve done any level of research into your proxy options you will have probably realized that this can be a confusing topic. Every proxy provider is shouting from the rafters that they have the best website proxy IPs, with very little explanation as to why. Making it very hard to assess which is the best proxy solution for your particular project.
So in this section of the guide, we will break down the key differences between the available proxy solutions and help you decide which solution is best for your needs. First, let’s talk about the fundamentals of proxies - the underlying IPs.
As mentioned already, a proxy is just a 3rd party IP address that you can route your request through. However, there are 3 main types of IPs to choose from. Each type with its own pros and cons.
Datacenter IPs are the most common type of proxy IP. They are the IPs of servers housed in data centers. These IPs are the most commonplace and the cheapest to buy. With the right proxy management solution, you can build a very robust web crawling solution for your business.
Residential IPs are the IPs of private residences, enabling you to route your request through a residential network. As residential IPs are harder to obtain, they are also much more expensive. In a lot of situations, they are overkill as you could easily achieve the same results with cheaper data center IPs. They also raise legal/consent issues due to the fact you are using a person’s personal network to scrape the web.
Mobile IPs are the IPs of private mobile devices. As you can imagine, acquiring the IPs of mobile devices is quite difficult so they are very expensive. For most web scraping projects mobile IPs are overkill unless you want to only scrape the results shown to mobile users. But more significantly they raise even trickier legal/consent issues as oftentimes the device owner isn't fully aware that you are using their GSM network for web scraping.
Our recommendation is to go with data center IPs and put in place a robust proxy management solution. In the vast majority of cases, this approach will generate the best results for the lowest cost. With proper proxy management, data center IPs give similar results as residential or mobile IPs without legal concerns and at a fraction of the cost.
The other consideration we need to discuss is whether you should use public, shared, or dedicated proxies.
As a general rule, you always stay well clear of public proxies, or "open proxies". Not only are these proxies of very low quality, but they can also be very dangerous. These proxies are open for anyone to use, so they quickly get used to slam websites with huge amounts of dubious requests. Inevitably resulting in them getting blacklisted and blocked by websites very quickly. What makes them even worse though is that these proxies are often infected with malware and other viruses. As a result, when using a public proxy you run the risk of spreading any malware that is present, infecting your own machines, and even making public your web scraping activities if you haven't properly configured your security (SSL certs, etc.).
The decision between shared or dedicated proxies is a bit more intricate. Depending on the size of your project, your need for performance and your budget using a web scraping IP rotation service where you pay for access to a shared pool of IPs might be the right option for you. However, if you have a larger budget and where performance is a high priority for you then paying for a dedicated pool of proxies might be the better option.
Ok, by now you should have a good idea of what proxies are and what are the pros and cons of the different types of IPs you can use in your proxy pool. However, picking the right type of proxy is only part of the battle, the real tricky part is managing your pool of proxies so they don’t get banned.
If you are planning on scraping at any reasonable scale, just purchasing a pool of proxies and routing your requests through them likely won’t be sustainable long term. Your proxies will inevitably get banned and stop returning high-quality data.
Here are some of the main challenges that you will face when managing your proxy pool:
Managing a pool of 5-10 proxies is ok, but when you have 100s or 1,000s it can get messy fast. To overcome these challenges you have three core solutions: Do It Yourself, Proxy Rotators, and Done For You Solutions.
In this situation, you purchase a pool of shared or dedicated proxies, then build and tweak a proxy management solution yourself to overcome all the challenges you run into. This can be the cheapest option but can be the most wasteful in terms of time and resources. Often it is best to only take this option if you have a dedicated web scraping team who have the bandwidth to manage your proxy pool, or if you have zero budget and can’t afford anything better.
The middle-of-the-park solution is to purchase your proxies from a provider that also provides proxy rotation and geographical targeting. In this situation, the solution will take care of the more basic proxy management issues. Leaving you to develop and manage session management, throttling, ban identification logic, etc.
The final solution is to completely outsource the management of your proxy management. Solutions such as Zyte Smart Proxy Manager (formerly Crawlera), which is basically a rotating proxy for scraping, are designed as smart downloaders, where your spiders just have to make a request to its API and it will return the data you require. Managing all the proxy rotation, throttling, blacklists, session management, etc. under the hood so you don’t have to.
Each one of these approaches has its own pros and cons, so the best solution will depend on your specific priorities and constraints.
Here at Zyte (formerly Scrapinghub), we have been in the web scraping industry for 12 years. We have helped extract web data for more than 1,000 clients ranging from Government agencies and Fortune 100 companies to early-stage startups and individuals. During this time we gained a tremendous amount of experience and expertise in web data extraction.
Here are some of our best resources if you want to deepen your proxy management knowledge: