We’ve made a change. Scrapinghub is now Zyte! 

In-house vs off-the-shelf proxy management?

time to read
6
Mins
By the one and only

Proxy management is the thorn in the side of most web scrapers. Without a robust and fully featured proxy infrastructure, you will often experience constant reliability issues and hours spent putting out proxy fires - a situation no web scraping professional wants to deal with. We, web scrapers, are interested in extracting and using web data, not managing proxies.

In this article, we’re going to tackle the great proxy question: should you build your own proxy infrastructure in-house or use an off-the-shelf proxy solution?

But first, let’s talk about...

Your proxy infrastructure requirements

Although every individual web scraping project is different, proxy requirements remain remarkably similar. Your proxy infrastructure needs to be able to reliably return successful responses at the desired frequency. Anything else is a suboptimal proxy solution.

To achieve this, at a minimum your proxy infrastructure needs to contain a sufficient number of proxies to process the desired number of requests per minute and the ability to rotate the proxies to lower the risk of bans.

However, most web scrapers quickly discover that this rudimentary proxy infrastructure simply won’t cut it at any reasonable level of scale. Very quickly the list of requirements grows even longer to enable your crawlers to reliably retrieve the data they need:

  • Ban identification - Your proxy solution needs to be able to detect over 100+ types of bans so that you can troubleshoot and fix the underlying problem - i.e. captchas, redirects, blocks, cloaking, etc. Making things more difficult, your solution also needs to create and manage a ban database for every single website you scrape, which is not a trivial task.
  • Retry errors - If your proxies experience any errors, bans, timeouts, etc. they need to be able to retry the request with different proxies.
  • Request headers - Managing and rotating user agents, cookies, etc. is crucial to having a healthy crawl.
  • Session management - Some scraping projects require you to keep a session with the same proxy, so you’ll need to configure your proxy pool to allow for this.
  • Headless browsers - Some web scraping projects require you use headless browsers to extract your target data. As a result, your proxy infrastructure needs to be configured to work seamlessly with your chosen headless browser.
  • Add delays - Automatically randomize delays and change request throttling to help cloak the fact that you are scraping and access difficult sites. Not only that but your proxy management system should be able to dynamically select delays based on the known characteristics of the target website and the real-time feedback on the optimal crawl rates to ensure the highest request throughput without running the risk of bans or overloading the site’s servers.
  • Geographical targeting - Sometimes you’ll need to able to configure your pool so that only some proxies will be used on certain websites.

As a result, web scrapers need to design robust management logic within their proxy infrastructure to ensure it can reliably rotate IPs, select geographical specific IPs, throttle requests, identify bans and captchas, automate retries, manage sessions, user agents, and blacklisting logic.

Turning an axillary part of your web scraping project into a large development and maintenance undertaking.

Your proxy management options: Built in-house or use an off-the-shelf solution

When it comes to choosing a proxy management solution you really only have two options:

  1. Build the entire infrastructure in-house; or,
  2. Use an off-the-shelf proxy management solution.

First, let’s look at your first option…

Build your proxy infrastructure in-house

A common approach a lot of developers take when first getting started scraping the web is building their own proxy management solution from scratch.

This approach often works very well when scraping simple websites at small scales. With a relatively simple proxy infrastructure (pool of IPs, simple rotation logic & throttling, etc.) you can achieve a reasonable level of reliability from such a solution.

However, when scaling their web scraping or if they start scraping more complex websites they often find they increasingly start running into proxy issues. Commencing the arduous process of troubleshooting the proxy issue, obtaining more IPs, upgrading the proxy management logic, etc.

It is rare for developers to build an extremely robust proxy infrastructure from the get-go. Typically, it is an iterative process of running into proxy issues and patching together an adequate solution to get the crawlers back up and running.

Over time the sophistication and robustness of the proxy infrastructure do improve, however, not without sucking in significant development resources and countless late nights trying to fix the latest proxy issue.

In recent times, at Zyte (formerly Scrapinghub) we’ve increasingly noticed the trend of companies looking to jump to straight to large scale web scraping as a result of the ever-growing appetite for web data in business decision making and data-driven products.

In cases like these, it would be a massive understatement to say building a proxy management infrastructure designed to handle millions of requests per month is complex. Building this kind of infrastructure is a significant development project. Requiring months of development hours and careful planning.

Proxies aren’t a priority

The thing is, for most developers and companies proxy management is at the bottom of their list of priorities. You are interested in extracting the target data as efficiently and quickly as possible so you can get on with their main interests - analyzing and making decisions based on the data, incorporating the data into their products and services, and growing their businesses.

In nearly every situation web scrapers have very little to gain by building their own proxy management infrastructure from scratch, other than the learning experience of developing the proxy management logic or saving a small amount of money on the direct costs of proxies (oftentimes, the indirect engineering costs far outweigh the direct savings).

That is why we always recommend to our community that they should at the very least outsource some element of their proxy management infrastructure. Be it obtaining their proxies from a provider that also offers proxy rotation or other configurations, or our recommended method using a proxy management API that completely removes the hassle of managing proxies.

Use an off-the-shelf proxy management solution

When it comes to web scraping, especially scraping at scale, our recommendation is to use a proven fully-featured off-the-shelf proxy management solution.

It will save your team countless weeks in development time, allow you to start extracting the data you need immediately, and dramatically increase the reliability of your crawlers.

Developing crawlers, post-processing, and analyzing the data is time-intensive enough without trying to reinvent the wheel by developing and maintaining your own internal proxy management infrastructure.

By using an off-the-shelf proxy management solution you can get access to a highly robust & configurable proxy infrastructure from day 1. No need to spend weeks delaying your data extraction building your proxy management system and troubleshooting proxy issues that will inevitably arise.

If you are interested in using an off-the-shelf proxy management solution then we strongly recommend that you consider Zyte Smart Proxy Manager (formerly Crawlera), the complete proxy solution developed by Zyte.

Zyte Smart Proxy Manager (formerly Crawlera) is the world's smartest proxy network built by and for web scrapers. Instead of having to manage a pool of IPs, your crawler just sends a request to Zyte Smart Proxy Manager’s single endpoint API and gets a successful response in return.

Zyte Smart Proxy Manager manages a massive pool of proxies, carefully rotating, throttling, blacklists, and selecting the optimal IPs to use for any individual request to give the optimal results at the lowest cost. Completely, removing the hassle of managing IPs.

Users love Zyte Smart Proxy Manager because of the fact completely removes the hassle of managing proxies, freeing them up to work on more important areas of their business.

If you’d like to learn more about how Zyte Smart Proxy Manager only returns successful responses to its users, then be sure to check out "A Sneak Peek Inside Zyte Smart Proxy Manager (formerly Crawlera)" to get an inside look at how Zyte Smart Proxy Manager works.

How to pick the best proxy solution for your project?

Deciding on an approach to building and managing your proxy pool can be a headache. In this section we will outline some of the questions you need to be asking yourself when picking the best proxy solution for your needs:

  1. What’s your budget?

If you have a very limited or virtually non-existent budget then managing your own proxy pool is going to be the cheapest option. However, if you have even a small budget of $20 per month then you should seriously consider outsourcing your proxy management to a dedicated solution that manages everything.

  1. What is your #1 priority? 

If learning about proxies and everything web scraping is your #1 priority then buying your own pool of proxies and managing them yourself is probably your best option. However, if your #1 priority is getting the web data you need and achieving maximum performance from your web scraping, as is the case for most companies, then it is nearly always better to outsource your proxy management solution to a done for your solution. Or at the very least, use a proxy rotator.

  1. What is your technical skill level and your available resources?

To be able to manage your own proxy pool for a reasonable size web scraping project you will need at least a basic level of software development expertise and the bandwidth to build and maintain your spider’s proxy management logic. If you don’t have this expertise or don’t have the bandwidth to devote engineering resources to it then you are often better off either using a proxy rotator and building your own proxy management infrastructure or using a done for you proxy management solution.  

Your answers to these questions will quickly help you decide which approach to proxy management best suits your needs.

Learn more about proxies for web scraping

Here at Zyte, we have been in the web scraping industry for 12 years. We have helped extract web data for more than 1,000 clients ranging from Government agencies and Fortune 100 companies to early-stage startups and individuals. During this time we gained a tremendous amount of experience and expertise in web data extraction. 

Here are some of our best resources if you want to deepen your proxy management knowledge:

Written by John Campbell
1/3 Marketer, 1/3 Ops, 1/3 Techie. Currently Demand Generation Manager at Zyte. Data analytics, knowledge graph enthusiast with a particular taste for its applications in financial services, cybersecurity, law enforcement & intelligence sectors.