These days web scraping amongst big e-commerce companies is ubiquitous due to the advantages that data-based decision making can bring to remain competitive in such a tight margin business.
E-commerce companies are increasingly using web data to fuel their competitor research, dynamic pricing, and new product research.
For these e-commerce sites, their most important considerations are: the reliability of their data feed and it’s the ability to return the data they need at the required frequency.
As a result, these e-commerce sites face big challenges in managing their proxies so that they can reliably scrape the web without disruption.
In this article, we’re going to talk about those challenges and how the best web scrapers get around them.
The sheer number of the requests being made (upwards of 20 million successful requests per day) is a huge challenge for companies. With millions of requests per day, companies also need thousands of IPs in their proxy pools to cope with the request volume.
Not only do they need a large pool size, but a pool that contains a wide range of proxy types (location, datacenter/residential, etc.) to enable them to reliably scrape the precise data they need.
However, managing proxy pools of this scale can be very time-consuming. Developers and data scientists often report spending more time managing proxies and troubleshooting data quality issues than analyzing the extracted data.
To cope with this level of complexity, to scrape the web at this scale you will need to implement a robust intelligence layer to your proxy management logic.
The more sophisticated and automated your proxy management layer, the more efficient and hassle-free managing your proxy pool will be.
On that note, let’s dive deeper into proxy management layers and how the best e-commerce companies solve the challenges associated with them.
When scraping the web at a relatively small scale (couple thousand pages per day), you can get away with a simple proxy management infrastructure if your spiders are well designed and you have a large enough pool.
However, when you are scraping the web at scale, this simply just won’t cut it. Very quickly you’ll run into the following challenges when building a large scale web scraper.
As a result, companies need to implement a robust proxy management logic to rotate IPs, select geographical specific IPs, throttle requests, identify bans and captchas, automate retries, manage sessions, user agents and blacklisting logic to prevent your proxies from getting blocked and disrupting their data feed.
The problem is most solutions on the market are only selling proxies or proxies with simple rotation logic at best. So often times companies need to build and refine this intelligent proxy management layer themselves. Which requires significant development.
The other option is to use a proxy solution that takes care of all the proxy management for you. More on this later.
As is often the case with e-commerce product data, the prices and specifications of products vary depending on the location of the user.
As a result, to get the most accurate picture of a products pricing or feature data, companies often want to request each product from different locations/zip codes. This adds another layer of complexity to an e-commerce web scraping proxy pool, as you now need a proxy pool that contains proxies from different locations and has implemented the necessary logic to select the correct proxies for the target locations.
At lower volumes, it is often ok to just manually configure a proxy pool to only use certain proxies for specific web scraping projects. However, this can become very complex as the number and complexity of the web scraping projects increases. That is why an automated approach to proxy selection is key when scraping at scale.
As stated at the start of this article, the most important consideration in the development of any proxy management solution for large-scale e-commerce web scraping is that it is robust/reliable and returns high-quality data for analysis.
Oftentimes, the data these e-commerce companies are extracting is mission-critical to the success of the businesses and their ability to remain competitive in the marketplace. As a result, any disruptions or reliability issues with their data feed is a huge area of concern for most companies conducting large-scale web scraping.
Even a disruption of a couple hours will likely prevent them from having up to date product data for the setting product pricing for the next day.
The other issue is cloaking, the practice of e-commerce websites feeding incorrect product data to requests if they believe them to be from web scrapers. This can cause huge headaches for the data scientists working in these companies as there will always be a question mark over the validity of their data.
Growing a seed of doubt in their minds as to whether they can make decisions based on what the data is telling them.
This is where having a robust and reliable proxy management infrastructure along with an automated QA process in place really helps. Not only does it remove a lot of the headaches of having to manually configure and troubleshoot proxy issues, it also gives companies a high degree of confidence in the reliability of their data feed.
Ok, we’ve discussed the challenges of managing proxies for enterprise web scraping projects, however, how do you overcome these challenges and build your own proxy management system for your large scale web scraping projects?
In reality, enterprise web scrapers have two options when it comes to building their proxy infrastructure for their web scraping projects.
One solution is to build a robust proxy management solution in-house that will take care of all the necessary IP rotation, request throttling, session management and blacklisting logic to prevent your spiders being blocked.
There is nothing wrong with this approach, provided that you have the available resources and expertise to build and maintain such an infrastructure. To say that a proxy management infrastructure designed to handle 300 million requests per month (the scale a lot of e-commerce sites scrape at) is complex is an understatement. This kind of infrastructure is a significant development project.
For most companies their #1 priority is the data, not proxy management. As a result, a lot of the largest e-commerce companies completely outsource proxy management using a single endpoint proxy solution.
Our recommendation is to go with a proxy provider who can provide a single endpoint for proxy configuration and hide all the complexities of managing your proxies. Scraping at scale is resource-intensive enough without trying to reinvent the wheel by developing and maintaining your own internal proxy management infrastructure.
This is the approach most of the large e-commerce retailers take. Three of the worlds top five largest e-commerce companies use Zyte Smart Proxy Manager (formerly Crawlera) as their primary proxy solution, the smart downloader developed by Zyte, that completely outsources their proxy management. In total, Zyte Smart Proxy Manager processes 8 billion requests per month.
The beauty of Zyte Smart Proxy Manager is that instead of having to manage a pool of IPs, your spiders just send a request to Zyte Smart Proxy Manager's single endpoint API where Zyte Smart Proxy Manager retrieves and returns the desired data.
Under the hood, Zyte Smart Proxy Manager manages a massive pool of proxies, carefully rotating, throttling, blacklists and selecting the optimal IPs to use for any individual request to give the optimal results at the lowest cost. Completely, removing the hassle of managing IPs and enabling users to focus on the data, not proxies.
The huge advantage of this approach is that it is extremely scalable. Zyte Smart Proxy Manager can scale from a few hundred requests per day to millions of requests per day without any additional workload from the user. Simply increase the number of requests you are making and Zyte Smart Proxy Manager will take care of the rest.
Better yet, with Zyte Smart Proxy Manager you only pay for successful requests that return your desired data, not IPs or the amount of bandwidth you use.
Zyte Smart Proxy Manager also comes with global support. Clients know that they can get expert input into any issue that may arise 24 hours per day, 7 days a week no matter where they are in the world.
If you'd like to learn more about Zyte Smart Proxy Manager then, be sure to talk to our team about your project.
As you have seen there are a lot of challenges associated with managing proxies for large-scale web scraping projects. However, it is a surmountable challenge if you have adequate resources and expertise to implement a robust proxy management infrastructure. If not then you should seriously consider a single endpoint proxy solution such as Zyte Smart Proxy Manager.
For those of you who are interested in scraping the web at scale but are wrestling with the decision of whether or not you should build up a dedicated web scraping team in-house or outsource it to a dedicated web scraping firm then be sure to check out our guide, Enterprise Web Scraping: The Build In-House or Outsource Decision.
At Zyte (formerly Scrapinghub) we always love to hear what our readers think of our content and any questions you might. So please leave a comment below with what you thought of the article and what you are working on.