Let’s face it, managing your proxy pool can be an absolute pain and the biggest bottleneck to the reliability of your web scraping!
Nothing annoys developers more than crawlers failing because their proxies are continuously being banned.
Not only do you find yourself constantly firefighting proxy issues, but the people who rely on this web data also get increasingly frustrated with you because of the unreliability of the data feed.
Zyte (formerly Scrapinghub) had the same issues for years until we hit our breaking point and decided to solve this problem forever.
At the time, Zyte was about 3 years in business, providing web scraping consultancy services to companies looking to outsource their data extraction.
We had established ourselves as the leading provider of web scraping consultancy services, however, increasingly our crawl engineers were running into big proxy issues as the scale and complexity of the projects grew.
We’d start a project, but our development timelines we’re constantly being delayed because after deploying our spiders we’d experience constant proxy issues.
First, we’d configure a proxy pool and tell the client that everything was working. Then within a couple of days, everything would be broken. The proxy pool would no longer return requests at the target RPM (requests per minute).
We’d then acquire new proxies, increase the pool size, rotate the proxies, and create a new pool to route the requests through. This would work for a while but before long we were back where we started. A proxy pool full of banned IPs and our crawlers unable to make successful requests.
It was a never-ending cycle of swapping and rotating proxies. We couldn’t reliably predict how long it would take us to get a project into full production, leading to frustrated engineers and plenty of hard conversations with customers.
We were getting sick of firefighting proxies issues, but then one project came along that forced us to say “enough was enough” and commit to fixing this problem permanently...
The client wanted us to build a web scraping infrastructure to scrape product data from 20 e-commerce sites, about 1 million requests per day. Which in 2011, was a big deal!.
Everything started off great. We developed the spiders, ran a number of pilot crawls, and delivered the proof of concept data to the customer.
However, as was all too common we ran into serious problems scaling the crawls.
Although our spiders were well designed and configured to crawl at a polite speed, when we moved the project from proof of concept to production our proxies were being banned at an alarming rate.
We started the normal process of switching out proxies to try and get the crawlers back up and running.
However, eventually, it got to the point that we couldn’t scale the crawl anymore as we couldn’t put out the proxy fires fast enough.
Initially, we told the client that we’d have the issue fixed in 1 or 2 days “as it was just a matter of swapping out the banned IPs”.
However, the days kept ticking by and we still hadn’t found a permanent solution.
Finally, nearly a month later. We fixed it!
We stopped focusing on the underlying IPs and put all our energy into intelligently managing the IPs so that we could not only scrape reliably without the fear of being banned but also have more predictable development schedules and reduce the amount of time and costs associated with running our crawls.
We found that without an intelligent proxy management layer, our requests were continuously being blocked and our proxies burned. Leaving us constantly scrambling to find new proxies and get our crawlers back up and running again.
However, when managed intelligently we could reliably scrape the web with little fear of our IPs being banned and the accompanying development/crawl delays.
This breakthrough was a game-changer for us. With this new proxy management layer, we were able to exponentially scale our crawls and completely remove the headache of managing proxies.
Once configured for a project this new proxy management layer would automatically select the best proxy to use for the target website and manage all the proxy rotation, throttling, blacklisting, etc. ensuring that we could reliably extract the data we need.
All without any manual intervention from our crawl engineers!
As we continued to scale, people were constantly asking us how were we managing our proxies as they were facing the same reliability issues we encountered as they scaled their web scraping. It was at this point Zyte Smart Proxy Manager (formerly Crawlera) was born...
In 2012, we decided to make this technology available to everyone in the form of Zyte Smart Proxy Manager, a proxy management solution specifically designed for web scraping.
Zyte Smart Proxy Manager enabled web scrapers to reliably crawl at scale, managing thousands of proxies internally, so you didn’t have to.
They never needed to worry about rotating or swapping proxies again.
Users loved Zyte Smart Proxy Manager! It removed the frustrations their engineers had with managing their web scraping proxies.
With Zyte Smart Proxy Manager, instead of having to manage a pool of IPs the user's spiders send the request directly to Zyte Smart Proxy Manager's single endpoint API.
Zyte Smart Proxy Manager then selects the best IP and proxy configuration (user agents, request delay, etc.) for that particular website to retrieve the target data.
If a request is blocked, Zyte Smart Proxy Manager then automatically selects the next best IP and reconfigures the proxy configuration before making another request. This process continues until the Zyte Smart Proxy Manager is able to obtain a successful request or a predefined request limit has been reached.
All this functionality happened under the hood. The user just made the request to Zyte Smart Proxy Manager's API and Zyte Smart Proxy Manager would take care of everything else. Enabling users to focus on the data, not proxies.
Zyte Smart Proxy Manager achieved this by managing a massive pool of proxies, carefully rotating, throttling, blacklists, and selecting the optimal IPs to use for any individual request to give the optimal results at the lowest cost. Completely, removing the hassle of managing IPs.
The huge advantage of this approach is that it is extremely reliable and scalable. Zyte Smart Proxy Manager can scale from a few hundred requests per day to millions of requests per day without any additional workload on your part.
Better yet, as Zyte Smart Proxy Manager was built by web scrapers for web scrapers we know they only care about successful requests, not the number of proxies. As a result with Zyte Smart Proxy Manager, you only pay for successful requests that return your desired data, not IPs or the amount of bandwidth you use.
This is a huge benefit for users of Zyte Smart Proxy Manager as they can accurately predict the cost of their proxy solution as they scale.
For Zyte having Zyte Smart Proxy Manager at our disposal was a game-changer for our business. Now our crawl engineers could focus on what they really enjoyed, building crawlers and delivering accurate reliable data for our customers. Not constantly having to put out proxy fires just to get their data feeds up and running again. Leading to happier and more motivated teams, and happy customers.
Since its original launch, Zyte Smart Proxy Manager has undergone numerous redesigns and improvements to keep pace with the changes in web scraping technologies and cope with the ever more complex challenges experienced when scraping the web.
We’ve added highly targeted geographical support (city granularity), residential IPs, headless browser support, custom user agents, to name a few of the features. Making Zyte Smart Proxy Manager the most feature-rich and robust proxy solution for web scraping.
Zyte Smart Proxy Manager is for web scraping teams (or individual developers) that are tired of managing their own proxy pools and that is ready to integrate an off-the-shelf proxy API into their web scraping stack that only charges for successful requests.
It is also a perfect fit for larger organizations with mission-critical web crawling requirements looking for a dedicated crawling partner who’s tools and team of crawl consultants can help them crawl more reliably at scale, build custom solutions for their specific requirements, help debug any issues they may run into when scraping the web, and offer enterprise SLAs.
Zyte Smart Proxy Manager also comes with global support. Clients know that they can get expert input into any proxy issue that may arise 24 hours per day, 7 days a week no matter where they are in the world. Giving them immense peace of mind, knowing that they will never be left alone if they can’t get access to the data they need.
At Zyte all our products are 100% designed with web scraping in mind. We are committed to helping the web scraping community extract what they need to grow their businesses.
If you tired of troubleshooting proxy issues and would like to give Zyte Smart Proxy Manager a try then signup today and you can cancel within 14 days if Zyte Smart Proxy Manager isn’t for you, or schedule a call with our crawl consultant team.
At Zyte we always love to hear what our readers think of our content and any questions you might. So please leave a comment below with what you thought of the article and what you are working on.