Web scraping can look deceptively easy these days. There are numerous open-source libraries/frameworks, visual scraping tools, and data extraction tools that make it very easy to scrape data from a website. However, when you want to scrape websites at scale things start to get very tricky, very fast. Especially when it comes to price intelligence, where scale and quality matters a lot.
In this series of articles, we will share with you the lessons we’ve learned scraping over 100 billion product pages since 2010, give you an in-depth look at the challenges you will face when extracting product data from e-commerce stores at scale, and share with you some of the best practices to address those challenges.
In this article, the first of the series, we will give you an overview of the main challenges you will face scraping product data at scale and the lessons Zyte (formerly Scrapinghub) has learned from scraping 100 billion product pages.
Unlike your standard web scraping application, scraping e-commerce product data at scale has a unique set of challenges that make web scraping vastly more difficult.
At its core, these challenges can be boiled down to two things: speed and data quality.
As time is usually a limiting constraint, scraping at scale requires your crawlers to scrape the web at very high speeds without compromising data quality. This need for speed makes scraping large volumes of product data very challenging.
It might be obvious and it might not be the sexiest challenge, but sloppy and always changing website formats is by far the biggest challenge you will face when extracting data at scale. Not necessarily because of the complexity of the task, but the time and resources you will spend dealing with it.
Sloppy code like this can make writing your spider a pain, but can also make visual scraping tools or automatic extraction tools unviable.
When scraping at scale, not only do you have to navigate potentially hundreds of websites with sloppy code, you will also have to deal with constantly evolving websites. A good rule of thumb is to expect your target website to make changes that will break your spider (drop-in data extraction coverage or quality) every 2-3 months.
That mightn’t sound like too big a deal but when you are scraping at scale, those incidents really add up. For example, one of Zyte's larger e-commerce projects has ~4,000 spiders targeting about 1,000 e-commerce websites, meaning they can experience 20-30 spiders failing per day.
Variations in website layouts from regional and multilingual websites, A/B split testing, and packaging/pricing variants also create a world of problems that routinely break spiders.
Unfortunately, there is no magic bullet that will completely solve these problems. A lot of the time it just a matter of committing more resources to your project as you scale. To take the previous project as an example again, that project has a team of full-time 18 crawl engineers and 3 dedicated QA engineers to ensure the client always has a reliable data feed.
With experience, however, your team will learn to create ever more robust spiders that can detect and deal with quirks in your target website's format.
Instead of having multiple spiders for all the possible layouts, a target website might use, it is best practice to have only one product extraction spider that can deal with all the possible rules and schemes used by different page layouts. The more configurable your spiders are the better.
Although these practices will make your spiders more complex (some of our spiders are thousands of lines long), they will ensure that your spiders are easier to maintain.
The next challenge you will face is building a crawling infrastructure that will scale as the number of requests per day increases, without degrading in performance.
When extracting product data at scale a simple web crawler that crawls and scrapes data serially just won’t cut it. Typically, a serial web scraper will make requests in a loop, one after the other, with each request taking 2-3 seconds to complete.
This approach is fine if your crawler is only required to make <40,000 requests per day (request every 2 seconds equals 43,200 requests per day). However, past this point, you will need to transition to a crawling architecture that will allow you to scrape millions of requests per day with no decrease in performance.
As this topic warrants an article in itself, in the coming weeks we will publish a dedicated article discussing how to design and build your own high throughput scraping architecture. However, for the remainder of this section, we will discuss some of the higher-level principles and best practices.
As we’ve discussed, speed is key when it comes to scraping product data at scale. You need to ensure that you can find and scrape all the required product pages in the time allotted (often one day). To do this you need to do the following:
To scrape product data at scale you need to separate your product discovery spiders from your product extraction spiders.
The goal of the product discovery spider should be for it to navigate to the target product category (or “shelf”) and store URLs of the products in that category for the product extraction spiders. As the product discovery spider adds product URLs to the queue the product extraction spiders scrape the target data from that product page.
This can be accomplished with the aid of a crawl frontier such as Frontera, the open-source crawl frontier developed by Zyte. While Frontera was originally designed for use with Scrapy, it’s completely agnostic and can be used with any other crawling framework or standalone project. In this guide, we share how you can use Frontera to scrape at scale.
As each product category “shelf” can contain anywhere from 10 to 100 products and extracting product data is more resource-heavy than extracting a product URL, discovery spiders typically run faster than product extraction spiders. When this is the case, you need to have multiple extraction spiders for every discovery spider. A good rule of thumb is to create a separate extraction spider for each ~100,000-page bucket.
Scraping at scale can easily be compared to Formula 1 where your goal is to shave every unnecessary gram of weight from your car and squeeze that last fraction of horsepower from the engine all in the name of speed. The same is true for web scraping at scale.
When extracting large volumes of data you are always on the lookout for ways to minimize the request cycle time and maximize your spiders performance of the available hardware resources. All in the hope that you can shave a couple of milliseconds off each request.
To do this your team will need to develop a deep understanding of the web scraping framework, proxy management, and hardware you are using so you can tune them for optimal performance. You will also need to focus on:
When scraping at scale you should always be focused on solely extracting the exact data you need in as few requests as possible. Any additional requests or data extraction slow the pace at which you can crawl a website. Keep these tips in mind when designing your spiders:
If you are scraping e-commerce sites at scale, you are guaranteed to run into websites employing anti-bot countermeasures.
For most smaller websites their anti-bot countermeasures will be quite basic (ban IPs making excess requests). However, larger e-commerce websites such as Amazon, etc. make use of sophisticated anti-bot countermeasures such as Distil Networks, Incapsula, or Akamai, which make extracting data significantly more difficult.
With that in mind, the first and most essential requirement for any project scraping product data at scale is to use proxy IPs. When scraping at a scale you will need a sizeable list of proxies and will need to implement the necessary IP rotation, request throttling, session management, and blacklisting logic to prevent your proxies from getting blocked.
Unless you already have or are willing to commit a sizeable team to manage your proxies you should outsource this part of the scraping process. There are a huge number of proxy services available that provide varying levels of service.
However, our recommendation is to go with a proxy provider who can provide a single endpoint for proxy configuration and hide all the complexities of managing your proxies. Scraping at scale is resource-intensive enough without trying to reinvent the wheel by developing and maintaining your own internal proxy management infrastructure.
This is the approach most of the large e-commerce companies use. A number of the world's largest e-commerce companies use Zyte Smart Proxy Manager (formerly Crawlera), the smart downloader developed by Zyte, that completely outsource their proxy management. When your crawlers are making 20 million requests per day, it makes much more sense to focus on analyzing the data, not managing proxies.
Unfortunately, just using a proxy service won’t be enough to ensure you can evade bot countermeasures on larger e-commerce websites. More and more websites are using sophisticated anti-bot countermeasures that monitor your crawlers behavior to detect that it isn’t a real human visitor.
Not only do these anti-bot countermeasures make scraping e-commerce sites more difficult, overcoming them can significantly dent your crawlers performance if done incorrectly.
This means that to ensure you can achieve the necessary throughput from your spiders to deliver daily product data you often need to painstakingly reverse engineer the anti-bot countermeasures used on the site and design your spider to counteract them without using a headless browser.
From a data scientist's perspective, the most important consideration of any web scraping project is the quality of the data being extracted. Scraping at scale only makes this focus on data quality even more important.
When extracting millions of data points every single day, it is impossible to manually verify that all your data is clean and intact. It is very easy for dirty or incomplete data to creep into your data feeds and disrupt your data analysis efforts.
This is especially true when scraping products on multiple versions of the same store (different languages, regions, etc.) or separate stores.
Outside of a careful QA process during the design phase of building the spider, where the code of the spider is peer-reviewed and tested to ensure that it is extracting the desired data in the most reliable way possible. The best method of ensuring the highest possible data quality is the development of an automated QA monitoring system.
As part of any data extraction project, you need to plan and develop a monitoring system that will alert you of any data for inconsistencies and spider errors. At Zyte we’ve developed machine learning algorithms designed to detect:
All of which we will discuss in a later article dedicated to automated quality assurance.
As you have seen scraping product data at scale creates its own unique set of challenges. Hopefully, this article has made you more aware of the challenges you will face and how you should go about solving them.
However, this is just the first article in this series so if you are interested in reading the next articles as soon as they are published be sure to sign up for our email list.
For those of you who are interested in scraping the web at scale but are wrestling with the decision of whether or not you should build up a dedicated web scraping team in-house or outsource it to a dedicated web scraping firm then be sure to check out our guide, Enterprise Web Scraping: Build In-House or Outsource.
At Zyte we specialize in turning unstructured web data into structured data. If you would like to learn more about how you can use web scraped product data in your business then feel free to contact our sales team, who will talk you through the services we offer startups right through to Fortune 100 companies.
At Zyte we always love to hear what our readers think of our content and any questions you might. So please leave a comment below with what you thought of the article and what you are working on.