When it comes to web scraping, one key element is often overlooked until it becomes a big problem.
That is data quality.
Getting consistent high-quality data when scraping the web is critical to the success of any web scraping project, particularly when scraping the web at scale or extracting mission-critical data where accuracy is paramount.
Data quality can be the difference between a project being discontinued or it giving your business a huge competitive edge in a market.
In this article we’re going to talk about data quality assurance for web scrapers, give you a sneak peek into some of the tools and techniques Zyte (formerly Scrapinghub) has developed, and share with you some big news as we are open-sourcing one of our most powerful quality assurance tools. These QA processes enable us to verify the quality of our clients’ data at scale, and confidently give all our clients great data quality and coverage guarantees.
From a business perspective, the most important consideration of any web scraping project is the quality of the data being extracted. Without a consistent high quality data feed your web scraping infrastructure will never be able to help your business achieve its objectives.
Today, with the growing prevalence of big data, artificial intelligence and data driven decision making, a reliable source of rich and clean data is a major competitive advantage. Compounding this is the fact that many companies are now directly integrating web scraped data into their own customer-facing products, making real-time data QA a huge priority for them.
Scraping at scale only magnifies the importance of data quality. Poor data accuracy or coverage in a small web scraping project is a nuisance, but usually manageable. However, when you are scraping hundreds of thousands or millions of web pages per day, even a small drop in accuracy or coverage could have huge consequences for your business.
At the commencement of any web scraping project, you always need to be thinking about how you are going to achieve the high levels of data quality you need when scraping the web.
We know that getting high quality data when scraping the web is often of critical importance to your business, but what makes it so complex?
This is a combination of factors really:
#1 Requirements - The first and most important aspect of data quality verification is clearly defined requirements. Without knowing what data you require, what the final data should look like and what accuracy/coverage level you require, it is very hard to verify the quality of your data. Quite often companies come to Zyte not having clear data requirements laid out, so we need to work with the client to properly define what are these requirements. We find that a good question to ask is:
“What effect would a 5% data quality inaccuracy have on your engineers or downstream systems?”
In order to make your data quality targets realistic and achievable, it is important that you specify your requirements clearly and that they be “testable”, particularly when one or more of the following is true:
#2 Efficiency at Scale - The beauty of web scraping is that it has a unmatched ability to scale very easily compared to other data gathering techniques. However, data QA often isn’t able to match the scalability of your web scraping spiders, particularly when it involves only manual inspection of a sample of the data and visual comparison with the scraped pages.
#3 Website Changes - Perhaps the biggest cause of poor data coverage or accuracy is changes to the underlying structure of all or parts of the target website. With the increasing usage of A/B split testing, seasonal promotions and regional/multilingual variations, large websites are constantly making small tweaks to the structure of their web pages that can break web scraping spiders. As a result, it is very common for the coverage and accuracy of the data from your spiders to degrade over time unless you have continuous monitoring and maintenance processes in place.
#4 Semantics - Verifying the semantics of textual information, or the meaning of the data that is being scraped, is still a challenge for automated QA as of today. While ourselves and others are developing technologies to assist in the verification of the semantics of the data we extract from websites, no system is 100% perfect. As a result, manual QA of the data is often required to ensure the accuracy of the data.
At a high level, your QA system is trying to assess the quality/correctness of your data along with the coverage of the data you have scraped.
Depending on the scale, number of spiders, and the degree of complexity of your web scraping requirements, there are different approaches you can take when developing an automated quality assurance system for your web scraping.
Due to the number of clients we scrape the web for and the wide variety of web scraping projects we have in production at any one time, Zyte have experience with both approaches. We’ve developed bespoke project-specific automated test frameworks for individual projects with unique requirements. We rely principally though on the generic automated test framework we’ve developed that can be used to validate the data scraped by any spider.
When used alongside Spidermon (more on this below), this framework allows us to quickly add a quality assurance layer to any new web scraping project we undertake.
The other key component of any web scraping quality assurance system is a reliable system for monitoring the status and output of your spiders in real-time.
A spider monitoring system allows you to detect sources of potential quality issues immediately after spider execution completes.
At Zyte we’ve developed Spidermon, which allows developers (and indeed other stakeholders such as QA personnel and project managers) to automatically monitor spider execution. It verifies the scraped data against a schema that defines the expected structure, data types and value restrictions. It can also monitor bans, errors, and item coverage drops, among other aspects of a typical spider execution. In addition to the post-execution data validation that Spidermon performs, we often leverage real-time data-validation techniques, particularly for long-running spiders, which enables the developer to stop a spider when it is detected that it is scraping unusable data.
This brings us to the big news to we have to announce. Zyte is delighted to announce that in the coming weeks we are going to open source Spidermon, and making it an easy use add-on for all your Scrapy spiders. It can also be used with spiders developed using other Python libraries and frameworks such as BeautifulSoup.
Spidermon is a extremely powerful and robust spider monitoring add-on that has been developed and tested on millions of spiders over it’s lifetime. To be the first to get access to Spidermon, be sure to get yourself on our email list (below) so we can let you know as soon as Spidermon is released.
Next we’ll take a look at Zyte’s quality assurance process to see how all these elements fit together in an enterprise-scale QA system.
To ensure the highest data quality from our web scraping, Zyte applies a four-layer QA process to all the projects we undertake with clients.
Only after passing through all four of these layers is the dataset then delivered to the client.
To get a detailed behind the scenes look at how Zyte’s quality system works, the exact data validation tests we conduct and how you can build your own quality system, then click here to download our Web Scraping Quality Assurance Guide.
As you have seen, there is often quite a bit of work in ensuring your web scraping projects are actually yielding the high quality data you need to grow your business. Hopefully, this article has made you more aware of the challenges you will face and how you could go about solving them.
At Zyte we specialize in turning unstructured web data into structured data. If you would like to learn more about how you can use web scraped data in your business then feel free to contact our Sales team, who will talk you through the services we offer startups right through to Fortune 100 companies.
At Zyte we always love to hear what our readers think of our content and any questions you might have. So please leave a comment below with what you thought of the article and what you are working on right now.