When it comes to web scraping at scale, there’s a set of challenges you need to overcome to extract the data. But once you are able to get it, you still have work to do. You need to have a data QA process in place. Data quality becomes especially crucial if you’re extracting high volumes of data from the web regularly and your team’s success depends on the quality of the scraped data.
This article is the first of a four-part series on how to maximize web scraped data quality. We are going to share with you all the techniques, tricks, and technologies we use at Scrapinghub to extract web data from billions of pages every month while keeping data quality high.
The first step is to understand the business requirements of the web scraping project and define clear, testable rules which will help you detect data quality problems. Understanding requirements clearly is essential to move forward and develop the best data quality process.
Requirements are often incomplete, ambiguous, or vague. Here you can find some general tips for defining good requirements:
In order to show an actual example, in this article we are going to work with product data that was extracted from an e-commerce site. Here is a sample of what two typical scraped records are intended to look like:
In addition to these sample records, the business requirements - that are provided to the QA Engineer - are as follows:
Can you find some potential problems with the requirements above?
The stipulation on a data type for field price seems at first glance to be sufficient. Not quite. Is "2.6" valid? Or should it be 2.6? The answer is important if we want to properly validate. “We scraped the right thing, but did we scrape it right?”.
Similarly, there are 3 different date formats that will satisfy ISO 8601. Should we report warnings for the following if scraped?
Take a minute and try to visually validate this data based on the rules above. See how your validation fares against the automated validation techniques that will be outlined below.
Below are some example scraped records for this scraper and its requirements. For illustrative purposes, only the first record can be deemed to be of good quality; the other four each exhibit one or more data quality issues. Later on in the article we will show how each of these issues can be uncovered with one or more automated data validation techniques.
Based on the requirements outlined above, we are going to define a JSON schema that will help us to validate data.
If a schema is not created by hand in advance of spider development, one can be inferred from some known representative records using a tool like GenSON. It’s worth pointing out that although such inference is convenient, the schemas produced are often lacking the robustness needed to fully validate the web scraping requirements. This is where the experience of the QA Engineer comes into play, taking advantage of more advanced features of the JSON Schema standard, as well as adding regex’s and other more stringent validation, such as:
By default, all fields are marked as mandatory - at the end, only the ones requested from the client will be left
In the current version of JSON schema standard, it is not possible to enforce uniqueness.. While future drafts may support it, currently it is necessary to work around this by inserting a keyword that an automated data validation framework will recognize. In our case, we will use the keyword “unique”.
Some examples:
Price:
Date:
This is what the final schema looks like:
{ "$schema": "http://json-schema.org/draft-07/schema#", "title": "product.json", "definitions": { "url": { "type": "string", "pattern": "^https?://(www\.)?[a-z0-9.-]*\.[a-z]{2,}([^>%\x20\x00-\x1f\x7F]|%[0-9a-fA-F]{2})*$" } }, "additionalProperties": true, "type": "object", "properties": { "productId": { "type": "integer", "unique": "yes" }, "productName": { "type": "string" }, "price": { "type": "number" }, "tags": { "type": "array", "items": { "type": "string" } }, "available": { "type": "boolean" }, "date": { "type": "string", "format": "datetime" }, "url": { "$ref": "#/definitions/url", "unique": "yes" } }, "required": [ "price", "productId", "productName", "url" ] }
With requirements clarified and subsequently mapped to a robust schema with stringent validation, the core ingredient for automated data validation is now in place. The Python library jsonschema will be used as part of a broader automated data validation framework built upon the Robot framework and leveraging Pandas for additional, more advanced data analysis.
Given the schema and sample data defined above, the validation processes clearly shows us the data quality issues that need to be investigated:
Let's discuss some of them in more detail:
Although schema validation takes care of a lot of the heavy lifting with respect to checking the quality of a dataset, it is often necessary to wrangle and analyze the data further. Either to sense-check schema validation errors, discern edge cases and patterns, or test more complex spider requirements.
Pandas is often the first port of call, for its ability to concisely:
In the following examples, df is a scraped dataset represented as a Pandas DataFrame.
Before manipulating the data, it is often useful to see a high-level overview of it. One way is to list the top values for all fields using value_counts() in conjunction with head():
In the price data point there are several problems:
The first step is to get all prices scraped as numeric:
prices = df.price.apply(pd.to_numeric, args=('coerce',))
We then determine mean and standard deviation:
And finally to find all values which are too far from the standard deviation:
The next possible source of error is tags and names. Let's try to find if there are any cases where tags are not part of the name. In this case, we will expand nested data and iterate over the values.
tags = df.tags.apply(pd.Series)
Then we can access the first tag thus:
tags[0]
In this article, our goal was to give an overview of how data extracted from the web can be validated using whole-dataset automated techniques. Everything we’ve written about is based on our experience validating millions of records on a daily basis.
In the next post in the series, we’ll discuss more advanced data analysis techniques using Pandas as well as jq, with more real-world examples. We’ll also give an introduction to visualization as a way of uncovering data quality issues. Stay tuned!
Data quality assurance is just one small (but important!) piece of the puzzle. If you need help with your whole web data extraction project and you’re looking for a reliable data partner, have a look at our solutions or contact us to get started!