In case you missed them, here’s the first part and second part of the series.
The manual way or the highway...
In software testing and QA circles, the topic of whether automated or manual testing is superior remains a hotly debated one. For data QA and validation specifically, they are not mutually exclusive. Indeed, for data, manual QA can inform automated QA, and vice versa. In this post, we’ll give some examples.
It is rare that data extracted from the web can be adequately validated with automated techniques alone; additional manual inspections are often needed. The optimal blend of manual and automated tests depends on factors including:
When considered in isolation, each has its benefits and drawbacks:
Pros:
Cons:
Pros:
Cons:
Automated testing is most suitable for repetitive tasks and regression testing when rules are clearly defined and relatively static, this includes things like:
Manual tests, on the other hand, are invaluable for a deeper understanding of suspected data quality problems, particularly for data extracted from dynamic e-commerce websites and marketplaces.
From a practical point of view, the validation process should start with an understanding of the data and its characteristics. Next, define what rules are needed to validate the data, and automate them. The results of the automation will be warnings and possible false alarms that need to be verified using manual inspection. After the improvement of the rules, the second iteration of automated checks can be executed.
Let's suppose we have the task of verifying the extraction coverage and correctness for this website: http://quotes.toscrape.com/
If you try to achieve this task in a fully manual way, then you usually have the following options:
The same task can be easily done with an automation tool. You will need to spend some time investigating what needs to be extracted, do some tests, and voila. However, there are some points to be aware of:
To mitigate most of the cons of the manual and automated approaches, we can tackle the task using a semi-automated approach.
Step 1: Study the page, noting that there are 10 results per page and clear pagination: http://quotes.toscrape.com/page/2/
The last page is 10.
Step 2: Open all pages. You can use browser extensions like Open Multiple URLs
If you’d like to build such a list you can use excel: defining a template and simple formula:
=$B$1 & A3 & "/"
Open all links with the above extension.
Step 3: Extracting the data. Now we can extract all quotes per page with a few clicks. For this purpose, we are going to use another browser extension like Scraper.
Upon installing the extension, for example, this is how we can extract all authors:
Select the name of the first author of the page you are in, right-click on it and then click on “Scrape similar…”:
Then you will get the following window opened. Export it elsewhere or simply within the window, use it to compare with the data previously extracted:
Given a scenario of having failing data quality checks towards product data extracted from the web with only tests for prices, the tools and approaches we have covered so far since the beginning of our series are capable of detecting different errors like:
That said, the automated tests can fail to validate and report the wrong price because of a combination of different factors. Just to start, the total dependence on automated validation leads to a false sense of “no errors”, not to mention that if crucial care is not taken such as following the steps we covered in our series so far will lead to lower-than-possible test coverage.
The key lesson here is that even when automation is done in the best way possible, it can still fail us due to nuances on the websites, or miss some edge cases like on less than 1% of the data - that’s why it’s important to maintain and support automated tests with manual validations.
While building a JSON Schema we can try to be so strict on the data validation rules using as much as the validation definitions rules there are to assert the data to the best possible such as the following for price, for example:
{ "type": "object", "properties": { "price": { "type": "number", "minimum": 0.01, "maximum": 20000, "pattern": "^[0-9]\.[0-9]{2}$" }, "productName": { "type": "string", "pattern": "^[\S ]+$" }, "available": { "type": "boolean" }, "url": { "format": "uri", "unique": "yes" } }, "required": [ "price", "productName", "available", "url" ] }
However, every website has a different behaviour, many won’t have the price at all whenever the product is out of stock and it’s expected for our extraction tool to set the price to 0 for such cases so, what’s better? Be more lenient with our tests removing the "minimum" validation? No! There’s another approach that we can take with more knowledge of the JSON Schema validation possibilities and that is conditionals:
{ "type": "object", "properties": { "productName": { "type": "string", "pattern": "^[\S ]+$" }, "available": { "type": "boolean" }, "url": { "format": "uri", "unique": "yes" } }, "if": { "properties": { "available": { "const": false } } }, "then": { "properties": { "price": { "const": 0 } } }, "else": { "properties": { "price": { "type": "number", "minimum": 0.01, "maximum": 20000, "pattern": "^[0-9]\.[0-9]{2}$" } } }, "required": [ "productName", "price", "available", "url" ] }
So with this new second schema, we can prevent both the situations of having too many false-positive errors being raised (Happens when using 1st JSON Schema shown above) and also a misleading absence of errors (if we simply removed the minimum tag for price) which could lead us to miss the extraction of price even when the product was in stock due to malfunctioning or changes on the website.
It’s clear that a manual+automated approach is the way to go.
Let’s have an example of receiving the following sample data for the extraction of http://quotes.toscrape.com/page/1/ to assess its quality:
Automated tests are able to catch every single one of the values that failed to be extracted that are highlighted in green below (Null/NaN values), however, the following issues in red won’t be caught without an additional step of manual or semi-automated approach:
So taking on the possibility of using e.g. Selenium and the data on hands, we can build a script to check the coverage and that if the data extracted was indeed the one available for every single one of the cells:
# Preparing the Selenium WebDriver import time from selenium import webdriver from selenium.webdriver.chrome.options import Options driver = webdriver.Chrome(executable_path=r"chromedriver", options=Options()) driver.get("http://quotes.toscrape.com/page/1/") time.sleep(15) # Extraction coverage test is_item_coverage_ok = df.shape[0] == len( driver.find_elements_by_xpath("//*[@class='quote']") ) if is_item_coverage_ok: print("Extraction coverage is perfect!") else: print( "The page had " + str(len(driver.find_elements_by_xpath("//*[@class='quote']"))) + " items, however, only " + str(df.shape[0]) + " were extracted." ) # Testing if each column for each row matches the data available on the website # And setting the IS_OK column of them accordingly for row, x in df.iterrows(): is_quote_ok = ( x["quote"] == driver.find_element_by_xpath( "//div[" + str(x["position"] + 1) + "]/span[contains(@class, 'text')]" ).text ) is_author_ok = ( x["author"] == driver.find_elements_by_xpath("//small")[x["position"]].text ) are_tags_ok = True if isinstance(x["tags"], list): for tag in x["tags"]: if isinstance(tag, str): tags = driver.find_elements_by_xpath("//*[@class='tags']")[ x["position"] ].text if tag not in tags: are_tags_ok = False break else: are_tags_ok = False else: are_tags_ok = False df.at[row, "IS_OK"] = is_quote_ok and is_author_ok and are_tags_ok driver.close() df.style.hide_index()
Which returns us the following Pandas DataFrame:
This combined approach allowed us to detect 4 additional issues that would have slipped past standard automated data validation. The full Jupyter notebook can be downloaded here.
In this post, we showed how automated and manual techniques can be combined to compensate for the drawbacks of each and provide a more holistic data validation methodology.
In the next post of our series, we’ll discuss some additional data validation techniques that straddle the line between automated and manual.
Check out part 4 and part 5 of the series.
At Zyte (formerly Scrapinghub), we extract billions of records from the web everyday. Our clients use our web data extraction services for price intelligence, lead generation, building a product and market research, among other things.
If you and your business’s success depends on web data, reach out to us and let’s discover how we can help you with web data!