If you haven’t read the previous ones, here’s the first part, the second and third part of the series.
In this article, we build upon some of the semi-automated techniques and tools introduced in the previous part of the series.
Let’s say that the data we work with is separated by comma and line breaks:
change,deep-thoughts,thinking,world abilities,choices |
However, there isn’t a consistency of how many words are separated by comma per line. To make it easier, we can transform the data set so there’s only one word in each:
change deep-thoughts thinking world abilities choices |
In order to make this transformation of the data we can use search and replace functionalities of code text editors such as SublimeText, Notepad++ or Visual Studio Code.
This how you can do it using Sublime Text:
Once the process finishes, all the words will be in a single row, separated by commas:
Finally, we replace commas with n - newline
Once the replacing is done, we have a normalized dataset with only one word per line.
Let's work again with data from http://quotes.toscrape.com/. Our goal for this example is to make sure that the Top Ten tags stated on the website are indeed the top ten tags present in the scraped data. The web page looks like this:
After scraping the data from the page and loading it into a spreadsheet. this is what it looks like:
We will be using Google Sheets for this example. The first step will be to split the tags column into several columns so that we can count each word individually and analyze the results better:
Then your spreadsheet will look like this:
The next step will be to convert the multiple tag columns into a single one. This can be done by defining a new range:
Then apply the following formula to our data and expanded up to total expected length (in this case 8 columns by 100 words = 800 rows in total):
=INDEX(Tags,1+INT((ROW(A1)-1)/COLUMNS(Tags)),MOD(ROW(A1)-1+COLUMNS(Tags),COLUMNS(Tags))+1)
When you reach the end you will see an error “#NUM!” as shown below:
Next, we create a new column next to the tags populated with 1:
You can enter 1 in cells N2 and N3 and then click twice on the bottom angle of the cell to duplicate the value to the rest of the rows.
The final step is to create a Pivot table with columns M and N thus:
Then, this should be the result:
Finally, we can sort our data by occurrence count - this can be done from the Pivot table editor which gives us the following:
The top tags were verified in the previous example. What if we need to open the URLs of the top 10 and visually check them or random tags? In this case, we can use a combination of tools such as:
If we take a closer look at the tag links, they follow this structure:
Let’s say we are picking the following tags:
change deep-thoughts thinking world |
To visually check them, we can:
The result will be:
Now we can open all of them simultaneously by using the Chrome extension we mentioned, “Open Multiple URLs”.
After checking the results page by page comparing with the scraped data we can close the ones without a problem and keep the bad ones by:
The final result you’ll have for analysis or reporting issues will be:
Whenever it’s visible that side by side comparison could be something helpful, diff check tools (simple text comparison tools such as WinMerge) could be used. For this example, let’s say we received the task to assure all category links of books from books.toscrape.com were indeed scraped.
The data would look similar to this:
category_links http://books.toscrape.com/catalogue/category/books/cultural_49/index.html http://books.toscrape.com/catalogue/category/books/politics_48/index.html http://books.toscrape.com/catalogue/category/books/health_47/index.html ... |
The website has a listing of the categories just like this:
So since it’s pretty simple to select all the categories we can copy it and then simply side by side compare it with the extracted data.
Firstly upon copying all the categories from the website, we will need to normalize it in a similar way the scraped category links will be as well. So let’s order them alphabetically and transform to lower case.
In order to transform everything to lowercase, I’ll be using a built-in command called “Transform to Lowercase” for Visual Studio Code (through the command palette - open through F1 keyboard key):
Then I’ll use an extension for Visual Studio Code called Sort lines which upon using the available commands, will give us the following result that later will be compared to the category_links scraped data:
Now we normalize the category_links from the data scraped removing everything else not related to the category name with search and replace approaches similar to what was shown in the first method:
Removal of the start part of the URLs:
Then removing the header and the final part:
Now wrapping up with everything else (replacing “-” with one whitespace and removing a final leftover part):
Then we go over to WinMerge, open a new window through CTRL + N, paste the copied content from the website in one side and the categories from the links we normalized in the other. Upon hitting F5, both will be compared and if equal, every single option shown were indeed scraped:
Some scraping methods rely on data and structures that may be available only internally in the page source code like microdata and JSON linked data. Therefore, to make the job of assuring the data quality easier and compare with the scraped data, these can be checked with a tool called Google Structured Data Testing Tool.
SQL has been a key tool in the QA Engineer’s toolbelt. Let’s suppose we have 1000 rows of data scraped from books.toscrape.com following the below pattern to assess its data quality:
_type |
image |
price |
title |
URL |
All products Books to Scrape - Sandbox1 |
http://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg |
£51.77 |
http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html |
http://books.toscrape.com/ |
All products Books to Scrape - Sandbox1 |
http://books.toscrape.com/media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg |
£53.74 |
http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html |
http://books.toscrape.com/ |
Apart from field names not matching nicely for the scraped data (like a “title” column actually having URLs instead), with SQL we can:
However, let’s suppose we expected the field “_type” to be the genre/category of the books:
All data had only one value for “_type” and therefore could be corrected to scrape the genre of the books instead to be more useful.
In this post, we showcased multiple semi-automated techniques which combined with the approaches shown in the previous posts of the series will, hopefully, help bring creative ideas into your data quality assurance process to test your data better.
Want to learn more about web data QA? Read the Practical guide to web data QA part V: Broad crawls
Web scraping can look deceptively easy when you're starting out. There are numerous open-source libraries/frameworks and data extraction tools that make it very easy to scrape data from a website. But in reality, it can be very hard to extract web data at scale. Read our whitepaper and learn how to build a scalable web scraping infrastructure for your business or project.