A practical guide to web data QA part IV: Complementing semi-automated techniques
If you haven’t read the previous ones, here’s the first part, the second and third part of the series.
In this article, we build upon some of the semi-automated techniques and tools introduced in the previous part of the series.
1. Text editor search-and-replace techniques
Let’s say that the data we work with is separated by comma and line breaks:
change,deep-thoughts,thinking,world abilities,choices
However, there isn’t a consistency of how many words are separated by comma per line. To make it easier, we can transform the data set so there’s only one word in each:
change deep-thoughts thinking world abilities choices
In order to make this transformation of the data we can use search and replace functionalities of code text editors such as SublimeText, Notepad++ or Visual Studio Code.
This is how you can do it using Sublime Text:
- Open the data file or paste it into a new tab in the program
- Open Search and Replace dialog by using the shortcut CTRL + H
- Enable Regex through the button “.*” in the bottom left corner
- Search for line breaks using the control character \n
- Replace with a comma (“,”)
- Click on “Replace All”
Once the process finishes, all the words will be in a single row, separated by commas.
Finally, we replace commas with \n (newline).
Once the replacing is done, we have a normalized dataset with only one word per line.
2. Approaches with spreadsheets
Let's work again with data from http://quotes.toscrape.com/. Our goal for this example is to make sure that the Top Ten tags stated on the website are indeed the top ten tags present in the scraped data.
After scraping the data from the page and loading it into a spreadsheet, we will be using Google Sheets for this example.
The first step will be to split the tags column into several columns so that we can count each word individually and analyze the results better:
- Select the tags column
- Click on the Data menu option
- Then Split text to columns
- For the Separator window: choose “Comma”
The next step will be to convert the multiple tag columns into a single one. This can be done by defining a new range:
- Select the range with tags (e.g., D2:K101)
- Click on the Data menu option → Named ranges
- Enter a name, e.g., “Tags”
Then apply the following formula and drag it down to the expected total length (e.g., 800 rows):
=INDEX(Tags,1+INT((ROW(A1)-1)/COLUMNS(Tags)),MOD(ROW(A1)-1+COLUMNS(Tags),COLUMNS(Tags))+1)
Next, create a new column next to the tags populated with 1 (this will be used for counting).
The final step is to create a Pivot table:
- Select the entire range (tags + count column)
- Data → Pivot Table → Create
- Rows: add the Tags column
- Values: add the Count column (summarized by SUM)
Sort the pivot table by count descending to verify the top tags match the website.
3. Manually checking multiple pages
The top tags were verified in the previous example. What if we need to open the URLs of the top 10 tags and visually check them? In this case, we can use a combination of tools such as:
- Open multiple URLs (Chrome extension)
- Session Buddy
- Copy all URLs
- Sublime Text or Notepad++ to generate tag links
Tag links follow this structure: http://quotes.toscrape.com/tag/<tag>/
To visually check selected tags (e.g., change, deep-thoughts, thinking, world):
- Paste them into Sublime Text or Notepad++
- Copy the base URL:
http://quotes.toscrape.com/tag/ - Use block selection to prepend the base URL to each line
Result:
- http://quotes.toscrape.com/tag/change/
- http://quotes.toscrape.com/tag/deep-thoughts/
- http://quotes.toscrape.com/tag/thinking/
- http://quotes.toscrape.com/tag/world/
Open all simultaneously using the “Open Multiple URLs” extension. After manual review, keep problematic tabs open and use “Copy All URLs” and “Session Buddy” to save lists of good/bad pages.
4. Difference check
Whenever side-by-side comparison is helpful, diff tools like WinMerge can be used. For example, to verify all category links from books.toscrape.com were scraped:
Normalize both lists (website categories and scraped links) by converting to lowercase, sorting alphabetically, and extracting only the category name. Then compare them in WinMerge – identical lists confirm complete extraction.
5. Structured data testing tool
Some scraping methods rely on internal structures like microdata or JSON-LD. To validate these, use Google’s Structured Data Testing Tool and compare its output with your scraped data.
6. SQL
SQL is powerful for spotting anomalies in larger datasets. For example, with scraped book data, you can:
- Check min/max prices for invalid values
- Aggregate fields to detect unexpected patterns (e.g., a constant “_type” value indicating a scraping mistake)
Conclusions
In this post, we showcased multiple semi-automated techniques which, combined with the approaches shown in the previous posts of the series, will hopefully help bring creative ideas into your data quality assurance process to test your data better.
Want to learn more about web data QA? Read the Practical guide to web data QA part V: Broad crawls
Learn more about Enterprise Web Scraping
Web scraping can look deceptively easy when you're starting out. There are numerous open-source libraries/frameworks and data extraction tools that make it very easy to scrape data from a website. But in reality, it can be very hard to extract web data at scale. Read our whitepaper and learn how to build a scalable web scraping infrastructure for your business or project.
