During a broad crawl, you might be extracting data from thousands or tens of thousands of websites with different layouts. When you scrape this many websites using a single spider, analyzing and validating the extracted data can be challenging. One important question to answer is what criteria should be used to determine the overall quality of the dataset.
In the following article, which is the final part of our QA series, we’re going to work with 20,000 different sites and go through all steps with detailed explanations about the process. In this example, we are interested in finding specific keywords on the sites - if the keyword is found on a page then it should be flagged for further analysis.
A large subset of heterogeneous sites should be processed to collect information for a specific keyword. We are going to crawl recursively starting from the domain and check all found links on the landing page. Several types of limits can be introduced here, like:
The goal is to get as many pages as possible and verify the existence of keywords in them.
The first step is inspired by two popular topics in programming:
Simplifying a complicated problem by breaking it down into simpler sub-problems in a recursive manner.
The idea is to identify good representations of 1-5% of the dataset and work with it. So for 20,000 sites, we can start with 200 sites that represent best the dataset. A random selection might work as well when it’s not possible to pick by other methods.
At this point, these should be clear:
Let's assume that our sites are named like this:
We are going to use a semi-automated approach to verify the first n sites and the results extracted from the spider:
In this step, a summary of the first n spiders plus the semi-automated checks can be represented in tabular form:
In columns with the yellow header, you can see the extraction from the spiders, while in the blue columns it’s the result from the semi-automation. If you want to learn more about semi-automation then you can read the previous blog post from our QA series.
Pivot tables are handy to group results and show meaningful data for conclusions about coverage. Example for status:
The evaluation criteria for the first subset is the number of:
This is the high-level evaluation of the first run. In the next step, we will check how to identify potential problems.
The remaining parts of the data validations:
The combination of the previous two results will be the criteria to decide if this first run is successful or not. If there are any problems, a new full run of the subset might be required (after fixes and corrections) or only on the sites which have problems.
Unexpected problems are common for broad crawls and one of the ways to deal with them is by using deductive reasoning and analysis.
Let say that for site_1 we know:
Then we can ask several questions until we found an answer:
The diagram below will help whether to report a problem or mark the site as successfully crawled:
We divided the big problem into smaller tasks and defined an algorithm for successful spiders. Now we can work with the whole dataset by using tools like Pandas, Google Sheets, and our intuition.
First, we recommend checking the initial source of URLs for:
This should be done to verify what is the expected number of processed sites - which should be something like this:
final = initial URLs - duplications - wrong URLs - empty
Checks can be done in several ways. For example, the check for duplicated data can be done in Google Sheets/Excel by using pivot tables but to do quality check we need to have in mind that data should be cleaned because:
Will not be shown as duplicated unless we remove the extra parts like:
This can be done by using:
Pandas can be used as well:
df['website'] = df['website'].str.replace('http://', '') .str.replace('https://', '') .str.replace('www.', '')
# remove last / and lowercase all records def remove_backslash(s): return s[:-1] if s.endswith('/') else s df_init['website'] = df_init['website'].apply(remove_backslash)
df_init['website'] = df_init['website'].str.lower()
It’s always a good idea to compare apples with apples!
Once data for the execution is collected n in the form of:
We can divide the jobs into several groups:
Each of those groups will be analyzed for different validations. We can start with the jobs without items and investigate if:
Next, let’s focus on the errors. In this case, the best is to collect logs for all jobs and process them to find top errors and after that analyze top errors one by one. Below we can find the top errors for this execution:
Each error should be analyzed and reported.
How to use Pandas to analyze the logs. First, we need to clean them by:
df['message'] = df['message'].str.split(']' , expand=True) df['message_no_digits'] = df['message'].str.replace('d+', '') df['message_short'] = df['message_no_digits'] .str.slice(0, 50)
So finally we will get from
Next we can extract URLs from the logs:
df['url'] = df['message'].str.extract(pat = '(https:.*)')
For the group with the reached limit jobs, we need to find out why this is the case. Possible reasons:
All jobs in this group should have an explanation and be categorized in one of the groups above.
Let’s do the URL inspection. In this case, we can use a small and simple library like https://pypi.org/project/url-summary/
And get results like:
Or explore URLs by Pandas and get a summary as:
Or do site per site analysis, counting the first level after domain:
Below you can find part of the code used for this analysis:
Pandas can help too:
df['website'].str.split('/', expand=True) .str.split('?', expand=True).value_counts().head(-1)
The code above will extract everything from the 3rd / to next? and will return count for each group. Example:
Will extract - test. The output will be:
This can be done per site with slight modifications(if the frames is a list of dataframes for each website):
for df in frames: print(df.iloc['website']) print('-' * 80) print(df['website'].str.split('/', expand=True) .str.split('?', expand=True).value_counts().head(-1))
If data is in a single dataframe then you can use domain information if it’s exists or create it by:
df['domain'] = df['website'].str.split('/', expand=True)
then you can split the DataFrame into several by:
frames = [pd.DataFrame(y) for x, y in df.groupby('domain', as_index=False)]
Or process the information by groups. The final expected result is:
Based on the analysis done in the previous step we can start with our evaluation. The evaluation is based on the groups above after the applied fixes and correction and re-execution of the run.
The success of the run will depend on different factors and the importance of data requirements. Pivot tables and diagrams will help:
To justify the progress between the different runs and the coverage in the final one. Pandas can be used to draw plots for hundreds and thousands of rows. Drawing can be one per group or for problematic sites and factors.
Below is shown the code for plotting two plots next to each other:
import matplotlib.pyplot as plt sites = df.url.unique() for n, i in enumerate(range(0, len(sites), 2)): fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(14, 6)) plt.xticks(rotation='vertical') ax1, ax2 = axes temp_df = df.copy() temp_df = df[df['url'] == sites[i]] sites_df = temp_df[['errors', 'items', 'key']].sort_values(by='key', ascending=True)[ ['errors', 'items']].fillna(0) sites_df.plot(ax=ax1, title='Plot for site ' + str(sites[i])) fig.autofmt_xdate() temp_df = df.copy() temp_df = df[df['url'] == sites[i + 1]] sites_df = temp_df[['errors', 'items', 'key']].sort_values(by='key', ascending=True)[ ['errors', 'items']].fillna(0) sites_df.plot(ax=ax2, title='Plot for site ' + str(sites[i + 1])) fig.autofmt_xdate() tempDf = df.copy() tempDf = df[df['url'] == sites[i + 2]] country_df = tempDf[['errors', 'vcard', 'items', 'key']].sort_values(by='key', ascending=True)[ ['errors', 'vcard', 'items']].fillna(0) country_df.plot(ax=ax3, title='Plot for site ' + str(sites[i + 2])) fig.autofmt_xdate() except IndexError: print('out of index')
The final report will have two parts. Detailed table for all jobs from the final run:
And tab with pivot tables:
Reporting should include the most important information about the problems and should be easy to be generated.
Creating a report like the one above, when all the steps are clear, takes a few hours and a big part of the time is spent on data extraction.
In this article, we demonstrated how to evaluate data coming from a large number of different websites. This can be a good starting point to validate heterogeneous data and how to report the results in understandable ways for a wider audience.