The internet is full of useful information that we can use. However, at the same time, it’s full of hidden noise that can be harmful for data analysis. For example, if you load this Wikipedia page, it’ll show you some structured content that you can easily parse and search for the relevant information you need on that given topic. However, computers see things differently. In order to lay the content in an easy manner, the browser understands the underlying HTML source and renders it so you see that structure. If you look at the source of that page (right-click -> view source), you’ll see what the browser interprets to render the content.
All the source the browser sees is relevant to present the content in a pleasant way. However, most of it is irrelevant if you just want to perform some data analysis. Say that you want to find what are the most common words in an article’s headers. You don’t need to store all the Wikipedia data for that. Actually, you do need to download all the pages, but you can parse them and extract only the relevant information (section headers) and store only this piece of data. This makes the analysis task a lot easier, you only need to deal with strings and not HTML and where the headers are in this HTML. This strategy can be seen as noise removal, as it reduces the data to only what is important to you and parsing plays an important role in it.
From Wikipedia, “parsing is the process of analyzing a string of symbols conforming to the rules of a formal grammar”. Here we’re looking at a different interpretation, in which we extract only the relevant information of a web page (a string of symbols) and leave out what is irrelevant to our goal. This aims at our goal of reducing noise and putting data in some sort of structure. Of course, we can work with raw HTML pages to extract something from there, but usually, we’re only interested in a subset of the content that is available on the page.
For example, if you want to perform price analysis on products, you may consider the images irrelevant to your context and then leave them out, reducing costs with storage and processing by focusing only on what you really need.
In this manner, the first step for data parsing is to decide what is relevant information you need from a page. Once that is clear, we can move on to build a parser. Let’s assume we want to extract page headers from Wikipedia, you can achieve that by:
This is a simple parsing rule to find the title of any Wikipedia page. document.querySelector is a built-in function of browsers to find HTML elements in the page based on a CSS Selector query (another common approach is to use XPath). Then, we give it a selector to find the page title; then we ask for the content of that tag with .innerText.
The example in the previous section is a common case of parsing some data from a web page. Since the HTML doesn’t follow a standard structure across all sites, we need to perform this inspection for every parser we need to build. As you can imagine, this quickly becomes impractical once we want to scale. In order to overcome these issues, there are some semantic markers that have been built to inform parsers about relevant information on the page. The first attempt for this feature was HTML Microdata, which was a specification to add some metadata to tags containing the information that could be useful for computers. This way, pages could mark the content we could parse automatically. Gradually other approaches started appearing, like RDFa, JSON-LD, or Facebook’s Open Graph.
To give you an idea, when you share a URL on Facebook and it posts a thumbnail and some other information about the content, probably it’s coming from some of these markers that are present on that page. However, not all data you need might be present in these markers or, sometimes, developers decide to not put the data there too; then we need to get back to our common case of writing parsing rules or building new technologies to parse the data.
As we can’t rely on all websites to fill the semantic data on their side, a solution to handle this issue is to build tools to perform automatic data extraction.
So how do we, at Zyte, make crawling work for any website? Zyte Automatic Extraction API leverages artificial intelligence techniques to automatically extract the relevant information on a page. For example, when doing a category crawl for products, we start on the first page in the product list, and our job is to detect links to individual products, as well as the link to the next page. We do that by issuing a request to our Product List Extraction API, which returns a list of products, which among other fields has the “URL” field pointing to the detailed product pages. The URL to the next page is also present in the Product List API as part of pagination support. Once we have the URLs of individual products, we pass them via the Product Extraction API, collecting detailed information about each product, and then proceed to the next page. This is just one of the crawling strategies we have, but they all are generic and rely on the Extraction API to adapt to individual websites.
Another common case is extraction of all articles from a specific news source. In that case, we follow all the links, and rely on the News and Article Extraction API to discover which pages are articles, and deliver only those.