Article data extraction | How to Maximize Quality

Light

Dark

Maximize the quality of news and article data extraction

Read Time

6 Mins

Posted on

November 4, 2022

Obtaining access to high-quality and reliable content through news and article scraping is essential to keep up with today’s quickly evolving marketplace.

Konstantin Lopukhin

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.

Return to top

Subscribe to our Blog

Maximize the quality of news and article data extraction

Obtaining access to high-quality and reliable content through news and article scraping is essential to keep up with today’s quickly evolving marketplace.

Industry trends and consumer behavior are constantly changing. These are key to driving impactful decisions that can make or break your business.

However, at the speed and volume at which news articles are published on the net, scraping data from them can feel like a daunting task.

You’ll need to get data fast — and it must also be accurate and of high quality.

Organizations that fail to effectively grasp the complexities of data management, end up wasting time and resources, without deriving useful insights from the data gathered.

So, how can you maximize data quality when performing article extraction? What are the stumbling blocks, and what are the solutions so you can overcome these challenges?

This article will expand your knowledge on article data extraction, why it’s important, and the tools and procedures involved.

We’ll also dive into different tools used for news and article extraction — including Zyte’s Automatic Data Extraction API — with the advantages and disadvantages of each, as well as how these can be used to extract news and article data at scale.

What is article data extraction?

Article data extraction involves extracting data fields from an article page, and converting them into a structured, machine-readable format such as JSON.

Most of the time, the targeted site is a news page, but it can also cover other formats.

The process involves multiple points when extracting article data.

[H3] Properties and attributes:

Headline: the title of the article
articleBody, articleBody HTML: the text and HTML of the article body
Authors: authors of the article
datePublished, dateModified: date and time of publication and latest edits
images, mainImage: all images and the main image of the article

There can also be other minor attributes, such as language, breadcrumbs, description, and audio or video URLs.

Why is quality important for news and article data extraction?

Demand for structured data has skyrocketed in recent years, as more information is disseminated through the internet.

This data is important as it can be applied to a variety of use cases, such as market research, analytics, brand monitoring, competitive intelligence, customer personalization, and many more.

As such, news data has the potential to be a gold mine — and the ability to leverage it in the right manner will provide businesses with a solid advantage.

Quality article data extraction enables businesses to:

Make smart decisions with information that is backed by data
Pivot quickly with data that is close to real-time
Have an edge over competitors who do not have the same knowledge

If you’re looking to tap into article data extraction as a resource to give your business a boost and drive growth, it is crucial to have a solution that can provide high quality news and article data extraction.

Challenges to extract high-quality data from articles

Most of the important attributes of a news article are at the top, such as the title, date published, author, and the main image, followed by text.

However, there is also unrelated content such as “most popular” and “editors’ picks”, which provide a good user experience, but are less useful when extracting data, as it adds complexity to the process.

Here is an example of how a news article page might look like, and its attributes:

*Generic Sample to explain article data extraction*

The article body usually represents the meat of the article extraction process — it’s also the most difficult to get right. This is because in many cases, the body contains different pieces of content, which we may not want present in the output if we want to achieve good quality.

Consider the example on the left. There are block quotes within the body that look different from the rest of the article but are part of it.

Meanwhile, in the example on the right, there is a block that appears to be part of the article — but is unrelated and is simply there to keep readers on the platform.

Keeping these blocks can affect the quality of your article extraction.

If you have a downstream application that is performing sentiment analysis, having an unrelated link with text in the middle of the article can throw your systems off.

Therefore, the benchmark for quality would be to obtain all the desired content and exclude undesired blocks.

News & article data extraction tools - a complete analysis

Because article extraction has a variety of use cases, the ability to glean quality news data will be relevant to many fields. It is important to find the right tools to suit your project’s needs or your organization’s goals.

An important criterion for any article data extraction tool is that it should work for most websites, without having to write site-specific code. Writing custom rule-based extraction code requires a lot of maintenance, especially if you’re extracting data from thousands of domains.

In general, there are two types of article extraction solutions: open-source libraries (free) and commercial tools (paid).

Here are some examples:

Open-source libraries	Commercial tools

• Readability • Newspaper 3k • Dragnet • Boilerpipe • Html-text	• Zyte Automatic Data Extraction API • Diffbot

Open-source libraries

To use an open-source library for article extraction, you first need to download the HTML of the article. This provides the title and the body of the article as text.

The library then parses the HTML to find elements which correspond to the article board, using either heuristics, machine learning, or a combination of both. Finally, it extracts the text of these detected elements.

Open-source libraries can be useful — especially if your goal is simply to get all the content without missing anything, but the quality falls short compared to commercial solutions.

Advantages	Disadvantages
• Cheap to run • Flexible in integration • You only need HTML	• Lower quality of extraction • Less attributes supported • Manual content downloading

Commercial tools

A commercial solution such as Zyte’s Automatic Data Extraction API approaches article data extraction differently.

The page is rendered in a headless browser, allowing for a richer representation of the data by capturing not just text, but also screenshots, HTML, and the CSS properties.

This works like how a human might read the page. These different modalities are passed into one neural network and this leads to very high-quality extraction, plus more supported attributes.

Advantages	Disadvantages
• High quality extraction • More supported attributes • Downloading is handled by service • Includes anti-ban protection • Intuitive to use	• Paid service

Quality test: open-source vs commercial

Using three metrics — precision, recall, and F1 — we performed an evaluation to compare the quality of data from article extraction, with multiple open-source and commercial solutions.

The goal was to collect a representative and unbiased dataset, taking 181 pages from a diverse set of URLs.

These pages contained news articles from popular domains, as well as less popular and less typical pages, such as blogs and non-news articles.

Results

In terms of the ratio of undesired content, Auto Extract performed significantly better against the best commercial service we used, as well as open-source libraries.

The ratio of missed content was less significant, but Auto Extract again performed the best among the solutions that were tested.

The table below summarizes F1, precision and recall for the different open-source and commercial solutions.

Auto Extract got the best results.

Meanwhile, Html-text library, which is a library that extracts text from the whole web page regardless of whether it’s related to an article or not, had poor precision and poor F1, but excellent recall.

If you’re happy with achieving perfect recall at the cost of precision, this open-source library might be something worth looking into.

Results summary:

Auto Extract vs readability (open source) — 5x less undesired content, 1.5 x less missed content
Auto Extract vs Diffbot — 2.5x less undesired content, 1.3x less missed content

If you’d like to do a deep dive into our methodology, how the data set was collected, and how we achieved these results - download our free whitepaper here.

Conclusion

As the importance of data continues to grow, the quality of news and article data extraction will play an increasingly significant role in the decision-making process of many businesses.

While open-source libraries offer a cost-effective solution, data quality might not be up to par — especially when you’re doing article extraction at scale.

Commercial solutions might seem like a costlier alternative, but what you’re paying for is essentially better-quality data.

Utilizing a comprehensive article extraction solution such as Zyte’s Automatic Data Extraction API, ensures that you’re getting fast, accurate, and reliable results consistently.

Verify the data quality on the pages you want to test, and you can see for yourself how it all works.

Try Zyte’s Automatic Data Extraction API for free.

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.