CASE STUDY

Debunk EU scrapes millions of news articles with Zyte

Checking authenticity of Baltic region news stories

DebunkEU.org is using Zyte API with Extraction service to monitor and expose disinformation campaigns spread across media outlets in the Baltic region and further afield. To achieve this Debunk EU is currently scraping news-based websites worldwide in over 40 languages including Russian, Chinese, Iranian, Arabic, German, French, Ukrainian, Georgian, Balkan and Baltic languages. With the help of our easy-to-use Automatic Extraction API - plus friendly technical support from the Zyte team - Debunk EU is scraping around 1.5 million news articles every month from thousands of news sources.

We’re really happy with the quality of Zyte’s Automatic Extraction. We are also very satisfied by the level of technical support we get. Without Zyte we simply wouldn’t be able to do what we do.

CTO at DebunkEU.org

About

DebunkEU.org is an independently-funded think tank and non-governmental organization that tracks disinformation and misinformation campaigns across media outlets in Baltic countries and Poland, as well as in the United States and North Macedonia.

Its team of over 50 analysts and active volunteers conducts detailed fact-checking and research into disinformation concerns in the Baltic countries and Poland. The think-tank reports on topics including misinformation about COVID-19 and vaccines, political turmoil in Belarus and Russia, and attempts to target NATO activities.

Debunk EU publishes over 100 reports per year, and also runs a programme of educational media literacy campaigns. It also works closely with national institutions in partner countries that provide more valuable insights on the situation in the Baltics.

Challenges

Debunk EU aims to counter disinformation and information campaigns, with the goal of providing insights into complex issues in a concise, understandable and informative way.


From 2017, Debunk EU started exploring the options for collecting news articles from various sources. “At that time all the commercial options were really expensive, so we developed our own extraction solution based on Scrapy” explains Debunk EU CTO Girius Merkys. “It was OK, but we had something like 200 domains to monitor and it required a lot of maintenance.”

As time passed, Debunk EU faced the growing challenge of monitoring more and more domains. “Some small countries that we’re interested in might have over a thousand news outlets” states Girius. “In the disinformation space it’s common to see lots of simple Wordpress-based websites controlled by one entity, all running the same story to give the impression that ‘it must be true’”.

Girius also notes that the process of debunking false or misleading content online can be both costly and time consuming. “It’s difficult to fact-check a piece of information if you do not know where to start. What’s more, debunking disinformation costs way more than creating it.”

In parallel with the constantly increasing number of media outlets to monitor, Girius observes that the process of extracting online news articles efficiently is becoming steadily more resource-intensive: “To analyze so much data is quite a challenge. Page designs are also changing more and more frequently, and javascript based sites are becoming more popular. It’s very difficult to scrape that kind of content – sometimes it’s impossible.”

Solution

To deal with the rapidly-growing scale and complexity of extracting millions of news articles, Debunk EU approached Zyte to provide a cost-effective and easy-to-use automated article extraction solution that would minimize development overheads for the busy Debunk EU team.

With the help of Zyte API, Debunk EU is able to track the evolution of disinformation campaigns by monitoring over 1.5 million online articles every month.

“As we’ve scaled up we didn’t want the hassle of having to keep maintaining Scrapy” says Girius. “Also, because we are a non-commercial NGO we needed an affordable solution – and that’s something Zyte has been able to offer us, plus technical assistance because of the sheer volume of requests we have every month.”

As well as the quality and reliability of article extraction, Girius also welcomes the efficient support offered by the Zyte team: “We’re very happy with the help we get. Without it, we wouldn’t be able to do our work and publish more than 100 reports every year. I really like the article list service. It really just makes everything much easier for us. We just give the link of the domain, then we get the article list and we just scrape it with your API. It’s automatic and it’s really convenient.”

Results - web scraping at scale

Data extraction at scale1+
million/ articles per month
World-wide coverage42
languages covered
Global reach60,000
domains monitored

Summary

With help from Zyte API Debunk EU is able to access millions of news articles every year – with the capacity to grow smoothly as it monitors a greater range of media outlets in more territories.

Access any website

One powerful web scraping API to access all websites. Per-site pricing that just makes sense.

Trusted by leading brands

Why Zyte API

Scrape websites of all complexity levels

Zyte API enables you to scrape websites of all complexity levels. Extract data using the right solution 100% of the time. Automate troubleshooting so that when proxy management alone can't get you what you need, use our single web scraping API.

Per-site pricing

Our pricing maps directly to your web scraping strategy. Cheaper for easy websites, and more expensive for difficult websites. Stop toggling between multiple scraping tools based on the use case with our web scraping API.