Marie Moynihan
4 Mins
March 31, 2023

How to (safely) extract data from social media platforms and news sites

Data extraction from news sites and social media platforms is becoming an increasingly common practice. Popular use cases range from ensuring more informed investment decisions to protecting brand reputation.

However, if your core business isn’t focused on news aggregation or analysis, it can be difficult to know how to scrape news articles and social posts effectively, and without breaking the law or unintentionally disrupting websites. While web scrapers can make it possible to manage anti-ban restrictions, this doesn’t remove the legal implications of being compliant. 

To help you overcome the common dilemmas faced when developing a data feed, our team here at Zyte hosted a webinar on how to build news and social media data schemas successfully (and what to avoid!). 

With guest speakers including Sanaea Daruwalla (Chief Legal Officer), and Konstantin Lopukhin (Head of Data Science) the webinar includes helpful advice on improving data coverage, the best data fields to scrape, and abiding by key regulatory considerations.

If you haven’t watched the webinar yet, here’s a breakdown of what to expect.

Disclaimer: The recommendations in this guide do not constitute legal advice. Our Chief Legal Officer is a lawyer, but she’s not your lawyer, so none of her opinions or recommendations in this guide constitute legal advice from her to you. The commentary and recommendations outlined below are based on Zyte’s experience helping our clients (startups to Fortune 100’s) maintain GDPR compliance whilst scraping 7 billion web pages per month. If you want assistance with your specific situation then you should consult a lawyer.

Which data fields should you extract for news schemas (aka article schemas) and why?

Extracting data from the right fields is crucial to ensure what you’ve collected is relevant and reliable. Zyte’s Head of Data Science (Konstantin Lopukhin) provides a detailed overview of which data fields you should include in your article schema. These include:

  • The URL - Great for identifying the source of the data and tracking changes over time.
  • Headline - Very beneficial for sentiment analysis, or to identify the topic of the article.
  • Author and publication date - Good for understanding when the article was published and avoiding old news.
  • Article body - This is the main and, arguably, the most important field in most use cases, but also the one with the strongest legal protections. Konstantin says: “This field provides you with the text of the article, and the quality of this field must be really high for you to trust the rest of the data feed.”

What are the legal implications of news data extraction?

Extracting data from news articles requires that businesses remain vigilant to ensure compliance. Zyte’s Chief Legal Officer, Sanaea Daruwalla, provided some expert insight on the legal factors to consider when developing a news data extraction schema.

One data field which many assume is legally compromising, but can be a bit more innocuous, is the author’s name. While the author’s name is personal data, many jurisdictions, such as the EU, have an exemption for journalistic use.  If you're not sure whether your use case falls under the exemption, you need to do a data processing impact analysis (DPIA). 

The article body, however, is a field that you need to be much more careful with, due to copyright law. Sanaea says: “The article body is always going to hold copyright protection, so your use of it has to fall under an exception to copyright law. Use cases are important. If you want to gain an exception, you can't be republishing. If using the content for investment decisions or sentiment analysis, you’re generally safe.” If you want to scrape the full article body, you should check with your lawyer to see if your use case qualifies as an exception under the applicable copyright laws.

Sanaea also explains, as a general rule, you should avoid scraping articles that require a subscription or login to access them. That’s because when you sign up for that particular service, you need to abide by their terms and conditions which usually involve a ban on web scraping. “Once you accept the terms and conditions, you're bound by them,” said Sanaea.

What are the legal implications of social media data extraction?

You might think ‘well social media is public, so it must be legal to scrape social content’, but this isn’t always the case. 

While social media pages owned by businesses are generally safe, you need to be far more cautious when scraping content from individuals. Sanaea says: “You need to be really cautious when you're taking personal data. Even if it's out there in the public, GDPR still applies.”

There are some exceptions, such as anonymizing personal data. For example, Zyte can help you anonymize the user tag and remove all personal indicators while retaining data such as the number of followers, likes, and reposts. Anonymized data is not considered personal data under the GDPR.

Similar to news scraping, businesses should never scrape social media content behind a login. Actions like this could be considered a breach of contract, and you would be putting yourself in extremely dangerous territory. 

How to design and develop a news schema that works

For those looking to design and develop their article schema, there are a lot of considerations to ensure efficacy and compliance.  Zyte has done extensive work in this field, and we’re happy to share our expertise to ensure you don’t fall into any common pitfalls. Here are a few of the key pointers mentioned in the webinar:

  • Consistency is everything. You need to be able to do your longitudinal analysis, especially when it comes to making investment decisions. The last thing you want to do is to change your schema in a few months' time.
  • Be cost-effective. Scraping news data can get expensive, so make sure you set expectations beforehand. 
  • Consider copyright and data protection implications. We’ve covered this already, but it’s absolutely imperative that you’re smart about navigating legal matters.
  • You need robust monitoring in place. If you're building your business or your products around this data, you need to know that your design is reliable. 
  • Ensure you have the right resources. Ask yourself whether you have the right team with the right skills in place to manage the constantly shifting environment. 

Extract news and social media data safely with Zyte

News data extraction is an intricate and legally complex practice, which shouldn’t be managed alone. Without the right resources or the right legal team in place, you could be vulnerable to collecting unreliable or unlawful data.

Zyte is here to help. As the world’s leading web scraping service, we are experts at finding, extracting, and formatting datasets so you don't have to.

We provide reliable and legally compliant web data, with competitive upfront pricing structures and ongoing support at every stage of the process. Try us for free today. 

To watch the full webinar, with all insights and advice on news and social media schemas, click here.