PINGDOM_CHECK
Light
Dark

Balancing innovation and regulation in data scraping

Read Time
10 Mins
Posted on
October 14, 2025
IntroductionPublic web dataCopyrightPersonal dataKey takeaways for the road ahead
×

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.
Return to top
Table of Content

Balancing innovation and regulation in data scraping

For anyone involved in data gathering, the legal landscape can often feel like a waiting game, as protracted legal cases play out before becoming case law.


Recently, however, we have finally started to see exactly that happen.


For web data access, the changes are positive news. Innovators continually have to balance what they do with regulation. But the legal cases have confirmed growing scope for innovation.

Public web data

This is the foundational element for so much of the innovation happening today, but it’s also where the regulatory story begins.


Innovation: Public data fuels creativity


The value of public web data is undeniable. On the innovation side of the scale, the arguments are clear:


  • Public web data is the largest data set in the world. The potential is infinite.

  • Web data can be used for countless business intelligence purposes, driving smarter decisions and creating new opportunities.

  • AI isn’t going anywhere, and we need good data to train it. Public data is the fuel for this technological revolution.

  • Fundamentally, we believe that public data should remain public.


Regulation: Logged-out public data capture may be permitted


Historically, the primary legal threat to web scraping came from the Computer Fraud and Abuse Act (CFAA), a US anti-hacking law. This was concerning because violations carried not only civil penalties (money) but also potential criminal penalties.


However, a few years ago, landmark court rulings in cases like LinkedIn Corp. v. hiQ Labs, Inc. and Van Buren v. United States clarified the landscape. The courts stated that if you have lawful access to the data—meaning anyone can go on a public website and see it—you are not violating the CFAA.


So, the question then became: “Can it nevertheless be a violation of a site’s Terms of Service (ToS)?” This year, we saw a major ruling in the Meta v. Bright Data case that answers this question. The court ruled that Bright Data did not violate Meta's ToS.


However, while many headlines declared that all public data scraping is now okay, that's not quite what the case said. The court's decision was specific to the facts: Bright Data was scraping data that was not behind a login and their activity did not violate Meta’s ToS.


Following this, we saw X (formerly Twitter) settle its lawsuit against Bright Data. While the terms are confidential, one can make an educated guess that X saw the outcome of the Meta case and decided it wasn't worth pursuing. The courts are favoring innovation.


Takeaway: Not everything is fair game


Just because the courts have been ruling in favor of scraping public data doesn’t mean it’s all fair game. What you do with the data still matters a lot, and what type of public data matters too. We're seeing courts look more closely at data usage, especially when it involves pirated or illegally obtained content, which leads us to our next topic.

Personal data

Scraping personal data is always a hot topic, and while there haven't been massive legal shifts recently, the existing rules are more important than ever, especially with the integration of data into AI.


Innovation: Creating personalized and powerful datasets


The goals here are clear: obtaining vast and diverse data to build out various types of datasets, creating robust LLMs, fine-tuning models, and creating tools for brand monitoring and social listening. Personal data can be a component of this, but it requires extreme care.


Regulation: The US vs. EU divide


There is a huge distinction between how the US and the EU treat personal data.


  • United States: In the US, public personal data is typically okay to scrape. If data is "manifestly made public," then no consent or other type of legitimate interest is generally required.

  • European Union: In the EU, under GDPR, there is no exception for public personal data. You must have a legitimate interest or consent, even for data that is publicly accessible. This applies even if you are in the US but are scraping the personal data of EU citizens.


When incorporating data into AI, it's crucial to ensure you are not violating prohibited uses under new regulations like the EU AI Act, which restricts applications like facial recognition and automated decision-making for employment, housing, or loans.


Takeaways: When is public personal data okay?


The rules differ significantly by jurisdiction. In the EU, even with public data, you must consider:


  1. Data retention: How long do you keep the data?

  2. Anonymization: Can you anonymize the data to remove personal identifiers?

  3. Minimization: Are you only taking the data you absolutely need?

  4. Notices: Do you need to provide notice to data subjects?

  5. Opt-outs: Is there a mechanism for individuals to opt out?


Be cautious about the usage of personal data when building an LLM, ensure it's obtained compliantly, and design use cases that do not run afoul of the AI Act or other regulations.

Key takeaways for the road ahead

The legal changes this year have been overwhelmingly positive for the web scraping community. The courts are increasingly ruling that scraping public web data is acceptable and are even recognizing fair use in the context of training AI.


However, this freedom comes with responsibility. Here are the most important principles to guide your data scraping activities:


  • Ensure data comes from reputable, legally compliant websites.

  • Avoid scraping websites with pirated or illegal content. The potential damages are enormous, as seen in the Anthropic case.

  • Do not build directly competitive products unless the data is materially transformed. Add your own analysis and intelligence.

  • Ensure you handle personal data according to jurisdictional requirements, paying close attention to the stringent rules of the EU if you collect data on its citizens.

  • Do not use scraped data for AI products prohibited under emerging regulations like the EU AI Act.


The more the web scraping industry unites around ethical standards, the more we can influence regulators to continue making positive decisions that favor innovation. The law is finally catching up, and for those who proceed ethically, the future of data scraping looks bright.

×

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.