PINGDOM_CHECK
Light
Dark

AI’s Legal Frontier: What Europe’s Privacy Regulators Say About Scraping Personal Data

Read Time
5 Mins
Posted on
December 12, 2025
Explore how EU privacy regulators view AI web scraping, lawful bases like legitimate interest, risks of collecting personal data, and compliance best practices.
Table of Content

DISCLAIMER: This blog post is for informational purposes only. The content is not legal advice and does not create an attorney-client relationship.


This is a pivotal moment for web data collectors. Zyte has been in the web scraping industry for over a decade, but never before has web scraping attracted such widespread global attention. 


We have seen a surge of lawsuits and regulations concerning AI and the web scraping methods often used to obtain AI training data.


Companies doing so must follow all applicable laws and regulations, such as copyright laws, data protection laws, and new regulations specific to AI, such as the EU AI Act.


But, while much of the focus has been on copyright as it pertains to publicly-available web data, users must also be keenly aware of regulations around use of personal data.

What is ‘personal data’?

Under the General Data Protection Regulation (GDPR), personal data is defined as “any information that relates, directly or indirectly, to an identified or identifiable natural person.”


In practice, this covers a broad range of identifiers such as name, identification number, location data, username, user ID, email address, and phone number. 


And it is not hard to see that such personal data can be found on the public web.


To collect personal data, you already must have what is called a “lawful basis.” There are six available lawful bases for processing personal data:


  1. Consent - the individual has given clear, informed permission.

  2. Contract - processing is necessary to perform or enter into a contract with the individual.

  3. Legal obligation - required to comply with a statutory duty.

  4. Vital interests - necessary to protect someone’s life.

  5. Public task - carried out in the public interest or under official authority.

  6. Legitimate interests - necessary for a controller’s legitimate aims, provided these don’t override the individual’s rights.

Legitimate interest for AI collection

For AI developers’ personal data collection to satisfy a “legitimate interest” protection, they will require:


  1. The existence of a legitimate interest.

  2. That the processing is necessary to achieve that legitimate interest.

  3. That your legitimate interest is balanced against the fundamental rights and freedoms of the individuals’ whose data you are processing.


It is important to take into account the volume of data that will be collected and all the possible uses for the AI system.


Weighing use of collected data


When the personal data is collected via scraping, the individuals typically have no direct knowledge that their data is being used. The risks to the rights of the individuals vary depending on whether the database and any generated data is publicly searchable or for internal use only.


While some individuals may not have an issue with their data being collected for lead generation or data analytics purposes, they may have more concern over how their data is used to train AI models. However, as pointed out by the EDPB, there could be cases where adding their data into an AI model may be in an individual’s best interests, such as an AI system that helps provide better access to healthcare or education, or a model that will help the individual in their professional capacity and financial gain.


Making AI use cases clear


Under the GDPR, anonymized data is no longer personal data. But anonymization is a lot more difficult to achieve with Al. The sheer computing power of AI and the amount of data a system can hold makes it extremely difficult to limit the inferences about an individual, even if you anonymize certain data points.


Generative AI poses an additional concern with its abilities to create deepfakes or chatbots based on an individual’s data. Because AI is so complex, it may be difficult to achieve a reasonable expectation of an individual’s understanding about the likely uses of their data.


Of course, privacy policies typically state how an individual’s data is to be used, but merely stating “to train AI models” could be too vague for many individuals.


Finally, any personal data that is collected but deemed unnecessary for training should be deleted or anonymized as soon as possible.

Re-using scraped data for AI

It’s important to note that, even if you have a pre-existing dataset that had a lawful basis under the GDPR generally, you will need to re-assess whether it can be lawfully reused for AI training.


France’s Commission Nationale de l'Informatique et des Libertés (CNIL) refers to this as the “compatibility test”. In that test, you must assess issues such as:


  • The existence of a link between the initial objective and that of building a dataset for training an AI system.

  • The context in which the personal data was collected.

  • Whether you have adequate safeguards when training an AI system on the personal data.


However, the EDPB noted that, if even if the data to train the model was not lawfully obtained (i.e. GDPR was not complied with while scraping the data), as long as it is properly anonymized later, the AI system itself is not immediately illegal. 


Companies looking for personal data to train AI may find value in an alternative data source. There are open source models that will supply synthetic personal data that may achieve the same purpose.

Limit data scope to limit liability

The data protection authorities have varying levels of concerns regarding the ability of scrapers to limit the content that they scrape.


  • The Dutch Data Protection Authority, for example, believes there is a high likelihood that scrapers will unintentionally scrape personal data, including special-category sensitive data.

  • However, CNIL and EPDB seem to understand that web scraping compliantly can often come down to how you code your spiders and/or crawlers.


Define minimum collection requirements


If you clearly define your collection criteria, including your data schema and defined list of websites, you can mitigate the risk of scraping personal data.


For example, if you scrape product sites and omit the seller’s name or reviews, the likelihood of scraping personal data will be significantly lower than if you scrape public social media profiles.


Handling special-category personal data


If you are scraping sites that may have special-category personal data (i.e. personal data that relates to an individual's race, ethnic origin, or religious belief), you will also need to meet one of 10 GDPR conditions before it can be processed:


  • Explicit consent for specified purposes.

  • Employment / social protection obligations or rights, authorised by law.

  • Vital interests where the person cannot consent.

  • Non-profit bodies processing data of members/regular contacts, with no external disclosure.

  • Data made public by the data subject.

  • Legal claims or courts acting in a judicial role.

  • Substantial public interest under Union or Member State law with safeguards.

  • Healthcare: medical diagnosis, care, treatment, or health-system management, under professional secrecy.
    Public health: protecting against serious health threats or ensuring quality/safety of healthcare/medicines.

  • Research / statistics / archiving in the public interest.




Social media sites and forums are particularly risky sources, as they can contain photos of individuals (including children), someone’s political opinions, religious beliefs, union memberships, sexual orientation, etc.


The EDPB Task Force suggested that it could be beneficial to exclude public social media profiles. However, if this type of data is necessary for your project, you will need to determine whether the individuals intended that the information be “manifestly made public” and what their reasonable expectations may be.


This can be a difficult task that may require that you engage a lawyer or privacy professional to perform a Data Protection Impact Assessment (DPIA) on your behalf. 

Further considerations

If you already have a compliant web scraping plan in place, you are probably already aware of the possible scraping counter-measures some websites put into place and have a policy on how to proceed when blocked by one of them.


You probably also know which sites still have accessible public data and where to find specialised data for your project.


If you don’t know where to start, we are here to help. Zyte has a team of legal and compliance scraping experts who can help guide you on your web scraping compliance journey. Just reach out at legal@zyte.com

×

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.