Is your AI breaking the law? Legal experts’ advice for web scrapers

AI promises to revolutionise the practice of web data extraction - but, when it comes to using that data, it also raises pressing questions about copyright, liability, and compliance that businesses cannot afford to ignore.

At Zyte’s recent Web Data Extract Summit, a legal panel hosted by Zyte’s Chief Legal & People Officer, Sanaea Daruwalla, convened to chew over these challenges:

Dr. Nikos Minas, Global IP Counsel at Wesco International.
Dr. Bernd Justin Jütte, Associate Professor in intellectual property law at University College Dublin.
Callum Henry, Senior Legal Counsel at Zyte.

Navigating the EU's new AI rulebook

The European Union's landmark AI Act began coming into force in August 2024, but is being implemented over a three-year period to 2027.

As a legal academic specializing in the field, Dr. Jütte explained that the regulation is not actually a blanket law:

“The idea of the Act is to prevent and address certain risks that AI poses to larger society. We call that a risk-based regulation.”
– Dr. Bernd Justin Jütte, Associate Professor in intellectual property law, University College Dublin

This framework categorizes AI systems into different tiers of scrutiny. Some practices, like using subliminal techniques, are prohibited outright. At the other end are "low or no-risk" AI systems, which face minimal regulation beyond encouraging voluntary codes of practice.

The real complexity lies in the middle, he said. "High-risk" systems and "general purpose AI models" (GPAI) are subject to significant obligations, including transparency, record-keeping, and oversight.

Dr. Minas warned, for companies outside the EU, it's crucial to understand that these rules have an extra-territorial reach:

“Once you deploy something in the EU, you have to comply.”
– Dr. Nikos Minas, Global IP Counsel, Wesco International

He also noted a key challenge: the AI Act relies on self-assessment. “It's up to you to define where you are in the tier,” he said. “We are human beings, and we tend, when there is a risk, to maybe sometimes minimize that risk and especially go for the reward portion.”

Copyright's collision with AI

The most contentious area today is the use of web-scraped data to inform AI models, which can bring the vast data needs of AI development into close proximity with copyright laws.

In the European Union, the Copyright Directive includes a specific exception for text and data mining (TDM).

Dr. Jütte explained that, while this allows for scraping, it comes with a critical caveat: rightsholders can opt out. He said a recent German court ruling found that a website’s published opt-out message doesn't need to be machine-readable, meaning a simple sentence in a website's terms and conditions statement could suffice.

This, Jütte said, creates a complex scenario:

“That makes it much more complicated for you guys because you have to have a web crawler that not only looks for robots.txt, but also speaks … 27 different languages within the European Union to identify whether somebody, in non-legal language, says: ‘Please don't copy the stuff that I have on my website’.”
– Dr. Bernd Justin Jütte, Associate Professor in intellectual property law at University College Dublin

Zyte’s Sanaea Daruwalla clarified compliance with the wishes expressed in robots.txt files: “This isn't binding law. For the most part, robots.txt are still looked at as advisory and not legally binding. So, it's still a very gray area.”

The United States takes a different approach with its "fair use" doctrine - a more flexible, case-by-case analysis. Recent lawsuits are beginning to test its application to AI.

Zyte’s Callum Henry pointed to the recent Anthropic case brought by book authors, in which the company trained its model on two sets of data:

Books it had lawfully purchased and scanned.
Books it had downloaded from pirate sites.

A judge indicated that training on the lawfully obtained data from the purchased books was likely fair use. The pirated data, however, was a different story.

“We've got this distinction now between lawfully obtained data and unlawfully obtained data,” Henry explained. For the pirated books, the judge indicated that it likely fell against fair use”. Anthropic has settled the case.

Live lessons from the courtroom

Panellists also discussed the ongoing case between Getty Images and Stability AI in the UK. Getty sued Stability AI, alleging its image-generation model, Stable Diffusion, ingested Getty’s library and began reproducing images.

As Extract Summit got underway in Dublin, a judge in London delivered a much-awaited ruling. While the core issue of whether the initial training constituted copyright infringement remains to be decided (Stability AI successfully argued its model was trained outside of the UK), the judge did find that “importing” the trained model to the UK could potentially be an act of secondary copyright infringement.

However, Zyte’s Henry said the judge ruled that what the AI model actually contains is not images but, rather, a collection of mathematical weights and parameters:

“The model never saw an image in the way that you and I understand an image, and it never retained an image. It's effectively a mathematical representation.”
– Callum Henry, Senior Legal Counsel at Zyte

As Henry put it, the key takeaway is that “a model can be something which is capable of infringing copyright” - but the finer points of both jurisdiction and presence of actual copying matter greatly.

Optimize for provenance

So, how can businesses navigate all this? All panellists agreed on one fundamental principle: accountability. The source of your data matters immensely. “Where do you get your data? That should be your primary concern,” Dr. Minas stressed.

As Callum Henry advised, even if using an AI tool to generate scraping code, "you are still responsible for what you do on the internet”.

Zyte’s Daruwalla echoed the view: “It does show you how important it is to stay away from those pirated sites, to be cognizant of what you're scraping:”

“Even if you're using these AI scrapers and they're going to build these really cool spiders for you, you still have to think about where that's going and what that's doing.”
– Sanaea Daruwalla, Chief Legal & People Officer, Zyte

The legal frameworks around data and AI may still be developing, but common sense around classic copyright compliance, coupled with keeping tabs on emerging case law, can help create a secure foundation for sustainable success.