Is Web & Data Scraping Legally Allowed?

The Ultimate Guide to Legal Web Scraping

DISCLAIMER: This post is for informational purposes only. The content is not legal advice and does not create an attorney-client relationship.

One question everyone asks prior to commencing their web scraping journey is “is web scraping legal?”

The short answer is that web scraping itself is not illegal. There are no specific regulations that explicitly prohibit web scraping in the US, UK, or the EU. However, the manner in which you scrape, the data that you scrape, and how you use that data might put you into an area of web scraping that might not be legal. In this article we will discuss what legal considerations you need to take into account prior to commencing your web scraping journey.

The most important laws that intersect with web scraping are (1) copyright, (2) contract, (3) data protection, and (4) anti-hacking laws. There are other legal considerations as well, however these are the biggest factors to take into account, so we’ll walk through each of these below.

1. Copyright Law

One of the first considerations to take into account when scraping a website owned by another party is whether the information you are scraping is copyrighted.

By virtue of the fact that you are scraping data from someone else’s website, the likelihood is that they own the data on the site. For example, a news website owns the copyright to the articles they publish on the site (or in some cases they’ve licensed the articles from the copyright holder). As such, you need to comply with the relevant copyright laws in the jurisdiction at issue if you are scraping this data.

However, not all data on a website is copyrighted, because not all information is copyrightable. Copyright law only protects original works of authorship, so factual data, like product prices and names, typically cannot be copyrighted. So if your scraping is limited to non-copyrightable information then your web scraping will fall outside the bounds of copyright law.

If you determine that the data is copyrighted, then you need to move to the next step of the analysis to determine whether your use of that data infringes on their copyright, whether there are database rights involved, and/or whether your use of the data falls into a copyright exception, such as fair use under US law (note that different jurisdictions may have different exceptions under copyright law). You will need to make such determinations on a case by case basis, as it will depend heavily on your use case for the data.

In order to determine if your use of copyrighted data falls within the fair use exception, US courts look at four factors:

the purpose and character of your use;
the nature of the copyrighted work;
the amount and substantiality of the portion taken; and
the effect of the use upon the potential market.

What the court will really be looking at here is whether your usage directly competes with the copyright owner and diverts business from them and whether your use transforms the data in some way. If you’re simply using the copyrighted data for internal business purposes, analysis, or only posting short snippets that link back to the original, this will likely be considered fair use. If you are directly copying and reposting the data for your own business gains, this will not be fair use and you will potentially be liable for copyright infringement.

2. Contract Law

The next set of laws that come into play for web scraping are contract laws. Most websites contain terms and conditions and they are either browsewrap or clickwrap.

Browsewrap terms are the ones that are linked somewhere on a webpage but that you do not explicitly agree to.

Clickwrap terms are terms that you need to explicitly agree to -- this means that you are clicking agree or checking a box or registering to the site or any other mechanism by which you are actively agreeing to the terms.

The reason we make this distinction is that many courts have found this distinction to be vital in assessing whether the terms are enforceable. Browsewrap terms have been enforced with less frequency, while clickwrap terms are almost always enforced, as you are considered to have entered into a contract with the target website when you explicitly agree to their terms. As a result, when agreeing to clickwrap terms, you should always read and comply with the terms, so that you are not breaching the contract that you have entered into.

There are potentially various items within the terms that may prohibit your web scraping, but the most common and clear statement is when the terms state that you may not scrape, crawl, or use any other automated means to access the site. In this case, you should not scrape the site where you have agreed to the terms unless you have permission from the site.

Furthermore, there may be restrictions around commercial use of the data on the website, login sharing, and/or other use restrictions in the terms that you agree to. If you agree to those terms, all the terms apply, so read and abide by them all.

3. Data Protection Laws

Data protection laws vary from country to country, but the main laws we will focus on here are GDPR and the various US state laws, as they are the ones talked about most frequently.

GDPR

If your web scraping project involves personal data of people within the EU then you must comply with GDPR, which means you must have a lawful basis for scraping the personal data. There are various lawful bases under GDPR, but the ones that you will likely need to rely on for a web scraping project including personal data are legitimate interest or consent. For consent, you would need to have clear written consent from the data subjects to scrape their data -- this applies whether or not the data is publicly available. To avail of legitimate interest you would need to conduct a legitimate interest analysis (LIA) and in some cases also conduct a Data Protection Impact Assessment (DPIA), which we discuss in further detail in our GDPR blog post.

US State Laws

In the US, the main state law people are focussed on is the California Consumer Privacy Act (CCPA), but there are also 12 other states with their own unique data protection laws. This makes navigating US privacy laws very complex, but there is one overarching theme that helps web scrapers – all the state laws have an exception for public personal data. So long as you are only scraping personal data that has been clearly made public, you will likely fall under the various US law exceptions.

Be mindful that there are many other jurisdictions with strict data protection laws, so you will need to consult with your legal counsel to ensure that any scraping you conduct complies with the relevant local laws.

4. Computer Fraud and Abuse Act

There have been several cases in the US in which companies have brought actions against web scrapers under the Computer Fraud and Abuse Act (CFAA). Originally CFAA was designed to target hackers, but those bringing lawsuits under CFAA for web scraping have tried to broaden its applicability.

In recent years we’ve seen two landmark decisions that support the fact that CFAA does not in fact apply to most web scraping.

Van Buren

In the Supreme Court case, Van Buren v United States, the Court held that if you lawfully access a computer system but do so for an improper purpose, that is not a violation of the CFAA, as the access itself was not unlawful. So accessing a public website through web scraping isn't a CFAA violation, but it can still be a violation of other laws like copyright, contract, privacy, so be mindful of this. The Court also left open whether circumventing technological measures to get data could be a CFAA violation, and made no ruling on what technological measures might create risk. So while this is a great ruling for ethical web scrapers, it does still leave open the other causes of action we discussed above and doesn’t fully settle the CFAA issue.

LinkedIn v hiQ Labs

This may be the most miscited case in all of web scraping. So many articles will tell you that the ruling in this case said that web scraping is fully legal – in fact, ChatGPT itself gets this wrong on many occasions because it’s trained on so much misinformation about the case. While a ruling like that would be great, it’s not the reality and it’s a much more nuanced decision. So what does this case really say . . .

As a part of a preliminary injunction ruling, the court ruled that hiQ was likely to win the CFAA cause of action, as their access to LinkedIn’s website was not beyond their authorized access (they were accessing information that they had a lawful right to get from LinkedIn). This decision was affirmed by the Ninth Circuit Court of Appeals and then appealed to the Supreme Court. The Supreme Court did not rule on the case but sent it back to the Ninth Circuit to review the case based on its ruling in Van Buren. The Ninth Circuit then affirmed its ruling, again deciding that hiQ’s scraping was not beyond its authorized access, similar to the decision in Van Buren. While this was a positive ruling that confirms the likelihood that web scraping public data is not a violation of CFAA, this was merely a ruling on a preliminary injunction with no final ruling on this cause of action. Subsequently, this case settled via a confidential settlement agreement, so we received informative and positive authority from the courts, but nothing definitive and nothing about the other causes of action in the case.

5. Other Considerations

Robots.txt

While robots.txt are informative, they are typically not legally binding. You will need to determine on a case by case basis whether it’s something you need to follow and all the other legal factors listed above should be taken into consideration as a part of this.

Rate Limits

Ensure that you set reasonable rate limits and/or only work with providers that set those limits for you. Web scraping must be done responsibly and you want to ensure that you are not interfering with the operations of the websites you are interacting with. And if you do receive an abuse report from a website, you should always reduce the speed and frequency at which you are accessing their site to ensure you are respecting their operations.

Industry-Specific Considerations

Some web scraping use cases will have nuances based on the industry you are in. For example, if you are using web data to make investment decisions, you will need to comply with securities laws and consider factors like material non-public information (MNPI) to avoid any insider trading issues. In this case, you will always want to ensure that you are only collecting public data that isn’t behind a login, paywall, or other barrier.

Other industry specific examples would be government entities, law enforcement, or healthcare – there are additional regulations in these sectors and if you are working in one of these areas you need to take those into account.

Residential IPs

If you are using residential IPs to conduct your web scraping, you must ensure that they were legally and ethically obtained. While there are many reputable residential IP providers who obtain proper consents for the IPs, there are still a lot of providers who are not doing this. Always make sure you are working with a provider who can show you the clear and explicit consents that they gather in order to use the residential IPs. Without this, you may be utilizing someone’s IP address without their permission or knowledge, which runs afoul of many data protection laws.

AI Laws

As we know, most large language models (LLMs) and generative AI systems (GenAI) are trained on data scraped from the web. There is a host of case law and regulation pending on this topic, so we will be publishing a legal tracker soon to help navigate these laws.

So far what we have seen is: (1) beware if you are using AI for higher risks things like facial recognition, making job or housing decisions, and law enforcement, (2) it’s yet to be determined if using copyrighted material like images or books to train your LLM is copyright infringement, and (3) if the subsequent output from the GenAI system is distinct enough from the original work it likely isn’t copyright infringement. We’ll keep you updated as we learn more from the pending cases and law. This will be an exciting topic to follow!

For more assistance in assessing your web scraping needs, check out our Compliant Web Scraping Checklist or contact Zyte’s legal team at legal@zyte.com