Callum Henry
6 mins
March 15, 2024

Compliant web scraping with AI

DISCLAIMER: This post is for information purposes only. The content is not legal advice and does not create an attorney-client relationship. 


Zyte’s flagship product, Zyte API, now includes built-in features that automate crawling using spider templates, and our patented AI-powered automated extraction, which gives you quality structured data quickly without writing custom parsing code. For scraping product data with AI, this is a complete solution that leverages the product’s automatic extraction feature via a Zyte AI-Powered Spider template which calls Zyte API’s AI models.

AI web scraping with Zyte API

While these tools facilitate efficient web scraping, it is important to keep in mind the basic principles of compliant web scraping. All projects should start with a compliance assessment that considers the key web scraping legal and compliance risk areas as they apply to your project. You can use our Compliant Web Scraping Checklist to help with this. 


In order to help you navigate these issues, we have also integrated a number of compliance focused protections into our AI-powered web scraping solutions. 


Agreement to terms, login and non-public data


If the data you want to extract is not publicly available on the internet — for example, it is behind a paywall, or a login page, or is not generally available to members of the public online — you need to conduct a thorough review of the website terms, or you might need to obtain permission from the website before extracting any data. 


Likewise, if you explicitly agree to any Terms of Service, Terms and Conditions or other policies — for example, by creating an account, by logging into a site, or by clicking ‘ok’ or ‘I agree’ to the site’s terms — you must comply with the policies that you have agreed to. 


While this requires a site-by-site analysis for all projects, in order to protect against some of these risks, Zyte API automatically blocks login for a large number of sites where their Terms of Service prohibit web scraping. This significantly reduces the risk of breaching website terms or policies, as any attempt to access the restricted sites behind a login page will not be permitted by Zyte API. 


Recently, a court in California made a significant ruling dealing with some of these issues in the ongoing litigation between Meta and Bright Data. For our analysis of this ruling, see our blog post: Court Rules Meta's Terms Do Not Prohibit Scraping of Public Data.


Personal data


By now, you should all be familiar with the EU’s General Data Protection Regulation (the GDPR). However, this area is becoming increasingly complex as other countries around the world bring in their own jurisdiction-specific personal data regulations. In particular, we are seeing a number of US state laws coming into effect this year.


It is important to stay on top of these developments to ensure that your project complies with the applicable personal data laws. 


In order to help you remain compliant, we have designed the AI-powered automatic extraction functionality in Zyte API so that it does not extract personal data fields in most cases. This means that, if you are using our smart spiders or our automatic extraction features, you shouldn’t end up with personal data that you weren’t expecting in your dataset. 


Where personal data is included within a schema, it is restricted to publicly available personal data where the lawful basis for that personal data and a balancing of the data subjects’ rights has been considered. For example, if you are scraping articles, the author field is included in the schema but names of commenters to an article are not included. You will still need to conduct your own analysis based on the jurisdiction you are in, but our AI-powered automatic extraction provides a good level of protection against data protection concerns. 


Copyright


One of the first factors to consider when assessing a web scraping project is whether or not the information you are seeking is protected by copyright. By its nature, data on someone else’s website is likely to be owned by them, but not all data is subject to copyright protection. Factual data - for example, a product name and price — is unlikely to be protected by copyright. But a creative or original work - for example, an article or image — is very likely to be protected by copyright. 


If the data you are seeking includes copyrighted material, you need to determine if your use would constitute an infringement of that copyright. If so, you need to assess whether your use falls within an exception. Zyte’s Terms of Service also set out restrictions relating to the external use of web data. By complying with our Terms of Service, you are also more likely to stay on the right side of copyright laws. 


However, the simplest way of dealing with copyrighted material is to descope it from your project. To this end, we have excluded the most common potentially copyrighted data, including image and video downloads, PDF downloads and music downloads from our AI automatic extraction feature. This means that you shouldn’t inadvertently infringe someone’s copyright protection.


Compliance partner for enterprise customers


We have extensive experience in web scraping best practices, with lawyers qualified in three key jurisdictions (US, UK and EU) who review hundreds of web scraping projects each year.


All Zyte API Enterprise customers receive compliance onboarding at the outset of a project. We provide a risk assessment to identify compliance risks and provide customers with information on the best next steps. We work with customers on any adjustments or preparatory work required to ensure compliance and, as customers expand their projects, we continue to work alongside them to help assess and mitigate risks along the way. 


Other risk areas


While there are no specific web scraping laws or regulations which tell you what you can and can’t do, there are a number of key risk areas and associated laws to navigate before commencing a web scraping project. Zyte API has been designed to help mitigate some of these risks, but there are other potential risk areas that it is important to be aware of, and each project needs to be assessed on a case-by-case basis. Most of these are set out in our Compliant Web Scraping Checklist but we always recommend getting independent legal advice.


Zyte has a team of legal and compliance scraping experts who can help guide you on your web scraping compliance journey. Just reach out at legal@zyte.com