In the fifth and final post of this solution architecture series, we will share with you how we architect a web scraping solution, all the core components of a well-optimized solution, and the resources required to execute it.
To give you an inside look at this process in action, we will give you a behind the scenes look at examples of projects we’ve scoped for our clients.
But first, let’s take a look at the main components you need for every web scraping project…
Disclaimer: I am not a lawyer, and the recommendations in this guide do not constitute legal advice. Our Head of Legal is a lawyer, but she’s not your lawyer, so none of her opinions or recommendations in this guide constitute legal advice from her to you. The commentary and recommendations outlined below are based on Zyte's (formerly Scrapinghub) experience helping our clients (startups to Fortune 100s) maintain GDPR compliance while scraping billions of web pages each month. If you want assistance with your specific situation then you should consult a lawyer.
There are a few core components to every web scraping project that you need to have in place if you want to reliably extract high-quality data from the web at scale:
However, depending on your project requirements you might also need to make use of other technologies to extract the data you need:
The amount of resources required to develop and maintain the project will be determined by the type and frequency of data needed and the complexity of the project.
Talking about the building blocks of web scraping projects is all well and good, however, the best way to see how to scope a solution is to look at real examples.
In the first example, we’ll look at one of the most common web scraping use cases - Product Monitoring. Every day Zyte (formerly Scrapinghub) receives numerous requests from companies looking to develop internal product intelligence capabilities through the use of web scraped data. Here we will look at a typical example:
Project requirements: The customer wanted to extract products from specific product pages from Amazon.com. They would provide a batch of search terms and the crawlers would search for those keywords and extract all products associated with them (~500 keywords per day).
The extracted data will be used in a customer-facing product intelligence tool for consumer brands looking to monitor their own products along with the products of their competitors.
Legal assessment: Typically, extracting product data poses very few legal issues provided that the crawler (1) doesn’t have to scrape behind a login (which often isn’t the case), (2) is only scraping factual or non-copyrightable information, and (3) the client doesn’t want to recreate the target website’s whole store, which may bring database rights into question. In this case, the project had little to no legal challenges.
Technical feasibility: Although this project required a large scale crawling of a complex and ever-changing website, projects like this are Zyte's bread and butter. We have considerable experience delivering similar (and more complex) projects for clients so this was a very manageable project. We would be able to reuse a considerable amount of code used elsewhere to enable us to get the project up and running very quickly for the client.
Solution: After assessing the project the solution architect then developed a custom solution to meet the client's requirements. The solution consisted of three main parts:
Outcome: Zyte (formerly Scrapinghub) successfully implemented this project for the client. The crawlers developed now extract ~500,000 products per day from the site, which the client inputs directly into their customer-facing product monitoring application.
In the next example, we’re going to take a look at a more complex web scraping project that required us to use artificial intelligence to extract the article data from over 300+ news sources.
Project requirements: The customer wanted to develop a news aggregator app that will curate news content for their specific industries and interests. They provided an initial list of 300 news sites they wanted to crawl, however, they indicated that this number was likely to rise as their company grew. The client required every article in specific categories to be extracted from all the target sites, crawling the site every 15 minutes to every hour depending on the time of day. The client needed to extract the following data from every article:
Once extracted this data would be fed directly into their customer-facing app so ensuring high quality and reliable data was a critical requirement.
Legal assessment: With article extraction, you always want to be cognizant of the fact that the articles are copyrighted material of the target website. You must ensure that you are not simply copying an entire article and republishing it. In this case, since the customer was aggregating the content internally and only republishing headlines and short snippets of the content, it was deemed that this project could fall under the fair use doctrine under copyright law. There are various copyright considerations and use cases to take into account when dealing with article extraction, so it is always best to consult with your legal counsel first.
Technical feasibility: Although the project was technically feasible, due to the scale of the project (developing high-frequency crawlers for 300+ websites) the natural concern was that it would be financially unviable to pursue such a project.
As a rule of thumb, it takes an experienced crawl engineer 1-2 days to develop a robust and scalable crawler for one website. Doing a rough calculation will quickly show that to manually develop 300+ crawlers would be a very costly project if it required 1 workday per crawler.
With this in mind, our solution architecture team explored the use of AI-enabled intelligent crawlers that would remove the need to code custom crawlers for every website.
Solution: After conducting the technical feasibility assessment the solution architect then developed a custom solution to meet the client's requirements. The solution consisted of three main parts:
Outcome: Zyte successfully implemented this project for the client, who now is able to extract 100,000-200,000 articles per day from the target websites for the news aggregation app.
So there you have it, this is the four-step process Zyte uses to architect solutions for our client's web scraping projects. At Scrapinghub we have extensive experience architecting and developing data extraction solutions for every possible use case.
Our legal and engineering teams work with clients to evaluate the technical and legal feasibility of every project and develop data extraction solutions that enable them to reliably extract the data they need.
If you have a need to start or scale your web scraping project then our Solution architecture team is available for a free consultation, where we will evaluate and architect a data extraction solution to meet your data and compliance requirements.
At Zyte we always love to hear what our readers think of our content and any questions you might have. So please leave a comment below with what you thought of the article and what you are working on.