We’ve made a change. Scrapinghub is now Zyte! 
zyte logo
zyte logo

The Scraper’s System: a secret sauce to architect scalable web scraping applications

Dear Web Scraping Developers, 

Suppose you have been assigned to develop a competitive intelligence system for an ecommerce company.

Staring at the Statement of Work opened on your laptop screen, your thoughts must be wandering in all directions - where do you start?  

A scalable web scraping project is a classic example of a complex system with dynamic situations that consist of changing problems that interact with each other. When you try understanding any complex system, you start looking into its elements, then its sub-elements, and the sub-elements of its sub-elements, and soon you are lost in this multiverse. 

“We become what we behold. We shape our tools and then our tools shape us.” - Marshall McLuhan. And what we behold is dependent on how we perceive the world around us. 

This quote always tickles me into - isn’t that a key that makes or breaks everything we do? Is it possible we train ourselves to become better at perceiving the problems around us? How do we think and act better? 

Systems Thinking

If we know the mantra to think and perceive these complex systems, we can optimize our energy, time and resources to increase the probability of success in anything we do. The key is to be able to see that “a system isn’t just any old collection of things.

Everything is interconnected in a System that is coherently organized in a way that achieves something.” 

Well, that was a brief on Systems Thinking. Since the time I have understood System and Design Thinking, I am fascinated by Systems and Processes around me that are operating at their best and love to analyze the reasons that make them effective! 

At Zyte, developers are at the core of the system to scrape websites and deliver quality data. So, I started talking to experienced developers to understand the system at Zyte to effectively scrape close to 13 billion pages per month. 

To my surprise, I found that although the developers differ in their style of coding, the thought process behind designing the web scraping project to scrape data at scale is almost the same.

And, understanding this process gave birth to the framework which I have called “The Scraper’s System - a secret sauce”. 

A framework for scalable web scraping

The Scraper’s System is an eight step framework that will guide you to create a large scale web scraping application to ensure the highest success rate. 

I have tried to simplify the complex process of Architecting a Scalable Web Scraping Project by applying System Design Thinking to help you plan better for your data extraction projects. 

  1. Preparing for the most common and some uncertain challenges  that arise due to the dynamic nature of the ever-changing web.
  2. Sharing some best practices 
  3. Nudging you  to answer some basic yet important questions. 

 Welcome to Part One of the Seven-Part Blog Series. 

The following discusses in detail the first step of the framework:  Clarify The Goal.

Part 1 - Clarify The Goal

This graph represents the results of the survey done with our customers, which gave us a shocking revelation that most of them fall in Level-1: Ad-hoc Maturity Level when it comes to creating the business case. 

The graph shows:

  • No documented Business Use Case
  • Poor understanding of the costs of the web data. 
  • Inappropriate success KPIs. 
scalable web scraping

Which often leads to more chaos while scaling a web scraping project.

Examples: 

  • Lack of clarity on where to find useful data, how to use Data. 
  • Decide the scale to plan resources accordingly. 
  • Not being able to Prioritize Project Attributes. 
  • No defined Data Schema.
  • Struggle to determine the quality of Data. 

Defining the clear purpose of your web scraping system can help you design your crawlers better and prepare you for the uncertainty involved in a large scale web scraping project. 

How to start a scalable web scraping project

Start your project by answering these basic yet important questions that lay a smooth path for the entire project. 

  1. Why do you want Data for?
  2. What is the business problem at your hand that needs this Data? There are multiple reasons why you would need Data for your project and the business problem you want to solve that’s associated with it. Check this detailed blog that describes multiple use cases and problems about, What is web scraping used for?. If you are specifically interested in e-commerce, here’s a detailed blog on scraping e-commerce websites & leveraging product data at scale. 
  3. Where would you find that data?
    Do you know which sites have suitable data to extract? This question will force you to spend time on defining the target websites for your project. 
  4. What kind of data do you need to solve it?

After defining the target websites, the next step is to clearly capture what data we want to extract from the target web pages.

One of the best methods to clearly capture the scope of the data extraction is to take screenshots of the target web pages and mark them with fields that need to be extracted.

Oftentimes during calls with our solution architects, we will run through this process with customers to ensure everyone understands exactly what data is to be extracted.

scalable web scraping

Interviewing over 40+ industry representatives (and our internal teams which serve over 5000 customers), about this step, gave birth to the Web Data Maturity Model - a detailed guide to clarify where you are in the journey of extracting Data. 

You can watch the entire webinar on Web Data Maturity Model, here

Conclusion

To summarize, try filling the "purpose statement" before you start the project.

scalable web scraping

Example:

The purpose is to build Product Pricing Comparison that will optimize pricing strategy by monitoring the competitor's prices for which I need to extract product name, product price, product details, from these websites.

I hope this has helped you better understand this first section of the series and will be better prepared to start a scalable web scraping project.

smart proxy manager