We’ve made a change. Scrapinghub is now Zyte! 

Solution architecture part 4: Accessing the technical feasibility of your web scraping project

time to read
By the one and only
June 13, 2019

In the fourth post of this solution architecture series, we will share with you our step-by-step process for evaluating the technical feasibility of a web scraping project.

After completing the legal review of a potential project, the next step in every project should be to assess the technical feasibility and complexity of executing the project successfully.

A bit of upfront testing and planning can save the countless number of wasted hours down the line if you start developing a fully-featured solution only to hit a technical brick wall.

Technical review process

The technical review process focuses on the four key parts of the web scraping development process:

  1. Data discovery
  2. Data extraction
  3. Extraction scale
  4. Data output

We will look at each one of these individually, breaking down the technical questions you should be asking yourself at each stage.

Using the project requirements we gathered in the requirement gathering process you should have all the information you need to start accessing the technical feasibility of the project.

Step 1: Data discovery

The first step of the technical feasibility process is investigating whether it is possible for your crawlers to accurately discover the desired data as defined in the project requirements.

For most projects, where you know the exact websites you want to extract data from and can manually navigate to the desired data there are usually no technical challenges in developing crawlers to discover this data. The crawler just needs to be designed to replicate the user behavior sequence you need to execute to access the data.

Typically, the main reasons for experiencing technical challenges at the data discovery phase is if you don’t know how to discover all the data manually yourself:

  • you don’t know which websites contain your desired data; or,
  • you can’t filter the data on the target websites to only extract the desired data.

For example, the most common technical challenge we run into during the data discovery stage is when the client would like to extract a specific type of data but either doesn’t know the specific websites they would like to extract from or has a massive list of 100+ sites that they would like to scrape.

In cases like these, we can investigate the viability of discovering the data they require using a number of approaches we have at our disposal:

  • Data aggregators - as a result of our extensive experience of extracting data from the web, we can sometimes work with the client to identify existing websites that aggregator online web data and that if scraped could fulfill their data requirements. This is an ideal solution if the client knows what data they want to extract but doesn’t have a clear idea of which websites contain the required data.
  • Broadcrawls - in certain cases Zyte can configure a crawl that will search a specific region of the internet (country domain for example) for the type of data the client is looking for. This can us viable for certain data types (company or personal data), however, as this solution uses a generic data extraction crawler the accuracy of the data extraction is much lower than the accuracy achievable with custom-designed crawlers. We often only recommend this solution if the data quality requirement is low.
  • Artificial intelligence - if the client has a list of 100+ websites they would like data extracted from but doesn’t have the budget to develop custom spiders for each then we often recommend that they consider using an AI-assisted data extraction approach like Zyte Automatic Extraction API. This solution enables users to extract data from known websites without the need to develop custom code for each website, significantly decreasing the cost of the project. Currently, the Zyte Automatic Extraction API is only available for product and article data extraction, however, in the coming months, there are plans to add coverage for additional data types.

Each of these approaches has its limitations, most notably data quality and coverage. As a result, we generally recommend that we develop customer crawlers for each website if data quality is of priority.

Step 2: Data extraction

The next step is to verify the technical feasibility of the data extraction phase.

Here the focus is on verifying that the data can be accurately extracted from the target websites and give an assessment on the complexity required. Our solution architecture team will do a series of tests to enable them to design the extraction process and verify that it is technically possible to extract the data at the required quality. These tests will test for:

  • JavaScript/AJAX - as modern websites are increasingly using JavaScript and AJAX to dynamically display data on web pages there might be a requirement to use a headless browser to render this data for the crawlers or develop custom code to execute parts of the JavaScript without using a headless browser. The solution architect will run a series of tests to determine if the target data is rendered using JavaScript or AJAX.
  • The number of steps required to extract the data - in some cases all the target data isn’t available on a single page, instead, it requires the crawler to make multiple requests to obtain the data. In cases like these, the solution architect will determine the number of requests that will need to be made which will determine the amount of infrastructure the project will require.
  • The complexity of iterating through records - certain sites have more complex pagination (infinite scrolling pages, etc.) or formatting structures that can require a headless browser or complex crawl logic. The solution architect will determine the type of pagination and crawl logic required to access all the available records.
  • Data validation - a key component to every web scraping project is maintaining high data quality and coverage. As a result, before the project event starts our solution architect will make an assessment of the complexity of guaranteeing perfect data quality and coverage as the project scales.
  • Difficulty to maintain - not only will the solution architect evaluate the difficulty of extracting the target data for the current website, but they will also look at the website's history and the trends in that industry to determine the likelihood of disruptive website changes occurring that would break the crawlers.

Once this step is complete the solution architect will then investigate the feasibility of extracting the data at the required scale...

Step 3: Extraction scale

With the ability to discover and extract the data on a small scale verified then the next step is to verify that the project can be executed at the required scale & speed to meet the project requirements. This is often the most difficult and troublesome area when it comes to web scraping and the area where a good crawl engineer's experience and expertise really shines.

Provided there are no legal or glaring data discovery/extraction issues it is normally feasible to develop a small scale crawler to extract data without running into any issues. However, as the scale and speed requirements of the crawl increases you can quickly run into trouble:

  • Test Crawls - our solution architect will typically run a series of test crawls to investigate whether there will be any bottlenecks regarding maximum crawl speeds. This can often be an issue for smaller websites or when the project requires a tight time window to complete the data extraction but there is a risk it could overload the site's servers (for example hourly crawls).
  • Anti-Bot Countermeasures - our solution architect will run the target websites through our internal analysis tools to identify the presence of any anti-bot countermeasures, captchas, or CDNs that will increase the complexity of the project and limit the potential to extract the data at the required scale or frequency. The presence of these technologies can increase the risks of bans or reliability issues.

With this information, our solution architecture team is able to get a deep understanding of the complexity of the project and the difficulty of delivering and maintaining it at scale.

Step 4: Data delivery

The final technical feasibility step is verifying that the data delivery format and method is viable. In most cases, there are very few issues at this stage as we have numerous data formats and delivery methods available to us to satisfy any customer requirement.

However, certain options might be more complex than others as some require more development time (example: develop a custom API, etc.).

The most common complexity adding step is if the project requires data post-processing or data science to meet its requirements. These can significantly increase the complexity of the project.

So there you have it, they are the four steps to conducting a technical feasibility review of your web scraping project. In the next article in the series, we will share with you how we take the project requirements and the complexity analysis and develop a custom solution.

Your web scraping project

At Zyte we have extensive experience architecting and developing data extraction solutions for every possible use case.

Our legal and engineering teams work with clients to evaluate the technical and legal feasibility of every project and develop data extraction solutions that enable them to reliably extract the data they need.

If you have a need to start or scale your web scraping project then our Solution architecture team is available for a free consultation, where we will evaluate and architect a data extraction solution to meet your data and compliance requirements.

At Zyte (formerly Scrapinghub) we always love to hear what our readers think of our content and any questions you might have. So please leave a comment below with what you thought of the article and what you are working on.

Written by Zyte team