In the fourth post of this solution architecture series, we will share with you our step-by-step process for evaluating the technical feasibility of a web scraping project.
After completing the legal review of a potential project, the next step in every project should be to assess the technical feasibility and complexity of executing the project successfully.
A bit of upfront testing and planning can save the countless number of wasted hours down the line if you start developing a fully-featured solution only to hit a technical brick wall.
The technical review process focuses on the four key parts of the web scraping development process:
We will look at each one of these individually, breaking down the technical questions you should be asking yourself at each stage.
Using the project requirements we gathered in the requirement gathering process you should have all the information you need to start accessing the technical feasibility of the project.
The first step of the technical feasibility process is investigating whether it is possible for your crawlers to accurately discover the desired data as defined in the project requirements.
For most projects, where you know the exact websites you want to extract data from and can manually navigate to the desired data there are usually no technical challenges in developing crawlers to discover this data. The crawler just needs to be designed to replicate the user behavior sequence you need to execute to access the data.
Typically, the main reasons for experiencing technical challenges at the data discovery phase is if you don’t know how to discover all the data manually yourself:
For example, the most common technical challenge we run into during the data discovery stage is when the client would like to extract a specific type of data but either doesn’t know the specific websites they would like to extract from or has a massive list of 100+ sites that they would like to scrape.
In cases like these, we can investigate the viability of discovering the data they require using a number of approaches we have at our disposal:
Each of these approaches has its limitations, most notably data quality and coverage. As a result, we generally recommend that we develop customer crawlers for each website if data quality is of priority.
The next step is to verify the technical feasibility of the data extraction phase.
Here the focus is on verifying that the data can be accurately extracted from the target websites and give an assessment on the complexity required. Our solution architecture team will do a series of tests to enable them to design the extraction process and verify that it is technically possible to extract the data at the required quality. These tests will test for:
Once this step is complete the solution architect will then investigate the feasibility of extracting the data at the required scale...
With the ability to discover and extract the data on a small scale verified then the next step is to verify that the project can be executed at the required scale & speed to meet the project requirements. This is often the most difficult and troublesome area when it comes to web scraping and the area where a good crawl engineer's experience and expertise really shines.
Provided there are no legal or glaring data discovery/extraction issues it is normally feasible to develop a small scale crawler to extract data without running into any issues. However, as the scale and speed requirements of the crawl increases you can quickly run into trouble:
With this information, our solution architecture team is able to get a deep understanding of the complexity of the project and the difficulty of delivering and maintaining it at scale.
The final technical feasibility step is verifying that the data delivery format and method is viable. In most cases, there are very few issues at this stage as we have numerous data formats and delivery methods available to us to satisfy any customer requirement.
However, certain options might be more complex than others as some require more development time (example: develop a custom API, etc.).
The most common complexity adding step is if the project requires data post-processing or data science to meet its requirements. These can significantly increase the complexity of the project.
So there you have it, they are the four steps to conducting a technical feasibility review of your web scraping project. In the next article in the series, we will share with you how we take the project requirements and the complexity analysis and develop a custom solution.
At Zyte we have extensive experience architecting and developing data extraction solutions for every possible use case.
Our legal and engineering teams work with clients to evaluate the technical and legal feasibility of every project and develop data extraction solutions that enable them to reliably extract the data they need.
If you have a need to start or scale your web scraping project then our Solution architecture team is available for a free consultation, where we will evaluate and architect a data extraction solution to meet your data and compliance requirements.
At Zyte (formerly Scrapinghub) we always love to hear what our readers think of our content and any questions you might have. So please leave a comment below with what you thought of the article and what you are working on.