Welcome to part two of "The Scraper’s System" series.
If you haven’t read the introductory part yet, you can do so here.
In the first part, we discussed a template to define the clear purpose of your web scraping system that can help you design your crawlers better and prepare you for the uncertainty involved in a large scale web scraping project.
Step 1 clarifies the three W’s:
Why, What, Where – of a large scale Web Scraping project – which will be the guiding North-Star throughout the development process.
Step 2 of the framework helps you answer: “How do we extract the data?”
I also like to call this phase – The Explorer’s Compass – you must understand this critical navigational tool and some best practices before you set out to sail through the target websites.
At Zyte, developers spend days analyzing target websites using four parameters that will help design the high-level crawl logic and choose the best suitable technology stack for your project.
Web Scraping vs API – Which is the better option?
My answer to this question… It's always subjective and depends primarily on your business goals.
If you need to collect data from the same website all the time, API is a suitable choice.
A good idea is to always check for API availability and note all those data fields that can be extracted from the website APIs, rather than jumping straight away to scraping them.
Keep in mind, that this may not hold true in certain scenarios which involve:
For example, if the target websites are e-commerce aggregators, then API would make sense. If the target websites are flight aggregators, then maybe not.
When deciding whether to choose API over web scraping or vice-versa, it’s also important to check the dynamicity of the target websites and look at the website's history. In addition to trends in that specific industry to determine the likelihood of disruptive website changes, which as a result could break the crawlers.
Once again, clarify the business goal and make a list of all data-fields that can be extracted using the APIs provided by the target websites.
Let me give you two examples for a quick introduction of Interactive elements on the websites.
Once you see this, you will notice little red markers lit up in your head every time you interact with any favorite web or mobile applications.
If you feel like eating your favorite cake… open Google Maps and type in “Bakery near me”.
Did you notice those little red markers appear?
Open your favorite e-commerce website, check the exact availability of any product you want to buy – select the delivery location and enter the pin code.
Did you notice this entire interaction happened without loading the entire page?
After this, you cannot unsee such interactions across many applications over the web.
Therefore, the regular approach of scraping data will fail when it comes to scraping dynamic websites.
Two alternative approaches:
This part of the analysis is boolean, either your requests go through or you start experiencing "429's".
It’s hard to pinpoint what goes on behind the scenes to block your requests.
Read this blog post – where Akshay Philar covers the most common measures used by websites and we’ll show you how to overcome them with Zyte Data API Smart Browser.
To summarize, these are some of the defensive measures that get you blocked:
In this process, try to figure out the level of protection used by the target website. This helps answer whether a rotation proxy solution will be enough or it needs an advanced anti-ban solution like Zyte API that takes care of bans of all types.
Lastly, the number of steps required to extract the data - in some cases all the target data isn’t available on a single page, instead, it requires the crawler to make multiple requests to obtain the data.
In these cases, determine the number of requests that will need to be made which will determine the amount of infrastructure the project will require.
The complexity of iterating through records – certain sites have more complex pagination (infinite scrolling pages, etc.) or formatting structures that can require a headless browser or complex crawl logic. This helps answer the type of pagination and crawl logic required to access all the available records.
To conclude, try the following exercise first.
Answer the questions below to ensure you fully understood "The Explorer’s Compass" and are ready to move forward with The Scraper's System.