In our article last week, we answered some of the best questions we got during Extract Summit. In today’s post, we share with you the second part of this series. We are covering questions on web scraping infrastructure and how machine learning can be used in web scraping.
How do you do unit testing and integration testing in web scraping?
Spider unit testing is difficult because it depends on the actual website which you don't have control over. You can use spider contracts for testing.
Can you tell us more about what do you use for storing the crawled information NoSQL or SQL database, advantages, and disadvantages?
In Scrapy Cloud we use NoSQL database for data storage. One of the advantages of NoSQL is the lack of schema which means you can change Scrapy item definition at any time without any problems.
Do you have more information on the machines you use and the throughput each spider/machine has?
Here you can read more about the Scrapy Cloud architecture.
How much of your team’s time is taken up by data quality checking, given the dynamic nature of the sites you have to harvest?
It depends on the project. In some projects there is a dedicated QA team that checks data, in some projects there is no such need. Sometimes it is enough to add automated checks for typical problems, and only do data checks after some major spider update.
How could you handle cookies and sessions without getting a headache?
If you don't want to develop your own implementation use a high-level tool like Scrapy that takes care of it. Otherwise, you have to develop your own solution.
How to parse price values when scraping multiple websites and there are different formats? (E.g. 1000, 1.000, 1,000, 1 000)
Our price-parser open-source library handles these formats.
How you handle the website layout changes?
The data field locators need to be changed as well. Or use an AI tool like Zyte Automatic Extraction (formerly AutoExtract) which handles the changes for you.
What storage would you recommend for storing the visited and to-be-visited links?
If you mean storage of string with URL you can use anything, you can have HTTP API that returns a list of strings with URLs to visit, you can have txt file, you can make the spider connect to some database to obtain a list of URLs to crawl. If you want to create some sort of delta-crawl, only visit a page once and don’t revisit again then you can use scrapy-deltafetch.
Which database type you recommend to store HTML pages?
HTML responses can be huge, storing them in a database can be challenging. It depends on your use case, but sometimes it makes sense to store responses in files rather than database.
What is a preferred spider management tool?
There's only one spider management tool available for the public today which is Scrapy Cloud.
How do you monitor your spiders to get an alert when they become broken?
You can check if the required data fields are populated or check if the produced JSON schema is what you expected. If something doesn't seem right send an alert.
Yes, there is an open source selenium middleware available. Using a headless browser takes considerably more resources so it takes longer to execute.
How do you deal with websites hanging onto requests for a long time? 10min for example?
There is no easy way to deal with that. You need to determine why the request hangs. Does it hang because of the rate limit? If yes, use proxies. Does it hang because of a poor quality server? If this is the case, you cannot do much.
Is it recommended to use regular expressions when scraping data? Is there any alternative?
You could use CSS selectors/XPATH instead. These are much easier to read and maintain also there are libraries for this. Only use regex if the HTML is very messy and you cannot use CSS or XPATH. Or use Zyte Automatic Extraction (formerly AutoExtract) and you don’t have to deal with any of these.
What are the advantages of using Scrapy over Apify SDK?
Scrapy is more mature, it is on the market for several years, it has more options for configuration, and is easier to extend.
Are there any good open-source libraries that simplify parsing and extraction of human names?
There are useful libraries made by datamade.
What are some eminent researchers/papers on the topic of fully automatic web information retrieval?
It looks like most of the good research is hidden behind the doors of commercial entities (search engines, web data extraction companies). You need to do your own research to get good results.
Can it handle NER? Pre-learned entities or arbitrary entities?
No, Zyte Automatic Extraction (formerly AutoExtract) doesn't have this feature yet.
How many annotations do you need to make to train your own model?
We use hundreds of thousands of pages from different domains. It could work with thousands as well.
What ML frameworks do you use?
Do your models keep working after website layout changes? Do you need to retrain on those cases?
It keeps working. No need for additional training.
Can the model learn how to extract data from a website that it has never seen before?
Yes, it can.
Do you create different models based on the type of the website ( social, news media, blog... ) or the model works with all sites?
We create models for different website types, but the approach/architecture is the same. It is possible to have a single model, as a performance optimization.
How does the ML-based extraction approach at Zyte (formerly Scrapinghub) differ from tools available in libraries such as moz/dragnet? Do you collaborate?
We evaluated our article extraction against dragnet and other open-source libraries; they were not useful for us in the end, because the quality is much worse. We're not contributing to them, because the approach is very different.
Is Zyte Automatic Extraction built on top of yolo?
What about meta-ML challenges such as explaining model outputs to humans?
ML explainability is an area of active research and development. To contribute, we created this open source library which includes many explanation methods. Recently image support was added to this library.
If you feel like we missed an important question leave a comment below and we will try our best to answer it. Also if you missed the Extract Summit but are interested in the talks you can access the recordings here.