Getting hold of clean, accurate web data – quickly, and in a format that’s easy to manage – is a struggle for many organizations. One solution is hiring a couple of enthusiastic interns to copy and paste the information you’re looking for, but they’ll soon be struggling on larger-scale projects. Alternatively, you might use commercially available data extractor software or scraping apps. Or if you’re feeling brave you could try writing your own script for web scraping.
Let’s imagine you need to get product names or prices from an e-commerce marketplace for comparison purposes. Web scraping traditionally means identifying product pages on a site and then extracting relevant information from those pages. To achieve this your developer needs to inspect the site’s source code, and then write some more code to pick out the relevant bits like links, product names, and prices. There are some excellent tools out there to do this, including CSS and XPath selectors or a free open source framework like Scrapy.
At Zyte, we believe every business should be able to access the web data it needs, quickly and cost-effectively. Released last year, our Automatic Extraction API makes it easy to extract structured data from web pages without having to write site-specific code. Just feed URLs of individual pages you want to scrape, and the API serves up your requested data in a standard schema.
AI-powered Automatic Extraction harnesses deep learning methods, helping you to retrieve clean, accurate data in seconds rather than days or weeks. And it already supports more than 40 languages, making it easy to scrape web data literally all over the world.
Your own data extractor scripts can break if a web page changes, but Automatic Extraction reliably gets the data you want… even from dynamic sites. It’s a huge time saver, taking away the pain of having to maintain your own code. Our API also makes it possible to get data from many different web page types. Say you’re building a comparison tool, allowing your customers to browse prices and availability for high street fashions or automotive parts across lots of different sources. With Automatic Extraction, it’s easy to get reliable data aggregated from e-commerce sites, news articles, blogs and more.
At Zyte we always want to go one better. And now we’re excited to introduce our friendly self-serve interface for Automatic Extraction, letting you convert whole websites or specific pages to datasets in just a few clicks.
Wouldn’t it be fantastic if you could focus on profitable business activities instead of writing spiders to collect all those relevant page URLs? Automatic Extraction neatly meets this need, handling both extraction and crawling without manual intervention. It’s effectively a spidering service based on the same AI/machine learning technology used in our API. And it also features built-in ban management, using automatic IP rotation to prevent blocking so you don’t need to babysit every crawl.
Just select the data type you want to extract, enter your URLs, select the data type and let Automatic Extraction do its thing. Leave it running while you have another meeting. Then come back, check everything’s been OK, and fetch your clean, usable data. That’s all there is to it. Currently, we support news and article data and product data extraction, with more data types coming later this year.
Our Automatic Extraction API has helped Zyte clients get data far faster than manual extraction, but it still means pulling some developer resources away from other tasks.
Let’s say you want to extract products from the 'Arts and Crafts' category of an online marketplace. You’ll be faced with creating a list of all the URLs you wish to scrape. And that means writing code to crawl the web and collect individual page URLs for feeding into our API for extraction.
Our smart automated crawler capability is accessed via the ‘Datasets’ tab on the left-side navigation. It’s easy to use, letting you get the data you want without any coding. Just select the data type you want and you’re ready to go. There are some customization options if you need them. Extraction Strategy lets you collect all products from a website (Full Extraction), or just fetch products within a particular category starting from the specified page (Category Extraction). You can also tweak extraction request limits. A low limit lets you experiment without worrying about running out of credits, while larger limits enable scalability and production crawls. Select how you want to receive your data by setting up S3 export, or skip this option and select built-in JSON export. And now you’re all ready to start getting data back. It’s as quick and easy as that.
Hit ‘Start’ and you’ll see products start appearing in seconds. There’s a preview of items you’re about to extract, plus crawl statistics with a number of used requests from your specified limit, crawl speed in requests per minute and field coverage.
This friendly new web interface doesn’t displace our original API that lets developers seamlessly integrate Automatic Extraction into their own applications. And while fully automated crawling is super-easy, the API lets you deploy refined custom crawling strategies tailored to your own specific business needs.
Our powerful AI/machine learning model ‘sees’ a web page much as a human does, with a screenshot of the entire page and its visual elements as they’re rendered by an actual browser. These inputs are combined with the source code and processed by a deep neural network, allowing the model to understand relationships between different elements on a page. Trained on a wide range of different domains, the model also understands the inner structure of a page much as a web developer would.
We’ve recently switched to a new downloading and rendering backend that powers all Automatic Extraction requests, using modern browser engines and refined ban avoidance techniques. This has significantly reduced problems like banning and poor rendering, improving the quality of extraction results, especially for product pages.
Here’s a glimpse of some areas we’re looking at right now to improve Automatic Extraction even further: