To accurately extract data from a web page, developers usually need to develop custom code for each website. This is manageable and recommended for tens or hundreds of websites and where data quality is of the utmost importance, but if you need to extract data from thousands of sites, or rapidly extract data from sites that are not yet covered by pre-existing code, this is often an insurmountable challenge.
The complex and resource-intensive nature of developing code for each individual website acts as a bottleneck severely curtailing the scope of companies' data extraction and analysis capabilities.
Nowhere has this need for real-time data extraction at scale is more needed than in e-commerce and media monitoring. Where the ability to monitor products on any online e-commerce store or monitor news from thousands of media outlets would take a company’s business intelligence capabilities to a completely new level.
Zyte’s new Automatic Extraction API has been specifically designed for real-time e-commerce & article extraction at scale, and we’re now opening it up to beta users for a limited time period.
At the core of the Zyte Automatic Extraction API is an AI-enabled data extraction engine able to extract data from a web page without the need to design custom code. Through the use of deep learning, computer vision, and Zyte Smart Proxy Manager (formerly Crawlera), the data engine is able to automatically identify common items on product and article web pages and extract them without the need to develop and maintain extraction rules for each site.
With this AI technology, developers and companies now have the ability to extract product data from e-commerce sites without having to write custom data extraction code for each website.
As with any machine learning-based solution, the coverage and accuracy of the output are open to more inaccuracies compared to custom-developed code.
However, after much testing and refinement with alpha users, our data science team has improved our machine learning technology and operational processes to the point that the data extraction engine is capable of yielding commercially viable data quality for users.
"When Kinzen was faced with regularly obtaining quality news articles from the open web we knew we had to get people in who had done it before. After an evaluation of the market, Zyte (formerly Scrapinghub) were an obvious choice with their years of experience in web scraping. We have not regretted that decision.
Not only have they lived up to their promises, but the quality of their output, and responsiveness have exceeded our expectations. Their technology and know how is without par on the market. We have no hesitation in recommending them."
Paul Watson - CTO & Co-Founder - Kinzen, the news app that puts you in control
The key to this success has been Zyte's (formerly Scrapinghub) 10+ years of experience being at the forefront of web scraping technologies and extracting over 8 billion pages per month. This experience and scale have enabled us to overcome a lot of the technical challenges faced by AI-enabled data extraction engines and design a solution that is viable for commercial applications.
Ideally suited for developers the API offers a flexible and highly scalable data extraction engine for large-scale data analysis and visualization applications. Especially:
The AI-enabled web scraping technology used as part of the API has the potential to unlock the web's full potential, turning the web into the world’s largest structured database.
Now instead of having to manually develop and maintain code for each new website, you can simply configure your applications to send their queries to the Zyte Automatic Extraction API and receive structured data ready for analysis in response.
Not only does this capability enable developer teams to build highly scalable data extraction capabilities, but it also enables data science teams to rapidly prototype and test the value of data science projects, and stands as a backup to your existing custom-built code if they were ever to break.
Currently, there are two versions of the API designed for two separate use cases:
Although we are initially focused on providing the API for product and article extraction, over time we plan to expand the types of data the API can automatically extract to include company/people profile data, real estate, reviews, etc. Further enhancing the accessibility of the web’s data.
If you are interested in e-commerce or media monitoring and would like to get early access to the Zyte Automatic Extraction API then be sure to sign up for the public beta program.
When you sign up you will be issued an API key, along with documentation on how to use the API. From there you are free to use the Zyte Automatic Extraction API for your own projects and retain ownership of the data you extracted when the beta program closes.
What's even better, there is zero cost involved with the beta program. You will be assigned a daily/monthly request quota which you are free to consume as you wish.
The beta program will run until July 9th, 2019, so if you’d like to be involved then be sure to sign up today as places are limited.