Zyte's journey – from skepticism to transformation
Zyte has leveraged AI to help data acquisition since 2019. We fed annotated page data to machine learning models to train our engine to extract data from certain content types automatically.
That has allowed us to launch features helping our users quickly extract data from particular, specified data types – including articles, jobs, products, ecommerce listings and search results – in standardized schemas. Many of our customers love using these capabilities today.
At the time, this was the cheapest and best way to access data from the web without parsing it manually, offering many advantages such as accelerating time to data across huge numbers of site types, reducing maintenance costs on parsing, and opening up avenues to large horizontal scale.
Easy data in custom schemas
But those standard schemas don’t cover everything. Data gatherers often need very specific page data – say, an ingredients list, delivery cost, a list of product tags, or even an analysis of the target audience for a product – that don’t fit neatly into any pre-existing schema. Such requirements could entail custom code, retraining models, and a constant race against mark-up changes.
Many people have theorised that a language-centric AI engine could do a fine job at finding, understanding and extracting structured data directly from a given page source, without needing to rely on selectors. While the finding and understanding are straightforward, in generative AI, returning the identified content nevertheless consumes LLM output tokens, even though it was already contained in the supplied page mark-up. “Generating” the extraction of certain long attributes – like product descriptions or article bodies – with an LLM can burn through a lot of tokens, so the trick is to do so efficiently.
Data acquisition solutions need to scale to handle billions of records – so the quest was on to build a low-token method of extracting page content. We began by instructing an LLM to smartly extract text. For instance, if a certain product characteristic is contained within a long product description, our LLM engine does a fantastic job at extracting the property by generating the extracted value.
But, when the task requires larger text to be returned - let’s say, the whole product description itself or the complete body of a news article - we found LLMs to be costly and slow.
To tackle these cases, we have defined an internal HTML cleaning and annotation pre-processing strategy that, when coupled with the fine-tuning of the LLM, allows the system to extract data by asking the LLM to identify only the most relevant document nodes.
These node references use only two or three tokens, and function as signposts for guiding parsing of the target content from our original page source. The final value is precisely the relevant piece of HTML – but the LLM never had to use tokens to “generate” it at all.
Conjuring data from words
Then came the "aha!" moment – or rather, a series of small realizations.
In 2023, we started experimenting with LLMs for free-form data extraction of unpredictable and longer content from pages. Our first attempts were messy. The LLM would sometimes hallucinate or go off on irrelevant tangents. But there was something intriguing – a glimmer of genuine understanding.
We realised we could use natural language to describe exactly what we wanted to extract from a page, and the LLM, more often than not, would deliver.
The key insight was combining the power of LLMs with our existing, robust infrastructure. We didn't need to reinvent web scraping; we needed to give it a powerful new brain.
For those cases where the intended content could not be found, we developed a method to target specific HTML elements on the page. Think of it like this: instead of asking the LLM to rewrite an entire book, we asked it to first create a detailed table of contents, complete with page numbers. From there, our existing systems could quickly and efficiently retrieve the referenced content.
This approach dramatically reduced the number of tokens the LLM needed to process, making the process far more efficient, accurate, and cost-effective.
The result? Since January 2025, Zyte API’s AI Extraction has been smart enough to understand how to find complex combinations of non-standard on-page content.