CASE STUDY

Global retailer leverages Zyte AI Scraping and LLMs to collect unstructured data

Understanding the push and pull of supply and demand is fundamental in ecommerce. More businesses are recognizing that web data can provide data driven, impactful and highly accurate product and pricing insights. With web data, you increase the scope of your research targets and surface strategic product trends at scale. The result is a larger, more complete view of the industry, giving you an edge on the competition and a well-researched understanding of the industry trends. 


About


Our Zyte Data client is a large retailer operating in over 30 countries worldwide. They want to maintain a competitive advantage by offering competitively priced products in each of the markets they operate in.


Challenges


The client needs data-driven pricing intelligence to maintain competitiveness, affordability and profitability. Pricing products for different countries with different market demand and competitors is difficult. They have a large number of target websites and require the data delivered weekly. In addition to gathering your traditional structured product data, the client needs to acquire relevant product related data buried in unstructured text. Unlike a lot of product data, the client’s data points aren’t consistently included in structured text which makes it difficult for traditional web scraping efforts to access.


Solution


Creating and maintaining spiders for hundreds of websites (all with different configurations and anti-bot challenges) is a big task. So the traditional cat-and-mouse game of ban handling would require a large capital investment for the client. 


Zyte Data was engaged by the client to:


  • gather product data from hundreds of different retail domains in over 30 countries,

  • gather data buried in unstructured text for each product,

  • transform certain data to the client preferred standard,

  • ensure data is ethically and legally acquired, and

  • deliver the data to the client’s storage on a weekly schedule.


Zyte Data took care of everything, and the client only had to provide the list of target domains. Zyte API’s AI Scraping reduced the time Zyte Data’s developers spent writing selectors or parsing code from scratch. The machine learning model powering AI Scraping accelerated spider implementation and successfully extracted, on average, 80% of the clients data needs. The spiders used the standard Product data schema with additional custom data fields created to fit the client’s needs. 


Capturing certain very specific data from unstructured text needed a different solution. Trying to train the patented ML model powering AI Scraping to correctly extract this data across all the differently configured websites would take too long. In this situation, Zyte Data integrated a large language model (LLM) based solution into the web scraping stack to gather the data. 


After the unstructured description text item is extracted using AI Scraping, it’s passed to the LLM to extract the data points. LLMs are best used for this task in web scraping with its understanding of natural language. The applicable  data is then converted to the client’s preferred standard.

Results

Data Extraction at Scale100+
million requests/month
Data Extraction at Scale3+ million
data items delivered/month
Development Time Saved25%
Reduced development time per spider
Quality and Accessibility99.9%
success rate

Summary


With the help of Zyte Data, the client successfully acquired the product data needed for their price intelligence efforts. Zyte API’s AI scraping was important to reduce development time and easily gather the structured data for the client. LLMs strategically complimented Zyte API’s AI Scraping to gather the hard-to-get data accurately and quickly.