PINGDOM_CHECK
Light
Dark

Teaching AI to scrape like a pro: how we measure LLMs’ data quality

Read Time
10 min
Posted on
February 23, 2026
AI-enabled code editors can now conjure scraping code on command. But is it any good? Here’s how Zyte re-engineered LLMs with Web Scraping Copilot to drive best-in-class output.
Table of Content

In the past couple of years, AI coding assistants have gone from magic power to business-as-usual. You open your code editor, type a comment, and a Large Language Model (LLM) fills in the blanks.


But there's a problem. When you ask a general AI assistant to write code, it's pulling from billions of lines of examples. Very few of them show what good code looks like.


When it comes to specialist code cases, like web scraping, that’s problematic, because scraping code - especially high-quality scraping code - is relatively under-represented in the global training data.


So, how can AI help write good scraping code? This was a key design consideration when we built Web Scraping Copilot, an AI-powered Visual Studio Code extension that specializes in generating and managing web scraping code.

We wanted to help web scraping developers not just code faster, but also to ship good scraping code, fast.


So, what does "good" scraping code actually look like, and how do you get it?

What makes scraping code ‘good’?

Before teaching Web Scraping Copilot how to generate quality code, we first had to define what constitutes quality in the specific case of web scraping.


The data industry has never really formalized such a standard, so Zyte needed to build a measurement system where none existed.


We turned to our team of hundreds of Scrapy experts, distilling its experience in creating accurate and maintainable Scrapy code into three quantifiable dimensions that describe data accuracy and code maintainability.

VariableAreaMeasurement test
ROUGE-1 F1 adjData accuracyThe code extracts the right data with the right values.
Source lines of code (SLOC)Code complexityThe code is tight and non superfluous.
Cyclomatic complexityCode complexityThe logic is simple and understandable.

If we could define what “accuracy” and “maintainability” actually mean, we could score them.


1. The accuracy challenge: Measuring messy data


Measuring the accuracy of extracted web data is trickier than it seems, because desired output is not always clear from AI prompt inputs.


So we adapted a metric from natural language processing called ROUGE-1 F1, which measures token-level overlap between texts, and extended it to handle structured web data.


This metric gives partial credit for values that differ in formatting but are semantically equivalent (like “24.99”, “$24.99”, and “24.99 USD”) – letting us score thousands of extraction attempts without penalizing harmless variations.


ROUGE-1 F1 scores on a sliding scale from 0 to 1, with lower values for poorer accuracy and greater values for higher accuracy. With this benchmark in our toolset, we could be confident that we are not skipping relevant data.

2. The maintainability challenge: The leaner, the better


Scraping code that picks up accurate data also needs to be easily understood and adapted as websites change.


The first signal we look at for maintainability is the length of the code generated.


We record the source lines of code (SLOC) in each generated spider.


Fewer lines generally mean less surface area for bugs and lower maintenance cost over time. Keeping SLOC low encourages spiders that are focused, declarative, and easier to reason about.


3. The complexity challenge: Going deep without getting lost


Yet, code length alone doesn’t tell the full story. Two spiders with the same number of lines can vary in how easy they are to understand.


That’s where cyclomatic complexity comes in.


Cyclomatic complexity measures how many independent decision paths exist in a piece of code - essentially, how many branches, conditionals, and forks a reader has to keep in their head at once.


Lower values are generally better: they indicate linear, predictable logic that is easier to test and modify. Higher values suggest brittle code where small changes can have unintended side effects

Cyclomatic complexity score rangeInterpretation
1 - 10Simple to moderate complexity. Low risk.
11 - 20Moderate. Careful review needed to justify.
21 - 40Complex. Difficult to test and maintain.
Above 40Unmaintainable.

A well-structured spider would typically land at around five to 15.


Bringing it together


Taken together, these metrics let us evaluate a scraper from multiple angles at once. Here’s what that looks like for a single spider:

Scraper namerouge1_f1_adjSLOCComplexity
Product scraper for website A0.7955356.25

In this example:


  • A ROUGE-1 F1 adj score of ~0.8 indicates good extraction accuracy, with minor acceptable variations in formatting.

  • 35 source lines of code suggests the scraper is compact.

  • A cyclomatic complexity of 6.25 means the logic is straightforward, with intuitive branching.


Together, they give us a practical, repeatable way to judge whether a scraper has good quality.

Iterating toward production quality

With our scoring system in place, we could move toward building a Visual Studio Code extension that reliably produces good scraping code.


For Web Scraping Copilot, that meant perfecting our own extension code and crafting embedded prompts that it uses to turn mass-market LLMs into expert spider generators.


We followed the following process to establish target thresholds for each score:


  • Data accuracy: The team produced a source-of-truth dataset - a pre-assembled list of 1,250 on-page data fields, from hundreds of URLs, that are known to be correct. By comparing output from our LLM-produced spider code against the values known to be correct, we could make changes to nudge that rouge1_f1_adj score ever closer to 1.

  • Code complexity: Zyte specialists reviewed the SLOC and cyclomatic complexity scores for LLM-produced, to assess whether generated spiders met their expectations for clarity and structure.


After a couple of iterations, it became clear - good LLM-generated scraping code on average has a scorecard like this:

rouge1_f1_adjSLOCComplexity
0.8 +30 to 40< 12

With these targets in place, improvements followed a reliable process: adjust prompts or tooling, re-run code generation, and check whether changes moved quality in the right direction across all metrics.


Sometimes, gains were obvious. Other times, they revealed trade-offs: a change might reduce the number of generation attempts needed to produce working code (good), while slightly hurting extraction accuracy (bad). In those cases, we only accepted changes when the overall outcome clearly delivered more value than it cost.

AI code quality is real, today

Today, Web Scraping Copilot consistently generates scraping code that meets the quality bar we set during development and does so in a measurable, repeatable way.


Just as importantly, these scores are not treated as a one-time gate. They are monitored continuously. Every prompt change, tooling adjustment, or model upgrade is evaluated against the same metrics to ensure quality does not regress as the system evolves. When we see improvements, we raise expectations. When tradeoffs appear, we consider them holistically.


Every iteration brings Web Scraping Copilot closer to thinking less like a generic AI coding assistant and more like a colleague who has spent years writing production scrapers.


And the beauty of scoring our own product’s output in this way is that we can apply the same approach to rating the relative quality of scraping code produced by any of the LLM models usable by the extension.

For instance, when Anthropic released Sonnet 4.6 in February 2026, Zyte’s research and development team was able to crunch the numbers to show how it beat all rival models in most of the score areas.


That is, at the time - Sonnet 4.6, when instructed by Web Scraping Copilot’s best-in-class, secret-sauce scraping know-how - produced the very best auto-generated scraping code.


We are excited to see where these scores go next, as frontier models get better and better.

Where general AI stops, Web Scraping Copilot begins

Most of today’s general-purpose AI coding assistants optimize for plausibility and speed, not for long-term accuracy or maintainability.


Zyte has “taught” the AI to code like our best scraping engineers by defining, measuring, and iteratively improving quality along the axes that matter most: accuracy and complexity.


We believe that gaining and maintaining access to web data should be hassle-free, no matter who, or what, is writing the code.

×

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.