PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Blog

    Learn

    Case Studies

    Webinars

    Videos

    White Papers

    Join our Community
    Web scraping APIs vs proxies: A head-to-head comparison
    Blog Post
    The seven habits of highly effective data teams
    Blog Post
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
Home
Blog
AI won’t fix your data quality (until you answer these three questions)
Light
Dark

AI won’t fix your data quality (until you answer these three questions)

Read Time
10 min
Posted on
May 13, 2026
Use case
In our interview, a QA expert warns - before you delegate web scraping quality assurance to AI, make sure you can describe what ‘good’ looks like for yourself.
By
Neha Setia
IntroductionWhat is data quality?1. Item coverage: ‘Did we catch every item?’2. Field coverage: ‘Did we fill in every detail?’3. Field accuracy: ‘Is the information actually correct?’The DQ before AI1. The importance of JSON Schema2. Testing your data3. Automated testsQuality as a processDiscovering the value LLMs bringSpidermon to the rescueFrom anxiety to autonomyA developer’s data quality checklistTry it yourself
×

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.
Start FreeFind out more
Subscribe to our Blog
Table of Contents

Data engineers are increasingly handing the data quality imperative to LLMs. You plug in an AI checker, let it verify your scraped data, and move on. Sure, it sounds like progress - but a recent conversation made me question whether we’re skipping a step.

I sat down with Tomasz Lesiak, a quality assurance (QA) engineer at Zyte who has spent four years verifying scraped data across hundreds of projects, and asked him a simple question: “If you had to explain the data quality (DQ) process to a developer without mentioning AI at all, how would you describe it?”

The short answer: every quality check, manual or automated, AI or not, comes down to the same three questions. Until you can write those three questions down for your own project, you don’t have data quality; you have hope.

What is data quality?

Neha: “If you had to explain data quality without mentioning AI at all, how would you describe it?”

Tomasz: “I don’t even know how to put it.”

That paradox is common to anyone who has so mastered their field to the point of automatic instinct that articulating the process feels challenging. Tomasz opens a website, compares it to the scraped data, item by item, field by field, and knows instantly when something is off. But, when I pushed him to explain further, Tomasz’s framework became clear. Data quality comes down to three questions:

1. Item coverage: ‘Did we catch every item?’

If a site has 1,450 products and your spider returned 800, something broke. Maybe pagination stopped early. Maybe a category page didn’t load. Maybe the spider hit a rate limit and gave up silently. The data looks clean, it’s just incomplete.

2. Field coverage: ‘Did we fill in every detail?’

You may have collected 1,450 items, but 40% of them have no price and 60% are missing images. The spider didn’t error out, it just delivered hollow records. This one is sneaky.

3. Field accuracy: ‘Is the information actually correct?’

You have the item, you have the field populated, but the value is wrong. The product title says “pet-friendly” when the actual title is something else. The price shows $0.00. The description contains raw HTML instead of text. This is the hardest to catch because the data looks structurally fine. It just isn’t true.

D1

Every tool Tomasz has used, from manual side-by-side checks to schema validators to AI-powered copilots, exists to answer these three questions. The tooling changes, but the questions don’t.

That is the whole data quality thesis in three lines. Every blog post you read about AI for data quality, every vendor pitching you an LLM-powered validator, is doing some version of these three things. If you can’t articulate them for your own project, you can’t evaluate any of those tools, and you certainly can’t trust their output.

The DQ before AI

Neha: “Walk me through how Zyte’s QA team operated before LLMs entered the picture.”

Tomasz: “Three layers: JSON Schema first, then manual checks, then automated tests in notebooks.”

D2

1. The importance of JSON Schema

The schema is the contract between the spider and the client.

Every project has one. It defines the fields, their types, their allowed values, their patterns. If the data doesn’t match the schema, it’s wrong.

2. Testing your data

Manual testing comes next. Zyte has had an internal side-by-side comparison tool, called Manual and Automated Testing Tool (MATT), for over a decade: scraped data on the left, the live website on the right. The QA engineer opens item after item.

  • Product title matches? Check.
  • Price correct? Check.
  • Image present? Check.
  • Dimensions populated? Empty, but the website shows measurements. That’s a bug. Report it.

“Checking items one by one was the worst way to do it,” Tomasz says, “but also the best way, because there was no other way.” Manual QA catches everything, but scaling it is nearly impossible.

3. Automated tests

The third layer is Jupyter notebooks with Pandas. Run automated tests for predictable issues like duplicate URLs, empty required fields, and prices that don’t match regular expression patterns.

This was our first real automation at Zyte, and it still handles anything you can express mathematically.

Quality as a process

Notice the sequence. Schema, then manual, then automation. This is the same shape Zyte’s wider extraction team uses to guarantee data quality on managed projects.

There’s a reason it’s stayed stable for years:

  • People tend to want to skip straight to automation.
  • But the manual step is where the rules get encoded into the QA engineer’s instinct, and automation just makes those encoded rules cheap to run at scale.

If you skip the manual step, your automation reproduces whatever you guessed instead of what you actually verified.

D3

Discovering the value LLMs bring

Neha: “Where did large language models (LLMs) first prove genuinely useful for you?”

Tomasz: “In testing fields that resist mathematical rules.”

The example he reaches for is job postings. Job listings tend to contain salary ranges, buried somewhere in an unstructured description:

  • One listing writes it as “$50,000-$70,000.”
  • Another writes “50-70K annually.”
  • A third puts it in a sentence: “competitive salary starting at fifty thousand.”

A regex won’t catch all three. For Tomasz, this is where LLMs became essential - not as a replacement for the QA process, but as the layer that handles fields you can’t reduce to a formula.

This is the right way to think about where AI sits in the stack. The LLM is the layer for unstructured fields and ambiguity, not the layer for “is this data good?”

Once you hand the real work to an LLM, measuring its data quality becomes its own discipline. Hand it the whole problem and you skip every step that defines what “good” actually means for your project. The LLM ends up guessing, too, just with more confidence.

Spidermon to the rescue

Neha: “How did all of this become automated, before AI got involved?”

Tomasz: “Spidermon.”

Spidermon is a Scrapy add-on for spider monitoring and validation. It’s open source, it’s been at Zyte for years, and any developer using Scrapy can set it up today.

Crucially, it’s structured around the same three questions Tomasz asks every day:

Target Description
Item coverage If a spider usually collects 2,700 items and suddenly returns 500, Spidermon flags it.
Field coverage If images are normally present in 60% of items and that drops to 30%, it creates a ticket.
Field accuracy Monitors schema violations, unexpected nulls, and abnormal run duration.

Developers set thresholds based on historical data. If a field’s coverage drops below the threshold, Spidermon, in Tomasz’s words, “starts yelling.”

Most of the QA tickets at Zyte come from Spidermon, not from humans spotting issues.

Here’s a basic Spidermon setup for field coverage monitoring:

For item coverage, track item count against historical averages:

For field accuracy, combine Spidermon with schema validation:

from spidermon.contrib.validation import JSONSchemaValidator

class ProductValidator(JSONSchemaValidator):
schema_url = "https://your-project.com/schema/product.json"

Three tools, three questions:

Question Tool What it checks
Did we catch every item? Item count monitor Spider output vs historical average
Did we fill in every detail? Field coverage monitor % of items with each field populated
Is the data correct? Schema validator Field types, patterns, allowed values

All of this works today, on any Scrapy project, without an LLM anywhere in the loop.

Spidermon isn’t really a monitoring tool, it’s an articulation tool. It forces you to encode “good” into a number, a rule, a threshold.

Writing those thresholds is the thing the LLM can’t do for you. Once you’ve done it, an AI layer can sit on top and do useful things.

Skip the encoding step, however, and AI is just a more confident way of being wrong.

From anxiety to autonomy

Neha: “We’re moving toward autonomous, agent-built pipelines. What worries you, and what gives you hope?”

Tomasz: “The context window problem.”

When an AI copilot checks a spider, it needs the schema, the HTML of the page, the prompt, the spider stats, Tomasz said. All of that has to fit in the large language model’s context window. Even with everything loaded in, the AI still misses things a human would catch in five seconds of visual comparison.

The hope is on the other side of that same problem. Encoded thresholds, defined schemas, established baselines: all of it is context the AI can read. The work doesn’t go away, it moves earlier in the process.

That’s where the industry is heading:

  • Coding agents generate scrapers.
  • AI extracts data from pages.
  • Pipelines deliver datasets without a human touching them end-to-end.

We’ve written about that shift in “dDawn of the autonomous data pipeline” and “Web data for scraping developers in 2026: AI fuels the agentic future”. Autonomous maintenance requires autonomous quality checks, and right now the bottleneck isn’t the AI’s capability, it’s our ability to describe what “good” looks like clearly enough for the AI to verify it.

The thresholds you set in Spidermon, the schemas you define, the baselines you establish: that’s the context your future AI copilot will consume.

A developer’s data quality checklist

Tomasz’s data quality process can be distilled into something you can run on your next spider:

  1. Define your schema before you deploy. Every field gets a type, a pattern, a description of what “correct” looks like. If you can’t describe it for a schema, you can’t check it, manually or with AI.

  2. Set your baselines from real runs. Run the spider a few times. Note typical item count, typical field coverage. These become your Spidermon thresholds. Don’t guess -use real numbers from real runs.

  3. Automate the obvious first. Schema validation catches structural issues. Field coverage monitors catch silent failures. Item count monitors catch pagination breaks. None of this requires AI; it requires knowing what your data should look like.

  4. Check accuracy manually, at least once. Open the website and your the scraped data. Compare them side by side - examine 30 items - enough to catch patterns that automation can’t. Look for a product title that’s a valid string but incorrect, a price that’s a valid number but belongs to a different product. This is the step most developers skip, and it’s the step that catches the worst bugs.

  5. Then, and only then, add AI. Once you have the schema, the thresholds, and a manual baseline, an LLM can help with the fields that resist rules: salary extraction from unstructured text, descriptions that look right but don’t match the product on the page. The LLM needs the same context you built in steps one through four. Without that foundation, it’s guessing.

Try it yourself

Spidermon is open source: https://github.com/scrapinghub/spidermon.

If you’re building spiders with Scrapy and you don’t have quality monitoring set up, start with the three questions, define your schema, set your thresholds, and run Spidermon on your next crawl.

If you’re already using AI for data quality, ask yourself: “Can I describe what “good” looks like without the AI?”

If you can’t, the AI can’t either.

×

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.
Start FreeFind out more

Get the latest posts straight to your inbox

No matter what data type you're looking for, we've got you

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026
1rom spidermon import Monitor, monitors
2from spidermon.contrib.scrapy.monitors import FieldCoverageMonitor
3 
4@monitors.name("Field Coverage")
5class ProductFieldCoverage(Monitor, FieldCoverageMonitor):
6    field_coverage_rules = {
7        "product_name": 0.95,
8        "price": 0.90,
9        "image_url": 0.50,
10        "description": 0.60,
11    }
Copy
1@monitors.name("Item Count")
2class ItemCountMonitor(Monitor):
3    def test_minimum_items(self):
4        item_count = self.data.stats.get("item_scraped_count", 0)
5        minimum = 100
6        self.assertTrue(
7            item_count >= minimum,
8            f"Only {item_count} items scraped (minimum: {minimum})"
9        )
Copy