PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    SERP

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Browse

    • BlogArticles, podcasts, videos
    • Case studiesCustomer outcomes
    • White papersIn-depth reports
    • EventsConferences, webinars, recordings

    Subscribe

    • NewsletterSwiftly delivered
    • Discord communityExtract Data community
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
All articles
AI-assisted data extraction28, 28 articles
Data gathering for AI6, 6 articles
Large Language Models (LLMs)24, 24 articles
Tool-assisted coding3, 3 articles
Developer interest143, 143 articles
Integration13, 13 articles
Open-source96, 96 articles
Scraping practice59, 59 articles
Scraping strategy46, 46 articles
Anti-ban35, 35 articles
Traffic6, 6 articles
Web data application25, 25 articles
Web data collection358, 358 articles
Web data collection ethics3, 3 articles
Web data collection legality16, 16 articles
Web scraping APIs63, 63 articles
Zyte API59, 59 articles
Scrapy48, 48 articles
Scrapy Cloud10, 10 articles
Web Scraping Copilot12, 12 articles
AI & Machine Learning1, 1 articles
Automotive2, 2 articles
E-commerce & retail26, 26 articles
Entertainment & Streaming2, 2 articles
Financial Services8, 8 articles
Government2, 2 articles
Market Research & Intelligence3, 3 articles
Media & publishing8, 8 articles
Real Estate2, 2 articles
Recruitment & HR3, 3 articles
Transportation & Logistics2, 2 articles
Travel & hospitality2, 2 articles
Extract Summit25, 25 articles
PyCon1, 1 articles

Appearance

Discord Community
BlogScraping strategyHow Zyte’s extraction experts guarantee data quality
ArticleScraping strategyData quality

How Zyte’s extraction experts guarantee data quality

Ensuring web data quality at scale means moving beyond fragile scripts and spot checks to robust validation that keeps business decisions accurate and reliable.

Artur Sadurski · QA Data Scientist

2 min read · September 1, 2025

How Zyte’s extraction experts guarantee data quality

Web data is fragile by default. Websites can change their structure, add new layouts without notice, and often present information with inconsistent formatting.

As a result, you can’t assume extracted content is necessarily correct.

When a value returned by extraction is present but malformed or logically invalid, it’s a risk to your business.

Scaling data quality control

De-risking data collection is the bread and butter of the quality assurance (QA) team at Zyte Data, Zyte’s done-for-you data extraction service. We help make sure the billions of records delivered to our customers on a monthly basis – whether it’s product data, job listings, or articles – are of good quality. This requires us to inspect and validate data projects that often contain hundreds of thousands of records.

Traditionally, QA relied on ad-hoc Python scripts, spot checks, and human intuition – effective, but difficult to measure, repeat, or scale. Close to a decade ago, we reached a point where the traditional approach couldn’t keep pace. 

When I took on the role of QA Data Scientist at Zyte, my focus was clear: how do we ensure that the data to customers is accurate, valid, complete, consistent, and timely, and to do it at scale?

From quality inspection to quality management

Over the past 15 years, Zyte’s approach to data quality (DQ) has evolved from manual data inspection to a scalable data quality management system.

This transformation has taken shape through three core initiatives.

1. Going all-in on JSON Schema

You can’t strive for data quality without knowing what data quality you aim to collect.

Once upon a time, aligning on that was difficult. Required fields would be mentioned in a sales call, described imprecisely in the statement of work; undocumented assumptions were baked right into the crawlers, validated by judgement – only to find that customers have something else in mind.

So, we adopted JSON Schema as the common standard for expressing what high-quality data looks like.

JSON Schema is the foundation of our industry. It defines what the collected data should look like for every customer: field by field, value by value. It specifies a list of fields, acceptable value ranges, data types, lengths, formats, and all validation rules that apply.

JSON Schema files are the operational source of truth for every Zyte Data project: version-controlled, accessible to all teams, and kept in sync with the customer’s expectations.

By clearly documenting expected outcomes in a JSON Schema file, we ensure every project is kept on the straight and narrow.

2. Creating our own QA tool suite

With a common schema standard in place, the next step was building the infrastructure to apply it at scale.

Off‑the‑shelf tools to service data quality assurance at Zyte’s scale were non-existent – so we built our own.

Our homegrown internal QA automation toolbox, called Manual and Automated Testing Tool (MATT), grew out of years of accumulated techniques we developed to meet our quality standards. Over time, we consolidated these scripts and cheatsheets – many of which leveraged well-known data analysis and validation frameworks like Pandas and Robot Framework – into a unified interface, a treasure trove of utilities.

It now powers the whole QA lifecycle from data inspection and validation, to monitoring and reporting.

The team uses MATT to generate JSON Schemas for each project – equipped with predefined text-matching patterns, and to validate each field against those rules. This accelerates project execution and quickly aligns data with the customer’s expectations.

It assists the team in making sure that the data includes all required fields and available records, while validating data integrity by detecting unwanted duplicates through configurable identifiers and its dataset-comparison feature.

MATT includes a built-in visual diff tool that helps the team verify the crawler picks up exactly what’s on the website by running a comparison on a representative sample.

Finally, the toolbox automates the generation of QA reports, providing transparency and traceability into what was checked and what passed or failed. If your dataset arrives complete, clean, and clearly documented, you can thank MATT.

3. Using Python notebooks as our launchpad

With our homebrew testing suite facilitating much of the core QA workflow, Python notebooks – a web-based interactive platform to work with Python code – become our orchestration layer.

Notebooks are where we keep project-specific logic, write custom validations, test hypotheses, and explore edge cases. This is where most experimental logic begins, before it gets activated in MATT.

MATT can help us detect a range of predictable issues early, narrowing the scope of what needs manual attention. But it’s the QA engineer’s experience and judgement that ultimately ensure the data is right. That’s why you will also find us interrogating the project through notebooks.

With these foundations in place, we can reliably deliver on data quality, and do it at scale.

Delivering on quality

When we set out to define what “quality” means in the context of web scraping operations, we evaluated established data quality frameworks against the real-world needs of our customers. Over time, we formalized five dimensions that guide our QA processes: accuracy, validity, completeness, consistency, and timeliness.

With our tools at our service, here’s a look at the practices we use to ensure every dataset we deliver meets those standards.

Accuracy: validating against the webpage

Accuracy means scraping the right value from the right page element.

We run automated checks against each project’s JSON Schemas. We ensure that required fields are non-null, that data is clean, and that data types match expectations.

But automation only goes so far. We manually inspect a statistically significant sample of records to spot potentially incorrect selectors, for example. For instance, when working on an ecommerce site, we’ll ensure that product names aren’t picking up breadcrumb text, or that seller names aren’t being inferred from unrelated metadata.

We calculate z-scores –a measure of how far a numeric value is from the mean – for anomaly detection. If most product prices are near $500 and one shows up at $19,000, it gets flagged for manual review.

These days, we also embed large language models (LLMs) into our workflow to perform semantic checks, like recognizing whether a string is a legitimate color name or determining if a person’s name has been properly segmented into first names and surnames.

Validity: validating against the project

“Valid” data means data that conforms to the project’s agreed rules, both at the field level and in how fields might relate to each other.

Validity checks enforce field-level rules such as whether dates follow the desired format, whether missing fields should be represented as empty strings (rather than null or omitted), and whether enumerated fields like country_code conform to the ISO 3166 standard.

Paired fields are validated together using conditional modifiers to ensure logical consistency. For example:

  • If in_stock is false, then inventory_count must be zero.

  • If price is present, currency must be, too.

  • A discounted_price should never exceed the regular_price.

  • If subcategory is present, category must be as well.

These rules help catch issues that simple data type checks might miss.

Consistency: validating across records

Consistency means that the data holds together across records. A price listed as $49.99 in one row and 49,99 USD in another might both be accurate and valid, but they’re not consistent.

Here we enforce baseline normalization: ensuring decimal separators, currency symbols, casing, units are aligned with what’s defined at the schema level.

We also run consistency checks across deliveries. If the format of an ecommerce product price changes unexpectedly, or a field starts including new units or categories, it gets automatically flagged for manual review.

Consistency checks are what keep downstream systems from breaking on edge cases.

For those familiar with traditional data quality frameworks, you might notice we don’t list uniqueness as a separate dimension. That’s because duplicate handling depends on the project’s context. In some projects, customers actually need the same records delivered across different datasets. In others, uniqueness is enforced within each delivery. These requirements are captured at the project level and validated as part of our consistency checks.

Completeness: validating against the website

Completeness means every record and field that could be collected is there – no gaps, no duplicates. We track this using project-specific baselines: expected record counts and field coverage percentages from previous deliveries. If any metric drops below a defined threshold, the team gets an alert and investigates immediately.

Field-level completeness can be enforced via custom modifiers in the schema. Fields that are listed as required, yet are missing from the scraped results will fail validation. Optional fields are monitored, but, if missing, only block delivery when coverage falls below acceptable levels.

For record-level completeness, we adapted a biology-inspired technique: mark-release-recapture, in which researchers estimate wild animal populations by marking and recapturing individuals. We apply the same principle to web data: crawl a site, tag record IDs, crawl again later, and measure overlap. If the overlap is low, we’re likely missing records.

We also inspect sitemaps, spin up dedicated discovery crawlers, and investigate external information sources when needed. Coverage can’t always be proven – but we apply every signal, tool, and technique available to minimize blind spots.

Timeliness: validating against the schedule

Timeliness means the data is delivered as promised, something which may be impacted by concern about over-burdening websites. We’ve built internal monitors that track delivery frequency against project expectations. If a dataset is scheduled weekly and no new job is detected after eight days, the team gets notified to investigate.

Engineering trust

At Zyte, we’ve learned that data quality isn’t a checkbox at the end of a pipeline. Quality comes from building systems that surface issues early and consistently.

This operational discipline requires alignment, speed, repeatability, and flexibility.

As the ecosystem matures – and as tools like LLMs become more capable – we’re continuing to push our QA processes forward: scaling the things we can automate, and sharpening our judgment where we can’t.

Because at this scale, quality isn’t about luck. It’s about design.

Try Zyte API

Build your first scraper in minutes

Free trial, no credit card. From a single request to production in an afternoon.

Get started
Scraping strategyData quality

Artur Sadurski

QA Data Scientist

More from this author

In this article

  • Scaling data quality control
  • From quality inspection to quality management
  • 1. Going all-in on JSON Schema
  • 2. Creating our own QA tool suite
  • 3. Using Python notebooks as our launchpad
  • Delivering on quality
  • Accuracy: validating against the webpage
  • Validity: validating against the project
  • Consistency: validating across records
  • Completeness: validating against the website
  • Timeliness: validating against the schedule
  • Engineering trust

Follow

Get the latest

Zyte and the data web in your inbox — or wherever you already are.

Subscribe

Or follow elsewhere

Continue reading

How to build your first Scrapy extension
Scraping strategy

How to build your first Scrapy extension

Why my Scrapy project plays a triumphant fanfare when a crawl finishes clean and a sad trombone when it doesn't, and how I finally learned how to build Scrapy extensions (it's easy)

Ayan Pahwa·June 18, 2026

The Community · Newsletter

The best of Zyte and the data web, in your inbox.

One curated edition — new articles, product updates, and the stories shaping the data web. No noise.

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026