How Zyte’s extraction experts guarantee data quality

Web data is fragile by default. Websites can change their structure, add new layouts without notice, and often present information with inconsistent formatting.

As a result, you can’t assume extracted content is necessarily correct.

When a value returned by extraction is present but malformed or logically invalid, it’s a risk to your business.

Scaling data quality control

De-risking data collection is the bread and butter of the quality assurance (QA) team at Zyte Data, Zyte’s done-for-you data extraction service. We help make sure the billions of records delivered to our customers on a monthly basis – whether it’s product data, job listings, or articles – are of good quality. This requires us to inspect and validate data projects that often contain hundreds of thousands of records.

Traditionally, QA relied on ad-hoc Python scripts, spot checks, and human intuition – effective, but difficult to measure, repeat, or scale. Close to a decade ago, we reached a point where the traditional approach couldn’t keep pace.

When I took on the role of QA Data Scientist at Zyte, my focus was clear: how do we ensure that the data to customers is accurate, valid, complete, consistent, and timely, and to do it at scale?

From quality inspection to quality management

Over the past 15 years, Zyte’s approach to data quality (DQ) has evolved from manual data inspection to a scalable data quality management system.

This transformation has taken shape through three core initiatives.

1. Going all-in on JSON Schema

You can’t strive for data quality without knowing what data quality you aim to collect.

Once upon a time, aligning on that was difficult. Required fields would be mentioned in a sales call, described imprecisely in the statement of work; undocumented assumptions were baked right into the crawlers, validated by judgement – only to find that customers have something else in mind.

So, we adopted JSON Schema as the common standard for expressing what high-quality data looks like.

JSON Schema is the foundation of our industry. It defines what the collected data should look like for every customer: field by field, value by value. It specifies a list of fields, acceptable value ranges, data types, lengths, formats, and all validation rules that apply.

JSON Schema files are the operational source of truth for every Zyte Data project: version-controlled, accessible to all teams, and kept in sync with the customer’s expectations.

By clearly documenting expected outcomes in a JSON Schema file, we ensure every project is kept on the straight and narrow.

2. Creating our own QA tool suite

With a common schema standard in place, the next step was building the infrastructure to apply it at scale.

Off‑the‑shelf tools to service data quality assurance at Zyte’s scale were non-existent – so we built our own.

Our homegrown internal QA automation toolbox, called Manual and Automated Testing Tool (MATT), grew out of years of accumulated techniques we developed to meet our quality standards. Over time, we consolidated these scripts and cheatsheets – many of which leveraged well-known data analysis and validation frameworks like Pandas and Robot Framework – into a unified interface, a treasure trove of utilities.

It now powers the whole QA lifecycle from data inspection and validation, to monitoring and reporting.

The team uses MATT to generate JSON Schemas for each project – equipped with predefined text-matching patterns, and to validate each field against those rules. This accelerates project execution and quickly aligns data with the customer’s expectations.

It assists the team in making sure that the data includes all required fields and available records, while validating data integrity by detecting unwanted duplicates through configurable identifiers and its dataset-comparison feature.

MATT includes a built-in visual diff tool that helps the team verify the crawler picks up exactly what’s on the website by running a comparison on a representative sample.

Finally, the toolbox automates the generation of QA reports, providing transparency and traceability into what was checked and what passed or failed. If your dataset arrives complete, clean, and clearly documented, you can thank MATT.

3. Using Python notebooks as our launchpad

With our homebrew testing suite facilitating much of the core QA workflow, Python notebooks – a web-based interactive platform to work with Python code – become our orchestration layer.

Notebooks are where we keep project-specific logic, write custom validations, test hypotheses, and explore edge cases. This is where most experimental logic begins, before it gets activated in MATT.

MATT can help us detect a range of predictable issues early, narrowing the scope of what needs manual attention. But it’s the QA engineer’s experience and judgement that ultimately ensure the data is right. That’s why you will also find us interrogating the project through notebooks.

With these foundations in place, we can reliably deliver on data quality, and do it at scale.

Delivering on quality

When we set out to define what “quality” means in the context of web scraping operations, we evaluated established data quality frameworks against the real-world needs of our customers. Over time, we formalized five dimensions that guide our QA processes: accuracy, validity, completeness, consistency, and timeliness.

With our tools at our service, here’s a look at the practices we use to ensure every dataset we deliver meets those standards.

Accuracy: validating against the webpage

Accuracy means scraping the right value from the right page element.

We run automated checks against each project’s JSON Schemas. We ensure that required fields are non-null, that data is clean, and that data types match expectations.

But automation only goes so far. We manually inspect a statistically significant sample of records to spot potentially incorrect selectors, for example. For instance, when working on an ecommerce site, we’ll ensure that product names aren’t picking up breadcrumb text, or that seller names aren’t being inferred from unrelated metadata.

We calculate z-scores –a measure of how far a numeric value is from the mean – for anomaly detection. If most product prices are near $500 and one shows up at $19,000, it gets flagged for manual review.

These days, we also embed large language models (LLMs) into our workflow to perform semantic checks, like recognizing whether a string is a legitimate color name or determining if a person’s name has been properly segmented into first names and surnames.

Validity: validating against the project

“Valid” data means data that conforms to the project’s agreed rules, both at the field level and in how fields might relate to each other.

Validity checks enforce field-level rules such as whether dates follow the desired format, whether missing fields should be represented as empty strings (rather than null or omitted), and whether enumerated fields like country_code conform to the ISO 3166 standard.

Paired fields are validated together using conditional modifiers to ensure logical consistency. For example:

If in_stock is false, then inventory_count must be zero.
If price is present, currency must be, too.
A discounted_price should never exceed the regular_price.
If subcategory is present, category must be as well.

These rules help catch issues that simple data type checks might miss.

Consistency: validating across records

Consistency means that the data holds together across records. A price listed as $49.99 in one row and 49,99 USD in another might both be accurate and valid, but they’re not consistent.

Here we enforce baseline normalization: ensuring decimal separators, currency symbols, casing, units are aligned with what’s defined at the schema level.

We also run consistency checks across deliveries. If the format of an ecommerce product price changes unexpectedly, or a field starts including new units or categories, it gets automatically flagged for manual review.

Consistency checks are what keep downstream systems from breaking on edge cases.

For those familiar with traditional data quality frameworks, you might notice we don’t list uniqueness as a separate dimension. That’s because duplicate handling depends on the project’s context. In some projects, customers actually need the same records delivered across different datasets. In others, uniqueness is enforced within each delivery. These requirements are captured at the project level and validated as part of our consistency checks.

Completeness: validating against the website

Completeness means every record and field that could be collected is there – no gaps, no duplicates. We track this using project-specific baselines: expected record counts and field coverage percentages from previous deliveries. If any metric drops below a defined threshold, the team gets an alert and investigates immediately.

Field-level completeness can be enforced via custom modifiers in the schema. Fields that are listed as required, yet are missing from the scraped results will fail validation. Optional fields are monitored, but, if missing, only block delivery when coverage falls below acceptable levels.

For record-level completeness, we adapted a biology-inspired technique: mark-release-recapture, in which researchers estimate wild animal populations by marking and recapturing individuals. We apply the same principle to web data: crawl a site, tag record IDs, crawl again later, and measure overlap. If the overlap is low, we’re likely missing records.

We also inspect sitemaps, spin up dedicated discovery crawlers, and investigate external information sources when needed. Coverage can’t always be proven – but we apply every signal, tool, and technique available to minimize blind spots.

Timeliness: validating against the schedule

Timeliness means the data is delivered as promised, something which may be impacted by concern about over-burdening websites. We’ve built internal monitors that track delivery frequency against project expectations. If a dataset is scheduled weekly and no new job is detected after eight days, the team gets notified to investigate.

Engineering trust

At Zyte, we’ve learned that data quality isn’t a checkbox at the end of a pipeline. Quality comes from building systems that surface issues early and consistently.

This operational discipline requires alignment, speed, repeatability, and flexibility.

As the ecosystem matures – and as tools like LLMs become more capable – we’re continuing to push our QA processes forward: scaling the things we can automate, and sharpening our judgment where we can’t.

Because at this scale, quality isn’t about luck. It’s about design.