Delivering on quality
When we set out to define what “quality” means in the context of web scraping operations, we evaluated established data quality frameworks against the real-world needs of our customers. Over time, we formalized five dimensions that guide our QA processes: accuracy, validity, completeness, consistency, and timeliness.
With our tools at our service, here’s a look at the practices we use to ensure every dataset we deliver meets those standards.
Accuracy: validating against the webpage
Accuracy means scraping the right value from the right page element.
We run automated checks against each project’s JSON Schemas. We ensure that required fields are non-null, that data is clean, and that data types match expectations.
But automation only goes so far. We manually inspect a statistically significant sample of records to spot potentially incorrect selectors, for example. For instance, when working on an ecommerce site, we’ll ensure that product names aren’t picking up breadcrumb text, or that seller names aren’t being inferred from unrelated metadata.
We calculate z-scores –a measure of how far a numeric value is from the mean – for anomaly detection. If most product prices are near $500 and one shows up at $19,000, it gets flagged for manual review.
These days, we also embed large language models (LLMs) into our workflow to perform semantic checks, like recognizing whether a string is a legitimate color name or determining if a person’s name has been properly segmented into first names and surnames.
Validity: validating against the project
“Valid” data means data that conforms to the project’s agreed rules, both at the field level and in how fields might relate to each other.
Validity checks enforce field-level rules such as whether dates follow the desired format, whether missing fields should be represented as empty strings (rather than null or omitted), and whether enumerated fields like country_code conform to the ISO 3166 standard.
Paired fields are validated together using conditional modifiers to ensure logical consistency. For example:
If in_stock is false, then inventory_count must be zero.
If price is present, currency must be, too.
A discounted_price should never exceed the regular_price.
If subcategory is present, category must be as well.
These rules help catch issues that simple data type checks might miss.
Consistency: validating across records
Consistency means that the data holds together across records. A price listed as $49.99 in one row and 49,99 USD in another might both be accurate and valid, but they’re not consistent.
Here we enforce baseline normalization: ensuring decimal separators, currency symbols, casing, units are aligned with what’s defined at the schema level.
We also run consistency checks across deliveries. If the format of an ecommerce product price changes unexpectedly, or a field starts including new units or categories, it gets automatically flagged for manual review.
Consistency checks are what keep downstream systems from breaking on edge cases.
For those familiar with traditional data quality frameworks, you might notice we don’t list uniqueness as a separate dimension. That’s because duplicate handling depends on the project’s context. In some projects, customers actually need the same records delivered across different datasets. In others, uniqueness is enforced within each delivery. These requirements are captured at the project level and validated as part of our consistency checks.
Completeness: validating against the website
Completeness means every record and field that could be collected is there – no gaps, no duplicates. We track this using project-specific baselines: expected record counts and field coverage percentages from previous deliveries. If any metric drops below a defined threshold, the team gets an alert and investigates immediately.
Field-level completeness can be enforced via custom modifiers in the schema. Fields that are listed as required, yet are missing from the scraped results will fail validation. Optional fields are monitored, but, if missing, only block delivery when coverage falls below acceptable levels.
For record-level completeness, we adapted a biology-inspired technique: mark-release-recapture, in which researchers estimate wild animal populations by marking and recapturing individuals. We apply the same principle to web data: crawl a site, tag record IDs, crawl again later, and measure overlap. If the overlap is low, we’re likely missing records.
We also inspect sitemaps, spin up dedicated discovery crawlers, and investigate external information sources when needed. Coverage can’t always be proven – but we apply every signal, tool, and technique available to minimize blind spots.
Timeliness: validating against the schedule
Timeliness means the data is delivered as promised, something which may be impacted by concern about over-burdening websites. We’ve built internal monitors that track delivery frequency against project expectations. If a dataset is scheduled weekly and no new job is detected after eight days, the team gets notified to investigate.