Such a schema articulates the requirements for your output, to data engineers and systems alike, serving as both the map for your data journey and the enforcer for your data project.
3. Monitoring for data quality
With the specifications in place and web data flowing in, your team can configure different variables and checks to actually monitor for DQ.
In web data extraction, a “monitor” is like a “test case” in software development at large – a set of instructions for ensuring proper functioning.
Teams using Scrapy, one of the world’s main software frameworks for web data extraction, can also harness the Spidermon monitoring framework, a perfect complement for verifying extracted data against your mandated schema.
Spidermon supports data validation against JSON Schema rules and can trigger alerts based on data collection that breaches rules or even fails.
4. Collecting DQ indicators
With monitors in place, what signals actually add up to data quality? Your team can check for quality by testing extracted data against our five dimensions.
Accuracy: validating against the webpage
Does the data being output match the actual content on the target page, or was it spoiled during extraction? Your team can calculate the match rate by running a side-by-side visual comparison of a representative sample of records.
Completeness: validating against the website
Did extraction get all the target records? Overall completeness compares the latest collected record count with the expected number.Field completeness shows how often each field is populated in the collected data. Summary stats of key fields like can be reported in the crawl job metadata to make it easier to spot issues.
Validity: validating against the format requirements
Do extracted data fields appear in the correct format? Each field is tested against the rules supplied in your schema to calculate the percentage of rows that conform to the agreed patterns.
Consistency: validating across records
Do values stay consistent across records or between extracts? What if reviews you are extracting move from a five-star system to a 10-point scale? Calculate z-scores—a measure of how far a value is from the mean—for key numeric fields to spot outliers.
Timeliness: validating against the schedule
Is the data delivered in a timely manner to be valuable? The recorded delivery completion timestamp is compared to the target delivery time. Calculate the current job's completion status—on time or not—against a number of total deliveries so far to get a percentage success rate.
By testing the indicators inherent in your data against these yardsticks, you can begin to get a real sense for your data’s provenance.
5. Scoring data quality
What separates a one-off quality inspection from a systemic commitment to improving data quality is ongoing measurement and the refinement it enables.
Many people have a gut sense that they know “quality” when they see it. But what gets measured gets managed.
Gartner estimates that poor quality data cost organizations an average of $15 million a year, yet 59% of organizations do not measure data quality.
MIT’s TDQM Cycle recommends you “develop a set of metrics that measure the important dimensions of data quality for the organization and that can be linked to the organization’s general goals and objectives”.
There is no single, universally agreed-upon score for data quality across all industries. But, having monitored to obtain the key indicators above, you can create a simple weighted average to your overall data quality.