PINGDOM_CHECK

Web Scraping Copilot is live. Build Scrapy spiders 3× faster, free in VS Code.

Install Now
  • Data Services
  • Pricing
  • Login
    Sign up👋 Contact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
Home
Blog
Giving spidey-senses to your web scraping spiders using Spidermon
Light
Dark

Giving spidey-senses to your web scraping spiders using Spidermon

Read Time
5 min
Posted on
April 27, 2026
Learn how Spidermon helps you monitor web scraping data quality in real time. Validate items, track field coverage, and get alerts before bad data impacts your pipeline.
By
Ayan Pahwa
IntroductionSet your expectations before you scrapeMeet Spidermon: your spider's first line of defenseStart with the cornerstone: JSON schema validationInstall and enable SpidermonDefine your schemaTune your sliders: field coverage and thresholdsYour first monitor in five minutesThe monitors you get for freeDo not just log it, alert the right peopleWhere to go from here
×

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.
Start FreeFind out more
Subscribe to our Blog
Table of Contents

Your spider ran overnight. It returned 10,000 items. The logs say "finished." Everything looks fine, so you move on with your day.

Three days later, someone downstream notices that 40% of the price fields are empty. The site changed a CSS class, your spider kept running, and nobody knew anything was wrong until the damage was done.

Artur Sadurski, a data quality expert with more than 10 years at Zyte, put it well on the Extract Data podcast: data quality is like your car; you don't know it's broken until it breaks down - and the further you go before checking, the more expensive the fix becomes.

This is the silent failure problem, and it is the most expensive kind of bug in web scraping. Don’t blame the spider - it did its job, the data just wasn't any good.

Spidermon exists to solve exactly this. It is an open-source monitoring framework for Scrapy that gives your spiders the equivalent of Peter Parker's spidey-sense: the ability to detect problems as they happen, validate every item against a schema, and alert your team before bad data flows downstream.

Set your expectations before you scrape

Data quality is not a single discipline. As Artur described on the podcast, think of it as a set of sliders.

Some projects need broad coverage: scrape as much as possible, tolerate some gaps. Others need surgical precision: fewer items, but every field must be pristine. Most projects fall somewhere in between, and the right balance depends entirely on your use case.

A pricing aggregator cannot afford a single wrong price. A product catalog builder might tolerate 10% missing descriptions as long as coverage across categories is high. A machine learning team training on scraped text needs consistency above all.

The framework you use to monitor data quality should let you configure these trade-offs, not hard-code them. It should let you say: "I need 100% of items to have a price, but 70% coverage on descriptions is fine." It should let you set those sliders per field, per spider, per project.

That is what Spidermon does.

Meet Spidermon: your spider's first line of defense

Spidermon is a monitoring framework built by Zyte and battle-tested on hundreds of production spiders before being released as an open-source library. It integrates directly into your Scrapy project and provides three core capabilities:

Data validation checks every scraped item against a schema you define, catching type mismatches, missing fields, and malformed values in real time.

Stats monitoring lets you write health checks (called "monitors") that run at key moments during spider execution, verifying things like item counts, error rates, finish reasons, and field coverage.

Automated notifications alert your team through Slack, Telegram, Discord, email, or Sentry when something goes wrong, so you find out before your client does.

At Zyte, monitors that run during scraping are considered the first line of defense. Artur described this layered approach on the podcast: simple checks catch problems early and cheaply, while more complex inspections happen further down the pipeline. Spidermon handles that critical first layer.

Start with the cornerstone: JSON schema validation

If you take one thing from this blog post, make it this: set up JSON schema validation. Artur was emphatic about this point: every project at Zyte starts with basic JSON schema validation. It is the foundation. It might seem surface-level, but it catches a surprising number of real issues, from wrong data types to missing required fields to malformed URLs.

Here is how to set it up from scratch.

Install and enable Spidermon

1pip install "spidermon[monitoring,validation]"
Copy

In your settings.py, enable the extension and the validation pipeline:

1SPIDERMON_ENABLED = True
2EXTENSIONS = {
3    "spidermon.contrib.scrapy.extensions.Spidermon": 500,
4}
5ITEM_PIPELINES = {
6    "spidermon.contrib.scrapy.pipelines.ItemValidationPipeline": 800,
7}
Copy

Keep the validation pipeline last so no subsequent pipeline changes the content of an item after validation.

Define your schema

Say you are scraping products from an e-commerce site. Your items have a name, price, URL, and a list of image URLs. Create a JSON schema that defines what a valid item looks like:

1{
2    "$schema": "http://json-schema.org/draft-07/schema",
3    "type": "object",
4    "properties": {
5        "name": {
6            "type": "string",
7            "minLength": 1
8        },
9        "price": {
10            "type": "number",
11            "minimum": 0
12        },
13        "url": {
14            "type": "string",
15            "pattern": "^https?://"
16        },
17        "image_urls": {
18            "type": "array",
19            "items": {
20                "type": "string",
21                "pattern": "^https?://"
22            }
23        }
24    },
25    "required": ["name", "price", "url"]
26}
Copy

Save this as product_schema.json in your project, and point Spidermon to it:

1SPIDERMON_VALIDATION_SCHEMAS = ["./product_schema.json"]
Copy

That is it. Every item your spider yields will now be validated against this schema. Run your spider and check the logs. You will see new stats like:

1'spidermon/validation/fields': 3000,
2'spidermon/validation/fields/errors': 12,
3'spidermon/validation/items': 1000,
4'spidermon/validation/items/errors': 8,
Copy

Those 12 field errors across eight items? Those are problems you would have shipped silently without Spidermon. Now you know about them before anyone else does.

If you want validation errors attached directly to each item, add this setting:

# settings.py

1SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS = True
Copy

Items that fail validation will include a _validation field showing exactly what went wrong:

1{
2    "_validation": {"url": ["Invalid URL"]},
3    "name": "Wireless Headphones",
4    "price": 49.99,
5    "url": "not_a_valid_url"
6}
Copy

You can even tell Spidermon to drop invalid items entirely with SPIDERMON_VALIDATION_DROP_ITEMS_WITH_ERRORS = True, though in most cases you will want to collect the errors first and decide how to handle them.

Tune your sliders: field coverage and thresholds

Not every item will have every field, and that is fine. But you need to know and control how much missing data is acceptable. This is where Spidermon's field coverage feature comes in.

artur

On the podcast, Artur made the point that different clients care about different fields. One client might need prices to be perfect while descriptions are optional, another might be building a service that reads product descriptions, making that the most critical field. Spidermon lets you encode exactly these priorities.

First, enable field coverage tracking:

# settings.py

1SPIDERMON_ADD_FIELD_COVERAGE = True
Copy

Then define your rules:
# settings.py

1SPIDERMON_FIELD_COVERAGE_RULES = {
2    "dict/name": 0.95,       # 95% of items must have a name
3    "dict/price": 1.0,       # 100% must have a price
4    "dict/url": 1.0,         # 100% must have a URL
5    "dict/image_urls": 0.70, # 70% coverage is acceptable for images
6}
Copy

With this configuration, Spidermon will fail the FieldCoverageMonitor if fewer than 95% of your items have a name, or if even a single item is missing a price. But it will tolerate up to 30% of items lacking image URLs.

This is the "set of sliders" Artur described, implemented as configuration.

Your first monitor in five minutes

Validation happens automatically through the pipeline. Monitors are the layer on top: health checks that run at defined moments during spider execution, inspect the stats, and decide whether the run was healthy or not.

Here is a minimal monitors.py that checks item count and validation errors:

# monitors.py

1from spidermon import Monitor, MonitorSuite, monitors
2from spidermon.contrib.scrapy.monitors import (
3    FieldCoverageMonitor,
4    FinishReasonMonitor,
5    ErrorCountMonitor,
6)
7
8@monitors.name("Item count")
9class ItemCountMonitor(Monitor):
10
11    @monitors.name("Minimum number of items")
12    def test_minimum_number_of_items(self):
13        item_extracted = getattr(
14            self.data.stats, "item_scraped_count", 0
15        )
16        minimum_threshold = 100
17        msg = "Extracted fewer than {} items".format(
18            minimum_threshold
19        )
20        self.assertTrue(
21            item_extracted >= minimum_threshold, msg=msg
22        )
23
24
25class SpiderCloseMonitorSuite(MonitorSuite):
26    monitors = [
27        ItemCountMonitor,
28        FieldCoverageMonitor,
29        FinishReasonMonitor,
30        ErrorCountMonitor,
31    ]
Copy

Wire it up in settings.py:

1# settings.py
2SPIDERMON_SPIDER_CLOSE_MONITORS = (
3    "myproject.monitors.SpiderCloseMonitorSuite",
4)
Copy

Run your spider. At the end of the crawl, you will see output like this in your logs:

1INFO: [Spidermon] -------------------- MONITORS --------------------
2INFO: [Spidermon] Item count/Minimum number of items... OK
3INFO: [Spidermon] FieldCoverageMonitor... OK
4INFO: [Spidermon] FinishReasonMonitor... OK
5INFO: [Spidermon] ErrorCountMonitor... OK
6INFO: [Spidermon] 4 monitors in 0.003s
7INFO: [Spidermon] OK
Copy

If anything fails, Spidermon reports exactly what went wrong and triggers the actions you have configured.

The monitors you get for free

Writing custom monitors is powerful, but Spidermon ships with a set of built-in monitors that cover the most common checks. You just configure them:

FinishReasonMonitor verifies that your spider finished for an expected reason (typically "finished"), not because it was banned or hit an error.

# settings.py
SPIDERMON_EXPECTED_FINISH_REASONS = ["finished"]

ErrorCountMonitor fails if the spider logs too many errors.

# settings.py
SPIDERMON_MAX_ERRORS = 10

ItemValidationMonitor checks that validation error rates stay within acceptable bounds. No additional configuration needed if you have already set up schema validation.

FieldCoverageMonitor enforces the per-field coverage rules you defined earlier.

PeriodicItemCountMonitor runs at intervals during long crawls to make sure items are still flowing, catching stalls before the spider finishes.

DownloaderExceptionMonitor flags excessive downloader exceptions, which often indicate anti-bot measures or network problems.

For projects that want quick, no-code monitors, Spidermon also supports expression monitors defined entirely in settings:

# settings.py

1SPIDERMON_SPIDER_CLOSE_EXPRESSION_MONITORS = [
2    {
3        "name": "QuickChecks",
4        "tests": [
5            {
6                "name": "minimum_items",
7                "expression": "stats.get('item_scraped_count', 0) >= 100",
8            },
9            {
10                "name": "low_error_rate",
11                "expression": "stats.get('log_count/ERROR', 0) < 10",
12            },
13        ],
14    },
15]
Copy

These are especially useful for operations teams who want to add monitoring rules without modifying Python code.

Do not just log it, alert the right people

Seeing "FAILED" in the spider logs is useful during development. In production, where you might be running hundreds of spiders, it is useless unless someone is watching. Spidermon integrates with the notification tools your team already uses.

Here is how to set up Slack notifications. Add a failed action to your monitor suite:

# monitors.py

1from spidermon.contrib.actions.slack.notifiers import (
2    SendSlackMessageSpiderFinished,
3)
4
5class SpiderCloseMonitorSuite(MonitorSuite):
6    monitors = [
7        ItemCountMonitor,
8        FieldCoverageMonitor,
9        FinishReasonMonitor,
10    ]
11    monitors_failed_actions = [
12        SendSlackMessageSpiderFinished,
13    ]
Copy

Configure your Slack credentials:
python
# settings.py

1SPIDERMON_SLACK_SENDER_TOKEN = "<YOUR_SLACK_BOT_TOKEN>"
2SPIDERMON_SLACK_RECIPIENTS = ["#scraping-alerts"]
Copy

Now, when any monitor fails, your team gets a Slack message with the details. Spidermon also supports Telegram, Discord, email through Amazon Simple Email Service (SES), Sentry, and Amazon Simple Notification Service (SNS). You can also write custom actions for anything else.

The goal, as Artur emphasized on the podcast, is visibility. The sooner you know something is wrong, the less it costs to fix.

Where to go from here

This blog post covered the essentials: schema validation, field coverage, monitors, and notifications. That is already enough to catch the majority of data quality issues in a production scraping setup. But Spidermon goes deeper:

  • Periodic monitors let you run checks at intervals during long crawls, catching stalls and anomalies before the spider finishes.
  • Custom actions can do anything: tag jobs, generate reports, push alerts to custom dashboards.
  • Comparing spider executions lets you detect regressions by checking current stats against previous runs, answering questions like: "Did this spider return significantly fewer items than last time?"

And looking ahead, the future of data quality extends beyond structural validation. Artur described on the podcast how large language models (LLMs) are opening a new frontier of semantic testing, the kind of checks that were previously impossible or required purpose-built models.

An LLM can read an entire record and notice that a color field contains a product name, or that a description seems to be a mix of two different products.

  • Spidermon handles the structural and statistical layer.
  • LLMs can handle the semantic layer.

They are complementary, not competing, and together they represent the full spectrum of data quality checks that modern web scraping demands.

For now, start with the foundation. Install Spidermon, add a JSON schema for your items, enable field coverage, wire up a notification channel, and let your spiders tell you when something is wrong instead of waiting for someone else to find out.

Your spiders are already running. Give them spidey-senses.

Get started:

  • Spidermon documentation
  • Spidermon on GitHub
  • Getting started tutorial
  • Listen to the Extract Data podcast episode on data quality

More on data quality:

  • How Zyte’s extraction experts guarantee data quality
  • The DQ playbook: How ‘data quality’ fuels business’ pursuit of precision
×

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.
Start FreeFind out more

Get the latest posts straight to your inbox

No matter what data type you're looking for, we've got you

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026