Giving spidey-senses to your web scraping spiders using Spidermon

Your spider ran overnight. It returned 10,000 items. The logs say "finished." Everything looks fine, so you move on with your day.

Three days later, someone downstream notices that 40% of the price fields are empty. The site changed a CSS class, your spider kept running, and nobody knew anything was wrong until the damage was done.

Artur Sadurski, a data quality expert with more than 10 years at Zyte, put it well on the Extract Data podcast: data quality is like your car; you don't know it's broken until it breaks down - and the further you go before checking, the more expensive the fix becomes.

This is the silent failure problem, and it is the most expensive kind of bug in web scraping. Don’t blame the spider - it did its job, the data just wasn't any good.

Spidermon exists to solve exactly this. It is an open-source monitoring framework for Scrapy that gives your spiders the equivalent of Peter Parker's spidey-sense: the ability to detect problems as they happen, validate every item against a schema, and alert your team before bad data flows downstream.

Set your expectations before you scrape

Data quality is not a single discipline. As Artur described on the podcast, think of it as a set of sliders.

Some projects need broad coverage: scrape as much as possible, tolerate some gaps. Others need surgical precision: fewer items, but every field must be pristine. Most projects fall somewhere in between, and the right balance depends entirely on your use case.

A pricing aggregator cannot afford a single wrong price. A product catalog builder might tolerate 10% missing descriptions as long as coverage across categories is high. A machine learning team training on scraped text needs consistency above all.

The framework you use to monitor data quality should let you configure these trade-offs, not hard-code them. It should let you say: "I need 100% of items to have a price, but 70% coverage on descriptions is fine." It should let you set those sliders per field, per spider, per project.

That is what Spidermon does.

Meet Spidermon: your spider's first line of defense

Spidermon is a monitoring framework built by Zyte and battle-tested on hundreds of production spiders before being released as an open-source library. It integrates directly into your Scrapy project and provides three core capabilities:

Data validation checks every scraped item against a schema you define, catching type mismatches, missing fields, and malformed values in real time.

Stats monitoring lets you write health checks (called "monitors") that run at key moments during spider execution, verifying things like item counts, error rates, finish reasons, and field coverage.

Automated notifications alert your team through Slack, Telegram, Discord, email, or Sentry when something goes wrong, so you find out before your client does.

At Zyte, monitors that run during scraping are considered the first line of defense. Artur described this layered approach on the podcast: simple checks catch problems early and cheaply, while more complex inspections happen further down the pipeline. Spidermon handles that critical first layer.

Start with the cornerstone: JSON schema validation

If you take one thing from this blog post, make it this: set up JSON schema validation. Artur was emphatic about this point: every project at Zyte starts with basic JSON schema validation. It is the foundation. It might seem surface-level, but it catches a surprising number of real issues, from wrong data types to missing required fields to malformed URLs.

Here is how to set it up from scratch.

Install and enable Spidermon

1pip install "spidermon[monitoring,validation]"

Copy

In your settings.py, enable the extension and the validation pipeline:

1SPIDERMON_ENABLED = True
2EXTENSIONS = {
3    "spidermon.contrib.scrapy.extensions.Spidermon": 500,
4}
5ITEM_PIPELINES = {
6    "spidermon.contrib.scrapy.pipelines.ItemValidationPipeline": 800,
7}

Copy

Keep the validation pipeline last so no subsequent pipeline changes the content of an item after validation.

Define your schema

Say you are scraping products from an e-commerce site. Your items have a name, price, URL, and a list of image URLs. Create a JSON schema that defines what a valid item looks like:

1{
2    "$schema": "http://json-schema.org/draft-07/schema",
3    "type": "object",
4    "properties": {
5        "name": {
6            "type": "string",
7            "minLength": 1
8        },
9        "price": {
10            "type": "number",
11            "minimum": 0
12        },
13        "url": {
14            "type": "string",
15            "pattern": "^https?://"
16        },
17        "image_urls": {
18            "type": "array",
19            "items": {
20                "type": "string",
21                "pattern": "^https?://"
22            }
23        }
24    },
25    "required": ["name", "price", "url"]
26}

Copy

Save this as product_schema.json in your project, and point Spidermon to it:

1SPIDERMON_VALIDATION_SCHEMAS = ["./product_schema.json"]

Copy

That is it. Every item your spider yields will now be validated against this schema. Run your spider and check the logs. You will see new stats like:

1'spidermon/validation/fields': 3000,
2'spidermon/validation/fields/errors': 12,
3'spidermon/validation/items': 1000,
4'spidermon/validation/items/errors': 8,

Copy

Those 12 field errors across eight items? Those are problems you would have shipped silently without Spidermon. Now you know about them before anyone else does.

If you want validation errors attached directly to each item, add this setting:

# settings.py

1SPIDERMON_VALIDATION_ADD_ERRORS_TO_ITEMS = True

Copy

Items that fail validation will include a _validation field showing exactly what went wrong:

1{
2    "_validation": {"url": ["Invalid URL"]},
3    "name": "Wireless Headphones",
4    "price": 49.99,
5    "url": "not_a_valid_url"
6}

Copy

You can even tell Spidermon to drop invalid items entirely with SPIDERMON_VALIDATION_DROP_ITEMS_WITH_ERRORS = True, though in most cases you will want to collect the errors first and decide how to handle them.

Tune your sliders: field coverage and thresholds

Not every item will have every field, and that is fine. But you need to know and control how much missing data is acceptable. This is where Spidermon's field coverage feature comes in.

On the podcast, Artur made the point that different clients care about different fields. One client might need prices to be perfect while descriptions are optional, another might be building a service that reads product descriptions, making that the most critical field. Spidermon lets you encode exactly these priorities.

First, enable field coverage tracking:

# settings.py

1SPIDERMON_ADD_FIELD_COVERAGE = True

Copy

Then define your rules:
# settings.py

1SPIDERMON_FIELD_COVERAGE_RULES = {
2    "dict/name": 0.95,       # 95% of items must have a name
3    "dict/price": 1.0,       # 100% must have a price
4    "dict/url": 1.0,         # 100% must have a URL
5    "dict/image_urls": 0.70, # 70% coverage is acceptable for images
6}

Copy

With this configuration, Spidermon will fail the FieldCoverageMonitor if fewer than 95% of your items have a name, or if even a single item is missing a price. But it will tolerate up to 30% of items lacking image URLs.

This is the "set of sliders" Artur described, implemented as configuration.

Your first monitor in five minutes

Validation happens automatically through the pipeline. Monitors are the layer on top: health checks that run at defined moments during spider execution, inspect the stats, and decide whether the run was healthy or not.

Here is a minimal monitors.py that checks item count and validation errors:

# monitors.py

1from spidermon import Monitor, MonitorSuite, monitors
2from spidermon.contrib.scrapy.monitors import (
3    FieldCoverageMonitor,
4    FinishReasonMonitor,
5    ErrorCountMonitor,
6)
7
8@monitors.name("Item count")
9class ItemCountMonitor(Monitor):
10
11    @monitors.name("Minimum number of items")
12    def test_minimum_number_of_items(self):
13        item_extracted = getattr(
14            self.data.stats, "item_scraped_count", 0
15        )
16        minimum_threshold = 100
17        msg = "Extracted fewer than {} items".format(
18            minimum_threshold
19        )
20        self.assertTrue(
21            item_extracted >= minimum_threshold, msg=msg
22        )
23
24
25class SpiderCloseMonitorSuite(MonitorSuite):
26    monitors = [
27        ItemCountMonitor,
28        FieldCoverageMonitor,
29        FinishReasonMonitor,
30        ErrorCountMonitor,
31    ]

Copy

Wire it up in settings.py:

1# settings.py
2SPIDERMON_SPIDER_CLOSE_MONITORS = (
3    "myproject.monitors.SpiderCloseMonitorSuite",
4)

Copy

Run your spider. At the end of the crawl, you will see output like this in your logs:

1INFO: [Spidermon] -------------------- MONITORS --------------------
2INFO: [Spidermon] Item count/Minimum number of items... OK
3INFO: [Spidermon] FieldCoverageMonitor... OK
4INFO: [Spidermon] FinishReasonMonitor... OK
5INFO: [Spidermon] ErrorCountMonitor... OK
6INFO: [Spidermon] 4 monitors in 0.003s
7INFO: [Spidermon] OK

Copy

If anything fails, Spidermon reports exactly what went wrong and triggers the actions you have configured.

The monitors you get for free

Writing custom monitors is powerful, but Spidermon ships with a set of built-in monitors that cover the most common checks. You just configure them:

FinishReasonMonitor verifies that your spider finished for an expected reason (typically "finished"), not because it was banned or hit an error.

# settings.py
SPIDERMON_EXPECTED_FINISH_REASONS = ["finished"]

ErrorCountMonitor fails if the spider logs too many errors.

# settings.py
SPIDERMON_MAX_ERRORS = 10

ItemValidationMonitor checks that validation error rates stay within acceptable bounds. No additional configuration needed if you have already set up schema validation.

FieldCoverageMonitor enforces the per-field coverage rules you defined earlier.

PeriodicItemCountMonitor runs at intervals during long crawls to make sure items are still flowing, catching stalls before the spider finishes.

DownloaderExceptionMonitor flags excessive downloader exceptions, which often indicate anti-bot measures or network problems.

For projects that want quick, no-code monitors, Spidermon also supports expression monitors defined entirely in settings:

# settings.py

1SPIDERMON_SPIDER_CLOSE_EXPRESSION_MONITORS = [
2    {
3        "name": "QuickChecks",
4        "tests": [
5            {
6                "name": "minimum_items",
7                "expression": "stats.get('item_scraped_count', 0) >= 100",
8            },
9            {
10                "name": "low_error_rate",
11                "expression": "stats.get('log_count/ERROR', 0) < 10",
12            },
13        ],
14    },
15]

Copy

These are especially useful for operations teams who want to add monitoring rules without modifying Python code.

Do not just log it, alert the right people

Seeing "FAILED" in the spider logs is useful during development. In production, where you might be running hundreds of spiders, it is useless unless someone is watching. Spidermon integrates with the notification tools your team already uses.

Here is how to set up Slack notifications. Add a failed action to your monitor suite:

# monitors.py

1from spidermon.contrib.actions.slack.notifiers import (
2    SendSlackMessageSpiderFinished,
3)
4
5class SpiderCloseMonitorSuite(MonitorSuite):
6    monitors = [
7        ItemCountMonitor,
8        FieldCoverageMonitor,
9        FinishReasonMonitor,
10    ]
11    monitors_failed_actions = [
12        SendSlackMessageSpiderFinished,
13    ]

Copy

Configure your Slack credentials:
python
# settings.py

1SPIDERMON_SLACK_SENDER_TOKEN = "<YOUR_SLACK_BOT_TOKEN>"
2SPIDERMON_SLACK_RECIPIENTS = ["#scraping-alerts"]

Copy

Now, when any monitor fails, your team gets a Slack message with the details. Spidermon also supports Telegram, Discord, email through Amazon Simple Email Service (SES), Sentry, and Amazon Simple Notification Service (SNS). You can also write custom actions for anything else.

The goal, as Artur emphasized on the podcast, is visibility. The sooner you know something is wrong, the less it costs to fix.

Where to go from here

This blog post covered the essentials: schema validation, field coverage, monitors, and notifications. That is already enough to catch the majority of data quality issues in a production scraping setup. But Spidermon goes deeper:

Periodic monitors let you run checks at intervals during long crawls, catching stalls and anomalies before the spider finishes.
Custom actions can do anything: tag jobs, generate reports, push alerts to custom dashboards.
Comparing spider executions lets you detect regressions by checking current stats against previous runs, answering questions like: "Did this spider return significantly fewer items than last time?"

And looking ahead, the future of data quality extends beyond structural validation. Artur described on the podcast how large language models (LLMs) are opening a new frontier of semantic testing, the kind of checks that were previously impossible or required purpose-built models.

An LLM can read an entire record and notice that a color field contains a product name, or that a description seems to be a mix of two different products.

Spidermon handles the structural and statistical layer.
LLMs can handle the semantic layer.

They are complementary, not competing, and together they represent the full spectrum of data quality checks that modern web scraping demands.

For now, start with the foundation. Install Spidermon, add a JSON schema for your items, enable field coverage, wire up a notification channel, and let your spiders tell you when something is wrong instead of waiting for someone else to find out.

Your spiders are already running. Give them spidey-senses.

Get started: