PINGDOM_CHECK

Web scraping is often taught using scripts that invariably dump data into a JSON or CSV file. That’s fine for learning the basics, but it doesn’t reflect how scraping works at scale in real-world systems.

In practice:

  • Scrapers usually run as jobs, not always running scripts or daemons.
  • Data needs to be stored reliably.
  • Environments must be reproducible.
  • Scaling and maintenance should be easy.

In this blog post, let me walk through a demo project called scrape2postgresql, which shows how to:

  1. Use Scrapy to scrape structured data.
  2. Store results in PostgreSQL.
  3. Run everything using docker-compose.
  4. Keep spiders and database in separate containers.

This project uses books.toscrape.com, a safe demo website, to scrape book titles and prices, but the structure applies to almost any scraping use case.


Why Scrapy?

Scrapy is a full-featured web scraping framework, not just a request library.

It gives you:

  • A crawling engine.
  • Request scheduling.
  • Built-in support for pagination.
  • Structured item pipelines.
  • Retry and error handling.
  • Clear project structure.

Instead of writing while loops, requests and BeautifulSoup, Scrapy encourages you to think in terms of spiders, items, and pipelines. It scales much better as projects grow.


Why Docker (and docker-compose)?

A very common beginner setup looks like this:

  • Scrapy installed locally.
  • PostgreSQL installed locally.
  • Different Python versions and virtual environments.

This becomes painful fast, difficult to scale, manage and maintain. Enter Docker, which solves this by:

  • Packaging dependencies into a container.
  • Making environments consistent, reproducible and talk to one another.
  • Isolating concerns cleanly and sandboxing local networking.
  • Want to change databases from PostgreSQL to mongoDB? Just fetch it from docker hub and plug it in.
  • Want to log data in a chart? Add another container such as Grafana.

docker-compose goes one step further by allowing us to run multiple containers together as a bundle, taking care of inter-connectivity:

  • One container for Scrapy.
  • One container for PostgreSQL.

Each container does its individual thing well, making it easier to maintain the project and isolate bugs, if any.


High-level architecture

Before we dive into code, let’s understand the architecture.

We’re using docker-compose to fire up two Docker containers:

  1. The first container has our Scrapy spider whose sole job is to scrape the web page we provide and store the data in database - it starts only when we need it.
  2. The other container is a PostgresSQL database with a persistent volume mounted on the host - whatever information our spider scrapes gets stored in this database.

Since this is docker-compose the networking between these two containers is sorted, we just need to use the authentication credentials.

Scrapy Container (one-shot job) ───────> PostgreSQL Container (persistent service)


Key philosophy:

  • Scrapy is a job
    • starts
    • crawls
    • stores data
    • exits
  • PostgreSQL is a service
    • stays running
    • persists data
    • can be queried anytime

This separation is extremely important for scaling and maintenance.


Project Structure

Here’s the structure of scrape2postgresql:


.
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ Makefile
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ run_spider.sh
β”‚
└── bookscraper/
    β”œβ”€β”€ scrapy.cfg
    └── bookscraper/
        β”œβ”€β”€ items.py
        β”œβ”€β”€ pipelines.py
        β”œβ”€β”€ settings.py
        └── spiders/
            └── books.py

Let’s go through each part and understand why it exists.


Scrapy project

The spider (/spiders/books.py)

The spider is where the website's crawling logic lives.

At a high level, our spider:

  • Accepts a URL dynamically.
  • Extracts book titles and prices.
  • Follows pagination links.
  • Yields structured data.

Initializing the spider


def __init__(self, url=None, max_pages=None, *args, **kwargs):
    super().__init__(*args, **kwargs)

    if not url:
        raise ValueError("You must pass a URL")

    self.start_urls = [url]

Instead of hard-coding URLs, we pass them at runtime. This makes the spider reusable for different categories or sites with similar structure.


CSS selectors

Scrapy supports both XPath and CSS selectors. CSS selectors are usually simpler and more readable.

Example:


for book in response.css("article.product_pod"):
    title = book.css("h3 a::attr(title)").get()
    price = book.css("p.price_color::text").get()

What this means:

  • article.product_pod selects each book card
  • h3 a::attr(title) extracts the book title
  • p.price_color::text extracts the price text

CSS selectors map directly to how the HTML is structured, making them easy to debug in browser DevTools.



Handling pagination

Pagination is one of the most important parts of any crawler. Basically it’s a logic using which you can navigate a website and move to the next page if/when needed.


next_page = response.css("li.next a::attr(href)").get()
if next_page:
    yield response.follow(next_page, callback=self.parse)

Scrapy handles relative URLs automatically with response.follow(), so you don’t have to manually build full URLs.

This approach ensures:

  • All pages in a category are crawled
  • No duplicate requests.
  • No infinite loops.

The pipeline (pipelines.py)

Spiders extract data, but pipelines store data. This separation is intentional.

Our pipeline:

  • Opens a PostgreSQL connection.
  • Creates a table if needed.
  • Inserts each scraped item.

class PostgresPipeline:
    def open_spider(self, spider):
        self.conn = psycopg2.connect(...)
        self.cur = self.conn.cursor()

The open_spider() method runs once, when the spider starts.


Inserting data


def process_item(self, item, spider):
    self.cur.execute(
        "INSERT INTO books (title, price) VALUES (%s, %s)",
        (item["title"], item["price"])
    )
    self.conn.commit()
    return item

Each item yielded by the spider passes through the pipeline.

This makes it easy to:

  • Add validation.
  • Normalize data.
  • Store in different backends later.

Dockerizing the Scraper

Dockerfile

The Dockerfile defines how the Scrapy container is built.


FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY bookscraper /app/bookscraper
COPY run_spider.sh /app/run_spider.sh
RUN chmod +x /app/run_spider.sh

Key points:

  • Uses a lightweight Python base image.
  • Installs dependencies once.
  • Copies the Scrapy project into the container.
  • Includes a run script.

The run script (run_spider.sh)

This script is what actually runs when the container starts.


if [ -z "$URL" ]; then
  echo "ERROR: URL not provided"
  exit 1
fi

scrapy crawl books -a url="$URL"

Why a script?

  • Easier debugging.
  • Clearer error messages.
  • Simpler command invocation.
  • Easier to extend later (cron, retries, etc.).

Docker Compose

Docker Compose ties everything together.


services:
  postgres:
    image: postgres:15
    volumes:
      - pgdata:/var/lib/postgresql/data

  scrapy:
    build: .
    depends_on:
      - postgres

Important concepts here:

  • Separate containers.
  • Shared network.
  • Persistent volumes.
  • Explicit dependencies.

Scrapy can talk to PostgreSQL using the service name (postgres) as hostname.


Makefile

Instead of typing long Docker commands, I’ve created a Makefile.

Clone project from https://github.com/apscrapes/scrape2postgresql and use make commands to set it up :

Example:


make db
make scrape url="https://books.toscrape.com/..."
make psql

Why this design scales well

This setup scales since each component is isolated and replaceable.:

What?How?
Want more spiders?Add more Scrapy spiders
Want scheduled scraping?Trigger make scrape via cron or CI
Want a another DB?Swap PostgreSQL with another docker image (e.g., MongoDB)
Want to plot data-points?Add Grafana container

Final thoughts

scrape2postgresql is intentionally simple, but architecturally solid.

It demonstrates:

  • How Scrapy is meant to be used.
  • How Docker simplifies environments.
  • Why separating spiders and databases matters.
  • How real scraping pipelines are structured.

If you’re new to web scraping, this project gives you a strong foundation. If you’re experienced, it gives you a clean starting template.


Next steps you could explore

  • Add item validation.
  • Include Zyte to avoid bans.
  • Store historical price changes.
  • Add retries and throttling.
  • Expose data via an API.
  • Schedule scraping jobs.

Once you understand this setup, you can build data pipelines at scale.

Happy scraping πŸš€.