
Web scraping is often taught using scripts that invariably dump data into a JSON or CSV file. That’s fine for learning the basics, but it doesn’t reflect how scraping works at scale in real-world systems.
In practice:
In this blog post, let me walk through a demo project called scrape2postgresql, which shows how to:
This project uses books.toscrape.com, a safe demo website, to scrape book titles and prices, but the structure applies to almost any scraping use case.
Scrapy is a full-featured web scraping framework, not just a request library.
It gives you:
Instead of writing while loops, requests and BeautifulSoup, Scrapy encourages you to think in terms of spiders, items, and pipelines. It scales much better as projects grow.
A very common beginner setup looks like this:
This becomes painful fast, difficult to scale, manage and maintain. Enter Docker, which solves this by:
docker-compose goes one step further by allowing us to run multiple containers together as a bundle, taking care of inter-connectivity:
Each container does its individual thing well, making it easier to maintain the project and isolate bugs, if any.
Before we dive into code, let’s understand the architecture.
We’re using docker-compose to fire up two Docker containers:
Since this is docker-compose the networking between these two containers is sorted, we just need to use the authentication credentials.
Scrapy Container (one-shot job) ───────> PostgreSQL Container (persistent service)
Key philosophy:
This separation is extremely important for scaling and maintenance.
Here’s the structure of scrape2postgresql:
.
├── docker-compose.yml
├── Dockerfile
├── Makefile
├── requirements.txt
├── run_spider.sh
│
└── bookscraper/
├── scrapy.cfg
└── bookscraper/
├── items.py
├── pipelines.py
├── settings.py
└── spiders/
└── books.py
Let’s go through each part and understand why it exists.
The spider is where the website's crawling logic lives.
At a high level, our spider:
def __init__(self, url=None, max_pages=None, *args, **kwargs):
super().__init__(*args, **kwargs)
if not url:
raise ValueError("You must pass a URL")
self.start_urls = [url]
Instead of hard-coding URLs, we pass them at runtime. This makes the spider reusable for different categories or sites with similar structure.
Scrapy supports both XPath and CSS selectors. CSS selectors are usually simpler and more readable.
Example:
for book in response.css("article.product_pod"):
title = book.css("h3 a::attr(title)").get()
price = book.css("p.price_color::text").get()
What this means:
article.product_pod selects each book cardh3 a::attr(title) extracts the book titlep.price_color::text extracts the price textCSS selectors map directly to how the HTML is structured, making them easy to debug in browser DevTools.
Pagination is one of the most important parts of any crawler. Basically it’s a logic using which you can navigate a website and move to the next page if/when needed.
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
Scrapy handles relative URLs automatically with response.follow(), so you don’t have to manually build full URLs.
This approach ensures:
Spiders extract data, but pipelines store data. This separation is intentional.
Our pipeline:
class PostgresPipeline:
def open_spider(self, spider):
self.conn = psycopg2.connect(...)
self.cur = self.conn.cursor()
The open_spider() method runs once, when the spider starts.
def process_item(self, item, spider):
self.cur.execute(
"INSERT INTO books (title, price) VALUES (%s, %s)",
(item["title"], item["price"])
)
self.conn.commit()
return item
Each item yielded by the spider passes through the pipeline.
This makes it easy to:
The Dockerfile defines how the Scrapy container is built.
FROM python:3.11-slim WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt COPY bookscraper /app/bookscraper COPY run_spider.sh /app/run_spider.sh RUN chmod +x /app/run_spider.sh
Key points:
This script is what actually runs when the container starts.
if [ -z "$URL" ]; then echo "ERROR: URL not provided" exit 1 fi scrapy crawl books -a url="$URL"
Why a script?
Docker Compose ties everything together.
services:
postgres:
image: postgres:15
volumes:
- pgdata:/var/lib/postgresql/data
scrapy:
build: .
depends_on:
- postgres
Important concepts here:
Scrapy can talk to PostgreSQL using the service name (postgres) as hostname.
Instead of typing long Docker commands, I’ve created a Makefile.
Clone project from https://github.com/apscrapes/scrape2postgresql and use make commands to set it up :
Example:
make db make scrape url="https://books.toscrape.com/..." make psql
This setup scales since each component is isolated and replaceable.:
| What? | How? |
|---|---|
| Want more spiders? | Add more Scrapy spiders |
| Want scheduled scraping? | Trigger make scrape via cron or CI |
| Want a another DB? | Swap PostgreSQL with another docker image (e.g., MongoDB) |
| Want to plot data-points? | Add Grafana container |
scrape2postgresql is intentionally simple, but architecturally solid.
It demonstrates:
If you’re new to web scraping, this project gives you a strong foundation. If you’re experienced, it gives you a clean starting template.
Once you understand this setup, you can build data pipelines at scale.
Happy scraping 🚀.