PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    SERP

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Browse

    • BlogArticles, podcasts, videos
    • Case studiesCustomer outcomes
    • White papersIn-depth reports
    • EventsConferences, webinars, recordings

    Subscribe

    • NewsletterSwiftly delivered
    • Discord communityExtract Data community
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
All articles
AI-assisted data extraction28, 28 articles
Data gathering for AI6, 6 articles
Large Language Models (LLMs)24, 24 articles
Tool-assisted coding3, 3 articles
Developer interest143, 143 articles
Integration13, 13 articles
Open-source96, 96 articles
Scraping practice59, 59 articles
Scraping strategy46, 46 articles
Anti-ban35, 35 articles
Traffic6, 6 articles
Web data application25, 25 articles
Web data collection358, 358 articles
Web data collection ethics3, 3 articles
Web data collection legality16, 16 articles
Web scraping APIs63, 63 articles
Zyte API59, 59 articles
Scrapy48, 48 articles
Scrapy Cloud10, 10 articles
Web Scraping Copilot12, 12 articles
AI & Machine Learning1, 1 articles
Automotive2, 2 articles
E-commerce & retail26, 26 articles
Entertainment & Streaming2, 2 articles
Financial Services8, 8 articles
Government2, 2 articles
Market Research & Intelligence3, 3 articles
Media & publishing8, 8 articles
Real Estate2, 2 articles
Recruitment & HR3, 3 articles
Transportation & Logistics2, 2 articles
Travel & hospitality2, 2 articles
Extract Summit25, 25 articles
PyCon1, 1 articles

Appearance

Discord Community
BlogOpen-sourceFrontera: The Brain Behind The Crawls
ArticleOpen-source

Frontera: The Brain Behind The Crawls

Frontera, formerly Crawl Frontier, is an open-source framework to manage our crawling logic and sharing it between spiders in our Scrapy projects.

P

Pablo Hoffman

5 min read · April 22, 2015

Frontera: The Brain Behind The Crawls

Frontera: The brain behind the crawls

At Zyte we're always building and running large crawls–last year we had 11 billion requests made on Scrapy Cloud alone. Crawling millions of pages from the internet requires more sophistication than getting a few contacts of a list, as we need to make sure that we get reliable data, up-to-date lists of item pages and are able to optimize our crawl as much as possible.

From these complex projects emerge technologies that can be used across all of our spiders, and we're very pleased to release Frontera, a flexible frontier for web crawlers.

Frontera, formerly Crawl Frontier, is an open-source framework we developed to facilitate building a crawl frontier, helping manage our crawling logic and sharing it between spiders in our Scrapy projects.

What is a crawl frontier?

A crawl frontier is the system in charge of the logic and policies to follow when crawling websites, and plays a key role in more sophisticated crawling systems. It allows us to set rules about what pages should be crawled next, visiting priorities and ordering, how often pages are revisited, and any behaviour we may want to build into the crawl.

While Frontera was originally designed for use with Scrapy, it’s completely agnostic and can be used with any other crawling framework or standalone project.

In this post, we’re going to demonstrate how Frontera can improve the way you crawl using Scrapy. We’ll show you how you can use Scrapy to scrape articles from Hacker News while using Frontera to ensure the same articles aren’t visited again in subsequent crawls.

The frontier needs to be initialized with a set of starting URLs (seeds), and then the crawler will ask the frontier which pages should visit. As the crawler visits pages it will inform back to the frontier of each page’s response and extracted URLs.

The frontier will decide how to use this information according to the defined logic. This process continues until an end condition is reached. Some crawlers may never stop, we refer to these as continuous crawls.

Creating a Spider for HackerNews

Hopefully, you're now familiar with what Frontera does. If not, have to take a look at this textbook's section for more theory on how a crawl frontier works.

You can check out the project we'll be developing in this example from GitHub.

Let’s start by creating a new project and spider:

1scrapy startproject hn\_scraper cd hn\_scraper scrapy genspider HackerNews news.ycombinator.com
Copy

You should have a directory structure similar to the following:

1hn\_scraper hn\_scraper/hn\_scraper hn\_scraper/hn\_scraper/\_\_init\_\_.py hn\_scraper/hn\_scraper/\_\_init\_\_.pyc hn\_scraper/hn\_scraper/items.py hn\_scraper/hn\_scraper/pipelines.py hn\_scraper/hn\_scraper/settings.py hn\_scraper/hn\_scraper/settings.pyc hn\_scraper/hn\_scraper/spiders hn\_scraper/hn\_scraper/spiders/\_\_init\_\_.py hn\_scraper/hn\_scraper/spiders/\_\_init\_\_.pyc hn\_scraper/hn\_scraper/spiders/HackerNews.py hn\_scraper/scrapy.cfg
Copy

Due to the way the spider template is set up, your start_urls in spiders/HackerNews.py will look like this:

1start\_urls = ( 'http://www.news.ycombinator.com/', )
Copy

So you will want to correct it like so:

1start\_urls = ( 'https://news.ycombinator.com/', )
Copy

We also need to create an item definition for the article we're scraping:

1items.py import scrapy class HnArticleItem(scrapy.Item): url = scrapy.Field() title = scrapy.Field() item\_id = scrapy.Field() pass
Copy

Here the url field will refer to the outbound URL, the title to the article's title, and the item_id to HN's item ID.

We then need to define a link extractor so Scrapy will know which links to follow and extract data from.

Hacker News doesn’t make use of CSS classes for each item row, and another problem is that the article's item URL, author, and comments count are on a separate row from the article title and outbound URL. We’ll need to use XPath in this case.

First, let's gather all of the rows containing a title and outbound URL. If you inspect the DOM, you will notice these rows contain 3 cells, whereas the subtext rows contain 2 cells. So we can use something like the following:

1selector = Selector(response) rows = selector.xpath('//table\[@id="hnmain"\]//td\[count(table) = 1\]'  '//table\[count(tr) > 1\]//tr\[count(td) = 3\]')
Copy

We then iterate over each row, retrieving the article URL and title, and we also need to retrieve the item URL and author from the subtext row, which we can find using the following-sibling axis. You should create a method similar to the following:

1def parse\_item(self, response): selector = Selector(response) rows = selector.xpath('//table\[@id="hnmain"\]//td\[count(table) = 1\]'  '//table\[count(tr) > 1\]//tr\[count(td) = 3\]') for row in rows: item = HnArticleItem() article = row.xpath('td\[@class="title" and count(a) = 1\]//a') article\_url = self.extract\_one(article, './@href', '') article\_title = self.extract\_one(article, './text()', '') item\['url'\] = article\_url item\['title'\] = article\_title subtext = row.xpath( './following-sibling::tr\[1\]//td\[@class="subtext" and count(a) = 3\]') if subtext: item\_author = self.extract\_one(subtext, './/a\[1\]/@href', '') item\_id = self.extract\_one(subtext, './/a\[2\]/@href', '') item\['author'\] = item\_author\[8:\] item\['id'\] = int(item\_id\[8:\]) yield item
Copy

The extract_one method is a helper function to extract the first result:

1def extract\_one(self, selector, xpath, default=None): extracted = selector.xpath(xpath).extract() if extracted: return extracted\[0\] return default
Copy

There’s currently a bug with Frontera's SQLalchemy middleware where callbacks aren’t called, so right now we need to inherit from Spider and override the parse method and make it call our parse_item function. Here's an example of what the spider should look like:

spiders/HackerNews.py

1\# -\*- coding: utf-8 -\*- import scrapy from scrapy.http import Request from scrapy.spider import Spider from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import Selector from hn\_scraper.items import HnArticleItem class HackernewsSpider(Spider): name = "HackerNews" allowed\_domains = \["news.ycombinator.com"\] start\_urls = ('https://news.ycombinator.com/', ) link\_extractor = SgmlLinkExtractor( allow=('news', ), restrict\_xpaths=('//a\[text()="More"\]', )) def extract\_one(self, selector, xpath, default=None): extracted = selector.xpath(xpath).extract() if extracted: return extracted\[0\] return default def parse(self, response): for link in self.link\_extractor.extract\_links(response): request = Request(url=link.url) request.meta.update(link\_text=link.text) yield request for item in self.parse\_item(response): yield item def parse\_item(self, response): selector = Selector(response) rows = selector.xpath('//table\[@id="hnmain"\]//td\[count(table) = 1\]'  '//table\[count(tr) > 1\]//tr\[count(td) = 3\]') for row in rows: item = HnArticleItem() article = row.xpath('td\[@class="title" and count(a) = 1\]//a') article\_url = self.extract\_one(article, './@href', '') article\_title = self.extract\_one(article, './text()', '') item\['url'\] = article\_url item\['title'\] = article\_title subtext = row.xpath( './following-sibling::tr\[1\]//td\[@class="subtext" and count(a) = 3\]') if subtext: item\_author = self.extract\_one(subtext, './/a\[1\]/@href', '') item\_id = self.extract\_one(subtext, './/a\[2\]/@href', '') item\['author'\] = item\_author\[8:\] item\['id'\] = int(item\_id\[8:\]) yield item
Copy

Enabling Frontera in Our Project

Now, all we need to do is configure the Scrapy project to use Frontera with the SQLalchemy middleware. First, install Frontera:

1pip install frontera
Copy

First, enable Frontera's middlewares and scheduler by adding the following to settings.py:

1SPIDER\_MIDDLEWARES = {} DOWNLOADER\_MIDDLEWARES = {} SPIDER\_MIDDLEWARES.update({ 'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 999 }, ) DOWNLOADER\_MIDDLEWARES.update({ 'frontera.contrib.scrapy.middlewares.schedulers.SchedulerDownloaderMiddleware': 999 }) SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler' FRONTERA\_SETTINGS = 'hn\_scraper.frontera\_settings'
Copy

Next, create a file named frontera_settings.py, as specified above in FRONTERA_SETTINGS, to store any settings related to the frontier:

1BACKEND = 'frontera.contrib.backends.sqlalchemy.FIFO' SQLALCHEMYBACKEND\_ENGINE = 'sqlite:///hn\_frontier.db' MAX\_REQUESTS = 2000 MAX\_NEXT\_REQUESTS = 10 DELAY\_ON\_EMPTY = 0.0
Copy

Here we specify hn_frontier.db as the SQLite database file, which is where Frontera will store pages it has crawled.

Running the Spider

Let’s run the spider:

1scrapy crawl HackerNews -o results.csv -t csv
Copy

You can review the items being scraped in results.csv while the spider is running.

You will notice the hn_scraper.db file we specified earlier will be created. You can browse it using the sqlite3 command-line tool:

1sqlite> attach "hn\_frontier.db" as hns; sqlite> .tables hns.pages sqlite> select \* from hns.pages; https://news.ycombinator.com/|f1f3bd09de659fc955d2db1e439e3200802c4645|0|20150413231805460038|200|CRAWLED| https://news.ycombinator.com/news?p=2|e273a7bbcf16fdcdb74191eb0e6bddf984be6487|1|20150413231809316300|200|CRAWLED| https://news.ycombinator.com/news?p=3|f804e8cd8ff236bb0777220fb241fcbad6bf0145|2|20150413231810321708|200|CRAWLED| https://news.ycombinator.com/news?p=4|5dfeb8168e126c5b497dfa48032760ad30189454|3|20150413231811333822|200|CRAWLED| https://news.ycombinator.com/news?p=5|2ea8685c1863fca3075c4f5d451aa286f4af4261|4|20150413231812425024|200|CRAWLED| https://news.ycombinator.com/news?p=6|b7ca907cc8b5d1f783325d99bc3a8d5ae7dcec58|5|20150413231813312731|200|CRAWLED| https://news.ycombinator.com/news?p=7|81f45c4153cc8f2a291157b10bdce682563362f1|6|20150413231814324002|200|CRAWLED| https://news.ycombinator.com/news?p=8|5fbe397d005c2f79829169f2ec7858b2a7d0097d|7|20150413231815443002|200|CRAWLED| https://news.ycombinator.com/news?p=9|14ee3557a2920b62be3fd521893241c43864c728|8|20150413231816426616|200|CRAWLED|
Copy

As shown above, the database has one table, pages, which stores the URL, its fingerprint, timestamp, and response code. This schema is specific to the SQLalchemy backend, and different backends may use different schemas, and some don't persist crawled pages at all.

Frontera backends aren't limited to storing crawled pages; they're the core component of Frontera, and hold all crawl frontier related logic you wish to make use of, so which backend you use is heavily tied to what you want to achieve with Frontera.

In many cases, you will want to create your own backend. This is a lot easier than it sounds, and you can find all the information you need in the documentation.

Hopefully, this tutorial has given you a good insight into Frontera and how you can use it to improve the way you manage your crawling logic. Feel free to check out the code and docs. If you run into a problem please report it at the issue tracker.

Try Zyte API

Build your first scraper in minutes

Free trial, no credit card. From a single request to production in an afternoon.

Get started
Open-source
P

Pablo Hoffman

More from this author

In this article

  • Creating a Spider for HackerNews
  • Enabling Frontera in Our Project
  • Running the Spider

Follow

Get the latest

Zyte and the data web in your inbox — or wherever you already are.

Subscribe

Or follow elsewhere

Continue reading

Scrapy in 2026: New release brings modern async crawling standards
Open Source

Scrapy in 2026: New release brings modern async crawling standards

Scrapy 2.14.0 is released with a major under-the-hood modernization. Say goodbye to Twisted Deferreds.

Robert Andrews·6 min·January 12, 2026
The new economics of web data: Smaller scraping just got cheaper
Open Source

The new economics of web data: Smaller scraping just got cheaper

Smarter tools and AI-driven automation are rewriting the rules of web scraping. As costs fall and setup barriers vanish, smaller teams can now compete at scale, reshaping how the web’s data economy works.

Theresia Tanzil·2 mins·October 6, 2025
A Deep Dive into Zyte's Open-Source Libraries
Open Source

A Deep Dive into Zyte's Open-Source Libraries

Discover how Zyte’s open-source libraries like ClearHTML, Extruct, Chomp.js, and more simplify web data extraction and processing.

Neha Setia Nagpal·1 mins·December 19, 2024

The Community · Newsletter

The best of Zyte and the data web, in your inbox.

One curated edition — new articles, product updates, and the stories shaping the data web. No noise.

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026