We’ve made a change. Scrapinghub is now Zyte! 

Open source at our heart with Scrapy and friends

Where it all started

Make building spiders a breeze

Scrapy is an open source python framework built specifically for web scraping by Zyte co-founders Pablo Hoffman and Shane Evans. Out of the box, Scrapy spiders are designed to download HTML, parse and process the data and save it in either CSV, JSON or XML file formats.

scrapy by zyte
zyte web scraping for developers
Powerful open source technology

Robust Web Scraping Capabilities

Scrapy boasts a wide range of built-in extensions and middlewares designed for handling cookies and sessions as well as HTTP features like compression, authentication, caching, user-agents, robots.txt and crawl depth restriction. It is also very easy to extend through the development of custom middlewares or pipelines to your web scraping projects which can give you the specific functionality you require.

Our open source projects

Giving you the power of Data Extraction

Scrapy is our open source web crawling framework written in Python. Scrapy is one of the most widely used and highly regarded frameworks of its kind; very powerful yet easy to use.

Splash

Github

Splash is our lightweight, scriptable browser as a service with a HTTP based API.

Spidermon

Github

Spidermon is our battle-tested open source spider monitoring library for Scrapy.

DateParser

Github

DateParser is our library for parsing human-readable dates and times. Supports 18 languages.

Portia

Github

Portia is our tool for building spiders through a friendly, visual user interface. No programming knowledge required.

Eli5

Github

A library for debugging machine learning classifiers and explaining their predictions.

Scrapely

Github

Scrapely is a library for generating parsers for web pages.

ScrapyJS

Github

ScrapyJS is our middleware for Splash, making it easy to use Splash in your Scrapy projects.

Frontera

Github

Frontera is a framework for managing your crawl logic and policies.

Formasaurus

Github

Formasaurus figures out the type of an HTML form using machine learning. Is it a login, search, sign up, password recovery, contact form, etc?

W3lib

Github

W3lib provides a number of useful web-related functions for your web scraping projects.

ScrapyRT

Github

ScrapyRT let’s you reuse your spider’s logic to extract data from web pages through a single HTTP request.

Loginform

Github

Loginform is a library that detects and fills login forms on specified URLs.

Webstruct

Github

Webstruct is our library for building NER systems that work with HTML.

Queuelib

Github

Queuelib lets you create disk-based queues in Python.

Adblockparser

Github

Adblockparser is a library for parsing and matching against Adblock Plus filters.

MDR is a library for detecting and extracting list data from web pages.

Webpager

Github

Webpager is a library for classifying whether a link on a web page is a pagination link or not.

Skinfer

Github

Skinfer is a tool we developed to infer schemas from a sample of JSON data.

Scrapy-StreamItem

Github

Scrapy-StreamItem provides support for working with streamcorpus’ StreamItems.

Wappalyzer-Python

Github

Wappalyzer-Python is a Python based wrapper for Wappalyzer.

We know web data

Trusted by leading brands

mercado libre
t mobile
chubb
allegis
gartner
sodatone
Google summer of code - GSOC
GSOC

Google Summer of Code 2021

We love open source and know the community can build amazing things. Google summer of code is a global program that offers students stipends to write code for open source projects. Zyte has been part of it since 2014.

Our 2021 Program is coming soon - stay tuned for updates.

Start Scraping The Web In Minutes

Extra Developer tools to make it easier for you

Learn more

We're hiring.  Check out our open roles...

Jobs