PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    SERP

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Browse

    • BlogArticles, podcasts, videos
    • Case studiesCustomer outcomes
    • White papersIn-depth reports
    • EventsConferences, webinars, recordings

    Subscribe

    • NewsletterSwiftly delivered
    • Discord communityExtract Data community
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
All articles
AI60, 60 articles
Data quality13, 13 articles
Developer interest57, 57 articles
Integration2, 2 articles
Open-source40, 40 articles
Proxies29, 29 articles
Scraping practice17, 17 articles
Scraping strategy26, 26 articles
Web data60, 60 articles
Web scraping APIs33, 33 articles
Zyte API59, 59 articles
Scrapy48, 48 articles
Scrapy Cloud10, 10 articles
Web Scraping Copilot12, 12 articles
AI & Machine Learning1, 1 articles
Automotive2, 2 articles
E-commerce & retail26, 26 articles
Entertainment & Streaming2, 2 articles
Financial Services8, 8 articles
Government2, 2 articles
Market Research & Intelligence3, 3 articles
Media & publishing8, 8 articles
Real Estate2, 2 articles
Recruitment & HR3, 3 articles
Transportation & Logistics2, 2 articles
Travel & hospitality2, 2 articles
Extract Summit25, 25 articles
PyCon1, 1 articles

Appearance

Discord Community
BlogProduct UpdateIntroducing Dash
ArticleProduct announcementProduct Update

Introducing Dash

Introducing Dash - Discover Dash, a new tool designed to simplify web scraping. Learn how to leverage its capabilities for better data extraction.

S

Shane Evans

4 min read · July 27, 2013

Introducing Dash

Introducing Dash

We're excited to introduce Dash, a major update to our scraping platform.

This release is the final step in migrating to our new storage back end and contains improvements to almost every part of our infrastructure. In this post I'd like to introduce some of the highlights.

Improved reliability and performance

Our new storage back end is based on HBase[1] and attempts to address some of the problems we had with mongodb.

The areas with noticeable improvements are:

  • Downloading data is much faster. In fact, for several months, data downloads from finished jobs redirected to this new back end
  • Jobs with millions of items are just as fast as small jobs in the UI
  • There are no pauses caused by other resource-intensive tasks on the server
  • We no longer need frequent scheduled maintenance[2]

During our testing and parallel run, we had 2 disk failures that had no impact on users. Machine failures caused errors until the machine was automatically marked as unavailable. Running crawls were never affected.

View downloaded pages in Autoscraping

When you run an Autoscraping spider in Dash, all pages are saved and available under the new Pages tab:
sh\_pages\_snapshot

Here you can see the requests made by the spider, our cache of the page and what is extracted with the current templates. It's much easier to analyze a scraping run and jump directly into fixing issues - no more annotating mode[3].

Storage API

There is a new API for our storage back end. It's not yet stable and our documentation isn't published, but the python library is available. This API makes it possible to access functionality not currently available elsewhere, such as:

  • All crawl information displayed in the UI and written by our bots
  • Issue a single query to filter and retrieve data (items, logs, pages) for all jobs from a spider or an entire project
  • A new Collections API for storing and filtering JSON objects, supporting optional auto-expiration and versioning
  • A crawl frontier[4], suitable for large distributed crawls

Because our bots now use this API, they can run from any location. We expect to open up new locations in the near future and support non scrapy-based jobs.

The UI is using this API to load data asynchronously, making it more responsive and giving us a cleaner separation between our UI and back end. We’re building on this refactor now and hope to see more UI improvements coming soon!

If you wish to use this new API before we make a stable release, please let us know!

Fair scheduling

Some projects in Dash regularly schedule tens of thousands of jobs at once, which previously degraded performance and caused other projects' jobs to be stuck in a pending state for a very long time.

Each project now has two settings that control running jobs:

Setting

Description

bot groups

When dedicated servers are purchased they are assigned to a bot group. This setting specifies the bot groups that will run jobs from the current project. If no bot groups are configured, a default bot group is used and billed using our "Pay as you go" model.

max slots

The maximum number of running jobs allowed in the project. This defaults to 4 if the project has no bot groups, otherwise it is limited by the number of available slots in the group.

When there is capacity to run another job in a given bot group, a new job is selected by choosing the next job from the project with the fewest running jobs in that bot group.

Infinite scrolling

We're using infinite scrolling for crawl data in Dash. Initially the feedback was mixed, but now everybody prefers the new version. It makes navigating crawl data more fluid as you don’t need to alternate between scrolling and clicking “next” - just keep scrolling and data appears. Give it a try and let us know what you think!

We have some changes coming soon to allow jumping ahead (and avoid scrolling) and various usability and performance tweaks.

Data ordering

Items, pages and logs are returned in insertion order. This means (at last!) logs are always ordered correctly, even when filtering.

Pending jobs are ordered by priority, then by age. So the next job to be run is at the bottom of the list. Running and finished jobs are ordered by the time they entered that state.

Incompatibilities

We are doing our best to minimize disruption, however, some changes may be noticeable:

  • When a project is being moved, jobs running on the old system (Panel) will not be visible in Dash until after they complete
  • Requests to panel.scrapinghub.com are now redirected to app.scrapinghub.com.
  • Previously, _jobid and _id fields containing ObjectIds were available in exported data. These private fields are no longer present
  • Jobs are now identified by a key of the format PROJECTID/SPIDERID/COUNTER. For example, 78/1/4 is the 4th run of spider id 1 in project 78. Items in this job are identified by 78/1/4/0, 78/1/4/1, etc. This has proven more useful than ObjectIds
  • All timestamps are now in milliseconds since 1st Jan 1970
  • Some incompatibilities exist in the Jobs API. We are working to minimize them, however, if you encounter any problems let us know and we can suggest a solution

Please visit our support forum if you have suggestions or bug reports, we're looking forward to your feedback!


  1. We use CDH, Cloudera's distribution of hadoop. I will follow up with a more technical blog post describing the architecture.
  2. This was previously needed to run database maintenance tasks, such as mongodb compact
  3. When an "annotating" tag is present in an Autoscraping job, we say that the spider is running in annotating mode. In this configuration, the spider saves pages instead of extracted data. It is now redundant since pages are always saved and usable in Fash.
  4. The crawl frontier is a datastructure holding information about all discovered URLs and us used to control which URLs are crawled.

Try Zyte API

Build your first scraper in minutes

Free trial, no credit card. From a single request to production in an afternoon.

Get started
Product Update
S

Shane Evans

More from this author

In this article

  • Improved reliability and performance
  • View downloaded pages in Autoscraping
  • Storage API
  • Fair scheduling
  • Infinite scrolling
  • Data ordering
  • Incompatibilities

Follow

Get the latest

Zyte and the data web in your inbox — or wherever you already are.

Subscribe

Or follow elsewhere

Continue reading

Play Before You Scrape: Explore Zyte API Settings with Playground
Product Update

Play Before You Scrape: Explore Zyte API Settings with Playground

Discover the best way to configure your scrapers using Zyte API Playground

Cleber Alexandre·10 Mins·February 10, 2025
New in Zyte: Scroll Control, Lower Costs, and More
Product Update

New in Zyte: Scroll Control, Lower Costs, and More

As the web continues to evolve, Zyte API is evolving right alongside it—adding powerful new features and refinements designed to make data extraction smarter, faster, and more adaptable than ever.

Daniel Cave·5 min·June 27, 2025
From products to SERPs: AI scraping now does it all
Product Update

From products to SERPs: AI scraping now does it all

Scale data extraction with Zyte’s composite AI, combining accuracy, flexibility, and cost-efficiency in one powerful scraping solution, now available for the most common data types.

Cleber Alexandre·10 mins·April 3, 2025

The Community · Newsletter

The best of Zyte and the data web, in your inbox.

One curated edition — new articles, product updates, and the stories shaping the data web. No noise.

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026