We're excited to introduce Dash, a major update to our scraping platform.
This release is the final step in migrating to our new storage back end and contains improvements to almost every part of our infrastructure. In this post I'd like to introduce some of the highlights.
Our new storage back end is based on HBase[1] and attempts to address some of the problems we had with mongodb.
The areas with noticeable improvements are:
During our testing and parallel run, we had 2 disk failures that had no impact on users. Machine failures caused errors until the machine was automatically marked as unavailable. Running crawls were never affected.
When you run an Autoscraping spider in Dash, all pages are saved and available under the new Pages tab:
Here you can see the requests made by the spider, our cache of the page and what is extracted with the current templates. It's much easier to analyze a scraping run and jump directly into fixing issues - no more annotating mode![3].
There is a new API for our storage back end. It's not yet stable and our documentation isn't published, but the python library is available. This API makes it possible to access functionality not currently available elsewhere, such as:
Because our bots now use this API, they can run from any location. We expect to open up new locations in the near future and support non scrapy-based jobs.
The UI is using this API to load data asynchronously, making it more responsive and giving us a cleaner separation between our UI and back end. We’re building on this refactor now and hope to see more UI improvements coming soon!
If you wish to use this new API before we make a stable release, please let us know!
Some projects in Dash regularly schedule tens of thousands of jobs at once, which previously degraded performance and caused other projects' jobs to be stuck in a pending state for a very long time.
Each project now has two settings that control running jobs:
Setting | Description |
---|---|
bot groups | When dedicated servers are purchased they are assigned to a bot group. This setting specifies the bot groups that will run jobs from the current project. If no bot groups are configured, a default bot group is used and billed using our "Pay as you go" model. |
max slots | The maximum number of running jobs allowed in the project. This defaults to 4 if the project has no bot groups, otherwise it is limited by the number of available slots in the group. |
When there is capacity to run another job in a given bot group, a new job is selected by choosing the next job from the project with the fewest running jobs in that bot group.
We're using infinite scrolling for crawl data in Dash. Initially the feedback was mixed, but now everybody prefers the new version. It makes navigating crawl data more fluid as you don’t need to alternate between scrolling and clicking “next” - just keep scrolling and data appears. Give it a try and let us know what you think!
We have some changes coming soon to allow jumping ahead (and avoid scrolling) and various usability and performance tweaks.
Items, pages and logs are returned in insertion order. This means (at last!) logs are always ordered correctly, even when filtering.
Pending jobs are ordered by priority, then by age. So the next job to be run is at the bottom of the list. Running and finished jobs are ordered by the time they entered that state.
We are doing our best to minimize disruption, however, some changes may be noticeable:
_jobid
and _id
fields containing ObjectIds were available in exported data. These private fields are no longer present78/1/4
is the 4th run of spider id 1 in project 78. Items in this job are identified by 78/1/4/0
, 78/1/4/1
, etc. This has proven more useful than ObjectIdsPlease visit our support forum if you have suggestions or bug reports, we're looking forward to your feedback!