Welcome to This Month in Open Source at Scrapinghub! In this monthly column, we share all the latest updates on our open source projects including Scrapy, Splash, Portia, and Frontera.
If you’re interested in learning more or even becoming a contributor, reach out to us by email at opensource@zyte.com or on Twitter @scrapinghub.
The big news for Scrapy lately is that Python 3 is now supported for the majority of use cases, the exceptions being FTP and email. We are very proud of the work done by our community of users and contributors, both old and new. It was a long ride, but we’re finally here. You all made it happen!
Check out the cool stuff that we packed into this release and please pay attention to the following backward incompatible changes:
Scrapy 1.1 is not officially released yet (we’re aiming for the end of March), but Release Candidate 3 is available for you to test. It’s the last mile, so we’d really appreciate if you could report any issues that you may have with Scrapy 1.1.0rc3 so that we can do our best to fix them.
Oh, and for those who want to stay on stabler (and less-shiny) grounds, we released Scrapy 1.0.5 with a few bug fixes.
Splash 2.0 is out! (Actually we’re already at v2.0.3 and 2.1 will be released soon)
Check out the repository here.
This is our third year of participating in Google Summer of Code and we’ve got plenty of possible project ideas for Scrapy, Portia, Splash, and Frontera. This program is open to students who are interested in working on open source projects with professional mentors. We’ve actually hired two of our previous participants, so you might even get a job out of this opportunity!
Scrapinghub is running under the Python Software Foundation umbrella, so please take the time to read through their guidelines before applying.
Applications opened on March 14 and close on March 25. We’re looking forward to working with you!
Changes from 0.3.1 (last October):
New features:
Improvements:
Note that 0.3.4 forces python-dateutil before or at 2.4.2. It doesn’t work with python-dateutil 2.5.
The beta version of Portia 2.0 is out! This major release comes with a completely overhauled UI and plenty of new fancy tricks (including multiple item extraction) to help make automatic data extraction even easier. Stay tuned for the official release and in the meantime, try out Portia 2.0 beta and let us know what you think.
The other big news in the Portia camp is the closure of Kimono Labs. For those affected, we offer a Kimono Labs to Portia migration so that you don't need to lose any of your work.
We released Frontera version 0.4 in January, however, we feel it deserved more coverage.
Let us know what you think! (use v0.4.1 from PyPI)
The Scrapinghub command line client, Shub, has long lived as merely a fork of scrapyd-client, the command line client for scrapyd. Last January, we freed it in the form of Shub v2.0! This release brings many new features and major improvements in usability.
If you work with multiple Scrapinghub projects, or even multiple API keys, you were probably irritated about the amount of repetition you needed to put into your scrapy.cfg file.
Shub v2.0 now reads from its own configuration file, scrapinghub.yml, where you can configure different projects or keys on a single link. You don’t need to worry about migrating your configuration as Shub will automatically generate new configuration files from your old ones. To avoid storing your API keys in version control, you can run shub login
which will take your API key and create a configuration file, .scrapinghub.yml, in your home directory. Shub will read this file by default, so you don’t need to specify the API key in future deployments.
If you’re new to deploying your projects to Scrapinghub, or have just started a new project, running shub deploy
in the project folder will guide you through a wizard and automatically generate your configuration files. No need to copy-and-paste from our web interface anymore!
Not only have we worked on deploying projects and onboarding new users. Shub provides a much nicer shell experience now, with a dedicated help page for every command (shub schedule --help
) and extensive error messages. If you’re not used to installing Python packages from the command line, our new stand-alone binaries (including for Windows) might be for you.
A particularly long-awaited new feature is the addition of viewing log entries, or items, live as they are being scraped. Just run shub log -f JOBID
and watch your spiders at work. Shub will let you know the JOBID when you schedule a run via shub schedule
. Alternatively,you can simply look it up on the web interface.
Find the full documentation here. You can install shub v2.0.2 via pip install -U shub
, or get the binaries here.
Don’t forget to tell us what you think!
Thus concludes the March edition of This Month in Open Source at Scrapinghub. We’re always looking for new contributors so please explore our GitHub. And remember, students, there are a variety of projects available for our open source projects, so apply to work with Zyte (Formerly Scrapinghub) on Google Summer of Code 2016.