Welcome to This Month in Open Source at Scrapinghub! In this regular column, we share all the latest updates on our open source projects including Scrapy, Splash, Portia, and Frontera.
If you’re interested in learning more or even becoming a contributor, reach out to us by email at opensource@zyte.com or on Twitter @scrapinghub
For those who missed the big news, Scrapy 1.1 is live! It’s the first official release that comes with Python 3 support, so you can go ahead and move your stack over.
The major changes in this release since the RC1 we announced in February include improved HTTPS connections (with proxy support) and handling URLs with non-ASCII characters. Make sure you upgrade w3lib to 1.14.2.
We’re very grateful for the feedback we received during the release candidate phase. A huge thanks to all the reporters, reviewers and code/documentation contributors.
If you find anything that’s not working, please take a few minutes to report the issue(s) on GitHub.
Notable limitations still present in this release include:
Splash 2.1 now lets you:
If you’re using the Scrapy-Splash plugin (formerly “scrapyjs”), we encourage you to upgrade to the latest v0.7 version. It includes many goodies that makes integrating with Scrapy much easier. Check the latest README for details, especially the scrapy_splash.SplashRequest utility.
We’re thrilled to have 5 students this year:
We’d like to thank the Python Software Foundation for again having Scrapinghub as a sub-org this year!
Scrapy relies on lxml and cssselect for all the XPath and CSS selection awesomeness that we use each and every day at Scrapinghub. We learned that Simon Sapin, author of cssselect package, was looking for new maintainers. So we put ourselves forward and now cssselect is hosted under the Scrapy organization on GitHub. Don’t worry though, Simon is still involved! We’re planning on fixing a few corner cases and maybe working on CSS Selectors Level 4. We’ll definitely need assistance with this task, so please reach out if you’re interested in helping out!
We released Dateparser 0.3.5 with support for dates in Danish and Japanese. It now handles dates with accents much better. The library is now working with the latest version of python-dateutil.
Check the full release notes here.
This side project of mine is now hosted under Scrapinghub’s organization on GitHub. It’s a little helper library to convert JavaScript code into an XML tree. This means you can use XPath and CSS selectors to extract data (strings, objects, function arguments, etc.) from HTML-embedded JavaScript (this does not interpret it though). You’d be amazed at how much valuable data is “hidden” in JavaScript inside web pages.
It’s on PyPI and is now Python 3-compatible.
Check this Jupyter/ipython notebook for an overview of what you can do with it and make sure to let us know what you think.
We updated our w3lib library to handle non-ASCII URLs better, as part of adding Python 3 support to Scrapy 1.1. We recommend that you upgrade to the latest 1.14.2 version.
If you’re using Scrapy 1.1, you’re using parsel under the hood. Parsel is Scrapy Selectors as an independent package. There’s a new release of parsel that fixes the hiding of XPath exceptions.
We’ve made some changes to Slybot, the Portia crawler, that include:
For Portia itself:
Most of the recent developments have been taking place in the Portia beta.
The big changes include:
Try out the beta using the nui-develop branch.
Frontera 0.5 introduces improved crawling strategy, new logging and better test coverage.
Scrapy-mosquitera is a library to assist Scrapy spiders to do more optimal crawls. In its basic form, it’s a collection of matchers and a mixin to narrow down the crawl to a specific date range. However, you can extend it to be applicable on any domain (URL paths, location filtering, etc). You can find more details about how it works and how you can create your own matchers in the documentation.
This concludes the June edition of This Month in Open Source at Scrapinghub. We’re always looking for new contributors, so if you’re interested, feel free to explore our GitHub.