Here at Zyte, we love open source. We love using and contributing to it. Over these years we have open sourced a few projects, that we keep using over and over, in the hope that it will make others lives easier. Writing reusable code is harder than it sounds, but it enforces good practices such as documenting accurately, testing extensively and worrying about backwards support. In the end it produces better software, and keeps programmers happier. This is why we open source as much as we can and always deliver the complete source code to our clients, so they can run everything on their machines if they ever want or need to do so.
Here is a list of open source projects we currently maintain, most of them born and raised at Scrapinghub:
Scrapy is the most popular web crawling framework for Python, used by thousands of companies around the world to power their web crawlers. At Scrapinghub we use it to crawl millions of pages daily. We use Scrapy Cloud for running our Scrapy crawlers without having to manage servers or plan capacity beforehand.
Scrapely is a supervised learning library for extracting structured data from HTML pages. You train it with examples and Scrapely automatically extracts all similar pages. It powers the data extraction engine of our Automatic extraction service.
Slybot combines the power of Scrapy and Scrapely into a standalone web crawler application. We are currently working in a new version that will include a fully-featured visual annotation tool (the one used so far by Zyte Automatic Extraction never got open sourced).
Pydepta is used to extract repeated data (such as records in tables) automatically. It’s based on the Web Data Extraction Based on Partial Tree Alignment paper.
Webstruct is a framework for creating machine-learning-based named-entity recognition systems that work on HTML data. Trained webstruct models can work on many different websites, while Scrapely shines where you need to extract data from a single website. Webstruct models require much more training than Scrapely ones, but we do it once per task (e.g. "contact extraction"), not per website, so it scales better to a larger number of websites. A big refactoring is in the works and due to be merged soon.
Loginform is used for filling website login forms given just the login page url, username & password. Which form and fields to submit are inferred automatically
Webpager is used to paginate search results automatically without having to specify where the “next” button is
If you are working on web data mining, take a moment to review them, there’s a high chance you will need one of those for your next project, and it might not be the kind of wheel you would want to reinvent.