We’ve made a change. Scrapinghub is now Zyte! 

Introducing w3lib and scrapely

time to read
2
Mins
By the one and only
April 20, 2011

In an effort to make Scrapy code smaller and more reusable, we’ve been working on splitting the Scrapy codebase into two different modules:

  1. w3lib
  2. scrapely

w3lib

A library with simple, reusable functions for working with URLs, HTML, forms, and HTTP. Things that aren’t found in the Python standard library. This library doesn’t have any external dependency.

For more info see:

scrapely

Scrapely is library for extracting structured data from HTML pages. What makes it different from other Python web scraping libraries is that it doesn’t depend on lxml or libxml2. Instead, it uses an internal pure-python parser, which can accept poorly formed HTML. The HTML is converted into an array of token ids, which is used for matching the items to be extracted.

Scrapely depends on numpy (it uses it to speed up calculations) and w3lib.

You can find more info, or try it out, in the Github page.

Scrapy codebase

After these changes, Scrapy codebase has been reduced by 4574 lines, including blank and comments (according to cloc).

Before:

$ cloc /tmp/scrapy2/scrapy 333 text files. 332 unique files. 18 files ignored. http://cloc.sourceforge.net v 1.09 T=0.5 s (628.0 files/s, 66050.0 lines/s) ------------------------------------------------------------------------------- Language files blank comment code scale 3rd gen. equiv ------------------------------------------------------------------------------- Python 301 5819 5341 20663 x 4.20 = 86784.60 HTML 11 117 93 792 x 1.90 = 1504.80 XML 2 1 0 199 x 1.90 = 378.10 ------------------------------------------------------------------------------- SUM: 314 5937 5434 21654 x 4.09 = 88667.50 -------------------------------------------------------------------------------

After:

$ cloc /tmp/scrapy/scrapy 308 text files. 307 unique files. 14 files ignored. http://cloc.sourceforge.net v 1.09 T=0.5 s (586.0 files/s, 55136.0 lines/s) ------------------------------------------------------------------------------- Language files blank comment code scale 3rd gen. equiv ------------------------------------------------------------------------------- Python 284 5206 3801 18242 x 4.20 = 76616.40 XML 2 1 0 199 x 1.90 = 378.10 HTML 7 17 0 102 x 1.90 = 193.80 ------------------------------------------------------------------------------- SUM: 293 5224 3801 18543 x 4.16 = 77188.30 -------------------------------------------------------------------------------

Scrapy dependencies

Scrapy 0.14 will depend on w3lib. Scrapy 0.13 (current dev version) already depends on w3lib, but w3lib is already packaged and provided in the official APT repos (package python-w3lib). So, if you’re using Scrapy 0.13 on Ubuntu, you can upgrade safely. Otherwise, you can always install/upgrade with easy_install or pip. Stable version (Scrapy 0.12) is not affected at all by this change.

If you have any comments or questions feel free to post them in the scrapy-users group.

Written by Kevin McKinless
Web scraping specialist with over 10 years experience. An expert in Python and Rocket League. Join me on social media and we can talk all things Data Extraction.
Sign up to the blog