An Introduction to Web Scraping with Python lxml library
Whether you're trying to analyze market trends or gather data for research, web scraping can be a useful skill to have. This technique allows you to extract specific pieces of data from websites automatically and process them for further analysis or use.
In this blog post, we'll introduce the concept of web scraping and the lxml library for parsing and extracting data from XML and HTML documents using Python.
Additionally, we'll touch upon Parsel, an extension of lxml that is a key component of the Scrapy web scraping framework, offering even more advanced capabilities for handling complex web tasks.
What is Web Scraping?
Web scraping extracts structured data from websites by simulating user interactions. It involves navigating pages, selecting elements, and capturing desired information for various purposes like data mining, data harvesting, competitor analysis, market research, social media monitoring, and more.
While web scraping can be done manually by copying and pasting information from a website, this approach is often time-consuming and error-prone.
Automating the process using programming languages like Python allows for faster, more accurate, and more efficient data collection with a web scraper.
What is lxml?
Python offers a wide range of libraries and tools for web scraping, such as Scrapy, Beautiful Soup, and Selenium. Each library has its own strengths and weaknesses, depending on the specific use case and requirements. lxml stands out due to its simplicity, efficiency, and flexibility when it comes to processing XML and HTML. lxml is designed for high-performance parsing and easy integration with other libraries. It combines the best of two worlds: the simplicity of Python's standard module xml.etree.ElementTree and the speed and flexibility of the C libraries libxml2 and libxslt.
HTML and XML files
HTML (HyperText Markup Language) is the standard markup language for creating web pages and web applications. It is also a hierarchical markup language, but its primary purpose is to structure and display content on the web.
HTML data consists of elements that browsers use to render the content on web pages. These elements, also referred to as html tags, have opening and closing parts (e.g., <tagname> and </tagname>) that enclose the content they represent. Each html tag has a specific purpose, such as defining headings, paragraphs, lists, links, or images, and they work together to create the structure and appearance of a web page.
Here's a simple HTML document example:
<!DOCTYPE html> <html> <head> <title>Bookstore</title> </head> <body> <h1>Bookstore</h1> <ul> <li> <h2>A Light in the Attic</h2> <p>Author: Shel Silverstein</p> <p>Price: 51.77</p> </li> <li> <h2>Tipping the Velvet</h2> <p>Author: Sarah Waters</p> <p>Price: 53.74</p> </li> </ul> </body> </html>
XML (eXtensible Markup Language) is a markup language designed to store and transport data in a structured, readable format. It uses a hierarchical structure, with elements defined by opening and closing tags. Each element can have attributes, which provide additional information about the element, and can contain other elements or text.
Here's a simple XML document example:
<?xml version="1.0" encoding="UTF-8"?> <books> <book id="1"> <title>A Light in the Attic</title> <author>Shel Silverstein</author> <price>51.77</price> </book> <book id="2"> <title>Tipping the Velvet</title> <author>Sarah Waters</author> <price>53.74</price> </book> </books>
Both XML and HTML documents are structured in a tree-like format, often referred to as the Document Object Model (DOM). This hierarchical organization allows for a clear and logical representation of data, where elements (nodes) are nested within parent nodes, creating branches and sub-branches.
The topmost element, called the root, contains all other elements in the document. Each element can have child elements, attributes, and text content.
The tree structure enables efficient navigation, manipulation, and extraction of data, making it particularly suitable for web scraping and other data processing tasks.
XPath vs. CSS Selectors
XPath and CSS selectors are two popular methods for selecting elements within an HTML or XML document. While both methods can be used with lxml, they have their own advantages and drawbacks.
XPath is a powerful language for selecting nodes in an XML or HTML document based on their hierarchical structure, attributes, or content. XPath can be considered more powerful for parsing HTML tags and HTML markup compared to CSS selectors, especially when dealing with complex formats. However, it may have a steeper learning curve for those not familiar with its syntax.
CSS selectors, on the other hand, are a simpler and more familiar method for selecting elements, especially for those with experience in web development. They are based on CSS rules used to style HTML elements, which makes them more intuitive for web developers. While they may not be as powerful as XPath, they are often sufficient for most web scraping tasks.
Ultimately, the choice between XPath and CSS selectors depends on your personal preference, familiarity with each method, and the complexity of your web scraping project.
Using lxml for web scraping
Let's look at an example of how to web scrape with Python lxml. Suppose we want to extract data about the title and price of books in Books to Scrape web page, a sandbox website created by Zyte for you to test your web scraping projects.
First, we need to install the Python lxml module by running the following command:
pip install lxml
To perform web scraping using Python and lxml, create a python file for your web scraping script. Save the file with a ".py" extension, like "web_scraping_example.py". You can write and execute the script using a text editor and a terminal, or an integrated development environment (IDE).
Next, we can use the requests module to retrieve the HTML content of HTML page from the website:
import requests url = "https://books.toscrape.com" response = requests.get(url) content = response.content
After retrieving the HTML content, use the html submodule from lxml to parse it:
from lxml import html parsed_content = html.fromstring(content)
Then, employ lxml's xpath method to extract the desired data from the web page:
# Parsing the HTML to gather all books books_raw = parsed_content.xpath('//article[@class="product_pod"]')
books_raw retrieves a list of Element article, which we can parse individually. Although we could extract the data directly by querying the titles and prices, this approach ensures greater consistency in more advanced data extraction cases.
Before proceeding, create a NamedTuple to store book information for improved readability with the following code:
from typing import NamedTuple class Book(NamedTuple): title: str price: str
Using NamedTuple is not necessary, but it can be a good approach for organizing and managing the extracted data. NamedTuples are lightweight, easy to read, and can make the code more maintainable. By using NamedTuple in this example, we provide a clear structure for the book data, which can be especially helpful when dealing with more complex data extraction tasks.
With the NamedTuple Book defined, iterate through books_raw and create a list of Book instances:
books =  for book_raw in books_raw: title = book_raw.xpath('.//a/img/@alt') price = book_raw.xpath('.//p[@class="price_color"]/text()') book = Book(title=title, price=price) books.append(book)
The books list will display the following output:
[Book(title=['A Light in the Attic'], price=['£51.77']), Book(title=['Tipping the Velvet'], price=['£53.74']), Book(title=['Soumission'], price=['£50.10']), Book(title=['Sharp Objects'], price=['£47.82']), Book(title=['Sapiens: A Brief History of Humankind'], price=['£54.23']), Book(title=['The Requiem Red'], price=['£22.65']), ... ]
You can execute your web scraping script from the same python console or terminal where you installed the lxml library. This way, you can run the script and observe the output directly in the console or store the scraped data in a file or a database, depending on your project requirements.
Extended lxml with Parsel/Scrapy
Parsel allows you to parse HTML and XML documents, extracting information, and traversing the parsed structure. It is built on top of the lxml library and provides additional functionality, like handling character encoding and convenient methods for working with CSS and XPath selectors.
The following code is an example using parsel with CSS method:
from parsel import Selector sel = Selector(text=u"""<html> <body> <h1>Hello, Parsel!</h1> <ul> <li><a href="http://example.com">Link 1</a></li> <li><a href="http://scrapy.org">Link 2</a></li> </ul> </body> </html>""") sel.css('h1::text').get() # Output: 'Hello, Parsel!'
It is also possible to use parsel's selectors with regex expressions after the css and xpath extraction:
sel.css('h1::text').re('\w+') # Output: ['Hello', 'Parsel!']
Web scraping is a powerful technique that enables users to collect valuable data from websites for various purposes. By understanding the fundamentals of HTML and XML documents and leveraging the Python lxml library, users can efficiently parse and extract data from web pages for simple data extraction tasks.
However, it's important to note that Python’s lxml may not be suitable for handling more complex projects. In those cases, Parsel, a key component of Scrapy, offers a superior solution. Scrapy comes with numerous benefits, including built-in support for handling cookies, redirects, and concurrency, as well as advanced data processing and storage capabilities. By utilizing Parsel for parsing both HTML and XML documents, Scrapy delivers a powerful and efficient way to traverse the parsed structure and extract the necessary information. This comprehensive library, combined with the robust and feature-rich capabilities of Scrapy, enables users to confidently tackle even the most complex web scraping projects.
By understanding the principles and techniques discussed in this blog post, you'll be prepared to tackle web scraping projects using either lxml or a comprehensive solution like Scrapy, harnessing data to achieve your objectives.