We’ve made a change. Scrapinghub is now Zyte! 

How does a headless browser help with web scraping and data extraction?

time to read
By the one and only
September 15, 2021

If you’re involved in any kind of web data extraction project, you’ve probably heard about headless browsers. Maybe you’re wondering what they are, and whether you need to use them. Here I’d like to tackle a few basic questions about headless browsers and how they’re used.

Let’s start by looking at what happens when you access a web page in the context of how most scraping frameworks work.

To read this blog you’re almost certainly using some sort of web browser on your computer or mobile device. In essence, a browser is a piece of software that renders a web page for viewing on a target device. It turns code sent from the server into something that’s readable on your screen, with text and images adorned by beautiful fonts, pop-ups, animations, and all the other pretty stuff. What’s more, the browser also allows you to interact with the contents of the page by clicking, scrolling, hovering, and swiping.

It’s your computer that actually does the donkey work of rendering, something that typically involves hundreds of HTTP requests being sent by the browser to the server. Your browser will first request the initial ‘raw’ HTML page content. Then it will make a string of further requests to the server for additional elements like stylesheets and images.

In the early days of the web, sites were built entirely on HTML and CSS. Now they’re designed to provide a much richer, more interactive user experience. And that means modern sites are often heavily reliant on JavaScript that renders all that beautiful content in near-real-time for the viewer’s benefit. You can see what’s happening when a site loads slowly over a sluggish Internet connection. Bare-bones elements of the page appear first. Then a few seconds later dull-looking text is re-rendered in snazzy custom fonts, and other elements of visual tinsel pop into being as JavaScript does its thing.

Most websites these days also serve some kind of tracking code, user analytics code, social media code, and myriad other things. The browser needs to download all this information, decide what needs to be done with it, and actually render it.

Now let’s say you want to write a scraping script to automate the process of extracting data for some websites. At this point, you may well be wondering if you need to use some kind of browser to achieve this. Let’s say you’re writing some code to compare product pricing on a number of different online marketplaces. The price for a certain item may not even be contained in the raw HTML code for the product page. It doesn’t exist as a visible element on that page until it’s been rendered by a JavaScript code executed by the client – i.e. the browser that’s made the page request to the server.

To extract information at scale from thousands or even millions of web pages, you’re certainly going to need some kind of automated solution. It’s prohibitively time-consuming and costly to hire a roomful of people, sit them in front of lots of computers and jot down notes about what they can see on screen. That’s what headless browsers are there for. And what’s ‘headless’ all about, by the way? This simply means that the browser isn’t under the control of a human operator, interacting with the target site via a graphical interface and mouse movements.

Instead of using humans to interact with and copy information from a website, you simply write some code that tells the headless browser where to go and what to get from a page. This way you can have a page rendered automatically and get the information you need. There are several programmatic interfaces to browsers out there – the most popular being Puppeteer, Playwright, and Selenium. They all do a broadly similar job, letting you write some code that tells the browser to visit a page, click a link, click a button, hover over an image, and take a screenshot of what it sees.

But is that really the best way to do scraping? The answer isn’t always a clear-cut yes or no - but more often a case of ‘it depends'.

Most popular scraping frameworks don’t use headless browsers under the hood. That’s because headless browsers are not the most efficient way to get your information for most use cases.

Let’s say you just want to extract the text from this article you’re reading right now. To see it on screen, a browser needs to make hundreds of requests. But if you try to make a request to our URL with some command-line tool such as cURL, you’ll see that this text is actually available in the initial response. In other words, you don’t actually need to bother about styling, images, user tracking, and social media buttons to get the bit you’re really interested in i.e. the text itself.

All these things are there for the benefit of humans and their interaction with websites. However, scrapers don’t really care whether there is some nice image on a page. They don’t click on social media sharing buttons - unless they have their bot social network, but AI isn’t quite that advanced yet. The scraper will just see raw HTML code: this isn’t easy for humans to read, but it’s quite sufficient for a machine. And it’s actually all your programme needs if it’s just hunting for this blog post. 

For many use cases, it’s vastly more efficient just making a request to one URL without rendering the whole page with a headless browser. Instead of making a hundred requests for things your programme doesn’t need - like images and stylesheets - you just ask for critical bits of relevant information. Still, there might be use cases where you need a headless browser.

Having said that, rendering and interaction with a real browser is increasingly being needed to counter antibot systems. While these technologies are mainly used to deter bad actors from attacking and potentially exploiting vulnerabilities on a site, antibots can also block legitimate users. In other use cases such as quality assurance, you actually need to simulate a real user: indeed that’s the whole objective of QA, albeit with automation coming into play to achieve this consistently and at scale. Here we’re talking about actions like clicking a sign-in button, adding items to a cart, and transitioning through pages.

Even if your data extraction efforts don’t need headless browsers right now, it’s still worth getting to know them better. If you’re a developer, have your own crawlers, and need a smart proxy network to get them going at scale head on to our headless browsers docs and try writing some programmes with them. Similarly, if you are in for the new and shiny stuff, read this article by our Head of R&D, Akshay Philar.

Written by Pawel Miech
Paweł is a Technical Team Lead in Delivery Department at Zyte, has several years of experience developing advanced crawling solutions using Scrapy framework, contributes to open source, is one of the authors of ScrapyRT framework, contributed to Splash.
Sign up to the blog