We’ve made a change. Scrapinghub is now Zyte! 

How to extract data from a website?

time to read
7
Mins
By the one and only
July 15, 2021

It's a 21st century truism that web data touches virtually every aspect of our daily lives. We create, consume and interact with it while we’re working, shopping, travelling and relaxing. It’s not surprising that web data makes the difference for companies to innovate and get ahead of their competitors. But how can you actually get data from websites? And what’s this thing called ‘web scraping’?

Why would you want to extract data from a website?

Up-to-date, trustworthy data from other websites is the rocket fuel that can power every organisation’s successful growth, including your own.

You might want to compare pricing of competitors’ products across popular ecommerce sites. You could be monitoring customer sentiment by trawling for name-checks for your brand – favourable or otherwise – in news articles and blogs. Or you might be gleaning information about a particular industry or market sector to guide critical investment decisions.

A concrete example of where web data plays an increasingly valuable role in the financial services industry is insurance underwriting and credit scoring. There are billions of ‘credit invisibles’ around the world, in both developing and mature markets. Although these individuals don’t possess a standard credit history, there’s a huge range of ‘alternative data’ sources out there, helping lenders assess risk and potentially take these individuals on as clients. These sources range from debit card transactions and utility payments to survey responses, social media posts on a particular topic and product reviews. Read our blog that explains how public web data can provide financial services providers with a precise, insightful alternative dataset.

Also in the financial sector, hedge fund managers are turning to alternative data – beyond the scope of conventional sources like company reports and bulletins – to help inform their investment decisions. We’ve blogged recently about the value of web data in this space, and how Zyte can help deliver standards-compliant custom data feeds that complement traditional research methodologies.

Data, in short, is the differentiating factor for companies when it comes to understanding customers, knowing what competitors are up to – or making just about any kind of commercial decisions based on hard facts rather than intuition.

The web holds answers to all these questions, and countless more. Think of it as the world’s biggest and fastest-growing research library. There are billions of web pages out there. Unlike a static library, however, many of those pages present a moving target when details like product pricing can change regularly. Whether you’re a developer or a marketing manager, getting your hands on reliable, timely web data might seem like searching for a needle in a huge, ever-changing digital haystack.

What is web scraping?

So you know your business needs web data. What happens next? There’s nothing to stop you collecting data from any website manually by cutting and pasting the relevant bits you need from other websites. But it’s easy to make errors, and it’s going to be fiddly, repetitive and time consuming for whoever’s been tasked with the job. And by the time you’ve gathered all the data you need, there’s no guarantee that the price or availability of a particular product hasn’t changed.

For all but the smallest projects you’ll need to turn to some kind of [automated?] extraction solution. Often referred to as ‘web scraping’, data extraction is the art and science of grabbing relevant web data – maybe from a handful of pages, or hundreds of thousands – and serving it up in a neatly organised structure that your business can make sense of.

So how does data extraction work? In a nutshell, it makes use of computers to mimic the actions of a human being when they’re finding specific information on a website, quickly, accurately and at scale. Webpages are designed primarily for the benefit of humans. They tend to present information in ways that we can easily process, understand and interact with. If it’s a product page, for example, the name of a book or a pair of trainers is likely to be shown pretty near the top, with the price nearby and probably with an image of the product too. Along with a host of other clues lurking in the HTML code of that webpage, these visual pointers can help a machine pinpoint the data you’re after with impressive accuracy.

There are various practical ways to attack the extraction challenge. The crudest is to make use of the wide range of open source scraping tools that are out there. In essence, these are chunks of ready-written code that scan the HTML content of a webpage, pull out the bits you need and file them into some kind of structured output. Going down the open source route has the obvious appeal of being ‘free’. But it’s not a task for the faint hearted, and your own developers will spend a fair amount of time writing scripts and tweaking off-the-shelf code to meet the needs of a specific job.

Step-by-step how to extract web data from a product page

OK – it’s time to put all this web scraping theory into practice. Here’s a worked example that illustrates the three key steps in a real-world extraction project.

1. Create extraction script

To keep things simple, we are going to use requests and beautifulsoup libraries to create our script.

As an example, I will be extracting product data from this website: books.toscrape.com

The extraction script will contain two functions:

  1. A crawler to find product URLs
  2. A scraper which will actually extract the data

Making requests is an important part of the script: both for finding the product URLs and fetching the product HTML files. So first, let’s start off by creating a new class and add the base URL of the website:

class ProductExtractor(object):
	BASE_URL = 'http://books.toscrape.com'

Then, let’s create a simple function that will help us make requests:

import requests
def make_request(self, url):
	return requests.get(url)

The function, requests.get() is fairly simple in itself, but in case you want to scale up your requests with proxies, you will only need to modify this part of your code and not all the places where you invoke requests.get().

Extract product URLs

I will only extract products from one category called Travel to get some sample data. Here, the task is basically to find all product URLs on this category page and return them in some kind of iterable format so we have each URL to make a request to:

from urllib.parse import urljoin
from bs4 import BeautifulSoup
def extract_urls(self, start_url):
	    	response = self.make_request(start_url)
	    	parser = BeautifulSoup(response.text, 'html.parser')
	    	product_links = parser.select('article.product_pod > h3 > a')
	    	for link in product_links:
    		relative_url = link.attrs.get('href')
    		absolute_url = urljoin(self.BASE_URL, relative_url.replace('../../..', 'catalogue'))
    		yield absolute_url

This is what this function does, line by line:

We make a normal request to get to the category page (start_url)

Create a BeautifulSoup object which will help us parse the HTML of the category page

We identify that each product URL on the page is available using the specified selector

Iterate over the extracted links - which are at this point are <a> elements

Extract the relative URL from the <a> element, by parsing the href attribute

Convert the relative URL to absolute URL

Return a generator with the absolute URLs

2. Extract product fields

The other important part of our script is the product extractor function.

def extract_product(self, url):
	    	response = self.make_request(url)
	    	parser = BeautifulSoup(response.text, 'html.parser')
	    	book_title = parser.select_one('div.product_main > h1').text
	    	price_text = parser.select_one('p.price_color').text
	    	stock_info = parser.select_one('p.availability').text.strip()
	    	product_data = {
    		'title': book_title,
    		'price': self.clean_price(price_text),
    		'stock': stock_info
	    	}
	    	return product_data

As you can see above, for the price field I needed to do some cleaning because it contained currency and other characters as well. Luckily, there’s an open source library which can do the heavy lifting for us to parse the price value, it’s called price_parser (created by Zyte):

from price_parser import Price
def clean_price(self, price_text):
	    	return Price.fromstring(price_text).amount_float

This function returns the price of the product - extracted from text - as a float value.

3. The main function

And finally – this is the main function where we put together extract_urls() and extract_product().

def main():
        	extractor = ProductExtractor()
        	product_urls = extractor.extract_urls('http://books.toscrape.com/catalogue/category/books/travel_2/index.html')
        	extracted_data = []
        	for url in product_urls:
	    	product_data = extractor.extract_product(url)
	    	extracted_data.append(product_data)        	
extractor.export_json(extracted_data, 'data.json')
And the export_json() function:
import json
def export_json(self, data, file_name):
	    	with open(file_name, 'w') as f:
    		json.dump(data, f)
The end result is a clean json data file, something like this:
[
  {
        	"title": "It's Only the Himalayas",
        	"price": 45.17,
        	"stock": "In stock (19 available)"
  },
  {
        	"title": "Full Moon over Noah\u00e2\u0080\u0099s Ark: An Odyssey to Mount Ararat and Beyond",
        	"price": 49.43,
        	"stock": "In stock (15 available)"
  },
  {
        	"title": "See America: A Celebration of Our National Parks & Treasured Sites",
        	"price": 48.87,
        	"stock": "In stock (14 available)"
  },
  {
        	"title": "Vagabonding: An Uncommon Guide to the Art of Long-Term World Travel",
        	"price": 36.94,
        	"stock": "In stock (8 available)"
  },
  {
        	"title": "Under the Tuscan Sun",
        	"price": 37.33,
        	"stock": "In stock (7 available)"
  },
  {
        	"title": "A Summer In Europe",
        	"price": 44.34,
        	"stock": "In stock (7 available)"
  },
  {
        	"title": "The Great Railway Bazaar",
        	"price": 30.54,
        	"stock": "In stock (6 available)"
  },
  {
        	"title": "A Year in Provence (Provence #1)",
        	"price": 56.88,
        	"stock": "In stock (6 available)"
  },
  {
        	"title": "The Road to Little Dribbling: Adventures of an American in Britain (Notes From a Small Island #2)",
        	"price": 23.21,
        	"stock": "In stock (3 available)"
  },
  {
        	"title": "Neither Here nor There: Travels in Europe",
        	"price": 38.95,
        	"stock": "In stock (3 available)"
  },
  {
        	"title": "1,000 Places to See Before You Die",
        	"price": 26.08,
        	"stock": "In stock (1 available)"
  }
]

Why do you need to use a smart proxy for web scraping?

There are plenty of pitfalls to negotiate during the course of any web scraping project. One of the biggest challenges comes when you’re trying to extract data at scale.

At Zyte we often talk to clients who’ve successfully extracted data from a hundred webpages a day, or a thousand. Surely, they ask, it must be just as easy getting data from a million pages daily?

Many websites use ‘anti-bot’ technology to discourage automated scraping. There are ways round this, the most effective being the use of smart rotating proxies. This is a technique that effectively lulls a target website into thinking it’s being visited innocuously by a human, rather than an extraction script.

Here’s an illustration of how Zyte’s Smart Proxy Manager can be integrated into a data extraction script to boost your chances of getting banned.

Remember that we created a make_request() function at the beginning so it handles all the requests in the script? Now if we want to use Smart Proxy Manager, we only need to make a small change in this function. Everything else will work just fine. To integrate Smart Proxy Manager, change this function:

def make_request(self, url):
        	return requests.get(url)

to this:

def make_request(self, url):
        	zyte_apikey = 'apikey'
        	proxy_url = 'proxy.zyte.com:8011'
        	return requests.get(
        	url,
        	proxies={
        	        	"http": "http://{}:@{}".format(zyte_apikey, proxy_url),
        	},
)

In this code, we add the Smart Proxy Manager endpoint as a proxy and authenticate using the Zyte apikey.

If you want to learn more about Smart Proxy Manager and how it can help you scale, check out our webinar.

How can Zyte help you extract the web data you want?

At Zyte we’ve spent the best part of a decade focused on extracting the all-important web data that companies need. Our international team of developers and data scientists includes some of the biggest brains in analytics, AI and machine learning. And along the way we’ve developed some powerful tools – several of them protected by international patents – to help our customers achieve their data extraction goals quickly, reliably and cost efficiently.

Written by Sarah Lang
Sign up to the blog