Learn all about the latest trends and best practices in data extraction - Join us at Extract Summit
zyte logo

Scraping e-commerce websites & leveraging product data at scale

time to read
[rt_reading_time][/rt_reading_time]
Mins
By the one and only
March 30, 2022

TL;DR

Product Data is the most sought-after web data type for e-commerce companies, with 7 common use cases from price intelligence to seller experiences.

Introduction

This article gives a very high-level overview of:

  1. Who needs e-commerce product data?
  2. What strategies does e-commerce product data help fuel?
  3. What e-commerce product data is available?
  4. How to get e-commerce product data?

We’ll focus on the perspective of e-commerce companies. This article is your map of this amazing and dynamic world of e-commerce product data.

Background

This article has been informed by our over a decade of experience scraping e-commerce websites, and a recently completed independent body of research that involved interviewing 30+ e-commerce industry representatives.

Who wants e-commerce product data?

Typically it’s organizations that also sell the same or similar products. For simplicity we break these organizations into three groups:

  1. Marketplaces
  2. Retail - e-commerce
  3. Retail - Brick & Mortar

Marketplaces

These organizations are often digital natives (i.e. were born digital), and data indigenous (i.e. data has always been at their core). Marketplaces have the largest teams of data scientists and at their heart, at least for C2C Marketplaces, are almost entirely data companies. 

Because they are not significantly hindered by the inertia of stock and traditional merchandising models, they can move more freely in the world of data. Marketplaces' need for e-commerce product data certainly overlaps with their brethren in retail, but they also have some nuanced data desires that we’ll look at below. 

Retail - e-commerce

These are e-commerce first retailers, think B2C, they have the warehouse but not the shopfront. Naturally, these are also digital natives, but as they have often evolved out of traditional retail backgrounds they may not be data indigenous but are getting there. Interestingly in the last few years, we have seen some of these companies start branching out to Brick & Mortar stores.

Retail - Brick & Mortar

These are traditional, often long-established, retailers that over the past decade have heavily invested in not being left behind. Because these organizations are neither digital natives nor data indigenous you often see them accelerating their e-commerce journey by purchasing data companies (e.g. Homedepot bought BlackLocus in 2012, and Lowes bought Boomerang Commerce in 2019).

Alternative Data - Finance

Another group that we see seeking e-commerce product data are Financial organizations that see this data as Alternative Data, but that is another day’s story, for today let’s focus on e-commerce companies.

What strategies does e-commerce product data help fuel?

From the perspective of e-commerce organizations the primary use cases for this data are as follows:

  1. Price intelligence
  2. Competitor intelligence
  3. Market analysis
  4. Vendor Management
  5. Compliance
  6. Improving the sellers’ experience
  7. Internal barriers to data

Below we’ll touch briefly on each of these use cases and the questions they help answer, but we will go into much more depth in separate use case articles.

1. Price intelligence

If your pricing strategy is in any way relative to your competitors (e.g. always match the price, or always be 1 cent cheaper, etc.) then you need to get their prices. 

2. Competitor intelligence

Who is moving in on your market (by geography, category, etc.), or perhaps who is doing well that maybe you could learn from.

3. Market analysis

What products or sellers are trending, what gaps are there in product offerings, and do you have a dominant stock position that you can leverage.

4. Vendor Management

Are your vendors providing you with their full range of products, most competitive price, and equally high-quality collateral? Not sure, you need to get that data.

5. Compliance

This more often applies to brands, where they want to ensure compliance with MAP (minimum advertised pricing), branding guidelines, and marketing copy. Other considerations could also be the discoverability of products via categories or keyword searches.

6. Improving the sellers’ experience

Make it easier for your sellers by pre-populating your database with products they might sell so as to speed up the uploading process and ensure better metadata quality. Also, recommend a selling price based on competitor data. Another fountain of information here is buyer reviews, are you getting better reviews than competitors for the same product? If so, what is it you are doing well?

7. Internal barriers to data

One of the most surprising uses is to eliminate all the internal barriers to data within an organization. Surprisingly enough, many clients actually scrape data from their own websites so as to avoid all of the internal friction around accessing data, and to ensure the data will better match the structure of product data from target websites.

What e-commerce product data is available?

There are four main types of e-commerce product data that are collected:

  1. Product details
  2. Product lists
  3. Product reviews
  4. Product seller details

Product details

This is the essence of product data and is usually called the “Product Details Page” (PDP). It’s where you find information like the following:

  • name
  • price
  • availability (i.e. in stock)
  • unique identifier (i.e. sku, mpn, gtin, etc.)
  • brand
  • description
  • rating
  • physical properties (e.g. color, size, style, dimensions, weight, material type, etc.)
  • seller information
  • variants available
  • delivery information (i.e. costs, timelines, etc.)
  • features
  • etc.

You can see a more detailed list at this link.

Product lists

Think of this as the product category page, or search result page, where you can see a list (or grid) of products from a handful to hundreds. It includes limited data, but perhaps enough depending on your use case. Common data fields would be:

  • name
  • price
  • page
  • position
  • rating (i.e. Stars, etc.)
  • delivery information

Product reviews

Though the top or most recent, or most popular reviews may be on the “Product Details Page” (PDP), the full list is typically elsewhere and contains much more information. The available fields are few but can be rich picking.

  • rating
  • date
  • review
  • reviewer
  • Was this helpful Y/N

Product seller details

Not as commonly targeted, but could be where you find your next power seller, or identify what’s becoming hot. There is huge variation here in terms of the availability of these pages, the types of data fields, sub-pages, etc.

How to get e-commerce product data?

At a simple level, a computer replicates a website visitor; it goes to a web page and collects just the data you want. It’s relatively easy, at least until you’re scraping e-commerce websites for 20,000 products per hour, every day.

If we look at the process of obtaining this data there are several steps and, depending on a number of factors, these may be completed in-house or outsourced.

  1. Identify the pages that you need to collect data from
  2. Visit these pages and collect the data
  3. Extract the data into a format you can process (e.g CSV, JSON, etc.)
  4. Process the data (e.g. cleaning, deduping, matching, etc.)
  5. Visualize the data via BI tools (though not always)
  6. Connect the data into your business systems

Naturally the above is simplified and overlooks a lot of complexities, especially how hard it can be to get the raw data in the first place. Many popular e-commerce sites have advanced systems to ensure their website is always accessible, these systems can sometimes mistakenly interfere with legitimate and ethical web scraping. This is yet one more reason to ensure that scraping e-commerce websites is always done in a compliant and sustainable way that in no way negatively impacts target sites.

Conclusion

The above points are of course just an overview of what in fact is an established and growing industry. Ultimately the goal of scraping e-commerce websites is to sell more products, and this is the success metric that you should always try to measure yourself against.

There are many shortcuts and trap doors when it comes to the world of leveraging product data and scraping e-commerce websites. Trust us, we have been collecting e-commerce data for hundreds of marketplaces and retailers for years. If you have an e-commerce project in the works, tell us more about your project. We have teams of data scientists and project managers who have years of experience that can give you a fast assessment of your best course of action.

Written by James Kehoe
Senior Product Manager at Zyte. Passionate and Product Management and value creation from web data at scale.
Sign up to the blog