In pursuit of perfection: measuring web product data quality

We put Zyte’s own Automatic Extraction API head-to-head with a commercial rival - and an open-source alternative - to find out who’s product extraction top dog.

From price intelligence to making investment decisions or building data-driven products, we often need to extract product data from multiple websites. This typically means writing website-specific code for each website, which can become costly and time consuming to develop and maintain when the number of websites gets large.

Wouldn’t it be great if we could use AI instead? Can we even use it some of the time?

Fully automated extraction works great for articles. However, product data extraction is more complex than articles and we were all eager to see data on how well our solution performed. But as Konstantin Lopuhin, our chief data scientist, soon discovered - even evaluation was a more challenging task for product data.

The goal was to stress-test our own AI-powered Automatic Extraction API against a well-known commercial tool. For our product extraction baseline, we chose a relatively crude wrapper around extruct, a widely-used open-source tool that extracts embedded metadata from HTML markup. To give our own Automatic Extraction API a serious workout we pitted it against Diffbot, another commercial offering that already set a high bar for extraction quality. By feeding each of these with a carefully curated set of real-world product page URLs, our objective was to find which solution yielded the best quality results in terms of extracting product price, availability, and SKU (Stock Keeping Unit) information.

Let’s play fairly

It’s time to address the elephant in the room. To make the experiment as fair as possible we took pains to minimize factors that could undermine the credibility of our test and the results.

Rather than cherry-picking web domains to evaluate, we asked two extraction experts outside our data science team to propose an unbiased set of popular consumer product domains. Their selections ranged from big marketplaces like Amazon, eBay, and Alibaba to mono-brand sites including Ikea and John Lewis. To make things tougher we threw in some sites from more obscure brands and vendors in a variety of languages. From these domains, we selected a broad spectrum of URLs including front page products, more deeply hidden items, discounted and out-of-stock products.

We also took other precautions, like taking a ‘snapshot’ of our chosen target URLs and feeding them into each extraction engine. That way we could be sure that page content hadn’t altered in any way in the short intervals between each test run, and was always the same regardless of the download location.

And the winner is…

Using the F1 score as the measure, which combines precision and recall, we found that the product extraction quality of Zyte’s Automatic Extraction is significantly better than Diffbot for price and SKU attributes. By the same token, the results for availability were comparable between the two solutions. Both Diffbot and Zyte’s Automatic Extraction were far better than the extruct baseline.

Don’t just take our word for it, we’ve open-sourced the whole project to show there was nothing to hide. That’s why we’ve released the entire dataset for the experiment, including web archive files, test methodology, screenshots of chosen pages, ground truth annotations, evaluation code, and baseline open source extraction code.

Rising to the product data challenge

We were delighted – and just a bit relieved – to discover that our own Zyte-powered extraction solution won the day against its commercial and open-source rivals. Having already conducted a similar experiment with the easier task of article extraction, we were hoping we could get similar results with product extraction... but that didn’t stop a few butterflies on the big day!

Nothing stands still in the world of the web. Product page design is evolving constantly, making the accurate parsing and interpretation of an HTML page a moving target for our data science team. Recent trends include the increasing use of JavaScript and the popularity of ‘infinite’ pages that continuously render new content as you scroll downwards.

At Zyte it’s our business to keep a close eye on these trends and we’re continuously improving it so that we can deliver the best solution to our customers. You’re more than welcome to get in touch with your own data extraction challenges. We love tough problems - almost as much as we love solving them for our web scraping customers.

Next steps

If you’re interested in hearing more or have questions, please watch our on-demand webinar to hear from Konstantin about how he undertook the whole evaluation process, what problems he faced, and his conclusions and suggestions.

You can also try our Automatic Extraction API for free and see how you get on.