PINGDOM_CHECK

#ExtractSummit2026 The world's largest web scraping conference returns. Austin Oct 7–8 · Dublin Nov 10–11.

Register now
Data Services
Pricing
Login
Try Zyte APIContact Sales
  • Unblocking and Extraction

    Zyte API

    The ultimate API for web scraping. Avoid website bans and access a headless browser or AI Parsing

    Ban Handling

    Headless Browser

    AI Extraction

    SERP

    Enterprise

    DocumentationSupport

    Hosting and Deployment

    Scrapy Cloud

    Run, monitor, and control your Scrapy spiders however you want to.

    Coding Agent Add-Ons

    Agentic Web Data

    Plugins that give coding agents the context to build production Scrapy projects. Starts with Claude Code.

  • Data Services
  • Pricing
  • Browse

    • BlogArticles, podcasts, videos
    • Case studiesCustomer outcomes
    • White papersIn-depth reports
    • EventsConferences, webinars, recordings

    Subscribe

    • NewsletterSwiftly delivered
    • Discord communityExtract Data community
  • Product and E-commerce

    From e-commerce and online marketplaces

    Data for AI

    Collect and structure web data to feed AI

    Job Posting

    From job boards and recruitment websites

    Real Estate

    From Listings portals and specialist websites

    News and Article

    From online publishers and news websites

    Search

    Search engine results page data (SERP)

    Social Media

    From social media platforms online

  • Meet Zyte

    Our story, people and values

    Contact us

    Get in touch

    Support

    Knowledge base and raise support tickets

    Terms and Policies

    Accept our terms and policies

    Open Source

    Our open source projects and contributions

    Web Data Compliance

    Guidelines and resources for compliant web data collection

    Join the team building the future of web data
    We're Hiring
    Trust Center
    Security, compliance & certifications
Login
Try Zyte APIContact Sales

Zyte Developers

Coding tools & hacks straight to your inbox

Become part of the community and receive a bi-weekly dosage of all things code.

Join us
    • Zyte Data
    • News & Articles
    • Search
    • Social Media
    • Product
    • Data for AI
    • Job Posting
    • Real Estate
    • Zyte API - Ban Handling
    • Zyte API - Headless Browser
    • Zyte API - AI Extraction
    • Web Scraping Copilot
    • Zyte API Enterprise
    • Scrapy Cloud
    • Solution Overview
    • Blog
    • Webinars
    • Case Studies
    • White Papers
    • Documentation
    • Web Scraping Maturity Self-Assesment
    • Web Data compliance
    • Meet Zyte
    • Jobs
    • Terms and Policies
    • Trust Center
    • Support
    • Contact us
    • Pricing
    • Do not sell
    • Cookie settings
    • Sign up
    • Talk to us
    • Cost estimator
All articles
AI-assisted data extraction28, 28 articles
Data gathering for AI6, 6 articles
Large Language Models (LLMs)24, 24 articles
Tool-assisted coding3, 3 articles
Developer interest143, 143 articles
Integration13, 13 articles
Open-source96, 96 articles
Scraping practice59, 59 articles
Scraping strategy46, 46 articles
Anti-ban35, 35 articles
Traffic6, 6 articles
Web data application25, 25 articles
Web data collection358, 358 articles
Web data collection ethics3, 3 articles
Web data collection legality16, 16 articles
Web scraping APIs63, 63 articles
Zyte API59, 59 articles
Scrapy48, 48 articles
Scrapy Cloud10, 10 articles
Web Scraping Copilot12, 12 articles
AI & Machine Learning1, 1 articles
Automotive2, 2 articles
E-commerce & retail26, 26 articles
Entertainment & Streaming2, 2 articles
Financial Services8, 8 articles
Government2, 2 articles
Market Research & Intelligence3, 3 articles
Media & publishing8, 8 articles
Real Estate2, 2 articles
Recruitment & HR3, 3 articles
Transportation & Logistics2, 2 articles
Travel & hospitality2, 2 articles
Extract Summit25, 25 articles
PyCon1, 1 articles

Appearance

Discord Community
BlogHow ToFinding Similar Items
ArticleHow To

Finding Similar Items

This post describes an approach to the problem of finding similar items among crawled items and how this was implemented at Zyte.

S

Shane Evans

6 min read · July 23, 2012

Finding Similar Items

Finding similar items

This post describes an approach to the problem of finding near-duplicates among crawled items and how this was implemented at Zyte.

Near duplicate content is everywhere on the web and needs to be considered in any web crawling project.

Web pages might differ only in a small portion of their content, such as advertising, timestamps, counters, etc. This fact can be used when crawling to improve quality and performance and there are efficient ways to detect these near-duplicate web pages[1].

However, there are times when you need to identify similar items in the extracted data. This could be unwanted duplication, or we may want to find the same product, artist, business, holiday package, book, etc. from different sources.

As an example, let’s say we’re interested in compiling information on tourist attractions in Ireland and would like to output each attraction, with links to various websites where more information can be found.  Consider the following examples of records that could be crawled:

Name

Summary

Location

Saint Fin Barre's Cathedral

Begun in 1863, the cathedral was the first major work of the Victorian architect William Burges...

51.8944, -8.48064

St. Finbarr’s Cathedral Cork

Designed by William Burges and consecrated in 1870, ..

51.894401550293, -8.48064041137695

Although they are obviously the same place, there’s not much textual similarity between these two records. “Cathedral” is the only word in common in the name, and we have a lot of those in Ireland!

It turns out there are other common spellings of St. Fin Barre's, many websites list it multiple times (not realizing they have duplicates) and not all websites have the location listed.

Finding Similar Text

A common way to implement this is to first produce a set of tokens (could be hashes, words, shingles, sketches, etc.) from each item, measure the similarity between each pair of sets and if it’s above a threshold then the items are said to be near duplicates.

Consider the name fields in the example above. If we split each into words, then we can say that they have one word in common out of a total of 7 unique words - a similarity of 14% [2].

Unfortunately, comparing all pairs of items is only feasible when we have very few items[3]. It may work (eventually) for tourist attractions in Ireland, but we need to use this on hundreds of millions of items. Instead of comparing all pairs, we restrict it to only pairs with at least one token in common (e.g. by using an inverted index). The performance of this approach depends mainly on how many “candidate pairs” are generated.

The quality of the similarity function can be improved by generating better tokens. Firstly, we could make a database of common synonyms, and recognize that “St.” is an abbreviation for “Saint” (and street - be careful!). Now we have 2 in common out of 6 unique words - up to 33%! In addition to synonyms, it’s a good idea to remove markup, ignore case, include stop words (reduces false-positives and the number of candidate pairs), and stem words. There are other possible ways to generate tokens instead of using words, such as by taking into account position, adjacent words, using the characters that make up the words, extracting “entities”, etc.

For a more detailed description, please see the excellent book Introduction to Information Retrieval. Tokenization and linguistic processing are covered in Section 2.2 and section 19.6 covers near-duplicate detection.

Extending Similarity to Items

Including more fields in the similarity calculation will make it more accurate. Returning to our example, there are many cathedrals named after saints, but the location can narrow it down. Additionally, the description can help disambiguate from other crawled items (e.g. hotels near St. Fin Barre’s Cathedral).

The same techniques as described above can be used to make tokens for the description. We just have to be aware that longer text generates more candidate matches and is more likely to be similar to other random text, so taking multiple words together as the tokens is a good idea. For example, instead of:

1begun, in, 1863, the, cathedral,...
Copy

We take each 3 adjacent words:

1begun in 1863, in 1863 the, 1863 the cathedral, ...
Copy

This technique is called Shingling and is commonly used when calculating similarity.

The location needs to be treated differently. The two examples given above happen to share many digits, but that won’t always be the case. We use geohash to convert the coordinates into a bucket and use both the bucket and its neighbors as the tokens. For our first example record, we generate the following buckets:
example geohash for the location of st. finbarre's cathedral

generated from David Troy's Geohash Demonstrator

Locations that are close will share some buckets.

Counting tokens in common across multiple fields gives poor results; the field with the most tokens (description in our example) will completely dominate the calculation. Instead, we calculate the similarity between each field in common (name, description, location) individually and combine these similarities into a total similarity for the pair of items. When combining scores, we can give more weight to more important fields.

Super Tokens[4]

With a high enough similarity threshold, we can prove that some similarity in multiple fields is necessary for the final score to be above the threshold. Therefore, we can generate tokens that combine the token data of more than one field into “super tokens”. Although there are more tokens generated, they will be much rarer and this reduces the size of the candidate pairs and improves performance significantly.

We observed that requiring a match on multiple fields can improve the quality and we can use super tokens to encode these rules. In our example, we could add a rule saying that we must match on either (name, description) or (name, location) and generate super tokens for these field combinations. For (name, description) with shingling of description, the first example record would have the following tokens:

1st. begun in 1863, st. in 1863 the, st. 1863 the cathedral, ...
2 fin begun in 1863, fin in 1863 the, fin 1863 the cathedral, ...
3 barre begun in 1863, barre in 1863 the, barre 1863 the cathedral, ...
4 ...
Copy

Producing the Same Item

Once we have found pairs of similar items, the next step is to merge these into a single item. In our case, we want to find all St. Fin Barre's Cathedrals in our data and output a single record with links.

The first thing to notice is that similarity is not transitive - if A is similar to B and B is similar to C, our function may not find A similar to C. So, we connect these into clusters and generate our output from the connected-up items.

In practice, these clusters can sometimes get too large as a weak similarity between clusters can cause them to be merged, resulting in large clusters of unrelated items. We found that algorithms designed for detecting communities in social networks are able to efficiently generate good clusters. The following example is from a large crawl, where a single item has some similarity with many different items, causing their respective clusters to get connected:

cluster-black1-150x150A large cluster containing different items

                   cluster-weak1-150x150                   Unwanted links are identified

cluster-parted-150x150The cluster is split into many smaller clusters

It's like someone who friends everyone on Facebook!

The final step is to output a record for each cluster. In our example, we would output the attraction name, and a link to each page we found it on.

Summary

We have described a system for finding similar items in scraped data. We have implemented it as a library based on MapReduce, which has been in use for over a year and has proven successful on many scraping projects.

If you are interested in using this on your crawling project, please contact our Professional Services team.


  1. We implemented a version of Charikhar's Simhash, as described in the WWW 2007 paper, "Detecting Near-Duplicates for Web Crawling" (PDF) by Gurmeet Manku, Arvind Jain, and Anish Sarma. The performance of this algorithm is excellent.
  2. This method of calculating similarity is called the Jaccard Index.
  3. The running time proportional to the square of the number of items. If we have 10x the number of items, it takes 100 times longer to process.
  4. Super Tokens are tokens of tokens. This is analogous to the Super Shingles proposed by Broder et al.

Try Zyte API

Build your first scraper in minutes

Free trial, no credit card. From a single request to production in an afternoon.

Get started
How To
S

Shane Evans

More from this author

In this article

  • Finding Similar Text
  • Extending Similarity to Items
  • Super Tokens[[4]](#4)
  • Producing the Same Item
  • Summary

Follow

Get the latest

Zyte and the data web in your inbox — or wherever you already are.

Subscribe

Or follow elsewhere

Continue reading

Teaching AI to scrape like a pro: how we measure LLMs’ data quality
How To

Teaching AI to scrape like a pro: how we measure LLMs’ data quality

AI-enabled code editors can now conjure scraping code on command. But is it any good? Here’s how Zyte re-engineered LLMs with Web Scraping Copilot to drive best-in-class output.

Theresia Tanzil·10 min·February 23, 2026
Analyze web data quickly with Jupyter Notebooks and Zyte API
How To

Analyze web data quickly with Jupyter Notebooks and Zyte API

With AI Scraping in Zyte API, you can pull data from any e-commerce website straight into your Jupyter notebooks.

Neha Setia Nagpal·2 mins·December 13, 2024
Overcoming web scraping challenges of Puppeteer and Playwright
How To

Overcoming web scraping challenges of Puppeteer and Playwright

Discover the challenges of scaling web scraping with Playwright & Puppeteer, from browser farm management to IP rotation and anti-scraping tactics.

Neha Setia Nagpal·1 mins·December 5, 2024

The Community · Newsletter

The best of Zyte and the data web, in your inbox.

One curated edition — new articles, product updates, and the stories shaping the data web. No noise.

G2.com

Capterra.com

Proxyway.com

EWDCI logoMost loved workplace certificateZyte rewardISO 27001 iconG2 rewardG2 rewardG2 reward

© Zyte Group Limited 2026