date: 2024-11-08
title: "BDA-Find Similar Items"
status: DONE
author:
  - AllenYGY
tags:
  - NOTE
  - BDA
publish: True

BDA-Find Similar Items

For High dimensional data

LSH
Clustering
Dimensionality Reduction

Finding Similar Items

Similarity of documents
- Plagiarism
- Mirror pages
- Articles from the same source
Collaborative filtering
- Online purchase
- Movie ratings

Distance Measures

Goal: Find near-neighbors in high-dim. space

We formally define “near neighbors” as points that are a “small distance” apart

For each application, we first need to define what “distance” means

Jaccard 相似度 (Jaccard Similarity)

The Jaccard Similarity measures the similarity between two sets and is defined as the size of the intersection divided by the size of the union of the sets.

Finding Similar Documents

Goal: Given a large number (𝑵 in the millions or billions) of documents, find “near duplicate” pairs

Applications:

Mirror websites, or approximate mirrors
- Don’t want to show both in search results
Similar news articles at many news sites

Problems:

Many small pieces of one document can appear out of order in another
Too many documents to compare all pairs
Documents are so large or so many that they cannot fit in main memory

3 Essential Steps for Similar Docs

Shingling: Convert documents to sets
Min-Hashing: Convert large sets to short signatures, while preserving similarity
Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents

Candidate pairs!

Minhash/LSH

Convert large sets to short signatures, while preserving similarity.

Encoding Sets as Bit Vectors

Many similarity problems can be formalized as finding subsets that have significant intersection

Encode sets using 0/1 (bit, boolean) vectors
- One dimension per element in the universal set
Interpret set intersection as bitwise AND, and set union as bitwise OR

Example: ;

Size of intersection = 3; size of union = 4,
Jaccard similarity (not distance) = 3/4
Distance: d(C1 ,C2 ) = 1 – (Jaccard similarity) = 1/4

From Sets to Boolean Matrices

Rows = elements (shingles)
Columns = sets (documents)
- 1 in row e and column s if and only if e is a member of s
- Column similarity is the Jaccard similarity of the corresponding sets (rows with value 1)
Typical matrix is sparse!

Each document is a column:

BpGOnL

Example: =0.5

So far:

Documents Sets of shingles
Represent sets as boolean vectors in a matrix

Next goal: Find similar columns while computing small signatures

Similarity of columns == similarity of signatures