BDA-Find Similar Items

For High dimensional data

  • LSH
  • Clustering
  • Dimensionality Reduction

Finding Similar Items

  • Similarity of documents
    • Plagiarism
    • Mirror pages
    • Articles from the same source
  • Collaborative filtering
    • Online purchase
    • Movie ratings

Distance Measures

Goal: Find near-neighbors in high-dim. space

We formally define “near neighbors” as points that are a “small distance” apart

For each application, we first need to define what “distance” means

 Jaccard 相似度 (Jaccard Similarity)

The Jaccard Similarity measures the similarity between two sets and is defined as the size of the intersection divided by the size of the union of the sets.

Finding Similar Documents

Goal: Given a large number (𝑵 in the millions or billions) of documents, find “near duplicate” pairs

Applications:

  • Mirror websites, or approximate mirrors
    • Don’t want to show both in search results
  • Similar news articles at many news sites

Problems:

  • Many small pieces of one document can appear out of order in another
  • Too many documents to compare all pairs
  • Documents are so large or so many that they cannot fit in main memory

3 Essential Steps for Similar Docs

  1. Shingling: Convert documents to sets
  2. Min-Hashing: Convert large sets to short signatures, while preserving similarity
  3. Locality-Sensitive Hashing: Focus on pairs of signatures likely to be from similar documents

Candidate pairs!

Minhash/LSH

Convert large sets to short signatures, while preserving similarity.

Encoding Sets as Bit Vectors

Many similarity problems can be formalized as finding subsets that have significant intersection

  • Encode sets using 0/1 (bit, boolean) vectors
    • One dimension per element in the universal set
  • Interpret set intersection as bitwise AND, and set union as bitwise OR

Example: ;

  • Size of intersection = 3; size of union = 4,
  • Jaccard similarity (not distance) = 3/4
  • Distance: d(C1 ,C2 ) = 1 – (Jaccard similarity) = 1/4

From Sets to Boolean Matrices

  • Rows = elements (shingles)

  • Columns = sets (documents)

    • 1 in row e and column s if and only if e is a member of s
    • Column similarity is the Jaccard similarity of the corresponding sets (rows with value 1)
  • Typical matrix is sparse!

Each document is a column:

BpGOnL

Example: =0.5

So far:

  • Documents Sets of shingles
  • Represent sets as boolean vectors in a matrix

Next goal: Find similar columns while computing small signatures

  • Similarity of columns == similarity of signatures