date: 2024-11-08
title: "BDA-Find Similar Items"
status: DONE
author:
- AllenYGY
tags:
- NOTE
- BDA
publish: True
BDA-Find Similar Items
For High dimensional data
Goal: Find near-neighbors in high-dim. space
We formally define “near neighbors” as points that are a “small distance” apart
For each application, we first need to define what “distance” means
Jaccard 相似度 (Jaccard Similarity)
The Jaccard Similarity measures the similarity between two sets and is defined as the size of the intersection divided by the size of the union of the sets.
Goal: Given a large number (𝑵 in the millions or billions) of documents, find “near duplicate” pairs
Applications:
Problems:
3 Essential Steps for Similar Docs
Candidate pairs!
Convert large sets to short signatures, while preserving similarity.
Many similarity problems can be formalized as finding subsets that have significant intersection
Example:
Rows = elements (shingles)
Columns = sets (documents)
Typical matrix is sparse!
Each document is a column:
Example:
So far:
Next goal: Find similar columns while computing small signatures