date: 2024-12-24
title: ML-Distance Measure
status: DONE
author:
  - AllenYGY
tags:
  - NOTE
  - DistanceMeasure
  - Clustering
  - ML
publish: True

ML-Distance Measure

Measuring the distance of two clusters

A few ways to measure the distance of two clusters.

Results in different variations of the algorithm.

Single Link

Definition: Distance between two clusters is defined as the shortest distance between any two points in the clusters.
Formula:
Characteristics:
- Forms "chain-like" clusters, suitable for finding non-convex shapes.
- Disadvantage: Sensitive to noise and outliers.
Use Case: Suitable for datasets where the clusters are non-convex.

Definition: Distance between two clusters is defined as the longest distance between any two points in the clusters.
Formula:
Characteristics:
- Creates compact clusters with limited spread.
- Disadvantage: May over-split clusters; not suitable for complex distributions.
Use Case: Preferred when tight clusters are required.

Definition: Distance between two clusters is defined as the average of all pairwise distances between points in the two clusters.
Formula:
Characteristics:
- Balances single link and complete link approaches.
- Robust to noise compared to single link but less than complete link.
Use Case: Suitable for evenly distributed data and balanced clustering.

Definition: Distance between two clusters is defined as the distance between their centroids (average points).
Formula:
Where and are the centroids of clusters and .
Characteristics:
- Simple to compute but may cause "reversal" (merged clusters may separate due to centroid movement).
- Disadvantage: Not suitable for complex shapes.
Use Case: Effective for spherical or isotropic clusters.

Method	Advantage	Disadvantage	Suitable Use Case
Single Link	Detects non-convex clusters	Sensitive to noise and outliers	Chain-like, non-convex clusters
Complete Link	Creates tight clusters	Struggles with complex distributions	Compact, tight clustering
Average Link	Balances between single and complete	Higher computational cost	Balanced clustering
Centroids	Computationally efficient	Not suitable for irregular clusters	Spherical, fast computation

All the algorithms are at least . is the number of data points.

Single link can be done in .
Complete and average links can be done in .
Due the complexity, hard to use for large data sets.
- Sampling
- Scale-up methods (e.g., BIRCH).