ML-Distance Measure

Measuring the distance of two clusters

A few ways to measure the distance of two clusters.

Results in different variations of the algorithm.

  • Single link
  • Complete link
  • Average link
  • Centroids
  • Definition: Distance between two clusters is defined as the shortest distance between any two points in the clusters.
  • Formula:
  • Characteristics:
    • Forms "chain-like" clusters, suitable for finding non-convex shapes.
    • Disadvantage: Sensitive to noise and outliers.
  • Use Case: Suitable for datasets where the clusters are non-convex.

  • Definition: Distance between two clusters is defined as the longest distance between any two points in the clusters.
  • Formula:
  • Characteristics:
    • Creates compact clusters with limited spread.
    • Disadvantage: May over-split clusters; not suitable for complex distributions.
  • Use Case: Preferred when tight clusters are required.

  • Definition: Distance between two clusters is defined as the average of all pairwise distances between points in the two clusters.
  • Formula:
  • Characteristics:
    • Balances single link and complete link approaches.
    • Robust to noise compared to single link but less than complete link.
  • Use Case: Suitable for evenly distributed data and balanced clustering.

Centroids

  • Definition: Distance between two clusters is defined as the distance between their centroids (average points).
  • Formula:
    Where and are the centroids of clusters and .
  • Characteristics:
    • Simple to compute but may cause "reversal" (merged clusters may separate due to centroid movement).
    • Disadvantage: Not suitable for complex shapes.
  • Use Case: Effective for spherical or isotropic clusters.

Comparison Table

Method Advantage Disadvantage Suitable Use Case
Single Link Detects non-convex clusters Sensitive to noise and outliers Chain-like, non-convex clusters
Complete Link Creates tight clusters Struggles with complex distributions Compact, tight clustering
Average Link Balances between single and complete Higher computational cost Balanced clustering
Centroids Computationally efficient Not suitable for irregular clusters Spherical, fast computation

Time complexity

All the algorithms are at least . is the number of data points.

  • Single link can be done in .
  • Complete and average links can be done in .
  • Due the complexity, hard to use for large data sets.
    • Sampling
    • Scale-up methods (e.g., BIRCH).