date: 2024-12-24
title: ML-Distance Measure
status: DONE
author:
- AllenYGY
tags:
- NOTE
publish: True
ML-Distance Measure
Key to clustering. “similarity” and “dissimilarity” can also commonly used terms.
There are numerous distance functions for
We denote distance with:
Most commonly used functions are Euclidean distance and Manhattan (city block) distance
Definition: Measures the proportion of matching elements (both 1
s and 0
s) between two binary vectors.
Formula:
1
).0
).1
, vector 2 is 0
).0
, vector 2 is 1
).Range: 1
indicates perfect matching.
Use Case: Suitable when both 1
s and 0
s carry equal importance.
Definition: Measures the similarity between two binary vectors by considering only the matches for 1
s. Ignores 0
s.
Formula:
1
).1
, vector 2 is 0
).0
, vector 2 is 1
).Range: 1
indicates perfect similarity.
Use Case: Ideal for sparse data or cases where 1
s are more significant than 0
s.
Definition: Measures the total number of differing bits between two binary vectors. It counts mismatched positions.
Formula:
Range: 0
indicates no differences (identical vectors).
Use Case: Suitable for measuring the difference between binary strings or vectors.
Measure | Formula | Focus | Range | Best Use Case |
---|---|---|---|---|
SMC | Matches for 1 s and 0 s |
Equal importance for 1 s and 0 s |
||
Jaccard | Matches for 1 s only |
Sparse data, where 1 s matter more |
||
Hamming | Mismatched positions | Binary strings or sequences |
Given two binary vectors:
Simple Matching Coefficient (SMC):
Jaccard Similarity:
Hamming Distance:
Nominal attributes: with more than two states or values.
the commonly used distance measure is also based on the simple matching method.
Given two data points
This section explains how text documents are represented and how distances or similarities between them are measured.
Term | Document 1 | Document 2 |
---|---|---|
aid | 0 | 1 |
back | 1 | 0 |
dog | 1 | 0 |
men | 0 | 1 |
... | ... | ... |
In the Euclidean space, standardization of attributes is recommended so that all attributes can have equal impact on the computation of distances.
Standardize attributes: to force the attributes to have a common value range
Their values are real numbers following a linear scale.
Two main approaches to standardize interval scaled attributes, range and z-score.
Z-score: transforms the attribute values so that they have a mean of zero and a mean absolute deviation of 1. The mean absolute deviation of attribute
Z-score: