date: 2024-12-24
title: ML-Distance Measure
status: DONE
author:
  - AllenYGY
tags:
  - NOTE
publish: True

ML-Distance Measure

Key to clustering. “similarity” and “dissimilarity” can also commonly used terms.

There are numerous distance functions for

Different types of data
- Numeric data
- Nominal data
Different specific applications

Minkowski Distance

We denote distance with: , where and are data points (vectors)

Most commonly used functions are Euclidean distance and Manhattan (city block) distance

If , it is the Manhattan distance

If , it is the Euclidean distance

Weighted Euclidean distance

Chebychev distance:

Distance functions for binary attributes

Simple Matching Coefficient (SMC)

Definition: Measures the proportion of matching elements (both 1s and 0s) between two binary vectors.
Formula:
Where:
- : True Positives (both vectors are 1).
- : True Negatives (both vectors are 0).
- : False Positives (vector 1 is 1, vector 2 is 0).
- : False Negatives (vector 1 is 0, vector 2 is 1).
Range: , where 1 indicates perfect matching.
Use Case: Suitable when both 1s and 0s carry equal importance.

Jaccard Similarity Coefficient

Definition: Measures the similarity between two binary vectors by considering only the matches for 1s. Ignores 0s.
Formula:
Where:
- : True Positives (both vectors are 1).
- : False Positives (vector 1 is 1, vector 2 is 0).
- : False Negatives (vector 1 is 0, vector 2 is 1).
Range: , where 1 indicates perfect similarity.
Use Case: Ideal for sparse data or cases where 1s are more significant than 0s.

Hamming Distance

Definition: Measures the total number of differing bits between two binary vectors. It counts mismatched positions.
Formula:
Where and are the corresponding elements in the two vectors.
Range: , where 0 indicates no differences (identical vectors).
Use Case: Suitable for measuring the difference between binary strings or vectors.

Comparison Table

Measure	Focus	Best Use Case
SMC	Matches for `1`s and `0`s	Equal importance for `1`s and `0`s
Jaccard	Matches for `1`s only	Sparse data, where `1`s matter more
Hamming	Mismatched positions	Binary strings or sequences

Example for Binary Vectors

Given two binary vectors:

Simple Matching Coefficient (SMC):
- , , ,
Jaccard Similarity:
- , ,
Hamming Distance:
- Number of mismatched positions: (index 2), (index 5).

Distance functions for nominal attributes

Nominal attributes: with more than two states or values.

the commonly used distance measure is also based on the simple matching method.

Given two data points and , let the number of attributes be , and the number of values that match in and be .

Distance Function for Text Documents

This section explains how text documents are represented and how distances or similarities between them are measured.

Representing Text Documents

Definition:
- A text document consists of a sequence of sentences, and each sentence is a sequence of words.
Simplification:
- To simplify, a document is usually represented as a Bag of Words (BOW) in document clustering.
- In the Bag of Words model:
  - The sequence and position of words are ignored.
  - Focus is on word occurrence or frequency.
Vector Representation:
- A document is converted into a vector, where each dimension corresponds to a specific term (word), and the value represents the frequency or presence of the term.

Example:

Term	Document 1	Document 2
aid	0	1
back	1	0
dog	1	0
men	0	1
...	...	...

Measuring Distance or Similarity

Similarity vs. Distance:

Instead of using distance, it is common to use similarity to compare text documents.
Most Common Similarity Measure:
- Cosine Similarity: Measures the cosine of the angle between two vectors.

Cosine Similarity:

Formula:
Where:
- and : Vector representations of two documents.
- : The Euclidean norm of vector .
Range:
- : Documents are identical.
- : Documents are completely dissimilar.

Data Standardization

In the Euclidean space, standardization of attributes is recommended so that all attributes can have equal impact on the computation of distances.

Standardize attributes: to force the attributes to have a common value range

Interval-scaled attributes

Their values are real numbers following a linear scale.

Two main approaches to standardize interval scaled attributes, range and z-score. is an attribute

Range Standardization

Z-Score Standardization

Z-score: transforms the attribute values so that they have a mean of zero and a mean absolute deviation of 1. The mean absolute deviation of attribute , denoted by , is computed as follows.

Z-score: