ML-Distance Measure

Key to clustering. “similarity” and “dissimilarity” can also commonly used terms.

There are numerous distance functions for

  • Different types of data
    • Numeric data
    • Nominal data
  • Different specific applications

Minkowski Distance

We denote distance with: , where and are data points (vectors)

Most commonly used functions are Euclidean distance and Manhattan (city block) distance

  • If , it is the Manhattan distance
  • If , it is the Euclidean distance
  • Weighted Euclidean distance
  • Chebychev distance:

Distance functions for binary attributes

Simple Matching Coefficient (SMC)

  • Definition: Measures the proportion of matching elements (both 1s and 0s) between two binary vectors.

  • Formula:
    Where:

    • : True Positives (both vectors are 1).
    • : True Negatives (both vectors are 0).
    • : False Positives (vector 1 is 1, vector 2 is 0).
    • : False Negatives (vector 1 is 0, vector 2 is 1).
  • Range: , where 1 indicates perfect matching.

  • Use Case: Suitable when both 1s and 0s carry equal importance.


Jaccard Similarity Coefficient

  • Definition: Measures the similarity between two binary vectors by considering only the matches for 1s. Ignores 0s.

  • Formula:
    Where:

    • : True Positives (both vectors are 1).
    • : False Positives (vector 1 is 1, vector 2 is 0).
    • : False Negatives (vector 1 is 0, vector 2 is 1).
  • Range: , where 1 indicates perfect similarity.

  • Use Case: Ideal for sparse data or cases where 1s are more significant than 0s.


Hamming Distance

  • Definition: Measures the total number of differing bits between two binary vectors. It counts mismatched positions.

  • Formula:
    Where and are the corresponding elements in the two vectors.

  • Range: , where 0 indicates no differences (identical vectors).

  • Use Case: Suitable for measuring the difference between binary strings or vectors.


Comparison Table

Measure Formula Focus Range Best Use Case
SMC Matches for 1s and 0s Equal importance for 1s and 0s
Jaccard Matches for 1s only Sparse data, where 1s matter more
Hamming Mismatched positions Binary strings or sequences

Example for Binary Vectors

Given two binary vectors:

  1. Simple Matching Coefficient (SMC):

    • , , ,
  2. Jaccard Similarity:

    • , ,
  3. Hamming Distance:

    • Number of mismatched positions: (index 2), (index 5).

Distance functions for nominal attributes

Nominal attributes: with more than two states or values.

the commonly used distance measure is also based on the simple matching method.

Given two data points and , let the number of attributes be , and the number of values that match in and be .

Distance Function for Text Documents

This section explains how text documents are represented and how distances or similarities between them are measured.


Representing Text Documents

  • Definition:
    • A text document consists of a sequence of sentences, and each sentence is a sequence of words.
  • Simplification:
    • To simplify, a document is usually represented as a Bag of Words (BOW) in document clustering.
    • In the Bag of Words model:
      • The sequence and position of words are ignored.
      • Focus is on word occurrence or frequency.
  • Vector Representation:
    • A document is converted into a vector, where each dimension corresponds to a specific term (word), and the value represents the frequency or presence of the term.

Example:

Term Document 1 Document 2
aid 0 1
back 1 0
dog 1 0
men 0 1
... ... ...

Measuring Distance or Similarity

Similarity vs. Distance:

  • Instead of using distance, it is common to use similarity to compare text documents.
  • Most Common Similarity Measure:
    • Cosine Similarity: Measures the cosine of the angle between two vectors.

Cosine Similarity:

  • Formula:
    Where:
    • and : Vector representations of two documents.
    • : The Euclidean norm of vector .
  • Range:
    • : Documents are identical.
    • : Documents are completely dissimilar.

Data Standardization

In the Euclidean space, standardization of attributes is recommended so that all attributes can have equal impact on the computation of distances.

Standardize attributes: to force the attributes to have a common value range

Interval-scaled attributes

Their values are real numbers following a linear scale.

Two main approaches to standardize interval scaled attributes, range and z-score. is an attribute

Range Standardization

Z-Score Standardization

Z-score: transforms the attribute values so that they have a mean of zero and a mean absolute deviation of 1. The mean absolute deviation of attribute , denoted by , is computed as follows.

Z-score: