Similarity and Dissimilarity Measures
1 Introduction
In Cluster Analysis, the objective is to partition a set of n objects into groups such that objects within a group
are more similar to each other than to objects in other groups. This requires a formal mathematical definition of
distance (dissimilarity) and proximity (similarity).
2 The Data Matrix
Let X be an n × p data matrix:
x11 x12 . . . x1p
x21 x22 . . . x2p
X= .
.. .. ..
.. . . .
xn1 xn2 . . . xnp
where xi = (xi1 , xi2 , . . . , xip )⊤ is the p-dimensional vector representing the i-th observation.
3 Dissimilarity Measures for Continuous Variables
3.1 Minkowski Distance (Lr Norm)
The Minkowski distance between observations i and j is defined as:
p
!1/r
X
r
dr (i, j) = |xik − xjk |
k=1
Special cases include:
• Euclidean Distance (r = 2): The most common metric.
v
u p q
uX
d2 (i, j) = t (xik − xjk )2 = (xi − xj )⊤ (xi − xj )
k=1
• Manhattan/City Block Distance (r = 1): Robust to outliers.
p
X
d1 (i, j) = |xik − xjk |
k=1
• Chebyshev Distance (r → ∞):
d∞ (i, j) = max |xik − xjk |
k
1
, 3.2 Mahalanobis Distance
To account for correlations and differing variances between variables, we use the Mahalanobis distance. Let S be
the sample covariance matrix: q
dM (i, j) = (xi − xj )⊤ S−1 (xi − xj )
Derivation: If x is transformed by z = S−1/2 x, then the Euclidean distance in z-space is identical to the
Mahalanobis distance in x-space.
4 Similarity Measures for Binary Data
For binary variables (0 or 1), similarity is based on a 2 × 2 contingency table for objects i and j:
Object i \ j 1 (Present) 0 (Absent) Total
1 (Present) a b a+b
0 (Absent) c d c+d
Total a+c b+d p
4.1 Standard Coefficients
1. Simple Matching Coefficient (SMC):
a+d
SSM C =
a+b+c+d
2. Jaccard Coefficient (SJ ): Used when “double-zeros” (d) carry no information.
a
SJ =
a+b+c
5 Similarity via Correlation
The Pearson Correlation Coefficient measures similarity in the “shape” of profiles:
Pp
(xik − x̄i )(xjk − x̄j )
ρij = P k=1
q
p 2
Pp 2
k=1 (xik − x̄i ) k=1 (xjk − x̄j )
where x̄i is the mean of the p variables for object i.
Problem 1: Invariance of Mahalanobis Distance
Task: Prove that the Mahalanobis distance between two vectors xi and xj is invariant under any non-singular
linear transformation y = Ax + b.
Proof. Let Sx be the covariance matrix of the original data X. The squared Mahalanobis distance is defined as:
d2M (xi , xj ) = (xi − xj )⊤ S−1
x (xi − xj )
Consider the linear transformation yi = Axi + b, where A is a non-singular p × p matrix. The difference vector
in the transformed space is:
yi − yj = (Axi + b) − (Axj + b) = A(xi − xj )
The covariance matrix of the transformed variables Y is given by:
Sy = Var(Ax + b) = ASx A⊤
2