1. Introduction
Cluster Analysis is an unsupervised learning and multivariate statistical technique used to group a
set of observations into clusters such that objects within the same cluster are more similar to each
other than to objects in different clusters. Unlike classification methods, cluster analysis does not
rely on predefined labels; instead, it discovers structure directly from the data.
The primary objective of cluster analysis is to identify natural groupings in data. These groupings
may represent hidden patterns, subpopulations, or structures that are not immediately apparent.
Cluster analysis is widely used in data mining, biology, marketing, social sciences, image
processing, and machine learning.
Clustering is particularly useful for:
• Exploratory data analysis
• Pattern recognition
• Market segmentation
• Anomaly detection
• Data summarization
Because clustering results depend strongly on the choice of similarity measure and algorithm,
careful methodological decisions are essential for meaningful outcomes.
2. Similarity Measures
Similarity measures quantify how alike two observations are. The choice of similarity or distance
measure directly influences the clustering result.
Distance-Based Measures
Euclidean Distance
• The most commonly used distance measure
• Measures straight-line distance between two points
, • Sensitive to scale and outliers
Manhattan Distance
• Measures distance along axes
• More robust to outliers than Euclidean distance
Minkowski Distance
• A generalization of Euclidean and Manhattan distances
• Allows flexibility through a parameter
Similarity Measures for Categorical Data
Hamming Distance
• Counts the number of mismatched attributes
Jaccard Coefficient
• Measures similarity based on shared attributes
• Commonly used for binary data
Correlation-Based Measures
• Used when the shape or trend of data matters more than magnitude
• Useful in time-series or gene expression analysis
Proper data preprocessing, including standardization and normalization, is critical before
computing similarity measures.
3. Hierarchical Clustering
Hierarchical clustering builds a hierarchy of clusters without requiring the number of clusters to
be specified in advance.
Types of Hierarchical Clustering
Agglomerative Clustering
• Bottom-up approach