1. Motivation
In modern data analysis, datasets often consist of a large number of variables, sometimes ranging from
tens to thousands of features. While having more variables can increase the richness of information, it
also introduces several significant challenges.
One major issue is the curse of dimensionality, where the volume of the data space increases
exponentially with the number of dimensions. As dimensionality grows, data points become sparse,
making it difficult to identify meaningful patterns or relationships. This negatively affects statistical
modeling, machine learning performance, and computational efficiency.
Another challenge is redundancy among variables. In many datasets, variables are highly correlated,
meaning they carry overlapping information. For example, in socioeconomic data, income, education
level, and occupation may all reflect similar underlying trends. Analyzing all correlated variables
separately can be inefficient and may distort results.
Additionally, high-dimensional data is difficult to visualize. Humans can easily interpret one-, two-, or
three-dimensional plots, but beyond that, direct visualization becomes impossible. This limits
exploratory data analysis and intuitive understanding.
Principal Component Analysis (PCA) was developed to address these challenges. Its main goals are:
• To reduce dimensionality while retaining as much relevant information as possible
• To eliminate redundancy by transforming correlated variables into uncorrelated components
• To simplify data structure, making analysis, visualization, and modeling more efficient
In summary, PCA provides a mathematically sound method to compress data without significantly
sacrificing the underlying structure or variability.
2. The Idea
The fundamental idea behind Principal Component Analysis is to re-express the data in a new coordinate
system where the axes represent directions of maximum variance.
, Instead of analyzing the original variables directly, PCA constructs new variables known as principal
components. These components are derived as linear combinations of the original variables, meaning
each principal component is formed by multiplying each original variable by a coefficient and summing
the results.
Key Properties of Principal Components:
1. Orthogonality
o Each principal component is orthogonal (perpendicular) to all others.
o This ensures that components are uncorrelated and capture distinct information.
2. Variance Maximization
o The first principal component captures the maximum possible variance in the data.
o Each subsequent component captures the maximum remaining variance subject to
being orthogonal to previous components.
3. Ordered Importance
o Components are ordered by importance based on how much variance they explain.
o Typically, only the first few components are required to represent most of the
information.
Geometrically, PCA can be understood as rotating the coordinate axes to align them with the directions
of greatest data spread. Instead of measuring variability along arbitrary original axes, PCA identifies the
most informative directions inherent in the data.
This transformation allows complex datasets to be represented in a more compact and interpretable
form without altering the relative relationships between observations.
3. The Method
Principal Component Analysis follows a systematic mathematical procedure. Each step is essential to
ensure accurate and meaningful results.
3.1 Data Standardization
Before applying PCA, the data is usually standardized: