At its core, dimensionality reduction is about finding a lower-dimensional subspace that preserves the maximum amount of 'information'—which we mathematically define as variance. Imagine a cloud of data points in 3D space that roughly forms a flattened pancake. While the data exists in three dimensions, most of its behavior can be captured by a 2D plane. The goal is to identify the axes along which the data stretches the most; these axes represent the primary signals, while the directions with minimal stretch represent noise or redundancy.
To formalize this, we begin with the Singular Value Decomposition (SVD). For any $m \\× n$ matrix $A$, the SVD decomposes it into three constituent matrices: $A = U \Sigma V^T$. Here, $U$ is an $m \\× m$ orthogonal matrix containing the left singular vectors, $\Sigma$ is an $m \\× n$ diagonal matrix containing non-negative singular values $\sigma_i$, and $V^T$ is the transpose of an $n \\× n$ orthogonal matrix containing the right singular vectors. Geometrically, this tells us that any linear transformation can be broken down into a rotation, a scaling along the principal axes, and another rotation.
The connection to Eigenvalue Decomposition (EVD) emerges when we examine the covariance matrix of our data. If $A$ is centered (mean subtracted), the covariance matrix is proportional to $A^T A$. By substituting the SVD into this product, we get $A^T A = (V \Sigma^T U^T)(U \Sigma V^T)$. Since $U$ is orthogonal, $U^T U = I$, simplifying the expression to $A^T A = V \Sigma^T \Sigma V^T$. This is exactly the form of the EVD, where the columns of $V$ are the eigenvectors and the diagonal elements of $\Sigma^T \Sigma$ are the eigenvalues $\lambda_i = \sigma_i^2$.
The eigenvalues $\lambda_i$ are the critical metrics for dimensionality reduction. Each eigenvalue represents the amount of variance captured by its corresponding eigenvector (or principal component). Because the SVD sorts singular values such that $\sigma_1 \ge \sigma_2 \ge \dots \ge \sigma_n$, the first few columns of $V$ correspond to the directions of maximum variance. In practical terms, these are the 'most important' features of the dataset, while the smallest singular values correspond to directions that contribute little to the overall structure.
To perform dimensionality reduction, we employ a technique called 'Truncated SVD'. We keep only the top $k$ singular values and set the rest to zero, effectively projecting the data from $n$ dimensions down to $k$. The reduced-rank approximation is $\hat{A} = U_k \Sigma_k V_k^T$. According to the Eckart-Young-Mirsky theorem, this truncated matrix is the best possible rank-$k$ approximation of the original matrix $A$ in terms of the Frobenius norm, ensuring that we lose the minimum amount of information possible.
In summary, while the EVD of a covariance matrix provides the theoretical foundation for Principal Component Analysis (PCA), SVD is the numerically stable engine that implements it. By decomposing a dataset into its singular values and vectors, we can strip away the noise, reduce computational complexity, and uncover the latent structure of high-dimensional data. The transition from $A$ to $U_k \Sigma_k V_k^T$ transforms a dense, complex sea of numbers into a streamlined set of primary signals.