At its core, dimensionality reduction is about identifying the 'axes of maximum variance' in a dataset. Imagine a cloud of data points in a 3D space that roughly forms a flattened pancake; while the data exists in three dimensions, the most critical information is captured by the plane of the pancake. By identifying the directions where the data spreads the most, we can project high-dimensional noise onto a lower-dimensional subspace without losing the fundamental structure of the signal.
The mathematical foundation for this begins with Eigen-decomposition. For a square, symmetric matrix $C$ (such as a covariance matrix), an eigenvector $v$ is a vector that, when multiplied by $C$, only changes in scale, not direction. This is expressed as $$Cv = \\lambda v$$, where $\\lambda$ is the eigenvalue. In the context of data, the eigenvector represents a principal direction of variance, and the eigenvalue $\\lambda$ quantifies how much variance exists along that axis. Larger eigenvalues correspond to the most 'important' features of the data.
While Eigen-decomposition requires a square matrix, Singular Value Decomposition (SVD) generalizes this to any $m \\× n$ matrix $A$. SVD factorizes the matrix into three distinct components: $$A = U\Sigma V^T$$. Here, $U$ is an $m \\× m$ orthogonal matrix representing the left singular vectors, $V^T$ is an $n \\× n$ orthogonal matrix representing the right singular vectors, and $\\Sigma$ is a diagonal matrix containing the singular values $\\sigma_i$ in descending order. This decomposition effectively rotates the data into a new coordinate system where the dimensions are uncorrelated.
The deep connection between SVD and Eigenvalues is revealed when we look at the product $A^T A$. If we substitute the SVD form, we get $$(U\Sigma V^T)^T (U\Sigma V^T) = V\Sigma^T U^T U\Sigma V^T$$. Since $U$ is orthogonal, $U^T U = I$, simplifying the expression to $V(\Sigma^T \Sigma)V^T$. This is the exact form of Eigen-decomposition for the covariance matrix. Consequently, the singular values $\\sigma_i$ are the square roots of the eigenvalues $\\lambda_i$ of $A^T A$, meaning SVD directly extracts the principal components of the data's variance.
To perform dimensionality reduction via Truncated SVD (the engine behind Principal Component Analysis), we discard the smallest singular values. By keeping only the top $k$ values in $\\Sigma$ and zeroing out the rest, we create a low-rank approximation $\\hat{A}$. According to the Eckart-Young-Mirsky theorem, this $\\hat{A}$ is the best possible rank-$k$ approximation of the original matrix $A$ in terms of the Frobenius norm: $$\min_{\text{rank}(\\hat{A})=k} ||A - \hat{A}||_F$$.
In practice, this process allows us to transform a feature space of thousands of dimensions into one with a handful of latent variables. For example, in Natural Language Processing, SVD is used in Latent Semantic Analysis (LSA) to group synonyms by projecting word-document matrices into a lower-dimensional 'concept space.' By focusing on the dominant singular values, we filter out the 'stochastic noise' of specific word choices and uncover the underlying thematic structure of the text.