To understand dimensionality reduction, we must first grasp the intuition of 'information density.' In a high-dimensional dataset, many features are often redundant or highly correlated. The goal of dimensionality reduction is to project this data onto a lower-dimensional subspace while preserving as much variance as possible. Imagine a 3D cloud of points shaped like a pancake; while the points exist in three dimensions, the 'essence' of the data lies on a 2D plane. By finding the axes along which the data stretches the most, we can discard the dimensions with the least variation without losing the structural integrity of the signal.
The mathematical foundation begins with the Eigenvalue Decomposition (EVD). For a square, symmetric matrix $C$ (such as a covariance matrix), the eigendecomposition is expressed as $$C = V \\Lambda V^T$$. Here, $V$ is an orthogonal matrix whose columns are the eigenvectors, and $\\Lambda$ is a diagonal matrix containing the eigenvalues $\\lambda_i$. These eigenvalues represent the magnitude of variance along their corresponding eigenvectors. In the context of Principal Component Analysis (PCA), we sort $\\lambda_i$ in descending order; the largest eigenvalues correspond to the 'Principal Components,' which are the directions of maximum variance in the feature space.
While EVD requires a square matrix, Singular Value Decomposition (SVD) is a generalization that applies to any $m \\× n$ matrix $A$. SVD decomposes the matrix into three distinct components: $$A = U \\Sigma V^T$$. In this formulation, $U$ is an $m \\× m$ orthogonal matrix representing the left singular vectors (related to the rows of $A$), $V$ is an $n \\× n$ orthogonal matrix representing the right singular vectors (related to the columns of $A$), and $\\Sigma$ is an $m \\× n$ rectangular diagonal matrix containing the singular values $\\sigma_i$. These singular values are the square roots of the eigenvalues of $A^T A$, effectively quantifying the 'strength' of each latent dimension.
The deep connection between SVD and Eigenvalues becomes apparent when we examine the covariance matrix $\\Sigma_{cov} = \frac{1}{n-1} A^T A$. If we substitute the SVD of $A$ into this expression, we find that the right singular vectors $V$ of $A$ are exactly the eigenvectors of the covariance matrix. Furthermore, the singular values $\\sigma_i$ are directly proportional to the eigenvalues $\\lambda_i$ of the covariance matrix via the relationship $\\lambda_i = \frac{\sigma_i^2}{n-1}$. This means SVD provides a numerically stable way to perform PCA without explicitly computing the covariance matrix, which can be computationally expensive for large datasets.
For dimensionality reduction, we employ 'Truncated SVD.' Instead of keeping all $k$ singular values, we retain only the top $r$ values (where $r < k$) and set the rest to zero. This creates the best low-rank approximation of the original matrix $A$ in terms of the Frobenius norm, known as the Eckart-Young-Mirsky theorem. The reduced representation is given by $A_r = U_r \\Sigma_r V_r^T$. By discarding the small singular values, we essentially filter out noise and keep only the most significant patterns, effectively compressing the data while maintaining its primary geometric structure.
Ultimately, the power of SVD and Eigenvalues in ML lies in their ability to decouple complex linear dependencies. By transforming the original coordinate system into the basis of singular vectors, we move from a space of correlated features to a space of uncorrelated components. This not only reduces the memory footprint and computational cost for downstream models but also mitigates the 'curse of dimensionality,' leading to better generalization and faster convergence in training deep neural networks.