At its core, dimensionality reduction is about identifying the 'essence' of a dataset by discarding noise and redundancy. Imagine a high-dimensional cloud of data points; often, these points don't fill the space uniformly but instead lie along a few dominant directions. If we can identify these directions—the axes of maximum variance—we can project the data onto a lower-dimensional subspace without losing significant information. This process is fundamentally an exercise in linear algebra, where we seek to decompose a complex transformation into its most basic constituent parts.
To understand this, we first look at Eigen-decomposition. For a square, symmetric matrix $C$ (such as a covariance matrix), we seek vectors $v$ that, when multiplied by $C$, only change in scale, not direction. This is expressed as $Cv = \\lambda v$, where $\\lambda$ is the eigenvalue and $v$ is the eigenvector. In the context of data, the eigenvectors represent the principal axes of the data's distribution, and the magnitude of the corresponding eigenvalue $\\lambda$ quantifies how much variance exists along that axis. By keeping only the eigenvectors associated with the largest eigenvalues, we perform Principal Component Analysis (PCA).
While eigen-decomposition is powerful, it is limited to square matrices. Singular Value Decomposition (SVD) generalizes this concept to any $m \\× n$ matrix $A$. SVD factorizes the matrix into three components: $A = U \Sigma V^T$. Here, $U$ is an $m \\× m$ orthogonal matrix representing the left singular vectors, $\Sigma$ is an $m \\× n$ diagonal matrix containing the singular values $\sigma_i$, and $V^T$ is the transpose of an $n \\× n$ orthogonal matrix representing the right singular vectors. Geometrically, SVD tells us that any linear transformation can be broken down into a rotation, a scaling, and another rotation.
The relationship between SVD and eigenvalues is profound. If we consider the covariance matrix $C = A^T A$, its eigenvalues are exactly the squares of the singular values of $A$, i.e., $\\lambda_i = \\sigma_i^2$. The right singular vectors $V$ of $A$ are the eigenvectors of $A^T A$. This means SVD provides a direct numerical path to finding the principal components of a dataset without explicitly computing the covariance matrix, which is often computationally expensive and numerically unstable for large datasets.
For dimensionality reduction, we employ the Eckart-Young-Mirsky theorem, which states that the best low-rank approximation of a matrix is obtained by keeping the $k$ largest singular values and setting the rest to zero. We define the truncated SVD as: $$A_k = \\sum_{i=1}^k \\sigma_i u_i v_i^T$$ This $A_k$ is the closest matrix to $A$ (in terms of the Frobenius norm) that has rank $k$. By choosing $k \ll n$, we effectively compress the data, retaining only the most significant structural patterns while filtering out low-energy noise.
In practice, this translates to transforming our original features into a new coordinate system where the features are uncorrelated. If we project our data $A$ onto the first $k$ columns of $V$, we obtain a reduced representation $Z = A V_k$. This transformation ensures that we maximize the captured variance in the reduced space. Whether you are compressing images, reducing noise in gene expression data, or latent semantic indexing in NLP, you are leveraging the spectral properties of matrices to find the hidden, low-dimensional manifold where the true signal resides.