At its core, dimensionality reduction is the art of finding a lower-dimensional subspace that captures the most 'important' variability of a dataset. Imagine a cloud of data points in 3D space that roughly forms a flattened ellipsoid. While the data exists in three dimensions, most of the information is concentrated along two primary axes. Dimensionality reduction seeks to identify these axes, allowing us to project the data onto a plane without losing significant structural information. This process relies on the fundamental ability of linear algebra to decompose a matrix into constituent parts that represent scaling, rotation, and projection.
To understand this, we first look at Eigen-decomposition. For a square, symmetric matrix $A$ (such as a covariance matrix), an eigenvector $v$ is a vector that, when multiplied by $A$, only changes in scale, not direction. This is expressed as $Av = \\lambda v$, where $\\lambda$ is the eigenvalue. In the context of data, the eigenvectors of the covariance matrix $\\Sigma$ represent the principal components—the directions of maximum variance. The corresponding eigenvalue $\\lambda$ quantifies how much variance exists along that direction. By keeping only the eigenvectors associated with the largest eigenvalues, we effectively discard noise and retain the signal.
However, eigen-decomposition is limited to square matrices. In real-world machine learning, our data matrix $X$ is typically rectangular, with $m$ samples and $n$ features. This is where Singular Value Decomposition (SVD) becomes essential. SVD generalizes the notion of eigen-decomposition to any matrix. It factors $X$ into three distinct matrices: $X = U\\Sigma V^T$. Here, $U$ is an orthogonal matrix representing the left singular vectors, $\\Sigma$ is a diagonal matrix of singular values, and $V^T$ is the transpose of an orthogonal matrix representing the right singular vectors.
The mathematical beauty of SVD lies in the interpretation of $U, \\Sigma,$ and $V$. The columns of $V$ are the principal directions of the data (identical to the eigenvectors of $X^T X$). The diagonal elements of $\\Sigma$, denoted as $\\sigma_i$, are the singular values, which are the square roots of the eigenvalues of $X^T X$: $\\sigma_i = \\sqrt{\\lambda_i}$. These singular values are sorted in descending order, $\\sigma_1 \\ge \\sigma_2 \\ge \\dots \\ge \\sigma_n \\ge 0$, providing a direct measure of the 'importance' of each corresponding dimension.
For dimensionality reduction, we employ a technique called Truncated SVD. Instead of keeping all $n$ singular values, we keep only the top $k$ values and set the rest to zero. This results in a low-rank approximation of the original matrix: $X \\approx U_k \\Sigma_k V_k^T$. According to the Eckart-Young-Mirsky theorem, this truncated matrix is the best possible rank-$k$ approximation of $X$ in terms of the Frobenius norm, meaning it minimizes the reconstruction error $\\|X - X_k\\|_F^2$.
The relationship between SVD and Principal Component Analysis (PCA) is intimate. PCA is essentially the application of SVD to a centered data matrix. By centering the data (subtracting the mean), the SVD of $X$ provides the principal components directly. The projection of the data into the reduced space is achieved by calculating $X_{reduced} = X V_k$, where $V_k$ contains the first $k$ columns of $V$. This transformation rotates the coordinate system so that the first axis aligns with the direction of greatest variance, the second axis with the next greatest, and so on.
In conclusion, the transition from eigenvalues to SVD allows us to move from analyzing simple square operators to analyzing complex, high-dimensional datasets. By decomposing a matrix into its singular components, we can separate the signal from the noise. Whether you are compressing an image, reducing the feature set of a genomic dataset, or implementing latent semantic analysis in NLP, you are leveraging the fact that the most critical information is stored in the largest singular values of the underlying linear transformation.