Singular Value Decomposition and Eigenvalues: The Bedrock of Dimensionality Reduction

At its core, dimensionality reduction is about finding a lower-dimensional subspace that preserves the most significant patterns of a dataset. Imagine a cloud of data points in a 3D space that mostly lie along a slanted plane; it is more efficient to describe that data using two coordinates on that plane rather than three coordinates in space. The central intuition is that not all dimensions are created equal. Some directions capture the bulk of the variance (the signal), while others capture negligible noise. SVD and Eigen-analysis are the tools we use to identify these 'principal directions'.

To understand the mathematics, we first look at Eigendecomposition. For a square, symmetric matrix $A$ (such as a covariance matrix), we seek a vector $v$ and a scalar $\lambda$ such that $Av = \lambda v$. This equation tells us that the transformation $A$ only scales the vector $v$ without changing its direction. The scalar $\lambda$ is the eigenvalue, and $v$ is the eigenvector. In a high-dimensional dataset, the eigenvector associated with the largest eigenvalue represents the direction of maximal variance. By projecting data onto these top $k$ eigenvectors, we reduce dimensions while minimizing information loss.

While Eigendecomposition requires square matrices, Singular Value Decomposition (SVD) generalizes this to any $m \\× n$ matrix $X$. SVD factorizes $X$ into three distinct matrices: $$X = U\Sigma V^T$$ Here, $U$ is an $m \\× m$ orthogonal matrix representing the left singular vectors, $\Sigma$ is an $m \\× n$ diagonal matrix containing the singular values $\sigma_i$, and $V^T$ is the transpose of an $n \\× n$ orthogonal matrix representing the right singular vectors. This decomposition effectively decomposes the data into a sum of rank-1 matrices, ordered by their importance.

The connection between SVD and eigenvalues is profound. If we consider the covariance matrix $C = X^T X$ (assuming $X$ is centered), the eigenvalues of $C$ are exactly the squares of the singular values of $X$, i.e., $\lambda_i = \sigma_i^2$. This means the right singular vectors $V$ from SVD are the same as the eigenvectors of the covariance matrix. SVD is numerically more stable than explicitly computing $X^T X$, which is why it is the preferred engine for implementing Principal Component Analysis (PCA).

For dimensionality reduction, we apply 'Truncated SVD'. Since the singular values $\sigma_1 \ge \sigma_2 \ge \dots \ge \sigma_n$ are sorted in descending order, the first few values capture the majority of the data's energy. By keeping only the top $k$ singular values and zeroing out the rest, we obtain the best rank-$k$ approximation of the original matrix $X$ in terms of the Frobenius norm: $$X_k = \sum_{i=1}^k \sigma_i u_i v_i^T$$ This process filters out noise and reduces the feature space from $n$ to $k$.

In practical machine learning, this mathematical framework allows us to handle the 'curse of dimensionality'. By transforming the original features into a set of orthogonal principal components, we eliminate multicollinearity and reduce the computational burden on downstream models. Whether it is Latent Semantic Analysis in NLP or image compression in computer vision, the ability to project high-dimensional noise into a low-dimensional signal via $\Sigma$ is what enables efficient and scalable AI.