The Geometry of Variance: SVD and Eigen-Decomposition in Dimensionality Reduction

At its core, dimensionality reduction is about finding a lower-dimensional subspace that preserves the 'essence' of the original data. Imagine a cloud of points in 3D space that mostly lies along a tilted plane; dimensionality reduction allows us to project these points onto that 2D plane while losing minimal information. The intuition behind Singular Value Decomposition (SVD) and Eigen-decomposition is to identify the directions of maximum variance. By aligning our coordinate system with these directions, we can discard the dimensions that contribute the least to the overall structure of the data.

To begin mathematically, consider a data matrix $A$ of size $m \\× n$. The Singular Value Decomposition factorizes $A$ into three distinct matrices: $$A = U\Sigma V^T$$ Here, $U$ is an $m \\× m$ orthogonal matrix whose columns are the left-singular vectors, $V$ is an $n \\× n$ orthogonal matrix whose columns are the right-singular vectors, and $\Sigma$ is an $m \\× n$ diagonal matrix. The diagonal entries of $\Sigma$, denoted as $\sigma_i$, are the singular values, typically ordered such that $\sigma_1 \ge \sigma_2 \ge \dots \ge \sigma_k \ge 0$.

The relationship between SVD and Eigen-decomposition is profound. If we look at the covariance matrix of our data, $C = A^T A$ (assuming $A$ is centered), the eigenvectors of $C$ are exactly the right-singular vectors $V$ from the SVD. Furthermore, the eigenvalues $\lambda_i$ of the covariance matrix are the squares of the singular values: $\lambda_i = \sigma_i^2$. While Eigen-decomposition only applies to square matrices, SVD generalizes this power to any rectangular matrix, making it the workhorse of practical machine learning.

In the context of dimensionality reduction, we are interested in the 'energy' or variance captured by each dimension. The total variance of the dataset is proportional to the sum of the squared singular values: $\sum \sigma_i^2$. To reduce dimensionality from $n$ to $k$, we keep only the first $k$ columns of $U$ and $V$ and the first $k \\× k$ submatrix of $\Sigma$. This results in the best rank-$k$ approximation of the original matrix $A$ in terms of the Frobenius norm, a result known as the Eckart-Young-Mirsky theorem.

Mathematically, the truncated SVD projection is represented as $A_k = U_k \Sigma_k V_k^T$. This transformation effectively filters out the noise. If a singular value $\sigma_i$ is very small, it indicates that the data varies very little in the direction of the corresponding singular vector $v_i$. By setting these small values to zero, we eliminate dimensions that contain more noise than signal, leading to more robust models and reduced computational overhead.

Principal Component Analysis (PCA) is the most famous application of these concepts. PCA is essentially SVD applied to a mean-centered data matrix. By projecting the data onto the first few principal components (the columns of $V$), we transform the data into a new coordinate system where the features are uncorrelated and ordered by their importance. The projection of a data point $x$ onto the reduced space is given by $z = V_k^T x$, where $V_k$ contains the top $k$ eigenvectors of the covariance matrix.

To summarize, while Eigen-decomposition helps us understand the internal stability and variance of square operators, SVD provides a universal tool for decomposing any linear transformation. In dimensionality reduction, these tools allow us to move from a high-dimensional space where data is sparse and noisy to a compact representation that captures the underlying latent structure. This is the foundation for everything from image compression to latent semantic analysis in NLP.