The Geometry of Reduction: SVD and Eigenvalues

At its core, dimensionality reduction is the quest to find a lower-dimensional subspace that captures the maximum variance of a dataset. Imagine a cloud of points in 3D space that mostly lies on a tilted 2D plane; by identifying the orientation of that plane, we can describe every point using only two coordinates instead of three without losing significant information. This process relies on identifying 'principal axes'—directions along which the data stretches the most—which is precisely what Eigen-decomposition and Singular Value Decomposition (SVD) provide.

To understand this, we start with the Covariance Matrix, $\\Sigma$, of a centered data matrix $X$. The Eigen-decomposition of the covariance matrix is given by $\\Sigma = Q \\Lambda Q^T$, where $Q$ is an orthogonal matrix containing the eigenvectors and $\\Lambda$ is a diagonal matrix of eigenvalues $\\lambda_i$. Each eigenvector represents a direction of variance, and the corresponding eigenvalue $\\lambda_i$ quantifies the amount of variance along that axis. In dimensionality reduction, we keep only the $k$ eigenvectors associated with the largest eigenvalues, projecting the data onto a $k$-dimensional hyperplane.

While Eigen-decomposition is powerful, it is limited to square matrices. This is where Singular Value Decomposition (SVD) becomes the gold standard. SVD generalizes the concept to any $m \\× n$ matrix $X$, decomposing it into three distinct matrices: $X = U \\Sigma V^T$. Here, $U$ represents the left singular vectors (orthogonal), $\\Sigma$ is a diagonal matrix of singular values $\\sigma_i$, and $V^T$ represents the right singular vectors (orthogonal). Mathematically, the singular values $\\sigma_i$ are the square roots of the eigenvalues of $X^T X$.

The intuition behind $X = U \\Sigma V^T$ is that any linear transformation can be broken down into a rotation ($V^T$), a scaling ($\\Sigma$), and another rotation ($U$). In the context of data, $V$ tells us about the relationships between features, while $U$ tells us about the relationships between observations. The singular values in $\\Sigma$ are sorted in descending order, $\\sigma_1 \ge \\sigma_2 \ge \dots \ge \\sigma_r$, allowing us to see exactly how much 'energy' or information is contained in each dimension.

To perform dimensionality reduction via SVD, we use the Eckart-Young-Mirsky theorem, which states that the best rank-$k$ approximation of a matrix $X$ is obtained by setting all singular values $\\sigma_i = 0$ for $i > k$. This results in a truncated SVD: $$X_k = \\sum_{i=1}^k \\sigma_i u_i v_i^T$$ This approximation minimizes the Frobenius norm $\\| X - X_k \\|_F$, ensuring that the reduced representation is the closest possible linear approximation of the original high-dimensional space.

The deep connection between these two methods is revealed when we look at Principal Component Analysis (PCA). PCA is effectively the application of SVD to a centered data matrix. While Eigen-decomposition of the covariance matrix is the theoretical definition of PCA, computing SVD on the raw data $X$ is numerically more stable and computationally efficient. By selecting the top $k$ singular values, we effectively filter out the noise (small $\\sigma_i$) and retain the signal (large $\\sigma_i$), transforming a bloated feature set into a compact, latent representation.