The Geometry of Information: SVD, Eigenvalues, and Dimensionality Reduction

To understand dimensionality reduction, we must first conceptualize data as a cloud of points in high-dimensional space. Often, this data is redundant; features are correlated, meaning the 'true' information lives on a lower-dimensional manifold. The intuition behind Singular Value Decomposition (SVD) is to find a new coordinate system—a set of orthogonal axes—where the first axis captures the maximum variance, the second captures the remaining maximum variance, and so on. By discarding axes that contribute negligible variance, we project the data into a lower-dimensional space while preserving its essential structure.

Mathematically, SVD generalizes the concept of eigendecomposition to any $m \\× n$ matrix $A$. The decomposition is expressed as: $$A = U \Sigma V^T$$ where $U$ is an $m \\× m$ orthogonal matrix representing the left singular vectors, $\Sigma$ is an $m \\× n$ diagonal matrix containing the singular values $\sigma_i$, and $V^T$ is the transpose of an $n \\× n$ orthogonal matrix containing the right singular vectors. The singular values in $\Sigma$ are non-negative and typically sorted in descending order: $\sigma_1 \ge \sigma_2 \ge \dots \ge \sigma_k \ge 0$.

The connection between SVD and eigenvalues is profound. Consider the covariance matrix of centered data, $C = \frac{1}{n-1} A^T A$. The eigenvectors of this square, symmetric matrix are exactly the right singular vectors $V$ from the SVD. Furthermore, the eigenvalues $\lambda_i$ of $A^T A$ are the squares of the singular values of $A$, such that $\lambda_i = \sigma_i^2$. This implies that the singular values represent the 'magnitude' of the data's spread along the principal axes defined by the eigenvectors.

In the context of Principal Component Analysis (PCA), dimensionality reduction is achieved through the Eckart-Young-Mirsky theorem. This theorem states that the best rank-$k$ approximation of a matrix $A$ (in terms of the Frobenius norm) is obtained by keeping only the top $k$ singular values and setting the rest to zero. The truncated SVD is represented as: $$A_k = \sum_{i=1}^{k} \sigma_i u_i v_i^T$$ This effectively filters out 'noise' and retains only the signal corresponding to the most significant directions of variance.

The computational power of this approach lies in the projection. To reduce $n$ features to $k$ dimensions, we project the original data $A$ onto the subspace spanned by the first $k$ columns of $V$. The reduced representation $Z$ is computed as: $$Z = A V_k$$ where $V_k$ consists of the first $k$ eigenvectors of $A^T A$. This transformation decorrelates the features, ensuring that each new principal component is orthogonal to the others, thus eliminating multicollinearity.

Finally, it is crucial to recognize the trade-off between compression and information loss. The 'energy' or variance captured by the first $k$ components is given by the ratio $\frac{\sum_{i=1}^k \sigma_i^2}{\sum_{i=1}^n \sigma_i^2}$. When this ratio is close to 1, we can discard the remaining dimensions with minimal loss of information. SVD and eigenvalue decomposition thus provide a rigorous mathematical framework for distilling complexity, enabling efficient machine learning on datasets that would otherwise be computationally intractable.