The Geometry of Compression: SVD and Eigen-Decomposition in Dimensionality Reduction

At its core, dimensionality reduction is the art of finding a lower-dimensional subspace that preserves the 'essence' of a dataset. Imagine a cloud of points in 3D space that roughly forms a flattened disk; while the data technically requires three coordinates, most of the variation happens along two directions. The goal of techniques like Principal Component Analysis (PCA) is to identify these directions of maximal variance, effectively stripping away redundant or noisy dimensions without losing the underlying structure of the information.

To understand this mathematically, we begin with Eigen-decomposition. For a square, symmetric matrix $C$ (such as a covariance matrix), an eigenvector $\mathbf{v}$ and its corresponding eigenvalue $\lambda$ satisfy the equation $C\mathbf{v} = \lambda\mathbf{v}$. In the context of data, the eigenvectors represent the principal axes of the data distribution, and the eigenvalues $\lambda$ quantify the amount of variance captured along those axes. By sorting eigenvalues in descending order, we can discard the dimensions associated with small $\lambda$ values, as they contribute little to the overall signal.

However, Eigen-decomposition is limited because it requires a square matrix. Real-world data matrices $A$ of size $m \\× n$ (observations by features) are rarely square. This is where Singular Value Decomposition (SVD) becomes essential. SVD generalizes the concept of eigen-decomposition to any matrix. It factorizes $A$ into three distinct matrices: $A = U\Sigma V^T$. Here, $U$ is an $m \\× m$ orthogonal matrix of left singular vectors, $\Sigma$ is an $m \\× n$ diagonal matrix of singular values, and $V^T$ is the transpose of an $n \\× n$ orthogonal matrix of right singular vectors.

The power of SVD lies in the diagonal matrix $\Sigma$. The singular values $\sigma_i$ are the square roots of the eigenvalues of $A^T A$. They are arranged in descending order $\sigma_1 \ge \sigma_2 \ge \dots \ge \sigma_k \ge 0$, providing a direct measure of the 'energy' or importance of each dimension. The vectors in $V$ represent the principal directions in the original feature space, while $U$ represents the projection of the data onto these directions. Together, they provide a complete geometric description of the transformation represented by $A$.

Dimensionality reduction is achieved through a process called 'Truncated SVD'. By keeping only the top $k$ largest singular values and zeroing out the rest, we create an approximation $\hat{A} = U_k \Sigma_k V_k^T$. According to the Eckart-Young-Mirsky theorem, this truncated matrix is the best possible rank-$k$ approximation of the original matrix $A$ in terms of the Frobenius norm. This effectively compresses the data while minimizing the mean squared reconstruction error.

The relationship between SVD and PCA is intrinsic. If we center our data $A$ by subtracting the mean of each feature, the covariance matrix is proportional to $A^T A$. The eigenvectors of the covariance matrix are exactly the right singular vectors $V$ from the SVD of $A$. Thus, SVD provides a more numerically stable and computationally efficient way to perform PCA without explicitly calculating the costly covariance matrix $A^T A$, making it the engine behind modern data compression and noise filtering.