Singular Value Decomposition and Eigendecomposition: The Engines of Dimensionality Reduction

At its core, dimensionality reduction is about finding a 'simpler' representation of data that retains the most important information. Imagine a cloud of data points in 3D space that mostly lies along a flat, tilted plane. While the data is technically three-dimensional, its primary structure is two-dimensional. The goal of techniques like Principal Component Analysis (PCA) is to identify the axes along which the data varies the most. These axes are the 'principal components,' and they are derived using the concepts of eigenvalues and Singular Value Decomposition (SVD).

To understand this mathematically, we start with the eigendecomposition of a square, symmetric matrix, such as the covariance matrix $C$. If we have a data matrix $X$ (centered to have a mean of zero), the covariance matrix is defined as $$C = \frac{1}{n-1} X^T X$$. An eigenvector $v$ and its corresponding eigenvalue $\lambda$ satisfy the characteristic equation: $$Cv = \lambda v$$. Intuitively, $v$ represents a direction in space that remains unchanged in orientation when transformed by $C$, and $\lambda$ represents the magnitude of the variance in that specific direction.

While eigendecomposition is powerful, it requires a square matrix. Singular Value Decomposition (SVD) generalizes this to any $m \\× n$ matrix $A$. SVD decomposes the matrix into three distinct components: $$A = U \Sigma V^T$$. Here, $U$ is an $m \\× m$ orthogonal matrix containing the left singular vectors, $\Sigma$ is an $m \\× n$ diagonal matrix containing the singular values $\sigma_i$, and $V^T$ is the transpose of an $n \\× n$ orthogonal matrix containing the right singular vectors. These singular values are the square roots of the eigenvalues of $A^T A$ and $AA^T$.

The magic of SVD for dimensionality reduction lies in the ordering of the singular values in $\Sigma$. By convention, singular values are sorted such that $\sigma_1 \\≥ \sigma_2 \\≥ \dots \\≥ \sigma_r > 0$. The magnitude of $\sigma_i$ tells us how much of the data's total energy (or variance) is captured by the $i$-th singular vector. To reduce dimensions, we perform a 'truncated SVD' by keeping only the top $k$ singular values and setting the rest to zero, effectively projecting the data onto the $k$ most significant directions.

The relationship between SVD and PCA is profound. When we apply SVD to a centered data matrix $X$, the right singular vectors $V$ are exactly the eigenvectors of the covariance matrix $X^T X$. The singular values $\sigma_i$ are related to the eigenvalues $\lambda_i$ by the formula $\lambda_i = \frac{\sigma_i^2}{n-1}$. This means SVD provides a numerically more stable way to compute the principal components without explicitly calculating the covariance matrix, which can be computationally expensive and prone to precision errors.

In a practical setting, these techniques allow us to handle the 'curse of dimensionality.' For instance, in image compression, an image can be viewed as a matrix of pixels. By applying SVD and keeping only the top $k$ components, we can approximate the image using significantly less memory while maintaining the visible structure. The approximation $\hat{A} = \sum_{i=1}^k \sigma_i u_i v_i^T$ is the best possible rank-$k$ approximation of the original matrix $A$ in terms of the Frobenius norm, a result known as the Eckart-Young-Mirsky theorem.