Singular Value Decomposition (SVD) and Eigenvalues in Dimensionality Reduction

At its core, dimensionality reduction is about finding a lower-dimensional subspace that captures the maximum variance of a dataset. Imagine a 3D cloud of points that roughly forms a flattened pancake; while the data exists in three dimensions, most of the 'information' is contained in the 2D plane of that pancake. The goal of methods like Principal Component Analysis (PCA) is to rotate the coordinate system so that the first axis aligns with the direction of maximum spread, the second axis aligns with the next most significant spread, and so on. This allows us to discard dimensions that contribute little to the overall structure of the data.

To understand this mathematically, we begin with the Eigenvalue Decomposition (EVD). For a square, symmetric matrix $C$ (such as a covariance matrix), the EVD is defined as $C = V \\Lambda V^T$, where $V$ is an orthogonal matrix whose columns are eigenvectors and $\\Lambda$ is a diagonal matrix of eigenvalues. An eigenvector $\\mathbf{v}$ satisfies the equation $C\\mathbf{v} = \\lambda \\mathbf{v}$, implying that the transformation $C$ only scales the vector $\\mathbf{v}$ by a factor of $\\lambda$. In dimensionality reduction, the eigenvalue $\\lambda$ represents the amount of variance captured along the direction of its corresponding eigenvector $\\mathbf{v}$.

While EVD requires square matrices, Singular Value Decomposition (SVD) generalizes this to any matrix $A$ of size $m \\× n$. SVD factors the matrix into three components: $A = U \\Sigma V^T$. Here, $U$ is an $m \\× m$ orthogonal matrix (left singular vectors), $\\Sigma$ is an $m \\× n$ diagonal matrix containing non-negative singular values $\\sigma_i$ in descending order, and $V^T$ is an $n \\× n$ orthogonal matrix (right singular vectors). Essentially, SVD decomposes a linear transformation into a rotation, a scaling, and another rotation.

The deep connection between SVD and EVD lies in the construction of the covariance matrix. If $A$ is a centered data matrix, the covariance matrix is proportional to $A^T A$. Substituting the SVD of $A$ into this product gives: $$A^T A = (V \\Sigma^T U^T)(U \\Sigma V^T) = V \\Sigma^T (U^T U) \\Sigma V^T = V (\\Sigma^T \\Sigma) V^T$$ Because $U^T U = I$, the right singular vectors $V$ of $A$ are exactly the eigenvectors of $A^T A$, and the squared singular values $\\sigma_i^2$ are the eigenvalues $\\lambda_i$ of the covariance matrix.

For dimensionality reduction, we apply the Eckart-Young-Mirsky theorem, which states that the best rank-$k$ approximation of a matrix $A$ (in terms of the Frobenius norm) is obtained by keeping only the top $k$ singular values and setting the rest to zero. This truncated SVD produces a matrix $A_k = \\sum_{i=1}^{k} \\sigma_i \\mathbf{u}_i \\mathbf{v}_i^T$. By projecting the data onto the subspace spanned by the first $k$ vectors in $V$, we reduce the feature space while minimizing the mean squared reconstruction error.

In practice, this means we can compress massive datasets by retaining only the 'dominant' singular values. If the singular values decay rapidly, a very small $k$ can represent the vast majority of the data's variance. This process not only reduces storage and computational costs but also acts as a form of regularization, filtering out high-frequency noise that typically resides in the smaller singular values, thereby improving the generalization of machine learning models.