At its core, dimensionality reduction is about finding a 'simpler' representation of data that preserves its essential characteristics. Imagine a 3D cloud of points that roughly forms a flattened pancake; while the data exists in three dimensions, most of the information is contained within a 2D plane. The intuition behind Singular Value Decomposition (SVD) and Eigen-decomposition is to identify the directions of maximum variance—the axes along which the data 'spreads' the most—and discard the directions where the variation is negligible, effectively compressing the data without losing its signal.
To understand this mathematically, we start with the Eigen-decomposition of a covariance matrix. For a centered data matrix $X$, the covariance matrix is defined as $\\Sigma = rac{1}{n-1} X^T X$. An eigenvector $v$ and its corresponding eigenvalue $\\lambda$ satisfy the equation $\\Sigma v = \\lambda v$. Geometrically, $v$ represents a direction that remains unchanged in orientation when transformed by $\\Sigma$, and $\\lambda$ quantifies the variance of the data along that specific direction. By sorting eigenvalues in descending order, we can identify the principal components of the dataset.
While Eigen-decomposition requires a square, positive semi-definite matrix, Singular Value Decomposition (SVD) is a more general framework applicable to any $m \\× n$ matrix $A$. SVD factorizes the matrix into three components: $A = U \\Sigma V^T$. Here, $U$ is an $m \\× m$ orthogonal matrix containing the left singular vectors, $\\Sigma$ is an $m \\× n$ diagonal matrix containing the singular values $\\sigma_i$, and $V^T$ is the transpose of an $n \\× n$ orthogonal matrix containing the right singular vectors. This decomposition essentially rotates, scales, and rotates the data back, revealing the intrinsic geometry of the transformation.
The deep connection between SVD and Eigen-decomposition lies in their relationship via the covariance matrix. Specifically, the right singular vectors $V$ of $A$ are the eigenvectors of $A^T A$, and the singular values $\\sigma_i$ are the square roots of the eigenvalues $\\lambda_i$ of $A^T A$, such that $\\sigma_i = \\sqrt{\\lambda_i}$. This means that SVD provides a direct route to Principal Component Analysis (PCA) without explicitly calculating the covariance matrix, which is computationally advantageous and more numerically stable for large datasets.
In the context of dimensionality reduction, we use 'Truncated SVD.' Instead of keeping all singular values, we retain only the top $k$ largest singular values and set the rest to zero. This creates a low-rank approximation $A_k = \\sum_{i=1}^k \\sigma_i u_i v_i^T$. According to the Eckart-Young-Mirsky theorem, $A_k$ is the best possible rank-$k$ approximation of the original matrix $A$ in terms of the Frobenius norm, meaning it minimizes the reconstruction error $\\| A - A_k \\|_F$.
Practically, this process transforms our original high-dimensional features into a new set of uncorrelated latent variables. By projecting the original data onto the subspace spanned by the first $k$ singular vectors, we reduce the noise and eliminate redundancy. This is why SVD is the engine behind techniques like Latent Semantic Analysis (LSA) in NLP and image compression, where we discard the 'small' singular values that correspond to high-frequency noise, leaving behind the robust, structural features of the data.