The Geometry of Latent Space: Bayesian Inference and the Variational Autoencoder

At its core, Bayesian inference is about updating our beliefs about a set of hidden parameters $\theta$ given some observed data $x$. According to Bayes' Theorem, the posterior distribution is expressed as $p(\theta | x) = \frac{p(x | \theta) p(\theta)}{p(x)}$. The fundamental challenge in high-dimensional spaces is the 'evidence' or marginal likelihood, $p(x) = \\∈t p(x | \theta) p(\theta) d\theta$. In most complex models, this integral is computationally intractable, meaning we cannot analytically determine the exact distribution of the latent variables that generated our data.

To circumvent this intractability, we use Variational Inference (VI). Instead of computing the exact posterior $p(\theta | x)$, we introduce a simpler, parameterized distribution $q_{\phi}(\theta | x)$ and attempt to make it as similar as possible to the true posterior. The standard measure of similarity between two distributions is the Kullback-Leibler (KL) divergence. Our goal is to minimize $\text{KL}(q_{\phi}(\theta | x) || p(\theta | x))$, which effectively turns an inference problem (integration) into an optimization problem (differentiation).

However, minimizing the KL divergence directly is impossible because it requires knowing $p(\theta | x)$, the very thing we are trying to find. By applying the laws of probability, we can decompose the log-marginal likelihood as: $\log p(x) = ext{KL}(q_{\phi}(\theta | x) || p(\theta | x)) + \mathcal{L}(\phi, \theta)$. Since the KL divergence is always non-negative, the term $\mathcal{L}(\phi, \theta)$ acts as a lower bound on the log-likelihood of the data. This term is known as the Evidence Lower Bound, or ELBO.

The ELBO can be mathematically expanded into two intuitive components: $\mathcal{L}(\phi, \theta) = \mathbb{E}_{q_{\phi}(\theta | x)}[\log p_{\theta}(x | \theta)] - \text{KL}(q_{\phi}(\theta | x) || p(\theta))$. The first term is the 'reconstruction term'; it encourages the model to maximize the probability of the data given the latent variables (ensuring the output looks like the input). The second term is the 'regularization term'; it forces the approximate posterior to stay close to the prior distribution $p(\theta)$, typically a standard normal distribution $\mathcal{N}(0, I)$, preventing the model from simply assigning a unique point in space to every single input.

The Variational Autoencoder (VAE) implements this framework using neural networks. The 'Encoder' network outputs the parameters $\phi$ (mean $\mu$ and variance $\sigma^2$) of $q_{\phi}(\theta | x)$. To allow gradients to flow through the random sampling process of $\theta \sim q_{\phi}(\theta | x)$, we use the 'reparameterization trick'. We express $\theta$ as $\theta = \mu + \sigma \odot \epsilon$, where $\epsilon \sim \mathcal{N}(0, I)$. This shifts the stochasticity to an external variable $\epsilon$, making the mapping from the encoder outputs to the latent vector $\theta$ deterministic and differentiable.

The 'Decoder' network then takes this sampled $\theta$ and attempts to reconstruct the original input, effectively modeling $p_{\theta}(x | \theta)$. By maximizing the ELBO, the VAE learns a structured latent space where similar data points are clustered together. Unlike a standard Autoencoder, which learns a discrete mapping, the VAE learns a continuous distribution, enabling us to generate entirely new, synthetic data by sampling from the prior $p(\theta)$ and passing it through the decoder.