The Geometry of Latent Spaces: Bayesian Inference and the VAE Framework

At its core, Bayesian inference is about updating our beliefs about a set of latent variables $\mathbf{z}$ given some observed data $\mathbf{x}$. In a generative context, we assume the data is produced by some underlying process $p(\mathbf{x}|\mathbf{z})$. Our goal is to find the posterior distribution $p(\mathbf{z}|\mathbf{x})$, which tells us which latent codes are most likely to have generated a specific observation. However, using Bayes' Rule, we see that $p(\mathbf{z}|\mathbf{x}) = \frac{p(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{p(\mathbf{x})}$. The denominator, $p(\mathbf{x}) = \\∈t p(\mathbf{x}|\mathbf{z})p(\mathbf{z}) d\mathbf{z}$, is the marginal likelihood (or evidence). In high-dimensional spaces, this integral is computationally intractable, meaning we cannot compute the exact posterior.

To circumvent this intractability, Variational Autoencoders (VAEs) employ Variational Inference. Instead of computing the true posterior $p(\mathbf{z}|\mathbf{x})$, we introduce a surrogate distribution $q_{\phi}(\mathbf{z}|\mathbf{x})$, parameterized by a neural network (the encoder). We then treat the problem as an optimization task: we want to make $q_{\phi}(\mathbf{z}|\mathbf{x})$ as similar as possible to $p(\mathbf{z}|\mathbf{x})$. The standard measure of similarity between two probability distributions is the Kullback-Leibler (KL) divergence, defined as: $$DKL(q_{\phi}(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}|\mathbf{x})) = E_{q_{\phi}(\mathbf{z}|\mathbf{x})} [\log q_{\phi}(\mathbf{z}|\mathbf{x}) - \log p(\mathbf{z}|\mathbf{x})]$$.

Since we cannot minimize this divergence directly (because $p(\mathbf{z}|\mathbf{x})$ is unknown), we derive the Evidence Lower Bound (ELBO). By manipulating the log-marginal likelihood, we can show that: $\log p(\mathbf{x}) = ELBO(\phi, \theta, \mathbf{x}) + KL(q_{\phi}(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}|\mathbf{x}))$. Because the KL divergence is always non-negative, the ELBO serves as a lower bound on the log-evidence: $\log p(\mathbf{x}) \\≥ ELBO$. Thus, maximizing the ELBO is equivalent to minimizing the KL divergence between our approximate posterior and the true posterior.

The ELBO can be decomposed into two intuitive components: the reconstruction term and the regularization term. The formulated ELBO is: $$ELBO = E_{q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\mathbf{z})] - KL(q_{\phi}(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}))$$. The first term, $E_{q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\mathbf{z})]$, encourages the decoder to reconstruct the input $\mathbf{x}$ accurately from the sampled latent code $\mathbf{z}$. The second term, the KL divergence between the approximate posterior and a prior $p(\mathbf{z})$ (usually a standard Gaussian $\mathcal{N}(0, I)$), acts as a regularizer, forcing the latent space to be compact and well-behaved.

A significant technical challenge arises during training: the expectation $E_{q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\mathbf{z})]$ involves sampling $\mathbf{z}$ from $q_{\phi}$, which is a stochastic process. We cannot propagate gradients through a random sample. To solve this, VAEs use the 'Reparameterization Trick'. Instead of sampling $\mathbf{z} \sim \mathcal{N}(\mu, \sigma^2)$, we express $\mathbf{z}$ as a deterministic transformation of a noise variable $\epsilon \sim \mathcal{N}(0, 1)$: $$\mathbf{z} = \mu + \sigma \odot \epsilon$$ where $\mu$ and $\sigma$ are outputs of the encoder. This shifts the stochasticity to $\epsilon$, allowing gradients to flow back to the network parameters $\phi$.

In summary, the VAE transforms a Bayesian inference problem into a deep learning objective. By maximizing the ELBO, the model learns an encoder that maps data to a structured latent space and a decoder that can sample from this space to generate realistic synthetic data. The tension between the reconstruction loss (pushing for distinct codes) and the KL divergence (pushing for a standard normal distribution) is what gives the VAE its generative properties, ensuring that the latent space is continuous and interpolated.