All Lessons

The Geometry of Latent Spaces: Bayesian Inference and the VAE

An exploration of how Variational Autoencoders transform the intractable problem of posterior inference into a manageable optimization task. We bridge the gap between classical Bayesian statistics and modern deep generative modeling.

AI Narration Press play to listen
0  / 6 paragraphs
Click any paragraph to jump · Scroll freely without breaking narration

At its core, Bayesian inference is about updating our beliefs about a parameter $\theta$ given some observed data $x$. According to Bayes' Theorem, the posterior distribution is $p(\theta|x) = \frac{p(x|\theta)p(\theta)}{p(x)}$. In the context of generative modeling, we imagine that our data $x$ is generated from some hidden, low-dimensional latent variable $z$. The goal of a Variational Autoencoder (VAE) is to learn this latent representation. However, calculating the evidence $p(x) = \\∈t p(x|z)p(z)dz$ is computationally impossible for complex neural networks because the integral over all possible $z$ is intractable.

To bypass this intractability, we introduce Variational Inference. Instead of calculating the true posterior $p(z|x)$, we approximate it with a simpler, parameterized distribution $q_{\phi}(z|x)$, typically a Gaussian. We want $q_{\phi}(z|x)$ to be as close to $p(z|x)$ as possible. The standard measure of similarity between two probability distributions is the Kullback-Leibler (KL) divergence. Our objective is to minimize: $$D_{KL}(q_{\phi}(z|x) \parallel p(z|x)) = \\∈t q_{\phi}(z|x) \log \frac{q_{\phi}(z|x)}{p(z|x)} dz$$

Since we cannot compute $p(z|x)$ directly, we rearrange the KL divergence using the definition of the posterior. By substituting $p(z|x) = \frac{p(x,z)}{p(x)}$, we can derive a relationship involving the log-marginal likelihood (the evidence), $\log p(x)$. This leads us to the decomposition: $\log p(x) = D_{KL}(q_{\phi}(z|x) \parallel p(z|x)) + \mathcal{L}(\phi, \theta, x)$, where $\mathcal{L}$ is the Evidence Lower Bound, or ELBO. Because the KL divergence is always non-negative, $\mathcal{L}$ serves as a rigorous lower bound on the log-likelihood of our data.

The ELBO is the central objective function we maximize during training. It is formulated as: $$\mathcal{L}(\phi, \theta, x) = \mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{KL}(q_{\phi}(z|x) \parallel p(z))$$ Here, the first term is the 'reconstruction term,' ensuring the decoder can reconstruct $x$ from $z$. The second term is the 'regularization term,' forcing the approximate posterior to stay close to the prior $p(z)$, which is usually a standard normal distribution $\mathcal{N}(0, I)$.

A significant challenge arises when we try to backpropagate through the expectation $\mathbb{E}_{q_{\phi}(z|x)}$. Since sampling $z \sim q_{\phi}(z|x)$ is a stochastic process, we cannot take gradients directly. To solve this, the VAE employs the 'Reparameterization Trick.' We express $z$ as a deterministic function of the parameters and a noise variable $\epsilon \sim \mathcal{N}(0, I)$. Specifically, $z = \mu_{\phi}(x) + \sigma_{\phi}(x) \odot \epsilon$. Now, the gradient can flow through $\mu$ and $\sigma$ to update the encoder parameters $\phi$.

In summary, the VAE transforms a Bayesian inference problem into a stochastic optimization problem. By maximizing the ELBO, we simultaneously learn a mapping from data to a structured latent space (the encoder) and a mapping from that space back to the data distribution (the decoder). This allows us to generate new, synthetic data by simply sampling $z \sim p(z)$ and passing it through the decoder, effectively mastering the underlying manifold of the input data.