All Lessons

The Geometry of Generative Models: Bayesian Inference and the ELBO

An exploration of how Variational Autoencoders leverage the Evidence Lower Bound to approximate intractable posterior distributions. This lesson bridges the gap between Bayesian theory and deep generative architectures.

AI Narration Press play to listen
0  / 6 paragraphs
Click any paragraph to jump · Scroll freely without breaking narration

At its heart, Bayesian inference is about updating our beliefs about a latent parameter $\mathbf{z}$ given some observed data $\mathbf{x}$. We seek the posterior distribution $p(\mathbf{z}|\mathbf{x})$, which represents the probability of the latent cause given the observation. According to Bayes' Theorem, this is defined as $p(\mathbf{z}|\mathbf{x}) = \frac{p(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{p(\mathbf{x})}$. While the numerator is often tractable, the denominator $p(\mathbf{x}) = \\∈t p(\mathbf{x}, \mathbf{z}) d\mathbf{z}$ (the evidence) is an integral over all possible latent states, which is computationally intractable for high-dimensional data like images.

To bypass this intractability, Variational Autoencoders (VAEs) treat inference as an optimization problem rather than an integration problem. We introduce a 'variational' distribution $q_{\phi}(\mathbf{z}|\mathbf{x})$, parameterized by a neural network (the encoder), to approximate the true posterior $p(\mathbf{z}|\mathbf{x})$. The goal is to make $q_{\phi}$ as close as possible to $p$ by minimizing the Kullback-Leibler (KL) divergence: $D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}|\mathbf{x}))$. This shifts our focus from calculating a distribution to finding the optimal parameters $\phi$ that define a distribution.

Since we cannot calculate the KL divergence directly (because it requires knowing the unknown $p(\mathbf{z}|\mathbf{x})$), we derive a surrogate objective called the Evidence Lower Bound, or ELBO. By rearranging the log-marginal likelihood, we find: $\log p(\mathbf{x}) = D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}|\mathbf{x})) + \mathcal{L}(\phi, \theta; \mathbf{x})$. Because the KL divergence is always non-negative, the term $\mathcal{L}(\phi, \theta; \mathbf{x})$ serves as a lower bound on the log-evidence: $\log p(\mathbf{x}) \\≥ \mathcal{L}(\phi, \theta; \mathbf{x})$. Maximizing the ELBO effectively minimizes the divergence between our approximation and the true posterior.

The ELBO can be decomposed into two interpretable terms: $\mathcal{L}(\phi, \theta; \mathbf{x}) = \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\mathbf{z})] - D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}))$. The first term is the 'reconstruction likelihood,' which encourages the decoder $p_{\theta}$ to reconstruct the input $\mathbf{x}$ from the sampled $\mathbf{z}$. The second term is a regularization penalty that forces the approximate posterior to remain close to a prior distribution $p(\mathbf{z})$, typically a standard Gaussian $\mathcal{N}(0, I)$. This prevents the model from simply assigning a unique point in space for every single image, thereby ensuring a smooth latent space.

A critical challenge arises when calculating the gradient of the ELBO: the expectation is taken over a distribution $q_{\phi}$ that depends on the parameters we are optimizing. To solve this, we use the 'reparameterization trick.' Instead of sampling $\mathbf{z} \sim \mathcal{N}(\mu, \sigma^2)$, we express $\mathbf{z}$ as a deterministic transformation of a noise variable $\epsilon \sim \mathcal{N}(0, 1)$: $\mathbf{z} = \mu + \sigma \odot \epsilon$. This moves the stochasticity outside the gradient path, allowing us to use standard backpropagation to optimize both the encoder parameters $\phi$ and decoder parameters $\theta$.

In summary, the VAE is a marriage of Bayesian inference and deep learning. By maximizing the ELBO, we simultaneously learn a compressed representation of data and a generative model capable of sampling new instances. The balance between the reconstruction term and the KL term creates a latent manifold where similar a data points are clustered together, enabling meaningful interpolation and controlled generation of high-dimensional data.