All Lessons

Bridging Bayesian Inference and Generative Modeling: The Mechanics of VAEs and the ELBO

An exploration of how variational inference transforms intractable posterior distributions into optimization problems. We derive the Evidence Lower Bound (ELBO) as the fundamental objective for training Variational Autoencoders.

AI Narration Press play to listen
0  / 6 paragraphs
Click any paragraph to jump · Scroll freely without breaking narration

At its heart, Bayesian inference is about updating our beliefs about a set of latent variables $\mathbf{z}$ given some observed data $\mathbf{x}$. We seek the posterior distribution $p(\mathbf{z}|\mathbf{x})$, which, by Bayes' Theorem, is expressed as $p(\mathbf{z}|\mathbf{x}) = \frac{p(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{p(\mathbf{x})}$. In many deep learning contexts, the marginal likelihood $p(\mathbf{x}) = \\∈t p(\mathbf{x}|\mathbf{z})p(\mathbf{z}) d\mathbf{z}$ is computationally intractable because the integral must be evaluated over all possible configurations of $\mathbf{z}$. This is the 'bottleneck' that necessitates Variational Inference (VI).

Variational Inference refocuses the problem from integration to optimization. Instead of computing the exact posterior $p(\mathbf{z}|\mathbf{x})$, we introduce a proxy distribution $q_{\phi}(\mathbf{z}|\mathbf{x})$, parameterized by $\phi$ (typically a neural network), and attempt to make $q_{\phi}$ as similar as possible to $p(\mathbf{z}|\mathbf{x})$. The gold standard for measuring the 'distance' between two distributions is the Kullback-Leibler (KL) divergence: $D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) \parallel p(\mathbf{z}|\mathbf{x})) = \\∈t q_{\phi}(\mathbf{z}|\mathbf{x}) \log \frac{q_{\phi}(\mathbf{z}|\mathbf{x})}{p(\mathbf{z}|\mathbf{x})} d\mathbf{z}$. Minimizing this divergence is equivalent to maximizing the similarity between our approximation and the true posterior.

Since we cannot compute $p(\mathbf{z}|\mathbf{x})$ directly, we derive a surrogate objective: the Evidence Lower Bound, or ELBO. Through algebraic manipulation of the KL divergence, we find that $\log p(\mathbf{x}) = D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) \parallel p(\mathbf{z}|\mathbf{x})) + \mathcal{L}(\phi, \theta)$, where $\mathcal{L}(\phi, \theta)$ is the ELBO. Because the KL divergence is always non-negative, $\log p(\mathbf{x}) \\≥ \mathcal{L}(\phi, \theta)$. By maximizing the ELBO, we simultaneously push the ELBO closer to the log-likelihood of the data and minimize the KL divergence between our approximate and true posteriors.

The ELBO can be decomposed into two intuitive terms: $\mathcal{L}(\phi, \theta) = \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\mathbf{z})] - D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) \parallel p(\mathbf{z}))$. The first term is the 'reconstruction term,' which encourages the model to decode the latent samples $\mathbf{z}$ back into the original data $\mathbf{x}$. The second term is the 'regularization term,' which forces the approximate posterior $q_{\phi}(\mathbf{z}|\mathbf{x})$ to remain close to a simple prior $p(\mathbf{z})$, typically a standard Gaussian $\mathcal{N}(0, I)$. This prevents the model from simply assigning a unique point in space to every single input, ensuring a smooth latent space.

In a Variational Autoencoder (VAE), the encoder network outputs the parameters of $q_{\phi}(\mathbf{z}|\mathbf{x})$—usually a mean $\mu$ and variance $\sigma^2$. However, sampling $\mathbf{z} \sim q_{\phi}(\mathbf{z}|\mathbf{x})$ is a stochastic process, which normally blocks the flow of gradients during backpropagation. To solve this, we employ the 'Reparameterization Trick.' We express $\mathbf{z}$ as a deterministic transformation of a noise variable $\epsilon \sim \mathcal{N}(0, I)$: $\mathbf{z} = \mu + \sigma \odot \epsilon$. This shifts the stochasticity to the input $\epsilon$, allowing gradients to propagate through $\mu$ and $\sigma$ via standard chain rule operations.

Once trained, the VAE serves as a powerful generative model. By discarding the encoder and sampling $\mathbf{z}$ directly from the prior $p(\mathbf{z})$, we can pass these samples through the decoder $p_{\theta}(\mathbf{x}|\mathbf{z})$ to generate entirely new data points. The harmony between the ELBO's reconstruction and regularization terms ensures that the latent space is structured such that nearby points in $\mathbf{z}$-space correspond to semantically similar images or signals in $\mathbf{x}$-space, bridging the gap between rigorous Bayesian statistics and modern deep generative modeling.