All Lessons

From Bayesian Inference to Generative Modeling: Unpacking VAEs and the ELBO

An exploration of how we approximate intractable posterior distributions using variational inference. We bridge the gap between probabilistic graphical models and deep neural networks via the Evidence Lower Bound.

AI Narration Press play to listen
0  / 6 paragraphs
Click any paragraph to jump · Scroll freely without breaking narration

At its heart, Bayesian inference is about updating our beliefs about a latent variable $\mathbf{z}$ given some observed data $\mathbf{x}$. According to Bayes' Theorem, the posterior distribution $p(\mathbf{z}|\mathbf{x})$ is proportional to the product of the likelihood $p(\mathbf{x}|\mathbf{z})$ and the prior $p(\mathbf{z})$. However, for complex models, the denominator—the evidence $p(\mathbf{x}) = \\∈t p(\mathbf{x}, \mathbf{z}) d\mathbf{z}$—is often an intractable integral. This renders the exact computation of the posterior impossible for high-dimensional data, necessitating approximate inference methods.

Variational Autoencoders (VAEs) solve this intractability by treating inference as an optimization problem. Instead of computing $p(\mathbf{z}|\mathbf{x})$ directly, we introduce a proxy distribution $q_{\phi}(\mathbf{z}|\mathbf{x})$, parameterized by a neural network (the encoder), to approximate the true posterior. The goal is to make $q_{\phi}(\mathbf{z}|\mathbf{x})$ as similar as possible to $p(\mathbf{z}|\mathbf{x})$. We quantify this similarity using the Kullback-Leibler (KL) divergence: $\text{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) \parallel p(\mathbf{z}|\mathbf{x}))$.

To minimize this divergence, we encounter a problem: the KL term itself depends on the unknown $p(\mathbf{z}|\mathbf{x})$. We resolve this by deriving the Evidence Lower Bound (ELBO). Through algebraic manipulation, we can show that $\log p(\mathbf{x}) = \text{ELBO}(\phi, \theta) + \text{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) \parallel p(\mathbf{z}|\mathbf{x}))$. Since the KL divergence is always non-negative, the ELBO serves as a lower bound on the log-likelihood of the data: $\log p(\mathbf{x}) \\≥ \text{ELBO}(\phi, \theta)$.

The ELBO is mathematically decomposed into two competing terms: $\text{ELBO} = \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\mathbf{z})] - \text{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) \parallel p(\mathbf{z}))$. The first term is the 'reconstruction log-likelihood,' which encourages the decoder ($p_{\theta}$) to accurately reconstruct the input $\mathbf{x}$ from the sampled latent code $\mathbf{z}$. The second term is a regularizer that forces the approximate posterior to stay close to the prior $p(\mathbf{z})$, typically a standard Gaussian $\mathcal{N}(0, \mathbf{I})$.

A significant technical hurdle arises when we try to differentiate the ELBO with respect to the encoder parameters $\phi$. Because the expectation involves sampling $\mathbf{z} \sim q_{\phi}(\mathbf{z}|\mathbf{x})$, we cannot backpropagate through the stochastic node. To overcome this, we use the 'reparameterization trick.' We express $\mathbf{z}$ as a deterministic transformation of a noise variable $\epsilon \sim \mathcal{N}(0, \mathbf{I})$, such that $\mathbf{z} = \mu_{\phi}(\mathbf{x}) + \sigma_{\phi}(\mathbf{x}) \odot \epsilon$. This shifts the randomness to an input, allowing gradients to flow through $\mu$ and $\sigma$.

Once trained, the VAE provides a powerful generative mechanism. While the encoder is used during training to learn the latent space, the decoder $p_{\theta}(\mathbf{x}|\mathbf{z})$ can be used independently for synthesis. By sampling $\mathbf{z}$ directly from the prior $p(\mathbf{z}) \sim \mathcal{N}(0, \mathbf{I})$, we can generate novel data points that share the structural characteristics of the training set. Thus, the VAE transforms the daunting task of Bayesian integration into a scalable deep learning objective.