All Lessons

Bayesian Inference and the Mechanics of Variational Autoencoders

An exploration of approximating intractable posterior distributions through variational inference and the optimization of the Evidence Lower Bound. We bridge the gap between probabilistic graphical models and deep generative networks.

AI Narration Press play to listen
0  / 6 paragraphs
Click any paragraph to jump · Scroll freely without breaking narration

At its heart, Bayesian inference is the process of updating our beliefs about a set of latent parameters $\theta$ given some observed data $x$. According to Bayes' Theorem, the posterior distribution $p(\theta|x)$ is proportional to the product of the likelihood $p(x|\theta)$ and the prior $p(\theta)$. However, in complex models—especially deep neural networks—the marginal likelihood $p(x) = \\∈t p(x|\theta)p(\theta) d\theta$, known as the evidence, is often computationally intractable because the integral must be solved over a high-dimensional parameter space.

To circumvent this intractability, we employ Variational Inference (VI). Instead of computing the exact posterior $p(\theta|x)$, we introduce a simpler, parameterized distribution $q_{\phi}(\theta)$ (the variational distribution) and attempt to make it as similar as possible to the true posterior. The gold standard for measuring the similarity between two distributions is the Kullback-Leibler (KL) divergence. Our goal is to minimize $KL(q_{\phi}(\theta) || p(\theta|x))$, which effectively turns an inference problem into an optimization problem.

The challenge is that the KL divergence between $q$ and $p$ involves the term $\log p(\theta|x)$, which contains the intractable evidence $p(x)$. By rearranging the terms of the KL divergence, we derive the Evidence Lower Bound, or ELBO. Mathematically, the relationship is expressed as: $\log p(x) = ELBO(\phi) + KL(q_{\phi}(\theta) || p(\theta|x))$. Since the KL divergence is always non-negative, the ELBO serves as a lower bound on the log-likelihood of the data: $\log p(x) \\≥ ELBO(\phi)$.

The ELBO can be decomposed into two intuitive components: the expected log-likelihood and the KL divergence from the prior. The objective function is written as: $$ELBO(\phi) = E_{q_{\phi}(\theta)}[\log p(x|\theta)] - KL(q_{\phi}(\theta) || p(\theta))$$ The first term represents the 'reconstruction' quality—how well the latent variables explain the data—while the second term acts as a 'regularizer,' forcing the variational distribution to remain close to the prior $p(\theta)$, typically a standard Gaussian $\mathcal{N}(0, I)$.

Variational Autoencoders (VAEs) operationalize this framework using neural networks. The encoder network acts as the variational distribution $q_{\phi}(z|x)$, mapping input $x$ to the parameters of a distribution (usually mean $\mu$ and variance $\sigma^2$). The decoder network represents the likelihood $p(x|z)$, reconstructing the data from a sampled latent vector $z$. To allow backpropagation through the stochastic sampling process $z \sim q_{\phi}(z|x)$, we use the 'reparameterization trick,' expressing $z$ as $z = \mu + \sigma \odot \epsilon$, where $\epsilon \sim \mathcal{N}(0, I)$.

By maximizing the ELBO, the VAE learns a latent space that is both continuous and structured. The reconstruction term pushes the model to preserve information, while the KL term prevents the model from assigning a unique, point-like code to every input, which would lead to overfitting. This balance allows the VAE to generate new, realistic data by sampling $z$ from the prior and passing it through the decoder, effectively performing generative modeling via approximate Bayesian inference.