From Bayesian Principles to Generative Models: Deciphering VAEs and the ELBO

At its heart, Bayesian inference is about updating our beliefs about a hidden parameter $\theta$ after observing some data $x$. According to Bayes' Theorem, the posterior distribution $p(\theta | x)$ is proportional to the product of the likelihood $p(x | \theta)$ and the prior $p(\theta)$, normalized by the evidence $p(x)$. In complex high-dimensional spaces, calculating the evidence $p(x) = \\∈t p(x, \theta) d\theta$ is often computationally intractable because the integral cannot be solved analytically. This is the fundamental problem that necessitates Variational Inference.

Variational Inference (VI) transforms the problem of integration into a problem of optimization. Instead of computing the exact posterior $p(\theta | x)$, we introduce a simpler, parameterized distribution $q_{\phi}(\theta | x)$—called the variational distribution—and attempt to make it as similar as possible to the true posterior. We measure the similarity using the Kullback-Leibler (KL) divergence: $\text{KL}(q_{\phi}(\theta | x) || p(\theta | x)) = \\∈t q_{\phi}(\theta | x) \log \frac{q_{\phi}(\theta | x)}{p(\theta | x)} d\theta$. Our goal is to find the parameters $\phi$ that minimize this divergence.

Since we cannot minimize the KL divergence directly (as it depends on the unknown $p(\theta | x)$), we derive a surrogate objective called the Evidence Lower Bound, or ELBO. By rearranging the log-evidence $\log p(x)$, we can show that $\log p(x) = ext{ELBO}(\phi, \theta) + ext{KL}(q_{\phi}(\theta | x) || p(\theta | x))$. Because the KL divergence is always non-negative, the ELBO serves as a lower bound on the log-likelihood of the data: $\log p(x) \\≥ \mathbb{E}_{q_{\phi}(\theta | x)}[\log p(x, \theta) - \log q_{\phi}(\theta | x)]$.

To make the ELBO intuition clearer, we often rewrite it into two interpretable terms: $\text{ELBO} = \mathbb{E}_{q_{\phi}(\theta | x)}[\log p(x | \theta)] - ext{KL}(q_{\phi}(\theta | x) || p(\theta))$. The first term is the 'reconstruction' term; it encourages the model to choose latent variables $\theta$ that explain the observed data well. The second term is a regularization term; it forces the approximate posterior to stay close to the prior distribution, preventing the model from simply assigning a unique point in space to every single data sample.

A Variational Autoencoder (VAE) is a neural network implementation of this framework. The 'Encoder' network mimics $q_{\phi}(\theta | x)$ by outputting the parameters (usually the mean $\mu$ and variance $\sigma^2$) of a Gaussian distribution. To allow gradients to flow through the random sampling of $\theta$, we use the 'Reparameterization Trick'. We express $\theta$ as $\theta = \mu + \sigma \epsilon$, where $\epsilon \sim \mathcal{N}(0, I)$. This moves the stochasticity to an input node, making the network differentiable.

The 'Decoder' network then implements $p(x | \theta)$, mapping the sampled latent vector back into the original data space. During training, the VAE maximizes the ELBO. This forces the latent space to be continuous and structured. Because of the KL penalty against the prior $\mathcal{N}(0, I)$, the model learns a compact representation where similar data points are clustered together, allowing us to generate new, synthetic data by sampling directly from the prior and passing the result through the decoder.