Bridging Bayesian Inference and Generative Modeling: The Mechanics of VAEs and the ELBO

At its core, Bayesian inference is the process of updating our beliefs about a latent variable $z$ after observing some data $x$. We are interested in the posterior distribution $p(z|x)$, which describes the probability of the hidden causes given the observed effects. According to Bayes' Rule, this is expressed as: $$p(z|x) = \frac{p(x|z)p(z)}{p(x)}$$. The primary challenge is the marginal likelihood (or evidence) $p(x) = \\∈t p(x|z)p(z) dz$, which is computationally intractable for high-dimensional spaces since it requires integrating over all possible configurations of $z$.

To bypass this intractability, we use Variational Inference (VI). Instead of calculating the true posterior $p(z|x)$, we introduce a proxy distribution $q_{\phi}(z|x)$, parameterized by $\phi$, and attempt to make it as similar to the true posterior as possible. The standard measure of similarity between two probability distributions is the Kullback-Leibler (KL) divergence. Our goal is to minimize $\text{KL}(q_{\phi}(z|x) || p(z|x))$, effectively transforming an integration problem into an optimization problem.

When we expand the KL divergence term, we discover a critical relationship. The KL divergence is defined as: $$\text{KL}(q_{\phi}(z|x) || p(z|x)) = \\∈t q_{\phi}(z|x) \log \frac{q_{\phi}(z|x)}{p(z|x)} dz$$ Using the identity $p(z|x) = \frac{p(x,z)}{p(x)}$, we can rearrange the terms to find: $\log p(x) = \text{KL}(q_{\phi}(z|x) || p(z|x)) + \mathbb{E}_{q_{\phi}(z|x)}[\log p(x,z) - \log q_{\phi}(z|x)]$. The second term on the right-hand side is known as the Evidence Lower Bound, or ELBO.

The ELBO is a lower bound on the log-marginal likelihood $\log p(x)$. Because the KL divergence is always non-negative, maximizing the ELBO is equivalent to minimizing the KL divergence between our approximation and the true posterior. We can rewrite the ELBO as: $$\text{ELBO}(\phi, \theta) = \mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - \text{KL}(q_{\phi}(z|x) || p(z))$$ Here, the first term is the 'reconstruction' term (how well the model reconstructs $x$ from $z$), and the second is a 'regularization' term (how close the approximate posterior is to the prior $p(z)$).

The Variational Autoencoder (VAE) implements this framework using neural networks. An encoder network produces the parameters $\phi$ (typically mean $\mu$ and variance $\sigma^2$) of a Gaussian distribution $q_{\phi}(z|x)$, and a decoder network produces the parameters $\theta$ of the likelihood $p_{\theta}(x|z)$. However, we cannot backpropagate through a random sample $z \sim q_{\phi}(z|x)$. To solve this, we use the 'reparameterization trick', expressing $z$ as $z = \mu + \sigma \odot \epsilon$, where $\epsilon \sim \mathcal{N}(0, I)$.

This architectural choice allows us to train the VAE end-to-end using stochastic gradient descent. By maximizing the ELBO, the model learns a latent space that is both informative (via the reconstruction term) and structured (via the KL term). The result is a generative model capable of sampling new data points by drawing $z$ from the prior $p(z)$ and passing it through the decoder, effectively capturing the underlying manifold of the data.