The Mechanics of Generative Modeling: Bayesian Inference, VAEs, and the ELBO

At its heart, Bayesian inference is about updating our beliefs about a hidden cause $z$ given some observed data $x$. In a generative context, we assume the data is generated by a latent variable $z$ through a conditional distribution $p(x|z)$. To perform inference, we want to find the posterior distribution $p(z|x)$. By Bayes' Rule, this is expressed as: $$p(z|x) = \frac{p(x|z)p(z)}{p(x)}$$ where $p(x) = \\∈t p(x|z)p(z) dz$. For complex neural networks, this integral is computationally intractable because it requires integrating over all possible configurations of $z$, making exact Bayesian inference impossible for high-dimensional data.

To overcome this intractability, Variational Autoencoders (VAEs) employ Variational Inference. Instead of calculating the true posterior $p(z|x)$, we introduce a simpler, parameterized distribution $q_{\phi}(z|x)$ (usually a Gaussian) to approximate it. The goal is to make $q_{\phi}(z|x)$ as close as possible to $p(z|x)$ by minimizing the Kullback-Leibler (KL) divergence between them: $$KL(q_{\phi}(z|x) \parallel p(z|x)) = \\∈t q_{\phi}(z|x) \log \frac{q_{\phi}(z|x)}{p(z|x)} dz$$. Minimizing this divergence is equivalent to maximizing the similarity between our approximate distribution and the true underlying posterior.

Since we cannot minimize the KL divergence directly (as it involves the unknown $p(z|x)$), we derive the Evidence Lower Bound, or ELBO. By rearranging the terms of the KL divergence using the log-likelihood of the data $\log p(x)$, we find: $$\log p(x) = KL(q_{\phi}(z|x) \parallel p(z|x)) + \mathcal{L}(\phi, \theta; x)$$ where $\mathcal{L}$ is the ELBO. Because the KL divergence is always non-negative, the ELBO serves as a lower bound on the evidence $\log p(x)$. Maximizing the ELBO is mathematically equivalent to minimizing the KL divergence between the approximate and true posteriors.

The ELBO can be decomposed into two interpretable components: a reconstruction term and a regularization term. The expression is: $$\mathcal{L}(\phi, \theta; x) = \mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - KL(q_{\phi}(z|x) \parallel p(z))$$ The first term, the expected log-likelihood, encourages the decoder (parameterized by $\theta$) to reconstruct the input $x$ accurately from the sampled $z$. The second term, the KL divergence between the approximate posterior and a prior $p(z)$ (typically $\mathcal{N}(0, I)$), acts as a regularizer that prevents the model from simply memorizing data points and ensures the latent space is continuous.

A significant hurdle in optimizing the ELBO is that the expectation $\mathbb{E}_{q_{\phi}(z|x)}$ depends on the parameters $\phi$, which prevents backpropagation through the stochastic sampling process. To solve this, we use the Reparameterization Trick. We express $z$ as a deterministic transformation of a noise variable $\epsilon \sim \mathcal{N}(0, I)$: $$z = \mu_{\phi}(x) + \sigma_{\phi}(x) \odot \epsilon$$ This shifts the randomness to $\epsilon$, allowing gradients to flow from the loss function through $z$ directly into the encoder parameters $\mu_{\phi}$ and $\sigma_{\phi}$.

In summary, the VAE framework converts a difficult inference problem into a differentiable optimization problem. By maximizing the ELBO, we simultaneously learn a generative model $p_{\theta}(x|z)$ and an efficient inference mechanism $q_{\phi}(z|x)$. The resulting latent space is not only useful for dimensionality reduction but also allows for sampling new, synthetic data by drawing $z \sim p(z)$ and passing it through the decoder. This elegant marriage of Bayesian statistics and deep learning remains a cornerstone of modern representation learning.