All Lessons

Foundations of Generative Modeling: Bayesian Inference and the VAE

An exploration of how variational inference transforms the intractable posterior into an optimization problem. We derive the Evidence Lower Bound (ELBO) as the cornerstone of the Variational Autoencoder.

AI Narration Press play to listen
0  / 6 paragraphs
Click any paragraph to jump · Scroll freely without breaking narration

To understand Variational Autoencoders (VAEs), we must first grasp the core tension in Bayesian inference. Given some observed data $x$, we wish to find the posterior distribution of the latent variables $z$, expressed as $p(z|x)$. According to Bayes' Theorem: $$p(z|x) = \frac{p(x|z)p(z)}{p(x)}$$ Where $p(x|z)$ is the likelihood and $p(z)$ is the prior. However, the marginal likelihood $p(x) = \\∈t p(x|z)p(z) dz$ is often computationally intractable for complex models, as it requires integrating over all possible configurations of $z$. This creates a bottleneck: we cannot compute the exact posterior, so we cannot efficiently sample latent representations.

Variational Inference (VI) solves this by turning an integration problem into an optimization problem. Instead of calculating $p(z|x)$ exactly, we introduce a simplified parametric distribution $q_{\phi}(z|x)$, known as the variational distribution, and attempt to make it as similar as possible to the true posterior. The standard measure of similarity between two distributions is the Kullback-Leibler (KL) divergence: $$D_{KL}(q_{\phi}(z|x) \parallel p(z|x)) = \\∈t q_{\phi}(z|x) \log \frac{q_{\phi}(z|x)}{p(z|x)} dz$$ Our goal is to minimize this divergence by optimizing the parameters $\phi$.

Since we cannot minimize $D_{KL}$ directly (because it depends on the unknown $p(z|x)$), we rearrange the terms to find a surrogate objective. By applying the definition of the posterior and the logarithm, we can derive the identity: $$\log p(x) = D_{KL}(q_{\phi}(z|x) \parallel p(z|x)) + \mathcal{L}(\phi, \theta)$$ Here, $\mathcal{L}(\phi, \theta)$ is the Evidence Lower Bound (ELBO). Because the KL divergence is always non-negative, $\log p(x) \\≥ \mathcal{L}(\phi, \theta)$. Maximizing the ELBO is mathematically equivalent to minimizing the KL divergence between our approximation and the true posterior.

The ELBO can be decomposed into two intuitive components that act as opposing forces. The full expression is: $$\mathcal{L}(\phi, \theta) = \mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{KL}(q_{\phi}(z|x) \parallel p(z))$$ The first term is the 'reconstruction term,' which encourages the model to maximize the likelihood of the data given the latent code. The second term is the 'regularization term,' which forces the approximate posterior to remain close to the prior $p(z)$ (usually a standard Normal $\mathcal{N}(0, I)$). This prevents the model from simply assigning a unique, narrow spike for every data point, ensuring the latent space is continuous and interpolatable.

In a VAE, the ELBO is implemented using two neural networks: an encoder $q_{\phi}(z|x)$ and a decoder $p_{\theta}(x|z)$. However, we encounter a problem: the expectation $\mathbb{E}_{q_{\phi}(z|x)}$ involves sampling $z$, and we cannot backpropagate gradients through a stochastic sampling process. To overcome this, we use the 'reparameterization trick.' Instead of sampling $z \sim \mathcal{N}(\mu, \sigma^2)$, we express $z$ as a deterministic transformation of a noise variable $\\epsilon \sim \mathcal{N}(0, I)$: $$z = \mu + \sigma \odot \epsilon$$ This shifts the stochasticity to an external input, allowing gradients to flow directly through $\mu$ and $\sigma$ to the encoder parameters $\phi$.

In summary, the VAE is not just an autoencoder with noise, but a principled Bayesian framework. By maximizing the ELBO, we optimize the reconstruction quality while shaping the latent manifold to match a prior distribution. This allows us to generate new, synthetic data by sampling $z$ from the prior $p(z)$ and passing it through the decoder $p_{\theta}(x|z)$, effectively sampling from the learned distribution of the data. The elegance of the VAE lies in its ability to approximate a complex Bayesian posterior through the lens of deep learning and stochastic optimization.