All Lessons

Bridging Bayesian Inference and Generative Modeling: The Mechanics of VAEs

An exploration of how Variational Autoencoders approximate intractable posterior distributions using the Evidence Lower Bound. This lesson connects probabilistic graphical models to deep learning via variational inference.

AI Narration Press play to listen
0  / 6 paragraphs
Click any paragraph to jump · Scroll freely without breaking narration

At its core, Bayesian inference is about updating our beliefs about a set of latent variables $z$ after observing some data $x$. We are interested in the posterior distribution $p(z|x)$, which tells us how the latent variables should be distributed given our data. According to Bayes' Theorem, this is expressed as: $$p(z|x) = \frac{p(x|z)p(z)}{p(x)}$$ where $p(x|z)$ is the likelihood, $p(z)$ is the prior, and $p(x) = \\∈t p(x,z) dz$ is the evidence. In high-dimensional spaces, calculating the evidence $p(x)$ is computationally intractable because the integral must be evaluated over all possible configurations of $z$.

To bypass this intractability, we use Variational Inference. Instead of computing the true posterior $p(z|x)$, we introduce a simpler, parameterized distribution $q_{\phi}(z|x)$, known as the variational distribution (or encoder), to approximate it. Our goal is to make $q_{\phi}(z|x)$ as close to $p(z|x)$ as possible. We measure the 'closeness' of these two distributions using the Kullback-Leibler (KL) Divergence: $$D_{KL}(q_{\phi}(z|x) \parallel p(z|x)) = \\∈t q_{\phi}(z|x) \log \frac{q_{\phi}(z|x)}{p(z|x)} dz$$ minimizing this divergence is equivalent to maximizing the similarity between our approximation and the true latent structure.

Since we cannot minimize $D_{KL}$ directly (as it requires knowing the unknown $p(z|x)$), we derive a surrogate objective called the Evidence Lower Bound (ELBO). By rearranging the KL divergence formula and utilizing the identity $\log p(x) = \log p(x) + \log \frac{q_{\phi}(z|x)}{q_{\phi}(z|x)}$, we can show that: $$\log p(x) = D_{KL}(q_{\phi}(z|x) \parallel p(z|x)) + \mathcal{L}(\theta, \phi; x)$$ where $\mathcal{L}$ is the ELBO. Because the KL divergence is always non-negative, $\mathcal{L}(\theta, \phi; x)$ serves as a lower bound on the log-likelihood of the data. Maximizing the ELBO effectively minimizes the KL divergence, pushing our approximate posterior toward the true posterior.

The ELBO can be decomposed into two meaningful components: the reconstruction term and the regularization term. Specifically: $$\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{KL}(q_{\phi}(z|x) \parallel p(z))$$ The first term, the expected log-likelihood, encourages the decoder $p_{\theta}(x|z)$ to reconstruct the original input $x$ accurately from the sampled latent variable $z$. The second term, the KL divergence between the approximate posterior and the prior $p(z)$ (usually a standard normal $\mathcal{N}(0, I)$), acts as a regularizer that prevents the model from simply assigning a unique point in space to every single input, thereby ensuring a continuous latent space.

A significant challenge arises during the optimization of the ELBO: the gradient cannot flow through the stochastic sampling process $z \sim q_{\phi}(z|x)$. To solve this, the Variational Autoencoder (VAE) employs the 'reparameterization trick'. Instead of sampling directly from the distribution, we express $z$ as a deterministic function of the parameters and an external noise variable $\epsilon$: $$z = \mu_{\phi}(x) + \sigma_{\phi}(x) \odot \epsilon, \text{ where } \epsilon \sim \mathcal{N}(0, I)$$ This shifts the stochasticity to $\epsilon$, allowing the gradients of the loss function to be backpropagated through $\mu$ and $\sigma$ using standard automatic differentiation.

When we train a VAE, we maximize the ELBO across the entire dataset. Once trained, we can discard the encoder $q_{\phi}(z|x)$ and use the decoder $p_{\theta}(x|z)$ as a generative model. By sampling $z$ directly from the prior $p(z) \sim \mathcal{N}(0, I)$ and passing it through the decoder, we generate new data samples that share the structural characteristics of the training set. This architecture transforms the a priori theoretical framework of Bayesian inference into a scalable, deep learning-based generative system.