At its core, Bayesian inference is about updating our beliefs about a hidden parameter $\theta$ after observing some data $x$. According to Bayes' Theorem, the posterior distribution $p(\theta|x)$ is proportional to the product of the likelihood $p(x|\theta)$ and the prior $p(\theta)$. However, in complex models, the marginal likelihood—or 'evidence'—$p(x) = \\∈t p(x, \theta) d\theta$ is often an intractable integral. This makes it impossible to compute the exact posterior, leading us to seek an approximation. Variational Inference (VI) solves this by introducing a simpler distribution $q(\theta)$ and attempting to make it as similar as possible to the true posterior $p(\theta|x)$.
The goal of Variational Inference is to minimize the divergence between our approximation $q(\theta)$ and the true posterior $p(\theta|x)$. We typically use the Kullback-Leibler (KL) divergence, expressed as: $$\text{KL}(q(\theta) || p(\theta|x)) = \\∈t q(\theta) \log \frac{q(\theta)}{p(\theta|x)} d\theta$$ Since the true posterior is unknown, we cannot minimize this directly. By manipulating the log-evidence $\log p(x)$ using the identity $p(\theta|x) = \frac{p(x, \theta)}{p(x)}$, we find that $\log p(x) = \text{KL}(q(\theta) || p(\theta|x)) + \mathcal{L}(q)$, where $\mathcal{L}(q)$ is the Evidence Lower Bound, or ELBO.
The ELBO is the quantity we maximize because it serves as a proxy for the log-likelihood. Because the KL divergence is always non-negative, $\log p(x) \\≥ \mathcal{L}(q)$. The ELBO is mathematically derived as: $$\mathcal{L}(q) = \mathbb{E}_{q(\theta)}[\log p(x, \theta)] - \mathbb{E}_{q(\theta)}[\log q(\theta)]$$ This can be rewritten to reveal a powerful tension: $\mathcal{L}(q) = \mathbb{E}_{q(\theta)}[\log p(x|\theta)] - \text{KL}(q(\theta) || p(\theta))$. The first term encourages the model to maximize the likelihood of the data (reconstruction), while the second term forces the approximate posterior to remain close to the prior (regularization).
A Variational Autoencoder (VAE) applies this Bayesian framework to deep learning. Instead of a static parameter $\theta$, we use a latent variable $z$ that represents the high-level features of the data. The VAE consists of an encoder $\\phi$ that outputs the parameters of $q_{\phi}(z|x)$ (usually the mean $\mu$ and variance $\sigma^2$ of a Gaussian) and a decoder $\theta$ that models $p_{\theta}(x|z)$. The objective is to maximize the ELBO over the parameters $\\phi$ and $\theta$: $$\mathcal{L}(\phi, \theta; x) = \mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - \text{KL}(q_{\phi}(z|x) || p(z))$$
A critical challenge arises during training: we cannot backpropagate through the stochastic sampling of $z$ from $q_{\phi}(z|x)$. To solve this, we use the 'reparameterization trick.' Instead of sampling $z$ directly, we express it as a deterministic function of the parameters and a random noise variable $\epsilon \sim \mathcal{N}(0, I)$. We write: $$z = \mu_{\phi}(x) + \sigma_{\phi}(x) \odot \epsilon$$ This shift moves the stochasticity to $\epsilon$, which does not depend on $\phi$, allowing us to compute gradients of the ELBO using standard backpropagation.
In summary, the VAE transforms the problem of Bayesian posterior inference into a differentiable optimization task. By maximizing the ELBO, we ensure that the latent space is both informative enough to reconstruct the input and regularized enough to allow for sampling. When we discard the encoder and sample $z \sim p(z)$, the decoder acts as a generative model, capable of synthesizing new data that mimics the original distribution. The elegance of the VAE lies in its marriage of classical Bayesian statistics with the representational power of neural networks.