At its heart, Bayesian inference is about updating our beliefs about a latent variable $\mathbf{z}$ given some observed data $\mathbf{x}$. According to Bayes' Theorem, the posterior distribution $p(\mathbf{z}|\mathbf{x})$ is proportional to the product of the likelihood $p(\mathbf{x}|\mathbf{z})$ and the prior $p(\mathbf{z})$. However, for complex models, the denominator—the evidence $p(\mathbf{x}) = \\∈t p(\mathbf{x}, \mathbf{z}) d\mathbf{z}$—is often an intractable integral. This renders the exact computation of the posterior impossible for high-dimensional data, necessitating approximate inference methods.
Variational Autoencoders (VAEs) solve this intractability by treating inference as an optimization problem. Instead of computing $p(\mathbf{z}|\mathbf{x})$ directly, we introduce a proxy distribution $q_{\phi}(\mathbf{z}|\mathbf{x})$, parameterized by a neural network (the encoder), to approximate the true posterior. The goal is to make $q_{\phi}(\mathbf{z}|\mathbf{x})$ as similar as possible to $p(\mathbf{z}|\mathbf{x})$. We quantify this similarity using the Kullback-Leibler (KL) divergence: $\text{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) \parallel p(\mathbf{z}|\mathbf{x}))$.
To minimize this divergence, we encounter a problem: the KL term itself depends on the unknown $p(\mathbf{z}|\mathbf{x})$. We resolve this by deriving the Evidence Lower Bound (ELBO). Through algebraic manipulation, we can show that $\log p(\mathbf{x}) = \text{ELBO}(\phi, \theta) + \text{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) \parallel p(\mathbf{z}|\mathbf{x}))$. Since the KL divergence is always non-negative, the ELBO serves as a lower bound on the log-likelihood of the data: $\log p(\mathbf{x}) \\≥ \text{ELBO}(\phi, \theta)$.
The ELBO is mathematically decomposed into two competing terms: $\text{ELBO} = \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\mathbf{z})] - \text{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) \parallel p(\mathbf{z}))$. The first term is the 'reconstruction log-likelihood,' which encourages the decoder ($p_{\theta}$) to accurately reconstruct the input $\mathbf{x}$ from the sampled latent code $\mathbf{z}$. The second term is a regularizer that forces the approximate posterior to stay close to the prior $p(\mathbf{z})$, typically a standard Gaussian $\mathcal{N}(0, \mathbf{I})$.
A significant technical hurdle arises when we try to differentiate the ELBO with respect to the encoder parameters $\phi$. Because the expectation involves sampling $\mathbf{z} \sim q_{\phi}(\mathbf{z}|\mathbf{x})$, we cannot backpropagate through the stochastic node. To overcome this, we use the 'reparameterization trick.' We express $\mathbf{z}$ as a deterministic transformation of a noise variable $\epsilon \sim \mathcal{N}(0, \mathbf{I})$, such that $\mathbf{z} = \mu_{\phi}(\mathbf{x}) + \sigma_{\phi}(\mathbf{x}) \odot \epsilon$. This shifts the randomness to an input, allowing gradients to flow through $\mu$ and $\sigma$.
Once trained, the VAE provides a powerful generative mechanism. While the encoder is used during training to learn the latent space, the decoder $p_{\theta}(\mathbf{x}|\mathbf{z})$ can be used independently for synthesis. By sampling $\mathbf{z}$ directly from the prior $p(\mathbf{z}) \sim \mathcal{N}(0, \mathbf{I})$, we can generate novel data points that share the structural characteristics of the training set. Thus, the VAE transforms the daunting task of Bayesian integration into a scalable deep learning objective.