At its heart, Bayesian inference is the process of updating our beliefs about a set of latent parameters $\theta$ given some observed data $x$. According to Bayes' Theorem, the posterior distribution $p(\theta|x)$ is proportional to the product of the likelihood $p(x|\theta)$ and the prior $p(\theta)$. However, in complex models—especially deep neural networks—the marginal likelihood $p(x) = \\∈t p(x|\theta)p(\theta) d\theta$, known as the evidence, is often computationally intractable because the integral must be solved over a high-dimensional parameter space.
To circumvent this intractability, we employ Variational Inference (VI). Instead of computing the exact posterior $p(\theta|x)$, we introduce a simpler, parameterized distribution $q_{\phi}(\theta)$ (the variational distribution) and attempt to make it as similar as possible to the true posterior. The gold standard for measuring the similarity between two distributions is the Kullback-Leibler (KL) divergence. Our goal is to minimize $KL(q_{\phi}(\theta) || p(\theta|x))$, which effectively turns an inference problem into an optimization problem.
The challenge is that the KL divergence between $q$ and $p$ involves the term $\log p(\theta|x)$, which contains the intractable evidence $p(x)$. By rearranging the terms of the KL divergence, we derive the Evidence Lower Bound, or ELBO. Mathematically, the relationship is expressed as: $\log p(x) = ELBO(\phi) + KL(q_{\phi}(\theta) || p(\theta|x))$. Since the KL divergence is always non-negative, the ELBO serves as a lower bound on the log-likelihood of the data: $\log p(x) \\≥ ELBO(\phi)$.
The ELBO can be decomposed into two intuitive components: the expected log-likelihood and the KL divergence from the prior. The objective function is written as: $$ELBO(\phi) = E_{q_{\phi}(\theta)}[\log p(x|\theta)] - KL(q_{\phi}(\theta) || p(\theta))$$ The first term represents the 'reconstruction' quality—how well the latent variables explain the data—while the second term acts as a 'regularizer,' forcing the variational distribution to remain close to the prior $p(\theta)$, typically a standard Gaussian $\mathcal{N}(0, I)$.
Variational Autoencoders (VAEs) operationalize this framework using neural networks. The encoder network acts as the variational distribution $q_{\phi}(z|x)$, mapping input $x$ to the parameters of a distribution (usually mean $\mu$ and variance $\sigma^2$). The decoder network represents the likelihood $p(x|z)$, reconstructing the data from a sampled latent vector $z$. To allow backpropagation through the stochastic sampling process $z \sim q_{\phi}(z|x)$, we use the 'reparameterization trick,' expressing $z$ as $z = \mu + \sigma \odot \epsilon$, where $\epsilon \sim \mathcal{N}(0, I)$.
By maximizing the ELBO, the VAE learns a latent space that is both continuous and structured. The reconstruction term pushes the model to preserve information, while the KL term prevents the model from assigning a unique, point-like code to every input, which would lead to overfitting. This balance allows the VAE to generate new, realistic data by sampling $z$ from the prior and passing it through the decoder, effectively performing generative modeling via approximate Bayesian inference.