At its core, Bayesian inference is about updating our beliefs about a set of hidden parameters $\mathbf{z}$ given some observed data $\mathbf{x}$. According to Bayes' Theorem, the posterior distribution is given by $p(\mathbf{z}|\mathbf{x}) = \frac{p(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{p(\mathbf{x})}$. In a generative context, $\mathbf{z}$ represents the latent factors that explain the observed data. However, the marginal likelihood $p(\mathbf{x}) = \\∈t p(\mathbf{x}|\mathbf{z})p(\mathbf{z}) d\mathbf{z}$, often called the 'evidence', is typically intractable for complex models because the integral must be evaluated over all possible configurations of the latent space.
To circumvent this intractability, we turn to Variational Inference (VI). Instead of computing the exact posterior $p(\mathbf{z}|\mathbf{x})$, we introduce a simpler, parameterized distribution $q_{\phi}(\mathbf{z}|\mathbf{x})$—usually a Gaussian—and attempt to make it as similar as possible to the true posterior. We measure the 'similarity' between these two distributions using the Kullback-Leibler (KL) divergence: $D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) \parallel p(\mathbf{z}|\mathbf{x}))$. Minimizing this divergence effectively transforms an inference problem into an optimization problem.
The challenge is that we cannot minimize $D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) \parallel p(\mathbf{z}|\mathbf{x}))$ directly because it requires knowing the very posterior we are trying to approximate. By rearranging the terms of the KL divergence, we derive the Evidence Lower Bound (ELBO). The relationship is defined as: $\log p(\mathbf{x}) = D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) \parallel p(\mathbf{z}|\mathbf{x})) + \mathcal{L}(\phi, \theta)$, where $\mathcal{L}(\phi, \theta)$ is the ELBO. Since the KL divergence is always non-negative, the ELBO provides a lower bound on the log-evidence: $\log p(\mathbf{x}) \ge \mathcal{L}(\phi, \theta)$.
The ELBO can be decomposed into two intuitive components. The full expression is: $$\mathcal{L}(\phi, \theta) = \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\mathbf{z})] - D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) \parallel p(\mathbf{z}))$$. The first term is the 'reconstruction' term; it encourages the model to maximize the likelihood of the data given the latent codes. The second term is a 'regularization' term; it forces the approximate posterior $q_{\phi}$ to stay close to the prior $p(\mathbf{z})$, preventing the model from simply assigning a unique point in latent space to every single data sample (which would lead to overfitting).
In a Variational Autoencoder (VAE), we implement these components using neural networks. The 'Encoder' network outputs the parameters $\phi$ (mean and variance) of $q_{\phi}(\mathbf{z}|\mathbf{x})$, and the 'Decoder' network outputs the parameters $\theta$ for the likelihood $p_{\theta}(\mathbf{x}|\mathbf{z})$. Because we need to backpropagate through the stochastic sampling process $\mathbf{z} \sim q_{\phi}(\mathbf{z}|\mathbf{x})$, we use the 'reparameterization trick'. We express $\mathbf{z}$ as a deterministic function of the parameters and a random noise variable $\epsilon$: $\mathbf{z} = \mu + \sigma \odot \epsilon$, where $\epsilon \sim \mathcal{N}(0, I)$.
By maximizing the ELBO, the VAE learns a structured latent space where similar data points cluster together and the distribution is smooth enough to allow for sampling. When we want to generate new data, we simply sample $\mathbf{z} \sim p(\mathbf{z})$ (usually a standard normal) and pass it through the decoder $p_{\theta}(\mathbf{x}|\mathbf{z})$. Thus, the ELBO is the bridge that allows us to use the machinery of deep learning to perform approximate Bayesian inference at scale.