At the heart of many machine learning problems lies the challenge of latent variables. In many datasets, the observed data $x$ is generated by some hidden, underlying structure $z$ that we cannot see. Bayesian inference allows us to reason about this hidden structure by calculating the posterior distribution $p(z|x)$, which represents our updated belief about $z$ after observing $x$. Using Bayes' Rule, we express this as: $$p(z|x) = \frac{p(x|z)p(z)}{p(x)}$$ where $p(x|z)$ is the likelihood and $p(z)$ is the prior.
The fundamental obstacle in Bayesian inference is the marginal likelihood, or evidence, $p(x) = \\∈t p(x|z)p(z) dz$. For complex models, this integral is computationally intractable because it requires summing over all possible configurations of the latent space. Since we cannot compute the denominator, we cannot directly compute the posterior $p(z|x)$. To solve this, we turn to Variational Inference (VI), where we approximate the true posterior $p(z|x)$ with a simpler, parameterized distribution $q_{\phi}(z|x)$. Our goal is to make $q_{\phi}(z|x)$ as similar to $p(z|x)$ as possible by minimizing the Kullback-Leibler (KL) divergence.
The KL divergence between our approximation and the true posterior is defined as: $$D_{KL}(q_{\phi}(z|x) || p(z|x)) = \\∈t q_{\phi}(z|x) \log \frac{q_{\phi}(z|x)}{p(z|x)} dz$$ Substituting the expression for $p(z|x)$ from Bayes' Rule and rearranging the terms, we find a crucial relationship. The log-evidence $\log p(x)$ can be decomposed into the KL divergence and another term: $\log p(x) = D_{KL}(q_{\phi}(z|x) || p(z|x)) + \mathcal{L}(\theta, \phi; x)$. This $\mathcal{L}$ is known as the Evidence Lower Bound, or ELBO.
The ELBO is the objective function we maximize in Variational Autoencoders. Because the KL divergence is always non-negative ($D_{KL} \\≥ 0$), the ELBO serves as a lower bound on the log-likelihood of the data: $\log p(x) \\≥ \mathcal{L}(\theta, \phi; x)$. We can rewrite the ELBO in a form that is intuitive for deep learning: $$\mathcal{L}(\theta, \phi; x) = E_{z \sim q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{KL}(q_{\phi}(z|x) || p(z))$$ The first term is the 'reconstruction term,' ensuring the decoded sample looks like the input, and the second term is a 'regularizer,' ensuring the latent space stays close to the prior $p(z)$ (usually a standard Normal distribution $\mathcal{N}(0, I)$).
In a Variational Autoencoder (VAE), the distribution $q_{\phi}(z|x)$ acts as the encoder, mapping data to a latent distribution, and $p_{\theta}(x|z)$ acts as the decoder, mapping latent samples back to data space. However, we cannot backpropagate through a random sample $z \sim q_{\phi}(z|x)$. To overcome this, we use the 'reparameterization trick.' Instead of sampling $z$ directly, we express it as a deterministic function of a random noise variable $\\epsilon \sim \mathcal{N}(0, I)$: $$z = \mu_{\phi}(x) + \sigma_{\phi}(x) \odot \epsilon$$ This allows the gradients to flow through the parameters $\mu$ and $\sigma$ of the encoder.
By maximizing the ELBO, the VAE learns a continuous, structured latent space. The KL term prevents the model from simply assigning a unique point in space to every single image (which would lead to overfitting), forcing it to distribute the encodings according to the prior. This regularity is what allows us to interpolate between points in the latent space or sample new, synthetic data by drawing $z \sim p(z)$ and passing it through the decoder $p_{\theta}(x|z)$. Thus, the VAE bridges the gap between rigorous Bayesian inference and scalable deep generative modeling.