At its heart, Bayesian inference is about updating our beliefs about a set of latent variables $\mathbf{z}$ given some observed data $\mathbf{x}$. We seek the posterior distribution $p(\mathbf{z}|\mathbf{x})$, which, by Bayes' Theorem, is expressed as $p(\mathbf{z}|\mathbf{x}) = \frac{p(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{p(\mathbf{x})}$. In many deep learning contexts, the marginal likelihood $p(\mathbf{x}) = \\∈t p(\mathbf{x}|\mathbf{z})p(\mathbf{z}) d\mathbf{z}$ is computationally intractable because the integral must be evaluated over all possible configurations of $\mathbf{z}$. This is the 'bottleneck' that necessitates Variational Inference (VI).
Variational Inference refocuses the problem from integration to optimization. Instead of computing the exact posterior $p(\mathbf{z}|\mathbf{x})$, we introduce a proxy distribution $q_{\phi}(\mathbf{z}|\mathbf{x})$, parameterized by $\phi$ (typically a neural network), and attempt to make $q_{\phi}$ as similar as possible to $p(\mathbf{z}|\mathbf{x})$. The gold standard for measuring the 'distance' between two distributions is the Kullback-Leibler (KL) divergence: $D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) \parallel p(\mathbf{z}|\mathbf{x})) = \\∈t q_{\phi}(\mathbf{z}|\mathbf{x}) \log \frac{q_{\phi}(\mathbf{z}|\mathbf{x})}{p(\mathbf{z}|\mathbf{x})} d\mathbf{z}$. Minimizing this divergence is equivalent to maximizing the similarity between our approximation and the true posterior.
Since we cannot compute $p(\mathbf{z}|\mathbf{x})$ directly, we derive a surrogate objective: the Evidence Lower Bound, or ELBO. Through algebraic manipulation of the KL divergence, we find that $\log p(\mathbf{x}) = D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) \parallel p(\mathbf{z}|\mathbf{x})) + \mathcal{L}(\phi, \theta)$, where $\mathcal{L}(\phi, \theta)$ is the ELBO. Because the KL divergence is always non-negative, $\log p(\mathbf{x}) \\≥ \mathcal{L}(\phi, \theta)$. By maximizing the ELBO, we simultaneously push the ELBO closer to the log-likelihood of the data and minimize the KL divergence between our approximate and true posteriors.
The ELBO can be decomposed into two intuitive terms: $\mathcal{L}(\phi, \theta) = \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\mathbf{z})] - D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) \parallel p(\mathbf{z}))$. The first term is the 'reconstruction term,' which encourages the model to decode the latent samples $\mathbf{z}$ back into the original data $\mathbf{x}$. The second term is the 'regularization term,' which forces the approximate posterior $q_{\phi}(\mathbf{z}|\mathbf{x})$ to remain close to a simple prior $p(\mathbf{z})$, typically a standard Gaussian $\mathcal{N}(0, I)$. This prevents the model from simply assigning a unique point in space to every single input, ensuring a smooth latent space.
In a Variational Autoencoder (VAE), the encoder network outputs the parameters of $q_{\phi}(\mathbf{z}|\mathbf{x})$—usually a mean $\mu$ and variance $\sigma^2$. However, sampling $\mathbf{z} \sim q_{\phi}(\mathbf{z}|\mathbf{x})$ is a stochastic process, which normally blocks the flow of gradients during backpropagation. To solve this, we employ the 'Reparameterization Trick.' We express $\mathbf{z}$ as a deterministic transformation of a noise variable $\epsilon \sim \mathcal{N}(0, I)$: $\mathbf{z} = \mu + \sigma \odot \epsilon$. This shifts the stochasticity to the input $\epsilon$, allowing gradients to propagate through $\mu$ and $\sigma$ via standard chain rule operations.
Once trained, the VAE serves as a powerful generative model. By discarding the encoder and sampling $\mathbf{z}$ directly from the prior $p(\mathbf{z})$, we can pass these samples through the decoder $p_{\theta}(\mathbf{x}|\mathbf{z})$ to generate entirely new data points. The harmony between the ELBO's reconstruction and regularization terms ensures that the latent space is structured such that nearby points in $\mathbf{z}$-space correspond to semantically similar images or signals in $\mathbf{x}$-space, bridging the gap between rigorous Bayesian statistics and modern deep generative modeling.