At its heart, Bayesian inference is the process of updating our belief about a set of latent variables $z$ after observing some data $x$. We represent this via Bayes' Theorem: $P(z|x) = \frac{P(x|z)P(z)}{P(x)}$. In the context of generative modeling, we want to find the posterior $P(z|x)$, which tells us which latent codes $z$ are most likely to have generated the observed image or signal $x$. However, for complex neural networks, the evidence $P(x) = \\∈t P(x|z)P(z) dz$ is an intractable integral, as it requires summing over all possible configurations of the latent space, making direct computation impossible.
To bypass this intractability, Variational Autoencoders (VAEs) treat inference as an optimization problem. Instead of computing the true posterior $P(z|x)$, we introduce a variational distribution $q_{\phi}(z|x)$, parameterized by an encoder network with weights $\phi$. Our goal is to make $q_{\phi}(z|x)$ as close as possible to the true $P(z|x)$. we measure this similarity using the Kullback-Leibler (KL) Divergence: $D_{KL}(q_{\phi}(z|x) || P(z|x))$. Minimizing this divergence is equivalent to maximizing the likelihood of the data, but since the true posterior is unknown, we need a proxy objective.
This proxy is the Evidence Lower Bound, or ELBO. By applying Jensen's Inequality to the log-likelihood of the data, we derive: $\log P(x) \\≥ E_{q_{\phi}(z|x)}[\log P_{\theta}(x|z)] - D_{KL}(q_{\phi}(z|x) || P(z))$. The ELBO consists of two competing terms: the reconstruction term, which ensures the decoder $P_{\theta}(x|z)$ can reconstruct $x$ from $z$, and the regularization term, which forces the approximate posterior to stay close to a prior distribution $P(z)$, typically a standard Gaussian $\mathcal{N}(0, I)$.
Let's examine the tension between these two terms. The reconstruction term $E_{q_{\phi}(z|x)}[\log P_{\theta}(x|z)]$ encourages the model to assign each input $x$ to a unique, highly specific point in the latent space to minimize error. Conversely, the KL term $D_{KL}(q_{\phi}(z|x) || P(z))$ acts as a 'spring,' pulling the distribution toward the center of the latent space. Without the KL term, the VAE would collapse into a standard autoencoder, resulting in a fragmented latent space where interpolating between points yields gibberish.
A critical technical challenge arises when we try to compute the gradient of the ELBO with respect to the encoder parameters $\phi$. Because the expectation involves sampling $z$ from $q_{\phi}(z|x)$, we cannot backpropagate through the stochastic sampling process. To solve this, we use the Reparameterization Trick. We express $z$ as a deterministic function of the parameters and an auxiliary noise variable $\epsilon \sim \mathcal{N}(0, I)$: $z = \mu_{\phi}(x) + \sigma_{\phi}(x) \odot \epsilon$. This shifts the randomness to an external input, allowing gradients to flow through $\mu$ and $\sigma$.
In summary, the VAE transforms a Bayesian inference problem into a differentiable optimization task. By maximizing the ELBO, we simultaneously learn a generative model $P_{\theta}$ and an approximate inference model $q_{\phi}$. The resulting latent space is continuous and structured, allowing us to generate new, synthetic data by sampling $z \sim P(z)$ and passing it through the decoder. This framework provides the foundational logic for modern diffusion models and other advanced latent variable architectures.