To understand Variational Autoencoders (VAEs), we must first start with Bayesian inference. At its core, Bayesian inference is about updating our belief about a latent variable $z$ given some observed data $x$. According to Bayes' Rule, the posterior distribution $p(z|x)$ is given by: $$p(z|x) = \frac{p(x|z)p(z)}{p(x)}$$ where $p(x|z)$ is the likelihood, $p(z)$ is the prior, and $p(x) = \\∈t p(x|z)p(z) dz$ is the evidence. In complex models, calculating $p(x)$ is computationally intractable because the integral covers a high-dimensional space, making direct computation of the posterior impossible.
Since we cannot compute the exact posterior $p(z|x)$, we turn to Variational Inference (VI). The strategy is to approximate the complex posterior with a simpler, parameterized distribution $q_{\phi}(z|x)$, such as a Gaussian. Our goal is to make $q_{\phi}(z|x)$ as similar as possible to $p(z|x)$. We measure this similarity using the Kullback-Leibler (KL) divergence: $$DKL(q_{\phi}(z|x) || p(z|x)) = \\∈t q_{\phi}(z|x) \log \frac{q_{\phi}(z|x)}{p(z|x)} dz$$ By minimizing this divergence, we effectively 'push' our approximate distribution to match the true posterior.
However, minimizing the KL divergence directly is impossible because it requires knowing $p(z|x)$, which is exactly what we are trying to find. To bypass this, we derive the Evidence Lower Bound (ELBO). By rearranging the KL divergence formula and the definition of marginal likelihood, we find that: $\log p(x) = ELBO(\phi, \theta) + DKL(q_{\phi}(z|x) || p(z|x))$. Since the KL divergence is always non-negative, the ELBO provides a lower bound on the log-likelihood of the data: $\log p(x) \\≥ ELBO(\phi, \theta)$.
The ELBO can be decomposed into two interpretable terms: the reconstruction quality and the regularization term. The full formulation is: $$ELBO(\phi, \theta) = \mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - DKL(q_{\phi}(z|x) || p(z))$$ The first term, $\mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]$, encourages the model to reconstruct the input $x$ accurately from the sampled latent $z$. The second term, $DKL(q_{\phi}(z|x) || p(z))$, acts as a regularizer, forcing the approximate posterior to stay close to the prior $p(z)$, typically a standard normal distribution $\mathcal{N}(0, I)$.
In a VAE, we implement this using two neural networks: an encoder $q_{\phi}(z|x)$ and a decoder $p_{\theta}(x|z)$. A significant challenge arises during training: we cannot backpropagate through a random sampling step. To solve this, we use the 'Reparameterization Trick'. Instead of sampling $z \sim \mathcal{N}(\mu, \sigma^2)$, we express $z$ as: $$z = \mu + \sigma \odot \epsilon, \text{ where } \epsilon \sim \mathcal{N}(0, I)$$ This shifts the stochasticity to an external noise variable $\epsilon$, allowing gradients to flow through $\mu$ and $\sigma$ back to the encoder weights $\phi$.
The training process becomes a joint optimization of $\phi$ and $\theta$ to maximize the ELBO. By maximizing the reconstruction term and minimizing the KL divergence to the prior, the VAE learns a structured latent space where similar data points are mapped to close regions. This enables generative capabilities: by sampling $z \sim p(z)$ and passing it through the decoder $p_{\theta}(x|z)$, we can synthesize entirely new data samples that share the characteristics of the training set.