To understand Variational Autoencoders (VAEs), we must first understand the goal of Bayesian inference. At its heart, we want to find the posterior distribution $p(\mathbf{z} | \mathbf{x})$, which tells us the probability of a latent variable $\mathbf{z}$ given some observed data $\mathbf{x}$. Using Bayes' Rule, we have $p(\mathbf{z} | \mathbf{x}) = \frac{p(\mathbf{x} | \mathbf{z}) p(\mathbf{z})}{p(\mathbf{x})}$. However, the denominator $p(\mathbf{x}) = \\∈t p(\mathbf{x}, \mathbf{z}) d\mathbf{z}$ (the evidence) is often computationally intractable because it requires integrating over all possible configurations of the latent space, which is impossible for high-dimensional data.
Since we cannot compute the true posterior $p(\mathbf{z} | \mathbf{x})$, we employ Variational Inference. Instead of calculating the exact distribution, we define a simpler, parameterized distribution $q_{\phi}(\mathbf{z} | \mathbf{x})$—usually a Gaussian—and attempt to make it as similar as possible to the true posterior. The 'closeness' between these two distributions is measured using the Kullback-Leibler (KL) divergence, defined as $\text{KL}(q_{\phi}(\mathbf{z} | \mathbf{x}) \parallel p(\mathbf{z} | \mathbf{x})) = \\∈t q_{\phi}(\mathbf{z} | \mathbf{x}) \log \frac{q_{\phi}(\mathbf{z} | \mathbf{x})}{p(\mathbf{z} | \mathbf{x})} d\mathbf{z}$. Our goal is to minimize this divergence by optimizing the parameters $\phi$.
Minimizing the KL divergence directly is still impossible because $p(\mathbf{z} | \mathbf{x})$ is unknown. To solve this, we derive the Evidence Lower Bound (ELBO). Through algebraic manipulation of the log-evidence $\log p(\mathbf{x})$, we can show that $\log p(\mathbf{x}) = \text{ELBO}(\phi, \theta) + \text{KL}(q_{\phi}(\mathbf{z} | \mathbf{x}) \parallel p(\mathbf{z} | \mathbf{x}))$. Since the KL divergence is always non-negative, the ELBO serves as a lower bound on the log-likelihood of the data. Thus, maximizing the ELBO is equivalent to minimizing the KL divergence between our approximation and the true posterior.
The ELBO is mathematically decomposed into two interpretable terms: $\text{ELBO} = \mathbb{E}_{q_{\phi}(\mathbf{z} | \mathbf{x})}[\log p_{\theta}(\mathbf{x} | \mathbf{z})] - \text{KL}(q_{\phi}(\mathbf{z} | \mathbf{x}) \parallel p(\mathbf{z}))$. The first term is the 'reconstruction term,' which encourages the model to decode $\mathbf{z}$ back into $\mathbf{x}$ accurately. The second term is the 'regularization term,' which forces the approximate posterior $q_{\phi}(\mathbf{z} | \mathbf{x})$ to remain close to a prior distribution $p(\mathbf{z})$, typically a standard normal $\mathcal{N}(0, I)$. This prevent the model from simply assigning a unique point in space to every input, ensuring the latent space remains continuous and structured.
In a VAE, these components are implemented via neural networks. The encoder network outputs the parameters $\phi$ (mean $\mu$ and variance $\sigma$) of the distribution $q_{\phi}(\mathbf{z} | \mathbf{x})$. To allow gradients to flow through the stochastic sampling process of $\mathbf{z}$, we use the 'reparameterization trick'. Instead of sampling $\mathbf{z} \sim \mathcal{N}(\mu, \sigma^2)$, we sample $\epsilon \sim \mathcal{N}(0, I)$ and compute $\mathbf{z} = \mu + \sigma \odot \epsilon$. This moves the randomness to an external input, making the mapping from $\phi$ to $\mathbf{z}$ differentiable.
The final architecture is a dual-network system. The encoder learns to squash high-dimensional data into a structured latent bottleneck, while the decoder $p_{\theta}(\mathbf{x} | \mathbf{z})$ learns to reconstruct the data from these latent samples. By maximizing the ELBO, the VAE learns a generative model. Once trained, we can discard the encoder, sample $\mathbf{z}$ directly from the prior $p(\mathbf{z})$, and pass it through the decoder to generate entirely new, synthetic data samples that mimic the original training distribution.