At its heart, Bayesian inference is about updating our beliefs about a set of hidden parameters $\zeta$ (latent variables) given some observed data $x$. According to Bayes' Theorem, the posterior distribution is expressed as $p(\zeta|x) = \frac{p(x|\zeta)p(\zeta)}{p(x)}$. While the likelihood $p(x|\zeta)$ and the prior $p(\zeta)$ are often known, the denominator $p(x) = \\∈t p(x, \zeta) d\zeta$—known as the evidence—is typically an intractable high-dimensional integral. In complex models like deep neural networks, we cannot compute this integral, meaning we cannot calculate the exact posterior.
To bypass this intractability, we utilize Variational Inference (VI). Instead of computing the true posterior $p(\zeta|x)$, we introduce a simpler, parameterized distribution $q_{\phi}(\zeta|x)$ (the variational distribution) and attempt to make it as similar as possible to the true posterior. We measure the 'closeness' of these two distributions using the Kullback-Leibler (KL) Divergence: $\text{KL}(q_{\phi}(\zeta|x) || p(\zeta|x)) = \\∈t q_{\phi}(\zeta|x) \log \frac{q_{\phi}(\zeta|x)}{p(\zeta|x)} d\zeta$. Our goal is to minimize this divergence by optimizing the parameters $\phi$.
Since we cannot minimize the KL divergence directly (because it requires knowing the intractable $p(\zeta|x)$), we derive a surrogate objective called the Evidence Lower Bound, or ELBO. By rearranging the terms of the KL divergence and the log-evidence, we find that $\log p(x) = \text{ELBO}(\phi) + ext{KL}(q_{\phi}(\zeta|x) || p(\zeta|x))$. Because the KL divergence is always non-negative, the ELBO serves as a lower bound on the log-likelihood of the data: $\log p(x) \\≥ \text{ELBO}(\phi)$. Maximizing the ELBO is mathematically equivalent to minimizing the KL divergence between our approximation and the truth.
The ELBO can be decomposed into two intuitive components: $\text{ELBO}(\phi) = \mathbb{E}_{q_{\phi}(\zeta|x)}[\log p_{\theta}(x|\zeta)] - ext{KL}(q_{\phi}(\zeta|x) || p(\zeta))$. The first term is the 'reconstruction loss,' which encourages the model to generate data $x$ that is consistent with the sampled latent variables $\zeta$. The second term is the 'regularization term,' which forces the approximate posterior to stay close to the prior $p(\zeta)$ (usually a standard Normal distribution $\mathcal{N}(0, I)$). This balance prevents the model from simply memorizing the data.
The Variational Autoencoder (VAE) implements this theory using neural networks. The 'Encoder' network outputs the parameters $\phi$ (mean $\mu$ and variance $\sigma$) of $q_{\phi}(\zeta|x)$, while the 'Decoder' network represents the likelihood $p_{\theta}(x|\zeta)$. To allow gradients to flow back through the stochastic sampling process, we use the 'Reparameterization Trick'. Instead of sampling $\zeta \sim \mathcal{N}(\mu, \sigma^2)$ directly, we sample $\\epsilon \sim \\\mathcal{N}(0, 1)$ and compute $\zeta = \mu + \sigma \odot \epsilon$. This shifts the randomness to an external input, making the mapping from $\phi$ to $\zeta$ deterministic and differentiable.
In summary, the VAE transforms a probabilistic inference problem into a deep learning optimization problem. By maximizing the ELBO, we simultaneously learn a compressed, structured latent space and a powerful generative model. When we discard the encoder after training, we can sample $\zeta$ from the prior $p(\zeta)$ and pass it through the decoder to synthesize entirely new data points that share the statistical characteristics of the original training set.