At its core, Bayesian inference is about updating our beliefs about a set of hidden parameters $\theta$ given some observed data $x$. According to Bayes' Theorem, the posterior distribution is given by $p(\theta|x) = \frac{p(x|\theta)p(\theta)}{p(x)}$. However, in complex models, the denominator $p(x) = \\∈t p(x|\theta)p(\theta) d\theta$, known as the evidence or marginal likelihood, is often computationally intractable because it requires integrating over a high-dimensional parameter space. This is the fundamental challenge that Variational Autoencoders (VAEs) seek to solve by transforming an integration problem into an optimization problem.
The Variational Autoencoder introduces a latent variable $z$ to represent the underlying structure of the data. We assume that the data is generated by first sampling $z$ from a prior $p(z)$ and then sampling $x$ from a conditional distribution $p(x|z)$. To perform inference, we want to find the posterior $p(z|x)$. Since this is intractable, we introduce a variational distribution $q_{\phi}(z|x)$—parameterized by a neural network (the encoder)—to approximate the true posterior. Our goal is to make $q_{\phi}(z|x)$ as close as possible to $p(z|x)$, typically measured by the Kullback-Leibler (KL) divergence: $\text{KL}(q_{\phi}(z|x) || p(z|x))$.
To optimize the variational distribution, we need a objective function. However, $\text{KL}(q_{\phi}(z|x) || p(z|x))$ cannot be computed directly because it depends on the unknown $p(z|x)$. By applying the definitions of KL divergence and Bayes' rule, we can derive the following identity: $\log p(x) = \text{KL}(q_{\phi}(z|x) || p(z|x)) + \mathcal{L}(\phi, \theta)$, where $\mathcal{L}(\phi, \theta)$ is the Evidence Lower Bound (ELBO). Since the KL divergence is always non-negative, $\mathcal{L}(\phi, \theta)$ serves as a lower bound on the log-likelihood of the data: $\log p(x) \\≥ \mathcal{L}(\phi, \theta)$.
The ELBO can be formally decomposed into two intuitive terms: $\mathcal{L}(\phi, \theta) = E_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - \text{KL}(q_{\phi}(z|x) || p(z))$. The first term, $E_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)]$, is the expected reconstruction log-likelihood, which encourages the decoder to reconstruct the input accurately. The second term, $\text{KL}(q_{\phi}(z|x) || p(z))$, acts as a regularizer, forcing the approximate posterior to remain close to the prior (usually a standard Gaussian $\mathcal{N}(0, I)$). This prevents the model from simply assigning a unique point in latent space for every single data point, effectively smoothing the latent manifold.
A major technical hurdle arises when taking the gradient of the ELBO with respect to the encoder parameters $\phi$, as the expectation involves a random sample $z \sim q_{\phi}(z|x)$. We cannot propagate gradients through a stochastic sampling process. To solve this, we use the Reparameterization Trick. Instead of sampling $z$ directly, we express $z$ as a deterministic transformation of a noise variable $\epsilon \sim \mathcal{N}(0, I)$. For a Gaussian posterior with mean $\mu$ and variance $\sigma^2$, we write $z = \mu_{\phi}(x) + \sigma_{\phi}(x) \odot \epsilon$. This shifts the stochasticity to $\epsilon$, making the mapping from $\phi$ to $z$ differentiable.
In summary, the VAE framework blends Bayesian inference with deep learning by using the ELBO as a surrogate objective. By maximizing the ELBO, we simultaneously perform approximate posterior inference (via the encoder) and learn the generative distribution (via the decoder). The resulting model allows us to sample new data from the prior $p(z)$ and map it through the decoder $p_{\theta}(x|z)$, while ensuring the latent space is structured and continuous, providing a powerful tool for generative modeling and representation learning.