At its heart, Bayesian inference is about updating our beliefs about a hidden cause $z$ given some observed data $x$. In a generative context, we assume the data is generated by a latent process described by $p(x|z)$. To perform inference, we seek the posterior distribution $p(z|x)$, which tells us the probability of the latent variables given the observations. According to Bayes' Theorem, $p(z|x) = \frac{p(x|z)p(z)}{p(x)}$. However, for complex neural networks, computing the evidence (the marginal likelihood) $p(x) = \\∈t p(x,z) dz$ is computationally intractable because it requires integrating over all possible configurations of the latent space.
To bypass this intractability, Variational Autoencoders (VAEs) introduce a strategy called Variational Inference. Instead of calculating the true posterior $p(z|x)$, we approximate it using a simpler, parameterized distribution $q_{\phi}(z|x)$, typically a Gaussian. The goal is to make $q_{\phi}(z|x)$ as similar as possible to $p(z|x)$ by minimizing the Kullback-Leibler (KL) divergence: $D_{KL}(q_{\phi}(z|x) || p(z|x))$. This transforms a problem of integration into a problem of optimization, where we tune the parameters $\phi$ of an 'encoder' network to approximate the true latent distribution.
The challenge arises because $D_{KL}$ depends on the unknown $p(z|x)$. To solve this, we derive the Evidence Lower Bound (ELBO). By applying Jensen's Inequality to the log-marginal likelihood, we can show that $\log p(x) \\≥ E_{z \sim q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{KL}(q_{\phi}(z|x) || p(z))$. The term on the right is the ELBO. Since the ELBO is a lower bound on the evidence, maximizing the ELBO is equivalent to minimizing the KL divergence between our approximation and the true posterior. This provides a mathematically sound objective function that can be optimized using gradient descent.
The ELBO can be decomposed into two intuitive components. The first term, $E_{z \sim q_{\phi}(z|x)}[\log p_{\theta}(x|z)]$, is the 'reconstruction term.' It encourages the decoder $p_{\theta}(x|z)$ to reconstruct the input $x$ accurately from the sampled latent variable $z$. The second term, $-D_{KL}(q_{\phi}(z|x) || p(z))$, is the 'regularization term.' It forces the approximate posterior to remain close to a prior distribution $p(z)$ (usually a standard normal $\mathcal{N}(0, I)$), preventing the model from simply assigning a unique point in the latent space to every single data point, which would lead to overfitting.
A critical technical hurdle in training VAEs is that the sampling process $z \sim q_{\phi}(z|x)$ is non-differentiable, meaning gradients cannot flow back to the encoder. To resolve this, we employ the 'Reparameterization Trick.' Instead of sampling $z$ directly, we express $z$ as a deterministic function of the parameters and a random noise variable $\\epsilon \sim \mathcal{N}(0, I)$. Specifically, $z = \mu_{\phi}(x) + \sigma_{\phi}(x) \odot \epsilon$. This shifts the stochasticity to an input node, allowing us to use standard backpropagation to optimize $\phi$ and $\theta$ simultaneously.
In summary, the VAE framework leverages Bayesian principles to create a structured latent space. By maximizing the ELBO, we balance the trade-off between faithful reconstruction and a well-behaved latent manifold. This allows the VAE to not only compress data but to act as a generative model: by sampling $z \sim p(z)$ and passing it through the decoder, we can generate novel data points that share the statistical characteristics of the training set. The transition from $\log p(x)$ to the ELBO is the pivotal step that makes deep generative Bayesian inference computationally feasible.