At its core, Bayesian inference is about updating our beliefs about a set of latent variables $z$ after observing some data $x$. We are interested in the posterior distribution $p(z|x)$, which tells us how the latent variables should be distributed given our data. According to Bayes' Theorem, this is expressed as: $$p(z|x) = \frac{p(x|z)p(z)}{p(x)}$$ where $p(x|z)$ is the likelihood, $p(z)$ is the prior, and $p(x) = \\∈t p(x,z) dz$ is the evidence. In high-dimensional spaces, calculating the evidence $p(x)$ is computationally intractable because the integral must be evaluated over all possible configurations of $z$.
To bypass this intractability, we use Variational Inference. Instead of computing the true posterior $p(z|x)$, we introduce a simpler, parameterized distribution $q_{\phi}(z|x)$, known as the variational distribution (or encoder), to approximate it. Our goal is to make $q_{\phi}(z|x)$ as close to $p(z|x)$ as possible. We measure the 'closeness' of these two distributions using the Kullback-Leibler (KL) Divergence: $$D_{KL}(q_{\phi}(z|x) \parallel p(z|x)) = \\∈t q_{\phi}(z|x) \log \frac{q_{\phi}(z|x)}{p(z|x)} dz$$ minimizing this divergence is equivalent to maximizing the similarity between our approximation and the true latent structure.
Since we cannot minimize $D_{KL}$ directly (as it requires knowing the unknown $p(z|x)$), we derive a surrogate objective called the Evidence Lower Bound (ELBO). By rearranging the KL divergence formula and utilizing the identity $\log p(x) = \log p(x) + \log \frac{q_{\phi}(z|x)}{q_{\phi}(z|x)}$, we can show that: $$\log p(x) = D_{KL}(q_{\phi}(z|x) \parallel p(z|x)) + \mathcal{L}(\theta, \phi; x)$$ where $\mathcal{L}$ is the ELBO. Because the KL divergence is always non-negative, $\mathcal{L}(\theta, \phi; x)$ serves as a lower bound on the log-likelihood of the data. Maximizing the ELBO effectively minimizes the KL divergence, pushing our approximate posterior toward the true posterior.
The ELBO can be decomposed into two meaningful components: the reconstruction term and the regularization term. Specifically: $$\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{KL}(q_{\phi}(z|x) \parallel p(z))$$ The first term, the expected log-likelihood, encourages the decoder $p_{\theta}(x|z)$ to reconstruct the original input $x$ accurately from the sampled latent variable $z$. The second term, the KL divergence between the approximate posterior and the prior $p(z)$ (usually a standard normal $\mathcal{N}(0, I)$), acts as a regularizer that prevents the model from simply assigning a unique point in space to every single input, thereby ensuring a continuous latent space.
A significant challenge arises during the optimization of the ELBO: the gradient cannot flow through the stochastic sampling process $z \sim q_{\phi}(z|x)$. To solve this, the Variational Autoencoder (VAE) employs the 'reparameterization trick'. Instead of sampling directly from the distribution, we express $z$ as a deterministic function of the parameters and an external noise variable $\epsilon$: $$z = \mu_{\phi}(x) + \sigma_{\phi}(x) \odot \epsilon, \text{ where } \epsilon \sim \mathcal{N}(0, I)$$ This shifts the stochasticity to $\epsilon$, allowing the gradients of the loss function to be backpropagated through $\mu$ and $\sigma$ using standard automatic differentiation.
When we train a VAE, we maximize the ELBO across the entire dataset. Once trained, we can discard the encoder $q_{\phi}(z|x)$ and use the decoder $p_{\theta}(x|z)$ as a generative model. By sampling $z$ directly from the prior $p(z) \sim \mathcal{N}(0, I)$ and passing it through the decoder, we generate new data samples that share the structural characteristics of the training set. This architecture transforms the a priori theoretical framework of Bayesian inference into a scalable, deep learning-based generative system.