At its heart, Bayesian inference is about updating our beliefs about a latent parameter $\mathbf{z}$ given some observed data $\mathbf{x}$. We seek the posterior distribution $p(\mathbf{z}|\mathbf{x})$, which represents the probability of the latent cause given the observation. According to Bayes' Theorem, this is defined as $p(\mathbf{z}|\mathbf{x}) = \frac{p(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{p(\mathbf{x})}$. While the numerator is often tractable, the denominator $p(\mathbf{x}) = \\∈t p(\mathbf{x}, \mathbf{z}) d\mathbf{z}$ (the evidence) is an integral over all possible latent states, which is computationally intractable for high-dimensional data like images.
To bypass this intractability, Variational Autoencoders (VAEs) treat inference as an optimization problem rather than an integration problem. We introduce a 'variational' distribution $q_{\phi}(\mathbf{z}|\mathbf{x})$, parameterized by a neural network (the encoder), to approximate the true posterior $p(\mathbf{z}|\mathbf{x})$. The goal is to make $q_{\phi}$ as close as possible to $p$ by minimizing the Kullback-Leibler (KL) divergence: $D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}|\mathbf{x}))$. This shifts our focus from calculating a distribution to finding the optimal parameters $\phi$ that define a distribution.
Since we cannot calculate the KL divergence directly (because it requires knowing the unknown $p(\mathbf{z}|\mathbf{x})$), we derive a surrogate objective called the Evidence Lower Bound, or ELBO. By rearranging the log-marginal likelihood, we find: $\log p(\mathbf{x}) = D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}|\mathbf{x})) + \mathcal{L}(\phi, \theta; \mathbf{x})$. Because the KL divergence is always non-negative, the term $\mathcal{L}(\phi, \theta; \mathbf{x})$ serves as a lower bound on the log-evidence: $\log p(\mathbf{x}) \\≥ \mathcal{L}(\phi, \theta; \mathbf{x})$. Maximizing the ELBO effectively minimizes the divergence between our approximation and the true posterior.
The ELBO can be decomposed into two interpretable terms: $\mathcal{L}(\phi, \theta; \mathbf{x}) = \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\mathbf{z})] - D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) || p(\mathbf{z}))$. The first term is the 'reconstruction likelihood,' which encourages the decoder $p_{\theta}$ to reconstruct the input $\mathbf{x}$ from the sampled $\mathbf{z}$. The second term is a regularization penalty that forces the approximate posterior to remain close to a prior distribution $p(\mathbf{z})$, typically a standard Gaussian $\mathcal{N}(0, I)$. This prevents the model from simply assigning a unique point in space for every single image, thereby ensuring a smooth latent space.
A critical challenge arises when calculating the gradient of the ELBO: the expectation is taken over a distribution $q_{\phi}$ that depends on the parameters we are optimizing. To solve this, we use the 'reparameterization trick.' Instead of sampling $\mathbf{z} \sim \mathcal{N}(\mu, \sigma^2)$, we express $\mathbf{z}$ as a deterministic transformation of a noise variable $\epsilon \sim \mathcal{N}(0, 1)$: $\mathbf{z} = \mu + \sigma \odot \epsilon$. This moves the stochasticity outside the gradient path, allowing us to use standard backpropagation to optimize both the encoder parameters $\phi$ and decoder parameters $\theta$.
In summary, the VAE is a marriage of Bayesian inference and deep learning. By maximizing the ELBO, we simultaneously learn a compressed representation of data and a generative model capable of sampling new instances. The balance between the reconstruction term and the KL term creates a latent manifold where similar a data points are clustered together, enabling meaningful interpolation and controlled generation of high-dimensional data.