At its core, Bayesian inference is about updating our beliefs about a hidden variable $z$ given some observed data $x$. We are interested in the posterior distribution $p(z|x)$, which tells us the probability of the latent cause $z$ given the evidence $x$. According to Bayes' Theorem: $$p(z|x) = \frac{p(x|z)p(z)}{p(x)}$$. In complex generative models, however, the denominator $p(x) = \\∈t p(x|z)p(z) dz$—known as the evidence—is often an intractable integral because it requires integrating over all possible configurations of the latent space, making direct computation of the posterior impossible.
To overcome this intractability, Variational Autoencoders (VAEs) treat inference as an optimization problem. Instead of computing $p(z|x)$ exactly, we introduce a simpler, parameterized distribution $q_{\phi}(z|x)$ (the 'encoder') to approximate it. Our goal is to make $q_{\phi}(z|x)$ as close as possible to the true posterior $p(z|x)$. The standard measure for the similarity between two distributions is the Kullback-Leibler (KL) divergence: $$D_{KL}(q_{\phi}(z|x) || p(z|x)) = \\∈t q_{\phi}(z|x) \log \frac{q_{\phi}(z|x)}{p(z|x)} dz$$. Minimizing this divergence would ideally yield the best approximation.
Since we cannot minimize the KL divergence directly (because it depends on the unknown $p(z|x)$), we derive a surrogate objective called the Evidence Lower Bound (ELBO). By manipulating the marginal likelihood $\log p(x)$, we can show that: $$\log p(x) = D_{KL}(q_{\phi}(z|x) || p(z|x)) + \mathcal{L}(\phi, \theta)$$ where $\mathcal{L}(\phi, \theta)$ is the ELBO. Because the KL divergence is always non-negative, it follows that $\log p(x) \ge \mathcal{L}(\phi, \theta)$. Maximizing the ELBO is equivalent to minimizing the KL divergence between our approximate posterior and the true posterior.
The ELBO can be decomposed into two intuitive terms that represent the tension in a VAE: $$\mathcal{L}(\phi, \theta) = \mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{KL}(q_{\phi}(z|x) || p(z))$$. The first term is the 'reconstruction term,' which encourages the decoder $p_{\theta}(x|z)$ to accurately reconstruct the input $x$ from the latent sample $z$. The second term is the 'regularization term,' which forces the approximate posterior to remain close to a prior distribution $p(z)$ (usually a standard Gaussian $\mathcal{N}(0, I)$), preventing the model from simply assigning a unique point in space to every single image.
A significant technical challenge arises when calculating the gradient of the ELBO with respect to the encoder parameters $\phi$, as we must propagate gradients through a random sampling process $z \sim q_{\phi}(z|x)$. To solve this, VAEs employ the 'reparameterization trick.' We express $z$ as a deterministic function of $\phi$ and an auxiliary noise variable $\epsilon \sim \mathcal{N}(0, I)$: $$z = \mu_{\phi}(x) + \sigma_{\phi}(x) \odot \epsilon$$. This shifts the stochasticity to $\epsilon$, allowing us to use standard backpropagation to update the encoder's weights.
In summary, the VAE transforms a complex Bayesian inference problem into a differentiable optimization task. By maximizing the ELBO, the model learns a structured latent space where similar data points are clustered together, while the decoder learns to map these latent representations back into the original data space. This framework allows us to generate new, synthetic data by sampling $z \sim p(z)$ and passing it through the decoder, effectively synthesizing data that shares the distribution of the training set.