At its heart, Bayesian inference is about updating our beliefs about a set of latent variables $\mathbf{z}$ given observed data $\mathbf{x}$. According to Bayes' Theorem, the posterior distribution is given by $p(\mathbf{z}|\mathbf{x}) = \frac{p(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{p(\mathbf{x})}$. While the numerator is often easy to compute, the denominator $p(\mathbf{x}) = \\∈t p(\mathbf{x}|\mathbf{z})p(\mathbf{z}) d\mathbf{z}$, known as the evidence, is typically intractable for high-dimensional latent spaces. This intractability prevents us from directly sampling from the posterior or calculating the exact likelihood of our data.
To bypass this, we use Variational Inference (VI). Instead of calculating the true posterior $p(\mathbf{z}|\mathbf{x})$, we introduce a simpler, parameterized distribution $q_{\phi}(\mathbf{z}|\mathbf{x})$ and attempt to make it as similar as possible to the true posterior. We measure this 'similarity' using the Kullback-Leibler (KL) divergence: $D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) \parallel p(\mathbf{z}|\mathbf{x})) = \\∈t q_{\phi}(\mathbf{z}|\mathbf{x}) \log \frac{q_{\phi}(\mathbf{z}|\mathbf{x})}{p(\mathbf{z}|\mathbf{x})} d\mathbf{z}$. By minimizing this divergence, we effectively 'approximate' the complex posterior with a tractable distribution, such as a Gaussian.
The challenge is that the KL divergence itself depends on the unknown $p(\mathbf{z}|\mathbf{x})$. To resolve this, we derive the Evidence Lower Bound (ELBO). Through algebraic manipulation of the log-evidence, we can show that: $\log p(\mathbf{x}) = D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) \parallel p(\mathbf{z}|\mathbf{x})) + \mathcal{L}(\phi; \mathbf{x})$, where $\mathcal{L}(\phi; \mathbf{x})$ is the ELBO. Since the KL divergence is always non-negative, $\log p(\mathbf{x}) \\≥ \mathcal{L}(\phi; \mathbf{x})$. Therefore, maximizing the ELBO is equivalent to minimizing the KL divergence between our approximation and the true posterior.
The ELBO can be decomposed into two intuitive terms: $\mathcal{L}(\phi; \mathbf{x}) = \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\mathbf{z})] - D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}) \parallel p(\mathbf{z}))$. The first term is the 'reconstruction term,' which encourages the model to decode the latent representation $\mathbf{z}$ back into the original data $\mathbf{x}$ accurately. The second term is a 'regularization term,' which forces the approximate posterior to remain close to the prior $p(\mathbf{z})$, preventing the model from simply assigning a unique point in space to every single data point (overfitting).
A Variational Autoencoder (VAE) implements this framework using neural networks. The 'encoder' represents $q_{\phi}(\mathbf{z}|\mathbf{x})$, outputting the parameters (mean $\mu$ and variance $\sigma^2$) of a Gaussian distribution. The 'decoder' represents $p_{\theta}(\mathbf{x}|\mathbf{z})$. Because we need to backpropagate through a random sample $\mathbf{z} \sim q_{\phi}(\mathbf{z}|\mathbf{x})$, we use the 'Reparameterization Trick'. We express $\mathbf{z}$ as $\mathbf{z} = \mu + \sigma \odot \epsilon$, where $\epsilon \sim \mathcal{N}(0, I)$. This shifts the randomness to an external input, making the mapping from $\phi$ to $\mathbf{z}$ differentiable.
In summary, the VAE is not just an autoencoder; it is a probabilistic graphical model. By optimizing the ELBO, we learn a structured latent space that allows us to generate new, synthetic data by sampling $\mathbf{z} \sim p(\mathbf{z})$ and passing it through the decoder. The balance between reconstruction and regularization ensures that the latent space is continuous and complete, enabling smooth interpolation between different data samples.