At its core, Bayesian inference is about updating our beliefs about a hidden variable given observed data. Imagine we have a dataset $x$ and we believe it was generated by some hidden, latent variable $z$. Our goal is to find the posterior distribution $p(z|x)$, which tells us the probability of the latent variable given the data. According to Bayes' Theorem, this is expressed as: $$p(z|x) = \frac{p(x|z)p(z)}{p(x)}$$. However, in complex generative models, the denominator $p(x) = \\∈t p(x|z)p(z) dz$, known as the evidence, is computationally intractable because it requires integrating over all possible configurations of $z$.
Since we cannot compute the true posterior $p(z|x)$ exactly, we turn to Variational Inference. Instead of calculating the true distribution, we introduce a simpler, parameterized distribution $q_{\phi}(z|x)$—often a Gaussian—and try to make it as similar as possible to the true posterior. We measure the 'similarity' between these two distributions using the Kullback-Leibler (KL) divergence. Our objective is to minimize $KL(q_{\phi}(z|x) || p(z|x))$, effectively transforming an inference problem into an optimization problem over the parameters $\phi$.
To derive a workable loss function, we analyze the log-likelihood of the data $\log p(x)$. Through algebraic manipulation, we can decompose this log-likelihood into two parts: the Evidence Lower Bound (ELBO) and the KL divergence. Specifically: $\log p(x) = ELBO(\phi, \theta; x) + KL(q_{\phi}(z|x) || p(z|x))$. Because the KL divergence is always non-negative, the ELBO serves as a lower bound on the log-likelihood of the data. Maximizing the ELBO is equivalent to minimizing the KL divergence to the true posterior, which is why the ELBO is the central objective function of the Variational Autoencoder.
Mathematically, the ELBO is formulated as: $$ELBO(\phi, \theta; x) = \mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - KL(q_{\phi}(z|x) || p(z))$. The first term, the reconstruction term, encourages the model to decode the latent variable $z$ back into the original input $x$ with high probability. The second term, the regularization term, forces the approximate posterior $q_{\phi}(z|x)$ to remain close to a prior distribution $p(z)$, typically a standard normal distribution $\mathcal{N}(0, I)$. This prevents the model from simply assigning a unique point in space to every single input, thereby ensuring a smooth and continuous latent space.
A primary challenge in optimizing the ELBO is the presence of the expectation $\mathbb{E}_{q_{\phi}(z|x)}$. We cannot backpropagate through a random sampling process because sampling is non-differentiable. To solve this, VAEs employ the 'reparameterization trick'. Instead of sampling $z$ directly from $\mathcal{N}(\mu, \sigma^2)$, we sample a noise variable $\epsilon$ from a standard normal distribution $\mathcal{N}(0, I)$ and compute $z = \mu + \sigma \odot \epsilon$. This shifts the stochasticity to an external input, allowing gradients to flow back through $\mu$ and $\sigma$ to the encoder network parameters $\phi$.
In summary, the VAE architecture consists of an encoder that outputs the parameters of $q_{\phi}(z|x)$ and a decoder that represents $p_{\theta}(x|z)$. By maximizing the ELBO, we simultaneously learn how to compress data into a meaningful latent representation and how to generate new data from that representation. The balance between the reconstruction accuracy and the KL regularization ensures that the latent space is well-structured, enabling the generative capabilities that distinguish VAEs from traditional autoencoders.