Probabilistic Latent Variable Models: From Bayesian Inference to VAEs

At its heart, Bayesian inference is about updating our beliefs about a hidden cause given some observed evidence. Imagine we have data $x$ and a latent variable $z$ that generates $x$. We want to find the posterior distribution $p(z|x)$, which tells us the probability of the latent cause given the data. According to Bayes' Theorem: $$p(z|x) = \frac{p(x|z)p(z)}{p(x)}$$ The denominator $p(x) = \\∈t p(x|z)p(z) dz$ is known as the evidence. In high-dimensional spaces, this integral is computationally intractable, making exact Bayesian inference impossible for complex models.

To circumvent this intractability, we use Variational Inference (VI). Instead of calculating the exact posterior $p(z|x)$, we introduce a simpler, parameterized distribution $q_{\phi}(z|x)$ (the variational distribution) and try to make it as similar as possible to the true posterior. The standard measure of similarity between two distributions is the Kullback-Leibler (KL) divergence: $$D_{KL}(q_{\phi}(z|x) || p(z|x)) = \\∈t q_{\phi}(z|x) \log \frac{q_{\phi}(z|x)}{p(z|x)} dz$$ By minimizing this divergence, we effectively 'fit' our approximate distribution to the true, unknown posterior.

Since we cannot minimize $D_{KL}(q_{\phi}(z|x) || p(z|x))$ directly (as it depends on the unknown $p(z|x)$), we rearrange the terms using the definition of the KL divergence. After algebraic manipulation, we find: $$\\log p(x) = D_{KL}(q_{\phi}(z|x) || p(z|x)) + \mathcal{L}(\theta, \phi; x)$$ Here, $\mathcal{L}(\theta, \phi; x)$ is the Evidence Lower Bound, or ELBO. Because the KL divergence is always non-negative, the ELBO serves as a lower bound on the log-likelihood of the data: $\log p(x) \\≥ \mathcal{L}(\theta, \phi; x)$. Maximizing the ELBO is equivalent to minimizing the KL divergence to the posterior.

Let us decompose the ELBO to understand its mechanics. The ELBO can be written as: $$\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{KL}(q_{\phi}(z|x) || p(z))$$ The first term is the 'reconstruction term,' which encourages the model to decode the latent variable $z$ back into the original data $x$. The second term is a 'regularization term,' which forces the approximate posterior $q_{\phi}(z|x)$ to stay close to the prior $p(z)$, usually a standard Gaussian $\mathcal{N}(0, I)$. This prevents the model from simply assigning a unique point in space for every single data point, thereby ensuring a smooth latent space.

The Variational Autoencoder (VAE) implements this framework using neural networks. An 'encoder' network outputs the parameters $\phi$ (mean $\mu$ and variance $\sigma^2$) of $q_{\phi}(z|x)$, and a 'decoder' network outputs the parameters $\theta$ of $p_{\theta}(x|z)$. However, we cannot backpropagate through a random sample $z \sim q_{\phi}(z|x)$. To solve this, we use the 'Reparameterization Trick'. We express $z$ as a deterministic function of the parameters and a noise variable $\epsilon \sim \mathcal{N}(0, I)$: $$z = \mu + \sigma \odot \epsilon$$ This moves the stochasticity to an input node, allowing gradients to flow from the loss function back to the encoder's weights.

In summary, the VAE transforms a complex Bayesian inference problem into a stochastic optimization problem. By maximizing the ELBO, we simultaneously learn a generative model $p_{\theta}(x|z)$ and an efficient way to approximate the posterior $q_{\phi}(z|x)$. This allows us to generate new, synthetic samples by sampling $z \sim p(z)$ and passing it through the decoder, leveraging the structured latent space we've learned. The elegance of the VAE lies in bridging the gap between deep learning's flexibility and Bayesianism's probabilistic rigor.