All Lessons

KL Divergence and the Geometry of Policy Updates

An exploration of how Kullback–Leibler divergence acts as a trust region mechanism to ensure stable convergence in Reinforcement Learning. This lesson bridges the gap between information theory and stochastic policy optimization.

AI Narration Press play to listen
0  / 6 paragraphs
Click any paragraph to jump · Scroll freely without breaking narration

At its core, Kullback–Leibler (KL) divergence is a measure of how one probability distribution differs from a second, reference probability distribution. In the context of Machine Learning, specifically Reinforcement Learning (RL), we often need to update a policy $\pi_{\theta}$ to a new version $\pi_{\theta'}$. If the update is too aggressive, the agent may enter a region of the state space where it has no quality data, leading to a catastrophic collapse in performance. KL divergence provides a mathematical 'yardstick' to quantify the distance between the old and new policies, ensuring that the update remains within a 'trust region' where the approximation of the objective function remains valid.

Mathematically, for two probability distributions $P$ and $Q$ defined over the same space, the KL divergence is defined as the expected log-difference between them: $D_{KL}(P \parallel Q) = \sum_{x \\∈ \mathcal{X}} P(x) \log \frac{P(x)}{Q(x)}$. In the continuous case, the summation is replaced by an integral: $\\∈t p(x) \log \frac{p(x)}{q(x)} dx$. Crucially, KL divergence is non-symmetric, meaning $D_{KL}(P \parallel Q) \\≠ D_{KL}(Q \parallel P)$. It is always non-negative and equals zero if and only if $P$ and $Q$ are identical. In RL, we typically treat $P$ as the current policy distribution and $Q$ as the updated distribution.

The role of KL divergence becomes critical when we consider the Policy Gradient Theorem. Standard gradients often suggest updates based on the local curvature of the loss landscape, but the 'step size' (learning rate) is difficult to tune. If we move too far in the direction of the gradient, the new policy $\pi_{\theta'}$ may be radically different from $\pi_{\theta}$, causing the agent to 'forget' previously learned stable behaviors. By constraining the update such that $D_{KL}(\pi_{\theta} \parallel \pi_{\theta'}) \\≤ \delta$, we enforce a constraint on the information-theoretic distance, ensuring that the distribution of actions does not shift too abruptly.

This constraint is most famously implemented in Trust Region Policy Optimization (TRPO). Instead of a simple gradient descent step, TRPO solves a constrained optimization problem: maximize the surrogate objective $L(\theta) = \mathbb{E} \left[ \frac{\pi_{\theta}(a|s)}{\pi_{\old}(a|s)} A^{\pi_{\old}}(s, a) \right]$ subject to $D_{KL}(\pi_{\old} \parallel \pi_{\theta}) \\≤ \delta$. Here, $A$ represents the advantage function. By limiting the KL divergence, TRPO guarantees a monotonic improvement in the policy, effectively treating the KL divergence as a way to reshape the geometry of the parameter space from Euclidean distance to a Riemannian manifold.

A more computationally efficient approximation of this constraint is found in Proximal Policy Optimization (PPO). While TRPO uses a hard constraint, PPO often employs a KL penalty term added to the objective function: $L^{KLPEN}(\theta) = \\mathbb{E}_t [ R_t( heta) - eta D_{KL}(\pi_{\old}(\\·|s_t) \parallel \pi_{\theta}(\\·|s_t)) ]$. Here, $\beta$ is a coefficient that controls the strength of the constraint. If the KL divergence becomes too large, the penalty term dominates, pulling the new policy back toward the old one and preventing the 'collapse' associated with oversized updates.

Ultimately, the use of KL divergence transforms the problem of policy optimization from a chaotic search in parameter space to a controlled evolution in probability space. It acknowledges that the parameters $\theta$ are merely a means to define a distribution, and the true distance between two policies is not the Euclidean distance between their weights, but the fact that the agent's behavior—its action distribution—remains stable. This stabilizes training and allows for much larger, more confident steps toward the optimal policy without sacrificing robustness.