KL Divergence and the Stability of Policy Updates

At its core, Kullback–Leibler (KL) divergence is a measure of how one probability distribution differs from a second, reference probability distribution. In the context of Machine Learning, specifically Reinforcement Learning (RL), we often treat the agent's policy—the mapping from states to actions—as a distribution $\pi(\\·|s)$. When we update this policy using gradient ascent, we risk taking a step that is too large in parameter space, which may lead to an overly aggressive change in the actual behavior of the agent. This 'policy collapse' occurs because a small change in weights $\theta$ can lead to a massive change in the resulting probability distribution, causing the agent to forget previously learned stable behaviors.

Mathematically, the KL divergence between two distributions $P$ and $Q$ over a discrete space is defined as $D_{KL}(P \parallel Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}$. For continuous distributions, we replace the sum with an integral. It is crucial to note that KL divergence is non-symmetric, meaning $D_{KL}(P \parallel Q) \\≠ D_{KL}(Q \parallel P)$, and it is always non-negative, reaching zero if and only if $P$ and $Q$ are identical. In RL, we typically measure the divergence between the old policy $\pi_{\theta_{old}}$ and the updated policy $\pi_{\theta}$.

The instability of vanilla Policy Gradient methods stems from the fact that the gradient $\nabla_{\theta} J(\theta)$ provides a direction of steepest ascent in the parameter space, not the distribution space. Because the relationship between $\theta$ and $\pi_{\theta}$ is highly non-linear, a fixed learning rate $\alpha$ can be too small in some regions and dangerously large in others. By constraining the update such that $D_{KL}(\pi_{\theta_{old}} \parallel \pi_{\theta}) \le \delta$, we ensure that the new policy remains within a 'trust region' where the local linear approximation of the objective function remains valid.

This conceptual framework is operationalized in Trust Region Policy Optimization (TRPO). Instead of a simple gradient update, TRPO solves a constrained optimization problem: maximize the surrogate advantage objective subject to a constraint on the average KL divergence across states: $\mathbb{E}_{s \sim \rho_{\pi}} [D_{KL}(\pi_{\theta_{old}}(\\·|s) \parallel \pi_{\theta}(\\·|s))] \le \delta$. This ensures that the update is globally stable, effectively preventing the 'performance crash' often seen in standard Actor-Critic architectures.

While TRPO is theoretically sound, it requires computing the second-order derivative of the KL divergence (the Fisher Information Matrix), which is computationally expensive. Proximal Policy Optimization (PPO) simplifies this by using a clipped objective function. While PPO does not strictly enforce a KL constraint, it approximates the same behavior by limiting the ratio $r_t(\theta) = \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$. This clipping prevents the new policy from deviating too far from the old one, mirroring the intent of the KL constraint without the heavy quadratic overhead.

Ultimately, the role of KL divergence in policy updates is to balance exploration with stability. By quantifying the 'distance' traveled in the space of behaviors, we transform the optimization process from a risky leap in the dark to a series of calculated, reliable steps. This ensures that the agent monotonically improves its performance, adhering to the theoretical guarantee that a small enough update in the KL sense preserves the value function's integrity.