Constraining Policy Updates via Kullback–Leibler Divergence

In reinforcement learning, the primary challenge is updating an agent's policy to improve performance without destroying the very behaviors that made learning possible in the first place. If we change the policy too drastically based on noisy gradient estimates, the agent may forget how to survive in its environment, leading to what we call 'policy collapse.' The Kullback–Leibler (KL) divergence serves as a mathematical anchor, measuring exactly how much our new policy differs from the old one, ensuring that updates remain within a safe 'trust region.'

Intuitively, think of the KL divergence as a measure of surprise or information loss. If you have a map of a city (the old policy) and I give you a slightly updated map (the new policy), the KL divergence quantifies how confused you would be if you used the new map while expecting the old one. In policy optimization, we want to maximize reward, but we must penalize changes that make the new policy 'too surprising' relative to the data collected by the old policy.

Mathematically, for two discrete probability distributions $P$ and $Q$, the KL divergence is defined as $D_{KL}(P || Q) = \sum_x P(x) \log \left( \frac{P(x)}{Q(x)} \right)$. It is crucial to note that this measure is asymmetric, meaning $D_{KL}(P || Q) \\≠ D_{KL}(Q || P)$. In the context of policy gradients, $P$ typically represents the new policy $\pi_{\theta'}$ and $Q$ represents the old policy $\pi_{\theta}$, measuring the cost of approximating the new behavior using the old distribution.

When optimizing a policy parameterized by $\theta$, we aim to maximize an objective function $J(\theta)$ subject to a constraint on the KL divergence. This creates a constrained optimization problem: $\max_{\theta'} \mathbb{E}_{s \sim \rho, a \sim \pi_{\theta}} [A_{\pi_{\theta}}(s, a)]$ subject to $\mathbb{E}_{s} [D_{KL}(\pi_{\theta'}(\\·|s) || \pi_{\theta}(\\·|s))] \le \delta$. Here, $\delta$ is a small hyperparameter defining the maximum allowable step size in the information space, not just the parameter space.

This formulation leads directly to Trust Region Policy Optimization (TRPO), where the constraint ensures monotonic improvement. By approximating the KL divergence with a second-order Taylor expansion, we can relate it to the Fisher Information Matrix $F$. The update rule effectively becomes a natural gradient step: $\theta' = \theta + \alpha F^{-1} \nabla_{\theta} J(\theta)$, where the inverse Fisher matrix rescales the gradient to account for the curvature of the policy space.

While TRPO is theoretically elegant, computing the inverse Fisher matrix is computationally expensive for large neural networks. Proximal Policy Optimization (PPO) simplifies this by moving the KL constraint into the objective function as a penalty term: $L^{CLIP}(\theta) = \mathbb{E} [\min(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t)]$. Although PPO often uses a clipping mechanism instead of an explicit KL penalty, the underlying principle remains identical: limiting the ratio $r_t(\theta) = \frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}$ prevents the policy from drifting too far.

The role of KL divergence extends beyond just stabilization; it fundamentally changes the geometry of the optimization landscape. Without this constraint, standard gradient ascent might take large steps in directions where the policy probability changes exponentially, leading to high variance and instability. By constraining the KL divergence, we ensure that every update step is conservative, prioritizing reliable improvement over risky, high-magnitude jumps.

In conclusion, the Kullback–Leibler divergence is the cornerstone of modern, stable policy gradient methods. It transforms the chaotic search for optimal behavior into a disciplined walk through probability space, balancing the exploration of new strategies with the preservation of acquired knowledge. Whether implemented as a hard constraint in TRPO or a soft penalty in PPO, respecting the information distance between policies is essential for training robust deep reinforcement learning agents.