Kullback–Leibler (KL) Divergence and its Role in Constraining Policy Updates

At its core, KL divergence measures how much one probability distribution differs from a second, reference probability distribution. In the context of Reinforcement Learning (RL), we are often updating a policy $\pi_{\theta}$, which is a mapping from states to action probabilities. If we update the parameters $\theta$ too aggressively, the new policy $\pi_{\theta_{new}}$ might diverge drastically from the old policy $\pi_{\theta_{old}}$. This creates a 'collapse' where the agent forgets previously learned successful behaviors or enters a region of the state space where the gradients are uninformative, leading to unstable training.

Mathematically, for two discrete probability distributions $P$ and $Q$, the KL divergence is defined as the expected value of the logarithmic difference between the two: $D_{KL}(P \parallel Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}$. In the continuous case, the summation is replaced by an integral. Crucially, $D_{KL}$ is non-symmetric—meaning $D_{KL}(P \parallel Q) \\≠ D_{KL}(Q \parallel P)$—and it is always non-negative, equaling zero if and only if $P$ and $Q$ are identical. This property makes it an ideal 'distance' metric for monitoring how far a policy has drifted during an update.

In standard Policy Gradient methods, we perform gradient ascent on the expected return $J(\theta)$. The update rule $\theta_{t+1} = \theta_t + \alpha \nabla_{\theta} J(\theta)$ moves parameters in the direction of steepest ascent. However, a small change in $\theta$ can lead to a massive change in the resulting distribution $\pi_{\theta}$. This is because the relationship between the parameter space and the distribution space is non-linear. This discrepancy often leads to the 'policy collapse' phenomenon, where a single bad update destroys the policy's performance.

To solve this, we introduce a constraint on the update: we want to maximize the objective $J(\theta)$ while ensuring that the KL divergence between the old policy and the new policy remains below a small threshold $\delta$. The optimization problem becomes: $\max_{\theta} J(\theta)$ subject to $D_{KL}(\pi_{\theta_{old}} \parallel \pi_{\theta}) \\≤ \delta$. By constraining the update to a 'Trust Region,' we ensure that the new policy stays close enough to the old one that the local approximation of the reward surface remains valid.

Trust Region Policy Optimization (TRPO) formalizes this by using the Fisher Information Matrix (FIM). The KL divergence can be approximated locally using a second-order Taylor expansion, where the Hessian of the KL divergence is exactly the FIM, denoted as $H$. The update direction is then modified to $\\Delta \theta \\approx H^{-1} \nabla_{\theta} J(\theta)$. This transforms the vanilla gradient update into a Natural Gradient update, which moves the policy along the steepest ascent direction in the space of distributions rather than the space of parameters.

While TRPO is theoretically robust, computing the inverse of the FIM is computationally expensive. This led to the development of Proximal Policy Optimization (PPO). PPO approximates the KL constraint by using a clipped objective function: $L^{CLIP}(\theta) = \hat{E}_t [ \min(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t) ]$, where $r_t(\theta)$ is the probability ratio $\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}$. By clipping this ratio, PPO implicitly prevents the new policy from deviating too far from the old one, effectively mimicking the behavior of a KL constraint without the heavy quadratic computation.