Kullback–Leibler (KL) Divergence and the Stability of Policy Updates

At its core, the Kullback–Leibler (KL) divergence, often called relative entropy, is a measure of how one probability distribution differs from a second, reference probability distribution. In the context of Machine Learning, imagine you have a 'true' distribution $P$ and an approximating distribution $Q$. KL divergence quantifies the amount of information lost when we use $Q$ to represent $P$. Crucially, it is not a true distance metric because it is asymmetric: $D_{KL}(P \parallel Q) \\≠ D_{KL}(Q \parallel P)$. This asymmetry is fundamental to how we constrain policy updates in reinforcement learning, as we care specifically about how the new policy deviates from the old one.

Mathematically, for discrete probability distributions, the KL divergence is defined as the expected value of the logarithmic difference between the two distributions. The formula is given by: $$D_{KL}(P \parallel Q) = \sum_{x \\∈ \mathcal{X}} P(x) \log \frac{P(x)}{Q(x)} = \sum_{x \\∈ \mathcal{X}} P(x) \log P(x) - \sum_{x \\∈ \mathcal{X}} P(x) \log Q(x)$$ The first term is the negative entropy of $P$, and the second term is the cross-entropy between $P$ and $Q$. Because of Gibbs' inequality, $D_{KL}(P \parallel Q) \\≥ 0$, with equality if and only if $P = Q$ almost everywhere.

In Reinforcement Learning, we represent a policy as a distribution $\pi_{\theta}(a|s)$ over actions $a$ given state $s$. When we perform a gradient update to improve the expected return, a standard step in the direction of the gradient $\nabla_{\theta} J(\theta)$ can be dangerously large. If the parameters $\theta$ change too significantly, the policy might shift to a region of the action space that the agent has not explored, leading to a 'collapse' in performance. This is the 'step-size' problem: a step that is too small leads to slow convergence, but a step that is too large can destroy the policy's stability.

To solve this, we introduce a constraint on the update using KL divergence. Instead of simply maximizing the objective function $J(\theta)$, we maximize it subject to a constraint on the divergence between the old policy $\pi_{\theta_{old}}$ and the new policy $\pi_{\theta}$: $$\\max_{\theta} J(\theta) \quad \text{subject to} \quad D_{KL}(\pi_{\theta_{old}}(\\·|s) \parallel \pi_{\theta}(\\·|s)) \\≤ \delta$$ Here, $\delta$ is a hyperparameter that defines a 'trust region.' By keeping the KL divergence small, we ensure that the new policy remains close to the distribution of the old policy, guaranteeing that the policy update is conservative and stable.

This conceptual framework is the engine behind Trust Region Policy Optimization (TRPO). TRPO leverages the fact that the KL divergence is related to the Fisher Information Matrix (FIM). Specifically, the second-order Taylor expansion of the KL divergence is: $D_{KL}(\pi_{\theta} \parallel \pi_{\theta + \Delta\theta}) \approx \frac{1}{2} \Delta\theta^T F \Delta\theta$, where $F$ is the Fisher Information Matrix. By utilizing this quadratic approximation, TRPO can compute a natural gradient update, which adjusts the parameters based on the geometry of the probability distribution space rather than the flat Euclidean space of the parameter vector $\theta$.

Finally, the Proximal Policy Optimization (PPO) algorithm simplifies this constraint by using a clipped surrogate objective. Instead of a hard constraint on the KL divergence, PPO effectively penalizes large changes in the ratio $r_t(\theta) = \frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}$. While PPO's clipping is a first-order approximation, its goal remains the same as the KL constraint: preventing the update from moving too far from the current distribution. Whether through hard constraints or clipping, preventing the 'divergence' of the policy is the key to achieving robust, monotonic improvement in complex agent behaviors.