Kullback–Leibler (KL) Divergence and Its Role in Constraining Policy Updates

At its simplest, the Kullback–Leibler (KL) divergence is a measure of how one probability distribution differs from a second, reference probability distribution. In the context of Machine Learning, it represents the 'information loss' occurred when we use an approximate distribution $Q$ to represent a true distribution $P$. Unlike a true distance metric, KL divergence is asymmetric: $D_{KL}(P \parallel Q) \\≠ D_{KL}(Q \parallel P)$. This asymmetry is critical because it dictates whether we are prioritizing the coverage of all modes of the target distribution or focusing on the most probable regions.

Mathematically, for discrete probability distributions $P$ and $Q$ defined on the same probability space, the KL divergence is defined as the expected value of the logarithmic difference between the two distributions: $$D_{KL}(P \parallel Q) = \sum_{x \\∈ \mathcal{X}} P(x) \log \left( \frac{P(x)}{Q(x)} \right)$$ For continuous distributions, the summation is replaced by an integral: $$D_{KL}(P \parallel Q) = \\∈t_{-\\∈fty}^{\\∈fty} p(x) \log \left( \frac{p(x)}{q(x)} \right) dx$$ This formulation shows that if $P$ and $Q$ are identical, the ratio is 1 and the log is 0, resulting in a divergence of zero.

In Reinforcement Learning (RL), we optimize a policy $\pi_{\theta}(a|s)$ to maximize the expected return. A naive gradient ascent update on the policy parameters $\theta$ can lead to excessively large steps in the parameter space. Because the relationship between $\theta$ and the resulting probability distribution $\pi_{\theta}$ is non-linear, a small change in $\theta$ can lead to a massive shift in the policy's behavior. This often results in a 'performance collapse,' where the agent forgets previously learned successful behaviors and enters a state of instability.

To mitigate this, we introduce the KL divergence as a constraint on the policy update. Instead of maximizing the objective $J(\theta)$ without limits, we seek to find a new policy $\pi_{\theta'}$ that maximizes the objective while ensuring the divergence from the old policy $\pi_{\theta}$ remains below a threshold $\delta$: $$\\max_{\theta'} J(\theta') \quad \text{subject to} \quad D_{KL}(\pi_{\theta} \parallel \pi_{\theta'}) \le \\delta$$ This ensures that the updated policy remains in a 'trust region' around the previous policy, guaranteeing that the local approximation of the objective function remains valid.

This concept is the theoretical bedrock of Trust Region Policy Optimization (TRPO). TRPO replaces the hard constraint with a penalty or uses a second-order approximation of the KL divergence, known as the Fisher Information Matrix $F$. The Fisher Information Matrix defines the curvature of the space of probability distributions: $$F(\theta) = E_{s \sim \\mu, a \sim \\pi_{\theta}} [\nabla_{\theta} \log \pi_{\theta}(a|s) \nabla_{\theta} \log \pi_{\theta}(a|s)^T]$$ By scaling the gradient update by the inverse of $F$, the algorithm takes steps that are consistent in terms of actual distribution change, regardless of how the parameters are parameterized.

Finally, Proximal Policy Optimization (PPO) simplifies this by using a clipped surrogate objective to achieve a similar effect without the computational overhead of the Fisher matrix. While PPO doesn't strictly enforce a KL constraint via Lagrange multipliers, its clipping mechanism is designed to prevent the ratio $r_t(\theta) = \frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}$ from deviating too far from 1. In essence, KL divergence transforms RL from a volatile search in parameter space into a stable evolution of probability distributions, ensuring monotonic improvement.