Kullback–Leibler (KL) Divergence and its Role in Constraining Policy Updates

At its core, Kullback–Leibler (KL) divergence measures how one probability distribution differs from a second, reference probability distribution. In the context of machine learning, imagine you have a 'truth' distribution $P$ and an approximation $Q$. KL divergence quantifies the amount of information lost when $Q$ is used to approximate $P$. Unlike a standard distance metric, it is non-symmetric—meaning the divergence from $P$ to $Q$ is not the same as from $Q$ to $P$—and it is always non-negative, reaching zero if and only if the two distributions are identical.

Mathematically, for discrete probability distributions, the KL divergence is defined as the expected value of the logarithmic difference between the probabilities. For distributions $P$ and $Q$ over a space $\mathcal{X}$, the formula is: $$D_{KL}(P \parallel Q) = \sum_{x \\∈ \mathcal{X}} P(x) \log \frac{P(x)}{Q(x)}$$ In the continuous case, the summation is replaced by an integral. This can be interpreted as the difference between the cross-entropy of $P$ and $Q$ and the entropy of $P$: $D_{KL}(P \parallel Q) = H(P, Q) - H(P)$.

In Reinforcement Learning (RL), we typically optimize a policy $\pi_{\theta}$ to maximize the expected return. A naive gradient ascent update $\theta_{t+1} = \theta_t + \alpha \nabla J(\theta)$ can be perilous. Because the policy is a probability distribution over actions, a large step in parameter space $\theta$ can lead to a massive shift in the resulting distribution $\pi_{\theta}$. If the new policy becomes overly deterministic or collapses into a suboptimal region of the action space, the agent may stop exploring and fail to recover, leading to a catastrophic drop in performance.

To mitigate this, we introduce the KL divergence as a constraint on the policy update. Instead of maximizing the objective $J(\theta)$ without limit, we solve a constrained optimization problem: maximize the objective such that the KL divergence between the old policy $\pi_{\theta_{old}}$ and the new policy $\pi_{\theta}$ remains below a threshold $\delta$. This ensures that the agent's behavior changes incrementally: $$ \max_{\theta} J(\theta) \text{ subject to } D_{KL}(\pi_{\theta_{old}} \parallel \pi_{\theta}) \le \delta$$

This constraint is most famously implemented in Trust Region Policy Optimization (TRPO). TRPO uses a second-order approximation of the KL divergence—the Fisher Information Matrix (FIM)—to define a 'trust region' around the current policy. The FIM, denoted as $F$, is essentially the Hessian of the KL divergence: $F_{ij} = E_{s \sim \rho} [\nabla_{\theta_i} \log \pi_{\theta}(a|s) \nabla_{\theta_j} \log \pi_{\theta}(a|s)]$. By constraining the update using the FIM, TRPO ensures that the update step size is measured in terms of the distribution's change, rather than the raw parameter values.

While TRPO provides rigorous guarantees, it is computationally expensive due to the inversion of the FIM. This led to the development of Proximal Policy Optimization (PPO). PPO approximates the KL constraint using a clipped surrogate objective. While not a hard KL constraint, the clipping mechanism effectively penalizes updates that move the ratio $r_t(\theta) = \frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}$ too far from 1. This maintains the spirit of the KL constraint—preventing overly aggressive updates—while utilizing only first-order gradients for efficiency.

In summary, the KL divergence transforms the optimization landscape from a volatile parameter-space search into a stable distribution-space evolution. By treating the policy as a probability manifold, we ensure that updates are 'safe,' preserving the agent's ability to explore while steadily improving its performance. This paradigm shifts the focus from 'how much should we change the weights?' to 'how much should we change the behavior?', which is the fundamental key to stability in deep reinforcement learning.