Kullback–Leibler (KL) Divergence and Trust Region Policy Optimization

At its core, the Kullback–Leibler (KL) divergence is a measure of how one probability distribution differs from a second, reference probability distribution. In the context of Reinforcement Learning (RL), imagine a policy $\pi_{\theta}$ as a strategy for acting in an environment. If we update the parameters $\theta$ too aggressively, the new policy $\pi_{\theta'}$ might behave drastically differently from the old one. While we seek improvement, a massive shift in distribution can lead to 'catastrophic forgetting' or instability, where the agent enters a state of the environment from which it cannot recover, causing the performance to plummet.

Mathematically, for two discrete probability distributions $P$ and $Q$ defined on the same probability space, the KL divergence is defined as $D_{KL}(P \parallel Q) = \\sum_{i} P(i) \log \frac{P(i)}{Q(i)}$. In the continuous case, this is expressed as an integral: $D_{KL}(P \parallel Q) = ∈t_{-\\∈fty}^{\\∈fty} p(x) \log \frac{p(x)}{q(x)} dx$. Crucially, KL divergence is non-negative and equals zero if and only if $P = Q$. However, it is not a true metric because it is asymmetric: $D_{KL}(P \parallel Q) \\≠ D_{KL}(Q \parallel P)$. In RL, we typically treat the old policy as the reference $Q$ and the updated policy as $P$.

The primary challenge in Policy Gradient methods is the variance of the gradient estimate. The standard gradient update $\theta_{t+1} = \theta_t + \\alpha \nabla J(\theta)$ assumes that the landscape of the objective function $J$ is locally linear. However, the relationship between the parameters $\theta$ and the resulting distribution $\pi_{\theta}$ is highly non-linear. A small step in parameter space can lead to a massive change in the action distribution. By incorporating a KL constraint, we ensure that the update stays within a 'trust region' where the local linear approximation of the reward surface remains valid.

This concept is formalized in Trust Region Policy Optimization (TRPO). Instead of a simple learning rate, TRPO solves a constrained optimization problem: maximize the expected reward subject to the constraint $D_{KL}(\pi_{\theta_{old}} \parallel \\pi_{\theta}) \le \delta$. This ensures that the new policy $\pi_{\theta}$ does not deviate from the old policy $\pi_{\theta_{old}}$ by more than a threshold $\delta$. By bounding the divergence, we provide a theoretical guarantee that the policy improvement is monotonic, meaning the agent is unlikely to enter a regime of degraded performance.

Because the KL constraint is computationally expensive to solve exactly via second-order optimization (requiring the Hessian of the KL divergence, known as the Fisher Information Matrix), practitioners often use Proximal Policy Optimization (PPO). PPO simplifies this by using a clipped surrogate objective. It approximates the KL constraint by clipping the probability ratio $r_t(\theta) = \frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}$ within a range, typically $[1-\epsilon, 1+\\epsilon]$. This prevents the update from pushing the ratio too far from 1, effectively simulating a KL-like constraint without the heavy quadratic computation.

In summary, KL divergence serves as the mathematical 'leash' that prevents a learning agent from diverging into instability. By shifting our focus from the distance in parameter space (Euclidean distance) to the distance in distribution space (KL divergence), we align our optimization steps with the actual behavior of the agent. This transition from $\mathcal{L}_2$ norms to information-theoretic measures is what enables modern RL agents to learn complex tasks with stability and reliability.