At its core, the Kullback–Leibler (KL) divergence is a measure of how one probability distribution differs from a second, reference probability distribution. In the context of Reinforcement Learning (RL), we use it to track the 'distance' between an agent's old policy $\pi_{\theta_{old}}$ and its updated policy $\pi_{\theta}$. Unlike a standard Euclidean distance between parameter vectors, KL divergence focuses on the change in the actual behavior (the output distributions), ensuring that the agent does not fundamentally alter its strategy based on a single noisy batch of data.
Mathematically, for two discrete probability distributions $P$ and $Q$, the KL divergence is defined as the expected value of the logarithmic difference between $P$ and $Q$. The formula is expressed as: $$D_{KL}(P \parallel Q) = \sum_{x \\∈ \mathcal{X}} P(x) \log \frac{P(x)}{Q(x)}$$ This can be interpreted as the 'information gain' achieved when moving from prior $Q$ to posterior $P$, or the average number of extra bits required to encode samples from $P$ using a code optimized for $Q$.
In policy gradient methods, we seek to maximize the expected return $J(\theta)$. However, a standard gradient ascent step $\theta_{t+1} = \theta_t + \alpha \nabla J(\theta_t)$ can be dangerous. Because the relationship between parameters $\theta$ and the resulting distribution $\pi_{\theta}$ is highly non-linear, a small change in parameter space can lead to a massive collapse in the policy's performance. This is known as the 'cliff' problem, where the agent moves into a region of the parameter space where it can no longer collect useful data, leading to catastrophic forgetting.
To mitigate this, we introduce a constraint on the update. Instead of trust-region optimization based on Euclidean distance, we constrain the KL divergence: $D_{KL}(\pi_{\theta_{old}} \parallel \pi_{\theta}) \le \delta$. This ensures that the new policy is 'close' to the old one in terms of probability mass. By limiting the divergence, we guarantee that the state-action visitation distribution does not shift too violently, which preserves the validity of the advantage estimates used to compute the gradient.
The most prominent implementation of this concept is Trust Region Policy Optimization (TRPO). TRPO formulates the update as a constrained optimization problem: $\\max_{\theta} E[\frac{\pi_{\theta}}{\pi_{\theta_{old}}} A^{\pi_{old}}]$ subject to $D_{KL}(\pi_{\theta_{old}} \parallel \pi_{\theta}) \le \delta$. By using a second-order Taylor expansion, TRPO approximates the KL divergence as a quadratic form involving the Fisher Information Matrix $F$: $$D_{KL}(\pi_{\theta_{old}} \parallel \pi_{\theta}) \approx \frac{1}{2}(\theta - \theta_{old})^T F(\theta_{old})(\theta - \theta_{old})$$ This transforms the constraint into a natural gradient update, which follows the steepest ascent on the manifold of probability distributions rather than the parameter plane.
Proximal Policy Optimization (PPO) further simplifies this by replacing the hard constraint with a clipped surrogate objective. While TRPO solves a complex constrained problem, PPO limits the ratio $r_t(\theta) = \frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}$ to be within a range, such as $[1-\epsilon, 1+\\epsilon]$. Although PPO does not strictly enforce a KL limit in its clipped form, it is functionally designed to achieve the same goal: preventing the policy from deviating too far from the data-collection policy, thereby ensuring stable and monotonic improvement.