Kullback–Leibler (KL) Divergence and its Role in Constraining Policy Updates

In reinforcement learning, the goal is to optimize a policy $\pi_{\theta}(a|s)$ to maximize expected returns. However, a common pitfall is the 'collapse' of the policy: if we take a gradient step that is too large, the policy may shift into a region of the parameter space where it performs poorly. Because the data used to compute the gradient is sampled from the current policy, a drastic change in the policy makes the previous data obsolete and lead to unstable training. The core intuition behind using Kullback–Leibler (KL) divergence is to ensure that the new policy $\pi_{\theta_{new}}$ remains 'close' to the old policy $\pi_{\theta_{old}}$ in terms of probability distribution, rather than blindly trusting the Euclidean distance in parameter space.

Mathematically, the KL divergence, denoted as $D_{KL}(P || Q)$, measures how one probability distribution $P$ diverges from a second, expected probability distribution $Q$. In the context of policies, the KL divergence between the old policy $P$ and the new policy $Q$ is defined as the expected value of the logarithmic difference between the two distributions: $$D_{KL}(\pi_{\theta_{old}} || \pi_{\theta_{new}}) = \sum_{a \\∈ \mathcal{A}} \pi_{\theta_{old}}(a|s) \log \frac{\pi_{\theta_{old}}(a|s)}{\pi_{\theta_{new}}(a|s)}$$ This formulation effectively quantifies the 'information gain' or the amount of additional information required to represent the distribution $P$ using the distribution $Q$.

It is critical to note that $D_{KL}$ is non-symmetric, meaning $D_{KL}(P || Q) \\≠ D_{KL}(Q || P)$. In policy updates, we typically treat the old policy as the reference distribution. Because the KL divergence is always non-negative ($D_{KL} \ge 0$) and equals zero if and only if $P = Q$, it serves as a powerful surrogate for a distance metric. Unlike Euclidean distance in the parameter space $\theta$, which doesn't account for how parameters actually affect the output probabilities, KL divergence operates directly on the output manifold, ensuring that the agent's behavior changes smoothly.

One of the most prominent applications of this concept is in Trust Region Policy Optimization (TRPO). Instead of a standard gradient update $\theta_{t+1} = \theta_t + \alpha \nabla J(\theta)$, TRPO solves a constrained optimization problem. The objective is to maximize the surrogate advantage function while keeping the KL divergence below a small threshold $\delta$: $$\\max_{\theta} E_{s,a \sim \pi_{\theta_{old}}} \left[ \frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)} A^{\pi_{\theta_{old}}}(s,a) \right]$$ subject to $$D_{KL}(\pi_{\theta_{old}} || \pi_{\theta}) \le \delta$$ This ensures that the update stays within a 'trust region' where the local approximation of the policy's performance is likely to be accurate.

While TRPO provides rigorous guarantees, it is computationally expensive due to the need to compute the second-order Fisher Information Matrix. This led to the development of Proximal Policy Optimization (PPO). PPO simplifies the constraint by using a 'clipped' objective function that penalizes changes that move the probability ratio $r_t(\theta) = \frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}$ too far from 1. Alternatively, PPO can implement a KL penalty directly into the loss function: $$L^{KLPEN}( heta) = E_t [ r_t(\theta) A_t - \beta D_{KL}(\pi_{\theta_{old}} || \pi_{\theta}) ]$$ Here, $\beta$ acts as a Lagrange multiplier that controls the trade-off between maximizing reward and maintaining policy stability.

Ultimately, constraining policy updates via KL divergence addresses the fundamental tension between exploration and stability. By limiting the 'step size' in the space of probability distributions, we prevent the agent from taking an irrevocable step into a 'black hole' of poor performance from which it cannot recover. This theoretical framework transforms the optimization process from a volatile descent into a controlled evolution, ensuring that the agent consistently improves its behavior without sacrificing the reliability of its learning process.