In reinforcement learning, the goal is to optimize a policy $\pi_{\theta}$ to maximize expected returns. However, a fundamental challenge arises during the update step: if we take a gradient step that is too large, the policy may change drastically, leading to a collapse in performance. This is often termed the 'catastrophic forgetting' or instability problem. To prevent this, we need a way to measure the 'distance' between the old policy $\pi_{\theta_{old}}$ and the new policy $\pi_{\theta}$. Unlike Euclidean distance between parameter vectors, which does not reflect the actual change in behavior, we use the Kullback–Leibler (KL) divergence to measure the difference between the probability distributions themselves.
Mathematically, the KL divergence from distribution $Q$ to $P$ (denoted as $D_{KL}(P \parallel Q)$) is a non-symmetric measure of how much information is lost when $Q$ is used to approximate $P$. For discrete probability distributions, it is defined as: $$D_{KL}(P \parallel Q) = \sum_{x \\∈ X} P(x) \log \frac{P(x)}{Q(x)}$$. In the context of policy updates, we treat the agent's action distribution as the probability measure. If $D_{KL}(\pi_{\theta_{old}} \parallel \pi_{\theta})$ is small, it implies that the new policy will behave similarly to the old one, regardless of how many parameters were shifted in the weight space.
The core intuition behind using KL divergence is that it captures the 'informational' distance. In a high-dimensional parameter space $\theta$, a small change in a single weight might have a negligible effect on the output distribution, while a small change in another weight might completely flip the agent's decision. By constraining the KL divergence rather than the parameter norm, we ensure that the agent's output behavior remains within a 'trust region,' providing a mathematical guarantee that the policy update is conservative enough to avoid divergence.
This concept is formalized in Trust Region Policy Optimization (TRPO). Instead of a standard gradient ascent step $\theta_{t+1} = \theta_t + \alpha \nabla J(\theta)$, TRPO solves a constrained optimization problem. The objective is to maximize the surrogate advantage function $L(\theta)$ subject to a constraint on the average KL divergence: $$\text{maximize } E_{s \sim \rho_{\pi_{old}}, a \sim \pi_{old}} \left[ \frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)} A^{\pi_{old}}(s, a) \right] \text{ s.t. } E_s [D_{KL}(\pi_{\theta_{old}}(\\·|s) \parallel \pi_{\theta}(\\·|s))] \le \delta$$ where $A$ is the advantage function and $\delta$ is a hyperparameter defining the trust region size.
Computing the exact KL constraint in TRPO involves the Fisher Information Matrix (FIM), which is the second-order derivative (Hessian) of the KL divergence. Specifically, for a small change $\\Delta \theta$, the KL divergence can be approximated as a quadratic form: $$D_{KL}(\pi_{\theta} \parallel \pi_{\theta + \Delta \theta}) \approx \frac{1}{2} \Delta \theta^T F(\theta) \Delta \theta$$ where $F(\theta)$ is the Fisher Information Matrix. This reveals that the KL divergence defines a Riemannian manifold over the parameter space, where the metric is determined by the sensitivity of the distribution to changes in $\theta$.
Because calculating the inverse of the FIM is computationally expensive for large neural networks, Proximal Policy Optimization (PPO) was introduced as a scalable alternative. PPO replaces the hard KL constraint with a clipped surrogate objective or a KL penalty term added to the loss function: $L^{KLPEN} = E [ R(\theta) - \beta D_{KL}(\pi_{\theta_{old}} \parallel \pi_{\theta}) ]$. This penalizes the update if the new policy drifts too far from the old one, effectively simulating the trust region effect without the need for complex second-order matrix inversions, thus bridging the gap between theoretical stability and practical efficiency.