Kullback–Leibler (KL) Divergence and its Role in Constraining Policy Updates

At its core, the Kullback-Leibler (KL) divergence is a measure of how one probability distribution $P$ differs from a second, reference probability distribution $Q$. In the context of information theory, it represents the 'extra' information required to encode samples from $P$ using a code optimized for $Q$. Intuitively, if $P$ and $Q$ are identical, the divergence is zero; as they drift apart, the value increases. In Machine Learning, we treat it not as a true metric—since it lacks symmetry—but as a measure of surprise or informational gain when switching from a prior belief to a posterior observation.

Mathematically, for discrete probability distributions, the KL divergence is defined as the expected value of the logarithmic difference between the probabilities. The formulation is given by: $$D_{KL}(P \parallel Q) = \sum_{i} P(i) \log \frac{P(i)}{Q(i)}$$ For continuous distributions, the summation is replaced by an integral over the support of the distributions. A critical property is that $D_{KL}(P \parallel Q) \\≥ 0$, with equality if and only if $P = Q$ almost everywhere. This non-negativity makes it an ideal candidate for a penalty term or a constraint in optimization problems.

In Reinforcement Learning, specifically Policy Gradient methods, we aim to optimize a policy $\pi_{\theta}$ to maximize the expected return. A naive gradient ascent update $\theta_{new} = \theta_{old} + \alpha \nabla J(\theta)$ can be dangerous. Because the policy $\pi_{\theta}$ defines the data distribution (the trajectories the agent explores), a small change in parameter space $\theta$ can lead to a catastrophic collapse in policy space. If the policy changes too drastically, the agent may enter a region of the state space where it cannot recover, leading to a permanent drop in performance.

To mitigate this, we introduce the KL divergence as a constraint on the policy update. Instead of limiting the step size in the parameter space (which is what a learning rate $\alpha$ does), we limit the step size in the distribution space. We seek to maximize the objective $J(\theta)$ subject to the constraint: $D_{KL}(\pi_{\theta_{old}} \parallel \pi_{\theta}) \\≤ \delta$. This ensures that the new policy $\pi_{\theta}$ remains 'close' to the old policy $\pi_{\theta_{old}}$, effectively creating a 'Trust Region' where the local approximation of the gradient is likely to be accurate.

This conceptual framework is the foundation of Trust Region Policy Optimization (TRPO). TRPO approximates the objective and the constraint using a second-order Taylor expansion. The constraint $D_{KL}$ is approximated by the Fisher Information Matrix (FIM), denoted as $F$. The update rule becomes a constrained optimization problem: $$\\max_{\theta} g^T (\theta - \theta_{old}) \text{ s.t. } \frac{1}{2}(\theta - \theta_{old})^T F (\theta - \theta_{old}) \\≤ \delta$$ where $g$ is the policy gradient. This transforms the update into a natural gradient step, moving the policy in the direction of steepest ascent on the Riemannian manifold of probability distributions.

While TRPO provides rigorous guarantees, its computational cost is high due to the inversion of the FIM. Proximal Policy Optimization (PPO) simplifies this by using a clipped surrogate objective or a KL penalty. In the penalty version, the loss function is modified: $L^{CLIP}(\theta) = \\hat{E}_t [ ext{Advantage} ] - \beta D_{KL}(\pi_{\theta_{old}} \parallel \pi_{\theta})$. By dynamically adjusting the coefficient $\beta$, the algorithm penalizes updates that move too far from the original behavior, maintaining a balance between exploration and stability.

In summary, the KL divergence shifts the focus of policy updates from the geometry of the weights to the geometry of the behavior. By constraining the divergence, we ensure that the agent learns incrementally and robustly. This transition from $\text{Euclidean space}$ to $\text{Probability space}$ is what allows modern deep RL agents to converge reliably even in environments with high variance and sparse rewards.