Kullback–Leibler (KL) Divergence and its Role in Constraining Policy Updates

At its core, Kullback–Leibler (KL) divergence is a measure of how one probability distribution differs from a second, reference probability distribution. In the context of Machine Learning, think of it as a 'distance' metric, although it is technically not a distance because it is non-symmetric. If we have a true distribution $P$ and an approximation $Q$, the KL divergence quantifies the amount of information lost when we use $Q$ to represent $P$. Intuitively, if two distributions overlap perfectly, the divergence is zero; as they drift apart, the value increases, signaling a loss of fidelity in the approximation.

Mathematically, for discrete probability distributions $P$ and $Q$ defined on the same probability space, the KL divergence is defined as the expectation of the logarithmic difference between the probabilities: $$D_{KL}(P \parallel Q) = \sum_{x \\∈ \mathcal{X}} P(x) \log\left(\frac{P(x)}{Q(x)}\right)$$ This can be rewritten as the difference between the cross-entropy and the entropy of $P$: $$D_{KL}(P \parallel Q) = H(P, Q) - H(P)$$ Crucially, since $D_{KL}(P \parallel Q) \\≥ 0$ (by Gibbs' inequality), it provides a reliable scalar signal to indicate the 'gap' between two models.

In Reinforcement Learning (RL), we often optimize a policy $\pi_{\theta}$, where $\theta$ represents the neural network parameters. A standard approach is to perform stochastic gradient ascent on the expected return $J(\theta)$. However, the 'Policy Gradient Theorem' tells us that the gradient is an estimate based on sampled trajectories. If we take a step that is too large in parameter space, the resulting distribution $\pi_{\theta_{new}}$ might deviate drastically from $\pi_{\theta_{old}}$, leading to a 'collapse' where the agent forgets previously learned beneficial behaviors.

To mitigate this, we introduce KL divergence as a constraint on the update. Instead of unconstrained maximization, we solve for the optimal parameters $\theta$ that maximize the objective function while ensuring the KL divergence between the old policy and the new policy remains below a threshold $\\delta$: $$\\max_{\theta} J(\theta) \quad \text{subject to} \quad D_{KL}(\pi_{\theta_{old}} \parallel \pi_{\theta}) \\≤ \delta$$ This ensures that the agent evolves incrementally, maintaining a 'trust region' where the local approximation of the gradient remains valid.

This principle is most famously implemented in Trust Region Policy Optimization (TRPO). TRPO uses a second-order approximation of the KL divergence to define a constraint on the update step. By calculating the Fisher Information Matrix $F$, which is the Hessian of the KL divergence, TRPO approximates the constraint as a quadratic form: $\frac{1}{2} \Delta \theta^T F \Delta \theta \\≤ \delta$. This transforms the update from a simple step in the direction of the gradient into a natural gradient step, which follows the steepest ascent on the manifold of probability distributions rather than the Euclidean parameter space.

Proximal Policy Optimization (PPO) further simplifies this by using a clipped surrogate objective to achieve a similar effect without the computational overhead of the Fisher matrix. PPO ensures that the ratio $r_t(\theta) = \frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}$ does not move too far from 1. By clipping this ratio, PPO effectively penalizes large KL divergences, preventing the policy from making updates that would swing the probability distribution too violently. This stability is what allows PPO to be more robust across diverse environments.

Ultimately, the use of KL divergence transforms the optimization problem from one of 'finding the peak' to one of 'carefully climbing.' Without these constraints, the high variance of policy gradients often leads to divergent behavior. By bounding the information shift, we ensure that each update is a refinement rather than a reboot, preserving the stability of the learning process and ensuring a monotonic improvement in the agent's performance.