Kullback–Leibler (KL) Divergence and its Role in Constraining Policy Updates

At its core, Kullback–Leibler (KL) divergence is a measure of how one probability distribution differs from a second, reference probability distribution. In the context of Machine Learning, imagine you have a true distribution of data $P$ and an approximation $Q$. If $Q$ perfectly matches $P$, the divergence is zero. However, as $Q$ deviates, the KL divergence increases, effectively quantifying the 'information loss' incurred when using $Q$ to represent $P$. Unlike a standard distance metric, KL divergence is asymmetric, meaning $D_{KL}(P || Q) \\≠ D_{KL}(Q || P)$.

Mathematically, for discrete probability distributions, the KL divergence is defined as the expected value of the logarithmic difference between the probabilities. The formula is expressed as: $D_{KL}(P || Q) = \sum_{x \\∈ X} P(x) \log\left(\frac{P(x)}{Q(x)}\right)$. By utilizing the properties of logarithms, this can be rewritten as the difference between the cross-entropy of $P$ and $Q$ and the entropy of $P$: $D_{KL}(P || Q) = H(P, Q) - H(P)$. This highlights that KL divergence measures the extra 'surprise' we experience when we assume the distribution is $Q$ when it is actually $P$.

In Reinforcement Learning (RL), we often optimize a policy $\pi_{\theta}(a|s)$, which maps states to action distributions. The goal is to update the parameters $\theta$ to maximize the expected reward. However, a standard gradient ascent step can lead to catastrophic collapse. If the update step is too large, the new policy $\pi_{\theta_{new}}$ may be wildly different from $\pi_{\theta_{old}}$, causing the agent to enter a region of the state space where it has no valid data, leading to a total collapse in performance.

To prevent this, we introduce a constraint on the update: the KL divergence between the old policy and the new policy must be kept below a small threshold $\delta$. This is formalized as: $D_{KL}(\pi_{\theta_{old}} || \pi_{\theta_{new}}) \le \delta$. By restricting the update to a 'trust region,' we ensure that the new policy remains close to the old one. This effectively smooths the learning curve and prevents the agent from taking 'leaps of faith' into unexplored or unstable regions of the parameter space.

This constraint is the theoretical foundation of Trust Region Policy Optimization (TRPO). Instead of a simple gradient update $\theta_{t+1} = \theta_t + \alpha \nabla J(\theta)$, TRPO solves a constrained optimization problem: $\max_{\theta} E[ \frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)} A^{\pi_{old}}(s, a) ]$ subject to $D_{KL}(\pi_{\theta_{old}} || \pi_{\theta}) \le \delta$. Here, $A^{\pi_{old}}$ is the advantage function, and the ratio represents the importance sampling weight. The KL constraint ensures that the improvement is monotonic and stable.

A modern, more computationally efficient approximation of this concept is found in Proximal Policy Optimization (PPO). While TRPO uses a hard constraint, PPO uses a clipped objective function or a KL penalty term in the loss function: $L(\theta) = \\hat{E}_t [ L^{CLIP}( heta) - c_{KL} D_{KL}(\pi_{\theta_{old}} || \pi_{\theta}) ]$. By adding the KL divergence as a penalty, PPO discourages the policy from drifting too far from the behavior policy, maintaining the stability of the Trust Region approach without the heavy computational cost of calculating the second-order Hessian matrix.