Kullback–Leibler (KL) Divergence and Policy Constraints in Reinforcement Learning

At its core, Kullback–Leibler (KL) divergence is a measure of how one probability distribution differs from a second, reference probability distribution. In the context of Machine Learning, think of it as an 'informational distance.' If we have a true distribution $P$ and an approximation $Q$, the KL divergence tells us how much information is lost when we use $Q$ to represent $P$. Unlike a standard distance metric, KL divergence is asymmetric: $D_{KL}(P \| Q) \\≠ D_{KL}(Q \| P)$. This lack of symmetry is crucial because it determines whether the approximation $Q$ prioritizes covering all modes of $P$ or focusing on the most probable regions.

Mathematically, for discrete probability distributions, the KL divergence is defined as the expected value of the logarithmic difference between the probabilities. The formula is expressed as: $$D_{KL}(P \| Q) = \sum_{x \\∈ \mathcal{X}} P(x) \log\left(\frac{P(x)}{Q(x)} ight)$$ For continuous distributions, the summation is replaced by an integral: $$D_{KL}(P \| Q) = \\∈t_{-\\∈fty}^{\\∈fty} p(x) \log\left(\frac{p(x)}{q(x)} ight) dx$$ Essentially, this is the difference between the cross-entropy of $P$ and $Q$ and the entropy of $P$ itself, quantifying the 'extra' bits needed to encode samples from $P$ using a code optimized for $Q$.

In Reinforcement Learning, we optimize a policy $\pi_{\theta}$ to maximize the expected return. A naive gradient ascent update $\theta_{t+1} = \theta_t + \alpha \nabla J(\theta)$ can be dangerous. If the learning rate $\alpha$ is too high or the gradient is steep, the policy may change drastically in a single step. This is the 'policy collapse' problem: a sudden change in the distribution of actions can lead the agent into a region of the state space it has never explored, resulting in a catastrophic drop in performance from which the agent cannot recover.

To solve this, we introduce a constraint on the policy update using KL divergence. Instead of maximizing the reward blindly, we maximize the reward subject to the constraint that the new policy $\pi_{\theta}$ does not stray too far from the old policy $\pi_{\theta_{old}}$. This is formulated as: $$\\max_{\theta} E_{s \sim \rho^{\pi}, a \sim \pi_{\theta}} [A^{\pi_{old}}(s, a)] \quad \text{subject to} \quad D_{KL}(\pi_{\theta_{old}}(\\·|s) \| \pi_{\theta}(\\·|s)) \\≤ \delta$$ Here, $A^{\pi_{old}}$ is the advantage function, and $\delta$ is a hyperparameter controlling the 'trust region' size.

The most influential implementation of this concept is Trust Region Policy Optimization (TRPO). TRPO replaces the hard constraint with a penalty or a projection, ensuring that the update stays within a region where the local approximation of the reward surface is reliable. By constraining the KL divergence, we ensure that the update is 'smooth.' This effectively transforms the optimization landscape, preventing the optimizer from taking steps that are too large in the probability space, even if the gradient suggests a massive improvement in reward.

A more computationally efficient approach to this constraint is found in Proximal Policy Optimization (PPO). While TRPO uses second-order optimization (Kronecker-factored Approximate Curvature), PPO simplifies the KL constraint using a clipped surrogate objective: $$L^{CLIP}(\theta) = \hat{E}_t [ \min(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t) ]$$ where $r_t(\theta) = \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$. This clipping mechanism acts as a first-order approximation to the KL constraint, discouraging the ratio $r_t(\theta)$ from moving too far from 1, thereby implicitly limiting the KL divergence between the old and new policies.

Ultimately, the role of KL divergence in policy updates is to provide a mathematical guarantee of stability. By acknowledging that the gradient $\nabla J(\theta)$ is only a local linear approximation of the objective, and that this approximation degrades as we move away from $\theta_{old}$, KL divergence serves as a safety rail. It ensures that the agent evolves its behavior incrementally, maintaining a balance between exploration and the stability of its acquired knowledge.