Kullback–Leibler Divergence and Constraints in Policy Optimization

At its core, the Kullback–Leibler (KL) divergence is a measure of how one probability distribution $P$ differs from a second, reference probability distribution $Q$. In the context of Machine Learning, think of it as a 'surprise' metric: if we expect data to follow distribution $Q$ but it actually follows $P$, the KL divergence quantifies the amount of information lost when $Q$ is used to approximate $P$. Crucially, it is not a true distance metric because it is asymmetric, meaning $D_{KL}(P \parallel Q) \\≠ D_{KL}(Q \parallel P)$.

Mathematically, for discrete probability distributions, the KL divergence is defined as the expected value of the logarithmic difference between the probabilities. The formula is expressed as: $$D_{KL}(P \parallel Q) = \sum_{x \\∈ \mathcal{X}} P(x) \log \frac{P(x)}{Q(x)}$$ For continuous distributions, the summation is replaced by an integral: $$D_{KL}(P \parallel Q) = \\∈t_{-\\∈fty}^{\\∈fty} p(x) \log \frac{p(x)}{q(x)} dx$$ Because $D_{KL}$ is non-negative, it reaches zero if and only if $P$ and $Q$ are identical across their entire support.

In Reinforcement Learning (RL), we often optimize a policy $\pi_{\theta}(a|s)$, which maps states to action probabilities. The challenge with standard Gradient Ascent is that a single large step in the parameter space $\theta$ can lead to a massive change in the distribution of actions. This creates a 'performance collapse' where the agent forgets previously learned successful behaviors because the new policy $\pi_{\theta_{new}}$ is too far from the old policy $\pi_{\theta_{old}}$ in terms of output probability, even if the change in $\theta$ was small.

To solve this, we introduce a constraint on the policy update. Instead of maximizing solely the expected return $J(\theta)$, we add a penalty based on the KL divergence between the old and new policies. This ensures that the update remains within a 'trust region.' The objective function becomes: $$\\max_{\theta} J(\theta) \text{ subject to } D_{KL}(\pi_{\theta_{old}} \parallel \pi_{\theta}) \\≤ \delta$$ where $\delta$ is a small hyperparameter that controls the maximum allowable change in the distribution.

A prominent implementation of this logic is found in Trust Region Policy Optimization (TRPO). TRPO uses the second-order Taylor expansion of the KL divergence to approximate the constraint. This involves the Fisher Information Matrix $F$, which represents the curvature of the KL divergence. The update rule effectively moves the parameters in the direction of the gradient, but scales the step size by the inverse of $F$: $$\\Delta \theta \\approx \sqrt{\frac{2\delta}{g^T F^{-1} g}} F^{-1} g$$ where $g$ is the policy gradient.

Alternatively, Proximal Policy Optimization (PPO) approximates this constraint using a clipped objective function. Rather than calculating the complex Fisher matrix, PPO limits the ratio $r_t(\theta) = \frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}$. By clipping this ratio between $1-\epsilon$ and $1+\epsilon$, PPO implicitly prevents $\pi_{\theta}$ from diverging too far from $\pi_{\theta_{old}}$, mimicking the effect of a KL constraint while remaining computationally efficient.

In summary, KL divergence serves as the mathematical bridge between parameter space and distribution space. By constraining the divergence, we transform the optimization process from an unstable walk into a cautious exploration. This stability is what allows deep RL agents to converge consistently across complex environments without the need for painstakingly tuned learning rates.