Kullback–Leibler (KL) Divergence and its Role in Constraining Policy Updates

At its core, Kullback–Leibler (KL) divergence is a measure of how one probability distribution differs from a second, reference probability distribution. In the context of Machine Learning, imagine you have a 'true' distribution of data $P$ and an approximation $Q$. The KL divergence quantifies the amount of information lost when we use $Q$ to approximate $P$. Unlike a standard distance metric, KL divergence is asymmetric: $D_{KL}(P || Q) \\≠ D_{KL}(Q || P)$. Intuitively, it tells us how 'surprised' we would be if we expected data to follow distribution $Q$, but it actually followed $P$.

Mathematically, for discrete probability distributions, the KL divergence is defined as the expected value of the logarithmic difference between the two distributions. The formula is expressed as: $$D_{KL}(P || Q) = \sum_{x \\∈ X} P(x) \log \frac{P(x)}{Q(x)}$$ For continuous distributions, the summation is replaced by an integral: $$D_{KL}(P || Q) = \\∈t_{-\\∈fty}^{\\∈fty} p(x) \log \frac{p(x)}{q(x)} dx$$ This can be rewritten as the difference between the cross-entropy and the entropy of $P$, highlighting its relationship to information theory.

In Reinforcement Learning (RL), we typically optimize a policy $\pi_{\theta}(a|s)$, which maps states to a probability distribution over actions. The goal is to maximize the expected return. However, standard policy gradient methods are prone to high variance and instability. If a single update step changes the policy parameters $\theta$ too drastically, the agent may move into a region of the state space where it collects poor quality data, leading to a 'collapse' in performance from which the agent cannot recover.

To mitigate this, we introduce KL divergence as a constraint on policy updates. The intuition is simple: we want to improve the policy to $\pi_{\theta_{new}}$ while ensuring it remains 'close' to the old policy $\pi_{\theta_{old}}$. By constraining the divergence, we ensure that the new policy does not deviate so far that it invalidates the data collected by the previous version of the agent. This transforms the optimization problem into a constrained optimization: maximize the objective function subject to $D_{KL}(\pi_{\theta_{old}} || \pi_{\theta_{new}}) \le \delta$.

This constraint is most famously implemented in Trust Region Policy Optimization (TRPO). Instead of a hard constraint, TRPO optimizes a surrogate objective and uses the KL divergence to define a 'trust region'. The update step is formulated as: $$\\max_{ heta} \mathbb{E}_{s,a \sim \pi_{\theta_{old}}} \left[ \frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)} A^{\pi_{\theta_{old}}}(s,a) \right]$$ subject to $D_{KL}(\pi_{\theta_{old}} || \pi_{\theta}) \le \epsilon$. Here, $A^{\pi}$ represents the advantage function, which estimates how much better an action is compared to the average action in that state.

A more computationally efficient alternative is Proximal Policy Optimization (PPO), which approximates this constraint using a clipped objective function. While PPO does not explicitly compute the KL divergence in its basic clipped version, many implementations add a KL divergence penalty to the loss function: $L = L_{clip} - \beta D_{KL}(\pi_{\theta_{old}} || \pi_{\theta})$. This regularization ensures that the updates remain stable, effectively acting as a soft constraint that penalizes the agent for moving too far from its previous behavioral distribution.

Ultimately, the role of KL divergence in policy updates is to provide a theoretical guarantee of monotonic improvement. By limiting the step size in the space of probability distributions rather than the space of parameter weights $\theta$, we account for the fact that a small change in $\theta$ can lead to a massive change in the resulting behavior $\pi$. This distinction is critical: the geometry of the parameter space is often misleading, but the geometry of the distribution space—defined by the Fisher Information Metric—provides the stability necessary for deep reinforcement learning to converge.