All Lessons

The Geometry of Information: KL Divergence and Policy Constrained Updates

An exploration of how Kullback–Leibler divergence prevents catastrophic collapse in Reinforcement Learning. We examine the mathematical transition from vanilla policy gradients to Trust Region methods.

AI Narration Press play to listen
0  / 6 paragraphs
Click any paragraph to jump · Scroll freely without breaking narration

At its core, the Kullback–Leibler (KL) divergence is a measure of how one probability distribution $P$ differs from a second, reference probability distribution $Q$. In the context of Machine Learning, we often think of it as an 'informational distance,' although it is technically not a metric because it is asymmetric—the distance from $P$ to $Q$ is not necessarily the same as from $Q$ to $P$. Intuitively, if we use $Q$ to approximate $P$, the KL divergence quantifies the amount of extra information (or 'surprise') we encounter. In policy optimization, this allows us to quantify how much a new policy $\\pi_{ heta_{new}}$ deviates from the current policy $\\pi_{ heta_{old}}$.

Mathematically, for discrete probability distributions, the KL divergence is defined as the expected value of the logarithmic difference between the two distributions: $$D_{KL}(P \\parallel Q) = \\sum_{x ∈ \\mathcal{X}} P(x) \\log \\left( rac{P(x)}{Q(x)} ight)$$. For continuous distributions, we replace the summation with an integral: $$D_{KL}(P \\parallel Q) = ∈t_{-∈fty}^{∈fty} p(x) \\log \\left( rac{p(x)}{q(x)} ight) dx$$. From an information-theoretic perspective, this can be rewritten as the difference between the cross-entropy of $P$ and $Q$ and the entropy of $P$, reinforcing the idea that KL divergence measures the inefficiency of using $Q$ as a code for $P$.

In Reinforcement Learning, we seek to optimize a policy $\\pi_{ heta}$ to maximize the expected return $J( heta)$. Standard Policy Gradient methods update parameters via $ heta_{t+1} = heta_t + \\alpha abla_{ heta} J( heta)$. However, these updates are performed in the parameter space. Because the relationship between parameters $ heta$ and the resulting distribution $\\pi_{ heta}$ is often non-linear, a small step in $ heta$ can lead to a massive shift in the actual behavior of the agent. If a policy update pushes the distribution into a region of the state space where it has no useful data, the agent may suffer a 'catastrophic collapse,' from which it can never recover.

To solve this, we introduce a constraint on the update using KL divergence. Instead of trusting the gradient blindly, we constrain the update such that the 'distance' between the old policy and the new policy remains below a threshold $\\delta$: $$\\max_{ heta} J( heta) \\quad ext{subject to} \\quad D_{KL}(\\pi_{ heta_{old}} \\parallel \\pi_{ heta}) ≤ \\delta$$. This ensures that the new policy remains within a 'Trust Region' where the local approximation of the reward landscape is likely to be accurate. By constraining the distribution shift rather than the parameter shift, we ensure stable and monotonic improvement.

The most prominent implementation of this concept is Trust Region Policy Optimization (TRPO). TRPO approximates the objective function using a Taylor expansion and uses the Fisher Information Matrix $F$ to model the local curvature of the KL divergence. The quadratic approximation of the KL divergence is given by $D_{KL}(\\pi_{ heta} \\parallel \\pi_{ heta + \\Delta heta}) \\approx rac{1}{2} \\Delta heta^T F \\Delta heta$. By solving the constrained optimization problem, TRPO computes a natural gradient update that moves the policy along the steepest ascent direction on the Riemannian manifold of probability distributions, rather than the Euclidean space of parameters.

A more computationally efficient alternative is Proximal Policy Optimization (PPO), which simplifies the KL constraint. Rather than a hard constraint, PPO often uses a clipped objective or adds a KL penalty term directly to the loss function: $L( heta) = \\mathbb{E}[R] - eta D_{KL}(\\pi_{ heta_{old}} \\parallel \\pi_{ heta})$. This penalty acts as a regularizer, punishing the model for straying too far from its previous iteration. This balance between maximizing reward and minimizing divergence is the cornerstone of modern stable deep reinforcement learning, preventing the variance of the policy gradient from causing divergent behavior.