Kullback–Leibler (KL) Divergence and Constraints in Policy Optimization

At its core, KL divergence is a measure of how one probability distribution differs from a second, reference probability distribution. In the context of Machine Learning, imagine you have a 'true' distribution that describes the world, and a model distribution that tries to approximate it. KL divergence quantifies the 'information loss' when we use our approximation instead of the truth. Unlike a standard distance metric, KL divergence is asymmetric: the distance from distribution $P$ to $Q$ is not necessarily the same as from $Q$ to $P$.

Mathematically, for discrete probability distributions $P$ and $Q$ defined on the same probability space, the KL divergence is defined as the expectation of the logarithmic difference between the probabilities: $$D_{KL}(P || Q) = \\sum_{i} P(i) \\log \\left( rac{P(i)}{Q(i)} ight)$$. For continuous distributions, we replace the sum with an integral: $$D_{KL}(P || Q) = ∈t_{-∈fty}^{∈fty} p(x) \\log \\left( rac{p(x)}{q(x)} ight) dx$$. Note that $D_{KL} \\ge 0$, and it only equals zero if $P$ and $Q$ are identical across the entire domain.

In Reinforcement Learning (RL), we represent our agent's behavior as a policy $\\pi_{ heta}(a|s)$, which is a probability distribution over actions $a$ given state $s$, parameterized by $ heta$. The goal is to update $ heta$ to maximize the expected return. However, a naive gradient ascent update $ heta_{new} = heta_{old} + \\alpha abla J( heta)$ can be dangerous. If the step size $\\alpha$ is too large, the policy may change drastically, moving the agent into a region of the state space where it has no useful data, leading to a 'collapse' in performance from which the agent cannot recover.

To mitigate this, we introduce a constraint on the policy update using KL divergence. Instead of trusting the gradient blindly, we ensure that the new policy $\\pi_{ heta_{new}}$ remains 'close' to the old policy $\\pi_{ heta_{old}}$. We formulate this as a constrained optimization problem: $$ ext{maximize } J( heta) ext{ subject to } D_{KL}(\\pi_{ heta_{old}} || \\pi_{ heta_{new}}) \\le \\delta$$. Here, $\\delta$ is a hyperparameter that defines the 'trust region'—the maximum allowable change in the distribution of actions.

This approach is the foundation of Trust Region Policy Optimization (TRPO). By constraining the KL divergence, we ensure that the update maintains a monotonic improvement guarantee. Rather than calculating the gradient of the objective function alone, TRPO utilizes the Fisher Information Matrix $F$, which is the second-order derivative of the KL divergence: $F = abla_{ heta}^2 D_{KL}(\\pi_{ heta_{old}} || \\pi_{ heta})$. This effectively reshapes the gradient update to account for the geometry of the probability space, ensuring that we move a consistent distance in terms of 'distributional change' rather than 'parameter change'.

Modern algorithms like Proximal Policy Optimization (PPO) simplify this further by replacing the hard constraint with a penalty term or a clipped objective. For example, instead of a strict constraint, PPO might minimize an objective that clips the ratio $r_t( heta) = rac{\\pi_{ heta}(a|s)}{\\pi_{ heta_{old}}(a|s)}$. While not a direct KL constraint in every implementation, the motivation remains the same: preventing the policy from diverging too far from its predecessor to maintain training stability and sample efficiency.