Kullback–Leibler (KL) Divergence and Policy Constraint in Reinforcement Learning

At its core, Kullback–Leibler (KL) divergence, often called relative entropy, measures how one probability distribution differs from a second, reference probability distribution. In the context of Reinforcement Learning (RL), we are rarely interested in the absolute distance between two points in space, but rather in how much the 'shape' of our agent's policy changes after a parameter update. If a policy update is too aggressive, the agent may move into a region of the parameter space where it takes suboptimal actions, leading to a collapse in performance from which it cannot recover—a phenomenon known as catastrophic forgetting.

Mathematically, for two continuous probability distributions $P$ and $Q$, the KL divergence is defined as the integral of the difference between the log-likelihoods weighted by the first distribution: $D_{KL}(P \parallel Q) = \\∈t p(x) \log\left(\frac{p(x)}{q(x)}\right) dx$. For discrete distributions, this becomes a summation: $D_{KL}(P \parallel Q) = \sum_{x} P(x) \log\left(\frac{P(x)}{Q(x)}\right)$. It is crucial to note that KL divergence is non-symmetric, meaning $D_{KL}(P \parallel Q) \\≠ D_{KL}(Q \parallel P)$, and it is always non-negative, reaching zero if and only if $P$ and $Q$ are identical.

In policy gradient methods, we seek to maximize the expected return $J(\theta) = \mathbb{E}_{\pi_{\theta}}[R]$. A standard gradient ascent step $\theta_{t+1} = \theta_t + \alpha \nabla J(\theta_t)$ assumes that the local gradient is representative of the global landscape. However, in RL, a small change in parameter space $\theta$ can lead to a massive change in the resulting distribution of actions $\pi_{\theta}$. This instability occurs because the objective function is often highly non-linear, and a large step can push the policy into a 'plateau' of poor performance.

To mitigate this, we introduce a constraint on the update such that the KL divergence between the old policy $\pi_{\theta_{old}}$ and the new policy $\pi_{\theta}$ remains below a threshold $\delta$. The optimization problem is reformulated as: $\max_{\theta} J(\theta)$ subject to $D_{KL}(\pi_{\theta_{old}} \parallel \pi_{\theta}) \le \delta$. By constraining the divergence, we ensure that the new policy remains within a 'trust region' where the local approximation of the policy's performance is likely to be accurate, thereby guaranteeing monotonic improvement.

The Trust Region Policy Optimization (TRPO) algorithm operationalizes this by using a quadratic approximation of the KL divergence. By performing a Taylor expansion, the KL divergence can be approximated as $\frac{1}{2}(\theta - \theta_{old})^T H (\theta - \theta_{old})$, where $H$ is the Fisher Information Matrix (FIM). The FIM acts as a metric tensor for the manifold of probability distributions, effectively rescaling the gradient step to account for the curvature of the policy space rather than the raw geometry of the parameter space.

While TRPO is mathematically rigorous, it is computationally expensive due to the need to invert the FIM. Proximal Policy Optimization (PPO) simplifies this by using a clipped objective function that implicitly penalizes large changes in the ratio $rac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}$. Although PPO does not explicitly calculate the KL divergence in its clipped version, many implementations use an explicit KL penalty term $\beta D_{KL}(\pi_{\theta_{old}} \parallel \pi_{\theta})$ added to the loss function, creating a Lagrangian-style balance between reward maximization and policy stability.