At its core, the Kullback-Leibler (KL) divergence is a measure of how one probability distribution differs from a second, reference probability distribution. In the context of Machine Learning, imagine you have a 'true' distribution of data, $P$, and you attempt to model it using a parameterized distribution, $Q$. The KL divergence quantifies the 'information loss' or the extra 'surprise' encountered when using $Q$ to approximate $P$. Crucially, it is not a true distance metric because it is asymmetric: the divergence from $P$ to $Q$ is generally not the same as the divergence from $Q$ to $P$.
Mathematically, for discrete probability distributions, the KL divergence is defined as the expectation of the logarithmic difference between the probabilities. The formula is expressed as: $$D_{KL}(P \\parallel Q) = \\sum_{i} P(i) \\log \\left( rac{P(i)}{Q(i)} ight)$$ For continuous distributions, the summation is replaced by an integral: $$D_{KL}(P \\parallel Q) = ∈t_{-∈fty}^{∈fty} p(x) \\log \\left( rac{p(x)}{q(x)} ight) dx$$ This expression can be rewritten as the difference between the cross-entropy and the entropy of $P$, highlighting that KL divergence measures the inefficiency of using $Q$ to encode samples from $P$.
In Reinforcement Learning (RL), we optimize a policy $\\pi_{ heta}(a|s)$, which maps states to a distribution over actions. A common goal is to maximize the expected return $J( heta)$. However, performing a standard gradient ascent step $ heta_{new} = heta_{old} + \\alpha abla J( heta)$ can be dangerous. If the step size $\\alpha$ is too large, the policy parameters might move into a region of the parameter space where the agent's behavior changes drastically, leading to a 'collapse' in performance from which the agent cannot recover.
To prevent this instability, we introduce the concept of a 'Trust Region'. Instead of relying on a fixed learning rate, we constrain the update such that the new policy $\\pi_{new}$ remains 'close' to the old policy $\\pi_{old}$ in the space of probability distributions. We use KL divergence to define this closeness because it is sensitive to the actual change in behavior (the output distribution) rather than just the change in parameter values ($ heta$). The constraint is typically formulated as: $$D_{KL}(\\pi_{old}(·|s) \\parallel \\pi_{new}(·|s)) \\le \\delta$$ where $\\delta$ is a hyperparameter defining the size of the trust region.
This constraint is the foundational mechanism behind Trust Region Policy Optimization (TRPO). By bounding the KL divergence, we can theoretically guarantee that the new policy will improve the expected return without deviating so far that the approximation of the advantage function becomes invalid. The optimization problem becomes a constrained maximization: $$\\max_{ heta} E_{s \\sim ho_{\\pi_{new}}} [A^{\\pi_{old}}(s, a)] \\quad ext{subject to} \\quad E_{s \\sim ho_{\\pi_{old}}} [D_{KL}(\\pi_{old} \\parallel \\pi_{new})] \\le \\delta$$ Here, $A^{\\pi_{old}}$ represents the advantage function, signifying how much better an action is compared to the average action in that state.
While TRPO provides strong guarantees, the second-order optimization required to solve the constraint (involving the Fisher Information Matrix) is computationally expensive. This led to the development of Proximal Policy Optimization (PPO). PPO approximates the KL constraint using a clipped surrogate objective or a penalty term: $$L^{CLIP}( heta) = \\hat{E}_t [ \\min(r_t( heta)\\hat{A}_t, ext{clip}(r_t( heta), 1-\\epsilon, 1+\\epsilon)\\hat{A}_t) ]$$ where $r_t( heta)$ is the probability ratio $rac{\\pi_{ heta}(a|s)}{\\pi_{ heta_{old}}(a|s)}$. This effectively prevents the ratio from moving too far from 1, mimicking the effect of a KL constraint but with the simplicity of first-order stochastic gradient descent.