All Lessons

Proximal Policy Optimization (PPO) and the Clipped Surrogate Objective

An exploration of how PPO stabilizes reinforcement learning by preventing catastrophically large policy updates. We examine the transition from trust-region methods to the clipped surrogate objective.

AI Narration Press play to listen
0  / 6 paragraphs
Click any paragraph to jump · Scroll freely without breaking narration

In Reinforcement Learning (RL), the fundamental challenge of policy gradient methods is the instability of the step size. If we update the policy parameters $\theta$ too aggressively, the agent may move into a region of the parameter space where the policy is drastically worse, leading to a collapse in performance from which the agent cannot recover. Traditional methods like TRPO (Trust Region Policy Optimization) address this by constraining the update within a 'trust region' using the KL divergence, but they are computationally expensive due to the need to calculate the second-order Hessian matrix.

PPO simplifies this by introducing a 'surrogate' objective. Instead of solving a complex constrained optimization problem, PPO uses a modified objective function that penalizes the policy for moving too far from the old policy. We define the probability ratio $r_t(\theta)$ as the ratio between the action probability under the current policy and the old policy: $$r_t(\theta) = \frac{{\pi_{\theta}(a_t|s_t)}}{{\pi_{\theta_{old}}(a_t|s_t)}}$$ This ratio tells us how much more or less likely an action is under the new policy compared to the one that actually collected the data.

To improve the policy, we want to maximize the advantage estimate $\hat{A}_t$, which represents how much better an action $a_t$ is compared to the average action in state $s_t$. A naive surrogate objective would be $L^{CPI} = \mathbb{E}_t [r_t(\theta) \hat{A}_t]$. However, maximizing this without constraints encourages the ratio $r_t(\theta)$ to grow indefinitely to maximize the reward, which leads to the instability mentioned earlier. To fix this, PPO introduces the 'clipped' surrogate objective function.

The clipped objective is defined as follows: $$L^{CLIP}(\theta) = \mathbb{E}_t [\min(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t)]$$ Here, $\epsilon$ is a hyperparameter (typically 0.1 or 0.2). The $\text{clip}$ function ensures that the ratio $r_t(\theta)$ stays within the interval $[1-\epsilon, 1+\epsilon]$. By taking the minimum of the unclipped and clipped objectives, the algorithm refuses to further increase the objective if the policy has moved too far from the old one, effectively removing the incentive for excessively large updates.

Mathematically, this mechanism creates a 'flat' region in the objective landscape. If $\hat{A}_t > 0$, the objective increases as $r_t(\theta)$ increases, but it stops increasing once $r_t(\theta) \ge 1+\epsilon$. Conversely, if $\hat{A}_t < 0$ (the action was bad), the objective increases as $r_t(\theta)$ decreases, but it stops once $r_t(\theta) \le 1-\epsilon$. This ensures that the update is conservative, maintaining a 'proximal' relationship between the new and old policies while still allowing for steady improvement.

In practice, PPO is often implemented as an Actor-Critic method. The final loss function combines the clipped surrogate objective, a value function loss for state-value estimation, and an entropy bonus to encourage exploration: $$L^{PPO}(\theta) = \mathbb{E}_t [L^{CLIP}(\theta) - c_1 L^{VF}(\theta) + c_2 S[\pi_{\theta}](s_t)]$$ where $L^{VF}$ is the mean squared error of the value function and $S$ is the entropy. This holistic approach allows PPO to achieve state-of-the-art performance across various benchmarks while being significantly easier to implement than TRPO.