Proximal Policy Optimization (PPO) and the Clipped Surrogate Objective

In reinforcement learning, the fundamental challenge is the stability of policy updates. If we update a policy $\pi_{\theta}$ too aggressively based on a single batch of experience, we may move the parameters $\theta$ into a region of the parameter space where the policy performs poorly. Because the data collected depends on the current policy, a catastrophic drop in performance leads to poor data collection, creating a feedback loop that can cause the agent's performance to collapse entirely. PPO addresses this by ensuring that the new policy does not deviate too far from the old policy.

To understand PPO, we must first define the probability ratio $r_t(\theta)$, which represents the relative change between the new policy and the old policy for a given action $a_t$ and state $s_t$: $$r_t(\theta) = \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)}$$. If $r_t(\theta) > 1$, the action is more likely under the current policy than the old one; if $r_t(\theta) < 1$, it is less likely. The goal of policy gradients is to maximize the expected reward, which can be approximated using the advantage function $A_t$, denoting how much better an action is compared to the average action in that state.

The naive surrogate objective is defined as $L^{CPI}( heta) = \mathbb{E}_t [r_t(\theta) A_t]$. While maximizing this objective improves the policy, it lacks a mechanism to prevent excessively large updates. If the advantage $A_t$ is large and positive, the optimizer will push $r_t(\theta)$ to be as large as possible, potentially moving $\theta$ far away from $\theta_{old}$ and violating the assumption that the sampled data is still representative of the current policy's behavior.

PPO introduces a 'clipped surrogate objective' to enforce a trust region without the computational complexity of second-order optimizations like TRPO. The objective is formulated as: $$L^{CLIP}( heta) = \mathbb{E}_t [\min(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) A_t)]$$. Here, the $\text{clip}$ function limits the ratio $r_t(\theta)$ to stay within the range $[1 - \epsilon, 1 + \epsilon]$, where $\epsilon$ is a hyperparameter (typically $0.1$ or $0.2$).

The intuition behind the $\min$ operator is crucial. When $A_t > 0$, the objective encourages increasing the probability of the action, but the clipping prevents it from increasing beyond $1 + \epsilon$. Conversely, when $A_t < 0$, the objective encourages decreasing the probability, but the clipping prevents it from falling below $1 - \epsilon$. By taking the minimum of the clipped and unclipped objectives, PPO ensures that we only ignore the reward improvement if the policy change is too large, but we still penalize a policy that moves too far in a direction that decreases rewards.

Finally, the complete PPO loss function typically incorporates a value function loss and an entropy bonus to encourage exploration. The total objective is: $$L_t^{PPO} = \mathbb{E}_t [L^{CLIP}( heta) - c_1 L^{VF}( heta) + c_2 S[\\pi_{\theta}](s_t)]$$, where $L^{VF}$ is the mean squared error of the value function and $S$ is the entropy. This combined approach ensures that the agent learns an accurate value estimate, maintains a diverse set of actions, and updates its policy in a stable, incremental manner.