In Reinforcement Learning (RL), the primary goal is to optimize a policy $\pi_{\theta}$ to maximize the expected cumulative reward. A fundamental challenge is the 'step size' problem: if we update the policy weights $\theta$ too aggressively based on a single batch of experience, we may move to a region of the parameter space where the policy performs poorly. Because the data collected depends on the current policy, a sudden collapse in performance leads to a feedback loop of poor data collection, making recovery nearly impossible. PPO was designed to ensure that new policies do not deviate too far from the old ones, effectively implementing a 'trust region' without the computational overhead of second-order optimization.
To understand PPO, we first define the probability ratio $r_t(\theta)$, which measures how much the current policy differs from the old policy that collected the data: $$r_t(\theta) = \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)}$$ If $r_t(\theta) > 1$, the action is more likely under the current policy; if $r_t(\theta) < 1$, it is less likely. In a standard Policy Gradient approach, we maximize $r_t(\theta) \hat{A}_t$, where $\hat{A}_t$ is the advantage estimate. The advantage $\hat{A}_t$ tells us whether an action was better or worse than the average action in that state. However, maximizing this objective blindly leads to the instability mentioned previously.
The core innovation of PPO is the Clipped Surrogate Objective. Instead of blindly maximizing the reward, PPO limits the incentive for the policy to move $r_t(\theta)$ far away from $1$. The objective function is defined as: $$L^{CLIP}(\theta) = \hat{E}_t \left[ \min(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t) \right]$$ Here, $\epsilon$ is a hyperparameter (typically $0.1$ or $0.2$). This formulation effectively 'clips' the objective function when the ratio $r_t(\theta)$ moves outside the interval $[1-\epsilon, 1+\epsilon]$.
Let's analyze the mechanics of the $\min$ operator. If the advantage $\hat{A}_t$ is positive, the objective increases as $r_t(\theta)$ increases, but it flattens out once $r_t(\theta)$ reaches $1 + \epsilon$. This prevents the policy from becoming 'too greedy' on a single positive advantage. Conversely, if $\hat{A}_t$ is negative, the objective increases as $r_t(\theta)$ decreases (making the bad action less likely), but it is clipped once $r_t(\theta)$ drops to $1 - \epsilon$. This prevents the policy from drastically slashing the probability of an action based on a potentially noisy advantage estimate.
Mathematically, the clipping mechanism acts as a first-order approximation to a Trust Region Policy Optimization (TRPO) constraint. While TRPO uses a hard constraint on the KL divergence between policies, which requires calculating the Fisher Information Matrix (an $O(n^2)$ operation), PPO achieves a similar effect using only first-order gradients. By limiting the change in the policy ratio, PPO ensures that the update stays within a region where the local approximation of the reward landscape is likely to remain valid, providing a robust balance between sample efficiency and stability.
In a full implementation, the PPO objective is often augmented with a value function loss and an entropy bonus to encourage exploration. The total loss function typically looks like: $$L^{TOTAL}(\theta) = \hat{E}_t [L^{CLIP}(\theta)] - c_1 L^{VF}( heta) + c_2 S[\\pi_{\theta}](s_t)$$ where $L^{VF}$ is the mean squared error of the value function, $S$ is the entropy of the policy, and $c_1, c_2$ are coefficients. This combined objective allows the agent to learn the value of states while simultaneously optimizing the policy under the safety constraints of the clipped surrogate, leading to the state-of-the-art stability observed in modern RL agents.