At its core, Reinforcement Learning (RL) aims to find a policy $\pi_{\theta}$ that maximizes the expected cumulative reward. However, standard Policy Gradient methods suffer from high variance and a critical instability: if the step size in gradient ascent is too large, the policy may update into a region of parameter space that performs poorly. Because the data used for the update depends on the current policy, a single 'bad' update can collapse the agent's performance, making recovery nearly impossible. PPO addresses this by ensuring that the new policy does not deviate too far from the old policy.
To understand PPO, we first define the probability ratio $r_t(\theta)$, which measures how much the current policy $\pi_{\theta}$ differs from the policy $\pi_{\theta_{old}}$ used to collect the trajectory data: $$r_t(\theta) = \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)}$$. If $r_t > 1$, the action is more likely under the current policy; if $r_t < 1$, it is less likely. In a vanilla policy gradient approach, we would maximize the objective $L^{CPI}( heta) = \hat{E}_t [r_t(\theta) \hat{A}_t]$, where $\hat{A}_t$ is the estimated advantage function. However, this objective is dangerous because it encourages the policy to move as far as possible in the direction of the gradient, regardless of how much the distribution changes.
PPO solves this instability by introducing the 'Clipped Surrogate Objective.' Instead of blindly maximizing $r_t(\theta) \hat{A}_t$, PPO limits the incentive for the ratio to move outside a range $[1 - \epsilon, 1 + \epsilon]$. The objective function is defined as: $$L^{CLIP}( heta) = \hat{E}_t [ \min(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t) ]$$. Here, $\epsilon$ is a hyperparameter (typically $0.1$ or $0.2$) that controls the 'trust region' around the old policy. By taking the minimum of the original objective and the clipped version, we ensure that the update is conservative.
The logic behind the $\min$ operator and the clipping mechanism depends on the sign of the advantage $\hat{A}_t$. When $\hat{A}_t > 0$, the action was better than average, so we want to increase its probability. However, once $r_t(\theta)$ exceeds $1 + \epsilon$, the gradient becomes zero, preventing the policy from over-optimizing a single batch of data. Conversely, when $\hat{A}_t < 0$, the action was worse than average. The clipping prevents the policy from decreasing the probability of that action beyond $1 - \epsilon$, which stops the policy from collapsing too aggressively.
In practice, PPO is usually implemented as an actor-critic method. While the clipped objective optimizes the policy (the actor), a value function $V_{\phi}(s)$ (the critic) is trained simultaneously to estimate the expected return. The total loss function often combines the clipped surrogate objective, a value function loss, and an entropy bonus to encourage exploration: $$L^{total}( heta, \\phi) = L^{CLIP}( heta) - c_1 L^{VF}(\\phi) + c_2 S[\\pi_{\theta}](s_t)$$. The value function loss $L^{VF}$ is typically the mean squared error between the predicted value and the actual returns.
The beauty of PPO lies in its balance between ease of implementation and stability. Unlike its predecessor, Trust Region Policy Optimization (TRPO), which requires computing second-order derivatives via the Fisher Information Matrix, PPO achieves similar stability using only first-order gradients. By effectively constraining the update step in probability space rather than parameter space, PPO ensures that the agent learns reliably across a wide variety of complex environments.