The fundamental challenge in Reinforcement Learning (RL) is the instability of policy updates. In standard Policy Gradient methods, a single large step in the parameter space can lead to a collapse in performance, as the agent might move into a region of the environment where it collects poor data, leading to a feedback loop of failure. Proximal Policy Optimization (PPO) addresses this by ensuring that the new policy does not deviate too far from the old policy, effectively implementing a 'trust region' without the heavy computational overhead of second-order optimization.
To understand PPO, we must first define the probability ratio $r_t( heta)$, which measures the difference between the current policy $\\pi_ heta$ and the policy used to collect the data $\\pi_{ heta_{old}}$. It is defined as: $$r_t( heta) = rac{\\pi_ heta(a_t | s_t)}{\\pi_{ heta_{old}}(a_t | s_t)}$$. If $r_t( heta) > 1$, the action $a_t$ is more likely under the current policy than the old one; if $r_t( heta) < 1$, it is less likely. This ratio allows us to perform 'importance sampling', reusing data from the old policy to optimize the new one.
The goal is to maximize the Advantage function $A_t$, which tells us how much better a specific action was compared to the average action in that state. A vanilla surrogate objective would be $L^{CPI}( heta) = \\hat{E}_t [r_t( heta) \\hat{A}_t]$. However, maximizing this without constraints encourages the policy to make massive updates to $r_t( heta)$ to chase high advantages, often leading to the aforementioned instability and catastrophic forgetting.
PPO introduces the 'Clipped Surrogate Objective' to prevent this collapse. The objective function is defined as: $$L^{CLIP}( heta) = \\hat{E}_t [ \\min(r_t( heta) \\hat{A}_t, ext{clip}(r_t( heta), 1 - \\epsilon, 1 + \\epsilon) \\hat{A}_t) ]$$. Here, $\\epsilon$ is a hyperparameter (typically $0.1$ or $0.2$) that defines the boundary of the trust region. The $\\min$ operator ensures that we take the more conservative estimate of the improvement.
The intuition behind the clipping mechanism is elegant: when the advantage $\\hat{A}_t$ is positive, the objective increases as $r_t( heta)$ increases, but only up to $1 + \\epsilon$. Beyond that, the gradient becomes zero, removing the incentive to push the action probability further. Conversely, if $\\hat{A}_t$ is negative, the objective increases as $r_t( heta)$ decreases, but it stops providing a benefit once $r_t( heta)$ hits $1 - \\epsilon$. This prevents the policy from over-correcting and drastically dropping the probability of an action in a single step.
In practice, PPO is usually implemented as an Actor-Critic method. The final loss function combines the clipped surrogate objective, a value function loss to improve state estimation, and an entropy bonus to encourage exploration: $$L_t^{PPO}( heta) = \\hat{E}_t [ L^{CLIP}( heta) - c_1 L_t^{VF}( heta) + c_2 S[\\pi_ heta](s_t) ]$$. Here, $L^{VF}$ is the Mean Squared Error of the value function, and $S$ represents the entropy. This composite objective ensures that the agent learns a stable policy while maintaining a healthy level of curiosity and accurate value predictions.