Proximal Policy Optimization (PPO): Stability in Reinforcement Learning

In Reinforcement Learning (RL), the primary goal is to optimize a policy $\pi_{\theta}(a|s)$ to maximize the expected cumulative reward. Standard Policy Gradient methods, such as REINFORCE, suffer from high variance and a sensitivity to learning rates. If a gradient update is too large, the policy may move into a region of the parameter space where it collects poor data, leading to a collapse in performance from which the agent cannot recover. This is the 'stability problem': how do we take the largest possible improvement step without overshooting the region where our current data remains valid?

To solve this, PPO introduces the concept of a surrogate objective. Instead of optimizing the direct reward, we look at the probability ratio between the new policy and the old policy: $r_t(\theta) = \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$. This ratio tells us how much more (or less) likely an action is under the current parameters compared to the parameters used to collect the data. When $r_t(\theta) > 1$, the action is more likely; when $r_t(\theta) < 1$, it is less likely. By multiplying this ratio by the advantage estimate $\hat{A}_t$, we create a surrogate objective $L^{CPI}( heta) = \mathbb{E}_t [r_t(\theta) \hat{A}_t]$, which encourages actions that led to higher-than-average returns.

However, the Conservative Policy Iteration (CPI) objective $L^{CPI}$ can lead to excessively large updates. If the gradient is steep, $\theta$ may change so drastically that the ratio $r_t(\theta)$ becomes huge, pushing the policy far away from $\pi_{\theta_{old}}$. To prevent this, PPO implements a 'clipped' objective. We constrain the ratio $r_t(\theta)$ to stay within a small interval, typically $[1-\epsilon, 1+\epsilon]$, where $\epsilon$ is a hyperparameter (e.g., $0.2$). This effectively removes the incentive to move the policy parameters further once the update has reached a certain threshold of change.

The mathematical formulation of the clipped surrogate objective is defined as: $$L^{CLIP}( heta) = \mathbb{E}_t [ \min( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t )]$$ This equation takes the minimum of the unclipped objective and the clipped version. If the advantage $\hat{A}_t$ is positive, the objective increases as $r_t(\theta)$ increases, but it is capped at $1+\epsilon$. If the advantage $\hat{A}_t$ is negative, the objective increases as $r_t(\theta)$ decreases (making the bad action less likely), but it is capped at $1-\epsilon$.

Crucially, the $\min$ operator ensures that we only clip the objective when the update would make the policy 'too good' or 'too bad' relative to the old policy, but we still allow the objective to drop if the policy improves in a way that reduces the ratio. This creates a 'trust region' effect without the heavy computational overhead of calculating the Kullback-Leibler (KL) divergence or inverting a Fisher Information Matrix, as seen in Trust Region Policy Optimization (TRPO).

In a full implementation, PPO typically uses an actor-critic architecture. To enable the policy to converge, a value function $V_{\phi}(s)$ is learned to estimate the expected return, and the total loss function becomes a weighted sum of the clipped surrogate loss, a value function squared-error loss, and an entropy bonus to encourage exploration: $$L^{PPO}( heta, \phi) = \mathbb{E}_t [ L^{CLIP}( heta) - c_1 (V_{\phi}(s_t) - R_t)^2 + c_2 S[\pi_{\theta}](s_t) ]$$ Here, $S$ represents the entropy of the policy, ensuring the agent does not prematurely converge to a deterministic, sub-optimal strategy.