Optimization Landscapes: From Stochastic Gradient Descent to Adaptive Moments

At its core, training a deep neural network is a search for the set of parameters $\theta$ that minimizes a cost function $J(\theta)$. The intuition is akin to finding the lowest point in a rugged mountain range during a thick fog; we cannot see the whole landscape, so we feel the slope beneath our feet and take a step in the direction of the steepest descent. In standard Gradient Descent, we compute the average gradient across the entire dataset, which is computationally prohibitive for modern deep learning.

To resolve this, we use Stochastic Gradient Descent (SGD). Instead of the full dataset, SGD estimates the gradient using a small 'mini-batch' of data. Mathematically, the update rule is expressed as: $$\theta_{t+1} = \theta_t - \eta \nabla_{\theta} J(\theta_t; x^{(i)}, y^{(i)})$$ where $\eta$ is the learning rate and $ abla_{\theta} J$ represents the gradient of the loss with respect to the parameters. While SGD introduces noise into the optimization path, this stochasticity often helps the model escape shallow local minima and find flatter, more generalizable regions of the loss surface.

The primary weakness of SGD is the use of a single, global learning rate $\eta$ for every parameter. In deep networks, some features are rare and require larger updates, while others are frequent and require smaller updates to avoid divergence. This leads us to adaptive methods. The intuition behind Adam (Adaptive Moment Estimation) is to maintain a per-parameter learning rate that adjusts based on the first and second moments of the gradients, effectively providing a 'momentum' mechanism to accelerate movement in consistent directions.

Adam operates by calculating an exponentially decaying average of past gradients $m_t$ (the first moment) and past squared gradients $v_t$ (the second moment). The formulations are: $$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$$ where $g_t$ is the gradient at time $t$. To prevent bias toward zero at the start of training, we apply bias correction: $\hat{m}_t = \frac{m_t}{1-\beta_1^t}$ and $\hat{v}_t = \frac{v_t}{1-\beta_2^t}$.

The final parameter update in Adam combines these elements to normalize the step size: $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$ Here, $\epsilon$ is a small constant to prevent division by zero. By dividing by the square root of the second moment $\sqrt{\hat{v}_t}$, Adam effectively dampens the update for parameters with high-variance gradients and amplifies updates for those with small, consistent gradients. This results in significantly faster convergence in high-dimensional spaces.

Despite the power of Adam, the learning rate $\eta$ still requires tuning. Learning Rate Scheduling is the practice of adjusting $\eta$ over time. The intuition is to start with a large learning rate to explore the landscape rapidly and then 'cool' the system down—reducing $\eta$—to settle precisely into a sharp minimum. This is analogous to simulated annealing in physics.

Common scheduling strategies include Step Decay, where $\eta$ is dropped by a factor every $N$ epochs, and Cosine Annealing, which follows a half-cosine curve: $\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{T_{cur}}{T_{max}}\pi))$. These schedules prevent the optimizer from 'overshooting' the minimum in the late stages of training, ensuring that the model converges to a stable, low-error equilibrium.