Optimization Dynamics: From SGD to Adaptive Moment Estimation

At its core, optimizing a deep neural network is the process of finding a set of weights $\theta$ that minimizes a cost function $J(\theta)$. The intuition is akin to navigating a foggy mountain range to find the lowest valley; since we cannot see the entire landscape, we feel the slope beneath our feet and take a step in the direction of the steepest descent. However, calculating the gradient over the entire dataset is computationally prohibitive for large-scale models, leading to the adoption of Stochastic Gradient Descent (SGD), which approximates the true gradient using a small, random subset of data called a mini-batch.

Mathematically, basic SGD updates the parameters by subtracting a fraction of the gradient. The update rule is defined as: $\theta_{t+1} = \theta_t - \eta \nabla_{\theta} J(\theta_t)$, where $\eta$ denotes the learning rate. While this approach is computationally efficient, SGD suffers from high variance in the gradient estimates, which can cause the loss to oscillate wildly. To stabilize this, we often introduce a momentum term $\gamma$, which accumulates a moving average of past gradients: $v_t = \gamma v_{t-1} + \eta \nabla_{\theta} J(\theta_t)$, resulting in the update $\theta_{t+1} = \theta_t - v_t$. This allows the optimizer to 'gain speed' in consistent directions and dampen oscillations.

While momentum aids stability, a fixed learning rate $\eta$ is rarely optimal. If $\eta$ is too high, the model may overshoot the minimum and diverge; if too low, convergence becomes agonizingly slow. This motivates the use of learning rate scheduling. A common approach is 'Step Decay,' where the learning rate is reduced by a factor $\gamma$ every few epochs. A more sophisticated method is the 'Cosine Annealing' schedule, which follows the curve: $\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{T_{cur}}{T_{max}}\pi))$, allowing the model to explore the loss landscape aggressively at first and refine its position precisely toward the end of training.

The limitations of SGD—specifically the use of a single learning rate for all parameters—led to the development of adaptive optimizers. The core intuition behind Adaptive Moment Estimation (Adam) is that different parameters may require different learning rates based on how frequently they are updated. Parameters associated with rare but informative features should have larger updates, while frequently updated parameters should be tempered. Adam achieves this by maintaining separate per-parameter learning rates that adapt based on the first and second moments of the gradients.

Adam tracks the moving average of the gradient (the first moment $m_t$) and the squared gradient (the second moment $v_t$). The formulations are: $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$ and $v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$. To correct for the bias toward zero during early iterations, we compute bias-corrected estimates: $\hat{m}_t = \frac{m_t}{1-\beta_1^t}$ and $\hat{v}_t = \frac{v_t}{1-\beta_2^t}$. The final parameter update is then $\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$. Here, $\epsilon$ is a small constant to prevent division by zero.

Despite Adam's rapid initial convergence, research suggests that SGD with momentum often generalizes better to unseen data in specific tasks like image classification. This is because Adam's aggressive adaptation can lead the model to settle in sharp minima, whereas SGD's inherent noise encourages it to find flatter, more robust minima. Consequently, a common hybrid strategy is to start training with Adam for rapid progress and switch to SGD for the final 'fine-tuning' phase to maximize generalization performance.