At its core, training a deep neural network is an exercise in optimization: we seek the set of model parameters $\theta$ that minimizes a cost function $J(\theta)$. In the vast, high-dimensional landscape of a loss function, we cannot compute the global minimum analytically. Instead, we use the gradient—the vector of partial derivatives—as a compass. If we move in the direction opposite to the gradient, we move 'downhill' toward a local minimum. Stochastic Gradient Descent (SGD) simplifies this by estimating the true gradient using small, random subsets of data, reducing computational overhead and introducing beneficial noise that helps escape shallow local minima.
Mathematically, the standard SGD update rule is expressed as: $$\theta_{t+1} = \theta_t - \eta \nabla_{\theta} J(\theta_t; x^{(i)}, y^{(i)})$$ where $\eta$ represents the learning rate and $\nabla_{\theta} J$ is the gradient computed on a mini-batch. While intuitive, vanilla SGD suffers from 'oscillation' in narrow valleys where the gradient is steep in one dimension but shallow in another. To solve this, we introduce momentum, which accumulates a moving average of past gradients $v_t$ to smooth out updates: $$v_t = \gamma v_{t-1} + \eta \nabla_{\theta} J(\theta_t) \quad \text{and} \quad \theta_{t+1} = \theta_t - v_t$$ This effectively adds 'inertia' to the optimization process, accelerating convergence along consistent directions.
The primary limitation of SGD and momentum is the use of a single global learning rate $\eta$ for all parameters. In deep networks, some features are sparse and require larger updates, while frequent features require smaller ones. Adaptive optimization algorithms address this by scaling the learning rate for each individual parameter. This is the foundation of the Adam (Adaptive Moment Estimation) optimizer, which combines the benefits of AdaGrad (scaling by historical gradients) and RMSProp (using an exponential moving average).
Adam maintains two moving averages: the first moment $m_t$ (the mean) and the second raw moment $v_t$ (the uncentered variance) of the gradients: $$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{and} \quad v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$ where $g_t$ is the current gradient. Because these moments are initialized at zero, they are biased toward zero during early iterations. Adam corrects this using bias-correction terms: $$\hat{m}_t = \frac{m_t}{1 - \beta_1^t} \quad \text{and} \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$
The final parameter update in Adam integrates these adjusted moments to normalize the step size: $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$ Here, $\epsilon$ is a small constant to prevent division by zero. The term $\sqrt{\hat{v}_t}$ acts as a signal-to-noise filter; when the gradient variance is high, the effective step size is reduced, preventing divergent updates. This adaptive mechanism allows Adam to handle sparse gradients and non-stationary objectives far more efficiently than standard SGD.
Despite the power of Adam, the learning rate $\eta$ remains a critical hyperparameter. A constant learning rate often leads to 'plateauing' or unstable oscillations around the minimum. Learning rate scheduling is the practice of adjusting $\eta$ over time. Common strategies include step decay, where $\eta$ is reduced by a factor every few epochs, and cosine annealing, which follows a half-cosine curve. A common formulation for step decay is: $$\eta_t = \eta_0 \\· \gamma^{\lfloor t/k \rfloor}$$ where $\gamma < 1$ is the decay rate and $k$ is the step interval.
In modern deep learning, 'Warm-up' schedules are frequently employed. In the initial phase of training, the learning rate is increased linearly from zero to the target value. This prevents the model from diverging early on due to the massive gradients associated with random initialization. Following the warm-up, the rate is decayed. This sequence ensures that the model first finds a stable region of the loss landscape before refining its position with smaller, more precise steps.
To summarize, the transition from SGD to Adam represents a move from global, static steps to local, adaptive steps. While SGD with momentum is still favored in some computer vision tasks for its superior generalization properties, Adam's rapid convergence makes it the default for Transformers and large-scale NLP models. The synergy between an adaptive optimizer and a well-tuned decay schedule is what enables the training of networks with hundreds of billions of parameters without catastrophic divergence.