The Mechanics of Convergence: From SGD to Adam and Learning Rate Scheduling

At its core, training a deep neural network is an optimization problem: we seek to find a set of weights $\theta$ that minimizes a loss function $J(\theta)$. While Gradient Descent computes the gradient over the entire dataset, this is computationally prohibitive for modern Big Data. Stochastic Gradient Descent (SGD) solves this by estimating the gradient using a small 'mini-batch' of samples. The intuition is that a noisy estimate of the gradient is often 'good enough' to steer the model toward a local minimum, while significantly reducing the memory footprint and computational cost per iteration.

Mathematically, the update rule for SGD is defined as:$$\theta_{t+1} = \theta_t - \eta \nabla_{\theta} J(\theta_t; x^{(i)}, y^{(i)})$$ where $\eta$ is the learning rate and $\nabla_{\theta} J$ is the gradient of the loss with respect to the parameters. While conceptually simple, SGD suffers from challenges such as oscillations in steep ravines and slow progress in flat plateaus. To mitigate this, we often introduce Momentum, which accumulates a moving average of past gradients to dampen oscillations:$$v_t = \gamma v_{t-1} + \eta \nabla_{\theta} J(\theta_t)$$$$\theta_{t+1} = \theta_t - v_t$$ Here, $\gamma$ acts as a friction coefficient, allowing the optimizer to 'gain momentum' in consistent directions.

The primary limitation of SGD is the global learning rate $\eta$. In deep networks, different parameters may require different scales of updates; for example, rare features may need larger updates than frequent ones. This motivates Adaptive Optimization. The Adam (Adaptive Moment Estimation) optimizer addresses this by maintaining individual learning rates for every parameter. Adam tracks both the first moment (the mean) and the second moment (the uncentered variance) of the gradients to scale the step size dynamically.

The Adam update process involves calculating the moving averages of the gradient $m_t$ and the squared gradient $v_t$:$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$$$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$ To account for the fact that $m_t$ and $v_t$ are initialized to zero, we apply bias correction: $\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$ and $\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$. The final parameter update is then$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$ where $\epsilon$ is a small constant to prevent division by zero.

Despite the sophistication of Adam, the initial learning rate $\eta$ remains a critical hyperparameter. A constant learning rate often leads to 'bouncing' around the minimum rather than converging into it. Learning Rate Scheduling is the practice of adjusting $\eta$ over time. The intuition is to start with a large $\eta$ to cross the loss landscape quickly and escape poor local minima, then decay $\eta$ over time to refine the weights and stabilize convergence.

Common scheduling strategies include Step Decay, where the rate is multiplied by a factor every few epochs, and Cosine Annealing, which follows a cosine curve to smoothly transition from a maximum to a minimum rate. For example, in Step Decay, the update is $\eta_t = \eta_0 \\· \gamma^{\lfloor t/S \rfloor}$, where $S$ is the step size. More advanced techniques like 'Warm-up' involve starting with a very low learning rate and increasing it linearly for a few hundred steps to prevent the gradients from exploding during the initial phase of training.

In summary, the choice of optimizer defines the trajectory through the high-dimensional weight space. SGD provides a stochastic exploration that can improve generalization, Adam offers rapid convergence through per-parameter scaling, and scheduling ensures a graceful landing in the global minimum. Understanding the interplay between the variance of the gradient and the scale of the step is essential for training stable and high-performing deep architectures.