All Lessons

Stochastic Gradient Descent, Adam, and the Dynamics of Learning Rate Scheduling

An exploration of how first-order optimization methods navigate high-dimensional loss landscapes. We will analyze the transition from basic SGD to adaptive moments and the critical role of decay schedules.

AI Narration Press play to listen
0  / 7 paragraphs
Click any paragraph to jump · Scroll freely without breaking narration

At its core, training a deep network is an optimization problem where we seek to minimize a cost function $J(\theta)$ by adjusting the model parameters $\theta$. The simplest approach is Gradient Descent, which moves parameters in the direction of the steepest descent. However, calculating the gradient over the entire dataset is computationally prohibitive for large-scale deep learning. Stochastic Gradient Descent (SGD) solves this by estimating the gradient using a single random sample or a small 'mini-batch', introducing a form of noise that can actually help the optimizer escape shallow local minima and saddle points.

Mathematically, the SGD update rule is expressed as $\theta_{t+1} = \theta_t - \eta \nabla_{\theta} J(\theta_t; x^{(i)}, y^{(i)})$, where $\eta$ is the learning rate and $\nabla_{\theta} J$ represents the gradient of the loss with respect to the parameters for a specific sample $i$. While SGD is computationally efficient, it suffers from high variance in the gradient updates, which can lead to erratic convergence. To stabilize this, we often introduce 'Momentum', which accumulates a moving average of past gradients: $v_t = \gamma v_{t-1} + \eta \nabla_{\theta} J(\theta_t)$, followed by $\theta_{t+1} = \theta_t - v_t$. This acts like a ball rolling down a hill, gaining speed in consistent directions while canceling out oscillations.

Despite the utility of momentum, a single global learning rate $\eta$ is often suboptimal because different parameters may require different scales of updates—some weights may be sparse or have gradients that vanish or explode. This motivates the need for adaptive optimization. Adam (Adaptive Moment Estimation) addresses this by maintaining separate learning rates for each parameter. It tracks both the first moment (the mean) and the second moment (the uncentered variance) of the gradients to dynamically scale the step size.

The Adam algorithm computes the first moment estimate $m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$ and the second moment estimate $v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$, where $g_t$ is the gradient at time $t$. To account for the fact that these moments are initialized at zero, Adam applies bias correction: $\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$ and $\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$. The final parameter update is then $\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$. This ensures that parameters with large, volatile gradients receive smaller updates, while parameters with small, consistent gradients are accelerated.

While Adam handles per-parameter scaling, the global learning rate $\eta$ still needs to be managed over the course of training. If $\eta$ is too high, the model may overshoot the minimum; if too low, training stalls. Learning rate scheduling is the practice of adjusting $\eta$ over time. A common approach is 'Step Decay', where the learning rate is reduced by a factor every $N$ epochs, or 'Cosine Annealing', which follows a cosine curve to smoothly decrease the rate to near zero.

The formal logic behind scheduling is that early in training, we need a large $\eta$ to explore the loss landscape rapidly. As the model converges toward a minimum, we decrease $\eta$ to allow the optimizer to 'settle' into the narrowest part of the valley without bouncing out. This is often represented as $\eta_t = \eta_0 \\· f(t)$, where $f(t)$ is a decay function. When combined with Adam, scheduling acts as a coarse-grained control to ensure the adaptive mechanism doesn't converge prematurely to a suboptimal point.

In summary, the progression from SGD to Adam and the integration of scheduling represents a move from rigid updates to highly flexible, data-driven optimization. SGD provides the foundation, Momentum adds stability, Adam provides per-parameter precision, and scheduling provides the global trajectory. Together, these tools allow us to optimize networks with millions of parameters across non-convex surfaces that would otherwise be impossible to navigate.