All Lessons

Optimization Dynamics in Deep Learning: From SGD to Adam and Beyond

This lesson explores the evolution of optimization algorithms, transitioning from the foundational principles of Stochastic Gradient Descent to the adaptive mechanisms of Adam and the strategic necessity of learning rate scheduling. We will rigorously derive the update rules while maintaining an intuitive grasp of how these methods navigate the non-convex loss landscapes of modern deep networks.

AI Narration Press play to listen
0  / 8 paragraphs
Click any paragraph to jump · Scroll freely without breaking narration

The fundamental challenge in training deep neural networks is minimizing a high-dimensional, non-convex loss function $\mathcal{L}(\theta)$, where $\theta$ represents the model parameters. Stochastic Gradient Descent (SGD) addresses this by approximating the true gradient using a mini-batch of data, introducing noise that can surprisingly help escape shallow local minima. The core intuition is that while the direction may be noisy, the expected value of the stochastic gradient equals the true gradient, allowing for steady progress toward a stationary point.

Mathematically, the SGD update rule at iteration $t$ is defined as $\theta_{t+1} = \theta_t - \eta_t \nabla_\theta \mathcal{L}(\theta_t; x_i, y_i)$, where $\eta_t$ is the learning rate and $(x_i, y_i)$ is a sampled mini-batch. Unlike batch gradient descent, which computes the gradient over the entire dataset, SGD performs frequent updates with high variance. This variance acts as a regularizer but requires careful tuning of $\eta_t$ to ensure convergence without oscillating wildly around the optimum.

To accelerate convergence and dampen oscillations, momentum was introduced, accumulating a velocity vector $v_t$ in directions of persistent reduction. The update equations become $v_t = \gamma v_{t-1} + \eta_t \nabla_\theta \mathcal{L}(\theta_t)$ and $\theta_{t+1} = \theta_t - v_t$, where $\gamma$ is the momentum coefficient typically set to 0.9. This physical analogy of a ball rolling down a hill allows the optimizer to build speed in consistent directions and ignore irrelevant fluctuations in the gradient.

Adam (Adaptive Moment Estimation) combines the benefits of momentum with per-parameter adaptive learning rates, making it the default choice for many deep learning tasks. It maintains two moving averages: the first moment $m_t$ (mean) and the second moment $v_t$ (uncentered variance) of the gradients. The update rule involves bias-corrected estimates $\hat{m}_t = \frac{m_t}{1-\beta_1^t}$ and $\hat{v}_t = \frac{v_t}{1-\beta_2^t}$, leading to the parameter update $\theta_{t+1} = \theta_t - \frac{\eta_t}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$.

The adaptive nature of Adam scales the learning rate for each parameter inversely proportional to the square root of its past squared gradients. This is particularly effective for sparse features or problems with varying curvature, as parameters with large gradients receive smaller updates, while those with small gradients receive larger ones. The term $\epsilon$ is a small constant added for numerical stability to prevent division by zero, ensuring the algorithm remains robust even when gradients vanish.

However, the choice of a fixed learning rate $\eta_t$ is often suboptimal throughout the entire training trajectory, necessitating learning rate scheduling. A common strategy is step decay, where the learning rate is reduced by a factor after a fixed number of epochs, or cosine annealing, which smoothly decreases the rate following a cosine curve: $\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{T_{cur}}{T_{max}}\pi))$. These schedules allow for aggressive exploration early in training and fine-grained convergence later.

In practice, the interplay between the optimizer choice and the learning rate schedule dictates the generalization performance of the model. While Adam converges faster initially, SGD with momentum and a carefully tuned decaying learning rate schedule often achieves better final generalization on large-scale image classification tasks. Understanding these dynamics allows practitioners to diagnose training issues, such as divergence or stagnation, and select the appropriate optimization strategy for their specific architecture and dataset.

Ultimately, optimization in deep learning is not merely about finding the global minimum but navigating a complex loss landscape to find a wide, flat minimum that generalizes well to unseen data. The evolution from simple SGD to sophisticated adaptive methods like Adam, coupled with dynamic learning rate schedules, represents our growing ability to control this navigation. As models scale, the theoretical understanding of these algorithms continues to refine, bridging the gap between empirical success and mathematical guarantees.