At its core, training a deep neural network is a search for the global minimum of a high-dimensional loss function $J( heta)$, where $ heta$ represents the model parameters. The intuition behind Gradient Descent is simple: if we want to minimize a function, we should move in the direction opposite to the steepest ascent. However, calculating the gradient over the entire dataset (Batch Gradient Descent) is computationally prohibitive for large-scale problems. Stochastic Gradient Descent (SGD) solves this by approximating the true gradient using a small, random subset of data called a mini-batch, introducing a controlled amount of noise that can actually help the model escape shallow local minima.
Mathematically, the update rule for SGD is defined as: $$ heta_{t+1} = heta_t - \\eta abla_{ heta} J( heta_t; \\mathcal{B})$$ where $\\eta$ is the learning rate and $ abla_{ heta} J( heta_t; \\mathcal{B})$ is the gradient of the loss function computed over the mini-batch $\\mathcal{B}$. While SGD is conceptually elegant, it struggles with 'ravines'—areas where the surface curves much more steeply in one dimension than in another. This often leads to oscillations across the narrow valley, slowing down progress toward the optimum.
To overcome the limitations of vanilla SGD, we introduce momentum. The intuition is akin to a heavy ball rolling down a hill; it accumulates velocity in directions of consistent descent, smoothing out the oscillations. Mathematically, we maintain a velocity vector $v_t$, which is a moving average of gradients: $$v_{t+1} = \\gamma v_t + \\eta abla_{ heta} J( heta_t)$$ and then update the parameters as $$ heta_{t+1} = heta_t - v_{t+1}$$. Here, $\\gamma$ is the momentum coefficient, typically set around 0.9, which ensures that the updates are dominated by the long-term trend of the gradient rather than the immediate local noise.
While momentum helps with direction, it does not address the issue of the learning rate $\\eta$ being a global scalar. In deep networks, different parameters may require different scales of updates—some may need to move aggressively, while others require precision. This leads us to Adam (Adaptive Moment Estimation). Adam computes individual adaptive learning rates for different parameters by calculating estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients: $$m_t = eta_1 m_{t-1} + (1 - eta_1) g_t$$ and $$v_t = eta_2 v_{t-1} + (1 - eta_2) g_t^2$$, where $g_t$ is the gradient at time $t$.
The final update step in Adam incorporates bias correction to account for the fact that $m_t$ and $v_t$ are initialized at zero. The corrected estimates $\\hat{m}_t = rac{m_t}{1 - eta_1^t}$ and $\\hat{v}_t = rac{v_t}{1 - eta_2^t}$ are then used to update the weights: $$ heta_{t+1} = heta_t - rac{\\eta}{\\sqrt{\\hat{v}_t} + \\epsilon} \\hat{m}_t$$. The term $\\sqrt{\\hat{v}_t}$ acts as a signal-to-noise ratio; parameters with frequent, large gradients see their effective learning rate reduced, while those with sparse gradients receive larger updates, effectively normalizing the geometry of the loss landscape.
Despite the power of Adam, the choice of the initial learning rate $\\eta$ remains critical. Learning rate scheduling is the process of adjusting $\\eta$ during training to ensure convergence. The intuition is to start with a large $\\eta$ to rapidly explore the parameter space and gradually decrease it to 'settle' into the minimum without overshooting. A common approach is Step Decay, where $\\eta$ is reduced by a factor (e.g., 0.1) every few epochs, or Cosine Annealing, which follows the curve of a cosine function: $$\\eta_t = \\eta_{min} + rac{1}{2}(\\eta_{max} - \\eta_{min})(1 + \\cos(rac{T_{cur}}{T_{max}}\\pi))$$.
In practice, a sophisticated training pipeline often combines these elements. For instance, using a 'Warm-up' period where the learning rate increases linearly for the first few thousand steps helps stabilize the adaptive moments of Adam before the main decay begins. By balancing the stochastic nature of SGD, the adaptivity of Adam, and the precision of scheduling, we can navigate the complex, non-convex manifolds of deep learning to find weights that generalize well to unseen data.