At its core, training a deep neural network is a search for the global minimum of a cost function $J( heta)$, where $ heta$ represents the model weights. While Gradient Descent calculates the gradient over the entire dataset, this is computationally prohibitive for modern Big Data. Stochastic Gradient Descent (SGD) simplifies this by estimating the gradient using a single random sample or a small 'mini-batch'. The intuition is that while a single sample provides a noisy estimate, the average direction of these noisy steps converges toward the true minimum, while the inherent randomness helps the model escape shallow local minima.
Mathematically, the standard SGD update rule for a parameter $ heta$ at step $t$ is defined as: $$ heta_{t+1} = heta_t - \\eta abla_{ heta} J( heta_t; x^{(i)}, y^{(i)})$$ where $\\eta$ is the learning rate and $ abla_{ heta} J$ is the gradient of the loss function with respect to the weights for a specific sample $(x^{(i)}, y^{(i)})$. While effective, vanilla SGD suffers from oscillation in regions where the surface curves more steeply in one dimension than another, often leading to slow convergence in narrow 'canyons' of the loss landscape.
To accelerate convergence, we introduce 'Momentum'. Imagine a heavy ball rolling down a hill; it accumulates velocity as it descends, allowing it to push through small bumps and dampen oscillations. We track a moving average of previous gradients, $v_t$, to determine the update direction. The update equations become: $$v_{t+1} = \\gamma v_t + \\eta abla_{ heta} J( heta_t)$$ $$ heta_{t+1} = heta_t - v_{t+1}$$ Here, $\\gamma$ (typically $0.9$) acts as a friction coefficient, ensuring that the optimizer maintains a consistent direction based on historical trends rather than just the current noisy gradient.
Adam (Adaptive Moment Estimation) evolves this concept further by maintaining separate learning rates for every single parameter. It tracks both the first moment (the mean of gradients) and the second raw moment (the uncentered variance). The intuition is that parameters with frequent, large gradients should have their updates scaled down to avoid divergence, while parameters with sparse, small gradients should be boosted to accelerate learning. This allows the optimizer to automatically adjust the step size based on the geometry of the specific parameter's landscape.
The Adam update mechanism is formulated as follows. First, we estimate the moments: $$m_t = eta_1 m_{t-1} + (1-eta_1) abla_{ heta} J( heta_t)$$ $$v_t = eta_2 v_{t-1} + (1-eta_2) ( abla_{ heta} J( heta_t))^2$$ To account for the fact that these moments are initialized at zero, we apply bias correction: $\\hat{m}_t = rac{m_t}{1-eta_1^t}$ and $\\hat{v}_t = rac{v_t}{1-eta_2^t}$. The final parameter update is: $$ heta_{t+1} = heta_t - rac{\\eta}{\\sqrt{\\hat{v}_t} + \\epsilon} \\hat{m}_t$$ where $\\epsilon$ is a small constant to prevent division by zero.
Despite the power of Adam, the global learning rate $\\eta$ still requires tuning. This is handled via learning rate scheduling. A fixed learning rate is often too large to settle into the exact minimum (causing 'chatter') or too small to escape plateaus. Common strategies include 'Step Decay', where $\\eta$ is dropped by a factor every few epochs, or 'Cosine Annealing', which follows a cosine curve to smoothly reduce the rate. The mathematical intuition is to start with a high $\\eta$ for rapid exploration and finish with a low $\\eta$ for precise refinement.
In summary, the progression from SGD to Adam represents a shift from manual tuning to automated adaptation. While SGD with momentum remains a baseline for generalization, Adam's ability to handle non-stationary objectives and sparse gradients makes it the default choice for deep architectures like Transformers. The synergy of an adaptive optimizer and a well-tuned decay schedule ensures that the network converges efficiently and generalizes well to unseen data.