At its core, optimizing a deep neural network is a search for the global minimum of a cost function $J( heta)$, where $ heta$ represents the model weights. Since the loss landscape of a deep network is non-convex and high-dimensional, we cannot solve for the minimum analytically. Instead, we use iterative updates that move the parameters in the direction of the steepest descent. The intuition is akin to a blindfolded hiker attempting to reach the bottom of a valley by feeling the slope of the ground beneath their feet and taking steps in the direction that descends most sharply.
Stochastic Gradient Descent (SGD) simplifies the computation of the gradient by estimating the true gradient of the entire dataset using a small, random mini-batch of samples. The update rule is formulated as: $$ heta_{t+1} = heta_t - \\eta · abla_{ heta} J( heta_t; \\mathcal{B})$$ where $\\eta$ is the learning rate and $ abla_{ heta} J( heta_t; \\mathcal{B})$ is the gradient calculated over the batch $\\mathcal{B}$. While this introduces noise into the optimization path, this stochasticity often helps the model escape shallow local minima and saddle points, facilitating better generalization on unseen data.
Despite the utility of SGD, it often struggles with 'ravines'—areas where the surface curves much more steeply in one dimension than in another. This leads to oscillating updates that slow down convergence. To counteract this, we introduce momentum, which accumulates a moving average of past gradients to smooth out oscillations. The momentum update is defined by: $$v_t = \\gamma v_{t-1} + \\eta abla_{ heta} J( heta_t)$$ $$ heta_{t+1} = heta_t - v_t$$ Here, $\\gamma$ (typically 0.9) acts as a friction coefficient, allowing the optimizer to build 'velocity' in directions of consistent descent.
The Adam (Adaptive Moment Estimation) optimizer evolves this concept by calculating individual learning rates for every single parameter. Adam maintains estimates of both the first moment $m_t$ (the mean) and the second moment $v_t$ (the uncentered variance) of the gradients: $$m_t = eta_1 m_{t-1} + (1-eta_1) g_t$$ $$v_t = eta_2 v_{t-1} + (1-eta_2) g_t^2$$ To account for the fact that these moments are initialized at zero, Adam applies a bias correction: $\\hat{m}_t = rac{m_t}{1-eta_1^t}$ and $\\hat{v}_t = rac{v_t}{1-eta_2^t}$.
The final parameter update in Adam combines these corrected moments to scale the gradient: $$ heta_{t+1} = heta_t - rac{\\eta}{\\sqrt{\\hat{v}_t} + \\epsilon} \\hat{m}_t$$ The term $\\sqrt{\\hat{v}_t}$ effectively normalizes the update; parameters with large, volatile gradients receive smaller updates, while parameters with small, consistent gradients receive larger updates. This makes Adam remarkably robust to the initial choice of learning rate and highly effective for sparse gradients in large-scale transformers and deep CNNs.
Even with adaptive optimizers, the static learning rate $\\eta$ is rarely optimal throughout the entire training process. Early in training, a high learning rate is desirable for rapid exploration of the parameter space. However, as the model nears a minimum, a high rate can cause the optimizer to overshoot the target, leading to instability. Learning rate scheduling solves this by decaying $\\eta$ over time. A common strategy is the 'Cosine Annealing' schedule, which varies $\\eta$ according to: $$\\eta_t = \\eta_{min} + rac{1}{2}(\\eta_{max} - \\eta_{min})(1 + \\cos(rac{T_{cur}}{T_{max}}\\pi))$$
Integrating these components—SGD for foundational stochasticity, Adam for adaptive scaling, and scheduling for fine-tuned convergence—allows us to train networks with millions of parameters. The theoretical trade-off remains between the fast convergence provided by adaptive methods and the superior final generalization often achieved by meticulously tuned SGD with momentum. Mastering these tools requires an understanding of the 'optimization trajectory' and the ability to diagnose whether a model is diverging due to excessive step sizes or stagnating due to premature decay.