At its core, optimizing a deep neural network is a search for the global minimum of a high-dimensional cost function $J(\theta)$. The intuition behind Gradient Descent is akin to walking down a hill in thick fog; you cannot see the bottom, but you can feel the slope beneath your feet. By taking steps in the direction of the steepest descent—opposite to the gradient—we iteratively refine our parameters $\theta$ to minimize error. However, calculating the gradient over the entire dataset (Batch Gradient Descent) is computationally prohibitive for modern networks, leading to the necessity of stochasticity.
Stochastic Gradient Descent (SGD) optimizes this process by approximating the true gradient using a single random sample or a small 'mini-batch'. Mathematically, the update rule is defined as $\theta_{t+1} = \theta_t - \eta \nabla J(\theta_t; x^{(i)}, y^{(i)})$, where $\eta$ is the learning rate and $\nabla J$ is the gradient of the loss with respect to the parameters for a specific point $i$. While this introduces noise into the optimization path, this stochasticity often helps the model escape shallow local minima and saddle points, accelerating the overall convergence speed in large-scale problems.
A significant limitation of vanilla SGD is its inability to handle different scales of gradients across different dimensions. If the loss surface is an elongated 'ravine', SGD tends to oscillate wildly across the narrow dimension while making slow progress along the long axis. To remedy this, we introduce Momentum. Momentum simulates a physical ball rolling down a hill by accumulating a velocity vector $v_t$, defined as $v_t = \gamma v_{t-1} + \eta \nabla J(\theta_t)$. The parameter update then becomes $\theta_{t+1} = \theta_t - v_t$, effectively smoothing out oscillations and accelerating descent in directions of consistent gradient sign.
To further refine this, the Adam (Adaptive Moment Estimation) optimizer was developed. Adam combines the benefits of Momentum and RMSProp by maintaining estimates of both the first moment (the mean) and the second raw moment (the uncentered variance) of the gradients. The first moment is tracked as $m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$, and the second moment as $v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$, where $g_t$ is the current gradient. These estimates are then bias-corrected to account for their initialization at zero, ensuring more stable updates during the early stages of training.
The final update step in Adam scales the learning rate for each individual parameter based on the inverse square root of the second moment: $\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$. This means that parameters with frequently large gradients receive smaller updates, while those with sparse or small gradients receive larger updates. This per-parameter adaptation allows Adam to converge much faster than SGD in complex architectures like Transformers and Deep Residual Networks, as it effectively navigates the varied curvature of the loss landscape.
Despite the power of adaptive optimizers, the choice of the global learning rate $\eta$ remains critical. Learning rate scheduling adjusts $\eta$ over time to ensure the model doesn't overshoot the minimum as it nears convergence. Common strategies include 'Step Decay', where $\eta$ is dropped by a factor every few epochs, or 'Cosine Annealing', which follows a curve: $\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{T_{cur}}{T_{max}}\pi))$. Lowering the learning rate allows the model to perform 'fine-tuning' in the final stages, settling precisely into the deepest part of the local minimum.