At its core, training a deep neural network is an optimization problem where we seek the parameters $\theta$ that minimize a cost function $J(\theta)$. Because calculating the gradient over the entire dataset (Batch Gradient Descent) is computationally prohibitive for millions of samples, we use Stochastic Gradient Descent (SGD). The intuition is to estimate the true gradient of the total loss by calculating the gradient of a small, random subset of data called a 'mini-batch'. This introduces noise into the optimization process, which, paradoxically, can help the model escape shallow local minima and find more robust regions of the parameter space.
Mathematically, the update rule for SGD is expressed as $\theta_{t+1} = \theta_t - \eta \nabla_{\theta} J(\theta_t; \\mathcal{B})$, where $\eta$ is the learning rate and $\mathcal{B}$ is the mini-batch. While simple, SGD suffers from oscillations in directions of high curvature, especially in 'ravines' where the surface curves much more steeply in one dimension than another. To dampen these oscillations and accelerate convergence, we introduce Momentum. This adds a fraction $\\gamma$ of the previous update vector to the current one: $v_t = \gamma v_{t-1} + \eta \nabla_{\theta} J(\theta_t)$ and $\theta_{t+1} = \theta_t - v_t$.
The primary limitation of SGD and Momentum is the use of a single global learning rate $\eta$ for all parameters. In deep networks, some features are sparse and require larger updates, while others are frequent and require smaller, more precise updates. Adaptive moment estimation, or Adam, solves this by maintaining individual learning rates for every parameter. Adam tracks both the first moment (the mean) and the second raw moment (the uncentered variance) of the gradients to scale the updates dynamically.
The Adam optimizer computes the moving averages of the gradient $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$ and the squared gradient $v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$. To account for the fact that these moments are initialized at zero, Adam applies a bias correction: $\hat{m}_t = \frac{m_t}{1-\beta_1^t}$ and $\hat{v}_t = \frac{v_t}{1-\beta_2^t}$. The final parameter update is then $\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$. This ensures that parameters with large, volatile gradients have their effective step size reduced, while parameters with small gradients are boosted.
Despite the power of adaptive optimizers, the choice of the initial learning rate $\eta$ remains a critical hyperparameter. A learning rate that is too high causes the loss to diverge, while one that is too low leads to agonizingly slow convergence or stagnation in suboptimal plateaus. Learning rate scheduling is the process of adjusting $\eta$ over time. The intuition is to start with a large step size to explore the parameter space rapidly and gradually decrease it to 'fine-tune' the weights as the model approaches a minimum.
Common scheduling strategies include Step Decay, where $\eta$ is reduced by a factor every few epochs, and Cosine Annealing, which follows a cosine curve: $\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{T_{cur}}{T_{max}}\pi))$. Another powerful technique is 'Warm-up', where the learning rate starts very low and linearly increases for a few hundred steps. This prevents the model from diverging during the initial phase of training when the weights are random and gradients are volatile.
In summary, the evolution from SGD to Adam represents a shift from generic updates to parameter-specific intelligence. While Adam provides faster initial convergence and is generally more robust to hyperparameter settings, combined usage of SGD with a well-tuned Cosine schedule often reaches better final generalization on specific benchmarks. The art of deep learning optimization lies in balancing the exploration of the loss landscape with the precision of convergence.