At its core, optimizing a deep neural network is a search for the global minimum of a high-dimensional cost function $J(\theta)$. The most intuitive approach is Gradient Descent: if you are standing on a hill in a fog, you move in the direction of the steepest descent to reach the valley. However, calculating the gradient across the entire training set (Batch Gradient Descent) is computationally prohibitive for modern datasets. Stochastic Gradient Descent (SGD) solves this by estimating the gradient using a single random sample or a small 'mini-batch', introducing a level of noise that can actually help the model jump out of shallow local minima.
Mathematically, the SGD update rule for a parameter vector $\theta$ at iteration $t$ is defined as: $$\theta_{t+1} = \theta_t - \eta \nabla_{\theta} J(\theta_t; x^{(i)}, y^{(i)})$$ where $\eta$ represents the learning rate and $\nabla_{\theta} J$ is the gradient of the loss function with respect to the parameters. While SGD is computationally efficient, it suffers from oscillations in 'ravines'—areas where the surface curves much more steeply in one dimension than another—which can slow down convergence significantly.
To counteract these oscillations, we introduce Momentum. Intuition suggests that if we consistently move in a certain direction, we should gain speed. Momentum accumulates a moving average of past gradients, effectively acting as a low-pass filter that dampens high-frequency noise and accelerates descent along the consistent direction of the valley. The update rule becomes: $$v_t = \gamma v_{t-1} + \eta \nabla_{\theta} J(\theta_t)$$ $$\theta_{t+1} = \theta_t - v_t$$ where $\gamma$ (usually $0.9$) is the momentum coefficient.
Despite the success of momentum, a single learning rate $\eta$ for all parameters is suboptimal. Some features may be sparse and require larger updates, while others are frequent and require smaller ones. This led to Adaptive Moment Estimation, or Adam. Adam combines the ideas of Momentum (first moment) and RMSProp (second moment), which scales the learning rate by the square root of the running average of squared gradients. It essentially maintains a per-parameter learning rate that adapts based on the history of the gradient's magnitude.
The Adam optimization process is defined by calculating the first moment $m_t$ (mean) and the second moment $v_t$ (uncentered variance): $$m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$$ To correct for the fact that these moments are initialized at zero, we use bias-corrected estimates: $\hat{m}_t = \frac{m_t}{1-\beta_1^t}$ and $\hat{v}_t = \frac{v_t}{1-\beta_2^t}$. The final update is: $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$, where $\epsilon$ prevents division by zero.
Finally, the choice of $\eta$ is not static. A high learning rate is essential early in training to explore the parameter space, but it prevents the model from converging to a precise minimum late in training, causing the loss to fluctuate. Learning Rate Scheduling solves this by decaying $\eta$ over time. Common strategies include 'Step Decay', where the rate is dropped by a factor every $N$ epochs, or 'Cosine Annealing', which follows a cosine curve to smoothly reduce the rate: $\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{T_{cur}}{T_{max}}\pi))$. This ensures the model 'settles' into the optimal basin of the loss landscape.