To understand optimization in deep learning, imagine a high-dimensional landscape where the elevation represents the loss function $J(\theta)$. The goal of any optimizer is to find the global minimum of this surface. Stochastic Gradient Descent (SGD) simplifies this by calculating the gradient based on a small batch of data rather than the entire dataset. This introduces 'noise', which, paradoxically, helps the model escape local minima and saddle points that would trap a deterministic gradient descent approach.
Mathematically, SGD updates the parameters $\theta$ by moving in the opposite direction of the gradient of the cost function. For a learning rate $\eta$, the update rule is defined as: $$\theta_{t+1} = \theta_t - \eta \nabla_{\theta} J(\theta_t)$$ where $\nabla_{\theta} J(\theta_t)$ is the estimate of the gradient based on a mini-batch. While intuitive, SGD suffers from a critical flaw: it uses a single global learning rate for all parameters, regardless of how frequently a specific feature appears in the data or how steep the curvature is for a particular dimension.
To solve the limitations of SGD, we introduce Momentum. Momentum mimics a heavy ball rolling down a hill; it accumulates velocity from previous gradients, helping the optimizer dampen oscillations in 'ravines' and accelerate through flat regions. We introduce a velocity term $v_t$: $$v_t = \\gamma v_{t-1} + \\eta abla_{\theta} J(\theta_t)$$ $$\theta_{t+1} = \theta_t - v_t$$ Here, $\gamma$ is the momentum coefficient (typically $0.9$), ensuring that the current update is a weighted average of past gradients, effectively smoothing the optimization path.
The Adam (Adaptive Moment Estimation) optimizer takes this further by computing individual adaptive learning rates for different parameters. It tracks both the first moment $m_t$ (the mean of gradients) and the second moment $v_t$ (the uncentered variance of gradients). The updates are as follows: $$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$ $$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$ where $g_t$ is the current gradient. By dividing the update by $\sqrt{v_t}$, Adam scales the step size inversely to the magnitude of the gradient, effectively performing larger updates for infrequent features and smaller updates for frequent ones.
Because $m_t$ and $v_t$ are initialized to zero, they are biased toward zero during the initial steps. Adam corrects this using bias-correction terms: $$\hat{m}_t = rac{m_t}{1 - eta_1^t}, \quad \hat{v}_t = rac{v_t}{1 - eta_2^t}$$ The final parameter update then becomes: $$\theta_{t+1} = \theta_t - rac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \\hat{m}_t$$ The term $\epsilon$ is a tiny constant to prevent division by zero. This mechanism allows Adam to handle sparse gradients and non-stationary objectives with remarkable efficiency.
Finally, we must address the learning rate $\eta$ itself. Even with adaptive optimizers, a static $\eta$ is often suboptimal. Learning rate scheduling involves decaying $\eta$ over time to allow the model to converge precisely as it approaches the minimum. Common strategies include 'Step Decay', where $\eta$ is dropped by a factor every few epochs, or 'Cosine Annealing', which follows a cosine curve: $$\eta_t = \eta_{min} + rac{1}{2}(\eta_{max} - \eta_{min})(1 + \\cos(\frac{T_{cur}}{T_{max}}\pi))$$ By starting with a high learning rate for exploration and ending with a low one for exploitation, we ensure the network reaches a high-quality local minimum.