At its core, training a deep network is an optimization problem where we seek to minimize a cost function $J( heta)$ by adjusting the model parameters $ heta$. The intuition behind gradient descent is akin to a hiker descending a mountain in thick fog: since they cannot see the valley, they feel the slope beneath their feet and take a step in the direction of the steepest descent. In deep learning, we cannot compute the gradient over the entire dataset (the 'batch') due to memory constraints, so we use Stochastic Gradient Descent (SGD), which estimates the true gradient using a small, random subset of data.
Mathematically, the update rule for SGD is defined by the subtraction of the gradient scaled by a learning rate $\\eta$. For a parameter vector $ heta$ at iteration $t$, the update is: $$ heta_{t+1} = heta_t - \\eta abla_{ heta} J( heta_t; x^{(i)}, y^{(i)})$$ where $ abla_{ heta} J( heta_t; x^{(i)}, y^{(i)})$ is the gradient of the loss function computed on a single sample or a mini-batch. While SGD introduces noise into the optimization path, this stochasticity often helps the model escape shallow local minima and find more robust regions of the parameter space.
Despite its elegance, vanilla SGD suffers from the 'oscillation problem' in ravines—regions where the surface curves more steeply in one dimension than another. To solve this, we introduce Momentum, which accumulates a moving average of past gradients to dampen oscillations and accelerate convergence. The velocity $v_t$ is updated as: $$v_{t+1} = \\gamma v_t + \\eta abla_{ heta} J( heta_t)$$ and the parameters are updated via $$ heta_{t+1} = heta_t - v_{t+1}$$. Here, $\\gamma$ acts as a friction coefficient, typically set to $0.9$, ensuring that the optimizer maintains motion in directions of consistent descent.
Adaptive Moment Estimation, or Adam, advances this by maintaining separate learning rates for every single parameter. Adam tracks both the first moment (the mean) and the second moment (the uncentered variance) of the gradients. The first moment $m_t$ and second moment $v_t$ are updated as: $$m_t = eta_1 m_{t-1} + (1 - eta_1) g_t$$ and $$v_t = eta_2 v_{t-1} + (1 - eta_2) g_t^2$$ where $g_t$ is the current gradient. Because $m_t$ and $v_t$ are initialized at zero, they are bias-corrected using $\\hat{m}_t = rac{m_t}{1 - eta_1^t}$ and $\\hat{v}_t = rac{v_t}{1 - eta_2^t}$ to prevent the initial steps from being skewed toward zero.
The final update step in Adam combines these moments to normalize the gradient update: $$ heta_{t+1} = heta_t - rac{\\eta}{\\sqrt{\\hat{v}_t} + \\epsilon} \\hat{m}_t$$. The term $\\sqrt{\\hat{v}_t}$ effectively scales the update: parameters with large, volatile gradients are penalized with a smaller effective step size, while parameters with small, consistent gradients receive larger updates. The $\\epsilon$ term is a tiny constant (e.g., $10^{-8}$) added to prevent division by zero, ensuring numerical stability during the division process.
While Adam is powerful, the choice of the global learning rate $\\eta$ remains critical. Learning rate scheduling is the practice of adjusting $\\eta$ during training to ensure the model converges to a sharp minimum. A common approach is 'Step Decay', where $\\eta$ is reduced by a factor (e.g., $0.1$) every few epochs. Alternatively, 'Cosine Annealing' follows a cosine curve: $$\\eta_t = \\eta_{min} + rac{1}{2}(\\eta_{max} - \\eta_{min})(1 + \\cos(rac{T_{cur}}{T_{max}}\\pi))$$. This allows the model to explore the landscape aggressively early on and refine its position delicately as it nears the optimum.