At its core, training a deep neural network is an optimization problem: we seek the set of parameters $\theta$ that minimizes a loss function $J(\theta)$, which represents the discrepancy between the model's predictions and the ground truth. Because deep networks contain millions of parameters, computing the exact gradient over the entire dataset—known as Batch Gradient Descent—is computationally prohibitive. Stochastic Gradient Descent (SGD) solves this by estimating the gradient using a single random sample or a small 'mini-batch', allowing for faster iterations and providing a regularization effect by introducing noise into the optimization path.
Mathematically, the SGD update rule is defined as: $$\theta_{t+1} = \theta_t - \eta \nabla J(\theta_t ; x^{(i)}, y^{(i)})$$ where $\eta$ is the learning rate and $\nabla J$ is the gradient of the loss with respect to the parameters for a specific sample $(x^{(i)}, y^{(i)})$. While SGD is conceptually simple, it suffers from oscillations in ravines—areas where the surface curves much more steeply in one dimension than another. To mitigate this, we introduce momentum, which accumulates a moving average of past gradients to dampen oscillations and accelerate descent in the relevant direction: $$v_t = \gamma v_{t-1} + \eta \nabla J(\theta_t)$$ $$\theta_{t+1} = \theta_t - v_t$$ where $\gamma$ is the momentum coefficient, typically set to $0.9$.
Standard SGD applies a global learning rate $\eta$ to all parameters, regardless of how frequently a specific feature appears in the training data. This is suboptimal for sparse data. Adaptive methods like Adam (Adaptive Moment Estimation) solve this by maintaining individual learning rates for every parameter. Adam effectively combines the benefits of Momentum (tracking the first moment of the gradient) and RMSProp (tracking the second raw moment, or the uncentered variance), allowing the optimizer to 'slow down' in steep directions and 'speed up' in flat directions.
The Adam optimizer implements two moving averages: the first moment $m_t$ (mean) and the second moment $v_t$ (uncentered variance). The updates are formulated as: $$m_t = \beta_1 m_{t-1} + (1 - \beta_1)\nabla J(\theta_t)$$ $$v_t = \beta_2 v_{t-1} + (1 - \beta_2)(\nabla J(\theta_t))^2$$ To correct for the fact that these moments are initialized at zero, we apply bias correction: $$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$ Finally, the parameter update is: $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$ where $\epsilon$ is a small constant to prevent division by zero.
Despite the sophistication of Adam, the choice of the initial learning rate $\eta$ remains critical. A static learning rate often leads to a trade-off: a high rate allows for rapid initial convergence but causes the model to oscillate around the minimum instead of settling into it. To solve this, we employ learning rate scheduling. The intuition is to start with a high $\eta$ to escape local minima and explore the landscape, then systematically decay $\eta$ to allow the model to converge precisely into the global minimum.
Common scheduling strategies include Step Decay, where $\eta$ is reduced by a factor every few epochs, and Cosine Annealing, which follows a cosine curve to smoothly lower the rate. A more advanced technique is the 'Warm-up' phase, where $\eta$ starts very small and increases linearly for a few thousand iterations. This prevents the gradients from exploding during the early stages of training when the weights are randomly initialized and the loss is high, ensuring the stability of the adaptive moments in Adam.
In practice, choosing between SGD with momentum and Adam depends on the specific architecture and dataset. While Adam often converges faster and requires less tuning of the initial $\eta$, research suggests that SGD with a well-tuned schedule can achieve better generalization on the test set. This is because the 'aggressive' nature of Adam can sometimes lead it to get trapped in sharp minima, whereas the inherent noise of SGD helps it find flatter, more robust minima that generalize better to unseen data.