At the heart of every deep learning model lies an optimization problem: finding the set of weights $\theta$ that minimizes a cost function $J(\theta)$. Intuitionally, we treat this as a descent down a high-dimensional mountain range. While Batch Gradient Descent computes the gradient over the entire dataset, it is computationally prohibitive for large data. Stochastic Gradient Descent (SGD) solves this by approximating the true gradient using a single random sample or a small 'mini-batch', effectively introducing noise that can help the optimizer escape shallow local minima.
Mathematically, the SGD update rule is straightforward. For a given learning rate $\eta$, the parameter update at step $t$ is defined as: $$\theta_{t+1} = \theta_t - \eta \nabla J(\theta_t; x^{(i)}, y^{(i)})$$ where $\nabla J$ is the gradient of the loss with respect to the parameters calculated on a sample $i$. While efficient, vanilla SGD suffers from 'oscillation' in ravines—where the surface curves much more steeply in one dimension than another—leading to slow convergence toward the global minimum.
To mitigate these oscillations, we introduce momentum, which mimics a ball rolling down a hill by accumulating velocity. Instead of relying solely on the current gradient, momentum keeps a running average of past gradients. The velocity $v_t$ is updated as: $$v_t = \gamma v_{t-1} + \eta \nabla J(\theta_t)$$ and the parameters are updated via $\theta_{t+1} = \theta_t - v_t$. Here, $\gamma$ (usually 0.9) acts as a friction coefficient, allowing the optimizer to dampen oscillations and accelerate in directions of consistent descent.
The Adaptive Moment Estimation (Adam) optimizer evolves this concept further by maintaining separate learning rates for every single parameter. Adam tracks both the first moment (the mean) $m_t$ and the second moment (the uncentered variance) $v_t$ of the gradients: $$m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$$ and $$v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$$. This allows the optimizer to scale the update inversely to the square root of the accumulated gradients, effectively taking smaller steps for frequently updated parameters and larger steps for rare ones.
Because the first and second moments are initialized at zero, they are biased toward zero during the initial steps. Adam corrects this using bias-correction terms: $$\hat{m}_t = \frac{m_t}{1-\beta_1^t}$$ and $$\hat{v}_t = \frac{v_t}{1-\beta_2^t}$$. The final update rule becomes: $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$. Here, $\epsilon$ is a small constant to prevent division by zero. This combination of momentum and scaling makes Adam the default choice for most deep architectures.
Despite the power of Adam, the choice of the global learning rate $\eta$ remains critical. A constant learning rate often leads to 'over-shooting' the minimum as the model converges. Learning rate scheduling solves this by decaying $\eta$ over time. A common approach is Step Decay, where $\eta$ is multiplied by a factor $\gamma < 1$ every $k$ epochs, or Cosine Annealing, which follows a half-cosine curve to smoothly transition from a high rate to nearly zero: $$\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{T_{cur}}{T_{max}}\pi))$$.
In practice, the interaction between the optimizer and the scheduler determines the generalization capability of the network. While Adam converges faster, some research suggests that SGD with a well-tuned schedule may find 'flatter' minima, leading to better test-set performance. Therefore, practitioners often start training with Adam for rapid progress and switch to SGD for a 'fine-tuning' phase, ensuring the model settles into a robust, low-error region of the parameter space.