Optimizing Neural Networks: From SGD to Adaptive Moments and Scheduling

At its core, training a deep neural network is a high-dimensional optimization problem where the objective is to minimize a loss function $J( heta)$ by adjusting the model parameters $ heta$. The intuition behind Gradient Descent is simple: we calculate the slope of the loss surface at the current point and take a step in the opposite direction of the steepest ascent to reach a local minimum. While Batch Gradient Descent uses the entire dataset to compute the gradient, it is computationally prohibitive for large data. Stochastic Gradient Descent (SGD) solves this by approximating the gradient using a single random sample or a small mini-batch, introducing a 'noisy' signal that can actually help the model escape shallow local minima.

Mathematically, the update rule for SGD is defined by the following equation: $$ heta_{t+1} = heta_t - \\eta · abla_{ heta} J( heta_t; x^{(i)}, y^{(i)})$$ where $\\eta$ represents the learning rate, a scalar that controls the step size. The term $ abla_{ heta} J( heta_t)$ is the gradient of the cost function with respect to the parameters. While SGD is powerful, it suffers from oscillations in steep directions and slow progress in flat regions, often referred to as 'plateaus,' because it applies a uniform learning rate to all parameters regardless of their specific gradient histories.

To address the limitations of SGD, we introduce the concept of Momentum. Intuition suggests that if we consistently move in a certain direction, we should gain 'velocity' to speed up convergence and dampen oscillations. By maintaining a moving average of previous gradients, momentum allows the optimizer to smooth out high-frequency noise. The momentum update is formulated as: $$v_t = \\gamma v_{t-1} + \\eta abla_{ heta} J( heta_t)$$ $$ heta_{t+1} = heta_t - v_t$$ where $\\gamma$ is the momentum coefficient (typically 0.9). This ensures that the current update is a combination of the current gradient and the accumulated velocity of past steps.

The Adam (Adaptive Moment Estimation) optimizer represents a significant evolution by combining the ideas of Momentum and RMSProp. Instead of a single global learning rate, Adam computes individual adaptive learning rates for different parameters. It tracks both the first moment (the mean) and the second raw moment (the uncentered variance) of the gradients. This allows Adam to scale the updates inversely to the magnitude of the gradients: parameters with large, volatile gradients receive smaller updates, while those with small, consistent gradients receive larger updates.

The mathematical machinery of Adam involves calculating biased estimates of the moments: $$m_t = eta_1 m_{t-1} + (1 - eta_1) g_t$$ $$v_t = eta_2 v_{t-1} + (1 - eta_2) g_t^2$$ To correct for the fact that these moments are initialized at zero, we apply bias-correction: $$\\hat{m}_t = rac{m_t}{1 - eta_1^t}, \\quad \\hat{v}_t = rac{v_t}{1 - eta_2^t}$$ Finally, the parameter update is performed as: $$ heta_{t+1} = heta_t - rac{\\eta}{\\sqrt{\\hat{v}_t} + \\epsilon} \\hat{m}_t$$ where $\\epsilon$ is a small constant to prevent division by zero.

Despite the sophistication of Adam, the choice of the initial learning rate $\\eta$ remains a critical hyperparameter. Learning rate scheduling is the process of adjusting $\\eta$ over time to maximize convergence. If the learning rate is too high, the model may diverge or overshoot the minimum; if too low, training stalls. A common strategy is 'Learning Rate Decay,' where the rate is reduced according to a schedule, such as exponential decay: $\\eta_t = \\eta_0 e^{-kt}$, or 'Step Decay,' where the rate is dropped by a factor every few epochs. This allows the model to make large leaps early in training and refine its position with precision as it nears the optimum.