Stochastic Gradient Descent - Albert Masoliver's learning site

## Definition **Stochastic Gradient Descent (SGD)** is a variant of [[Gradient Descent]] that uses a *single example* (or a small mini-batch) to estimate the gradient at each step. Drastically reduces per-step cost; the dominant optimiser in deep learning. ## Algorithm ``` initialise θ randomly repeat until convergence: shuffle training data for each example (or mini-batch) (x_i, y_i): g ← ∇_θ L(f_θ(x_i), y_i) θ ← θ - η · g ``` ## Why It Works The single-example gradient is an *unbiased estimator* of the true gradient: $ \mathbb{E}_i[\nabla L(f_\theta(x_i), y_i)] = \nabla \mathcal{L}(\theta) $ So SGD takes noisy steps in roughly the right direction. The noise accumulates near minima where gradients are small but doesn't impede progress in regions where gradients are clearly informative. ## Mini-batch SGD In practice, the term "SGD" usually refers to **mini-batch SGD** — gradient computed over a batch of $B$ examples (typically 32–512): $ g = \frac{1}{B} \sum_{i \in \text{batch}} \nabla L(f_\theta(x_i), y_i) $ Sweet spot: smaller variance than pure SGD, much faster than batch GD. ## Implicit Regularisation SGD's noise has a *regularising* effect — it nudges the model away from sharp minima toward flatter ones, which empirically generalise better. This implicit bias is one hypothesised reason overparameterised deep networks generalise despite being able to memorise the training set. ## Momentum A standard enhancement. Maintain a velocity that accumulates past gradients: $ v_{t+1} = \beta v_t + (1 - \beta) g_t $ $ \theta_{t+1} = \theta_t - \eta \, v_{t+1} $ with $\beta \approx 0.9$. Momentum smooths the update direction and accelerates progress in consistent directions while damping oscillation in noisy directions. **Nesterov momentum** evaluates the gradient at a "look-ahead" point — minor theoretical and empirical improvement. ## Why Mini-batches Have a Sweet Spot - **Too small (1)** — high variance, slow to use GPU parallelism, unstable. - **Too large** — diminishing returns on noise reduction; may lose the regularising noise; per-step cost grows. - **Hardware-friendly** sizes: powers of 2 (32, 64, 128, 256). GPUs are most efficient on aligned batch sizes. The "linear scaling rule" (Goyal et al. 2017): if you double the batch size, double the learning rate. Works up to a point; very large batches need warmup and special schedules. ## Modern Successors For most deep-learning work, plain SGD has been displaced by adaptive optimisers (Adam, AdamW, LAMB). But **SGD with momentum + cosine schedule** still wins on some computer vision benchmarks (ImageNet, CIFAR), and is computationally cheaper than Adam. ## Related - [[Gradient Descent]] - [[Optimizers SGD Adam]] - [[Learning Rate Schedules]] - [[Backpropagation]]