Batch Normalization - Albert Masoliver's learning site

## Definition **Batch normalisation (BatchNorm)** (Ioffe & Szegedy, 2015) normalises layer activations using batch statistics during training. Stabilises and accelerates training of deep networks; one of the most impactful innovations of the deep learning era. ## Algorithm For a mini-batch $B$ and activation $x_i$: $ \mu_B = \frac{1}{m} \sum_{i=1}^m x_i, \quad \sigma_B^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2 $ $ \hat x_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} $ $ y_i = \gamma \hat x_i + \beta $ - $\gamma, \beta$ — learnable per-channel scale and shift parameters. - $\epsilon$ — small constant for numerical stability. At inference, replace batch statistics with running averages accumulated during training. ## What It Does - **Reduces internal covariate shift** — the distribution of activations to each layer stays stable. - **Enables larger learning rates** — gradients are better-conditioned. - **Acts as a regulariser** — batch statistics inject noise, similar to dropout. - **Robust to initialisation** — weight initialisation matters less. ## Why It Works (Disputed) The original "internal covariate shift" explanation was partially refuted by later research (Santurkar et al., 2018, *How Does Batch Normalization Help Optimization?*). The newer consensus: BatchNorm *smooths the loss landscape*, making gradient descent more reliable. The "why" is still debated; the empirical impact is settled. ## Where It's Used - **Convolutional networks** — virtually all modern CNNs use BatchNorm after each conv layer. - **Image classification, segmentation, detection.** - Standard in models like ResNet, EfficientNet, etc. ## Limitations - **Batch size dependency.** Small batches (≤8) give noisy statistics. Use [[Layer Normalization]] or GroupNorm for small batches. - **Doesn't work for sequence models** where sequence length varies — use LayerNorm. - **Train/eval mismatch.** Using batch stats at training but running stats at evaluation can cause subtle behaviour differences. - **Distributed training synchronisation.** Multi-GPU training needs synchronised batch statistics. ## Variants - **[[Layer Normalization]]** — normalise across features per sample (no batch dependency). Used in transformers. - **GroupNorm** — normalise within groups of channels. Compromise between BatchNorm and LayerNorm. - **InstanceNorm** — normalise per-sample, per-channel. Used in style transfer. ## In Modern LLMs Transformers use **LayerNorm** (or RMSNorm, a simpler variant), not BatchNorm. Reasons: variable sequence lengths, small effective batches during inference, sample-independent statistics. ## Related - [[Layer Normalization]] - [[Vanishing and Exploding Gradients]] - [[Skip Connections]] - [[Convolutional Neural Network]]