## Definition
**Batch normalisation (BatchNorm)** (Ioffe & Szegedy, 2015) normalises layer activations using batch statistics during training. Stabilises and accelerates training of deep networks; one of the most impactful innovations of the deep learning era.
## Algorithm
For a mini-batch $B$ and activation $x_i$:
$
\mu_B = \frac{1}{m} \sum_{i=1}^m x_i, \quad \sigma_B^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2
$
$
\hat x_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
$
$
y_i = \gamma \hat x_i + \beta
$
- $\gamma, \beta$ — learnable per-channel scale and shift parameters.
- $\epsilon$ — small constant for numerical stability.
At inference, replace batch statistics with running averages accumulated during training.
## What It Does
- **Reduces internal covariate shift** — the distribution of activations to each layer stays stable.
- **Enables larger learning rates** — gradients are better-conditioned.
- **Acts as a regulariser** — batch statistics inject noise, similar to dropout.
- **Robust to initialisation** — weight initialisation matters less.
## Why It Works (Disputed)
The original "internal covariate shift" explanation was partially refuted by later research (Santurkar et al., 2018, *How Does Batch Normalization Help Optimization?*). The newer consensus: BatchNorm *smooths the loss landscape*, making gradient descent more reliable.
The "why" is still debated; the empirical impact is settled.
## Where It's Used
- **Convolutional networks** — virtually all modern CNNs use BatchNorm after each conv layer.
- **Image classification, segmentation, detection.**
- Standard in models like ResNet, EfficientNet, etc.
## Limitations
- **Batch size dependency.** Small batches (≤8) give noisy statistics. Use [[Layer Normalization]] or GroupNorm for small batches.
- **Doesn't work for sequence models** where sequence length varies — use LayerNorm.
- **Train/eval mismatch.** Using batch stats at training but running stats at evaluation can cause subtle behaviour differences.
- **Distributed training synchronisation.** Multi-GPU training needs synchronised batch statistics.
## Variants
- **[[Layer Normalization]]** — normalise across features per sample (no batch dependency). Used in transformers.
- **GroupNorm** — normalise within groups of channels. Compromise between BatchNorm and LayerNorm.
- **InstanceNorm** — normalise per-sample, per-channel. Used in style transfer.
## In Modern LLMs
Transformers use **LayerNorm** (or RMSNorm, a simpler variant), not BatchNorm. Reasons: variable sequence lengths, small effective batches during inference, sample-independent statistics.
## Related
- [[Layer Normalization]]
- [[Vanishing and Exploding Gradients]]
- [[Skip Connections]]
- [[Convolutional Neural Network]]