Layer Normalization - Albert Masoliver's learning site

## Definition **Layer normalisation (LayerNorm)** (Ba, Kiros & Hinton, 2016) normalises activations across the *features* of a single sample (rather than across the batch like [[Batch Normalization]]). Independent of batch size; the standard normalisation in transformers and LLMs. ## Algorithm For a sample with features $x = (x_1, \dots, x_d)$: $ \mu = \frac{1}{d} \sum_{j=1}^d x_j, \quad \sigma^2 = \frac{1}{d} \sum_{j=1}^d (x_j - \mu)^2 $ $ y_j = \gamma_j \cdot \frac{x_j - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta_j $ Each sample uses its own statistics; no cross-sample dependence. ## LayerNorm vs BatchNorm | Property | BatchNorm | LayerNorm | |---|---|---| | Normalises over | Batch dimension | Feature dimension | | Depends on batch size | Yes | No | | Train/eval mismatch | Possible | None | | Best for | CNNs, large batches | Transformers, RNNs, small batches | ## Why LayerNorm Dominates Transformers - **Variable sequence lengths.** Batch stats would be confounded by padding. - **Small effective batches** during deployment (single user, single sequence). - **Per-token normalisation** maps cleanly to attention's per-token computation. ## Pre-Norm vs Post-Norm In transformers, two placements: - **Post-norm** (original Transformer paper): `LayerNorm(x + Sublayer(x))`. Less stable for very deep transformers. - **Pre-norm**: `x + Sublayer(LayerNorm(x))`. More stable. **The standard in modern LLMs.** The placement decision matters for training stability at scale. ## RMSNorm A simpler variant (Zhang & Sennrich, 2019): drop the mean centring, only normalise by root mean square: $ y_j = \gamma_j \cdot \frac{x_j}{\sqrt{\frac{1}{d}\sum_j x_j^2 + \epsilon}} $ Slightly cheaper, comparable performance. Used by Llama, T5, and many modern transformers. ## Trainable Parameters LayerNorm has $2d$ parameters per layer ($\gamma, \beta$); RMSNorm has $d$. Trivial overhead. ## Where to Place It - **Before each sublayer** in a transformer block (pre-norm). - **After each layer** in a feedforward MLP (less common in 2026). - **In RNN cells** (e.g., LayerNorm-LSTM, LayerNorm-GRU). ## Related - [[Batch Normalization]] - [[Transformer Architecture]] - [[Vanishing and Exploding Gradients]]