## Definition
**Layer normalisation (LayerNorm)** (Ba, Kiros & Hinton, 2016) normalises activations across the *features* of a single sample (rather than across the batch like [[Batch Normalization]]). Independent of batch size; the standard normalisation in transformers and LLMs.
## Algorithm
For a sample with features $x = (x_1, \dots, x_d)$:
$
\mu = \frac{1}{d} \sum_{j=1}^d x_j, \quad \sigma^2 = \frac{1}{d} \sum_{j=1}^d (x_j - \mu)^2
$
$
y_j = \gamma_j \cdot \frac{x_j - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta_j
$
Each sample uses its own statistics; no cross-sample dependence.
## LayerNorm vs BatchNorm
| Property | BatchNorm | LayerNorm |
|---|---|---|
| Normalises over | Batch dimension | Feature dimension |
| Depends on batch size | Yes | No |
| Train/eval mismatch | Possible | None |
| Best for | CNNs, large batches | Transformers, RNNs, small batches |
## Why LayerNorm Dominates Transformers
- **Variable sequence lengths.** Batch stats would be confounded by padding.
- **Small effective batches** during deployment (single user, single sequence).
- **Per-token normalisation** maps cleanly to attention's per-token computation.
## Pre-Norm vs Post-Norm
In transformers, two placements:
- **Post-norm** (original Transformer paper): `LayerNorm(x + Sublayer(x))`. Less stable for very deep transformers.
- **Pre-norm**: `x + Sublayer(LayerNorm(x))`. More stable. **The standard in modern LLMs.**
The placement decision matters for training stability at scale.
## RMSNorm
A simpler variant (Zhang & Sennrich, 2019): drop the mean centring, only normalise by root mean square:
$
y_j = \gamma_j \cdot \frac{x_j}{\sqrt{\frac{1}{d}\sum_j x_j^2 + \epsilon}}
$
Slightly cheaper, comparable performance. Used by Llama, T5, and many modern transformers.
## Trainable Parameters
LayerNorm has $2d$ parameters per layer ($\gamma, \beta$); RMSNorm has $d$. Trivial overhead.
## Where to Place It
- **Before each sublayer** in a transformer block (pre-norm).
- **After each layer** in a feedforward MLP (less common in 2026).
- **In RNN cells** (e.g., LayerNorm-LSTM, LayerNorm-GRU).
## Related
- [[Batch Normalization]]
- [[Transformer Architecture]]
- [[Vanishing and Exploding Gradients]]