## Definition
**Dropout** (Srivastava et al., 2014) is a regularisation technique that randomly "drops" (sets to zero) a fraction of neurons during training. Prevents co-adaptation: neurons can't rely on specific other neurons to handle specific patterns, so they must encode more robust features.
## Algorithm
During training, for each forward pass and each neuron, drop with probability $p$ (typical: 0.2-0.5):
$
h_i = \begin{cases} 0 & \text{with probability } p \\ h_i / (1-p) & \text{with probability } 1-p \end{cases}
$
The $1/(1-p)$ scaling preserves expected activation magnitude — sometimes called *inverted dropout*.
At inference: dropout is disabled. All neurons are used.
## Why It Works
Several complementary intuitions:
- **Implicit ensembling.** Each forward pass uses a different subnetwork; over many passes, the model is an ensemble of exponentially many smaller networks sharing weights.
- **Reduces co-adaptation.** Forces neurons to be useful in many contexts.
- **Acts like noise injection.** Generic regularisation effect.
## When to Use
- **MLPs and CNNs**, typically before fully-connected output layers.
- **Pre-2017 norm.** Dropout was the dominant regulariser in the LeCun-Hinton-Bengio "deep learning explosion."
## When NOT to Use
- **Batch-normalised networks.** Dropout + BatchNorm often interact poorly. Modern recipes use one or the other, not both.
- **Most transformers.** Modern LLMs use light dropout (0.0-0.1) — or none — combined with layer normalisation, weight decay, and large training data.
## Variants
- **Spatial dropout** for CNNs — drop entire feature maps.
- **DropConnect** — drop weights instead of activations.
- **Variational dropout** — fixed mask per training sample (Gal & Ghahramani 2016) — provides Bayesian uncertainty estimates.
- **DropBlock** — drop contiguous regions in feature maps.
## Probability Setting
- **Input layer:** 0.1-0.2 (low — preserving input information).
- **Hidden layers:** 0.3-0.5.
- **Tune** if validation performance is sensitive; otherwise 0.5 is a reasonable default.
## In Modern Deep Learning
Dropout has been partially displaced by [[Batch Normalization]] and large data scale — both regularise implicitly. But dropout still appears:
- **Output heads** of classifiers.
- **Attention layers** in transformers (small probability, ~0.1).
- **Tasks with limited training data** where other regularisation isn't enough.
## Related
- [[Regularization]]
- [[Batch Normalization]]
- [[Overfitting and Underfitting]]