## Definition
**Inductive bias** is the set of assumptions a learning algorithm makes to predict outputs for inputs it hasn't seen. Without an inductive bias, generalisation is impossible — the [[No Free Lunch Theorem]] says so.
## Two Forms
### Restriction bias
The hypothesis space ($[[Hypothesis Space]]$) of the algorithm itself. Linear regression cannot represent non-linear relationships — that's a restriction. Decision trees can represent step functions but not smooth ones.
### Preference bias
Within the hypothesis space, the algorithm *prefers* some hypotheses over others. [[L2 Regularization]] prefers small-weight solutions. Decision tree growth prefers shorter trees (Occam's razor). SGD prefers solutions found by a particular search trajectory.
## Examples
| Algorithm | Inductive bias |
|---|---|
| Linear regression | Linearity in features |
| k-NN | Smooth, locally constant target function |
| Decision trees | Axis-aligned, hierarchical decision rules |
| Convolutional networks | Spatial locality, translation equivariance |
| Recurrent networks | Sequential dependence, temporal locality |
| Transformers | Permutation equivariance + positional encoding |
| Random Forest | Many weak decision rules average to a strong one |
| L2 regularisation | Smooth, low-norm solutions |
## Why It Matters
- **No generalisation without bias.** A learner that makes no assumptions about unseen data has no way to label it.
- **Right bias → fast learning.** Less data needed when the bias matches reality.
- **Wrong bias → fails silently.** A linear model on non-linear data simply can't learn the truth, regardless of data volume.
## Choosing the Right Bias
The art of ML engineering is matching inductive bias to the problem:
- **Spatial data (images)** → convolutional bias.
- **Sequential data (text, audio)** → recurrence or attention.
- **Tabular with mixed types** → tree-based models.
- **Strong domain priors** → handcrafted features, kernels, or architecture.
## The Bitter Lesson
Rich Sutton's "Bitter Lesson" (2019): historically, AI methods that scale with compute and data tend to beat those relying on hard-coded knowledge. Translation: weaker inductive biases combined with vast scale often win over stronger biases with limited scale. The frontier models of 2026 (LLMs, diffusion) reflect this — relatively general architectures, scaled aggressively, dominate hand-crafted competitors.
## Related
- [[No Free Lunch Theorem]]
- [[Hypothesis Space]]
- [[Bias-Variance Tradeoff]]
- [[Machine Learning]]