## Definition
The **Universal Approximation Theorem** (UAT) states that a feedforward neural network with a single hidden layer of sufficient width can approximate any continuous function on a compact subset of $\mathbb{R}^n$ to arbitrary accuracy. First proven by Cybenko (1989) for sigmoid activations; generalised by Hornik (1991) to broader non-polynomial activations.
## Statement (Cybenko's Version, Informal)
For any continuous function $f: K \to \mathbb{R}$ on a compact set $K \subset \mathbb{R}^n$ and any $\epsilon > 0$, there exist parameters such that the network $g(x) = \sum_{i=1}^m w_i \sigma(a_i^\top x + b_i)$ satisfies $|g(x) - f(x)| < \epsilon$ for all $x \in K$.
In words: a wide-enough single-hidden-layer network can approximate any continuous function arbitrarily well.
## What It Doesn't Tell You
- **How wide.** The required width can be exponentially large in input dimension. Universal in theory; not practical.
- **How to find the parameters.** Existence is not constructive.
- **Generalisation guarantees.** UAT is about representation, not learning from finite data.
## Why Depth Matters
Despite the theorem's promise that one layer suffices, **deep** networks consistently outperform shallow ones. Why?
- **Compositionality.** Real-world functions often have hierarchical structure (edges → textures → parts → objects). Depth matches this structure.
- **Parameter efficiency.** A deep network can represent some functions with *exponentially fewer* parameters than a shallow network.
- **Optimisation landscape.** Gradient descent finds better solutions in deep architectures, empirically.
UAT shows shallow networks *can* represent any function; deep networks just do it *more efficiently and learnably*.
## Modern Extensions
- **Universal approximation by deep networks.** Multiple variants extend the theorem to deep networks with bounded width.
- **ReLU networks.** Specific universal approximation results for ReLU architectures.
- **Other modalities.** Analogous results for CNNs, RNNs, transformers.
## Historical Significance
Before UAT, it wasn't clear that neural networks could even *in principle* solve complex problems. UAT settled the representation question; the learning, scaling, and architecture questions took decades more to resolve.
## Connection to The Bitter Lesson
The UAT framing — "any function representable" — meshes with Sutton's *Bitter Lesson*: general-purpose function approximators plus scale outperform hand-engineered solutions. UAT provides the theoretical floor under empirical scale.
## Related
- [[Multilayer Perceptron]]
- [[Neural Network Architecture]]
- [[Activation function]]