Universal Approximation Theorem - Albert Masoliver's learning site

## Definition The **Universal Approximation Theorem** (UAT) states that a feedforward neural network with a single hidden layer of sufficient width can approximate any continuous function on a compact subset of $\mathbb{R}^n$ to arbitrary accuracy. First proven by Cybenko (1989) for sigmoid activations; generalised by Hornik (1991) to broader non-polynomial activations. ## Statement (Cybenko's Version, Informal) For any continuous function $f: K \to \mathbb{R}$ on a compact set $K \subset \mathbb{R}^n$ and any $\epsilon > 0$, there exist parameters such that the network $g(x) = \sum_{i=1}^m w_i \sigma(a_i^\top x + b_i)$ satisfies $|g(x) - f(x)| < \epsilon$ for all $x \in K$. In words: a wide-enough single-hidden-layer network can approximate any continuous function arbitrarily well. ## What It Doesn't Tell You - **How wide.** The required width can be exponentially large in input dimension. Universal in theory; not practical. - **How to find the parameters.** Existence is not constructive. - **Generalisation guarantees.** UAT is about representation, not learning from finite data. ## Why Depth Matters Despite the theorem's promise that one layer suffices, **deep** networks consistently outperform shallow ones. Why? - **Compositionality.** Real-world functions often have hierarchical structure (edges → textures → parts → objects). Depth matches this structure. - **Parameter efficiency.** A deep network can represent some functions with *exponentially fewer* parameters than a shallow network. - **Optimisation landscape.** Gradient descent finds better solutions in deep architectures, empirically. UAT shows shallow networks *can* represent any function; deep networks just do it *more efficiently and learnably*. ## Modern Extensions - **Universal approximation by deep networks.** Multiple variants extend the theorem to deep networks with bounded width. - **ReLU networks.** Specific universal approximation results for ReLU architectures. - **Other modalities.** Analogous results for CNNs, RNNs, transformers. ## Historical Significance Before UAT, it wasn't clear that neural networks could even *in principle* solve complex problems. UAT settled the representation question; the learning, scaling, and architecture questions took decades more to resolve. ## Connection to The Bitter Lesson The UAT framing — "any function representable" — meshes with Sutton's *Bitter Lesson*: general-purpose function approximators plus scale outperform hand-engineered solutions. UAT provides the theoretical floor under empirical scale. ## Related - [[Multilayer Perceptron]] - [[Neural Network Architecture]] - [[Activation function]]