## Definition A **Multilayer Perceptron (MLP)** is a feedforward neural network composed of multiple layers of neurons, where each layer applies an affine transformation followed by a non-linear activation. The simplest "deep" architecture and the foundational template for nearly every neural network. ## Architecture For $L$ layers: $ h^{(0)} = x $ $ h^{(\ell)} = \sigma(W^{(\ell)} h^{(\ell-1)} + b^{(\ell)}) \quad \text{for } \ell = 1, \dots, L $ $ \hat y = h^{(L)} $ - $W^{(\ell)}$ — weight matrix of layer $\ell$. - $b^{(\ell)}$ — bias vector. - $\sigma$ — activation function (ReLU, GELU, sigmoid, tanh). ## Why Multiple Layers Matter A single-layer network (perceptron) can only represent linearly separable functions. Adding hidden layers gives the model the ability to represent arbitrary non-linear functions (Universal Approximation Theorem). Depth matters because deep networks compose functions hierarchically — features in early layers combine into more abstract features in later layers. ## Training Standard recipe: 1. Forward pass — compute predictions. 2. Compute loss against true labels. 3. [[Backpropagation]] — compute gradients of loss with respect to all parameters. 4. Update parameters via an optimiser (SGD, Adam — see [[Optimizers SGD Adam]]). ## Output Layer Conventions - **Regression:** linear output, MSE loss. - **Binary classification:** sigmoid output, binary cross-entropy. - **Multi-class classification:** softmax output, categorical cross-entropy. ## Key Hyperparameters - **Depth** (number of hidden layers): 2-10 for tabular tasks; many more for vision/NLP. - **Width** (neurons per layer): 32-1024 typical. - **Activation:** ReLU is the default; GELU and SwiGLU in modern transformers. - **Initialisation:** He or Xavier; never zeros. ## Strengths - **Universal approximator** — given enough capacity, can represent any continuous function. - **End-to-end learnable.** - **GPU-friendly** with batch processing. ## Weaknesses - **Doesn't exploit structural priors.** For images, convolutional networks vastly outperform MLPs of comparable size — exploiting spatial locality. - **Many parameters** for high-dimensional inputs (`d * h` connections in the first layer alone). - **Black-box.** Interpretability is hard. ## When Used in 2026 - **Tabular data** — when you want to capture interactions a tree-based model misses (though usually XGBoost still wins). - **Heads on top of pretrained encoders** — final classification layers on top of LLM or CNN embeddings. - **Toy problems and teaching.** For most "interesting" problems, specialised architectures (CNNs, RNNs, Transformers) beat raw MLPs. ## Related - [[Perceptron]] - [[Backpropagation]] - [[Neural Network Architecture]] - [[Activation function]] - [[Universal Approximation Theorem]]