## Definition
A **Multilayer Perceptron (MLP)** is a feedforward neural network composed of multiple layers of neurons, where each layer applies an affine transformation followed by a non-linear activation. The simplest "deep" architecture and the foundational template for nearly every neural network.
## Architecture
For $L$ layers:
$
h^{(0)} = x
$
$
h^{(\ell)} = \sigma(W^{(\ell)} h^{(\ell-1)} + b^{(\ell)}) \quad \text{for } \ell = 1, \dots, L
$
$
\hat y = h^{(L)}
$
- $W^{(\ell)}$ — weight matrix of layer $\ell$.
- $b^{(\ell)}$ — bias vector.
- $\sigma$ — activation function (ReLU, GELU, sigmoid, tanh).
## Why Multiple Layers Matter
A single-layer network (perceptron) can only represent linearly separable functions. Adding hidden layers gives the model the ability to represent arbitrary non-linear functions (Universal Approximation Theorem). Depth matters because deep networks compose functions hierarchically — features in early layers combine into more abstract features in later layers.
## Training
Standard recipe:
1. Forward pass — compute predictions.
2. Compute loss against true labels.
3. [[Backpropagation]] — compute gradients of loss with respect to all parameters.
4. Update parameters via an optimiser (SGD, Adam — see [[Optimizers SGD Adam]]).
## Output Layer Conventions
- **Regression:** linear output, MSE loss.
- **Binary classification:** sigmoid output, binary cross-entropy.
- **Multi-class classification:** softmax output, categorical cross-entropy.
## Key Hyperparameters
- **Depth** (number of hidden layers): 2-10 for tabular tasks; many more for vision/NLP.
- **Width** (neurons per layer): 32-1024 typical.
- **Activation:** ReLU is the default; GELU and SwiGLU in modern transformers.
- **Initialisation:** He or Xavier; never zeros.
## Strengths
- **Universal approximator** — given enough capacity, can represent any continuous function.
- **End-to-end learnable.**
- **GPU-friendly** with batch processing.
## Weaknesses
- **Doesn't exploit structural priors.** For images, convolutional networks vastly outperform MLPs of comparable size — exploiting spatial locality.
- **Many parameters** for high-dimensional inputs (`d * h` connections in the first layer alone).
- **Black-box.** Interpretability is hard.
## When Used in 2026
- **Tabular data** — when you want to capture interactions a tree-based model misses (though usually XGBoost still wins).
- **Heads on top of pretrained encoders** — final classification layers on top of LLM or CNN embeddings.
- **Toy problems and teaching.**
For most "interesting" problems, specialised architectures (CNNs, RNNs, Transformers) beat raw MLPs.
## Related
- [[Perceptron]]
- [[Backpropagation]]
- [[Neural Network Architecture]]
- [[Activation function]]
- [[Universal Approximation Theorem]]