## Definition
The **Transformer** is a neural-network architecture introduced by Vaswani et al. in 2017 — see [[Attention Is All You Need (Vaswani et al.)]] — that replaces recurrence and convolutions with [[Attention Mechanism|self-attention]]. It is the architectural foundation of every modern frontier LLM.
## Why It Replaced RNNs
- **Parallelisation.** RNNs process tokens sequentially; Transformers compute attention over all positions in parallel. This is the precondition for training at scale.
- **Long-range dependencies.** Self-attention has direct edges between any two positions; an RNN must propagate information through many time steps.
- **Better gradient flow.** No vanishing-gradient problem on long sequences.
## Variants
- **Encoder-only** (BERT, RoBERTa) — bidirectional; used for understanding tasks.
- **Decoder-only** (GPT family, Claude, Llama) — autoregressive; the dominant LLM shape.
- **Encoder-decoder** (original Transformer, T5, BART) — used for seq2seq tasks like translation and summarisation.
## Architectural Components
A standard decoder-only Transformer block:
1. **Multi-head self-attention** — see [[Attention Mechanism]].
2. **Residual connection + layer norm**.
3. **Position-wise feed-forward network** — typically a two-layer MLP with a non-linearity (GELU / SwiGLU).
4. **Residual connection + layer norm**.
Stacked N times. Frontier models in 2026 have anywhere from ~30 to ~200+ such blocks.
## Modern Refinements
- **Rotary positional embeddings (RoPE)** replace sinusoidal positions.
- **Grouped-query attention (GQA)** and **multi-query attention (MQA)** reduce KV-cache memory.
- **Mixture-of-Experts (MoE)** routes each token through a subset of feed-forward experts.
- **Flash Attention** dramatically reduces memory footprint at inference.
All refinements live *inside* the Vaswani skeleton.
## Related
- [[Attention Mechanism]]
- [[Large Language Model]]
- [[Tokenization]]
- [[Attention Is All You Need (Vaswani et al.)]]