Transformer Architecture - Albert Masoliver's learning site

## Definition The **Transformer** is a neural-network architecture introduced by Vaswani et al. in 2017 — see [[Attention Is All You Need (Vaswani et al.)]] — that replaces recurrence and convolutions with [[Attention Mechanism|self-attention]]. It is the architectural foundation of every modern frontier LLM. ## Why It Replaced RNNs - **Parallelisation.** RNNs process tokens sequentially; Transformers compute attention over all positions in parallel. This is the precondition for training at scale. - **Long-range dependencies.** Self-attention has direct edges between any two positions; an RNN must propagate information through many time steps. - **Better gradient flow.** No vanishing-gradient problem on long sequences. ## Variants - **Encoder-only** (BERT, RoBERTa) — bidirectional; used for understanding tasks. - **Decoder-only** (GPT family, Claude, Llama) — autoregressive; the dominant LLM shape. - **Encoder-decoder** (original Transformer, T5, BART) — used for seq2seq tasks like translation and summarisation. ## Architectural Components A standard decoder-only Transformer block: 1. **Multi-head self-attention** — see [[Attention Mechanism]]. 2. **Residual connection + layer norm**. 3. **Position-wise feed-forward network** — typically a two-layer MLP with a non-linearity (GELU / SwiGLU). 4. **Residual connection + layer norm**. Stacked N times. Frontier models in 2026 have anywhere from ~30 to ~200+ such blocks. ## Modern Refinements - **Rotary positional embeddings (RoPE)** replace sinusoidal positions. - **Grouped-query attention (GQA)** and **multi-query attention (MQA)** reduce KV-cache memory. - **Mixture-of-Experts (MoE)** routes each token through a subset of feed-forward experts. - **Flash Attention** dramatically reduces memory footprint at inference. All refinements live *inside* the Vaswani skeleton. ## Related - [[Attention Mechanism]] - [[Large Language Model]] - [[Tokenization]] - [[Attention Is All You Need (Vaswani et al.)]]