## Definition The **attention mechanism** computes a weighted average of value vectors, where the weights are derived from the compatibility between a query vector and a set of key vectors. **Self-attention** is the special case where queries, keys, and values all come from the same sequence — each token attends to every other token. ## Scaled Dot-Product Attention $ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V $ - $Q, K, V$ — query, key, value matrices (one row per token). - $d_k$ — dimensionality of keys; scaling by $\sqrt{d_k}$ prevents the softmax from saturating. - The softmax produces an attention *distribution*: each row sums to 1. ## Multi-Head Attention Instead of one attention computation, the model runs $h$ in parallel with separate learned projections, then concatenates and projects: $ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O $ Each head can specialise in different relational patterns (syntactic dependency, coreference, positional offset). ## Why It Works (Intuitively) Attention answers, per token, "where should I look for the information I need?" The model learns to point at the right neighbours rather than being forced to summarise the whole context into a single fixed-size state. ## Computational Cost Naive attention is **O(n²)** in sequence length — the cost that drove the practical context-window limits of early LLMs. Modern optimisations (Flash Attention, sparse attention, sliding-window attention) reduce this in practice. ## Causal Masking In decoder-only models, a triangular mask ensures position $i$ can only attend to positions $\leq i$. This is what makes autoregressive generation work: the model can't peek at future tokens during training. ## Related - [[Transformer Architecture]] - [[Large Language Model]] - [[Tokenization]] - [[Attention Is All You Need (Vaswani et al.)]] - [[Lost in the Middle Effect]]