## Definition
The **attention mechanism** computes a weighted average of value vectors, where the weights are derived from the compatibility between a query vector and a set of key vectors. **Self-attention** is the special case where queries, keys, and values all come from the same sequence — each token attends to every other token.
## Scaled Dot-Product Attention
$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V
$
- $Q, K, V$ — query, key, value matrices (one row per token).
- $d_k$ — dimensionality of keys; scaling by $\sqrt{d_k}$ prevents the softmax from saturating.
- The softmax produces an attention *distribution*: each row sums to 1.
## Multi-Head Attention
Instead of one attention computation, the model runs $h$ in parallel with separate learned projections, then concatenates and projects:
$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O
$
Each head can specialise in different relational patterns (syntactic dependency, coreference, positional offset).
## Why It Works (Intuitively)
Attention answers, per token, "where should I look for the information I need?" The model learns to point at the right neighbours rather than being forced to summarise the whole context into a single fixed-size state.
## Computational Cost
Naive attention is **O(n²)** in sequence length — the cost that drove the practical context-window limits of early LLMs. Modern optimisations (Flash Attention, sparse attention, sliding-window attention) reduce this in practice.
## Causal Masking
In decoder-only models, a triangular mask ensures position $i$ can only attend to positions $\leq i$. This is what makes autoregressive generation work: the model can't peek at future tokens during training.
## Related
- [[Transformer Architecture]]
- [[Large Language Model]]
- [[Tokenization]]
- [[Attention Is All You Need (Vaswani et al.)]]
- [[Lost in the Middle Effect]]