## Definition
**Positional encoding** is the mechanism that injects word-order information into a transformer, which would otherwise treat the input as an unordered set. It is the literal reason that the *order* of words in your prompt changes the model's output.
## Why it is necessary
The [[Attention Mechanism]] is **permutation-invariant**: it computes pairwise relationships between tokens without any built-in notion of which came first. Feed it "dog bites man" or "man bites dog" and, absent position information, the attention math is identical. Something must tell the model where each [[Token]] sits in the sequence.
## Absolute / sinusoidal (classic)
The original "Attention Is All You Need" transformer added a fixed **sinusoidal** signal to each token's [[Embedding]] — a unique pattern of sines and cosines per position:
$
PE_{(pos,2i)} = \sin\!\left(\frac{pos}{10000^{2i/d}}\right),\quad
PE_{(pos,2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)
$
It is parameter-free and lets the model read off an absolute position, but it generalizes poorly to sequences longer than those seen in training.
## RoPE (modern)
**Rotary Position Embedding (RoPE)** is the dominant approach in current LLMs (Llama, Mistral, and most open models). Instead of adding a signal, it *rotates* the query and key vectors by an angle proportional to their position. The elegant consequence: the dot product between two tokens depends only on their **relative** distance, so RoPE encodes **absolute and relative position at once** and extrapolates to longer contexts far more gracefully.
| | Sinusoidal | RoPE |
| --- | --- | --- |
| Applied to | embeddings (added) | queries & keys (rotated) |
| Encodes | absolute | absolute + relative |
| Long-context | weak | strong (extensible) |
## Practitioner takeaway
"Order matters in prompts" is not folklore — it is positional encoding doing its job. Where you place an instruction, an example, or a retrieved chunk relative to others measurably shifts attention. RoPE-based extensions are also what let vendors stretch a model's usable [[Context Window]] after training.
## Related
- [[Attention Mechanism]]
- [[Transformer Architecture]]
- [[Token]]
- [[Embedding]]
- [[Context Window]]
- [[KV Cache]]
- [[Hands-On Large Language Models - Alammar, Grootendorst]]