Positional Encoding - Albert Masoliver's learning site

## Definition **Positional encoding** is the mechanism that injects word-order information into a transformer, which would otherwise treat the input as an unordered set. It is the literal reason that the *order* of words in your prompt changes the model's output. ## Why it is necessary The [[Attention Mechanism]] is **permutation-invariant**: it computes pairwise relationships between tokens without any built-in notion of which came first. Feed it "dog bites man" or "man bites dog" and, absent position information, the attention math is identical. Something must tell the model where each [[Token]] sits in the sequence. ## Absolute / sinusoidal (classic) The original "Attention Is All You Need" transformer added a fixed **sinusoidal** signal to each token's [[Embedding]] — a unique pattern of sines and cosines per position: $ PE_{(pos,2i)} = \sin\!\left(\frac{pos}{10000^{2i/d}}\right),\quad PE_{(pos,2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d}}\right) $ It is parameter-free and lets the model read off an absolute position, but it generalizes poorly to sequences longer than those seen in training. ## RoPE (modern) **Rotary Position Embedding (RoPE)** is the dominant approach in current LLMs (Llama, Mistral, and most open models). Instead of adding a signal, it *rotates* the query and key vectors by an angle proportional to their position. The elegant consequence: the dot product between two tokens depends only on their **relative** distance, so RoPE encodes **absolute and relative position at once** and extrapolates to longer contexts far more gracefully. | | Sinusoidal | RoPE | | --- | --- | --- | | Applied to | embeddings (added) | queries & keys (rotated) | | Encodes | absolute | absolute + relative | | Long-context | weak | strong (extensible) | ## Practitioner takeaway "Order matters in prompts" is not folklore — it is positional encoding doing its job. Where you place an instruction, an example, or a retrieved chunk relative to others measurably shifts attention. RoPE-based extensions are also what let vendors stretch a model's usable [[Context Window]] after training. ## Related - [[Attention Mechanism]] - [[Transformer Architecture]] - [[Token]] - [[Embedding]] - [[Context Window]] - [[KV Cache]] - [[Hands-On Large Language Models - Alammar, Grootendorst]]