Position Embeddings
Dec 27, 2025
Why do we need position embeddings?
In the ideal case of sequence models (in which traditional NNs, RNNs, and LSTMS fall), our architecture should know what nth position each character/symbol is in.
The visual cortex in the brain knows which pixel is in the upper right vs the lower left, and which word came first, because processors receive input from different dendrites for upper right vs lower left rods and cones.
Transformers are an overpowered architecture, but are by default position-invariant.
Attention transforms each embedding vectors into a query key and value but has no way of knowing at what position an embedding is. Attention naively considers the red in “the red apple is a fruit” and “the apple is a red fruit” as the same query, key, and value, and so could not understand that “red modifies apple, not fruit”.
Relative or Absolute Position Encodings?
Attention needs information about the position of each token, before projecting each to a query, key, and value vector. We can accomplish this either with absolute or relative information.
1. absolute encodings “label” each token 1,2,3,… in order 2. relative encodings, given token a and b, indicate pairwise positions, like that b is x tokens after a
To implement absolute encodings, we could simply add an extra dimension indicating token position number to the end of the sequence. In theory, the model could learn to use these optimally.
Why don’t we?
We want translation invariance: in this method shiftign every position changes every attention dot product. Absolute positions also prevent extrapolation past training length.
Also, in practice, the model needs relative positions, and computing them itself is onerous and inefficient.
And Sinusoidal Position Embeddings work better.
I: Absolute Sinusoidal Position Embeddings
Sinusoidal position embeddings were introduced in Attention is All You Need.
First, we construct d-dimensional tensors (where d is the model’s embedding dimension), which encode position relatively, then add these to the model embeddings.
For every pair of positions 2i and 2i + 1, position 2i encodes $\sin(a/f(i))$, position 2i + 1 encodes $\cos(a/f(i))$ where $f(i) = 10000(2i/d)$ and adjusts the frequency over which the sine/cosine waves repeat. So the first two positions in the embeddings vectors are sin/cosine waves along the sequence dimension of the largest size, the next two positions are sin/cosine waves along the same sequence dimension but with higher-frequency (smaller-length) waves, and so on.
This thus encodes position relatively by the identity:
$ \sin(a)\sin(b) + \cos(a)\cos(b) = \cos(a - b) $
For two positions a and b, $cos^{-1}(a-b)$ allows the transformer to recover the relative distance $(a - b) + xf2\pi$ where x is unknown and f is the frequency.
So at each position, the model can now tell the relative offset within that frequency f.
This assign increasing frequencies to every 2 positions, starting at 1 hz over the sequence and increasing at log-scale.
When we compute $QK$ during attention, the delta $a - b$ can now be learned to be produced in the pairwise $\sin(a)\sin(b) + \cos(a)\cos(b)$s into the $QK$ matrix, and give relative positions of token to token within each frequency band.
II: Learnable Position Encodings
We can also allow the model learn how best to encode position by explicitly adding a learned vector to each position embedding. This worked better than sinusoidal position embeddings, but not as well as the class relative position encodings.
III: Relative Position Encodings
In the limit of infinite model size and train time, absolute encodings can work (from absolute position information, any relative relations can be inferred by this infinite model’s transformation function).
But practically, and especially in reasoning about language, relative position is precisely what matters. “Red” being next to “apple” and 3 positions from “fruit” is useful information regardless of “the red apple is a fruit” starting at index 0 or 10000. A competent language network would could infer that information from absolute encodings, but we could give it for free with relative embeddings.
In deep learning, we may know a prior (restriction on universality of architecture) is near-guaranteed to be useful through either theoretical or experimental validation. Inductive biases restrict the output space of the model, so as long as we know that the remaining output space contains our optimum complexity model, that restriction is a good one.
We are constraining the model to a space in which we can guarantee the optimal solution is.
So instead of absolute position embeddings, we use relative embeddings: we tell the model how tokens relate to each other.
IV: Rotary Position Embeddings (RoPE)
help explicitly encode relative position.
RoPe paper welcomes you: 5 Chinese scientists, English-name emails, and 1 Non-chinese scientist, the only one with a Chinese email. The legend does it for the love of the game.
SinCos Position Embeddings worked, but the model still needs to learn how to use/interpret the position embeddings through the weights. RoPE ensures that the $QK^T$ matrix natively contains token position information.
In RoPE, we rotate each 2d chunk instead of adding d-dim embedding representing relative difference frequencies
After projection to $Q$ and $K$, we inject RoPE by multiplying each length-2 chunk at position i of each embedding by a rotation matrix determined by:
$ R(mfi)=\begin{bmatrix}\cos(mfi)& \sin(mfi)\\ -\sin(mfi)&\cos(mfi)\end{bmatrix} $
where $f = 10000^{-2i/d}$ and $m$ is the token’s position in the sequence
This RoPE rotation matrix has the property $R(m)^⊤R(n)=R(n−m)$
So computing $QK^T$ yields a relative offset with respect to frequency $f$ and positions $m$ and $n$ explicitly as $(m - n) * f$
RoPE now explicitly gives relative offset in each position’s vector: it tells us relative positions between each token. So we explicily know red attends to apple if next to apple, or fruit if next to fruit.
But RoPE only works within the sequence length the model is trained on
Because the frequencies it trains on may be to small for a long sequence (can’t get relative positions in the 1000s if only trained on sequence lengths in the 100s)
Some tried to solve this using Position Interpolation by just naively extending each position encoding’s frequency by the new context length (positions become $m/s$ where s is new sequence length. In practice, this worked less well than YaRN (next). YaRN authors theorize this is because Position Interpolation ruins high-frequency (short-range) computation of relative token positions.
V: Yet another RoPE ExteNsion (YaRN)
reportedly requires 10x less tokens and 2.5x less compute to generalize to the same sequence lengths compared to RoPE.
Discovered using Neural Tangent Kernels, neural nets struggle to learn high-frequency functions unless explicitly encoded in input embeddings
So in NTK-aware encoding, short-range dimensions of RoPE remain the same, and long-range dims are stretched. But as autoregressive inference happens, sequence length s changes so the encodings of new tokens are under-compressed. And the medium-range dims are often practically stretched too widely.
YaRN adds to this using:
1. “NTK-by-parts” interpolation
Low frequencies are scaled more, medium frequencies are in between, and high frequencies are scaled not at all.
The new Yarn frequency is $ f = \begin{cases} f_i \quad i \leq i_{low}\\ \alpha * f_i i \geq i_{high}\\ \text{smooth interpolation} \text{otherwise} \end{cases} $
where $\alpha < 1$
2. Attention scaling
they found scaling by temperature $t$ helped training, explained by assisting long contexts. Attention now becomes $ \frac{\text{softmax}(q^T_m k^n)}{t \sqrt{|D|})} $
3. Dynamic scaling
always use s as current length instead of start length for stretching. This means pre-RoPE vectors must be stored in the KVCache since RoPE now depends on current context length (so this is trading off memory for accuracy
Learnings
Include priors up to a point. In relative position encodings, a strong prior works because we know exactly what is useful relative positions in the qk^T matrix. The model can technically learn everything on its own though.
heuristics like let the model learn everything dont encode anything and just make it universal or put as many priors in as possible (no one says that) are wrong, the balance is so complex that it ends up being experimental (and always justified after the fact)
Mr pessimist
The YaRN paper taught me that if your method isn't that novel, just string together multiple experimental adjustments into a “method”.
Also add good in-depth explanations of methods you built on.
Mr optimist
Experimental results are ultimately what matter.
Getting good experimental results is about generating good ideas, speed of execution, and perseverance.
Then after the fact, identify what actually made the biggest difference via ablations.