Induction Heads
An induction head implements one rule. When a sequence contains the pair
[A][B] and the token [A] appears again later, the head
predicts [B] as the continuation. It does this by attending from the
second A back to the position just after the first A,
then writing a vector that promotes whatever token sat there. The result is
prompt-local copying: the model continues a repeated pattern using evidence from
the prompt rather than from its weights.
The rule is narrow, but it is one of the few transformer behaviors with a mechanism that has been traced end to end. It also marks a transition in what a model can do. A network that notices "this token followed that token earlier" has the smallest useful form of in-context learning.
1. The copy rule
Take the toy sequence A B x y A ?. The head sits at the final
position, where the current token is the repeated A. The step
control walks through what it does: match the current A against the
earlier A, shift attention by one position to the B that
followed it, and copy B into the prediction. The earlier
B is the answer because it is what came next last time the prefix
appeared.
The offset is the part worth pausing on. The head does not attend to the matching
A; it attends to the token after it. A pattern that pointed
at the match itself would predict another A, which is useless for
continuation. The one-position shift is what turns a repeat detector into a copy
mechanism.
2. Two heads in composition
A single attention head cannot both find the earlier A and read the
token after it, because its keys describe individual positions, not their
neighbors. The mechanism needs two heads in different layers, and the earlier one
sets up the later one.
| Layer $\ell$ | previous-token head | At each position $t$, attend to $t-1$ and write the identity of $x_{t-1}$ into the stream. Position $t$ now carries a tag, "the token before me was $x_{t-1}$". |
|---|---|---|
| Layer $\ell+1$ | induction head | At the repeated $A$, form a query from $A$ and match it against those tags. The position holding $B$ tags itself "previous token was $A$", so the query lands there. OV copies $B$ toward the output. |
This is QK-composition, the handoff described on the
QK and OV circuits page. The previous-token
head's write becomes part of the induction head's key, so the second head attends
by a relationship the first head computed rather than by raw token content. The
copy itself is the induction head's OV path: it writes in a direction that the
unembedding reads as B. Routing and writing stay separable, which is
why the same head can copy correctly or, in a copy-suppression variant, push a
repeated token down instead.
[A][B] case is the cleanest instance, not the only one.
3. The induction bump
Induction heads tend to appear suddenly during training rather than fading in. Across a short window of training, the model's loss drops faster than the surrounding trend, and a separate measurement of in-context ability rises at the same time. The in-context score is the gap between the loss on a late token and the loss on an early token in the same sequence: a model that uses the prompt predicts later tokens better, so the gap grows once copying is available.
Slide the marker through training. Before the shaded window the in-context score is flat: the model has no mechanism for using a repeated prefix. Inside the window the score climbs sharply while the loss curve dips below its own trend, the small acceleration that gives the phenomenon its name. After the window the score sits at a higher plateau, and copying is part of the model's behavior. The two measurements move together because the same circuit drives both.
4. What induction does and does not cover
Induction is a real lower bound on in-context learning, not an account of all of it. The mechanism is symbolic and local: when $A$ recurs, continue with what followed $A$ before. That explains exact and near-exact repetition, and it plausibly seeds richer abilities, since later heads can read the same previous-token tags for other purposes.
It does not explain in-context learning that has no surface repeat to match, such as inferring a format from a few labeled examples or following an instruction stated once. Those behaviors involve representations that the prefix-match rule does not name. The value of the induction story is its specificity: it shows that at least one in-context behavior is implemented by an identifiable circuit, which makes it a model for what a mechanistic account of a harder behavior would need.
- Olsson, Elhage, Nanda, Joseph, DasSarma, Henighan, Mann, and coauthors (2022), "In-context Learning and Induction Heads", for the induction-head account, the in-context score, and the formation bump.
- Elhage, Nanda, Olsson, and coauthors (2021), "A Mathematical Framework for Transformer Circuits", for the two-layer attention-only setting and QK-composition.
- QK and OV Circuits for the composition handoff the two heads use.
- Attention for the softmax read operation behind the prefix match.