Induction Heads

An induction head uses a repeated prefix to copy the token that followed the earlier occurrence.

An induction head implements one rule. When a sequence contains the pair [A][B] and the token [A] appears again later, the head predicts [B] as the continuation. It does this by attending from the second A back to the position just after the first A, then writing a vector that promotes whatever token sat there. The result is prompt-local copying: the model continues a repeated pattern using evidence from the prompt rather than from its weights.

The rule is narrow, but it is one of the few transformer behaviors with a mechanism that has been traced end to end. It also marks a transition in what a model can do. A network that notices "this token followed that token earlier" has the smallest useful form of in-context learning.

Figure 1 · Prefix match then copy

step 3

1. The copy rule

Take the toy sequence A B x y A ?. The head sits at the final position, where the current token is the repeated A. The step control walks through what it does: match the current A against the earlier A, shift attention by one position to the B that followed it, and copy B into the prediction. The earlier B is the answer because it is what came next last time the prefix appeared.

The offset is the part worth pausing on. The head does not attend to the matching A; it attends to the token after it. A pattern that pointed at the match itself would predict another A, which is useless for continuation. The one-position shift is what turns a repeat detector into a copy mechanism.

2. Two heads in composition

A single attention head cannot both find the earlier A and read the token after it, because its keys describe individual positions, not their neighbors. The mechanism needs two heads in different layers, and the earlier one sets up the later one.

Layer $\ell$	previous-token head	At each position $t$, attend to $t-1$ and write the identity of $x_{t-1}$ into the stream. Position $t$ now carries a tag, "the token before me was $x_{t-1}$".
Layer $\ell+1$	induction head	At the repeated $A$, form a query from $A$ and match it against those tags. The position holding $B$ tags itself "previous token was $A$", so the query lands there. OV copies $B$ toward the output.

This is QK-composition, the handoff described on the QK and OV circuits page. The previous-token head's write becomes part of the induction head's key, so the second head attends by a relationship the first head computed rather than by raw token content. The copy itself is the induction head's OV path: it writes in a direction that the unembedding reads as B. Routing and writing stay separable, which is why the same head can copy correctly or, in a copy-suppression variant, push a repeated token down instead.

Prefix match, not literal token match. The query and key live in learned coordinates, so the match tolerates near-repeats: a different casing, a synonym, or a token that plays the same role. Heads that generalize the rule this way are sometimes called fuzzy or semantic induction heads. The crisp [A][B] case is the cleanest instance, not the only one.

3. The induction bump

Induction heads tend to appear suddenly during training rather than fading in. Across a short window of training, the model's loss drops faster than the surrounding trend, and a separate measurement of in-context ability rises at the same time. The in-context score is the gap between the loss on a late token and the loss on an early token in the same sequence: a model that uses the prompt predicts later tokens better, so the gap grows once copying is available.

Figure 2 · In-context score forms across a narrow window

training progress 50

Slide the marker through training. Before the shaded window the in-context score is flat: the model has no mechanism for using a repeated prefix. Inside the window the score climbs sharply while the loss curve dips below its own trend, the small acceleration that gives the phenomenon its name. After the window the score sits at a higher plateau, and copying is part of the model's behavior. The two measurements move together because the same circuit drives both.

4. What induction does and does not cover

Induction is a real lower bound on in-context learning, not an account of all of it. The mechanism is symbolic and local: when $A$ recurs, continue with what followed $A$ before. That explains exact and near-exact repetition, and it plausibly seeds richer abilities, since later heads can read the same previous-token tags for other purposes.

It does not explain in-context learning that has no surface repeat to match, such as inferring a format from a few labeled examples or following an instruction stated once. Those behaviors involve representations that the prefix-match rule does not name. The value of the induction story is its specificity: it shows that at least one in-context behavior is implemented by an identifiable circuit, which makes it a model for what a mechanistic account of a harder behavior would need.

Induction heads have a specified copy circuit. The prefix match and the copy are each tied to a concrete component. Most attention-head labels name a looser family of patterns whose mechanism has not been pinned down to the same degree.

Citations

Olsson, Elhage, Nanda, Joseph, DasSarma, Henighan, Mann, and coauthors (2022), "In-context Learning and Induction Heads", for the induction-head account, the in-context score, and the formation bump.
Elhage, Nanda, Olsson, and coauthors (2021), "A Mathematical Framework for Transformer Circuits", for the two-layer attention-only setting and QK-composition.

Related pages

QK and OV Circuits for the composition handoff the two heads use.
Attention for the softmax read operation behind the prefix match.

What next

Before

QK and OV Circuits

The decomposition behind the copy mechanism.

Binding

Copying as a building block for resolving nonlocal references.

Labels

Attention Head Labels

Why induction is unusually crisp among labels.