Residual Stream & Directions

The residual stream is a shared vector workspace. A feature can often be approximated by a direction in that space.

The residual-stream framing treats a transformer as a sequence of components that read from and write to a shared vector stream.¹ Each token position carries a vector forward through the stack. Attention heads and MLPs read from that vector, compute updates, and write new vectors back into the stream. Many interpretability analyses focus on this stream because different computations coexist in the same vector space.

The phrase "a feature is a direction" is a common approximation. If a model has a direction for "is a city", then projecting the residual vector onto that direction estimates how strongly the current token or context carries that feature.² The approximation is incomplete: directions can be shared, curved, split across subspaces, or active only in the right context. The direction view is still a useful local approximation because attention heads, MLPs, probes, lenses, and steering methods all read from or write to vectors.³

Figure 1 · A residual stream accumulating writes

layer 4

feature activity 0.30

1. The shared workspace

In the transformer-circuits account, embeddings, attention heads, MLPs, and the unembedding are analyzed as reads from or writes to the residual stream. Token identity, position, local syntax, long-range references, partial semantic hypotheses, and output-relevant information all share that vector space. Some components write small corrections. Others write features that downstream components can read many layers later.

Embedding write: Before the first block, a token embedding plus a positional embedding are added at each position, setting the stream's starting value. Every later component edits this vector rather than replacing it.
Attention write: A head reads from other positions (QK selects where to look, OV selects what to bring) and adds the result at the destination position. This is the main way information moves between positions.
MLP write: A position-wise MLP reads the current vector and adds a nonlinear function of it. MLPs compute many features, and factual associations appear to live largely in them (Meng et al., 2022).
Unembedding read: A final normalization and the unembedding matrix project the last vector onto the vocabulary to produce logits. The logit lens applies this same read at earlier layers to see a partial prediction.

Each block adds to the stream rather than overwriting it, so the writes accumulate: $x_{\ell+1} = x_\ell + \text{sublayer}(x_\ell)$. Reading the stream at any depth means reading a running sum of every write before it.

2. Directions make features available to linear readouts

If a feature is linearly readable, a probe, lens, or unembedding vector can pull it out with a dot product. Linear probes test the corresponding model-compatible question: whether some property is present in a form that downstream linear reads could plausibly use. They do not prove that the model actually uses it; that requires causal evidence. They locate a candidate coordinate system.

Linear does not mean simple. A linear direction can represent an abstract property if the model has already done the nonlinear work needed to place examples along that axis. The hard computation may live upstream; the final readout can still be linear.

3. Steering vectors and dose-response

A steering vector adds a direction back into the residual stream and asks how the output changes as the dose increases. A useful direction should produce a smooth, task-specific logit-margin curve, while matched random directions form a control band. The non-destructive range is the part of the curve where the target margin moves but the unrelated quality metric has not collapsed.

Figure 2 · Dose-response for a residual-stream direction

dose 1.2

4. Superposition

The residual stream has finite dimension, but the model may represent many more features than there are independent axes. If features are sparse, a model can pack many of them into the same space and tolerate some interference. Elhage et al. call this arrangement superposition: features are not always assigned one clean coordinate each. They can be arranged as overlapping directions whose collisions are manageable because only a few are active at once.⁴

For interpretation, a neuron or coordinate can look polysemantic because it participates in several feature directions. Conversely, a feature can be real even when no single neuron lines up with it. As more directions are active together, off-axis interference grows.

Citation notes

Elhage, Nanda, Olsson, Henighan, Joseph, Mann, Askell, Bai, Chen, Conerly, Drain, Ganguli, Hatfield-Dodds, Hernandez, Jones, Kernion, Lovitt, Ndousse, Petrov, Sellitto, Shlegeris, Sodhi, Tow-Arnett, Trenton, Voss, Watkins, Xu, and Olah (2021), "A Mathematical Framework for Transformer Circuits", introduces the residual-stream and QK/OV circuit framing used here. back
Park, Choe, and Veitch (2024), "The Linear Representation Hypothesis and the Geometry of Large Language Models", formalizes linear representations and relates them to probes and steering vectors. back
Alain and Bengio (2018), "Understanding Intermediate Layers Using Linear Classifier Probes"; nostalgebraist (2020), "Interpreting GPT: The Logit Lens"; and Belrose, Furman, Smith, Halawi, Ostrovsky, McKinney, Biderman, and Steinhardt (2023), "Eliciting Latent Predictions from Transformers with the Tuned Lens", are examples of linear readouts from intermediate representations. back
Elhage, Henighan, Olah, and coauthors (2022), "Toy Models of Superposition", develops the sparse-feature packing account of superposition. back

What next

QK and OV Circuits

Split an attention head into where it looks and what it writes.

Method

Probes and Validity

Decodability, selectivity, and causal use.