Logit Lens & Tuned Lens

Lens methods decode intermediate residual streams as vocabulary predictions. Layer depth is not time.

A transformer produces vocabulary logits by multiplying its final residual stream by the unembedding matrix $W_U$, usually after a layer norm. The logit lens applies that same readout to a residual stream from an earlier layer instead of the last one. The question it answers is concrete: if the model were forced to commit to a next token using the state at layer $\ell$, which tokens would that state favor?

Running the lens at every layer gives a trajectory of latent predictions. Early layers often decode to high-frequency or locally plausible tokens. Middle layers may surface partial hypotheses. Late layers usually sharpen into the token the model finally emits. The trajectory is a readout of intermediate state, not a record of the model deliberating.

Figure 1 · Layerwise vocabulary readouts

layer 5

tuned lens

1. What the logit lens reads

The logit lens adds no trained parameters. It reuses the model's own final readout as a probe at each depth, which makes it cheap and keeps it in the model's coordinate system rather than a probe's. Move the layer slider in Figure 1 to watch the decoded distribution change. If a token is already on top at layer 8 and stays on top through layer 12, that token was linearly readable well before the output. If the top tokens churn from layer to layer, the decoded state is still moving.

A lens can surface information that is present internally but does not reach the output. A layer may rank a token highly even though a later component routes around it or suppresses it, the gap discussed under represented versus expressed knowledge. The lens reports decodability under a fixed readout. It does not establish that the model uses the decoded token.

2. Why early layers decode poorly

The unembedding was trained against final-layer states. Earlier residual streams occupy a related but shifted region of the same vector space: different typical norms, different offsets contributed by components that have not yet written, and in some models a basis that later layers rotate. Feeding those states straight into $W_U$ decodes them as if they were finished, which is why a raw logit lens can look noisy or systematically wrong in early and middle layers, and can fail almost entirely on some model families.

The failure is a coordinate mismatch rather than an absence of information. The state at layer $\ell$ may carry a usable prediction that the final readout cannot extract because it expects the final coordinate system. Correcting that mismatch is what the tuned lens does.

3. The tuned lens translator

The tuned lens trains one affine map per layer, $h_\ell \mapsto A_\ell h_\ell + b_\ell$, applied before the shared unembedding. The map is fit to make the decoded distribution at layer $\ell$ match the model's final distribution, by minimizing the KL divergence from the lens output to the final output over a training corpus. Each layer gets its own translator into final-layer coordinates, while $W_U$ stays fixed.

The translator is small and the target is the model itself. Because the objective is KL to the model's own final distribution rather than to ground-truth tokens, the tuned lens measures what the layer already implies about the model's eventual prediction, not what a fresh classifier could learn. An affine map cannot manufacture a prediction that the layer does not contain; it can only realign one that the raw lens misreads.

Toggling the tuned-lens control in Figure 1 pulls the early-layer readouts toward the final answer and reduces their noise. Tuned-lens predictions are typically better calibrated than raw logit-lens predictions and track the final distribution more smoothly across depth. The cost is the per-layer fit, and the caveat is that a trained translator can paper over a genuine coordinate change that the raw lens would have exposed.

4. Reading a lens trajectory

A lens marks the depth at which a prediction becomes linearly decodable. That is useful for locating where information appears and for comparing layers, but it invites two misreadings. The first treats the trajectory as a timeline of thought; the second treats a high lens score as proof the model acts on the token.

Layer depth is not cognitive time. A lens trajectory is not a movie of the model thinking. Each layer exposes a different intermediate state in one feed-forward pass. The trajectory can show when information becomes decodable, but it is not an event sequence in the human sense.

The disciplined use pairs a lens with a causal test. The lens proposes that a token is readable at some layer; a causal intervention on that layer's activations checks whether the readable state changes behavior. Decodability locates a candidate; intervention tests whether it matters. The lenses decode the residual stream, so a lens result is also a claim about which directions in that stream the unembedding can read.

Citations

nostalgebraist (2020), "Interpreting GPT: The Logit Lens", for applying the final unembedding to intermediate residual streams.
Belrose, Furman, Smith, Halawi, Ostrovsky, McKinney, Biderman, and Steinhardt (2023), "Eliciting Latent Predictions from Transformers with the Tuned Lens", for the per-layer affine translators, the KL objective, and calibration results.

Related pages

Residual Stream & Directions for the hidden states that lenses decode.
KL Divergence for the objective the tuned lens minimizes.

What next

Represented vs. Expressed Knowledge

When internal readouts and output behavior disagree.

Method

Causal Interventions

Activation changes test lens hypotheses.