Represented vs. Expressed Knowledge
Most measurements of a language model are taken at the output: token probabilities, loss, accuracy, calibration, refusal rate, generated text. Mechanistic interpretability adds a second vantage point and reads what is present inside the model before the output is formed. The two views can agree or come apart, and the cases where they come apart are the object of study.
The represented-versus-expressed distinction names that gap. A feature can be decodable from an intermediate activation while the output distribution barely moves. The reverse also happens: an output behavior can appear without any clean internal feature that current tools can read. The first case, an internal readout that the output does not reflect, is the one with the sharper handle.
1. Surprisal is an output measurement
Surprisal is negative log probability, $-\log p(\text{token})$. A token the model rates unlikely has high surprisal. This is a behavioral quantity: it depends only on the output distribution, and it is the per-token term in language-model loss. Averaged over a corpus it is the model's cross-entropy.
Because surprisal is read from the final distribution, it cannot see state that never reaches the output. In Figure 1 the gauges separate an internal margin, how strongly an intermediate layer favors the target token, from the output margin on the same token. Raising the gap widens the distance between them: the internal readout climbs while the emitted probability stays low.
2. Internal readouts: lenses and probes
The logit and tuned lenses and task-specific probes read intermediate activations. They can show that a candidate token, an attribute, a grammatical feature, or a semantic relation is present at some layer. When that internal margin disagrees with the output margin, the live question is why the information failed to surface: the model may not route it to the output, a later component may suppress it, or the prompt format may override it.
3. Why the views come apart
Consider the prompt "The capital of France is". A logit lens may rank Paris at the top several layers before the end, yet the emitted probability can still be lower than that internal margin suggests. A few mechanisms produce this pattern.
- Suppression
- A later head can write against a token. Copy-suppression heads lower the score of a token that already appeared, so a candidate that was promoted mid-stack is pushed down before the output.
- Incomplete routing
- A feature decodable at layer $\ell$ may need further layers to move it into the output-reading direction. If those layers do other work, the readout never fully transfers.
- Format and calibration
- Spreading probability across paraphrases, punctuation, or continuations can keep any single target token's probability modest even when the model clearly favors that content.
- Overriding objectives
- Instruction tuning or refusal behavior can divert the output away from a token the base computation supports, leaving the internal trace intact.
Each of these keeps an internal readout high while holding the output low, so the gap in Figure 1 is not a single phenomenon. Naming which mechanism produced it is the work, and it usually takes a causal test rather than a readout alone.
4. What "the model knows" means here
The distinction matters for evaluation. A benchmark scored at the output reports expressed behavior. A model that represents an answer but suppresses or misroutes it will look ignorant under that benchmark while a probe or lens finds the answer inside. The two results are both correct about different things, and conflating them produces claims about capability that the evidence does not support.
Turning a represented-versus-expressed observation into a mechanism claim needs the same discipline as any internal readout. The lens or probe locates the gap; a causal intervention tests whether the internal state, once changed, changes the output. Decodability shows the information is there. Only intervention shows what it does.
- Meng, Bau, Andonian, and Belinkov (2022), "Locating and Editing Factual Associations in GPT", for internal localization of factual recall.
- Belrose, Furman, Smith, Halawi, Ostrovsky, McKinney, Biderman, and Steinhardt (2023), "Eliciting Latent Predictions from Transformers with the Tuned Lens", for latent prediction trajectories.
- McDougall, Conmy, Rushing, McGrath, and Nanda (2023), "Copy Suppression", for heads that write against an already-promoted token.
- Logit Lens & Tuned Lens for layerwise vocabulary readouts.
- Entropy & Mutual Information for surprisal as negative log probability.