Represented vs. Expressed Knowledge

A model can carry information internally without making it visible in the next-token distribution.

Most measurements of a language model are taken at the output: token probabilities, loss, accuracy, calibration, refusal rate, generated text. Mechanistic interpretability adds a second vantage point and reads what is present inside the model before the output is formed. The two views can agree or come apart, and the cases where they come apart are the object of study.

The represented-versus-expressed distinction names that gap. A feature can be decodable from an intermediate activation while the output distribution barely moves. The reverse also happens: an output behavior can appear without any clean internal feature that current tools can read. The first case, an internal readout that the output does not reflect, is the one with the sharper handle.

Figure 1 · Internal and output margins can diverge

internal/output gap 0.70

1. Surprisal is an output measurement

Surprisal is negative log probability, $-\log p(\text{token})$. A token the model rates unlikely has high surprisal. This is a behavioral quantity: it depends only on the output distribution, and it is the per-token term in language-model loss. Averaged over a corpus it is the model's cross-entropy.

Because surprisal is read from the final distribution, it cannot see state that never reaches the output. In Figure 1 the gauges separate an internal margin, how strongly an intermediate layer favors the target token, from the output margin on the same token. Raising the gap widens the distance between them: the internal readout climbs while the emitted probability stays low.

Measure and behavior can diverge at the output too. Adding the same constant to every logit leaves the softmax unchanged, so the next-token probabilities, the argmax, sampling, and the KL between two such distributions are all invariant. Only the unnormalized log-sum-exp of the logits shifts. A score read from raw logits can move while the output behavior does not.

2. Internal readouts: lenses and probes

The logit and tuned lenses and task-specific probes read intermediate activations. They can show that a candidate token, an attribute, a grammatical feature, or a semantic relation is present at some layer. When that internal margin disagrees with the output margin, the live question is why the information failed to surface: the model may not route it to the output, a later component may suppress it, or the prompt format may override it.

3. Why the views come apart

Consider the prompt "The capital of France is". A logit lens may rank Paris at the top several layers before the end, yet the emitted probability can still be lower than that internal margin suggests. A few mechanisms produce this pattern.

Suppression: A later head can write against a token. Copy-suppression heads lower the score of a token that already appeared, so a candidate that was promoted mid-stack is pushed down before the output.
Incomplete routing: A feature decodable at layer $\ell$ may need further layers to move it into the output-reading direction. If those layers do other work, the readout never fully transfers.
Format and calibration: Spreading probability across paraphrases, punctuation, or continuations can keep any single target token's probability modest even when the model clearly favors that content.
Overriding objectives: Instruction tuning or refusal behavior can divert the output away from a token the base computation supports, leaving the internal trace intact.

Each of these keeps an internal readout high while holding the output low, so the gap in Figure 1 is not a single phenomenon. Naming which mechanism produced it is the work, and it usually takes a causal test rather than a readout alone.

4. Frequency controls and abstraction

Output behavior can also reflect surface frequency rather than the abstraction of interest. One simple diagnostic is to plot a pairwise margin against the bigram-frequency difference and read the fitted line at frequency difference zero. The intercept is the frequency-independent preference; the slope is the part explained by frequency.

Figure 2 · Intercept at matched frequency

frequency slope 0.45

abstraction intercept 0.35

5. What "the model knows" means here

Knowledge is shorthand. "The model knows X" compresses a more specific claim: X is decodable from internal activations under a stated readout, while the output distribution does not express X in the corresponding way. The claim is relative to the readout and the layer, not an absolute property of the model.

The distinction matters for evaluation. A benchmark scored at the output reports expressed behavior. A model that represents an answer but suppresses or misroutes it will look ignorant under that benchmark while a probe or lens finds the answer inside. The two results are both correct about different things, and conflating them produces claims about capability that the evidence does not support.

Turning a represented-versus-expressed observation into a mechanism claim needs the same discipline as any internal readout. The lens or probe locates the gap; a causal intervention tests whether the internal state, once changed, changes the output. Decodability shows the information is there. Only intervention shows what it does.

Citations

Meng, Bau, Andonian, and Belinkov (2022), "Locating and Editing Factual Associations in GPT", for internal localization of factual recall.
Belrose, Furman, Smith, Halawi, Ostrovsky, McKinney, Biderman, and Steinhardt (2023), "Eliciting Latent Predictions from Transformers with the Tuned Lens", for latent prediction trajectories.
McDougall, Conmy, Rushing, McGrath, and Nanda (2023), "Copy Suppression", for heads that write against an already-promoted token.

Related pages

Logit Lens & Tuned Lens for layerwise vocabulary readouts.
Entropy & Mutual Information for surprisal as negative log probability.

What next

Before

Logit Lens & Tuned Lens

The internal readout tool.

Causal Interventions

Testing whether internal state matters for behavior.