Probes and Validity

Probe scores measure decodability. Selectivity controls and interventions test different claims.

A probe is a small supervised model trained on internal representations. Give it hidden states from a frozen language model and labels such as part of speech, dependency edge, semantic role, truth value, or grammatical number. If the probe predicts the labels well, something about the representation supports that prediction.

A high probe score does not automatically mean the base model represents the property in a clean way, or uses it when making predictions. The probe might exploit shallow lexical cues, memorize word types, learn the task itself, or read a feature that is present but causally irrelevant.

Figure 1 · Accuracy can survive the wrong control

1. What a probe can say

The cleanest probe question is linear decodability: can a simple readout recover the property from a frozen representation? If yes, the property is available in some form to a downstream linear read. The evidential weight depends on probe capacity, test distribution, and baseline strength.

Decodability is still not causality. A model may carry a feature as a side effect. It may carry a feature too late to affect behavior. It may carry a feature in a subspace that the final output never reads. Probes find readable signals; interventions test use.

2. Selectivity and control tasks

Hewitt and Liang's control-task idea asks whether a probe can also solve a task that the representation does not genuinely encode. A common version assigns random labels by word type. If a high-capacity probe performs well on the real task and also performs well on the control, the raw score is less impressive: it may reflect lexical memorization or probe capacity rather than model knowledge.

In the control setting, increasing probe capacity can keep the raw score high while selectivity collapses. The raw score and the control gap are different quantities.

Train/test split. A random-label control only exposes memorization if the train/test split lets the same word types recur. If the split prevents that, control accuracy falls toward chance. The control tests a particular failure mode.

One control keeps syntax but removes much of the ordinary lexical semantics by replacing content words with nonce words. If a syntactic probe works on normal sentences but drops sharply under that control, the probe may have relied on semantic plausibility rather than syntax alone. If the score survives, the evidence for a syntactic representation is stronger.

3. Three separate claims

Figure 2 · Each claim is a stronger, rarer subset
Decodable Selective Causal Outer to inner a control task narrows decodable to selective; an intervention narrows selective to causal. Each step rules out more.
Decodable
A trained readout recovers the property from frozen activations, above a baseline. This is the weakest claim: the information is present in an accessible form, not necessarily computed or used. A part-of-speech probe at 95% says POS is readable, nothing more.
Selective
The probe scores higher on the real task than on a matched control, such as random labels assigned by word type, or nonce-word sentences. The gap rules out a probe that wins by raw capacity or lexical memorization. It is reported as the selectivity gap: real minus control.
Causal
Editing the representation along the probe's direction changes model behavior under a stated metric, through activation patching, ablation, or steering. This is the strongest claim, and the only one about use rather than availability.

These claims can come apart. A feature can be decodable without being selective, and selective without being causal.

A probe score is a proxy for representational structure. When extra capacity lets the probe reach the score by memorizing word types instead, the proxy and the property come apart, a probe-level case of Goodhart's law.

Citations Related pages

What next