Causal Interventions
Probes and lenses ask what can be read from a representation. Causal interventions ask what happens to the model's behavior when the representation is changed. The shift is from decodability to use: a feature can be readable and still play no role, and intervention is how that difference gets tested.
The basic design compares a clean run, where the model produces the behavior under study, against a corrupted run, where some relevant fact or relation has been altered. An activation from one run is copied into the other at a chosen site. If behavior changes in the expected direction under a stated metric, that site carries causal weight for the behavior.
1. A worked patch
The indirect-object task is a standard example. The prompt "When John and Mary went to the store, John gave a drink to" should continue with Mary. A natural metric is the logit difference $\Delta = \text{logit}(\text{Mary}) - \text{logit}(\text{John})$, which is large and positive on a clean run. Corrupt the prompt by swapping the names so the licensed answer flips, which collapses $\Delta$.
Now patch: take the activation at one head, on the corrupted run, and overwrite it with the value that head held on the clean run. If $\Delta$ recovers toward its clean value, that head was carrying information the answer depends on. Figure 1 contrasts this restoration with plain ablation. The "patch clean A" mode lifts the metric back up; the "ablate head A" mode drops it and leaves it down. Patching gives sufficiency-style evidence; ablation gives necessity-style evidence, and the two need not agree.
2. Intervention types
- Ablation
- Remove or zero a component. A metric drop is necessity-style evidence, mixed with redundancy and compensation effects.
- Activation patching
- Copy an activation from a clean run into a corrupted run. Whether behavior is restored depends on the corruption, the site, the representation type, and the metric.
- Path patching
- Restrict a patch to a route through the network, such as one head writing into another head's input, so the test isolates a connection rather than a component.
- Attribution patching
- Use a local linear approximation to estimate many patch effects at once, cheaply, at lower resolution and under stronger assumptions.
- Self-repair
- Measure whether another component's contribution grows once the targeted component is removed.
3. The corruption and the metric are part of the claim
A patching result is only as specific as its corruption. Swapping a name, zeroing an activation, replacing it with a dataset mean, or resampling it from another prompt each define a different counterfactual, and a head can look important under one and irrelevant under another. The direction matters too: denoising patches a clean activation into a corrupted run and asks what restores the behavior, while noising patches a corrupted activation into a clean run and asks what breaks it. The two answer different questions and can disagree.
4. Self-repair changes what is being measured
In the Hydra effect, ablating one component causes another to grow its contribution. An ablation result therefore mixes the removed component's role with the model's reaction to its removal. A small metric drop can mean the component was unimportant, or that the rest of the network compensated before the metric was read.
This makes single-component ablation a lower bound on one kind of causal role rather than a clean measurement of the original computation. The intervention has changed the system being measured, so the number reflects the perturbed model, not the one you wanted to characterize.
The practical response is to triangulate. Compare ablation with activation patching and path patching, and use paired interventions that remove a candidate compensator alongside the target. A claim is stronger when necessity-style and restoration-style evidence tell the same story and the model is treated as adaptive rather than passive.
- Zhang and Nanda (2024), "Towards Best Practices of Activation Patching in Language Models", for metrics, corruption choices, and noising versus denoising.
- Wang, Variengien, Conmy, Shlegeris, and Steinhardt (2023), "Interpretability in the Wild", for the indirect-object task and logit-difference metric.
- Meng, Bau, Andonian, and Belinkov (2022), "Locating and Editing Factual Associations in GPT", for causal tracing in factual recall.
- Goldowsky-Dill, MacLeod, Sato, and Arora (2023), "Localizing Model Behavior with Path Patching", for path patching.
- Syed, Rager, and Conmy (2023), "Attribution Patching Outperforms Automated Circuit Discovery", for attribution patching.
- McGrath, Rahtz, Kramar, Mikulik, and Legg (2023), "The Hydra Effect", for self-repair under ablation.
- Residual Stream & Directions for what many activation patches replace.
- Probes and Validity for the decodability claims interventions are meant to test.