Causal Interventions

Causal interventions change activations and measure the effect on model behavior.

Probes and lenses ask what can be read from a representation. Causal interventions ask what happens to the model's behavior when the representation is changed. The shift is from decodability to use: a feature can be readable and still play no role, and intervention is how that difference gets tested.

The basic design compares a clean run, where the model produces the behavior under study, against a corrupted run, where some relevant fact or relation has been altered. An activation from one run is copied into the other at a chosen site. If behavior changes in the expected direction under a stated metric, that site carries causal weight for the behavior.

Figure 1 · Necessity-style and restoration evidence

intervention ablate head A

1. A worked patch

The indirect-object task is a standard example. The prompt "When John and Mary went to the store, John gave a drink to" should continue with Mary. A natural metric is the logit difference $\Delta = \text{logit}(\text{Mary}) - \text{logit}(\text{John})$, which is large and positive on a clean run. Corrupt the prompt by swapping the names so the licensed answer flips, which collapses $\Delta$.

Now patch: take the activation at one head, on the corrupted run, and overwrite it with the value that head held on the clean run. If $\Delta$ recovers toward its clean value, that head was carrying information the answer depends on. Figure 1 contrasts this restoration with plain ablation. The "patch clean A" mode lifts the metric back up; the "ablate head A" mode drops it and leaves it down. Patching gives sufficiency-style evidence; ablation gives necessity-style evidence, and the two need not agree.

2. Intervention types

Ablation removes a component and asks whether the metric drops. Activation patching copies a clean activation into a corrupted run and asks whether the metric recovers. Path patching restricts that test to a route, attribution patching estimates many patch effects with a local linear approximation, and self-repair measures whether backup components grow after the target is removed.

3. The corruption and the metric are part of the claim

A patching result is only as specific as its corruption. Swapping a name, zeroing an activation, replacing it with a dataset mean, or resampling it from another prompt each define a different counterfactual, and a head can look important under one and irrelevant under another. The direction matters too: denoising patches a clean activation into a corrupted run and asks what restores the behavior, while noising patches a corrupted activation into a clean run and asks what breaks it. The two answer different questions and can disagree.

The metric is part of the claim. A component can restore one logit-difference metric while doing little for loss, calibration, or a different prompt family. The behavior you measure sets the scope of the causal claim, so a result stated without its metric is incomplete.

4. Self-repair changes what is being measured

In the Hydra effect, ablating one component causes another to grow its contribution. An ablation result therefore mixes the removed component's role with the model's reaction to its removal. A small metric drop can mean the component was unimportant, or that the rest of the network compensated before the metric was read.

This makes single-component ablation a lower bound on one kind of causal role rather than a clean measurement of the original computation. The intervention has changed the system being measured, so the number reflects the perturbed model, not the one you wanted to characterize.

The practical response is to triangulate. Compare ablation with activation patching and path patching, and use paired interventions that remove a candidate compensator alongside the target. A claim is stronger when necessity-style and restoration-style evidence tell the same story and the model is treated as adaptive rather than passive.

Citations

Zhang and Nanda (2024), "Towards Best Practices of Activation Patching in Language Models", for metrics, corruption choices, and noising versus denoising.
Wang, Variengien, Conmy, Shlegeris, and Steinhardt (2023), "Interpretability in the Wild", for the indirect-object task and logit-difference metric.
Meng, Bau, Andonian, and Belinkov (2022), "Locating and Editing Factual Associations in GPT", for causal tracing in factual recall.
Goldowsky-Dill, MacLeod, Sato, and Arora (2023), "Localizing Model Behavior with Path Patching", for path patching.
Syed, Rager, and Conmy (2023), "Attribution Patching Outperforms Automated Circuit Discovery", for attribution patching.
McGrath, Rahtz, Kramar, Mikulik, and Legg (2023), "The Hydra Effect", for self-repair under ablation.

Related pages

Residual Stream & Directions for what many activation patches replace.
Probes and Validity for the decodability claims interventions are meant to test.

What next

Before

Probes and Validity

Readable features before intervention.

Circuit

QK and OV Circuits

The attention-head pieces interventions often target.