Causal Interventions

Causal interventions change activations and measure the effect on model behavior.

Probes and lenses ask what can be read from a representation. Causal interventions ask what happens to the model's behavior when the representation is changed. The shift is from decodability to use: a feature can be readable and still play no role, and intervention is how that difference gets tested.

The basic design compares a clean run, where the model produces the behavior under study, against a corrupted run, where some relevant fact or relation has been altered. An activation from one run is copied into the other at a chosen site. If behavior changes in the expected direction under a stated metric, that site carries causal weight for the behavior.

Figure 1 · Necessity-style and restoration evidence

1. A worked patch

The indirect-object task is a standard example. The prompt "When John and Mary went to the store, John gave a drink to" should continue with Mary. A natural metric is the logit difference $\Delta = \text{logit}(\text{Mary}) - \text{logit}(\text{John})$, which is large and positive on a clean run. Corrupt the prompt by swapping the names so the licensed answer flips, which collapses $\Delta$.

Now patch: take the activation at one head, on the corrupted run, and overwrite it with the value that head held on the clean run. If $\Delta$ recovers toward its clean value, that head was carrying information the answer depends on. Figure 1 contrasts this restoration with plain ablation. The "patch clean A" mode lifts the metric back up; the "ablate head A" mode drops it and leaves it down. Patching gives sufficiency-style evidence; ablation gives necessity-style evidence, and the two need not agree.

2. Intervention types

Ablation
Remove or zero a component. A metric drop is necessity-style evidence, mixed with redundancy and compensation effects.
Activation patching
Copy an activation from a clean run into a corrupted run. Whether behavior is restored depends on the corruption, the site, the representation type, and the metric.
Path patching
Restrict a patch to a route through the network, such as one head writing into another head's input, so the test isolates a connection rather than a component.
Attribution patching
Use a local linear approximation to estimate many patch effects at once, cheaply, at lower resolution and under stronger assumptions.
Self-repair
Measure whether another component's contribution grows once the targeted component is removed.

3. The corruption and the metric are part of the claim

A patching result is only as specific as its corruption. Swapping a name, zeroing an activation, replacing it with a dataset mean, or resampling it from another prompt each define a different counterfactual, and a head can look important under one and irrelevant under another. The direction matters too: denoising patches a clean activation into a corrupted run and asks what restores the behavior, while noising patches a corrupted activation into a clean run and asks what breaks it. The two answer different questions and can disagree.

The metric is part of the claim. A component can restore one logit-difference metric while doing little for loss, calibration, or a different prompt family. The behavior you measure sets the scope of the causal claim, so a result stated without its metric is incomplete.

4. Self-repair changes what is being measured

In the Hydra effect, ablating one component causes another to grow its contribution. An ablation result therefore mixes the removed component's role with the model's reaction to its removal. A small metric drop can mean the component was unimportant, or that the rest of the network compensated before the metric was read.

This makes single-component ablation a lower bound on one kind of causal role rather than a clean measurement of the original computation. The intervention has changed the system being measured, so the number reflects the perturbed model, not the one you wanted to characterize.

The practical response is to triangulate. Compare ablation with activation patching and path patching, and use paired interventions that remove a candidate compensator alongside the target. A claim is stronger when necessity-style and restoration-style evidence tell the same story and the model is treated as adaptive rather than passive.

Citations Related pages

What next