Bayesian / MDL Evidence for Probes
Probe accuracy answers one question: how well does this probe predict the labels? Bayesian evidence and minimum description length ask a different one: how much prediction did the probe buy per unit of complexity, relative to a baseline? The complexity term matters for interpretation because a high-capacity readout can fit structure that belongs to the probe rather than to the representation.
A flexible probe can keep raising training or even test accuracy by fitting accidental regularities. A model-selection criterion charges for that flexibility. Evidence rises when extra parameters explain real structure and falls when they mostly memorize, so the criterion and raw accuracy can point in opposite directions.
1. The Occam factor
Bayesian model evidence integrates the likelihood over the parameters instead of keeping the single best setting. A simple model that fits the data across much of its parameter space can beat a complex model that fits only in a tiny region. The Occam factor is that comparison made quantitative: a ratio of volumes measuring how far the data shrink the plausible region of parameter space. A parameter the data leave unconstrained keeps its prior volume and adds neither fit nor penalty; a parameter the data pin down sharply has to improve the fit enough to pay for the volume it surrenders.
In the large-sample limit this reduces to the BIC penalty, about $\tfrac{1}{2}\log n$ of cost per parameter for $n$ examples. Doubling a probe's parameter count adds a fixed log-likelihood budget it must earn back through better fit. Figure 1 makes the tradeoff visible: raw fit keeps climbing with complexity while the penalty grows faster, so evidence peaks at an intermediate capacity and then declines even as accuracy improves.
2. MDL as a coding story
Minimum description length tells the same story in bits. To transmit the labels, send a model and then send the residual mistakes. A more complex probe costs more model bits but can save residual bits. The best probe minimizes the total message, not the residual term alone, which is the same balance the Occam factor strikes in a different currency.
Putting model bits and residual bits on one ledger is what disciplines the comparison. If accuracy keeps rising while total codelength also rises, the probe is spending capacity faster than it is finding structure, and the apparent gain is the probe fitting itself rather than reading the representation.
3. Online codelength makes the penalty operational
The version used for probing is prequential, or online, description length. Process the labeled examples in order; at each step, predict the next label using a probe trained only on the examples seen so far, and pay the surprisal $-\log p(\text{label})$ of the true label. The accumulated surprisal is the codelength, and no separate model-bits term is needed because the early predictions, made from little data, automatically carry the cost of a probe that needs many examples to work.
Voita and Titov report compression, the ratio of a uniform code's length to the model's codelength. With $K$ label classes, transmitting $n$ labels blind costs $n \log_2 K$ bits, so a probe whose codelength falls well below that has found predictive structure. A flexible probe that only memorizes pays heavily in the early, data-poor steps, so its online codelength stays long even when its final-epoch accuracy is high. The measure separates a representation that makes a property easy to read from a probe that learns the property itself, and it does so without a held-out split or a separately designed control task.
- MacKay (1992), "A Practical Bayesian Framework for Backpropagation Networks", for evidence and Occam factors.
- Voita and Titov (2020), "Information-Theoretic Probing with Minimum Description Length", for online codelength and compression as a probing measure.
- Immer, Torroba Hennigen, Fortuin, and Cotterell (2022), "Probing as Quantifying Inductive Bias", for Bayesian model selection as probe evaluation.
- Hewitt and Liang (2019), "Designing and Interpreting Probes with Control Tasks", for control tasks and selectivity.
- Bayesian Neural Networks for evidence, Occam's hill, and prior mismatch.
- Free Energy & Variational Inference for evidence as a log-partition problem.