∇
Transformer Internals #
The moving parts of a transformer: the attention operation, the residual stream it reads and writes, and the QK/OV decomposition of a head.
3 items
Attention
A weighted sum read as an adaptive sufficient statistic, Nadaraya–Watson kernel regression, and entropy-regularized retrieval
Residual Stream & Directions
The transformer residual stream as a shared workspace; features as directions; superposition as sparse feature packing
QK and OV Circuits
An attention head has separate routing and residual-write components
λ
Interpretability Methods #
Readouts, probes, model-selection checks, and interventions for separating decodability from causal use.
Work in progress; not externally reviewed.
6 items
Probes and Validity
Probe scores, selectivity controls, lexical controls, and the distinction between decodability and causal use
Logit Lens & Tuned Lens
Layerwise vocabulary readouts, tuned affine decoders, and the difference between depth and cognitive time
Dependency Trees & Structural Probes
Dependency grammar, tree distance and depth, structural-probe geometry, MST extraction, and syntactic controls
Compositionality & Semantic Probes
Compositional meaning as a relation between head and dependent vectors, from additive to bilinear and nonlinear probes
Bayesian / MDL Evidence for Probes
Probe evaluation as model selection: fit, complexity penalties, evidence, and codelength
Causal Interventions
Ablation, activation patching, path patching, attribution patching, and self-repair under component removal
ψ
Phenomena & Circuits #
Attention-head labels, copying circuits, binding and lookback, and gaps between internal state and output behavior.
Work in progress; not externally reviewed.
5 items
Attention Head Labels
Positional, induction, syntactic, rare-word, copy-suppression, and name-mover labels as hypotheses rather than stable kinds
Induction Heads
The prefix-match then copy mechanism behind the [A][B] ... [A] -> [B] transformer circuit
Binding
Resolving a use against a nonlocal source — agreement, anaphora, traces, variable use, and logical chaining — with minimal pairs and distractor controls
The Lookback Mechanism
Store an address, carry a pointer, look back to dereference it: how binding IDs and retrieval heads combine into one in-context recall motif
Represented vs. Expressed Knowledge
Surprisal, internal readouts, and cases where a model carries information that does not surface in the output distribution