notes.osteele.com / ml

∇

Transformer Internals #

The moving parts of a transformer: the attention operation, the residual stream it reads and writes, and the QK/OV decomposition of a head.

3 items

Attention

A weighted sum read as an adaptive sufficient statistic, Nadaraya–Watson kernel regression, and entropy-regularized retrieval

→

Residual Stream & Directions

The transformer residual stream as a shared workspace; features as directions; superposition as sparse feature packing

→

QK and OV Circuits

An attention head has separate routing and residual-write components

→

Interpretability Methods #

Readouts, probes, model-selection checks, and interventions for separating decodability from causal use.

Work in progress; not externally reviewed.

6 items

Probes and Validity

Probe scores, selectivity controls, lexical controls, and the distinction between decodability and causal use

→

Logit Lens & Tuned Lens

Layerwise vocabulary readouts, tuned affine decoders, and the difference between depth and cognitive time

→

Dependency Trees & Structural Probes

Dependency grammar, tree distance and depth, structural-probe geometry, MST extraction, and syntactic controls

→

Compositionality & Semantic Probes

Compositional meaning as a relation between head and dependent vectors, from additive to bilinear and nonlinear probes

→

Bayesian / MDL Evidence for Probes

Probe evaluation as model selection: fit, complexity penalties, evidence, and codelength

→

Causal Interventions

Ablation, activation patching, path patching, attribution patching, and self-repair under component removal

→

Phenomena & Circuits #

Attention-head labels, copying circuits, binding and lookback, and gaps between internal state and output behavior.

Work in progress; not externally reviewed.

5 items

Attention Head Labels

Positional, induction, syntactic, rare-word, copy-suppression, and name-mover labels as hypotheses rather than stable kinds

→

Induction Heads

The prefix-match then copy mechanism behind the [A][B] ... [A] -> [B] transformer circuit

→

Binding

Resolving a use against a nonlocal source — agreement, anaphora, traces, variable use, and logical chaining — with minimal pairs and distractor controls

→

The Lookback Mechanism

Store an address, carry a pointer, look back to dereference it: how binding IDs and retrieval heads combine into one in-context recall motif

→

Represented vs. Expressed Knowledge

Surprisal, internal readouts, and cases where a model carries information that does not surface in the output distribution

→

Machine Learning

Transformer Internals #

Interpretability Methods #

Phenomena & Circuits #