Attention
Attention mixes information across positions of a sequence. Each output position pulls a weighted combination of all other positions, with weights that depend on the content at each position rather than its index. For one query, $\operatorname{Attn}(q, K, V) = \sum_i \alpha_i(q)\, v_i$, with weights $\alpha_i(q) \propto \exp(q \cdot k_i / \tau)$.
For multi-head shapes, layer composition, positional encodings, and KV-cache memory in practice, see LLM Inference.
1. Fixed and query-dependent weights
A sufficient statistic compresses data with fixed weights. Kernel regression makes the weights depend on a query and a chosen similarity. Attention makes the weights depend on a query and learned dot-product similarities.
2. Attention as Nadaraya–Watson kernel regression
The Nadaraya–Watson estimator $\hat f(q) = \sum_i K(q, x_i)\, y_i \;/\; \sum_j K(q, x_j)$ is a weighted average of $y$-values, with weights given by a kernel of distance from the query. Soft attention with a Gaussian-shaped softmax ($\alpha_i \propto \exp(-\|q - k_i\|^2/2\sigma^2)$, up to a query-only normalization) has the same normalized weighted-average form.
At small $\tau$ the kernel is sharp and the prediction at $q$ is essentially the $y$-value of the nearest data point. Attention puts all its mass on one token, the entropy of the weight distribution is near zero, and the smoothed curve becomes a jagged step function. At large $\tau$ the kernel spans the whole interval, the weight distribution is nearly uniform, and the prediction flattens toward the global mean. Intermediate values trade fidelity against stability, the usual bias-variance tradeoff in non-parametric regression.
3. Softmax is the unique entropy-regularized retrieval rule
Among all retrieval distributions $a$ over the memory items, softmax maximizes $$\mathbb{E}_a[s] \;+\; \tau\, H(a) \;=\; \sum_i a_i s_i \;-\; \tau \sum_i a_i \log a_i$$ subject to $\sum a_i = 1$. The objective trades score against entropy: higher expected score prefers high-score items, while higher entropy spreads mass across items. The temperature $\tau$ sets that tradeoff.
The temperature parameter controls how concentrated the retrieval distribution is. The limiting cases are the argmax regime, the uniform regime, and the intermediate regime.
- $\tau \to 0$ (argmax). The total objective collapses to $\max_i s_i$; the retrieval distribution puts all mass on the highest-scoring item; entropy is zero. Brittle but maximally relevant.
- $\tau \to \infty$ (uniform). The objective is dominated by the entropy term; the retrieval distribution is uniform; effective $k$ is $N$. Maximally spread, scores don't matter.
- Intermediate $\tau$. The softmax interpolates. Multiple items contribute; the "effective $k$" (i.e., $\exp H(a)$) tells you roughly how many memory items you're actually pulling from.
4. Kernel variants
Changing the kernel changes how retrieval weight falls off with similarity or position. Dot-product attention, linear attention, local windows, and positional-bias schemes make different choices here. The slider sets the bandwidth / temperature; the curves show the resulting weight profile in each coordinate system.
Each kernel encodes a prior about which positions count as neighbors. The dot-product softmax permits global retrieval at $O(N^2)$ cost. Linear attention swaps the kernel for one with an explicit feature map $\phi$, which permits the rearrangement $\sum_i \phi(q)^\top \phi(k_i) v_i = \phi(q)^\top \big(\sum_i \phi(k_i) v_i^\top\big)$ and pre-aggregate to bring cost to $O(N)$. Local-window attention restricts the kernel to a sliding interval. Position-bias schemes like ALiBi and T5 relative bias add a position-only decay before the softmax, equivalent to multiplying the kernel by a position prior, the KL-regularized form $a_i \propto r_i \exp(s_i/\tau)$ from the Aside in §3.
5. Multi-head: many statistics at once
A multi-head layer computes several query-dependent statistics in parallel. Each head has its own learned $W_Q, W_K, W_V$, so one head can implement a previous-token pattern while another routes by syntax, delimiter position, or a copied token if training finds such a solution. The four patterns here are stylized examples, not measurements from a particular model.
In the stylized example, H1 implements a "look at the previous token" rule, a building block of induction circuits for in-context copying. H2 routes verbs to subjects and prepositions to head verbs. H3 is positional: weights decay with distance regardless of content. H4 attends to determiners and modifiers. Each head computes a different summary statistic of the prefix; their outputs concatenate and pass through the next layer's matrices, where they can be combined into higher-level statistics. Multi-head = multiple sufficient statistics in parallel.
6. Causal masking and the KV cache
During autoregressive generation, attention at step $t$ depends on the previous keys and values through $\sum_i \alpha_i(q_t) v_i$. An efficient decoder caches those $(k_i, v_i)$ pairs as it goes. The cache stores raw keys and values rather than an accumulated summary, because each new query reweights them differently.
The "broadcast / uniform" pattern is what an unconditional summary statistic would look like: the same weights regardless of the query, so the cache could be collapsed to one accumulated $\sum v_i$. The non-uniform patterns (subject-pointer, previous-token) reweight differently for each query, which is exactly why a real KV cache stores the raw vectors. Attention is a family of sufficient statistics, one per query, and the cache is the data structure for computing any member of that family on demand. For memory growth, paging, and attention sinks, see KV cache memory.
7. Adaptive sufficiency
Attention can be interpreted as an adaptive sufficient statistic: the weighted sum compresses the sequence into a summary chosen for the current prediction. Kernel regression interprets the softmax as Nadaraya–Watson with learned similarity. From the maximum-entropy side, the softmax is the unique solution to "be relevant, but don't be overconfident." These interpretations account for different design choices: the sum, the softmax, and the cache of raw vectors.