Attention

Attention computes a query-dependent weighted sum. The same algebra also describes adaptive statistics and kernel regression.

Attention mixes information across positions of a sequence. Each output position pulls a weighted combination of all other positions, with weights that depend on the content at each position rather than its index. For one query, $\operatorname{Attn}(q, K, V) = \sum_i \alpha_i(q)\, v_i$, with weights $\alpha_i(q) \propto \exp(q \cdot k_i / \tau)$.

The same weighted sum appears in older settings. In a sufficient statistic, a sum compresses data with fixed weights. In Nadaraya-Watson regression, the weights depend on a query and a chosen kernel. In attention, the weights depend on a query and learned key vectors. The KV cache stores raw keys and values because each later query reweights the same memory differently.

For multi-head shapes, layer composition, positional encodings, and KV-cache memory in practice, see LLM Inference.

1. Fixed and query-dependent weights

Sufficient statistic

T(x) = Σᵢ φ(xᵢ)

Fixed weights (all equal). The summary is a function of the data alone.

Kernel regression

f̂(q) = Σᵢ wᵢ(q)·yᵢ

Weights depend on a query point. Nearby data contributes more.

Attention

Attn(q) = Σᵢ αᵢ(q)·vᵢ

Weights are softmax over learned dot-product similarities.

A sufficient statistic compresses data with fixed weights. Kernel regression makes the weights depend on a query and a chosen similarity. Attention makes the weights depend on a query and learned dot-product similarities.

2. Attention as soft dictionary lookup

The default reading is a content-addressed lookup. Each position projects its residual vector into three roles: a query $q = W^Q x$, a key $k = W^K x$, and a value $v = W^V x$. A query is scored against every key by dot product, the scores are normalized with a softmax, and the output is the resulting blend of values, $\operatorname{Attn}(q, K, V) = \sum_i \alpha_i(q)\, v_i$ with $\alpha_i(q) = \operatorname{softmax}_i(q \cdot k_i / \tau)$. A key advertises what its position offers, a value carries what that position contributes, and a query states what the current position is looking for.

A hard dictionary returns the one value whose key matches the query. Attention returns a weighted average of every value, each weighted by how well its key matches — a differentiable lookup over an associative memory. The temperature $\tau$ sets how peaked the match is: small $\tau$ approaches a hard argmax, large $\tau$ averages over everything. The QK and OV circuits page separates the matching half (which positions a query reads) from the value-write half (what gets copied back). The sections that follow recast this same weighted sum — as kernel regression, and as the unique entropy-regularized retrieval rule.

3. Attention as Nadaraya–Watson kernel regression

The Nadaraya–Watson estimator $\hat f(q) = \sum_i K(q, x_i)\, y_i \;/\; \sum_j K(q, x_j)$ is a weighted average of $y$-values, with weights given by a kernel of distance from the query. Soft attention with a Gaussian-shaped softmax ($\alpha_i \propto \exp(-\|q - k_i\|^2/2\sigma^2)$, up to a query-only normalization) has the same normalized weighted-average form.

Figure 2 · Attention as kernel regression

data $(x_i, y_i)$: tokens with key $k_i = x_i$, value $v_i = y_i$ kernel bump at query (kernel-regression view) attention weights $\alpha_i$ (attention view) predicted output $\hat f(q) = \sum_i \alpha_i y_i$

query position $q$ 0.5

log temperature (base 2): bandwidth -5

At small $\tau$ the kernel is sharp and the prediction at $q$ is essentially the $y$-value of the nearest data point. Attention puts all its mass on one token, the entropy of the weight distribution is near zero, and the smoothed curve becomes a jagged step function. At large $\tau$ the kernel spans the whole interval, the weight distribution is nearly uniform, and the prediction flattens toward the global mean. Intermediate values trade fidelity against stability, the usual bias-variance tradeoff in non-parametric regression.

In a transformer attention head, $q \cdot k_i$ with $q = W_Q x_q$ and $k_i = W_K x_i$. After expanding, $\exp(q \cdot k_i / \tau)$ is the exponential kernel in those learned coordinates. Up to a query-only constant, that's a Gaussian RBF kernel with bandwidth set by $\tau$. The projection matrices $W_Q, W_K$ are how the network chooses what counts as "near."

4. Softmax is the unique entropy-regularized retrieval rule

Among all retrieval distributions $a$ over the memory items, softmax maximizes $$\mathbb{E}_a[s] \;+\; \tau\, H(a) \;=\; \sum_i a_i s_i \;-\; \tau \sum_i a_i \log a_i$$ subject to $\sum a_i = 1$. The objective trades score against entropy: higher expected score prefers high-score items, while higher entropy spreads mass across items. The temperature $\tau$ sets that tradeoff.

Figure 3 · Scores → softmax retrieval, with the objective on display

scores $s_i$: relevance of each memory item retrieval distribution $a_i = \operatorname{softmax}(s/\tau)_i$ contributions $a_i s_i$ (signed)

log temperature (base 2) 0

score pattern single peak

The temperature parameter controls how concentrated the retrieval distribution is. The limiting cases are the argmax regime, the uniform regime, and the intermediate regime.

$\tau \to 0$ (argmax). The total objective collapses to $\max_i s_i$; the retrieval distribution puts all mass on the highest-scoring item; entropy is zero — brittle but maximally relevant.
$\tau \to \infty$ (uniform). The objective is dominated by the entropy term; the retrieval distribution is uniform; effective $k$ is $N$ — maximally spread, and scores don't matter.
Intermediate $\tau$. The softmax interpolates. Multiple items contribute; the "effective $k$" (i.e., $\exp H(a)$) tells you roughly how many memory items you're actually pulling from.

The softmax is the unique distribution maximizing $\mathbb{E}_a[s] + \tau H(a)$. Setting up the Lagrangian with $\sum a_i = 1$ and differentiating gives $\log a_i = (s_i - \lambda)/\tau$, i.e. $a_i \propto e^{s_i/\tau}$. Equivalently, among distributions with a fixed expected score $\mathbb{E}_a[s] = \mu$, the maximum-entropy one is an exponential family with $s$ as the sufficient statistic. The same algebra gives Boltzmann distributions in statistical mechanics and exponential families in §7 of Sufficient Statistics.

5. Kernel variants

Changing the kernel changes how retrieval weight falls off with similarity or position. Dot-product attention, linear attention, local windows, and positional-bias schemes make different choices here. The slider sets the bandwidth / temperature; the curves show the resulting weight profile in each coordinate system.

Figure 4 · Kernel shapes for common attention variants

dot-product softmax: $\exp(q\cdot k / \tau)$, Gaussian-like linear attention: $\phi(q)^\top\phi(k)$ via random features local-window: indicator of $|i - j| \le w$ positional bias (ALiBi-style): linear decay

bandwidth / window 0.2

Each kernel encodes a prior about which positions count as neighbors. The dot-product softmax permits global retrieval at $O(N^2)$ cost. Linear attention swaps the kernel for one with an explicit feature map $\phi$, which permits the rearrangement $\sum_i \phi(q)^\top \phi(k_i) v_i = \phi(q)^\top \big(\sum_i \phi(k_i) v_i^\top\big)$ and pre-aggregate to bring cost to $O(N)$. Local-window attention restricts the kernel to a sliding interval. Position-bias schemes like ALiBi and T5 relative bias add a position-only decay before the softmax, equivalent to multiplying the kernel by a position prior, the KL-regularized form $a_i \propto r_i \exp(s_i/\tau)$ from the Aside in §4.

6. Multi-head: many statistics at once

A multi-head layer computes several query-dependent statistics in parallel. Each head has its own learned $W_Q, W_K, W_V$, so one head can implement a previous-token pattern while another routes by syntax, delimiter position, or a copied token if training finds such a solution. The four patterns here are stylized examples, not measurements from a particular model.

Figure 5 · Four stylized heads over three sentences

row $i$: how token $i$ distributes attention over the sequence causal mask (greyed): attention to future positions

sentence the cat sat on the mat

display all four heads

head index 0

In the stylized example, H1 implements a "look at the previous token" rule, a building block of induction circuits for in-context copying. H2 routes verbs to subjects and prepositions to head verbs. H3 is positional: weights decay with distance regardless of content. H4 attends to determiners and modifiers. Each head computes a different summary statistic of the prefix; their outputs concatenate and pass through the next layer's matrices, where they can be combined into higher-level statistics. Multi-head = multiple sufficient statistics in parallel.

7. Encoder and decoder masks

An encoder such as a masked-language model can use bidirectional context: each token can attend left and right, and a special masked position asks for a missing word. A causal decoder can only attend left when predicting the next token. The same attention operation therefore supports different evidence patterns because the mask changes which keys are visible.

Figure 6 · Bidirectional MLM mask versus causal decoder mask

visible attention cell masked future or hidden cell query row

architecture encoder / MLM

query token 3

This difference matters for probe comparisons. A bidirectional encoder can represent a token using future context that a causal decoder has not seen. A masked-language-model objective also trains direct reconstruction of hidden tokens, while a causal-language-model objective trains next-token continuation. Layerwise syntax or binding readouts can therefore fragment differently across encoder, decoder, and encoder-decoder models even when the attention equation is the same.

8. Causal masking and the KV cache

During autoregressive generation, attention at step $t$ depends on the previous keys and values through $\sum_i \alpha_i(q_t) v_i$. An efficient decoder caches those $(k_i, v_i)$ pairs as it goes. The cache stores raw keys and values rather than an accumulated summary, because each new query reweights them differently.

Figure 7 · Token-by-token attention with a growing KV cache

tokens already in the cache current query attention weights $\alpha_i(q_t)$ output $\sum_i \alpha_i v_i$

generation step $t$ 3

temperature tau 0.6

head pattern subject pointer

The "broadcast / uniform" pattern is what an unconditional summary statistic would look like: the same weights regardless of the query, so the cache could be collapsed to one accumulated $\sum v_i$. The non-uniform patterns (subject-pointer, previous-token) reweight differently for each query, which is exactly why a real KV cache stores the raw vectors. Attention is a family of sufficient statistics, one per query, and the cache is the data structure for computing any member of that family on demand. For memory growth, paging, and attention sinks, see KV cache memory.

9. Adaptive sufficiency

Attention can be interpreted as an adaptive sufficient statistic: the weighted sum compresses the sequence into a summary chosen for the current prediction. Kernel regression interprets the softmax as Nadaraya–Watson with learned similarity. From the maximum-entropy side, the softmax is the unique solution to "be relevant, but don't be overconfident." These interpretations account for different design choices: the sum, the softmax, and the cache of raw vectors.

A transformer layer stacks attention with a feed-forward block; with depth, later layers build sufficient statistics of the sufficient statistics from earlier layers. Which computations those stacked summaries implement is a representation question, not an attention-only question.

What next

Foundations

Sufficient Statistics

Fixed-weight summaries; attention is the query-dependent cousin of $T(x) = \sum \phi(x_i)$.

Systems

KV Cache Memory

The practical consequences of caching keys and values: memory growth, paging, and the cost of long contexts.

Likelihood

Fisher Information

Softmax-as-Gibbs and exponential families also drive the geometry of likelihood inference.