Attention

Attention computes a query-dependent weighted sum. The same algebra also describes adaptive statistics and kernel regression.

Attention mixes information across positions of a sequence. Each output position pulls a weighted combination of all other positions, with weights that depend on the content at each position rather than its index. For one query, $\operatorname{Attn}(q, K, V) = \sum_i \alpha_i(q)\, v_i$, with weights $\alpha_i(q) \propto \exp(q \cdot k_i / \tau)$.

The same weighted sum appears in older settings. In a sufficient statistic, a sum compresses data with fixed weights. In Nadaraya-Watson regression, the weights depend on a query and a chosen kernel. In attention, the weights depend on a query and learned key vectors. The KV cache stores raw keys and values because each later query reweights the same memory differently.

For multi-head shapes, layer composition, positional encodings, and KV-cache memory in practice, see LLM Inference.

1. Fixed and query-dependent weights

Sufficient statistic
T(x) = Σᵢ φ(xᵢ)
Fixed weights (all equal). The summary is a function of the data alone.
Kernel regression
f̂(q) = Σᵢ wᵢ(q)·yᵢ
Weights depend on a query point. Nearby data contributes more.
Attention
Attn(q) = Σᵢ αᵢ(q)·vᵢ
Weights are softmax over learned dot-product similarities.

A sufficient statistic compresses data with fixed weights. Kernel regression makes the weights depend on a query and a chosen similarity. Attention makes the weights depend on a query and learned dot-product similarities.

2. Attention as Nadaraya–Watson kernel regression

The Nadaraya–Watson estimator $\hat f(q) = \sum_i K(q, x_i)\, y_i \;/\; \sum_j K(q, x_j)$ is a weighted average of $y$-values, with weights given by a kernel of distance from the query. Soft attention with a Gaussian-shaped softmax ($\alpha_i \propto \exp(-\|q - k_i\|^2/2\sigma^2)$, up to a query-only normalization) has the same normalized weighted-average form.

Figure 2 · Attention as kernel regression
data $(x_i, y_i)$: tokens with key $k_i = x_i$, value $v_i = y_i$ kernel bump at query (kernel-regression view) attention weights $\alpha_i$ (attention view) predicted output $\hat f(q) = \sum_i \alpha_i y_i$

At small $\tau$ the kernel is sharp and the prediction at $q$ is essentially the $y$-value of the nearest data point. Attention puts all its mass on one token, the entropy of the weight distribution is near zero, and the smoothed curve becomes a jagged step function. At large $\tau$ the kernel spans the whole interval, the weight distribution is nearly uniform, and the prediction flattens toward the global mean. Intermediate values trade fidelity against stability, the usual bias-variance tradeoff in non-parametric regression.

In a transformer attention head, $q \cdot k_i$ with $q = W_Q x_q$ and $k_i = W_K x_i$. After expanding, $\exp(q \cdot k_i / \tau)$ is the exponential kernel in those learned coordinates. Up to a query-only constant, that's a Gaussian RBF kernel with bandwidth set by $\tau$. The projection matrices $W_Q, W_K$ are how the network chooses what counts as "near."

3. Softmax is the unique entropy-regularized retrieval rule

Among all retrieval distributions $a$ over the memory items, softmax maximizes $$\mathbb{E}_a[s] \;+\; \tau\, H(a) \;=\; \sum_i a_i s_i \;-\; \tau \sum_i a_i \log a_i$$ subject to $\sum a_i = 1$. The objective trades score against entropy: higher expected score prefers high-score items, while higher entropy spreads mass across items. The temperature $\tau$ sets that tradeoff.

Figure 3 · Scores → softmax retrieval, with the objective on display
scores $s_i$: relevance of each memory item retrieval distribution $a_i = \operatorname{softmax}(s/\tau)_i$ contributions $a_i s_i$ (signed)

The temperature parameter controls how concentrated the retrieval distribution is. The limiting cases are the argmax regime, the uniform regime, and the intermediate regime.

The softmax is the unique distribution maximizing $\mathbb{E}_a[s] + \tau H(a)$. Setting up the Lagrangian with $\sum a_i = 1$ and differentiating gives $\log a_i = (s_i - \lambda)/\tau$, i.e. $a_i \propto e^{s_i/\tau}$. Equivalently, among distributions with a fixed expected score $\mathbb{E}_a[s] = \mu$, the maximum-entropy one is an exponential family with $s$ as the sufficient statistic. The same algebra gives Boltzmann distributions in statistical mechanics and exponential families in §7 of Sufficient Statistics.

4. Kernel variants

Changing the kernel changes how retrieval weight falls off with similarity or position. Dot-product attention, linear attention, local windows, and positional-bias schemes make different choices here. The slider sets the bandwidth / temperature; the curves show the resulting weight profile in each coordinate system.

Figure 4 · Kernel shapes for common attention variants
dot-product softmax: $\exp(q\cdot k / \tau)$, Gaussian-like linear attention: $\phi(q)^\top\phi(k)$ via random features local-window: indicator of $|i - j| \le w$ positional bias (ALiBi-style): linear decay

Each kernel encodes a prior about which positions count as neighbors. The dot-product softmax permits global retrieval at $O(N^2)$ cost. Linear attention swaps the kernel for one with an explicit feature map $\phi$, which permits the rearrangement $\sum_i \phi(q)^\top \phi(k_i) v_i = \phi(q)^\top \big(\sum_i \phi(k_i) v_i^\top\big)$ and pre-aggregate to bring cost to $O(N)$. Local-window attention restricts the kernel to a sliding interval. Position-bias schemes like ALiBi and T5 relative bias add a position-only decay before the softmax, equivalent to multiplying the kernel by a position prior, the KL-regularized form $a_i \propto r_i \exp(s_i/\tau)$ from the Aside in §3.

5. Multi-head: many statistics at once

A multi-head layer computes several query-dependent statistics in parallel. Each head has its own learned $W_Q, W_K, W_V$, so one head can implement a previous-token pattern while another routes by syntax, delimiter position, or a copied token if training finds such a solution. The four patterns here are stylized examples, not measurements from a particular model.

Figure 5 · Four stylized heads over three sentences
row $i$: how token $i$ distributes attention over the sequence causal mask (greyed): attention to future positions

In the stylized example, H1 implements a "look at the previous token" rule, a building block of induction circuits for in-context copying. H2 routes verbs to subjects and prepositions to head verbs. H3 is positional: weights decay with distance regardless of content. H4 attends to determiners and modifiers. Each head computes a different summary statistic of the prefix; their outputs concatenate and pass through the next layer's matrices, where they can be combined into higher-level statistics. Multi-head = multiple sufficient statistics in parallel.

6. Causal masking and the KV cache

During autoregressive generation, attention at step $t$ depends on the previous keys and values through $\sum_i \alpha_i(q_t) v_i$. An efficient decoder caches those $(k_i, v_i)$ pairs as it goes. The cache stores raw keys and values rather than an accumulated summary, because each new query reweights them differently.

Figure 6 · Token-by-token attention with a growing KV cache
tokens already in the cache current query attention weights $\alpha_i(q_t)$ output $\sum_i \alpha_i v_i$

The "broadcast / uniform" pattern is what an unconditional summary statistic would look like: the same weights regardless of the query, so the cache could be collapsed to one accumulated $\sum v_i$. The non-uniform patterns (subject-pointer, previous-token) reweight differently for each query, which is exactly why a real KV cache stores the raw vectors. Attention is a family of sufficient statistics, one per query, and the cache is the data structure for computing any member of that family on demand. For memory growth, paging, and attention sinks, see KV cache memory.

7. Adaptive sufficiency

Attention can be interpreted as an adaptive sufficient statistic: the weighted sum compresses the sequence into a summary chosen for the current prediction. Kernel regression interprets the softmax as Nadaraya–Watson with learned similarity. From the maximum-entropy side, the softmax is the unique solution to "be relevant, but don't be overconfident." These interpretations account for different design choices: the sum, the softmax, and the cache of raw vectors.

A transformer layer stacks attention with a feed-forward block; with depth, later layers build sufficient statistics of the sufficient statistics from earlier layers. Which computations those stacked summaries implement is a representation question, not an attention-only question.

What next