Posterior Summaries & Bayes Risk
Given a posterior $p(\theta\mid y)$ and a loss $L(\theta, \hat\theta)$ that scores how bad an estimate $\hat\theta$ is when the truth is $\theta$, the Bayes risk at $\hat\theta$ is
$$ R(\hat\theta) \;=\; \mathbb{E}_{\theta \mid y}\!\bigl[L(\theta, \hat\theta)\bigr] \;=\; \int L(\theta, \hat\theta)\,p(\theta\mid y)\,d\theta. $$The Bayes-optimal point estimate is $\hat\theta^\ast = \arg\min_{\hat\theta} R(\hat\theta)$. Three classical losses give three named summaries, each revealing a different feature of the posterior.
1. Three losses, three summaries
The three canonical results:
| Loss $L(\theta, \hat\theta)$ | Bayes-optimal estimate | Name | Sensitive to |
|---|---|---|---|
| $(\theta - \hat\theta)^2$ (squared) | posterior mean $\mathbb{E}[\theta\mid y]$ | MMSE | tails (variance) |
| $\lvert \theta - \hat\theta \rvert$ (absolute) | posterior median | MAD | quantile balance |
| $\mathbf 1[\theta \neq \hat\theta]$ (zero-one) | posterior mode $\arg\max p(\theta\mid y)$ | MAP | peak only |
| $-\log q(\theta)$ (log loss, over distributions $q$) | the posterior itself $q^\ast = p(\theta\mid y)$ | — | every part of the posterior |
The last row generalises the others: instead of asking for a point estimate $\hat\theta$, allow the answer to be a whole probability distribution $q$ and use the log loss $L(\theta, q) = -\log q(\theta)$. The Bayes risk is $\mathbb E_{\theta \mid y}[-\log q(\theta)]$, minimized by $q = p(\theta\mid y)$ with minimum value equal to the posterior entropy $H(\theta\mid y) = -\!\int p(\theta\mid y)\log p(\theta\mid y)\,d\theta$. Any point estimate is strictly worse by a KL gap: $\mathbb E[-\log q] - H(\theta\mid y) = D(p\,\Vert\, q)$. Reporting the full posterior beats reporting any summary, by an amount the KL divergence measures exactly.
Squared loss → mean
Expanding $R(\hat\theta) = \mathbb{E}[(\theta - \hat\theta)^2]$,
$$ R(\hat\theta) = \mathbb{E}[\theta^2] \;-\; 2\hat\theta\,\mathbb{E}[\theta] \;+\; \hat\theta^2 = \operatorname{Var}(\theta\mid y) \;+\; \bigl(\hat\theta - \mathbb{E}[\theta\mid y]\bigr)^2. $$A parabola in $\hat\theta$, with minimum at $\hat\theta^\ast = \mathbb E[\theta\mid y]$ and minimum value equal to the posterior variance, the minimum mean-squared error (MMSE).
Absolute loss → median
For $R(\hat\theta) = \int |\theta - \hat\theta|\,p(\theta\mid y)\,d\theta$, differentiate:
$$ \frac{dR}{d\hat\theta} \;=\; -\!\int_{\hat\theta}^{\infty} p(\theta\mid y)\,d\theta \;+\!\int_{-\infty}^{\hat\theta} p(\theta\mid y)\,d\theta \;=\; 2\,F(\hat\theta\mid y) - 1. $$Setting this to zero gives $F(\hat\theta^\ast\mid y) = \tfrac12$, the posterior median. The Bayes risk at the optimum is the mean absolute deviation about the median (MAD).
Zero-one loss → mode
For continuous $\theta$ strict equality has probability zero, so use a thin-window limit: $L_\varepsilon(\theta, \hat\theta) = \mathbf 1[\,|\theta - \hat\theta| > \varepsilon/2\,]$. Then $R_\varepsilon(\hat\theta) = 1 - \int_{\hat\theta-\varepsilon/2}^{\hat\theta+\varepsilon/2} p(\theta\mid y)\,d\theta \approx 1 - \varepsilon\,p(\hat\theta\mid y)$ for small $\varepsilon$, and the minimiser is $\hat\theta^\ast = \arg\max_{\theta} p(\theta\mid y)$, the posterior mode, also known as the maximum a posteriori (MAP) estimate.
Equivalent intuition: 0-1 loss only rewards being close to a peak; everything else incurs the same cost. The Bayes risk is large unless the posterior concentrates mass near $\hat\theta$.
The mode is the odd one out. Mean and median are properties of the measure $\pi(\cdot \mid y)$ (defined by integrals and quantiles of the distribution itself), and so are reparametrization-invariant in the appropriate sense (the mean transforms affinely, the median is preserved under any monotone bijection). The mode is a property of the density with respect to a chosen reference measure: under a smooth bijection $\phi(\theta)$ the density picks up a Jacobian $|\phi'|^{-1}$ and the argmax moves. MAP estimates of $\log \sigma^2$ and of $\sigma^2$ disagree, but their means and medians are consistent. This is the same reparametrization-invariance issue that motivates Jeffreys priors.
2. Three losses on one posterior
Figure 1 shows a posterior with deliberately distinct mean, median, and mode. Drag the slider for the estimator $\hat\theta$ and watch the chosen Bayes-risk curve $R(\hat\theta)$ pin its minimum to the matching summary.
Things to notice as you drag:
- On the skewed posterior, mode < median < mean, the canonical ordering when the right tail is heavier. The three Bayes-risk curves each minimise at a different point.
- On the bimodal posterior, the mode can sit on the taller peak while the median balances mass between modes and the mean lands between them. MAP can be a poor summary of a multi-modal posterior. It ignores everything except the highest peak.
- On a symmetric Gaussian, all three coincide.
- The squared-loss curve grows quickly away from the mean (it's a parabola pinned at posterior variance). Absolute loss grows linearly. Zero-one loss barely moves until you leave the immediate neighbourhood of a peak.
3. Which to use
Standard advice, written from the point of view of how the estimate will be used:
| Use this loss when | Reach for | Caveats |
|---|---|---|
| Errors have quadratic cost; you care about average squared error | posterior mean | Sensitive to heavy tails; can sit in a low-density region between modes |
| Errors have linear cost; you want a central tendency insensitive to tail mass | posterior median | Insensitive to tail mass; usually unique even when mode is not |
| Decisions are discrete and an exact match matters | posterior mode (MAP) | Ignores uncertainty; arbitrary on a flat ridge; misleading if multimodal |
| None of the above: you need the full distribution | report intervals, samples, or the full posterior | A single summary is always lossy |
A practical note: the MAP estimate is the most computationally accessible. It's the maximum of an unnormalised density, which avoids any integration. That's why gradient-based optimisation (and hence regularised maximum likelihood) implicitly chooses 0-1 loss. The posterior mean and median require integrating the posterior, which is what Monte Carlo & MCMC and variational inference exist to do.