Posterior Summaries & Bayes Risk

Squared, absolute, and zero-one loss pick out the mean, median, and mode. Three losses, three summaries. They only agree when the posterior is symmetric.

Given a posterior $p(\theta\mid y)$ and a loss $L(\theta, \hat\theta)$ that scores how bad an estimate $\hat\theta$ is when the truth is $\theta$, the Bayes risk at $\hat\theta$ is

$$ R(\hat\theta) \;=\; \mathbb{E}_{\theta \mid y}\!\bigl[L(\theta, \hat\theta)\bigr] \;=\; \int L(\theta, \hat\theta)\,p(\theta\mid y)\,d\theta. $$

The Bayes-optimal point estimate is $\hat\theta^\ast = \arg\min_{\hat\theta} R(\hat\theta)$. Three classical losses give three named summaries, each revealing a different feature of the posterior.

1. Three losses, three summaries

The three canonical results:

Loss $L(\theta, \hat\theta)$Bayes-optimal estimateNameSensitive to
$(\theta - \hat\theta)^2$  (squared) posterior mean $\mathbb{E}[\theta\mid y]$ MMSE tails (variance)
$\lvert \theta - \hat\theta \rvert$  (absolute) posterior median MAD quantile balance
$\mathbf 1[\theta \neq \hat\theta]$  (zero-one) posterior mode $\arg\max p(\theta\mid y)$ MAP peak only
$-\log q(\theta)$  (log loss, over distributions $q$) the posterior itself $q^\ast = p(\theta\mid y)$ every part of the posterior

The last row generalises the others: instead of asking for a point estimate $\hat\theta$, allow the answer to be a whole probability distribution $q$ and use the log loss $L(\theta, q) = -\log q(\theta)$. The Bayes risk is $\mathbb E_{\theta \mid y}[-\log q(\theta)]$, minimized by $q = p(\theta\mid y)$ with minimum value equal to the posterior entropy $H(\theta\mid y) = -\!\int p(\theta\mid y)\log p(\theta\mid y)\,d\theta$. Any point estimate is strictly worse by a KL gap: $\mathbb E[-\log q] - H(\theta\mid y) = D(p\,\Vert\, q)$. Reporting the full posterior beats reporting any summary, by an amount the KL divergence measures exactly.

Squared loss → mean

Expanding $R(\hat\theta) = \mathbb{E}[(\theta - \hat\theta)^2]$,

$$ R(\hat\theta) = \mathbb{E}[\theta^2] \;-\; 2\hat\theta\,\mathbb{E}[\theta] \;+\; \hat\theta^2 = \operatorname{Var}(\theta\mid y) \;+\; \bigl(\hat\theta - \mathbb{E}[\theta\mid y]\bigr)^2. $$

A parabola in $\hat\theta$, with minimum at $\hat\theta^\ast = \mathbb E[\theta\mid y]$ and minimum value equal to the posterior variance, the minimum mean-squared error (MMSE).

Absolute loss → median

For $R(\hat\theta) = \int |\theta - \hat\theta|\,p(\theta\mid y)\,d\theta$, differentiate:

$$ \frac{dR}{d\hat\theta} \;=\; -\!\int_{\hat\theta}^{\infty} p(\theta\mid y)\,d\theta \;+\!\int_{-\infty}^{\hat\theta} p(\theta\mid y)\,d\theta \;=\; 2\,F(\hat\theta\mid y) - 1. $$

Setting this to zero gives $F(\hat\theta^\ast\mid y) = \tfrac12$, the posterior median. The Bayes risk at the optimum is the mean absolute deviation about the median (MAD).

Zero-one loss → mode

For continuous $\theta$ strict equality has probability zero, so use a thin-window limit: $L_\varepsilon(\theta, \hat\theta) = \mathbf 1[\,|\theta - \hat\theta| > \varepsilon/2\,]$. Then $R_\varepsilon(\hat\theta) = 1 - \int_{\hat\theta-\varepsilon/2}^{\hat\theta+\varepsilon/2} p(\theta\mid y)\,d\theta \approx 1 - \varepsilon\,p(\hat\theta\mid y)$ for small $\varepsilon$, and the minimiser is $\hat\theta^\ast = \arg\max_{\theta} p(\theta\mid y)$, the posterior mode, also known as the maximum a posteriori (MAP) estimate.

Equivalent intuition: 0-1 loss only rewards being close to a peak; everything else incurs the same cost. The Bayes risk is large unless the posterior concentrates mass near $\hat\theta$.

The mode is the odd one out. Mean and median are properties of the measure $\pi(\cdot \mid y)$ (defined by integrals and quantiles of the distribution itself), and so are reparametrization-invariant in the appropriate sense (the mean transforms affinely, the median is preserved under any monotone bijection). The mode is a property of the density with respect to a chosen reference measure: under a smooth bijection $\phi(\theta)$ the density picks up a Jacobian $|\phi'|^{-1}$ and the argmax moves. MAP estimates of $\log \sigma^2$ and of $\sigma^2$ disagree, but their means and medians are consistent. This is the same reparametrization-invariance issue that motivates Jeffreys priors.

2. Three losses on one posterior

Figure 1 shows a posterior with deliberately distinct mean, median, and mode. Drag the slider for the estimator $\hat\theta$ and watch the chosen Bayes-risk curve $R(\hat\theta)$ pin its minimum to the matching summary.

Figure 1 · Bayes risk for squared, absolute, and zero-one loss
posterior mean posterior median posterior mode (MAP) your estimator $\hat\theta$
loss:
posterior:
estimator $\hat\theta$: 1.70

Things to notice as you drag:

3. Which to use

Standard advice, written from the point of view of how the estimate will be used:

Use this loss whenReach forCaveats
Errors have quadratic cost; you care about average squared error posterior mean Sensitive to heavy tails; can sit in a low-density region between modes
Errors have linear cost; you want a central tendency insensitive to tail mass posterior median Insensitive to tail mass; usually unique even when mode is not
Decisions are discrete and an exact match matters posterior mode (MAP) Ignores uncertainty; arbitrary on a flat ridge; misleading if multimodal
None of the above: you need the full distribution report intervals, samples, or the full posterior A single summary is always lossy

A practical note: the MAP estimate is the most computationally accessible. It's the maximum of an unnormalised density, which avoids any integration. That's why gradient-based optimisation (and hence regularised maximum likelihood) implicitly chooses 0-1 loss. The posterior mean and median require integrating the posterior, which is what Monte Carlo & MCMC and variational inference exist to do.

What next