Posterior Summaries & Bayes Risk

Squared, absolute, and zero-one loss pick out the mean, median, and mode. Three losses, three summaries. They only agree when the posterior is symmetric.

Given a posterior $p(\theta\mid y)$ and a loss $L(\theta, \hat\theta)$ that scores how bad an estimate $\hat\theta$ is when the truth is $\theta$, the Bayes risk at $\hat\theta$ is

$$ R(\hat\theta) \;=\; \mathbb{E}_{\theta \mid y}\!\bigl[L(\theta, \hat\theta)\bigr] \;=\; \int L(\theta, \hat\theta)\,p(\theta\mid y)\,d\theta. $$

The Bayes-optimal point estimate is $\hat\theta^\ast = \arg\min_{\hat\theta} R(\hat\theta)$. Three classical losses give three named summaries, each revealing a different feature of the posterior.

1. Three losses, three summaries

The three canonical results:

Loss $L(\theta, \hat\theta)$	Bayes-optimal estimate	Name	Sensitive to
$(\theta - \hat\theta)^2$ (squared)	posterior mean $\mathbb{E}[\theta\mid y]$	MMSE	tails (variance)
$\lvert \theta - \hat\theta \rvert$ (absolute)	posterior median	MAD	quantile balance
$\mathbf 1[\theta \neq \hat\theta]$ (zero-one)	posterior mode $\arg\max p(\theta\mid y)$	MAP	peak only
$-\log q(\theta)$ (log loss, over distributions $q$)	the posterior itself $q^\ast = p(\theta\mid y)$	—	every part of the posterior

The last row generalises the others: instead of asking for a point estimate $\hat\theta$, allow the answer to be a whole probability distribution $q$ and use the log loss $L(\theta, q) = -\log q(\theta)$. The Bayes risk is $\mathbb E_{\theta \mid y}[-\log q(\theta)]$, minimized by $q = p(\theta\mid y)$ with minimum value equal to the posterior entropy $H(\theta\mid y) = -\!\int p(\theta\mid y)\log p(\theta\mid y)\,d\theta$. Any point estimate is strictly worse by a KL gap: $\mathbb E[-\log q] - H(\theta\mid y) = D(p\,\Vert\, q)$. Reporting the full posterior beats reporting any summary, by an amount the KL divergence measures exactly.

Squared loss → mean

Expanding $R(\hat\theta) = \mathbb{E}[(\theta - \hat\theta)^2]$,

$$ R(\hat\theta) = \mathbb{E}[\theta^2] \;-\; 2\hat\theta\,\mathbb{E}[\theta] \;+\; \hat\theta^2 = \operatorname{Var}(\theta\mid y) \;+\; \bigl(\hat\theta - \mathbb{E}[\theta\mid y]\bigr)^2. $$

A parabola in $\hat\theta$, with minimum at $\hat\theta^\ast = \mathbb E[\theta\mid y]$ and minimum value equal to the posterior variance, the minimum mean-squared error (MMSE).

Absolute loss → median

For $R(\hat\theta) = \int |\theta - \hat\theta|\,p(\theta\mid y)\,d\theta$, differentiate:

$$ \frac{dR}{d\hat\theta} \;=\; -\!\int_{\hat\theta}^{\infty} p(\theta\mid y)\,d\theta \;+\!\int_{-\infty}^{\hat\theta} p(\theta\mid y)\,d\theta \;=\; 2\,F(\hat\theta\mid y) - 1. $$

Setting this to zero gives $F(\hat\theta^\ast\mid y) = \tfrac12$, the posterior median. The Bayes risk at the optimum is the mean absolute deviation about the median (MAD).

Zero-one loss → mode

For continuous $\theta$ strict equality has probability zero, so use a thin-window limit: $L_\varepsilon(\theta, \hat\theta) = \mathbf 1[\,|\theta - \hat\theta| > \varepsilon/2\,]$. Then $R_\varepsilon(\hat\theta) = 1 - \int_{\hat\theta-\varepsilon/2}^{\hat\theta+\varepsilon/2} p(\theta\mid y)\,d\theta \approx 1 - \varepsilon\,p(\hat\theta\mid y)$ for small $\varepsilon$, and the minimiser is $\hat\theta^\ast = \arg\max_{\theta} p(\theta\mid y)$, the posterior mode, also known as the maximum a posteriori (MAP) estimate.

Equivalent intuition: 0-1 loss only rewards being close to a peak; everything else incurs the same cost. The Bayes risk is large unless the posterior concentrates mass near $\hat\theta$.

The mode is the odd one out. Mean and median are properties of the measure $\pi(\cdot \mid y)$ (defined by integrals and quantiles of the distribution itself), and so are reparametrization-invariant in the appropriate sense (the mean transforms affinely, the median is preserved under any monotone bijection). The mode is a property of the density with respect to a chosen reference measure: under a smooth bijection $\phi(\theta)$ the density picks up a Jacobian $|\phi'|^{-1}$ and the argmax moves. MAP estimates of $\log \sigma^2$ and of $\sigma^2$ disagree, but their means and medians are consistent. This is the same reparametrization-invariance issue that motivates Jeffreys priors.

2. Three losses on one posterior

Figure 1 shows a posterior with deliberately distinct mean, median, and mode. Drag the slider for the estimator $\hat\theta$ and watch the chosen Bayes-risk curve $R(\hat\theta)$ pin its minimum to the matching summary.

Figure 1 · Bayes risk for squared, absolute, and zero-one loss

posterior mean posterior median posterior mode (MAP) your estimator $\hat\theta$

loss:

posterior:

estimator $\hat\theta$: 1.70

Things to notice as you drag:

On the skewed posterior, mode < median < mean, the canonical ordering when the right tail is heavier. The three Bayes-risk curves each minimise at a different point.
On the bimodal posterior, the mode can sit on the taller peak while the median balances mass between modes and the mean lands between them. MAP can be a poor summary of a multi-modal posterior. It ignores everything except the highest peak.
On a symmetric Gaussian, all three coincide.
The squared-loss curve grows quickly away from the mean (it's a parabola pinned at posterior variance). Absolute loss grows linearly. Zero-one loss barely moves until you leave the immediate neighbourhood of a peak.

3. Which to use

Standard advice, written from the point of view of how the estimate will be used:

Use this loss when	Reach for	Caveats
Errors have quadratic cost; you care about average squared error	posterior mean	Sensitive to heavy tails; can sit in a low-density region between modes
Errors have linear cost; you want a central tendency insensitive to tail mass	posterior median	Insensitive to tail mass; usually unique even when mode is not
Decisions are discrete and an exact match matters	posterior mode (MAP)	Ignores uncertainty; arbitrary on a flat ridge; misleading if multimodal
None of the above: you need the full distribution	report intervals, samples, or the full posterior	A single summary is always lossy

A practical note: the MAP estimate is the most computationally accessible. It's the maximum of an unnormalised density, which avoids any integration. That's why gradient-based optimisation (and hence regularised maximum likelihood) implicitly chooses 0-1 loss. The posterior mean and median require integrating the posterior, which is what Monte Carlo & MCMC and variational inference exist to do.

What next

Decision theory

Hypothesis Testing

Bayes risk reappears as the prior-and-loss-weighted error rate that picks the optimal Neyman–Pearson threshold.

Posteriors

Conjugate Priors & the Exponential Family

These pages assume you already have a posterior. Conjugacy is the simplest way to get one.

Computation

Monte Carlo & MCMC

Posterior mean and median need integration; samplers approximate the integrals that MAP avoids.

Approximation

Free Energy & Variational Inference

Variational posteriors give an approximate mean directly; the MAP of the variational q approximates the true MAP.