Hierarchical Bayes

When many similar groups share a hyperprior, the posterior estimate of each one borrows strength from the rest by an amount the data itself determines.

Suppose you have $J$ groups (eight coaching programs, twenty hospitals, sixty players), and from each you collect $n_j$ observations. You want a point estimate $\hat\theta_j$ of each group's true mean. Two obvious estimators stake out the extremes:

Both are wrong in opposite directions. Partial pooling interpolates: each group's estimate is pulled toward the grand mean by an amount that depends on how much the groups appear to vary. Doing Bayes on a hierarchical model is the way to get it.

1. Three model structures, and how to choose

Before writing down the math, decide what you actually believe about the parameters. Three structural options, drawn as graphs of the dependence between the $\theta_j$'s:

StructureGraphAppropriate when
Separate (no pooling) $\theta_1 \to y_1$, $\theta_2 \to y_2$, ..., $\theta_J \to y_J$  (no link between $\theta_j$'s) The $\theta_j$'s are causally unrelated: different experiments, different populations, no common source of variation.
Pooled (complete pooling) $\theta \to y_1, y_2, \ldots, y_J$  (one $\theta$ for all) The groups share an identical underlying mean.
Hierarchical (partial pooling) $\tau \to \theta_j \to y_j$  ($\theta_j$'s share a common cause $\tau$) The $\theta_j$'s are drawn from a common population but are not identical. Knowing one tells you something about the others through their shared hyperparameter.

The decision turns on what you know:

Exchangeability is weaker than independence. Exchangeable $\theta_j$'s are not independent in general: knowing $\theta_1$ updates your belief about $\theta_2$ through the shared hyperparameter $\tau$. A common parent $\tau$ creates dependence between the children $\theta_j$, even though the children look like iid draws conditional on $\tau$. The hierarchical model is the right choice whenever you can neither defend strict independence nor strict identity.

Measure-theory aside (de Finetti). If the groups are exchangeable (their order carries no information), then de Finetti's theorem says the joint distribution of $(\theta_1, \ldots, \theta_J)$ must be a mixture of iid sequences. The mixing measure over the population distribution is exactly the hyperprior $p(\mu, \tau^2)$ of the hierarchical model. Hierarchical Bayes is therefore the measure-theoretic representation of any exchangeable collection of groups. Empirical Bayes plugs in a point estimate for that mixing measure rather than integrating over it.

2. The two-level model

The simplest hierarchical model puts a Gaussian prior on the group means and treats its parameters as unknown:

$$ \begin{aligned} y_{ij} \mid \theta_j &\sim \mathcal N(\theta_j,\, \sigma^2), \quad i = 1,\ldots, n_j, \\ \theta_j \mid \mu, \tau^2 &\sim \mathcal N(\mu,\, \tau^2), \quad j = 1,\ldots, J, \\ (\mu, \tau^2) &\sim p(\mu, \tau^2). \\ \end{aligned} $$

The hyperparameters $\mu$ and $\tau^2$ describe the population of group means. $\tau^2$ is the one that matters: it controls how far groups can plausibly differ from one another, and therefore how much each group is allowed to disagree with the others.

The DAG factorization is $p(y, \theta, \mu, \tau^2) = p(\mu, \tau^2)\,\prod_j p(\theta_j\mid\mu,\tau^2)\,\prod_{i,j} p(y_{ij}\mid\theta_j)$, a plate diagram with three nested layers (hyperparameters, group means, observations).

3. The posterior formula

With $\sigma^2$ and $(\mu, \tau^2)$ held fixed, each group's posterior is conjugate Normal–Normal. Writing $\sigma_j^2 = \sigma^2/n_j$ for the standard error of the raw group mean,

$$ \theta_j \mid y, \mu, \tau^2 \;\sim\; \mathcal N\!\bigl(\hat\theta_j,\; V_j\bigr), $$

with

$$ \hat\theta_j \;=\; \frac{\tau^2\,\bar y_j \;+\; \sigma_j^2\,\mu} {\tau^2 + \sigma_j^2} \;=\; \bar y_j \;+\; B_j\,(\mu - \bar y_j), \qquad B_j = \frac{\sigma_j^2}{\sigma_j^2 + \tau^2}. $$

$B_j \in [0, 1]$ controls how far group $j$'s estimate is pulled toward $\mu$.

The posterior variance contracts too: $V_j = \sigma_j^2\,(1 - B_j)$. The posterior estimate is tighter and less noisy than the raw group mean, at the cost of a small bias toward the grand mean.

Estimating $\tau$ from the data

A full Bayesian treatment puts a hyperprior on $(\mu, \tau^2)$ and integrates. A practical alternative, empirical Bayes, plugs in point estimates. Method of moments: the variance of the group means decomposes as

$$ \operatorname{Var}(\bar y_j) \;=\; \tau^2 + \mathbb E[\sigma_j^2], $$

so $\hat\tau^2 = \max\!\bigl(0,\; S^2_{\bar y} - \overline{\sigma_j^2}\bigr)$ where $S^2_{\bar y}$ is the sample variance of the group means. If observed between-group scatter is no larger than what within-group noise would produce on its own, the data are consistent with $\tau = 0$ and the model collapses to complete pooling.

4. The simulation

Figure 1 shows $J$ simulated groups. Each row is one group: the small dots are the $n_j$ raw observations, the blue ✕ is the no-pooling estimate $\bar y_j$, the green ○ is the partial-pooling estimate $\hat\theta_j$, and the gold ▲ marks the true $\theta_j$ for reference. The dashed black line is the grand mean.

Drag the between-group spread $\tau$: as it shrinks, every green ○ slides toward the grand mean; as it grows, the ○'s converge onto the ✕'s. "Fit $\tau$ from data" runs the empirical-Bayes step above. The readout reports the mean-squared error of each estimator against the true means $\theta_j$. The hierarchical estimate typically wins.

Figure 1 · No pooling vs. partial pooling vs. complete pooling
raw $y_{ij}$ true $\theta_j$ no-pool $\bar y_j$ partial-pool $\hat\theta_j$ grand mean

Things to look for:

5. In practice

What next