Hierarchical Bayes

When many similar groups share a hyperprior, the posterior estimate of each one borrows strength from the rest by an amount the data itself determines.

Suppose you have $J$ groups (eight coaching programs, twenty hospitals, sixty players), and from each you collect $n_j$ observations. You want a point estimate $\hat\theta_j$ of each group's true mean. Two obvious estimators stake out the extremes:

No pooling. Estimate each group on its own data: $\hat\theta_j^{\text{np}} = \bar y_j$. Unbiased for $\theta_j$, but high-variance when $n_j$ is small.
Complete pooling. Ignore group membership and use the grand mean $\hat\theta_j^{\text{cp}} = \bar y$. Low-variance, but biased whenever groups actually differ.

Both are wrong in opposite directions. Partial pooling interpolates: each group's estimate is pulled toward the grand mean by an amount that depends on how much the groups appear to vary. Doing Bayes on a hierarchical model is the way to get it.

1. Three model structures, and how to choose

Before writing down the math, decide what you actually believe about the parameters. Three structural options, drawn as graphs of the dependence between the $\theta_j$'s:

Structure	Graph	Appropriate when
Separate (no pooling)	$\theta_1 \to y_1$, $\theta_2 \to y_2$, ..., $\theta_J \to y_J$ (no link between $\theta_j$'s)	The $\theta_j$'s are causally unrelated: different experiments, different populations, no common source of variation.
Pooled (complete pooling)	$\theta \to y_1, y_2, \ldots, y_J$ (one $\theta$ for all)	The groups share an identical underlying mean.
Hierarchical (partial pooling)	$\tau \to \theta_j \to y_j$ ($\theta_j$'s share a common cause $\tau$)	The $\theta_j$'s are drawn from a common population but are not identical. Knowing one tells you something about the others through their shared hyperparameter.

The decision turns on what you know:

Strict independence requires positive information that the $\theta_j$'s share no common cause (measurements of unrelated physical constants, parameters of disjoint physical systems). Use the separate model.
Exchangeability is the natural default when you have no information distinguishing one $\theta_j$ from another a priori. The group indices are interchangeable labels. Use the hierarchical model (de Finetti, below, is why).
Identity requires positive information that the groups share a single $\theta$ (same population, same protocol, no between-group variation). Use the pooled model.

Exchangeability is weaker than independence. Exchangeable $\theta_j$'s are not independent in general: knowing $\theta_1$ updates your belief about $\theta_2$ through the shared hyperparameter $\tau$. A common parent $\tau$ creates dependence between the children $\theta_j$, even though the children look like iid draws conditional on $\tau$. The hierarchical model is the right choice whenever you can neither defend strict independence nor strict identity.

Measure-theory aside (de Finetti). If the groups are exchangeable (their order carries no information), then de Finetti's theorem says the joint distribution of $(\theta_1, \ldots, \theta_J)$ must be a mixture of iid sequences. The mixing measure over the population distribution is exactly the hyperprior $p(\mu, \tau^2)$ of the hierarchical model. Hierarchical Bayes is therefore the measure-theoretic representation of any exchangeable collection of groups. Empirical Bayes plugs in a point estimate for that mixing measure rather than integrating over it.

2. The two-level model

The simplest hierarchical model puts a Gaussian prior on the group means and treats its parameters as unknown:

$$ \begin{aligned} y_{ij} \mid \theta_j &\sim \mathcal N(\theta_j,\, \sigma^2), \quad i = 1,\ldots, n_j, \\ \theta_j \mid \mu, \tau^2 &\sim \mathcal N(\mu,\, \tau^2), \quad j = 1,\ldots, J, \\ (\mu, \tau^2) &\sim p(\mu, \tau^2). \\ \end{aligned} $$

The hyperparameters $\mu$ and $\tau^2$ describe the population of group means. $\tau^2$ is the one that matters: it controls how far groups can plausibly differ from one another, and therefore how much each group is allowed to disagree with the others.

The DAG factorization is $p(y, \theta, \mu, \tau^2) = p(\mu, \tau^2)\,\prod_j p(\theta_j\mid\mu,\tau^2)\,\prod_{i,j} p(y_{ij}\mid\theta_j)$, a plate diagram with three nested layers (hyperparameters, group means, observations).

3. The posterior formula

With $\sigma^2$ and $(\mu, \tau^2)$ held fixed, each group's posterior is conjugate Normal–Normal. Writing $\sigma_j^2 = \sigma^2/n_j$ for the standard error of the raw group mean,

$$ \theta_j \mid y, \mu, \tau^2 \;\sim\; \mathcal N\!\bigl(\hat\theta_j,\; V_j\bigr), $$

with

$$ \hat\theta_j \;=\; \frac{\tau^2\,\bar y_j \;+\; \sigma_j^2\,\mu} {\tau^2 + \sigma_j^2} \;=\; \bar y_j \;+\; B_j\,(\mu - \bar y_j), \qquad B_j = \frac{\sigma_j^2}{\sigma_j^2 + \tau^2}. $$

$B_j \in [0, 1]$ controls how far group $j$'s estimate is pulled toward $\mu$.

$\tau \to 0$ (groups indistinguishable) gives $B_j \to 1$: complete pooling.
$\tau \to \infty$ (groups completely separate) gives $B_j \to 0$: no pooling.
Smaller groups (larger $\sigma_j^2 = \sigma^2/n_j$) get pulled harder. Larger groups speak more clearly for themselves.

The posterior variance contracts too: $V_j = \sigma_j^2\,(1 - B_j)$. The posterior estimate is tighter and less noisy than the raw group mean, at the cost of a small bias toward the grand mean.

Estimating $\tau$ from the data

A full Bayesian treatment puts a hyperprior on $(\mu, \tau^2)$ and integrates. A practical alternative, empirical Bayes, plugs in point estimates. Method of moments: the variance of the group means decomposes as

$$ \operatorname{Var}(\bar y_j) \;=\; \tau^2 + \mathbb E[\sigma_j^2], $$

so $\hat\tau^2 = \max\!\bigl(0,\; S^2_{\bar y} - \overline{\sigma_j^2}\bigr)$ where $S^2_{\bar y}$ is the sample variance of the group means. If observed between-group scatter is no larger than what within-group noise would produce on its own, the data are consistent with $\tau = 0$ and the model collapses to complete pooling.

4. The simulation

Figure 1 shows $J$ simulated groups. Each row is one group: the small dots are the $n_j$ raw observations, the blue ✕ is the no-pooling estimate $\bar y_j$, the green ○ is the partial-pooling estimate $\hat\theta_j$, and the gold ▲ marks the true $\theta_j$ for reference. The dashed black line is the grand mean.

Drag the between-group spread $\tau$: as it shrinks, every green ○ slides toward the grand mean; as it grows, the ○'s converge onto the ✕'s. "Fit $\tau$ from data" runs the empirical-Bayes step above. The readout reports the mean-squared error of each estimator against the true means $\theta_j$. The hierarchical estimate typically wins.

Figure 1 · No pooling vs. partial pooling vs. complete pooling

raw $y_{ij}$ true $\theta_j$ no-pool $\bar y_j$ partial-pool $\hat\theta_j$ grand mean

number of groups $J$ 10

observations per group $n$ 6

within-group noise $sigma$ 1

true between-group spread $ au_{ ext{true}}$ 1.2

modelled $ au$ 1.2

Things to look for:

When $\tau_{\text{true}}$ is small (groups truly similar), the empirical-Bayes fit picks $\hat\tau$ near zero and almost all groups collapse to the grand mean. No-pooling estimates wander; the hierarchical estimates cluster cleanly around the truth.
When $\tau_{\text{true}}$ is large (groups truly different), $\hat\tau$ is large and the hierarchical estimates barely move. Both estimators perform similarly.
Small groups (lower $n$) get pulled much harder than large groups: their raw means are less trustworthy, and the model knows it.
The hierarchical MSE is almost always lower than the no-pool MSE. Complete pooling wins only when groups really do share a mean.

5. In practice

Mixed-effects / multilevel models (lme4, brms, PyMC, Stan): these are hierarchical Bayes by another name. The "random effects" are the $\theta_j$; the "variance components" are $\sigma^2$ and $\tau^2$.
Regularization: ridge regression is a hierarchical model with $\beta_k \sim \mathcal N(0, \tau^2)$. The ridge penalty $\lambda \lVert\beta\rVert^2$ corresponds to $\tau^2 = \sigma^2/\lambda$. Lasso is the same picture with a Laplace prior. See Bayesian regression for the full penalty-as-prior dictionary, including the $\ell_0$ / spike-and-slab case.
Connection to Stein's 1956 result: for $J \ge 3$ independent $y_j \sim \mathcal N(\theta_j, 1)$ under total squared-error loss, the James–Stein estimator $\hat\theta_j = \bigl(1 - (J-2)/\sum_k y_k^2\bigr)\,y_j$ dominates the naive $\hat\theta_j = y_j$ strictly and everywhere, even when the groups are unrelated. Same multiplicative shape as the formula in §2; the hierarchical-Bayes derivation explains why.
Limitations: pulling toward a common mean helps the average but can mislead on outlier groups, which the model treats as draws from the same Gaussian population. For heavy-tailed group means, replace the $\mathcal N(\mu, \tau^2)$ prior with a Student-$t$.

What next

Updates

Conjugate Priors & the Exponential Family

Each level of a hierarchical model is a conjugate update if the prior shapes are chosen well.

Decision

Posterior Summaries & Bayes Risk

Hierarchical estimates are posterior means under the model in §2, Bayes-optimal under squared loss.

Computation

Monte Carlo & MCMC

Real hierarchical posteriors aren't conjugate end-to-end; Gibbs and HMC handle the joint update.

Regularization

Bayesian Neural Networks

Layer-wise priors with shared variance hyperparameters are hierarchical Bayes for weights.