Conjugate Priors & the Exponential Family

Why some prior–likelihood pairs update in closed form, what the hyperparameters mean as pseudo-data, and the standard conjugate pairs.

Conjugate priors are useful for two reasons. First, computational convenience — posterior arithmetic reduces to adding sufficient statistics. Second, the hyperparameters of the conjugate prior have a clean interpretation as prior data: a Beta$(\alpha,\beta)$ prior on a coin's bias acts like $\alpha-1$ prior heads and $\beta-1$ prior tails; a Normal-Gamma prior on $(\mu,\sigma^2)$ acts like $\kappa_0$ prior observations of the mean and $2\alpha_0$ prior observations of the variance. The same arithmetic that turns the posterior back into a member of the family also turns each hyperparameter into an effective sample size.

One sentence: Whenever the likelihood is an exponential family $p(x\mid\eta)=h(x)\exp(\eta^\top T(x)-A(\eta))$, the conjugate prior has the form $\pi(\eta\mid\tau,\nu)\propto\exp(\eta^\top\tau-\nu A(\eta))$, and Bayesian updating is just $\tau\mapsto\tau+\sum_i T(x_i)$, $\nu\mapsto\nu+n$.

1. The exponential family (recap)

Conjugacy lives inside the exponential family — distributions of the canonical form

$$ p(x\mid\eta) \;=\; h(x)\,\exp\!\bigl(\eta^\top T(x) \;-\; A(\eta)\bigr), $$

where $\eta$ is the natural parameter, $T(x)$ the sufficient statistic, $h(x)$ the base measure, and $A(\eta)$ the log-partition function. The exponential-family page covers what each piece is called and why, why the log-partition borrows its name from statistical physics, how derivatives of $A$ give the moments of $T(X)$, the canonical-link / mean-function pair behind logistic and Poisson regression, and an interactive picker stepping through six standard members.

The one fact the conjugacy argument below needs: the natural parameter $\eta$ and sufficient statistic $T(x)$ meet only in an inner product $\eta^\top T(x)$, so $n$ i.i.d. observations summarize themselves into a fixed-size pair $(\sum_i T(x_i), n)$.

The Pitman–Koopman–Darmois theorem is the converse: for a family with fixed support, a $k$-dimensional sufficient statistic exists for every sample size $n$ if and only if the family is exponential of dimension at most $k$. Conjugate priors with finite-dimensional hyperparameters are therefore essentially restricted to this case; outside the exponential family, the posterior's complexity grows with $n$.

2. Conjugacy follows from exponential-family sufficient statistics

With $n$ i.i.d. observations $x_{1:n}$ from an exponential family, the joint likelihood is

$$ p(x_{1:n}\mid\eta) \;=\; \Bigl[\prod_i h(x_i)\Bigr]\, \exp\!\Bigl(\eta^\top \!\sum_i T(x_i) \;-\; n\,A(\eta)\Bigr). $$

Read off two numbers from the data: the total sufficient statistic $S=\sum_i T(x_i)$ and the sample size $n$. As a function of $\eta$, the likelihood depends only on $(S, n)$.

Now pick a prior with the same functional form in $\eta$ — that is, a density proportional to $\exp(\eta^\top\tau_0 - \nu_0 A(\eta))$ on the natural-parameter space:

$$ \pi(\eta\mid\tau_0,\nu_0) \;\propto\; \exp\!\bigl(\eta^\top\tau_0 - \nu_0\,A(\eta)\bigr). $$

The posterior is the product, and the product is again in this family:

$$ p(\eta\mid x_{1:n}) \;\propto\; \exp\!\Bigl(\eta^\top(\tau_0+S) - (\nu_0+n)\,A(\eta)\Bigr) \;=\; \pi\!\bigl(\eta\mid\tau_0+S,\;\nu_0+n\bigr). $$

Updating is the arithmetic

$$ \boxed{\;\tau \;\mapsto\; \tau_0 + \sum_i T(x_i),\qquad \nu \;\mapsto\; \nu_0 + n.\;} $$
Pseudo-count reading: $\nu_0$ is the number of pseudo-observations the prior is worth, and $\tau_0$ is their pseudo-total of sufficient statistics. A Beta$(\alpha,\beta)$ prior on a Bernoulli has $\nu_0=\alpha+\beta-2$ pseudo-trials with $\alpha-1$ pseudo-heads. Choosing the prior is choosing how much pretend data to start with.

3. Worked example: Beta–Bernoulli

The Bernoulli likelihood is $p(x\mid p) = p^{x}(1-p)^{1-x}$ for $x\in\{0,1\}$. In exponential-family form,

$$ p(x\mid p) \;=\; \exp\!\Bigl(x\,\underbrace{\log\tfrac{p}{1-p}}_{\eta} \;-\; \underbrace{(-\log(1-p))}_{A(\eta)}\Bigr), $$

so $T(x)=x$ and the natural parameter $\eta$ is the log-odds (the logit). The conjugate prior in canonical form has density $\propto\exp(\eta\tau_0 - \nu_0 A(\eta))$; transformed back to the bias $p$, this is the Beta distribution:

$$ p \;\sim\; \mathrm{Beta}(\alpha,\beta) \;\propto\; p^{\alpha-1}(1-p)^{\beta-1}, $$

where $\alpha = \tau_0+1$ and $\beta=\nu_0-\tau_0+1$. After observing $k$ successes and $n-k$ failures, $S=k$ and the posterior is

$$ p \mid x_{1:n} \;\sim\; \mathrm{Beta}(\alpha+k,\;\beta+n-k). $$

Each prior heads $\alpha-1$ adds to observed heads $k$; each prior tails $\beta-1$ adds to observed tails $n-k$. The prior is literally a pretend dataset. The posterior mean

$$ \mathbb{E}[p\mid x_{1:n}] \;=\; \frac{\alpha+k}{\alpha+\beta+n} \;=\; \frac{n}{\alpha+\beta+n}\,\hat p_{\mathrm{ML}} \;+\; \frac{\alpha+\beta}{\alpha+\beta+n}\,\mathbb{E}[p] $$

is a convex combination of the maximum-likelihood estimate $\hat p_{\mathrm{ML}}=k/n$ and the prior mean $\mathbb{E}[p]=\alpha/(\alpha+\beta)$, weighted by the sample size against the prior strength $\alpha+\beta$. The flat prior Beta$(1,1)$, the Jeffreys prior Beta$(1/2,1/2)$, and the Haldane prior Beta$(0,0)$ are all conjugate — they differ only in how much pseudo-data they contribute and where the mass sits near the boundary $p\in\{0,1\}$.

Figure 1 · Beta–Bernoulli updating as adding pseudo-counts
prior Beta$(\alpha,\beta)$ posterior Beta$(\alpha+k,\beta+n-k)$ likelihood (normalized) MLE $k/n$ posterior mean

The readout shows the posterior mean's weighting between the prior mean and the MLE. The strong-prior preset makes visible how the posterior resists a small contradicting sample — the prior is acting like extra trials.

4. Worked example: Normal–Normal (known variance)

With known variance $\sigma^2$, the Gaussian likelihood depends on the data only through $\bar x$ and $n$. The conjugate prior on the mean is Gaussian:

$$ \mu \sim \mathcal{N}(\mu_0,\sigma_0^2),\qquad x_{1:n}\mid\mu \sim \mathcal{N}(\mu,\sigma^2). $$

Working in precision $\lambda = 1/\sigma^2$, the update rule is precision addition:

$$ \lambda_n \;=\; \lambda_0 + n\lambda,\qquad \mu_n \;=\; \frac{\lambda_0 \mu_0 + n\lambda\,\bar x}{\lambda_n} \;=\; w\,\bar x + (1-w)\mu_0, $$

with $w = n\lambda/\lambda_n$. Prior precision and data precision simply add; the prior precision $\lambda_0$ is the equivalent of $\lambda_0/\lambda$ extra observations. The posterior mean is a precision-weighted blend, pulled toward whichever side is more precise.

The posterior predictive for a new observation $x_*$ is

$$ x_* \mid x_{1:n} \;\sim\; \mathcal{N}\!\bigl(\mu_n,\;\sigma^2+\sigma_n^2\bigr). $$

The predictive variance is the observation noise plus the residual uncertainty about $\mu$. It is wider than the likelihood and narrower than the prior predictive; as $n\to\infty$, $\sigma_n^2\to 0$ and the predictive collapses onto $\mathcal{N}(\mu_n,\sigma^2)$.

5. Worked example: Normal–Gamma (unknown mean and variance)

When both $\mu$ and the precision $\lambda=1/\sigma^2$ are unknown, the conjugate prior is the Normal–Gamma:

$$ \mathrm{NG}(\mu,\lambda\mid\mu_0,\kappa_0,\alpha_0,\beta_0) \;=\; \mathcal{N}\!\bigl(\mu\mid\mu_0,(\kappa_0\lambda)^{-1}\bigr)\, \mathrm{Gamma}(\lambda\mid\alpha_0,\beta_0). $$

Parametrized on the variance $\sigma^2$ instead of the precision $\lambda$, the same prior is called the Normal–scaled-Inverse-χ². Gamma on $\lambda = 1/\sigma^2$ is Inv-χ² on $\sigma^2$ — the family-map $1/X$ edge between Chi² and Inv-χ². Bayesian texts vary in which they use; the posterior predictive below (Student-$t$) is the same either way. Many treatments (e.g. Gelman et al.) prefer the Inv-χ² form because $\nu_0 = 2\alpha_0$ has the direct interpretation of "prior effective sample size on the variance."

The mean's prior variance is tied to the unknown precision through $\kappa_0$: a prior on the mean is only meaningful up to scale. After data with sample mean $\bar x$ and sample sum-of-squares $s^2=\sum_i(x_i-\bar x)^2$, the four hyperparameters update as

$$ \kappa_n = \kappa_0 + n,\qquad \mu_n = \frac{\kappa_0\mu_0 + n\bar x}{\kappa_n}, $$ $$ \alpha_n = \alpha_0 + \tfrac{n}{2},\qquad \beta_n = \beta_0 + \tfrac{1}{2}s^2 + \frac{\kappa_0\,n\,(\bar x-\mu_0)^2}{2\kappa_n}. $$

Read off the pseudo-counts: $\kappa_0$ is the prior's number of effective observations of the mean, $2\alpha_0$ is its number of effective observations of the variance. The extra term in $\beta_n$ penalizes the prior's mean for disagreeing with the data mean — prior–data conflict shows up as added scale on the precision.

Marginalizing out the precision gives the posterior predictive for a new $x_*$:

$$ x_* \mid x_{1:n} \;\sim\; t_{2\alpha_n}\!\Bigl(\mu_n,\;\frac{\beta_n(\kappa_n+1)}{\alpha_n\,\kappa_n}\Bigr). $$

A Student-$t$, with degrees of freedom equal to twice the posterior shape — small $n$ gives heavy tails reflecting variance uncertainty, large $n$ recovers the Gaussian predictive of the known-variance case.

5a. Gaussian priors at a glance

The three Gaussian cases above, lined up. Each row corresponds to which parameter is unknown; each column gives the canonical non-informative choice and the conjugate choice.

Unknown parameter Non-informative Conjugate
$\mu$  (variance $\sigma^2$ known) Uniform on $\mathbb R$  (improper; limit $\tau^2 \to \infty$ of the conjugate prior) Gaussian $\mathcal N(\mu_0, \sigma_0^2)$  (§4 above)
$\sigma^2$  (mean $\mu$ known) Jeffreys $p(\sigma^2) \propto 1/\sigma^2$  (improper) Inverse-$\chi^2$ / Inverse-Gamma  (equivalent: $\mathrm{Inv}\text{-}\chi^2_\nu = \mathrm{Inv}\text{-}\mathrm{Gamma}(\nu/2, 1/2)$)
$(\mu, \sigma^2)$  (both unknown) Joint Jeffreys $p(\mu, \sigma^2) \propto 1/\sigma^2$  (see Fisher information) Normal–Gamma on $(\mu, \tau)$ where $\tau = 1/\sigma^2$  (§5 above; aka Normal–scaled-Inverse-$\chi^2$ in the $\sigma^2$ parametrization)

Naming: the Inverse-$\chi^2$ and Inverse-Gamma names refer to the same distribution under different parametrizations. Bayesian texts in the Gelman/BDA tradition use Inverse-$\chi^2$; the standard-pairs table below uses the more general Inverse-Gamma form. The scaled-Inverse-$\chi^2$ card on named distributions has the density formulas. The Normal–Gamma case has a sibling name "Normal–scaled-Inverse-$\chi^2$" when the variance is parametrized directly rather than through precision; same prior, different label.

6. Standard conjugate pairs

For most named likelihoods, the conjugate prior is also named. The hyperparameter update is always "add sufficient statistic to $\tau$ and $n$ to $\nu$" — the table below just unpacks the convention each row.

Bernoulli↔Binomial and Categorical↔Multinomial share their conjugate prior; they differ only in whether the likelihood is written as $n$ i.i.d. trials or as one aggregate count. The first form is the primitive sampling model, the second is its repeated-trial summary, with the same sufficient statistic either way and therefore the same Beta or Dirichlet conjugate.

LikelihoodSufficient statConjugate priorPosterior updatePseudo-count
Bernoulli$(p)$ / Binomial $\sum x_i,\;n$ Beta$(\alpha,\beta)$ $\alpha\!+\!\sum x_i,\;\beta\!+\!n\!-\!\sum x_i$ $\alpha\!-\!1$ heads, $\beta\!-\!1$ tails
Categorical / Multinomial counts $n_k$ Dirichlet$(\boldsymbol\alpha)$ $\alpha_k + n_k$ $\alpha_k-1$ pseudo-observations of class $k$
Poisson$(\lambda)$ $\sum x_i,\;n$ Gamma$(\alpha,\beta)$ (rate) $\alpha\!+\!\sum x_i,\;\beta\!+\!n$ $\beta$ pseudo-intervals, $\alpha$ pseudo-events
Exponential$(\lambda)$ $\sum x_i,\;n$ Gamma$(\alpha,\beta)$ $\alpha\!+\!n,\;\beta\!+\!\sum x_i$ $\alpha$ pseudo-arrivals over time $\beta$
Geometric$(p)$ $\sum x_i,\;n$ Beta$(\alpha,\beta)$ $\alpha\!+\!n,\;\beta\!+\!\sum x_i$ $\alpha$ pseudo-trials with $\beta$ pseudo-failures before each success
Normal$(\mu,\sigma^2)$, $\sigma$ known $\bar x,\;n$ Normal$(\mu_0,\sigma_0^2)$ $\lambda_n\!=\!\lambda_0\!+\!n\lambda$; $\mu_n$ precision-weighted $\sigma^2/\sigma_0^2$ pseudo-observations
Normal$(\mu,\sigma^2)$, $\mu$ known $\sum(x_i\!-\!\mu)^2,\;n$ Inverse-Gamma$(\alpha,\beta)$ $\alpha\!+\!n/2,\;\beta\!+\!\tfrac12\sum(x_i\!-\!\mu)^2$ $2\alpha$ pseudo-observations of variance
Normal$(\mu,\sigma^2)$, both unknown $\bar x,\;s^2,\;n$ Normal–Gamma$(\mu_0,\kappa_0,\alpha_0,\beta_0)$ see §5 $\kappa_0$ for mean, $2\alpha_0$ for variance
Multivariate normal, $\Sigma$ known $\bar x,\;n$ Normal$(\boldsymbol\mu_0,\Lambda_0^{-1})$ $\Lambda_n\!=\!\Lambda_0\!+\!n\Sigma^{-1}$ precision matrices add
Multivariate normal, both unknown $\bar{\mathbf x},\;\mathbf S,\;n$ Normal-inverse-Wishart (matrix analogue of Normal–Gamma) $\kappa_0$ for mean, $\nu_0$ for covariance
Gamma$(\alpha,\beta)$, $\alpha$ known $\sum x_i,\;n$ Gamma$(\alpha_0,\beta_0)$ on rate $\alpha_0\!+\!n\alpha,\;\beta_0\!+\!\sum x_i$ $n\alpha$-equivalent of arrivals

The same pseudo-count reading shows what a "weakly informative" prior means quantitatively. A Beta$(2,2)$ prior is a pretend dataset of two heads and two tails: it nudges estimates of small samples back toward $1/2$ but is overwhelmed by $n=50$. A Normal$(0,10^2)$ prior on a regression coefficient with data precision near 1 is worth about $1/100$ of one observation: it constrains almost nothing.

7. Posterior-predictive distributions

For prediction, integrate the likelihood against the posterior on $\eta$:

$$ p(x_*\mid x_{1:n}) \;=\; \int p(x_*\mid\eta)\,p(\eta\mid x_{1:n})\,d\eta. $$

For conjugate pairs the integral is closed-form and gives a named distribution:

Notice that the predictive distributions for finite $n$ are never the original likelihood family. The Beta-Binomial is not a Binomial; the predictive for a Normal–Gamma is Student-$t$, not Gaussian. The likelihood family is only recovered in the $n\to\infty$ limit when the posterior on $\eta$ becomes a point mass.

8. Conjugacy is useful only when the family fits the belief

Conjugate priors are computationally cheap and have a clean pseudo-count interpretation, but they are only a parametric family — not all beliefs fit. Use them when the conjugate shape adequately approximates the prior you would otherwise have specified; when it doesn't, do the integrals some other way — numerically, by analytical approximation, or by simulation.

A few practical signals that the conjugate prior is the wrong tool:

The pseudo-count interpretation is also a diagnostic for prior strength. A hyperparameter that translates to "1000 pretend observations" on a dataset of 100 real observations is doing nearly all the work. If you wrote it down without thinking of the prior as data, you may not have meant that.

What next