Conjugate Priors & the Exponential Family
Conjugate priors are useful for two reasons. First, computational convenience — posterior arithmetic reduces to adding sufficient statistics. Second, the hyperparameters of the conjugate prior have a clean interpretation as prior data: a Beta$(\alpha,\beta)$ prior on a coin's bias acts like $\alpha-1$ prior heads and $\beta-1$ prior tails; a Normal-Gamma prior on $(\mu,\sigma^2)$ acts like $\kappa_0$ prior observations of the mean and $2\alpha_0$ prior observations of the variance. The same arithmetic that turns the posterior back into a member of the family also turns each hyperparameter into an effective sample size.
1. The exponential family (recap)
Conjugacy lives inside the exponential family — distributions of the canonical form
$$ p(x\mid\eta) \;=\; h(x)\,\exp\!\bigl(\eta^\top T(x) \;-\; A(\eta)\bigr), $$where $\eta$ is the natural parameter, $T(x)$ the sufficient statistic, $h(x)$ the base measure, and $A(\eta)$ the log-partition function. The exponential-family page covers what each piece is called and why, why the log-partition borrows its name from statistical physics, how derivatives of $A$ give the moments of $T(X)$, the canonical-link / mean-function pair behind logistic and Poisson regression, and an interactive picker stepping through six standard members.
The one fact the conjugacy argument below needs: the natural parameter $\eta$ and sufficient statistic $T(x)$ meet only in an inner product $\eta^\top T(x)$, so $n$ i.i.d. observations summarize themselves into a fixed-size pair $(\sum_i T(x_i), n)$.
The Pitman–Koopman–Darmois theorem is the converse: for a family with fixed support, a $k$-dimensional sufficient statistic exists for every sample size $n$ if and only if the family is exponential of dimension at most $k$. Conjugate priors with finite-dimensional hyperparameters are therefore essentially restricted to this case; outside the exponential family, the posterior's complexity grows with $n$.
2. Conjugacy follows from exponential-family sufficient statistics
With $n$ i.i.d. observations $x_{1:n}$ from an exponential family, the joint likelihood is
$$ p(x_{1:n}\mid\eta) \;=\; \Bigl[\prod_i h(x_i)\Bigr]\, \exp\!\Bigl(\eta^\top \!\sum_i T(x_i) \;-\; n\,A(\eta)\Bigr). $$Read off two numbers from the data: the total sufficient statistic $S=\sum_i T(x_i)$ and the sample size $n$. As a function of $\eta$, the likelihood depends only on $(S, n)$.
Now pick a prior with the same functional form in $\eta$ — that is, a density proportional to $\exp(\eta^\top\tau_0 - \nu_0 A(\eta))$ on the natural-parameter space:
$$ \pi(\eta\mid\tau_0,\nu_0) \;\propto\; \exp\!\bigl(\eta^\top\tau_0 - \nu_0\,A(\eta)\bigr). $$The posterior is the product, and the product is again in this family:
$$ p(\eta\mid x_{1:n}) \;\propto\; \exp\!\Bigl(\eta^\top(\tau_0+S) - (\nu_0+n)\,A(\eta)\Bigr) \;=\; \pi\!\bigl(\eta\mid\tau_0+S,\;\nu_0+n\bigr). $$Updating is the arithmetic
$$ \boxed{\;\tau \;\mapsto\; \tau_0 + \sum_i T(x_i),\qquad \nu \;\mapsto\; \nu_0 + n.\;} $$3. Worked example: Beta–Bernoulli
The Bernoulli likelihood is $p(x\mid p) = p^{x}(1-p)^{1-x}$ for $x\in\{0,1\}$. In exponential-family form,
$$ p(x\mid p) \;=\; \exp\!\Bigl(x\,\underbrace{\log\tfrac{p}{1-p}}_{\eta} \;-\; \underbrace{(-\log(1-p))}_{A(\eta)}\Bigr), $$so $T(x)=x$ and the natural parameter $\eta$ is the log-odds (the logit). The conjugate prior in canonical form has density $\propto\exp(\eta\tau_0 - \nu_0 A(\eta))$; transformed back to the bias $p$, this is the Beta distribution:
$$ p \;\sim\; \mathrm{Beta}(\alpha,\beta) \;\propto\; p^{\alpha-1}(1-p)^{\beta-1}, $$where $\alpha = \tau_0+1$ and $\beta=\nu_0-\tau_0+1$. After observing $k$ successes and $n-k$ failures, $S=k$ and the posterior is
$$ p \mid x_{1:n} \;\sim\; \mathrm{Beta}(\alpha+k,\;\beta+n-k). $$Each prior heads $\alpha-1$ adds to observed heads $k$; each prior tails $\beta-1$ adds to observed tails $n-k$. The prior is literally a pretend dataset. The posterior mean
$$ \mathbb{E}[p\mid x_{1:n}] \;=\; \frac{\alpha+k}{\alpha+\beta+n} \;=\; \frac{n}{\alpha+\beta+n}\,\hat p_{\mathrm{ML}} \;+\; \frac{\alpha+\beta}{\alpha+\beta+n}\,\mathbb{E}[p] $$is a convex combination of the maximum-likelihood estimate $\hat p_{\mathrm{ML}}=k/n$ and the prior mean $\mathbb{E}[p]=\alpha/(\alpha+\beta)$, weighted by the sample size against the prior strength $\alpha+\beta$. The flat prior Beta$(1,1)$, the Jeffreys prior Beta$(1/2,1/2)$, and the Haldane prior Beta$(0,0)$ are all conjugate — they differ only in how much pseudo-data they contribute and where the mass sits near the boundary $p\in\{0,1\}$.
The readout shows the posterior mean's weighting between the prior mean and the MLE. The strong-prior preset makes visible how the posterior resists a small contradicting sample — the prior is acting like extra trials.
4. Worked example: Normal–Normal (known variance)
With known variance $\sigma^2$, the Gaussian likelihood depends on the data only through $\bar x$ and $n$. The conjugate prior on the mean is Gaussian:
$$ \mu \sim \mathcal{N}(\mu_0,\sigma_0^2),\qquad x_{1:n}\mid\mu \sim \mathcal{N}(\mu,\sigma^2). $$Working in precision $\lambda = 1/\sigma^2$, the update rule is precision addition:
$$ \lambda_n \;=\; \lambda_0 + n\lambda,\qquad \mu_n \;=\; \frac{\lambda_0 \mu_0 + n\lambda\,\bar x}{\lambda_n} \;=\; w\,\bar x + (1-w)\mu_0, $$with $w = n\lambda/\lambda_n$. Prior precision and data precision simply add; the prior precision $\lambda_0$ is the equivalent of $\lambda_0/\lambda$ extra observations. The posterior mean is a precision-weighted blend, pulled toward whichever side is more precise.
The posterior predictive for a new observation $x_*$ is
$$ x_* \mid x_{1:n} \;\sim\; \mathcal{N}\!\bigl(\mu_n,\;\sigma^2+\sigma_n^2\bigr). $$The predictive variance is the observation noise plus the residual uncertainty about $\mu$. It is wider than the likelihood and narrower than the prior predictive; as $n\to\infty$, $\sigma_n^2\to 0$ and the predictive collapses onto $\mathcal{N}(\mu_n,\sigma^2)$.
5. Worked example: Normal–Gamma (unknown mean and variance)
When both $\mu$ and the precision $\lambda=1/\sigma^2$ are unknown, the conjugate prior is the Normal–Gamma:
$$ \mathrm{NG}(\mu,\lambda\mid\mu_0,\kappa_0,\alpha_0,\beta_0) \;=\; \mathcal{N}\!\bigl(\mu\mid\mu_0,(\kappa_0\lambda)^{-1}\bigr)\, \mathrm{Gamma}(\lambda\mid\alpha_0,\beta_0). $$Parametrized on the variance $\sigma^2$ instead of the precision $\lambda$, the same prior is called the Normal–scaled-Inverse-χ². Gamma on $\lambda = 1/\sigma^2$ is Inv-χ² on $\sigma^2$ — the family-map $1/X$ edge between Chi² and Inv-χ². Bayesian texts vary in which they use; the posterior predictive below (Student-$t$) is the same either way. Many treatments (e.g. Gelman et al.) prefer the Inv-χ² form because $\nu_0 = 2\alpha_0$ has the direct interpretation of "prior effective sample size on the variance."
The mean's prior variance is tied to the unknown precision through $\kappa_0$: a prior on the mean is only meaningful up to scale. After data with sample mean $\bar x$ and sample sum-of-squares $s^2=\sum_i(x_i-\bar x)^2$, the four hyperparameters update as
$$ \kappa_n = \kappa_0 + n,\qquad \mu_n = \frac{\kappa_0\mu_0 + n\bar x}{\kappa_n}, $$ $$ \alpha_n = \alpha_0 + \tfrac{n}{2},\qquad \beta_n = \beta_0 + \tfrac{1}{2}s^2 + \frac{\kappa_0\,n\,(\bar x-\mu_0)^2}{2\kappa_n}. $$Read off the pseudo-counts: $\kappa_0$ is the prior's number of effective observations of the mean, $2\alpha_0$ is its number of effective observations of the variance. The extra term in $\beta_n$ penalizes the prior's mean for disagreeing with the data mean — prior–data conflict shows up as added scale on the precision.
Marginalizing out the precision gives the posterior predictive for a new $x_*$:
$$ x_* \mid x_{1:n} \;\sim\; t_{2\alpha_n}\!\Bigl(\mu_n,\;\frac{\beta_n(\kappa_n+1)}{\alpha_n\,\kappa_n}\Bigr). $$A Student-$t$, with degrees of freedom equal to twice the posterior shape — small $n$ gives heavy tails reflecting variance uncertainty, large $n$ recovers the Gaussian predictive of the known-variance case.
5a. Gaussian priors at a glance
The three Gaussian cases above, lined up. Each row corresponds to which parameter is unknown; each column gives the canonical non-informative choice and the conjugate choice.
| Unknown parameter | Non-informative | Conjugate |
|---|---|---|
| $\mu$ (variance $\sigma^2$ known) | Uniform on $\mathbb R$ (improper; limit $\tau^2 \to \infty$ of the conjugate prior) | Gaussian $\mathcal N(\mu_0, \sigma_0^2)$ (§4 above) |
| $\sigma^2$ (mean $\mu$ known) | Jeffreys $p(\sigma^2) \propto 1/\sigma^2$ (improper) | Inverse-$\chi^2$ / Inverse-Gamma (equivalent: $\mathrm{Inv}\text{-}\chi^2_\nu = \mathrm{Inv}\text{-}\mathrm{Gamma}(\nu/2, 1/2)$) |
| $(\mu, \sigma^2)$ (both unknown) | Joint Jeffreys $p(\mu, \sigma^2) \propto 1/\sigma^2$ (see Fisher information) | Normal–Gamma on $(\mu, \tau)$ where $\tau = 1/\sigma^2$ (§5 above; aka Normal–scaled-Inverse-$\chi^2$ in the $\sigma^2$ parametrization) |
Naming: the Inverse-$\chi^2$ and Inverse-Gamma names refer to the same distribution under different parametrizations. Bayesian texts in the Gelman/BDA tradition use Inverse-$\chi^2$; the standard-pairs table below uses the more general Inverse-Gamma form. The scaled-Inverse-$\chi^2$ card on named distributions has the density formulas. The Normal–Gamma case has a sibling name "Normal–scaled-Inverse-$\chi^2$" when the variance is parametrized directly rather than through precision; same prior, different label.
6. Standard conjugate pairs
For most named likelihoods, the conjugate prior is also named. The hyperparameter update is always "add sufficient statistic to $\tau$ and $n$ to $\nu$" — the table below just unpacks the convention each row.
Bernoulli↔Binomial and Categorical↔Multinomial share their conjugate prior; they differ only in whether the likelihood is written as $n$ i.i.d. trials or as one aggregate count. The first form is the primitive sampling model, the second is its repeated-trial summary, with the same sufficient statistic either way and therefore the same Beta or Dirichlet conjugate.
| Likelihood | Sufficient stat | Conjugate prior | Posterior update | Pseudo-count |
|---|---|---|---|---|
| Bernoulli$(p)$ / Binomial | $\sum x_i,\;n$ | Beta$(\alpha,\beta)$ | $\alpha\!+\!\sum x_i,\;\beta\!+\!n\!-\!\sum x_i$ | $\alpha\!-\!1$ heads, $\beta\!-\!1$ tails |
| Categorical / Multinomial | counts $n_k$ | Dirichlet$(\boldsymbol\alpha)$ | $\alpha_k + n_k$ | $\alpha_k-1$ pseudo-observations of class $k$ |
| Poisson$(\lambda)$ | $\sum x_i,\;n$ | Gamma$(\alpha,\beta)$ (rate) | $\alpha\!+\!\sum x_i,\;\beta\!+\!n$ | $\beta$ pseudo-intervals, $\alpha$ pseudo-events |
| Exponential$(\lambda)$ | $\sum x_i,\;n$ | Gamma$(\alpha,\beta)$ | $\alpha\!+\!n,\;\beta\!+\!\sum x_i$ | $\alpha$ pseudo-arrivals over time $\beta$ |
| Geometric$(p)$ | $\sum x_i,\;n$ | Beta$(\alpha,\beta)$ | $\alpha\!+\!n,\;\beta\!+\!\sum x_i$ | $\alpha$ pseudo-trials with $\beta$ pseudo-failures before each success |
| Normal$(\mu,\sigma^2)$, $\sigma$ known | $\bar x,\;n$ | Normal$(\mu_0,\sigma_0^2)$ | $\lambda_n\!=\!\lambda_0\!+\!n\lambda$; $\mu_n$ precision-weighted | $\sigma^2/\sigma_0^2$ pseudo-observations |
| Normal$(\mu,\sigma^2)$, $\mu$ known | $\sum(x_i\!-\!\mu)^2,\;n$ | Inverse-Gamma$(\alpha,\beta)$ | $\alpha\!+\!n/2,\;\beta\!+\!\tfrac12\sum(x_i\!-\!\mu)^2$ | $2\alpha$ pseudo-observations of variance |
| Normal$(\mu,\sigma^2)$, both unknown | $\bar x,\;s^2,\;n$ | Normal–Gamma$(\mu_0,\kappa_0,\alpha_0,\beta_0)$ | see §5 | $\kappa_0$ for mean, $2\alpha_0$ for variance |
| Multivariate normal, $\Sigma$ known | $\bar x,\;n$ | Normal$(\boldsymbol\mu_0,\Lambda_0^{-1})$ | $\Lambda_n\!=\!\Lambda_0\!+\!n\Sigma^{-1}$ | precision matrices add |
| Multivariate normal, both unknown | $\bar{\mathbf x},\;\mathbf S,\;n$ | Normal-inverse-Wishart | (matrix analogue of Normal–Gamma) | $\kappa_0$ for mean, $\nu_0$ for covariance |
| Gamma$(\alpha,\beta)$, $\alpha$ known | $\sum x_i,\;n$ | Gamma$(\alpha_0,\beta_0)$ on rate | $\alpha_0\!+\!n\alpha,\;\beta_0\!+\!\sum x_i$ | $n\alpha$-equivalent of arrivals |
The same pseudo-count reading shows what a "weakly informative" prior means quantitatively. A Beta$(2,2)$ prior is a pretend dataset of two heads and two tails: it nudges estimates of small samples back toward $1/2$ but is overwhelmed by $n=50$. A Normal$(0,10^2)$ prior on a regression coefficient with data precision near 1 is worth about $1/100$ of one observation: it constrains almost nothing.
7. Posterior-predictive distributions
For prediction, integrate the likelihood against the posterior on $\eta$:
$$ p(x_*\mid x_{1:n}) \;=\; \int p(x_*\mid\eta)\,p(\eta\mid x_{1:n})\,d\eta. $$For conjugate pairs the integral is closed-form and gives a named distribution:
- Beta–Bernoulli: Beta-Binomial predictive, $\Pr(x_*=1\mid x_{1:n}) = (\alpha+k)/(\alpha+\beta+n)$. Always uses both prior pseudo-counts and the data.
- Dirichlet–Categorical: predictive $\Pr(x_*=k) = (\alpha_k+n_k)/(\sum_j \alpha_j + n)$. This is the additive-smoothing rule.
- Gamma–Poisson: Negative-Binomial predictive over counts. The overdispersion comes from posterior uncertainty about the Poisson rate.
- Normal–Normal ($\sigma$ known): Gaussian predictive $\mathcal{N}(\mu_n,\sigma^2+\sigma_n^2)$.
- Normal–Gamma: Student-$t$ predictive with $2\alpha_n$ degrees of freedom — heavy-tailed for small $n$, Gaussian asymptotically.
Notice that the predictive distributions for finite $n$ are never the original likelihood family. The Beta-Binomial is not a Binomial; the predictive for a Normal–Gamma is Student-$t$, not Gaussian. The likelihood family is only recovered in the $n\to\infty$ limit when the posterior on $\eta$ becomes a point mass.
8. Conjugacy is useful only when the family fits the belief
Conjugate priors are computationally cheap and have a clean pseudo-count interpretation, but they are only a parametric family — not all beliefs fit. Use them when the conjugate shape adequately approximates the prior you would otherwise have specified; when it doesn't, do the integrals some other way — numerically, by analytical approximation, or by simulation.
A few practical signals that the conjugate prior is the wrong tool:
- Bimodal or skewed prior beliefs. The conjugate Beta family is unimodal except for the U-shaped Beta$(\alpha,\beta)$ with both parameters $<1$. If you have genuine prior mass at $p=0.2$ and $p=0.8$, a Beta won't capture it.
- Hard prior bounds or hierarchical structure. A Beta prior on $p$ can't impose $p\in[0.4,0.6]$. A hierarchical or truncated prior may be closer to your belief, at the cost of closed-form posteriors.
- Sparsity or shrinkage targets. Conjugate priors put zero mass at zero. For sparse regression coefficients use a horseshoe, spike-and-slab, or Laplace prior (none of which are conjugate to Gaussian likelihood).
- The likelihood is not exponential family. Cauchy, mixture models, Student-$t$ with unknown degrees of freedom, and most neural-network likelihoods have no useful conjugate prior. Fall back to MCMC or VI — see the Monte Carlo & MCMC and variational inference pages.
The pseudo-count interpretation is also a diagnostic for prior strength. A hyperparameter that translates to "1000 pretend observations" on a dataset of 100 real observations is doing nearly all the work. If you wrote it down without thinking of the prior as data, you may not have meant that.