Conjugate Priors & the Exponential Family

Why some prior–likelihood pairs update in closed form, what the hyperparameters mean as pseudo-data, and the standard conjugate pairs.

Conjugate priors are useful for two reasons. First, computational convenience — posterior arithmetic reduces to adding sufficient statistics. Second, the hyperparameters of the conjugate prior have a clean interpretation as prior data: a Beta$(\alpha,\beta)$ prior on a coin's bias acts like $\alpha-1$ prior heads and $\beta-1$ prior tails; a Normal-Gamma prior on $(\mu,\sigma^2)$ acts like $\kappa_0$ prior observations of the mean and $2\alpha_0$ prior observations of the variance. The same arithmetic that turns the posterior back into a member of the family also turns each hyperparameter into an effective sample size.

One sentence: Whenever the likelihood is an exponential family $p(x\mid\eta)=h(x)\exp(\eta^\top T(x)-A(\eta))$, the conjugate prior has the form $\pi(\eta\mid\tau,\nu)\propto\exp(\eta^\top\tau-\nu A(\eta))$, and Bayesian updating is just $\tau\mapsto\tau+\sum_i T(x_i)$, $\nu\mapsto\nu+n$.

1. The exponential family (recap)

Conjugacy lives inside the exponential family — distributions of the canonical form

$$ p(x\mid\eta) \;=\; h(x)\,\exp\!\bigl(\eta^\top T(x) \;-\; A(\eta)\bigr), $$

where $\eta$ is the natural parameter, $T(x)$ the sufficient statistic, $h(x)$ the base measure, and $A(\eta)$ the log-partition function. The exponential-family page covers what each piece is called and why, why the log-partition borrows its name from statistical physics, how derivatives of $A$ give the moments of $T(X)$, the canonical-link / mean-function pair behind logistic and Poisson regression, and an interactive picker stepping through six standard members.

The one fact the conjugacy argument below needs: the natural parameter $\eta$ and sufficient statistic $T(x)$ meet only in an inner product $\eta^\top T(x)$, so $n$ i.i.d. observations summarize themselves into a fixed-size pair $(\sum_i T(x_i), n)$.

The Pitman–Koopman–Darmois theorem is the converse: for a family with fixed support, a $k$-dimensional sufficient statistic exists for every sample size $n$ if and only if the family is exponential of dimension at most $k$. Conjugate priors with finite-dimensional hyperparameters are therefore essentially restricted to this case; outside the exponential family, the posterior's complexity grows with $n$.

2. Conjugacy follows from exponential-family sufficient statistics

With $n$ i.i.d. observations $x_{1:n}$ from an exponential family, the joint likelihood is

$$ p(x_{1:n}\mid\eta) \;=\; \Bigl[\prod_i h(x_i)\Bigr]\, \exp\!\Bigl(\eta^\top \!\sum_i T(x_i) \;-\; n\,A(\eta)\Bigr). $$

Read off two numbers from the data: the total sufficient statistic $S=\sum_i T(x_i)$ and the sample size $n$. As a function of $\eta$, the likelihood depends only on $(S, n)$.

Now pick a prior with the same functional form in $\eta$ — that is, a density proportional to $\exp(\eta^\top\tau_0 - \nu_0 A(\eta))$ on the natural-parameter space:

$$ \pi(\eta\mid\tau_0,\nu_0) \;\propto\; \exp\!\bigl(\eta^\top\tau_0 - \nu_0\,A(\eta)\bigr). $$

The posterior is the product, and the product is again in this family:

$$ p(\eta\mid x_{1:n}) \;\propto\; \exp\!\Bigl(\eta^\top(\tau_0+S) - (\nu_0+n)\,A(\eta)\Bigr) \;=\; \pi\!\bigl(\eta\mid\tau_0+S,\;\nu_0+n\bigr). $$

Updating is the arithmetic

$$ \boxed{\;\tau \;\mapsto\; \tau_0 + \sum_i T(x_i),\qquad \nu \;\mapsto\; \nu_0 + n.\;} $$

Pseudo-count reading: $\nu_0$ is the number of pseudo-observations the prior is worth, and $\tau_0$ is their pseudo-total of sufficient statistics. A Beta$(\alpha,\beta)$ prior on a Bernoulli has $\nu_0=\alpha+\beta-2$ pseudo-trials with $\alpha-1$ pseudo-heads. Choosing the prior is choosing how much pretend data to start with.

Figure 1 · The same updater across four conjugate families

prior posterior data summary

family Bernoulli / Beta

prior pseudo-count strength ν₀ 8

sample size n 24

data summary 0.68

Use the family picker to see the same arithmetic in different clothes: $\tau$ absorbs the data's sufficient statistic and $\nu$ absorbs the sample size. The Normal-Gamma option draws the posterior predictive, so the same $n$ slider also shows the Student-$t$ tail shrinking toward a Gaussian as variance uncertainty disappears.

3. Worked example: Beta–Bernoulli

The Bernoulli likelihood is $p(x\mid p) = p^{x}(1-p)^{1-x}$ for $x\in\{0,1\}$. In exponential-family form,

$$ p(x\mid p) \;=\; \exp\!\Bigl(x\,\underbrace{\log\tfrac{p}{1-p}}_{\eta} \;-\; \underbrace{(-\log(1-p))}_{A(\eta)}\Bigr), $$

so $T(x)=x$ and the natural parameter $\eta$ is the log-odds (the logit). The conjugate prior in canonical form has density $\propto\exp(\eta\tau_0 - \nu_0 A(\eta))$; transformed back to the bias $p$, this is the Beta distribution:

$$ p \;\sim\; \mathrm{Beta}(\alpha,\beta) \;\propto\; p^{\alpha-1}(1-p)^{\beta-1}, $$

where $\alpha = \tau_0+1$ and $\beta=\nu_0-\tau_0+1$. After observing $k$ successes and $n-k$ failures, $S=k$ and the posterior is

$$ p \mid x_{1:n} \;\sim\; \mathrm{Beta}(\alpha+k,\;\beta+n-k). $$

Each prior heads $\alpha-1$ adds to observed heads $k$; each prior tails $\beta-1$ adds to observed tails $n-k$. The prior is literally a pretend dataset. The posterior mean

$$ \mathbb{E}[p\mid x_{1:n}] \;=\; \frac{\alpha+k}{\alpha+\beta+n} \;=\; \frac{n}{\alpha+\beta+n}\,\hat p_{\mathrm{ML}} \;+\; \frac{\alpha+\beta}{\alpha+\beta+n}\,\mathbb{E}[p] $$

is a convex combination of the maximum-likelihood estimate $\hat p_{\mathrm{ML}}=k/n$ and the prior mean $\mathbb{E}[p]=\alpha/(\alpha+\beta)$, weighted by the sample size against the prior strength $\alpha+\beta$. The flat prior Beta$(1,1)$, the Jeffreys prior Beta$(1/2,1/2)$, and the Haldane prior Beta$(0,0)$ are all conjugate — they differ only in how much pseudo-data they contribute and where the mass sits near the boundary $p\in\{0,1\}$.

Figure 2 · Beta–Bernoulli updating as adding pseudo-counts

prior Beta$(\alpha,\beta)$ posterior Beta$(\alpha+k,\beta+n-k)$ likelihood (normalized) MLE $k/n$ posterior mean

prior pseudo-heads (α) 2

prior pseudo-tails (β) 2

trials n 20

successes k 13

The readout shows the posterior mean's weighting between the prior mean and the MLE. The strong-prior preset makes visible how the posterior resists a small contradicting sample — the prior is acting like extra trials.

4. Worked example: Normal–Normal (known variance)

With known variance $\sigma^2$, the Gaussian likelihood depends on the data only through $\bar x$ and $n$. The conjugate prior on the mean is Gaussian:

$$ \mu \sim \mathcal{N}(\mu_0,\sigma_0^2),\qquad x_{1:n}\mid\mu \sim \mathcal{N}(\mu,\sigma^2). $$

Working in precision $\lambda = 1/\sigma^2$, the update rule is precision addition:

$$ \lambda_n \;=\; \lambda_0 + n\lambda,\qquad \mu_n \;=\; \frac{\lambda_0 \mu_0 + n\lambda\,\bar x}{\lambda_n} \;=\; w\,\bar x + (1-w)\mu_0, $$

with $w = n\lambda/\lambda_n$. Prior precision and data precision simply add; the prior precision $\lambda_0$ is the equivalent of $\lambda_0/\lambda$ extra observations. The posterior mean is a precision-weighted blend, pulled toward whichever side is more precise.

The posterior predictive for a new observation $x_*$ is

$$ x_* \mid x_{1:n} \;\sim\; \mathcal{N}\!\bigl(\mu_n,\;\sigma^2+\sigma_n^2\bigr). $$

The predictive variance is the observation noise plus the residual uncertainty about $\mu$. It is wider than the likelihood and narrower than the prior predictive; as $n\to\infty$, $\sigma_n^2\to 0$ and the predictive collapses onto $\mathcal{N}(\mu_n,\sigma^2)$.

5. Worked example: Normal–Gamma (unknown mean and variance)

When both $\mu$ and the precision $\lambda=1/\sigma^2$ are unknown, the conjugate prior is the Normal–Gamma:

$$ \mathrm{NG}(\mu,\lambda\mid\mu_0,\kappa_0,\alpha_0,\beta_0) \;=\; \mathcal{N}\!\bigl(\mu\mid\mu_0,(\kappa_0\lambda)^{-1}\bigr)\, \mathrm{Gamma}(\lambda\mid\alpha_0,\beta_0). $$

Parametrized on the variance $\sigma^2$ instead of the precision $\lambda$, the same prior is called the Normal–scaled-Inverse-χ². Gamma on $\lambda = 1/\sigma^2$ is Inv-χ² on $\sigma^2$ — the family-map $1/X$ edge between Chi² and Inv-χ². Bayesian texts vary in which they use; the posterior predictive below (Student-$t$) is the same either way. Many treatments (e.g. Gelman et al.) prefer the Inv-χ² form because $\nu_0 = 2\alpha_0$ has the direct interpretation of "prior effective sample size on the variance."

The mean's prior variance is tied to the unknown precision through $\kappa_0$: a prior on the mean is only meaningful up to scale. After data with sample mean $\bar x$ and sample sum-of-squares $s^2=\sum_i(x_i-\bar x)^2$, the four hyperparameters update as

$$ \kappa_n = \kappa_0 + n,\qquad \mu_n = \frac{\kappa_0\mu_0 + n\bar x}{\kappa_n}, $$ $$ \alpha_n = \alpha_0 + \tfrac{n}{2},\qquad \beta_n = \beta_0 + \tfrac{1}{2}s^2 + \frac{\kappa_0\,n\,(\bar x-\mu_0)^2}{2\kappa_n}. $$

Read off the pseudo-counts: $\kappa_0$ is the prior's number of effective observations of the mean, $2\alpha_0$ is its number of effective observations of the variance. The extra term in $\beta_n$ penalizes the prior's mean for disagreeing with the data mean — prior–data conflict shows up as added scale on the precision.

Marginalizing out the precision gives the posterior predictive for a new $x_*$:

$$ x_* \mid x_{1:n} \;\sim\; t_{2\alpha_n}\!\Bigl(\mu_n,\;\frac{\beta_n(\kappa_n+1)}{\alpha_n\,\kappa_n}\Bigr). $$

The predictive is a Student-$t$ with degrees of freedom equal to twice the posterior shape: small $n$ gives heavy tails reflecting variance uncertainty, large $n$ recovers the Gaussian predictive of the known-variance case.

For Gaussian models, the conjugate family changes with what is unknown: Gaussian for a known-variance mean, Inverse-Gamma or scaled-Inverse-$\chi^2$ for a known-mean variance, and Normal–Gamma when both mean and variance are unknown. The named-distributions card has the variance-prior density; Fisher information explains the corresponding Jeffreys priors.

6. Standard conjugate pairs

For the common named likelihoods, the conjugate prior is also named. The hyperparameter update is always "add sufficient statistic to $\tau$ and $n$ to $\nu$" — the table below keeps the pairs used by the worked examples and the interactive cards.

Bernoulli↔Binomial and Categorical↔Multinomial share their conjugate prior; they differ only in whether the likelihood is written as $n$ i.i.d. trials or as one aggregate count. The first form is the primitive sampling model, the second is its repeated-trial summary, with the same sufficient statistic either way and therefore the same Beta or Dirichlet conjugate.

Likelihood	Sufficient stat	Conjugate prior	Posterior update	Pseudo-count
Bernoulli$(p)$ / Binomial	$\sum x_i,\;n$	Beta$(\alpha,\beta)$	$\alpha\!+\!\sum x_i,\;\beta\!+\!n\!-\!\sum x_i$	$\alpha\!-\!1$ heads, $\beta\!-\!1$ tails
Categorical / Multinomial	counts $n_k$	Dirichlet$(\boldsymbol\alpha)$	$\alpha_k + n_k$	$\alpha_k-1$ pseudo-observations of class $k$
Poisson$(\lambda)$	$\sum x_i,\;n$	Gamma$(\alpha,\beta)$ (rate)	$\alpha\!+\!\sum x_i,\;\beta\!+\!n$	$\beta$ pseudo-intervals, $\alpha$ pseudo-events
Normal$(\mu,\sigma^2)$, $\sigma$ known	$\bar x,\;n$	Normal$(\mu_0,\sigma_0^2)$	$\lambda_n\!=\!\lambda_0\!+\!n\lambda$; $\mu_n$ precision-weighted	$\sigma^2/\sigma_0^2$ pseudo-observations
Normal$(\mu,\sigma^2)$, $\mu$ known	$\sum(x_i\!-\!\mu)^2,\;n$	Inverse-Gamma$(\alpha,\beta)$	$\alpha\!+\!n/2,\;\beta\!+\!\tfrac12\sum(x_i\!-\!\mu)^2$	$2\alpha$ pseudo-observations of variance
Normal$(\mu,\sigma^2)$, both unknown	$\bar x,\;s^2,\;n$	Normal–Gamma$(\mu_0,\kappa_0,\alpha_0,\beta_0)$	see §5	$\kappa_0$ for mean, $2\alpha_0$ for variance

The same pseudo-count reading shows what a "weakly informative" prior means quantitatively. A Beta$(2,2)$ prior is a pretend dataset of two heads and two tails: it nudges estimates of small samples back toward $1/2$ but is overwhelmed by $n=50$. A Normal$(0,10^2)$ prior on a regression coefficient with data precision near 1 is worth about $1/100$ of one observation: it constrains almost nothing.

7. Posterior-predictive distributions

For prediction, integrate the likelihood against the posterior on $\eta$:

$$ p(x_*\mid x_{1:n}) \;=\; \int p(x_*\mid\eta)\,p(\eta\mid x_{1:n})\,d\eta. $$

For conjugate pairs the integral is closed-form and gives a named distribution:

Beta–Bernoulli: Beta-Binomial predictive, $\Pr(x_*=1\mid x_{1:n}) = (\alpha+k)/(\alpha+\beta+n)$. Always uses both prior pseudo-counts and the data.
Dirichlet–Categorical: predictive $\Pr(x_*=k) = (\alpha_k+n_k)/(\sum_j \alpha_j + n)$. This is the additive-smoothing rule.
Gamma–Poisson: Negative-Binomial predictive over counts. The overdispersion comes from posterior uncertainty about the Poisson rate.
Normal–Normal ($\sigma$ known): Gaussian predictive $\mathcal{N}(\mu_n,\sigma^2+\sigma_n^2)$.
Normal–Gamma: Student-$t$ predictive with $2\alpha_n$ degrees of freedom — heavy-tailed for small $n$, Gaussian asymptotically.

The predictive distributions for finite $n$ are never the original likelihood family. The Beta-Binomial is not a Binomial; the predictive for a Normal–Gamma is Student-$t$, not Gaussian. The likelihood family is only recovered in the $n\to\infty$ limit when the posterior on $\eta$ becomes a point mass.

8. Conjugacy is useful only when the family fits the belief

Conjugate priors are computationally cheap and have a clean pseudo-count interpretation, but they are only a parametric family — not all beliefs fit. Use them when the conjugate shape adequately approximates the prior you would otherwise have specified; when it doesn't, do the integrals some other way — numerically, by analytical approximation, or by simulation.

A few practical signals that the conjugate prior is the wrong tool:

Bimodal or skewed prior beliefs. The conjugate Beta family is unimodal except for the U-shaped Beta$(\alpha,\beta)$ with both parameters $<1$. If you have genuine prior mass at $p=0.2$ and $p=0.8$, a Beta won't capture it.
Hard prior bounds or hierarchical structure. A Beta prior on $p$ can't impose $p\in[0.4,0.6]$. A hierarchical or truncated prior may be closer to your belief, at the cost of closed-form posteriors.
Sparsity or shrinkage targets. Conjugate priors put zero mass at zero. For sparse regression coefficients use a horseshoe, spike-and-slab, or Laplace prior (none of which are conjugate to Gaussian likelihood).
The likelihood is not exponential family. Cauchy, mixture models, Student-$t$ with unknown degrees of freedom, and most neural-network likelihoods have no useful conjugate prior. Fall back to MCMC or VI — see the Monte Carlo & MCMC and variational inference pages.

The pseudo-count interpretation is also a diagnostic for prior strength. A hyperparameter that translates to "1000 pretend observations" on a dataset of 100 real observations is doing nearly all the work. If you wrote it down without thinking of the prior as data, you may not have meant that.

What next

Bayes

Choosing a Prior

Conjugacy is one route to a workable prior; group invariance, max entropy, and Jeffreys give other routes.

Geometry

Fisher Information & Jeffreys Priors

The geometric story behind one of the canonical 'uninformative' priors, with interactive coordinate transforms.

Foundation

The Exponential Family

Why conjugate priors only have finite-dimensional hyperparameter updates inside the exponential family — and what η, T, h, A each mean.

Reference

Named Distributions

Where the named likelihoods and their conjugate partners sit in the broader distribution zoo.