The Exponential Family

One parametric template that covers most named distributions, explains where finite-dimensional sufficient statistics come from, supplies the canonical link functions behind GLMs, and ties Bayesian computation to statistical-mechanics partition functions.

1. The canonical form

A parametric family is an exponential family if every density in it can be written

$$ p(x \mid \eta) \;=\; h(x)\,\exp\!\Bigl[\eta^\top T(x) - A(\eta)\Bigr]. $$

Most named distributions are members: Bernoulli, binomial, Poisson, geometric, exponential, gamma, beta, Dirichlet, Gaussian (with known or unknown variance), multinomial. The shape of the formula does the work. Sufficient statistics shows the factorization theorem the canonical form trivially satisfies: $T(x)$ is staring at you in the exponent.

The Pitman–Koopman–Darmois theorem is the converse: under regularity (fixed support, smooth densities), this is the only way a finite-dimensional sufficient statistic exists for every sample size $n$. Conjugate priors with finite-dimensional hyperparameters are therefore restricted to this case; outside it, the posterior's complexity grows with $n$.

2. Canonical-form terms

The names look like jargon but each one carries a specific meaning. The log-partition piece in particular is a direct loan from statistical physics that pays for itself in §3.

Piece	Name	Why that name
$\eta \in \mathbb R^k$	natural parameter (also canonical)	In this parametrization the log-density is linear in $\eta$, so the family is convex and the geometry is simplest: the Fisher information becomes the Hessian of $A(\eta)$. Contrast the "conventional" or "moment" parameter: $p$ for Bernoulli vs. $\eta = \log\frac{p}{1-p}$, $\lambda$ for Poisson vs. $\eta = \log\lambda$.
$T(x) \in \mathbb R^k$	sufficient statistic	Conditional on $T(x)$, the likelihood doesn't depend on $\eta$, by Fisher–Neyman (see sufficient statistics). $T$ carries all the information about $\eta$ that $x$ provides. Pitman–Koopman–Darmois says this is the only class where a finite-dimensional $T$ does the job for every $n$.
$h(x) \ge 0$	base measure (also carrier)	The dominating measure with respect to which the density is computed. Whatever cannot be absorbed into $\eta^\top T(x) - A(\eta)$ ends up here. For Poisson it carries $1/x!$; for known-$\sigma$ Normal it carries the $x^2$ piece that doesn't depend on $\mu$; for Beta and Gamma it can be taken to be $1$.
$A(\eta) = \log\int h(x)\,e^{\eta^\top T(x)}\,dx$	log-partition function (also cumulant function)	Direct loan from statistical mechanics. In thermodynamics, $Z(\beta) = \sum_s e^{-\beta E_s}$ is the partition function: the normalizer that turns Boltzmann weights $e^{-\beta E_s}$ into a probability distribution, and almost every thermodynamic observable is a derivative of $\log Z$. Here $\int h(x)\,e^{\eta^\top T(x)}\,dx = e^{A(\eta)}$, so $A = \log Z$. The "cumulant" alias comes from the derivative property covered in §3.

The log-partition has a direct payoff. The free-energy/ELBO identity in variational inference is literally a statement about the log-partition function of an exponential family: $\log Z = \sup_q (\mathbb E_q[-E] + H(q))$, with $Z = p(y)$ the evidence and $E(\theta) = -\log p(y, \theta)$. That's not a coincidence. Bayesian computation is partition-function evaluation in disguise.

3. Derivatives of $A(\eta)$ give moments

Differentiating $A(\eta) = \log\int h(x)\,e^{\eta^\top T(x)}\,dx$ once and twice gives, by the same calculation that defines a moment-generating function:

$$ \nabla A(\eta) \;=\; \mathbb E[T(X)] \;=:\; \mu, \qquad \nabla^2 A(\eta) \;=\; \mathrm{Cov}[T(X)]. $$

Higher derivatives give higher cumulants (third = skewness × $\sigma^3$, fourth = excess kurtosis × $\sigma^4$, …). That is why $A$ is also called the cumulant function: every cumulant of the sufficient statistic is a derivative of one scalar function.

Two operational consequences follow:

The mean is a function of $\eta$. Write $\mu(\eta) = \nabla A(\eta)$; this is the mean function. It is the link between the natural parametrization (where the math is clean) and the moment parametrization (where the answers live).
The Fisher information is the Hessian of $A$. $I(\eta) = \mathrm{Cov}[T(X)] = \nabla^2 A(\eta)$. So the Fisher metric on an exponential family is just the second derivative of the log-partition. This is why exponential families have such clean information geometry.

Figure 1 · Slope and curvature of the log-partition

$A(\eta)=\log(1+e^\eta)$ tangent slope $A'(\eta)$ local curvature $A''(\eta)$

natural parameter η 0

This figure uses the Bernoulli member because its log-partition is one-dimensional and familiar. The same derivative identities hold family-wide: slope is the mean parameter and curvature is the Fisher information.

4. The canonical link function (GLMs)

Generalized linear models put a linear predictor $\eta = X^\top\beta$ in front of an exponential-family response. The function that maps the mean parameter $\mu = \mathbb E[T(X)]$ to the natural parameter $\eta$ is the canonical link; its inverse $\mu = \nabla A(\eta)$ is the mean function. Logistic and Poisson regression are precisely the cases where the response is Bernoulli or Poisson and the link is the canonical one:

Family	$\mu$	Canonical link $\eta(\mu)$	Inverse link $\mu(\eta)$	GLM name
Bernoulli	$p$	$\log\dfrac{p}{1-p}$ (logit)	$\dfrac{1}{1+e^{-\eta}}$ (sigmoid)	logistic regression
Poisson	$\lambda$	$\log\lambda$ (log link)	$e^\eta$	Poisson regression
Normal ($\sigma$ known)	$\mu$	$\mu/\sigma^2$ (identity when $\sigma^2 = 1$)	$\sigma^2\,\eta$	linear regression
Exponential	$1/\lambda$	$-\lambda$	$-1/\eta$	(survival models)

The logit/sigmoid pair is the most famous example. Logistic regression is exactly the GLM with a Bernoulli response and the canonical link; the sigmoid is the matching inverse mean function. Same for the log/exp pair behind Poisson regression. The "natural" in natural parameter is the same "natural" as in natural link.

5. Six families, one template

The table below shows six standard members in the canonical decomposition. Click any row to update the density preview at the bottom; slide the parameters to see how the same template generates each density. The pieces $h$, $T$, $\eta$, $A$ change family to family, but the shape $p(x\mid\eta) = h(x)\exp(\eta^\top T(x) - A(\eta))$ does not.

Family	$h(x)$	$T(x)$	$\eta(\theta)$	$A(\eta)$	Support
Bernoulli$(p)$	$1$	$x$	$\log\dfrac{p}{1-p}$	$\log(1+e^\eta) = -\log(1-p)$	$\{0,1\}$
Poisson$(\lambda)$	$1/x!$	$x$	$\log\lambda$	$e^\eta = \lambda$	$\mathbb{N}$
Exponential$(\lambda)$	$1$	$x$	$-\lambda$	$-\log(-\eta) = -\log\lambda$	$[0,\infty)$
Normal$(\mu,\sigma^2)$, $\sigma$ known	$\dfrac{e^{-x^2/(2\sigma^2)}}{\sqrt{2\pi\sigma^2}}$	$x$	$\mu/\sigma^2$	$\sigma^2\eta^2/2 = \mu^2/(2\sigma^2)$	$\mathbb R$
Gamma$(\alpha,\beta)$	$1$	$(\log x,\;x)$	$(\alpha-1,\;-\beta)$	$\log\Gamma(\alpha) - \alpha\log\beta$	$(0,\infty)$
Beta$(\alpha,\beta)$	$1$	$(\log x,\;\log(1-x))$	$(\alpha-1,\;\beta-1)$	$\log B(\alpha,\beta)$	$(0,1)$

p: 0.50

σ: 1.00

Three structural symmetries:

The 1-parameter families (Bernoulli, Poisson, Exponential, Normal-known-$\sigma$) all have $T(x) = x$. Their sufficient statistic for $n$ samples is just $\sum x_i$, the natural quantity an analyst would already compute.
The 2-parameter families (Gamma, Beta) have $T(x)$ a 2-vector. For Gamma it is $(\log x, x)$: geometric and arithmetic means are sufficient. For Beta it is $(\log x, \log(1-x))$, symmetric since Beta is symmetric in $(p, 1-p)$.
Cauchy is the standard counterexample: $p(x\mid\theta) = (\pi[1+(x-\theta)^2])^{-1}$ cannot be put in canonical form, has no finite-dimensional sufficient statistic for $n$ samples, and is exactly what Pitman–Koopman–Darmois rules out.

What next

Foundations

Sufficient Statistics

The factorization theorem and fiber picture the canonical form is built around — and the data-processing inequality view in §8.

Bayes

Conjugate Priors

Why conjugate priors exist precisely inside the exponential family: the (τ, ν) → (τ+ΣT, ν+n) update.

Geometry

Fisher Information

Fisher information as the Hessian of $A(\eta)$, Jeffreys priors, and the information-geometric view of the family.

Approximation

Free Energy & Variational Inference

The free-energy identity recast as a statement about log-partition functions, with the posterior as the equilibrium distribution.

Reference

Named Distributions

Where Bernoulli, Beta, Gamma, Normal, Dirichlet, and the rest sit — the family map, with links back to which are exponential-family members.