The Exponential Family

One parametric template that covers most named distributions, explains where finite-dimensional sufficient statistics come from, supplies the canonical link functions behind GLMs, and ties Bayesian computation to statistical-mechanics partition functions.

1. The canonical form

A parametric family is an exponential family if every density in it can be written

$$ p(x \mid \eta) \;=\; h(x)\,\exp\!\Bigl[\eta^\top T(x) - A(\eta)\Bigr]. $$

Most named distributions are members: Bernoulli, binomial, Poisson, geometric, exponential, gamma, beta, Dirichlet, Gaussian (with known or unknown variance), multinomial. The shape of the formula does the work. Sufficient statistics shows the factorization theorem the canonical form trivially satisfies: $T(x)$ is staring at you in the exponent.

The Pitman–Koopman–Darmois theorem is the converse: under regularity (fixed support, smooth densities), this is the only way a finite-dimensional sufficient statistic exists for every sample size $n$. Conjugate priors with finite-dimensional hyperparameters are therefore restricted to this case; outside it, the posterior's complexity grows with $n$.

2. Canonical-form terms

The names look like jargon but each one carries a specific meaning. The log-partition piece in particular is a direct loan from statistical physics that pays for itself in §3.

Piece Name Why that name
$\eta \in \mathbb R^k$ natural parameter
(also canonical)
In this parametrization the log-density is linear in $\eta$, so the family is convex and the geometry is simplest: the Fisher information becomes the Hessian of $A(\eta)$. Contrast the "conventional" or "moment" parameter: $p$ for Bernoulli vs. $\eta = \log\frac{p}{1-p}$, $\lambda$ for Poisson vs. $\eta = \log\lambda$.
$T(x) \in \mathbb R^k$ sufficient statistic Conditional on $T(x)$, the likelihood doesn't depend on $\eta$, by Fisher–Neyman (see sufficient statistics). $T$ carries all the information about $\eta$ that $x$ provides. Pitman–Koopman–Darmois says this is the only class where a finite-dimensional $T$ does the job for every $n$.
$h(x) \ge 0$ base measure
(also carrier)
The dominating measure with respect to which the density is computed. Whatever cannot be absorbed into $\eta^\top T(x) - A(\eta)$ ends up here. For Poisson it carries $1/x!$; for known-$\sigma$ Normal it carries the $x^2$ piece that doesn't depend on $\mu$; for Beta and Gamma it can be taken to be $1$.
$A(\eta) = \log\int h(x)\,e^{\eta^\top T(x)}\,dx$ log-partition function
(also cumulant function)
Direct loan from statistical mechanics. In thermodynamics, $Z(\beta) = \sum_s e^{-\beta E_s}$ is the partition function: the normalizer that turns Boltzmann weights $e^{-\beta E_s}$ into a probability distribution, and almost every thermodynamic observable is a derivative of $\log Z$. Here $\int h(x)\,e^{\eta^\top T(x)}\,dx = e^{A(\eta)}$, so $A = \log Z$. The "cumulant" alias comes from the derivative property covered in §3.
The log-partition has a direct payoff. The free-energy/ELBO identity in variational inference is literally a statement about the log-partition function of an exponential family: $\log Z = \sup_q (\mathbb E_q[-E] + H(q))$, with $Z = p(y)$ the evidence and $E(\theta) = -\log p(y, \theta)$. That's not a coincidence. Bayesian computation is partition-function evaluation in disguise.

3. Derivatives of $A(\eta)$ give moments

Differentiating $A(\eta) = \log\int h(x)\,e^{\eta^\top T(x)}\,dx$ once and twice gives, by the same calculation that defines a moment-generating function:

$$ \nabla A(\eta) \;=\; \mathbb E[T(X)] \;=:\; \mu, \qquad \nabla^2 A(\eta) \;=\; \mathrm{Cov}[T(X)]. $$

Higher derivatives give higher cumulants (third = skewness × $\sigma^3$, fourth = excess kurtosis × $\sigma^4$, …). That is why $A$ is also called the cumulant function: every cumulant of the sufficient statistic is a derivative of one scalar function.

Two operational consequences follow:

Generalized linear models put a linear predictor $\eta = X^\top\beta$ in front of an exponential-family response. The function that maps the mean parameter $\mu = \mathbb E[T(X)]$ to the natural parameter $\eta$ is the canonical link; its inverse $\mu = \nabla A(\eta)$ is the mean function. Logistic and Poisson regression are precisely the cases where the response is Bernoulli or Poisson and the link is the canonical one:

Family $\mu$ Canonical link $\eta(\mu)$ Inverse link $\mu(\eta)$ GLM name
Bernoulli $p$ $\log\dfrac{p}{1-p}$  (logit) $\dfrac{1}{1+e^{-\eta}}$  (sigmoid) logistic regression
Poisson $\lambda$ $\log\lambda$  (log link) $e^\eta$ Poisson regression
Normal ($\sigma$ known) $\mu$ $\mu/\sigma^2$  (identity when $\sigma^2 = 1$) $\sigma^2\,\eta$ linear regression
Exponential $1/\lambda$ $-\lambda$ $-1/\eta$ (survival models)

The logit/sigmoid pair is the most famous example. Logistic regression is exactly the GLM with a Bernoulli response and the canonical link; the sigmoid is the matching inverse mean function. Same for the log/exp pair behind Poisson regression. The "natural" in natural parameter is the same "natural" as in natural link.

5. Six families, one template

The table below shows six standard members in the canonical decomposition. Click any row to update the density preview at the bottom; slide the parameters to see how the same template generates each density. The pieces $h$, $T$, $\eta$, $A$ change family to family, but the shape $p(x\mid\eta) = h(x)\exp(\eta^\top T(x) - A(\eta))$ does not.

Family $h(x)$ $T(x)$ $\eta(\theta)$ $A(\eta)$ Support
Bernoulli$(p)$ $1$ $x$ $\log\dfrac{p}{1-p}$ $\log(1+e^\eta) = -\log(1-p)$ $\{0,1\}$
Poisson$(\lambda)$ $1/x!$ $x$ $\log\lambda$ $e^\eta = \lambda$ $\mathbb{N}$
Exponential$(\lambda)$ $1$ $x$ $-\lambda$ $-\log(-\eta) = -\log\lambda$ $[0,\infty)$
Normal$(\mu,\sigma^2)$, $\sigma$ known $\dfrac{e^{-x^2/(2\sigma^2)}}{\sqrt{2\pi\sigma^2}}$ $x$ $\mu/\sigma^2$ $\sigma^2\eta^2/2 = \mu^2/(2\sigma^2)$ $\mathbb R$
Gamma$(\alpha,\beta)$ $1$ $(\log x,\;x)$ $(\alpha-1,\;-\beta)$ $\log\Gamma(\alpha) - \alpha\log\beta$ $(0,\infty)$
Beta$(\alpha,\beta)$ $1$ $(\log x,\;\log(1-x))$ $(\alpha-1,\;\beta-1)$ $\log B(\alpha,\beta)$ $(0,1)$
p: 0.50
σ: 1.00

Three structural symmetries:

What next