The Exponential Family
1. The canonical form
A parametric family is an exponential family if every density in it can be written
$$ p(x \mid \eta) \;=\; h(x)\,\exp\!\Bigl[\eta^\top T(x) - A(\eta)\Bigr]. $$Most named distributions are members: Bernoulli, binomial, Poisson, geometric, exponential, gamma, beta, Dirichlet, Gaussian (with known or unknown variance), multinomial. The shape of the formula does the work. Sufficient statistics shows the factorization theorem the canonical form trivially satisfies: $T(x)$ is staring at you in the exponent.
The Pitman–Koopman–Darmois theorem is the converse: under regularity (fixed support, smooth densities), this is the only way a finite-dimensional sufficient statistic exists for every sample size $n$. Conjugate priors with finite-dimensional hyperparameters are therefore restricted to this case; outside it, the posterior's complexity grows with $n$.
2. Canonical-form terms
The names look like jargon but each one carries a specific meaning. The log-partition piece in particular is a direct loan from statistical physics that pays for itself in §3.
| Piece | Name | Why that name |
|---|---|---|
| $\eta \in \mathbb R^k$ | natural parameter (also canonical) | In this parametrization the log-density is linear in $\eta$, so the family is convex and the geometry is simplest: the Fisher information becomes the Hessian of $A(\eta)$. Contrast the "conventional" or "moment" parameter: $p$ for Bernoulli vs. $\eta = \log\frac{p}{1-p}$, $\lambda$ for Poisson vs. $\eta = \log\lambda$. |
| $T(x) \in \mathbb R^k$ | sufficient statistic | Conditional on $T(x)$, the likelihood doesn't depend on $\eta$, by Fisher–Neyman (see sufficient statistics). $T$ carries all the information about $\eta$ that $x$ provides. Pitman–Koopman–Darmois says this is the only class where a finite-dimensional $T$ does the job for every $n$. |
| $h(x) \ge 0$ | base measure (also carrier) | The dominating measure with respect to which the density is computed. Whatever cannot be absorbed into $\eta^\top T(x) - A(\eta)$ ends up here. For Poisson it carries $1/x!$; for known-$\sigma$ Normal it carries the $x^2$ piece that doesn't depend on $\mu$; for Beta and Gamma it can be taken to be $1$. |
| $A(\eta) = \log\int h(x)\,e^{\eta^\top T(x)}\,dx$ | log-partition function (also cumulant function) | Direct loan from statistical mechanics. In thermodynamics, $Z(\beta) = \sum_s e^{-\beta E_s}$ is the partition function: the normalizer that turns Boltzmann weights $e^{-\beta E_s}$ into a probability distribution, and almost every thermodynamic observable is a derivative of $\log Z$. Here $\int h(x)\,e^{\eta^\top T(x)}\,dx = e^{A(\eta)}$, so $A = \log Z$. The "cumulant" alias comes from the derivative property covered in §3. |
3. Derivatives of $A(\eta)$ give moments
Differentiating $A(\eta) = \log\int h(x)\,e^{\eta^\top T(x)}\,dx$ once and twice gives, by the same calculation that defines a moment-generating function:
$$ \nabla A(\eta) \;=\; \mathbb E[T(X)] \;=:\; \mu, \qquad \nabla^2 A(\eta) \;=\; \mathrm{Cov}[T(X)]. $$Higher derivatives give higher cumulants (third = skewness × $\sigma^3$, fourth = excess kurtosis × $\sigma^4$, …). That is why $A$ is also called the cumulant function: every cumulant of the sufficient statistic is a derivative of one scalar function.
Two operational consequences follow:
- The mean is a function of $\eta$. Write $\mu(\eta) = \nabla A(\eta)$; this is the mean function. It is the link between the natural parametrization (where the math is clean) and the moment parametrization (where the answers live).
- The Fisher information is the Hessian of $A$. $I(\eta) = \mathrm{Cov}[T(X)] = \nabla^2 A(\eta)$. So the Fisher metric on an exponential family is just the second derivative of the log-partition. This is why exponential families have such clean information geometry.
4. The canonical link function (GLMs)
Generalized linear models put a linear predictor $\eta = X^\top\beta$ in front of an exponential-family response. The function that maps the mean parameter $\mu = \mathbb E[T(X)]$ to the natural parameter $\eta$ is the canonical link; its inverse $\mu = \nabla A(\eta)$ is the mean function. Logistic and Poisson regression are precisely the cases where the response is Bernoulli or Poisson and the link is the canonical one:
| Family | $\mu$ | Canonical link $\eta(\mu)$ | Inverse link $\mu(\eta)$ | GLM name |
|---|---|---|---|---|
| Bernoulli | $p$ | $\log\dfrac{p}{1-p}$ (logit) | $\dfrac{1}{1+e^{-\eta}}$ (sigmoid) | logistic regression |
| Poisson | $\lambda$ | $\log\lambda$ (log link) | $e^\eta$ | Poisson regression |
| Normal ($\sigma$ known) | $\mu$ | $\mu/\sigma^2$ (identity when $\sigma^2 = 1$) | $\sigma^2\,\eta$ | linear regression |
| Exponential | $1/\lambda$ | $-\lambda$ | $-1/\eta$ | (survival models) |
The logit/sigmoid pair is the most famous example. Logistic regression is exactly the GLM with a Bernoulli response and the canonical link; the sigmoid is the matching inverse mean function. Same for the log/exp pair behind Poisson regression. The "natural" in natural parameter is the same "natural" as in natural link.
5. Six families, one template
The table below shows six standard members in the canonical decomposition. Click any row to update the density preview at the bottom; slide the parameters to see how the same template generates each density. The pieces $h$, $T$, $\eta$, $A$ change family to family, but the shape $p(x\mid\eta) = h(x)\exp(\eta^\top T(x) - A(\eta))$ does not.
| Family | $h(x)$ | $T(x)$ | $\eta(\theta)$ | $A(\eta)$ | Support |
|---|---|---|---|---|---|
| Bernoulli$(p)$ | $1$ | $x$ | $\log\dfrac{p}{1-p}$ | $\log(1+e^\eta) = -\log(1-p)$ | $\{0,1\}$ |
| Poisson$(\lambda)$ | $1/x!$ | $x$ | $\log\lambda$ | $e^\eta = \lambda$ | $\mathbb{N}$ |
| Exponential$(\lambda)$ | $1$ | $x$ | $-\lambda$ | $-\log(-\eta) = -\log\lambda$ | $[0,\infty)$ |
| Normal$(\mu,\sigma^2)$, $\sigma$ known | $\dfrac{e^{-x^2/(2\sigma^2)}}{\sqrt{2\pi\sigma^2}}$ | $x$ | $\mu/\sigma^2$ | $\sigma^2\eta^2/2 = \mu^2/(2\sigma^2)$ | $\mathbb R$ |
| Gamma$(\alpha,\beta)$ | $1$ | $(\log x,\;x)$ | $(\alpha-1,\;-\beta)$ | $\log\Gamma(\alpha) - \alpha\log\beta$ | $(0,\infty)$ |
| Beta$(\alpha,\beta)$ | $1$ | $(\log x,\;\log(1-x))$ | $(\alpha-1,\;\beta-1)$ | $\log B(\alpha,\beta)$ | $(0,1)$ |
Three structural symmetries:
- The 1-parameter families (Bernoulli, Poisson, Exponential, Normal-known-$\sigma$) all have $T(x) = x$. Their sufficient statistic for $n$ samples is just $\sum x_i$, the natural quantity an analyst would already compute.
- The 2-parameter families (Gamma, Beta) have $T(x)$ a 2-vector. For Gamma it is $(\log x, x)$: geometric and arithmetic means are sufficient. For Beta it is $(\log x, \log(1-x))$, symmetric since Beta is symmetric in $(p, 1-p)$.
- Cauchy is the standard counterexample: $p(x\mid\theta) = (\pi[1+(x-\theta)^2])^{-1}$ cannot be put in canonical form, has no finite-dimensional sufficient statistic for $n$ samples, and is exactly what Pitman–Koopman–Darmois rules out.