Choosing a Prior

Principles for prior selection: use real prior information when you have it; otherwise group invariance, maximum entropy, or Jeffreys — and the trade-offs between them.

1. Use real prior information when available

The most important principle of prior selection is that your prior should represent the best knowledge you have about the parameters before you look at the data. Usually some information is at your disposal:

1 Ruggles & Brodie, "An Empirical Approach to Economic Intelligence in World War II," Journal of the American Statistical Association 42:237 (1947), 72–91. The problem is also covered in Wikipedia's "German tank problem".

It is unjustified to use default, ignorance, or other automatic priors if you have substantial information that can affect the answer. Reaching for a non-informative prior is a real choice — it says "I want my conclusions to be visible to a reader who shares no prior beliefs with me" — but the choice has costs (wider credible intervals, sensitivity to parameterization), and you should make it only when the cost is worth paying.

One sentence: If you have prior information, use it; if you don't, pick the "uninformative" prior whose justification matches the question you care about — symmetry, ignorance under constraints, or reparameterization invariance.

2. Three routes to "uninformative"

A number of principles have been used to construct non-informative priors. The three with logical justification — and with results that broadly reproduce a corresponding frequentist analysis — are:

  1. Group invariance arguments. Choose the prior to respect a symmetry the problem has — permutation symmetry on a die, translation symmetry on a location parameter, scale invariance on a positive parameter.
  2. Maximum entropy arguments. Choose the prior with the largest information entropy subject to whatever constraints you do know (mean, variance, moments, support).
  3. Arguments from the Fisher information matrix. Take Jeffreys' prior, $\pi_J(\theta)\propto\sqrt{\det I(\theta)}$, which is invariant under smooth reparameterizations.

None of the three is canonical. They answer different questions, and they sometimes disagree (a flat prior in $p$ is not flat in $\log p/(1-p)$; the Bernoulli MaxEnt prior under no constraint is Beta$(1,1)$, but Jeffreys is Beta$(1/2,1/2)$). The point of having three routes is that which one is right depends on what you mean by "uninformative."

3. Group invariance

If our prior knowledge is invariant under the action of some group $G$ acting on the parameter space — i.e., if applying any $g\in G$ doesn't change what we believe — then a defensible prior is one that is also invariant under $G$. The prior is determined (up to scale) by the requirement that $\pi(g\cdot\theta)=\pi(\theta)$ for all $g$.

SettingSymmetry groupInvariant prior
Die or coin, no side favored Permutations of faces Uniform on faces, $\pi=1/k$
Location parameter $\mu\in\mathbb{R}$, translation-invariant $\mu\mapsto\mu+c$ Flat (improper), $\pi(\mu)\propto 1$
Angle $\theta\in[0,2\pi)$, rotation-invariant $\theta\mapsto\theta+c$ Uniform on the circle
Direction $(\varphi,\theta)$ on the sphere $O(3)$ Uniform on solid angle, $d\Omega = \cos\varphi\,d\varphi\,d\theta$
Scale parameter $\sigma>0$, scale-invariant $\sigma\mapsto c\sigma$ $\pi(\sigma)\propto 1/\sigma$ (improper); equivalently flat on $\log\sigma$

The scale case is the prototype. If we're measuring a positive quantity and want the prior to be invariant to the choice of unit (graduations of the ruler in inches vs. centimeters), then $\pi(c\sigma)=\pi(\sigma)$ for all $c>0$. Equivalently $\pi(\omega)$ on $\omega=\log\sigma$ is invariant to translations of $\omega$, so it is flat in $\omega$. Pushing back to $\sigma$ picks up the Jacobian $1/\sigma$, and the prior on $\sigma$ is the Jeffreys prior for a scale parameter, $\pi(\sigma)\propto 1/\sigma$.

Group-invariance arguments are the most secure from a logical point of view, but they need a natural symmetry in the problem to begin with. They are great when they apply and silent when they don't.

4. Maximum entropy

Maximum entropy (E.T. Jaynes) chooses the prior with the largest information entropy subject to known constraints. The intuition: entropy measures how much information we lack, so picking the maximum-entropy density commits the least beyond the stated constraints. The recipe is a Lagrangian:

$$ \mathcal{L}[p] \;=\; -\!\int p(x)\log p(x)\,dx \;-\; \lambda_0\!\Bigl(\int p - 1\Bigr) \;-\; \sum_j \lambda_j\!\Bigl(\int g_j(x)\,p(x)\,dx - c_j\Bigr). $$

Stationarity in $p$ gives $-\log p(x) - 1 - \lambda_0 - \sum_j\lambda_j g_j(x)=0$, i.e.,

$$ p(x) \;\propto\; \exp\!\Bigl(\sum_j \lambda_j g_j(x)\Bigr). $$

The MaxEnt density is an exponential family whose sufficient statistics are exactly the constraint functions $g_j$ and whose natural parameters $\lambda_j$ are the Lagrange multipliers. Solving the constraint equations $\int g_j p =c_j$ fixes the $\lambda_j$.

The MaxEnt outputs for the standard constraints are exactly the named "default" distributions:

SupportConstraintsMaxEnt density
Finite set $\{1,\dots,n\}$None (just normalization)Uniform $p_k=1/n$
Interval $[a,b]\subset\mathbb{R}$NoneUniform on $[a,b]$
$\{0,1,2,\dots\}$Mean $\mu$Geometric$(p=1/(1+\mu))$
$[0,\infty)$Mean $\mu$Exponential$(1/\mu)$
$\mathbb{R}$Mean $\mu$, variance $\sigma^2$Normal$(\mu,\sigma^2)$
$[0,\infty)$$\mathbb{E}[\log x]$ and $\mathbb{E}[x]$ both fixedGamma$(\alpha,\beta)$
$(0,1)$$\mathbb{E}[\log x]$ and $\mathbb{E}[\log(1-x)]$ both fixedBeta$(\alpha,\beta)$

This is the rule-of-thumb you reach for in practice: say what you know, and the MaxEnt prior is whichever named distribution turns those facts into sufficient statistics. If all you know is that a quantity is positive and has a finite mean, the exponential is the prior with the least extra commitment. If you also know the variance, Gaussian. If the support is bounded and you know nothing else, uniform.

But MaxEnt has a flaw, which is why Jeffreys is also worth knowing: its output depends on the choice of base measure and on which coordinates you write the entropy in. The continuous "differential entropy" is not invariant under smooth reparameterizations — if you change coordinates $x\mapsto\phi(x)$, the entropy shifts by $\mathbb{E}[\log|\phi'(x)|]$ and so does the answer. So MaxEnt implicitly commits to a natural parameterization. When that parameterization is obvious (Cartesian coordinates on space, time on $[0,\infty)$), MaxEnt is sharp. When it isn't, the answer is parameterization-dependent.

5. Jeffreys' priors

Harold Jeffreys' construction sidesteps the coordinate issue by defining the prior through the Fisher information of the likelihood:

$$ \pi_J(\theta) \;\propto\; \sqrt{\det I(\theta)}, \qquad I(\theta)\;=\; -\mathbb{E}\!\left[\frac{\partial^2 \log p(x\mid\theta)}{\partial\theta\,\partial\theta^\top}\right]. $$

This is automatically invariant under reparameterizations: if $\theta=h^{-1}(\phi)$, then $I(\phi)=I(\theta)\,(d\theta/d\phi)^2$, so $\sqrt{I(\phi)}=\sqrt{I(\theta)}\,|d\theta/d\phi|$, which is precisely the Jacobian factor needed for the prior density to push forward to itself in the new coordinate. Two analysts using different parameterizations end up with the same prior on the underlying probability model.

Worked examples:

Jeffreys is often improper (the integral diverges), as in the location and scale cases above. That's not by itself a problem for Bayesian inference — the posterior is usually proper as soon as you have any data. The Fisher information page develops Jeffreys in detail, including the multi-parameter version through the Fisher metric and several worked transformations.

6. Prior rules disagree near boundaries

The three routes don't always pick the same prior, because they answer different questions:

For some classes of problems the answers coincide (location and scale parameters under translation/rescaling symmetry). For others they diverge. A Bernoulli with no constraints has MaxEnt prior Beta$(1,1)$ (uniform on $p$); Jeffreys gives Beta$(1/2,1/2)$; a group-invariance argument doesn't directly apply because there's no natural group symmetry on $[0,1]$ unless you posit one. Which to use depends on the question: do you want a uniform-in-$p$ analysis (MaxEnt), or one that survives a switch to log-odds (Jeffreys)?

Figure 1 · Three 'uninformative' priors on the Bernoulli bias p
Flat / MaxEnt: Beta(1, 1) Jeffreys: Beta(½, ½) Haldane: Beta(0⁺, 0⁺)

At $n=0$ the three priors look different: flat is uniform on $p$, Jeffreys is U-shaped with mass pushed toward the boundaries, and Haldane (the limit $\alpha,\beta\to 0$) is fully concentrated on $\{0,1\}$. Switch the readout to log-odds: the flat prior on $p$ becomes peaked around $p=1/2$, Jeffreys becomes flat in log-odds (another reading of its invariance), and Haldane becomes flat over all $\mathbb{R}$. As $n$ grows the posteriors converge — for moderate $n$ the disagreement only matters near the boundary $p\approx 0$ or $p\approx 1$, where the prior dominates because the likelihood goes to zero.

7. Pitfalls and the "ignorance" warning

Non-informative priors are not priors with no assumptions. They are attempts to avoid privileging one coordinate system or one set of facts over another, and each construction makes a specific choice:

The standard caution bears repeating: use of "uninformative" or "automatic" priors does not incorporate any real prior information you may have. If you do have such information, use it.

8. A short decision guide

SituationFirst choiceWhy
You have a calibrated estimate from previous data or theory Informative prior matching that estimate Don't waste the information; show it explicitly in the prior
You have soft beliefs ("plausibly $\pm 2\sigma$ around 0") Weakly-informative prior at that scale (e.g. Normal$(0, 2.5^2)$ on a standardized coefficient) Regularizes near the boundary, agrees with data when there's enough of it
Closed-form posterior matters; soft beliefs OK Conjugate prior with hyperparameters interpretable as pseudo-counts (see here) Fast updates, transparent strength, often a reasonable approximation
Location parameter, no information Flat on $\theta$ (translation-invariant) All three routes agree
Scale or rate parameter, no information $\pi(\sigma)\propto 1/\sigma$ (Jeffreys / scale-invariant) All three routes agree
Probabilities or rates with possible boundary cases Beta$(1/2,1/2)$ (Jeffreys) Coordinate-invariant; better near $p\approx 0, 1$ than flat
Discrete outcomes, no preference between them Symmetric Dirichlet, e.g. $\alpha=1$ (uniform on simplex) or $\alpha=1/2$ (Jeffreys) Group invariance under permutation of categories
Positive quantity with known mean only Exponential$(1/\mu)$ (MaxEnt) The least committal positive distribution with that mean
Real-valued quantity with known mean and variance Normal (MaxEnt) The least committal distribution on $\mathbb{R}$ with those moments
Posterior is sensitive to the prior choice Report results under multiple priors, or collect more data If the answer depends on the prior in a way you can't defend, the data isn't doing enough work yet

9. Practical elicitation: turn beliefs into pseudo-counts

For prior beliefs that you can articulate but not immediately write as a density, the pseudo-count reading from the conjugate-prior page is often the most practical conversion:

The discipline this enforces is useful: every conjugate hyperparameter has units of either pseudo-events or pseudo-observations, and writing them down forces you to check that your prior isn't accidentally worth a thousand observations on a hundred-observation problem.

What next