Choosing a Prior

Principles for prior selection: use real prior information when you have it; otherwise group invariance, maximum entropy, or Jeffreys — and the trade-offs between them.

1. Use real prior information when available

The most important principle of prior selection is that your prior should represent the best knowledge you have about the parameters before you look at the data. Usually some information is at your disposal:

A candidate's vote share is in $[0, 1]$ by definition, so any prior on it lives on the unit interval — a Beta is the natural family.
In the German Tank Problem (the WWII estimation of German tank production from serial numbers on captured equipment), the total number of tanks $N$ must be at least the largest serial number observed. The prior reflects that lower bound: zero mass below it.¹
A treatment effect cannot plausibly be more than a few standard deviations of the outcome.

¹ Ruggles & Brodie, "An Empirical Approach to Economic Intelligence in World War II," Journal of the American Statistical Association 42:237 (1947), 72–91. The problem is also covered in Wikipedia's "German tank problem".

It is unjustified to use default, ignorance, or other automatic priors if you have substantial information that can affect the answer. Reaching for a non-informative prior is a real choice — it says "I want my conclusions to be visible to a reader who shares no prior beliefs with me" — but the choice has costs (wider credible intervals, sensitivity to parameterization), and you should make it only when the cost is worth paying.

One sentence: If you have prior information, use it; if you don't, pick the "uninformative" prior whose justification matches the question you care about — symmetry, ignorance under constraints, or reparameterization invariance.

2. Three routes to "uninformative"

A number of principles have been used to construct non-informative priors. The three with logical justification — and with results that broadly reproduce a corresponding frequentist analysis — are:

Group invariance arguments. Choose the prior to respect a symmetry the problem has — permutation symmetry on a die, translation symmetry on a location parameter, scale invariance on a positive parameter.
Maximum entropy arguments. Choose the prior with the largest information entropy subject to whatever constraints you do know (mean, variance, moments, support).
Arguments from the Fisher information matrix. Take Jeffreys' prior, $\pi_J(\theta)\propto\sqrt{\det I(\theta)}$, which is invariant under smooth reparameterizations.

None of the three is canonical. They answer different questions, and they sometimes disagree (a flat prior in $p$ is not flat in $\log p/(1-p)$; the Bernoulli MaxEnt prior under no constraint is Beta$(1,1)$, but Jeffreys is Beta$(1/2,1/2)$). The point of having three routes is that which one is right depends on what you mean by "uninformative."

3. Group invariance

If our prior knowledge is invariant under the action of some group $G$ acting on the parameter space — i.e., if applying any $g\in G$ doesn't change what we believe — then a defensible prior is one that is also invariant under $G$. The prior is determined (up to scale) by the requirement that $\pi(g\cdot\theta)=\pi(\theta)$ for all $g$.

Setting	Symmetry group	Invariant prior
Die or coin, no side favored	Permutations of faces	Uniform on faces, $\pi=1/k$
Location parameter $\mu\in\mathbb{R}$, translation-invariant	$\mu\mapsto\mu+c$	Flat (improper), $\pi(\mu)\propto 1$
Angle $\theta\in[0,2\pi)$, rotation-invariant	$\theta\mapsto\theta+c$	Uniform on the circle
Direction $(\varphi,\theta)$ on the sphere	$O(3)$	Uniform on solid angle, $d\Omega = \cos\varphi\,d\varphi\,d\theta$
Scale parameter $\sigma>0$, scale-invariant	$\sigma\mapsto c\sigma$	$\pi(\sigma)\propto 1/\sigma$ (improper); equivalently flat on $\log\sigma$

The scale case is the prototype. If we're measuring a positive quantity and want the prior to be invariant to the choice of unit (graduations of the ruler in inches vs. centimeters), then $\pi(c\sigma)=\pi(\sigma)$ for all $c>0$. Equivalently $\pi(\omega)$ on $\omega=\log\sigma$ is invariant to translations of $\omega$, so it is flat in $\omega$. Pushing back to $\sigma$ picks up the Jacobian $1/\sigma$, and the prior on $\sigma$ is the Jeffreys prior for a scale parameter, $\pi(\sigma)\propto 1/\sigma$.

Group-invariance arguments are the most secure from a logical point of view, but they need a natural symmetry in the problem to begin with. They are great when they apply and silent when they don't.

4. Maximum entropy

Maximum entropy (E.T. Jaynes) chooses the prior with the largest information entropy subject to known constraints. The intuition: entropy measures how much information we lack, so picking the maximum-entropy density commits the least beyond the stated constraints. The recipe is a Lagrangian:

$$ \mathcal{L}[p] \;=\; -\!\int p(x)\log p(x)\,dx \;-\; \lambda_0\!\Bigl(\int p - 1\Bigr) \;-\; \sum_j \lambda_j\!\Bigl(\int g_j(x)\,p(x)\,dx - c_j\Bigr). $$

Stationarity in $p$ gives $-\log p(x) - 1 - \lambda_0 - \sum_j\lambda_j g_j(x)=0$, i.e.,

$$ p(x) \;\propto\; \exp\!\Bigl(\sum_j \lambda_j g_j(x)\Bigr). $$

The MaxEnt density is an exponential family whose sufficient statistics are exactly the constraint functions $g_j$ and whose natural parameters $\lambda_j$ are the Lagrange multipliers. Solving the constraint equations $\int g_j p =c_j$ fixes the $\lambda_j$.

The MaxEnt outputs for the standard constraints are exactly the named "default" distributions:

Figure 1 · MaxEnt density from stated constraints

selected MaxEnt density

positive support known mean known variance known E[log x]

Support	Constraints	MaxEnt density
Finite set $\{1,\dots,n\}$	None (just normalization)	Uniform $p_k=1/n$
Interval $[a,b]\subset\mathbb{R}$	None	Uniform on $[a,b]$
$\{0,1,2,\dots\}$	Mean $\mu$	Geometric$(p=1/(1+\mu))$
$[0,\infty)$	Mean $\mu$	Exponential$(1/\mu)$
$\mathbb{R}$	Mean $\mu$, variance $\sigma^2$	Normal$(\mu,\sigma^2)$
$[0,\infty)$	$\mathbb{E}[\log x]$ and $\mathbb{E}[x]$ both fixed	Gamma$(\alpha,\beta)$
$(0,1)$	$\mathbb{E}[\log x]$ and $\mathbb{E}[\log(1-x)]$ both fixed	Beta$(\alpha,\beta)$

This is the rule-of-thumb you reach for in practice: say what you know, and the MaxEnt prior is whichever named distribution turns those facts into sufficient statistics. If all you know is that a quantity is positive and has a finite mean, the exponential is the prior with the least extra commitment. If you also know the variance, Gaussian. If the support is bounded and you know nothing else, uniform.

But MaxEnt has a flaw, which is why Jeffreys is also worth knowing: its output depends on the choice of base measure and on which coordinates you write the entropy in. The continuous "differential entropy" is not invariant under smooth reparameterizations — if you change coordinates $x\mapsto\phi(x)$, the entropy shifts by $\mathbb{E}[\log|\phi'(x)|]$ and so does the answer. So MaxEnt implicitly commits to a natural parameterization. When that parameterization is obvious (Cartesian coordinates on space, time on $[0,\infty)$), MaxEnt is sharp. When it isn't, the answer is parameterization-dependent.

5. Jeffreys' priors

Harold Jeffreys' construction sidesteps the coordinate issue by defining the prior through the Fisher information of the likelihood:

$$ \pi_J(\theta) \;\propto\; \sqrt{\det I(\theta)}, \qquad I(\theta)\;=\; -\mathbb{E}\!\left[\frac{\partial^2 \log p(x\mid\theta)}{\partial\theta\,\partial\theta^\top}\right]. $$

This is automatically invariant under reparameterizations: if $\theta=h^{-1}(\phi)$, then $I(\phi)=I(\theta)\,(d\theta/d\phi)^2$, so $\sqrt{I(\phi)}=\sqrt{I(\theta)}\,|d\theta/d\phi|$, which is precisely the Jacobian factor needed for the prior density to push forward to itself in the new coordinate. Two analysts using different parameterizations end up with the same prior on the underlying probability model.

Worked examples:

Normal mean, $\sigma$ known. $I(\mu)=n/\sigma^2$ is constant in $\mu$, so $\pi_J(\mu)\propto 1$ — flat. Matches the location-translation group argument.
Normal scale, $\mu$ known. $I(\sigma)\propto 1/\sigma^2$, so $\pi_J(\sigma)\propto 1/\sigma$. Matches the scale-invariance group argument.
Bernoulli bias $p$. $I(p)=1/[p(1-p)]$, so $\pi_J(p)\propto[p(1-p)]^{-1/2}$ — the Beta$(1/2,1/2)$ prior. Unlike a flat prior, this is invariant when reparameterized in log-odds.

Jeffreys is often improper (the integral diverges), as in the location and scale cases above. That's not by itself a problem for Bayesian inference — the posterior is usually proper as soon as you have any data. The Fisher information page develops Jeffreys in detail, including the multi-parameter version through the Fisher metric and several worked transformations.

6. Prior rules disagree near boundaries

The three routes don't always pick the same prior, because they answer different questions:

Group invariance asks: which prior respects the natural symmetry of the problem?
Max entropy asks: given these constraints, which prior commits the least?
Jeffreys asks: which prior gives the same answer in any coordinate system?

For some classes of problems the answers coincide (location and scale parameters under translation/rescaling symmetry). For others they diverge. A Bernoulli with no constraints has MaxEnt prior Beta$(1,1)$ (uniform on $p$); Jeffreys gives Beta$(1/2,1/2)$; a group-invariance argument doesn't directly apply because there's no natural group symmetry on $[0,1]$ unless you posit one. Which to use depends on the question: do you want a uniform-in-$p$ analysis (MaxEnt), or one that survives a switch to log-odds (Jeffreys)?

Figure 2 · Three 'uninformative' priors on the Bernoulli bias p

Flat / MaxEnt: Beta(1, 1) Jeffreys: Beta(½, ½) Haldane: Beta(0⁺, 0⁺)

show coordinate

observed trials n 0

observed successes k 0

At $n=0$ the three priors look different: flat is uniform on $p$, Jeffreys is U-shaped with mass pushed toward the boundaries, and Haldane (the limit $\alpha,\beta\to 0$) is fully concentrated on $\{0,1\}$. Switch the readout to log-odds: the flat prior on $p$ becomes peaked around $p=1/2$, Jeffreys becomes flat in log-odds (another reading of its invariance), and Haldane becomes flat over all $\mathbb{R}$. As $n$ grows the posteriors converge — for moderate $n$ the disagreement only matters near the boundary $p\approx 0$ or $p\approx 1$, where the prior dominates because the likelihood goes to zero.

Figure 3 · Flat on σ is not flat on log σ

flat in selected coordinate, shown as density on σ

flat coordinate σ

The scale-parameter version is the same warning in a different coordinate system. A density that is flat in $\sigma$ is not flat in $\log\sigma$, and a density flat in $\sigma^2$ leans even harder toward large scales. The scale-invariant choice is the middle curve, $\pi(\sigma)\propto1/\sigma$.

7. Pitfalls and the "ignorance" warning

Non-informative priors are not priors with no assumptions. They are attempts to avoid privileging one coordinate system or one set of facts over another, and each construction makes a specific choice:

Flat priors are coordinate-dependent. A prior that is uniform in $p$ is not uniform in $\log p/(1-p)$. Saying "let's just use a flat prior" answers which flat prior with the parameterization you happened to be using.
MaxEnt depends on the base measure. Maximum-entropy priors over continuous spaces are not parameterization-invariant unless you carry a reference measure along (the relative-entropy formulation), and writing down a natural reference measure is itself a modeling assumption.
Improper priors are not always safe. Jeffreys and group priors are often improper. The posterior usually becomes proper after any data, but not always — for hierarchical models you should check.
"Uninformative" can be very informative on derived quantities. A flat prior on the variance $\sigma^2$ puts most of its mass on huge values; a flat prior on $\log\sigma$ is dramatically different. Either is "uninformative" on its own coordinate but informative on the other.

The standard caution bears repeating: use of "uninformative" or "automatic" priors does not incorporate any real prior information you may have. If you do have such information, use it.

8. A short decision guide

Situation	First choice	Why
You have a calibrated estimate from previous data or theory	Informative prior matching that estimate	Don't waste the information; show it explicitly in the prior
You have soft beliefs ("plausibly $\pm 2\sigma$ around 0")	Weakly-informative prior at that scale (e.g. Normal$(0, 2.5^2)$ on a standardized coefficient)	Regularizes near the boundary, agrees with data when there's enough of it
Closed-form posterior matters; soft beliefs OK	Conjugate prior with hyperparameters interpretable as pseudo-counts (see here)	Fast updates, transparent strength, often a reasonable approximation
Location parameter, no information	Flat on $\theta$ (translation-invariant)	All three routes agree
Scale or rate parameter, no information	$\pi(\sigma)\propto 1/\sigma$ (Jeffreys / scale-invariant)	All three routes agree
Probabilities or rates with possible boundary cases	Beta$(1/2,1/2)$ (Jeffreys)	Coordinate-invariant; better near $p\approx 0, 1$ than flat
Discrete outcomes, no preference between them	Symmetric Dirichlet, e.g. $\alpha=1$ (uniform on simplex) or $\alpha=1/2$ (Jeffreys)	Group invariance under permutation of categories
Positive quantity with known mean only	Exponential$(1/\mu)$ (MaxEnt)	The least committal positive distribution with that mean
Real-valued quantity with known mean and variance	Normal (MaxEnt)	The least committal distribution on $\mathbb{R}$ with those moments
Posterior is sensitive to the prior choice	Report results under multiple priors, or collect more data	If the answer depends on the prior in a way you can't defend, the data isn't doing enough work yet

9. Practical elicitation: turn beliefs into pseudo-counts

For prior beliefs that you can articulate but not immediately write as a density, the pseudo-count reading from the conjugate-prior page is often the most practical conversion:

"I'd guess this coin's bias is around 0.6, and I'm about as sure as if I'd seen 10 trials" → Beta$(7, 5)$. Mean 0.58, effective sample size 10.
"I'd estimate the event rate at ≈ 2 per year, with about 5 years of indirect experience" → Gamma$(\alpha=10, \beta=5)$ on the rate. Mean 2, effective observation 5 years.
"I think the parameter is near 0, on the scale of 0.5" → Normal$(0, 0.5^2)$. Worth one observation of precision-4 data.

The discipline this enforces is useful: every conjugate hyperparameter has units of either pseudo-events or pseudo-observations, and writing them down forces you to check that your prior isn't accidentally worth a thousand observations on a hundred-observation problem.

What next

Bayes

Conjugate Priors & the Exponential Family

The other route to a workable prior: conjugacy. Hyperparameters as pseudo-counts, closed-form posteriors, standard pairs.

Geometry

Fisher Information & Jeffreys Priors

The interactive treatment of Jeffreys' construction, coordinate transforms, and the Fisher metric.

Reference

Named Distributions

Where Beta, Dirichlet, Gamma, Normal, and the rest sit relative to each other — useful when picking an informative prior.

Computation

Monte Carlo & MCMC

When the prior is not conjugate and the posterior has no closed form, sample from it.