Choosing a Prior
1. Use real prior information when available
The most important principle of prior selection is that your prior should represent the best knowledge you have about the parameters before you look at the data. Usually some information is at your disposal:
- A candidate's vote share is in $[0, 1]$ by definition, so any prior on it lives on the unit interval — a Beta is the natural family.
- In the German Tank Problem (the WWII estimation of German tank production from serial numbers on captured equipment), the total number of tanks $N$ must be at least the largest serial number observed. The prior reflects that lower bound: zero mass below it.1
- A treatment effect cannot plausibly be more than a few standard deviations of the outcome.
1 Ruggles & Brodie, "An Empirical Approach to Economic Intelligence in World War II," Journal of the American Statistical Association 42:237 (1947), 72–91. The problem is also covered in Wikipedia's "German tank problem".
It is unjustified to use default, ignorance, or other automatic priors if you have substantial information that can affect the answer. Reaching for a non-informative prior is a real choice — it says "I want my conclusions to be visible to a reader who shares no prior beliefs with me" — but the choice has costs (wider credible intervals, sensitivity to parameterization), and you should make it only when the cost is worth paying.
2. Three routes to "uninformative"
A number of principles have been used to construct non-informative priors. The three with logical justification — and with results that broadly reproduce a corresponding frequentist analysis — are:
- Group invariance arguments. Choose the prior to respect a symmetry the problem has — permutation symmetry on a die, translation symmetry on a location parameter, scale invariance on a positive parameter.
- Maximum entropy arguments. Choose the prior with the largest information entropy subject to whatever constraints you do know (mean, variance, moments, support).
- Arguments from the Fisher information matrix. Take Jeffreys' prior, $\pi_J(\theta)\propto\sqrt{\det I(\theta)}$, which is invariant under smooth reparameterizations.
None of the three is canonical. They answer different questions, and they sometimes disagree (a flat prior in $p$ is not flat in $\log p/(1-p)$; the Bernoulli MaxEnt prior under no constraint is Beta$(1,1)$, but Jeffreys is Beta$(1/2,1/2)$). The point of having three routes is that which one is right depends on what you mean by "uninformative."
3. Group invariance
If our prior knowledge is invariant under the action of some group $G$ acting on the parameter space — i.e., if applying any $g\in G$ doesn't change what we believe — then a defensible prior is one that is also invariant under $G$. The prior is determined (up to scale) by the requirement that $\pi(g\cdot\theta)=\pi(\theta)$ for all $g$.
| Setting | Symmetry group | Invariant prior |
|---|---|---|
| Die or coin, no side favored | Permutations of faces | Uniform on faces, $\pi=1/k$ |
| Location parameter $\mu\in\mathbb{R}$, translation-invariant | $\mu\mapsto\mu+c$ | Flat (improper), $\pi(\mu)\propto 1$ |
| Angle $\theta\in[0,2\pi)$, rotation-invariant | $\theta\mapsto\theta+c$ | Uniform on the circle |
| Direction $(\varphi,\theta)$ on the sphere | $O(3)$ | Uniform on solid angle, $d\Omega = \cos\varphi\,d\varphi\,d\theta$ |
| Scale parameter $\sigma>0$, scale-invariant | $\sigma\mapsto c\sigma$ | $\pi(\sigma)\propto 1/\sigma$ (improper); equivalently flat on $\log\sigma$ |
The scale case is the prototype. If we're measuring a positive quantity and want the prior to be invariant to the choice of unit (graduations of the ruler in inches vs. centimeters), then $\pi(c\sigma)=\pi(\sigma)$ for all $c>0$. Equivalently $\pi(\omega)$ on $\omega=\log\sigma$ is invariant to translations of $\omega$, so it is flat in $\omega$. Pushing back to $\sigma$ picks up the Jacobian $1/\sigma$, and the prior on $\sigma$ is the Jeffreys prior for a scale parameter, $\pi(\sigma)\propto 1/\sigma$.
Group-invariance arguments are the most secure from a logical point of view, but they need a natural symmetry in the problem to begin with. They are great when they apply and silent when they don't.
4. Maximum entropy
Maximum entropy (E.T. Jaynes) chooses the prior with the largest information entropy subject to known constraints. The intuition: entropy measures how much information we lack, so picking the maximum-entropy density commits the least beyond the stated constraints. The recipe is a Lagrangian:
$$ \mathcal{L}[p] \;=\; -\!\int p(x)\log p(x)\,dx \;-\; \lambda_0\!\Bigl(\int p - 1\Bigr) \;-\; \sum_j \lambda_j\!\Bigl(\int g_j(x)\,p(x)\,dx - c_j\Bigr). $$Stationarity in $p$ gives $-\log p(x) - 1 - \lambda_0 - \sum_j\lambda_j g_j(x)=0$, i.e.,
$$ p(x) \;\propto\; \exp\!\Bigl(\sum_j \lambda_j g_j(x)\Bigr). $$The MaxEnt density is an exponential family whose sufficient statistics are exactly the constraint functions $g_j$ and whose natural parameters $\lambda_j$ are the Lagrange multipliers. Solving the constraint equations $\int g_j p =c_j$ fixes the $\lambda_j$.
The MaxEnt outputs for the standard constraints are exactly the named "default" distributions:
| Support | Constraints | MaxEnt density |
|---|---|---|
| Finite set $\{1,\dots,n\}$ | None (just normalization) | Uniform $p_k=1/n$ |
| Interval $[a,b]\subset\mathbb{R}$ | None | Uniform on $[a,b]$ |
| $\{0,1,2,\dots\}$ | Mean $\mu$ | Geometric$(p=1/(1+\mu))$ |
| $[0,\infty)$ | Mean $\mu$ | Exponential$(1/\mu)$ |
| $\mathbb{R}$ | Mean $\mu$, variance $\sigma^2$ | Normal$(\mu,\sigma^2)$ |
| $[0,\infty)$ | $\mathbb{E}[\log x]$ and $\mathbb{E}[x]$ both fixed | Gamma$(\alpha,\beta)$ |
| $(0,1)$ | $\mathbb{E}[\log x]$ and $\mathbb{E}[\log(1-x)]$ both fixed | Beta$(\alpha,\beta)$ |
This is the rule-of-thumb you reach for in practice: say what you know, and the MaxEnt prior is whichever named distribution turns those facts into sufficient statistics. If all you know is that a quantity is positive and has a finite mean, the exponential is the prior with the least extra commitment. If you also know the variance, Gaussian. If the support is bounded and you know nothing else, uniform.
But MaxEnt has a flaw, which is why Jeffreys is also worth knowing: its output depends on the choice of base measure and on which coordinates you write the entropy in. The continuous "differential entropy" is not invariant under smooth reparameterizations — if you change coordinates $x\mapsto\phi(x)$, the entropy shifts by $\mathbb{E}[\log|\phi'(x)|]$ and so does the answer. So MaxEnt implicitly commits to a natural parameterization. When that parameterization is obvious (Cartesian coordinates on space, time on $[0,\infty)$), MaxEnt is sharp. When it isn't, the answer is parameterization-dependent.
5. Jeffreys' priors
Harold Jeffreys' construction sidesteps the coordinate issue by defining the prior through the Fisher information of the likelihood:
$$ \pi_J(\theta) \;\propto\; \sqrt{\det I(\theta)}, \qquad I(\theta)\;=\; -\mathbb{E}\!\left[\frac{\partial^2 \log p(x\mid\theta)}{\partial\theta\,\partial\theta^\top}\right]. $$This is automatically invariant under reparameterizations: if $\theta=h^{-1}(\phi)$, then $I(\phi)=I(\theta)\,(d\theta/d\phi)^2$, so $\sqrt{I(\phi)}=\sqrt{I(\theta)}\,|d\theta/d\phi|$, which is precisely the Jacobian factor needed for the prior density to push forward to itself in the new coordinate. Two analysts using different parameterizations end up with the same prior on the underlying probability model.
Worked examples:
- Normal mean, $\sigma$ known. $I(\mu)=n/\sigma^2$ is constant in $\mu$, so $\pi_J(\mu)\propto 1$ — flat. Matches the location-translation group argument.
- Normal scale, $\mu$ known. $I(\sigma)\propto 1/\sigma^2$, so $\pi_J(\sigma)\propto 1/\sigma$. Matches the scale-invariance group argument.
- Bernoulli bias $p$. $I(p)=1/[p(1-p)]$, so $\pi_J(p)\propto[p(1-p)]^{-1/2}$ — the Beta$(1/2,1/2)$ prior. Unlike a flat prior, this is invariant when reparameterized in log-odds.
Jeffreys is often improper (the integral diverges), as in the location and scale cases above. That's not by itself a problem for Bayesian inference — the posterior is usually proper as soon as you have any data. The Fisher information page develops Jeffreys in detail, including the multi-parameter version through the Fisher metric and several worked transformations.
6. Prior rules disagree near boundaries
The three routes don't always pick the same prior, because they answer different questions:
- Group invariance asks: which prior respects the natural symmetry of the problem?
- Max entropy asks: given these constraints, which prior commits the least?
- Jeffreys asks: which prior gives the same answer in any coordinate system?
For some classes of problems the answers coincide (location and scale parameters under translation/rescaling symmetry). For others they diverge. A Bernoulli with no constraints has MaxEnt prior Beta$(1,1)$ (uniform on $p$); Jeffreys gives Beta$(1/2,1/2)$; a group-invariance argument doesn't directly apply because there's no natural group symmetry on $[0,1]$ unless you posit one. Which to use depends on the question: do you want a uniform-in-$p$ analysis (MaxEnt), or one that survives a switch to log-odds (Jeffreys)?
At $n=0$ the three priors look different: flat is uniform on $p$, Jeffreys is U-shaped with mass pushed toward the boundaries, and Haldane (the limit $\alpha,\beta\to 0$) is fully concentrated on $\{0,1\}$. Switch the readout to log-odds: the flat prior on $p$ becomes peaked around $p=1/2$, Jeffreys becomes flat in log-odds (another reading of its invariance), and Haldane becomes flat over all $\mathbb{R}$. As $n$ grows the posteriors converge — for moderate $n$ the disagreement only matters near the boundary $p\approx 0$ or $p\approx 1$, where the prior dominates because the likelihood goes to zero.
7. Pitfalls and the "ignorance" warning
Non-informative priors are not priors with no assumptions. They are attempts to avoid privileging one coordinate system or one set of facts over another, and each construction makes a specific choice:
- Flat priors are coordinate-dependent. A prior that is uniform in $p$ is not uniform in $\log p/(1-p)$. Saying "let's just use a flat prior" answers which flat prior with the parameterization you happened to be using.
- MaxEnt depends on the base measure. Maximum-entropy priors over continuous spaces are not parameterization-invariant unless you carry a reference measure along (the relative-entropy formulation), and writing down a natural reference measure is itself a modeling assumption.
- Improper priors are not always safe. Jeffreys and group priors are often improper. The posterior usually becomes proper after any data, but not always — for hierarchical models you should check.
- "Uninformative" can be very informative on derived quantities. A flat prior on the variance $\sigma^2$ puts most of its mass on huge values; a flat prior on $\log\sigma$ is dramatically different. Either is "uninformative" on its own coordinate but informative on the other.
The standard caution bears repeating: use of "uninformative" or "automatic" priors does not incorporate any real prior information you may have. If you do have such information, use it.
8. A short decision guide
| Situation | First choice | Why |
|---|---|---|
| You have a calibrated estimate from previous data or theory | Informative prior matching that estimate | Don't waste the information; show it explicitly in the prior |
| You have soft beliefs ("plausibly $\pm 2\sigma$ around 0") | Weakly-informative prior at that scale (e.g. Normal$(0, 2.5^2)$ on a standardized coefficient) | Regularizes near the boundary, agrees with data when there's enough of it |
| Closed-form posterior matters; soft beliefs OK | Conjugate prior with hyperparameters interpretable as pseudo-counts (see here) | Fast updates, transparent strength, often a reasonable approximation |
| Location parameter, no information | Flat on $\theta$ (translation-invariant) | All three routes agree |
| Scale or rate parameter, no information | $\pi(\sigma)\propto 1/\sigma$ (Jeffreys / scale-invariant) | All three routes agree |
| Probabilities or rates with possible boundary cases | Beta$(1/2,1/2)$ (Jeffreys) | Coordinate-invariant; better near $p\approx 0, 1$ than flat |
| Discrete outcomes, no preference between them | Symmetric Dirichlet, e.g. $\alpha=1$ (uniform on simplex) or $\alpha=1/2$ (Jeffreys) | Group invariance under permutation of categories |
| Positive quantity with known mean only | Exponential$(1/\mu)$ (MaxEnt) | The least committal positive distribution with that mean |
| Real-valued quantity with known mean and variance | Normal (MaxEnt) | The least committal distribution on $\mathbb{R}$ with those moments |
| Posterior is sensitive to the prior choice | Report results under multiple priors, or collect more data | If the answer depends on the prior in a way you can't defend, the data isn't doing enough work yet |
9. Practical elicitation: turn beliefs into pseudo-counts
For prior beliefs that you can articulate but not immediately write as a density, the pseudo-count reading from the conjugate-prior page is often the most practical conversion:
- "I'd guess this coin's bias is around 0.6, and I'm about as sure as if I'd seen 10 trials" → Beta$(7, 5)$. Mean 0.58, effective sample size 10.
- "I'd estimate the event rate at ≈ 2 per year, with about 5 years of indirect experience" → Gamma$(\alpha=10, \beta=5)$ on the rate. Mean 2, effective observation 5 years.
- "I think the parameter is near 0, on the scale of 0.5" → Normal$(0, 0.5^2)$. Worth one observation of precision-4 data.
The discipline this enforces is useful: every conjugate hyperparameter has units of either pseudo-events or pseudo-observations, and writing them down forces you to check that your prior isn't accidentally worth a thousand observations on a hundred-observation problem.