Notation: density form vs. measure-theoretic form
1. Densities depend on reference measures
A density $p(\theta)$ is not an absolute object — it depends on a reference measure. The same probability law $\pi$ on $\mathbb R$ has density $p(\theta)$ with respect to Lebesgue measure $\lambda$, density $p(\theta)/e^{-\theta^2/2}$ with respect to the standard Gaussian, and no density at all with respect to (say) a sum of Dirac masses on the integers. The expression $p(\theta)\,d\theta$ silently assumes Lebesgue.
Measure-theoretic notation makes the reference explicit. You write $d\pi$ when you mean "integrate against the probability law itself" and $d\pi/d\nu$ when you specifically want a density — naming both the law and the yardstick.
For most calculations the difference is cosmetic: read $p(\theta)\,d\theta$ as a shorthand for $d\pi(\theta)$ and you've translated. The cases where the difference is not cosmetic are catalogued in §4.
2. The dictionary
| Concept | Density form | Measure-theoretic form |
|---|---|---|
| Probability law | $p(\theta)$ (density) | $\pi$ (measure on a space $(\Omega, \mathcal F)$) |
| Probability of a set $A$ | $\int_A p(\theta)\,d\theta$ | $\pi(A)$ |
| Expectation | $\int f(\theta)\,p(\theta)\,d\theta$ | $\int f\,d\pi$ or $\mathbb E_\pi[f]$ |
| Density (when one exists) | $p(\theta)$ | $\dfrac{d\pi}{d\nu}$ (Radon–Nikodym derivative w.r.t. dominating measure $\nu$) |
| Joint law | $p(x, y)$ | $\pi$ on the product space $(\mathcal X \times \mathcal Y, \mathcal F_X \otimes \mathcal F_Y)$ |
| Conditional | $p(\theta \mid y) = p(\theta, y)/p(y)$ | $\pi(\cdot \mid y)$ (regular conditional probability) |
| Independence | $p(x, y) = p(x)\,p(y)$ | $\pi_{X,Y} = \pi_X \otimes \pi_Y$ (product measure) |
| Pushforward | change of variables: $p_Y(y) = p_X(\phi^{-1}(y))\,|J|$ | $\phi_\ast \pi$, defined by $(\phi_\ast \pi)(B) = \pi(\phi^{-1}(B))$ |
| KL divergence | $D(p \,\Vert\, q) = \int p\,\log(p/q)\,d\theta$ | $D(\pi \,\Vert\, \nu) = \int \log\!\bigl(d\pi/d\nu\bigr)\,d\pi$ |
| Sufficient statistic | Fisher–Neyman: $p(x \mid \theta) = g(T(x), \theta)\,h(x)$ | Doob–Dynkin: $T$ measurable, $\sigma(T)$ makes $\theta$ conditionally independent of $X$ given $T(X)$ |
| Convergence | $p_n \to p$ (typically pointwise) | $\pi_n \to \pi$ weakly, in total variation, in distribution, etc. |
3. Routine calculations agree
For laws absolutely continuous with respect to Lebesgue measure on $\mathbb R^d$ (or with respect to counting measure on a countable space), the two languages compute the same numbers from the same inputs. Bayes' rule $\pi(\theta \mid y) \propto \pi(y \mid \theta)\,\pi(\theta)$ reads identically in both. The standard derivations of conjugate posteriors, MCMC acceptance ratios, ELBO decompositions, and Gaussian-process kernels go through with $p(\theta)\,d\theta$ replaced verbatim by $d\pi(\theta)$.
Routine conjugate-prior, Monte Carlo, variational-Bayes, and Gaussian-process calculations usually lose nothing in density form. They can be rewritten in measure-theoretic form without changing a result.
4. Modes, KL, changes of variables, and convergence differ
The language choice matters substantively in four places:
The mode requires a reference measure
The mean and median of a probability law are properties of the measure: $\mathbb E_\pi[X]$ and $F_\pi^{-1}(1/2)$ depend only on $\pi$, not on a chosen yardstick. The mode, $\arg\max p(\theta)$, depends on the density's reference measure. Under a smooth bijection $\phi$ the density transforms as $p_Y(y) = p_X(\phi^{-1}(y))\, |J_{\phi^{-1}}(y)|$ and the argmax moves with the Jacobian. So a MAP estimate of $\log\sigma^2$ and of $\sigma^2$ disagree, but their means and medians agree.
This is the same reparametrization-invariance issue that motivates Jeffreys priors and that is flagged on Posterior Summaries.
KL is a Radon–Nikodym derivative, not a density ratio
The definition $D(\pi \,\Vert\, \nu) = \int \log\!\bigl(d\pi/d\nu\bigr)\,d\pi$ is finite only when $\pi \ll \nu$ (the support of $\pi$ is contained in that of $\nu$); otherwise $D = \infty$. Density form $D(p\,\Vert\,q) = \int p \log(p/q)\,d\theta$ hides this — division by zero appears as an algebraic accident rather than a structural condition. The measure-theoretic form also generalises to discrete laws, mixed laws, and laws on infinite-dimensional spaces (Gaussian processes, path spaces) without needing separate definitions. See KL divergence.
Sufficient statistics via σ-algebras
Fisher–Neyman factorization is a useful test for sufficiency ($p(x \mid \theta) = g(T(x), \theta) h(x)$), but it presumes a density. The underlying definition is measure-theoretic: $T$ is sufficient if for every $\theta$, the conditional distribution of $X$ given $T(X)$ does not depend on $\theta$. Equivalently, $\theta$ is conditionally independent of $X$ given $T(X)$. This formulation handles cases with no density (continuous-discrete mixtures, point masses) and connects directly to information theory through the data-processing inequality. See Sufficient Statistics.
Modes of convergence are measure-theoretic by nature
Convergence in probability, almost surely, in $L^p$, and in distribution are each defined as conditions on a sequence of measures or a sequence of random variables on a common probability space. Density-form translations exist for absolutely continuous cases (convergence of densities, pointwise or in $L^1$) but the underlying notions are not density notions — random variables that have no density (e.g. point masses, mixed laws) still converge in measure. See Modes of Convergence.
5. Notation choice by task
- Default to density form. Conjugate updates, sampling-method derivations, MCMC accept ratios, ELBO bookkeeping, GP kernels — these work fine in density form and density form is the form most readers bring with them.
- Reach for measure-theoretic form when the result is reparametrization-dependent (anything involving a mode or a MAP), when absolute continuity matters (KL, Bayes factors near $q = 0$), when you're working on an unfamiliar space (path spaces, infinite product spaces, point processes), or when you need a definition that doesn't presume a density (sufficient statistics, convergence).
- Translation should be cheap. Any time the density form reads "p(θ) dθ", read "dπ(θ)" instead and the rest of the page still works. The cases above are the ones where translation alone isn't enough; the measure-theoretic form carries content the density form cannot.