Notation: density form vs. measure-theoretic form

A dictionary for density notation and measure-theoretic notation, with the places where they diverge.

1. Densities depend on reference measures

A density $p(\theta)$ is not an absolute object — it depends on a reference measure. The same probability law $\pi$ on $\mathbb R$ has density $p(\theta)$ with respect to Lebesgue measure $\lambda$, density $p(\theta)/e^{-\theta^2/2}$ with respect to the standard Gaussian, and no density at all with respect to (say) a sum of Dirac masses on the integers. The expression $p(\theta)\,d\theta$ silently assumes Lebesgue.

Measure-theoretic notation makes the reference explicit. You write $d\pi$ when you mean "integrate against the probability law itself" and $d\pi/d\nu$ when you specifically want a density — naming both the law and the yardstick.

For most calculations the difference is cosmetic: read $p(\theta)\,d\theta$ as a shorthand for $d\pi(\theta)$ and you've translated. The cases where the difference is not cosmetic are catalogued in §4.

2. The dictionary

ConceptDensity formMeasure-theoretic form
Probability law $p(\theta)$  (density) $\pi$  (measure on a space $(\Omega, \mathcal F)$)
Probability of a set $A$ $\int_A p(\theta)\,d\theta$ $\pi(A)$
Expectation $\int f(\theta)\,p(\theta)\,d\theta$ $\int f\,d\pi$   or   $\mathbb E_\pi[f]$
Density (when one exists) $p(\theta)$ $\dfrac{d\pi}{d\nu}$  (Radon–Nikodym derivative w.r.t. dominating measure $\nu$)
Joint law $p(x, y)$ $\pi$ on the product space $(\mathcal X \times \mathcal Y, \mathcal F_X \otimes \mathcal F_Y)$
Conditional $p(\theta \mid y) = p(\theta, y)/p(y)$ $\pi(\cdot \mid y)$  (regular conditional probability)
Independence $p(x, y) = p(x)\,p(y)$ $\pi_{X,Y} = \pi_X \otimes \pi_Y$  (product measure)
Pushforward change of variables: $p_Y(y) = p_X(\phi^{-1}(y))\,|J|$ $\phi_\ast \pi$, defined by $(\phi_\ast \pi)(B) = \pi(\phi^{-1}(B))$
KL divergence $D(p \,\Vert\, q) = \int p\,\log(p/q)\,d\theta$ $D(\pi \,\Vert\, \nu) = \int \log\!\bigl(d\pi/d\nu\bigr)\,d\pi$
Sufficient statistic Fisher–Neyman: $p(x \mid \theta) = g(T(x), \theta)\,h(x)$ Doob–Dynkin: $T$ measurable, $\sigma(T)$ makes $\theta$ conditionally independent of $X$ given $T(X)$
Convergence $p_n \to p$  (typically pointwise) $\pi_n \to \pi$  weakly, in total variation, in distribution, etc.

3. Routine calculations agree

For laws absolutely continuous with respect to Lebesgue measure on $\mathbb R^d$ (or with respect to counting measure on a countable space), the two languages compute the same numbers from the same inputs. Bayes' rule $\pi(\theta \mid y) \propto \pi(y \mid \theta)\,\pi(\theta)$ reads identically in both. The standard derivations of conjugate posteriors, MCMC acceptance ratios, ELBO decompositions, and Gaussian-process kernels go through with $p(\theta)\,d\theta$ replaced verbatim by $d\pi(\theta)$.

Routine conjugate-prior, Monte Carlo, variational-Bayes, and Gaussian-process calculations usually lose nothing in density form. They can be rewritten in measure-theoretic form without changing a result.

4. Modes, KL, changes of variables, and convergence differ

The language choice matters substantively in four places:

The mode requires a reference measure

The mean and median of a probability law are properties of the measure: $\mathbb E_\pi[X]$ and $F_\pi^{-1}(1/2)$ depend only on $\pi$, not on a chosen yardstick. The mode, $\arg\max p(\theta)$, depends on the density's reference measure. Under a smooth bijection $\phi$ the density transforms as $p_Y(y) = p_X(\phi^{-1}(y))\, |J_{\phi^{-1}}(y)|$ and the argmax moves with the Jacobian. So a MAP estimate of $\log\sigma^2$ and of $\sigma^2$ disagree, but their means and medians agree.

This is the same reparametrization-invariance issue that motivates Jeffreys priors and that is flagged on Posterior Summaries.

KL is a Radon–Nikodym derivative, not a density ratio

The definition $D(\pi \,\Vert\, \nu) = \int \log\!\bigl(d\pi/d\nu\bigr)\,d\pi$ is finite only when $\pi \ll \nu$ (the support of $\pi$ is contained in that of $\nu$); otherwise $D = \infty$. Density form $D(p\,\Vert\,q) = \int p \log(p/q)\,d\theta$ hides this — division by zero appears as an algebraic accident rather than a structural condition. The measure-theoretic form also generalises to discrete laws, mixed laws, and laws on infinite-dimensional spaces (Gaussian processes, path spaces) without needing separate definitions. See KL divergence.

Sufficient statistics via σ-algebras

Fisher–Neyman factorization is a useful test for sufficiency ($p(x \mid \theta) = g(T(x), \theta) h(x)$), but it presumes a density. The underlying definition is measure-theoretic: $T$ is sufficient if for every $\theta$, the conditional distribution of $X$ given $T(X)$ does not depend on $\theta$. Equivalently, $\theta$ is conditionally independent of $X$ given $T(X)$. This formulation handles cases with no density (continuous-discrete mixtures, point masses) and connects directly to information theory through the data-processing inequality. See Sufficient Statistics.

Modes of convergence are measure-theoretic by nature

Convergence in probability, almost surely, in $L^p$, and in distribution are each defined as conditions on a sequence of measures or a sequence of random variables on a common probability space. Density-form translations exist for absolutely continuous cases (convergence of densities, pointwise or in $L^1$) but the underlying notions are not density notions — random variables that have no density (e.g. point masses, mixed laws) still converge in measure. See Modes of Convergence.

5. Notation choice by task

Where the distinction shows up