A family tree of the usual distributions: limits, sums, transformations, ratios, tails, and conjugacy.
Named distributions are easier to remember when they are connected by operations.
Bernoulli trials add into binomials; rare binomials limit to Poisson; Gaussians stay
Gaussian under sums; squared Gaussians make $\chi^2$; ratios make Cauchy, $t$, and
$F$ laws. This page shows those connections.
1. Opening figure: the family map
Click a node to jump to its card. The edge labels are the operations that turn one
distribution into another: sums, limits, transformations, ratios, and conjugate
updates.
sum / transform / ratioconjugate priorspecial caselimit
Hover for a one-line summary; click a node to scroll to its card.
2. Discrete distributions
discreteBernoulli → Binomial
One trial becomes a count.
A Bernoulli variable is a single yes/no event. A binomial is the sum of $n$
independent Bernoulli trials with the same success probability $p$.
Drag $n$ and $p$: the bars are $P(X=k)$ for $X\sim\mathrm{Binomial}(n,p)$.
The center tracks $np$ and the spread tracks $np(1-p)$.
Law
PMF
CDF
CF
Mean
Var
Bernoulli
$p^x(1-p)^{1-x}$
$0,1-p,1$
$1-p+pe^{it}$
$p$
$p(1-p)$
Binomial
$\binom{n}{k}p^k(1-p)^{n-k}$
$\sum_{j\le k}\binom{n}{j}p^j(1-p)^{n-j}$
$(1-p+pe^{it})^n$
$np$
$np(1-p)$
discreteCategorical → Multinomial
From one $K$-sided draw to a vector of counts.
A categorical variable is a single draw over $K$ outcomes with probabilities
$(p_1,\dots,p_K)$. The multinomial is the vector of category counts after $n$
independent categorical draws. With $K=2$ these reduce to Bernoulli and binomial;
each marginal count is $\mathrm{Binomial}(n, p_k)$, but counts share negative
covariance because the total must equal $n$.
Bars show the expected counts $np_k$ with $\pm 2$ standard-deviation
whiskers from the binomial marginals.
Law
PMF
Mean
Var / Cov
Categorical
$\prod_k p_k^{\mathbb 1_{x=k}}$
$p_k$
$p_k(1-p_k)$; $-p_jp_k$
Multinomial
$\frac{n!}{\prod_k k_i!}\prod_k p_k^{k_i}$
$np_k$
$np_k(1-p_k)$; $-np_jp_k$
discreteGeometric
Waiting for the first success.
The geometric distribution counts trials until the first success. After
failures, the remaining wait still has the same distribution: discrete
memoryless waiting time.
If $X_n\sim\mathrm{Binomial}(n,\lambda/n)$, then as $n\to\infty$ the mass
approaches $\mathrm{Poisson}(\lambda)$. This is the count model for many
independent rare opportunities.
Poisson processes are the
process-level version: counts over time windows plus exponential waiting times.
The plot overlays the binomial bars (blue, wider) with the limiting Poisson distribution (red, narrower).
PMF
$e^{-\lambda}\lambda^k/k!$
CDF
$\sum_{j\le k}e^{-\lambda}\lambda^j/j!$
CF
$\exp(\lambda(e^{it}-1))$
Mean / variance
$\lambda$, $\lambda$
Fact
Independent Poisson counts add by adding rates.
discreteNegative Binomial
Waiting for several successes, or overdispersed counts.
A negative binomial can be seen as a sum of geometric waits. As a count model,
it is what you reach for when Poisson variance is too small for the data.
The gray curve is a Poisson with the same mean. The wider bars show the
extra dispersion.
PMF
$\binom{k+r-1}{k}p^r(1-p)^k$
CDF
$\sum_{j\le k}\binom{j+r-1}{j}p^r(1-p)^j$
CF
$\left(\frac{p}{1-(1-p)e^{it}}\right)^r$
Mean / variance
$r(1-p)/p$, $r(1-p)/p^2$
Fact
Variance-to-mean ratio is $1/p$, so it exceeds Poisson when $p<1$.
3. Continuous distributions
continuousUniform and inverse-CDF sampling
The random-number source behind the others.
If $U\sim\mathrm{Uniform}(0,1)$, then $F^{-1}(U)$ has CDF $F$. This is the
inverse-CDF method behind exact one-dimensional sampling.
The left axis is uniform probability. The curve is $F^{-1}(u)$ for the chosen
family, turning evenly spaced $u$ values into nonuniform samples.
The exponential distribution is the continuous waiting time with no aging:
\[P(T>s+t\mid T>s)=P(T>t).\]
It is also the interarrival-time distribution in a
Poisson process.
The minimum of independent exponentials is exponential again:
$\min(T_1,T_2)\sim\mathrm{Exp}(\lambda_1+\lambda_2)$. Competing clocks add rates.
PDF
$\lambda e^{-\lambda x}$ for $x\ge0$
CDF
$1-e^{-\lambda x}$
CF
$\lambda/(\lambda-it)$
Mean / variance
$1/\lambda$, $1/\lambda^2$
Fact
The only continuous memoryless distribution.
continuousGaussian
The shape sums with finite variance converge to.
Gaussian distributions are closed under sums: independent normals add to a
normal. More broadly, normalized sums of many finite-variance variables drift
toward a bell curve.
The bars show the sum of $m$ centered uniform variables rescaled to variance
$\sigma^2$. The curve is the matching normal approximation.
The double exponential: a sharp peak with exponential tails.
If $E_1,E_2$ are independent $\mathrm{Exp}(1/b)$, then $E_1-E_2\sim\mathrm{Laplace}(0,b)$.
The density is the symmetric exponential $\frac{1}{2b}e^{-|x-\mu|/b}$, so it has a
cusp at the mean and decays linearly on a log scale — heavier tails than a Gaussian, but
far lighter than a Cauchy.
It is the maximum-entropy distribution on $\mathbb{R}$ for a fixed mean absolute
deviation $\mathbb E|X-\mu|=b$, the same way Gaussian is max-entropy for fixed variance.
As a noise model, $-\log p(x\mid\mu,b)\propto|x-\mu|/b$, so MLE under Laplace noise is
median regression (L1 loss); as a prior on regression coefficients it gives
the lasso.
Then $Z_n \Rightarrow \mathcal N(0,1)$ as $n \to \infty$. The base shape can be
skewed, bimodal, or discrete — the standardized sum is still Gaussian in the
limit. Pick a base below and slide $n$.
Figure 4a · The standardized sample mean converges to N(0, 1)
base distribution samplestandardized sum (Monte Carlo)$\mathcal N(0, 1)$ target
base:
summands $n$:2
Things to notice:
At $n = 1$, the bottom panel is the top panel — standardized but
otherwise unchanged. The Exponential is right-skewed; the bimodal mixture is
two-humped; Bernoulli is two spikes.
By $n = 5{-}10$, all four bases produce a bell-shaped bottom panel that
visibly tracks the $\mathcal N(0,1)$ curve. The empirical mean and variance
of $Z_n$ in the readout are already $\approx 0$ and $\approx 1$.
The Bernoulli base needs the most summands — for small $n$ the
standardized sum is still discrete, so the histogram is jagged. The CLT
converges in distribution, not pointwise: the discrete histogram smooths into
the continuous bell as $n$ grows.
This is why the Gaussian shows up everywhere: any finite-variance
averaging process produces it.
4b. Other operations
operationSums and convolution
Adding independent variables convolves their densities.
The density of $X+Y$ is $f_X*f_Y$. Some families are closed under addition:
Gaussian plus Gaussian is Gaussian; Poisson plus Poisson is Poisson; Cauchy plus
Cauchy is Cauchy.
The plot shows two draggable family choices and their sum.
operationMax and min of i.i.d. samples
Extremes act on the CDF, not the density.
If $M_n=\max(X_1,\dots,X_n)$, then:
\[P(M_n\le x)=F(x)^n.\]
If $m_n=\min(X_1,\dots,X_n)$, then:
\[P(m_n\le x)=1-(1-F(x))^n.\]
The plot uses a $\mathrm{Uniform}(0,1)$ base distribution and shows how
increasing $n$ pushes mass toward the right edge for maxima and the left edge
for minima.
operationTransformations and Jacobians
Changing variables reshapes density by the derivative.
Two canonical cases: $Y=X^2$ turns $N(0,1)$ into $\chi^2_1$; $Y=e^X$ turns a
normal into a log-normal.
operationRatios
Division creates heavy tails.
The Cauchy distribution is what you get when you divide one standard normal
by another. Near-zero denominators create enormous ratios.
This is the ratio story behind the $t$ and $F$ sampling distributions too:
normalize by an estimated scale, and tail weight appears.
5. Sampling distributions from Gaussians
continuous$\chi^2$
Sum squared standard normals.
If $Z_i\sim N(0,1)$ independently, then $\sum_{i=1}^k Z_i^2\sim\chi^2_k$.
It is the distribution of squared Gaussian length in $k$ dimensions.
PDF
$x^{k/2-1}e^{-x/2}/(2^{k/2}\Gamma(k/2))$
CDF
$P(k/2,x/2)$
CF
$(1-2it)^{-k/2}$
Mean / variance
$k$, $2k$
Fact
Gamma with shape $k/2$ and rate $1/2$.
continuousStudent's t
A normal divided by estimated scale.
If $Z\sim N(0,1)$ and $V\sim\chi^2_\nu$, then
$T=Z/\sqrt{V/\nu}$ has Student's $t_\nu$ distribution. As $\nu$ grows, the
denominator stabilizes and $t$ becomes Gaussian.
Special-function form; no compact elementary expression.
Mean / variance
$0$ for $\nu>1$; $\nu/(\nu-2)$ for $\nu>2$
Fact
Heavy-tailed because the scale is estimated.
continuousF distribution
Ratio of scaled chi-squares.
If $U\sim\chi^2_{d_1}$ and $V\sim\chi^2_{d_2}$ independently, then
$(U/d_1)/(V/d_2)$ has an $F_{d_1,d_2}$ distribution. It appears in variance
comparisons and ANOVA-style ratios.
Special-function form; no compact elementary expression.
Mean / variance
$d_2/(d_2-2)$; finite variance for $d_2>4$
Fact
Ratio of independent variance estimates.
continuousInverse $\chi^2$ and scaled inverse $\chi^2$
One over a chi-square; the conjugate prior for normal variance.
If $X\sim\chi^2_\nu$ then $Y=1/X$ has the inverse chi-squared distribution
with $\nu$ degrees of freedom. The scaled inverse chi-squared,
$\mathrm{Scale}$-$\mathrm{Inv}\chi^2(\nu,\tau^2)$, multiplies by $\nu\tau^2$ and is the
standard conjugate prior for the variance $\sigma^2$ of a normal with known
mean: $n$ observations with sample variance $s^2$ update the posterior to
$\mathrm{Scale}$-$\mathrm{Inv}\chi^2(\nu+n,\,(\nu\tau^2+n s^2)/(\nu+n))$.
For both unknown mean and unknown variance, the conjugate prior is
Normal–scaled-Inv-$\chi^2$ (equivalently
Normal–Gamma on the precision,
via the $1/X$ edge). Marginalising out the variance turns the Gaussian
posterior predictive into Student's $t$ with degrees of freedom equal to the
posterior $\nu_n$ — heavy tails when $n$ is small, Gaussian as $n\to\infty$.
This is the single derivation that ties Chi², Inv-$\chi^2$, and Student-$t$
together in Bayesian inference.
PDF
$\frac{2^{-\nu/2}}{\Gamma(\nu/2)}x^{-\nu/2-1}e^{-1/(2x)}$ for $x>0$
CDF
$Q(\nu/2, 1/(2x))$ (upper regularized $\gamma$)
Mean / variance
$1/(\nu-2)$ for $\nu>2$; $2/((\nu-2)^2(\nu-4))$ for $\nu>4$
Fact
Conjugate prior for the variance of a normal with known mean.
6. Conjugate prior–likelihood pairs
Conjugacy means the posterior stays in the same family as the prior. These pairs
are exact-update shortcuts, and reference points for
variational inference when exact updating is
not available.
Why conjugacy exists. All four pairs below share a single structural
property: the likelihood is an
exponential family
$p(x\mid\theta) = h(x)\exp\bigl(\eta(\theta)\cdot T(x) - A(\theta)\bigr)$,
and the prior has the same exponential-family shape in $\theta$. Multiplying prior by
likelihood and absorbing the result back into the same form gives a posterior whose
natural parameter is just a shift in the sufficient-statistic direction. Concretely:
seeing data shifts $\alpha$ by $\sum T(x_i)$. These are exactly the families
with finite-dimensional sufficient statistics.
The data influences the posterior only through the fixed-size summary $\sum T(x_i)$,
which is why the update is a simple parameter shift. Conjugacy is the cases
where the posterior's tilt
stays inside a finite-dimensional family. When that's not true, you reach for
variational inference.
Family
Sufficient statistic $T(x)$
Natural parameter shift on update
Bernoulli / Binomial
$x$ (count of successes)
$\alpha\to\alpha+s,\;\beta\to\beta+f$
Poisson
$x$ (count)
$\alpha\to\alpha+\sum y_i,\;\beta\to\beta+t$
Normal (known $\sigma^2$)
$x$
precision adds; mean is precision-weighted
Categorical / Multinomial
$(\mathbb 1_{x=k})_k$
$\alpha_k\to\alpha_k+n_k$
Named distributions as maximum-entropy answers. Most of these
distributions are not arbitrary mathematical objects; they are the
unique distributions that maximize entropy given a particular support and
moment constraint:
Support
Constraints (beyond normalization)
Max-entropy law
$[a, b]$
none
Uniform$(a,b)$
$[0, \infty)$
fixed mean $1/\lambda$
Exponential$(\lambda)$
$\mathbb{R}$
fixed mean $\mu$, variance $\sigma^2$
Normal$(\mu, \sigma^2)$
$\mathbb{R}$
fixed mean $\mu$, mean abs deviation $\mathbb E|X-\mu|=b$
Laplace$(\mu, b)$
$\{0,1,\dots,N\}$
fixed mean
Discrete exp-family (Binomial-like)
$\{0,1,2,\dots\}$
fixed mean $\mu$
Geometric, $q = \mu/(\mu+1)$
$\{0,1,2,\dots\}$
fixed mean & variance
Negative-binomial-family
$\mathbb{R}^d$
fixed mean & covariance
Multivariate Normal
$(K-1)$-simplex
fixed $\mathbb E[\log x_k]$ for each $k$
Dirichlet$(\alpha_1,\dots,\alpha_K)$
The pattern is the same in every row: write down the Lagrangian
$-\int q\log q - \sum_i \lambda_i(\int T_i q - c_i)$, take the variation, and the
stationary $q^*\propto\exp(\sum_i\lambda_i T_i)$ is exactly an exponential family
with $T_i$ as sufficient statistics. See the max-entropy
interactive on the Fisher-information page to step through the constraints and
watch the family member emerge, and the Legendre-duality
section for why this construction is forced by the geometry of $\log\sum e^{\eta T}$.
bayesBeta-Binomial
Prior over a Bernoulli/binomial probability. Successes add to $\alpha$;
failures add to $\beta$.
The Pareto tail has $P(X>x)\propto x^{-\alpha}$. On log-log axes it becomes
a straight line, while exponential and Gaussian tails curve downward much faster.
Smaller $\alpha$ means heavier tails and fewer finite moments.
Cauchy has no mean or variance. Pareto has finite moments only for orders
below $\alpha$: mean needs $\alpha>1$, variance needs $\alpha>2$.
PDF
$\alpha x_m^\alpha/x^{\alpha+1}$ for $x\ge x_m$
CDF
$1-(x_m/x)^\alpha$
CF
Special-function form; no compact elementary expression.
Mean
$\alpha x_m/(\alpha-1)$ for $\alpha>1$
Variance
$\alpha x_m^2/((\alpha-1)^2(\alpha-2))$ for $\alpha>2$
8. Decision table
Start from what you are modeling; pick the distribution whose construction matches
that story.