Monte Carlo & MCMC
Bayesian computation centres on two integrals: the posterior normalizer $p(y) = \int p(y\mid\theta)p(\theta)\,d\theta$, and a posterior expectation $\mathbb{E}_{p(\theta\mid y)}[f(\theta)] = \int f(\theta)\,p(\theta\mid y)\,d\theta$. Neither has a closed form for most models of interest. Monte Carlo methods replace the integrals with averages over samples, either drawn directly (i.i.d. samplers) or drawn from a Markov chain whose stationary distribution is the target (MCMC). The sequential / state-space case (sequential Monte Carlo, particle filters, Kalman family) is a different problem shape and gets its own page.
1. Sampling methods estimate expectations under a target
Every method below evaluates an expectation against some target $\pi(x)$, typically a Bayesian posterior $\pi(x) = p(\theta\mid y)$, possibly known only up to a constant. They differ in three axes:
- How samples are produced. Direct i.i.d. draws (inverse CDF, rejection), reweighted draws from a tractable proposal (importance sampling), or draws from a Markov chain converging to $\pi$ (Metropolis-Hastings, Gibbs, RJMCMC).
- What they require of $\pi$. Inverse CDF needs the CDF; rejection needs a global envelope; importance sampling needs only pointwise $\pi(x)$; MCMC needs only ratios $\pi(x')/\pi(x)$; Gibbs needs tractable full conditionals.
- What state space they're built for. Static $\theta$ (most posteriors) or variable-dimension $(k,\theta_k)$ (RJMCMC for model selection). For time-evolving latent state $x_{0:T}$, see state-space filtering.
The decision table at the end reduces the choice to one view.
| Structure | Sampling idea | Use when |
|---|---|---|
| Static target | Direct / rejection | Low-dimensional target with an envelope or inverse transform. |
| Weighted target | Importance sampling | A tractable proposal covers the target and the goal is an expectation. |
| Markov target | MCMC | Only unnormalized target ratios are available, often in high dimension. |
| State over time | Sequential Monte Carlo | Latent state evolves and observations arrive one time step at a time. See state-space filtering. |
2. I.i.d. methods
exacti.i.d. Inverse-CDF sampling
The cleanest case. If $X$ has CDF $F$, then $F(X)\sim\mathcal{U}[0,1]$, so $X = F^{-1}(U)$ for $U\sim\mathcal{U}[0,1]$ recovers exact i.i.d. samples.
$$ X = F^{-1}(U),\qquad U\sim\mathcal{U}[0,1]. $$It is exact and free of rejection but useless beyond 1-D and beyond targets whose CDF has an invertible closed form. It is the baseline that motivates everything else.
exacti.i.d. Rejection sampling
Pick a tractable proposal $q$ and a constant $M$ with $M\,q(x)\geq \pi(x)$ for all $x$. Draw $x\sim q$, $u\sim\mathcal{U}[0,1]$, and accept if $u \lt \pi(x)/(M\,q(x))$. Accepted samples are exact draws from $\pi$.
- Draw $x \sim q$.
- Draw $u \sim \mathcal{U}[0,1]$.
- If $u \lt \pi(x) / (M\,q(x))$, accept $x$; else reject and repeat.
Acceptance rate is $1/M$. In high dimensions $M$ blows up exponentially, because the gap between any plausible $q$ and a peaked $\pi$ widens fast in volume, so rejection is unusable past a handful of dimensions.
i.i.d. Importance sampling (IS)
Never reject. Reweight every sample from $q$ by the ratio $w(x) = \pi(x)/q(x)$:
$$ \mathbb{E}_\pi[f(X)] \approx \frac{1}{N}\sum_{i=1}^N w(x^{(i)})\,f(x^{(i)}),\qquad x^{(i)}\sim q. $$When $\pi$ is unnormalized (Bayesian posteriors typically are), use self-normalized IS:
$$ \tilde w^{(i)} = \frac{w(x^{(i)})}{\sum_j w(x^{(j)})},\qquad \hat{\mathbb{E}}_\pi[f] = \sum_i \tilde w^{(i)} f(x^{(i)}). $$How good is this estimator? The samples come from $q$, so the variance is taken under $q$:
$$ \begin{aligned} \operatorname{Var}_q\bigl[f(X)\,w(X)\bigr] &= \mathbb{E}_q\!\bigl[\,f(X)^2 w(X)^2\,\bigr] - \mathbb{E}_q\!\bigl[\,f(X)\,w(X)\,\bigr]^2 \\ &= \mathbb{E}_q\!\bigl[\,f(X)^2 w(X)^2\,\bigr] - \mathbb{E}_p\!\bigl[\,f(X)\,\bigr]^2 \\ &= \mathbb{E}_q\!\bigl[\,f(X)^2 w(X)^2\,\bigr] - \mu_f^2. \end{aligned} $$The bridge from line 1 to line 2 is the unbiasedness identity that motivated IS in the first place: $\mathbb E_q[fw] = \int f(x)\,\tfrac{\pi(x)}{q(x)}\,q(x)\,dx = \mathbb E_p[f(X)] = \mu_f$. The second term doesn't depend on $q$; it's a property of the problem.
Two things follow immediately:
- The IS estimator is unbiased for any $q$ with support covering $\pi$. The choice of $q$ doesn't shift the mean — only the variance.
- The only $q$-dependent term is $\mathbb E_q[f^2 w^2]$. Since $w(x) = \pi(x)/q(x)$, the dangerous factor is $w(x)^2 = (\pi(x)/q(x))^2$: variance blows up wherever $q$ is small relative to $\pi$ in a region where $f$ is not also small. The squared weight is the explosion mechanism.
Minimising $\mathbb E_q[f^2 w^2]$ over $q$ subject to $\int q = 1$ gives the optimal proposal
$$ q^*(x) \;\propto\; |f(x)|\,\pi(x), $$which is exactly the thing you can't sample from — it requires the integral $\int |f|\pi$, the object closest to what you were trying to compute. What this says in practice is direct samples toward where $|f|\pi$ is large. For computing the evidence ($f \equiv 1$) the optimal proposal is $q^* = \pi$ — exact samples from the target. For a tail-event probability ($f$ an indicator), $q^*$ shifts mass to the tail, which is the design principle behind rare-event simulation.
The operational shortcut, since $q^*$ is unavailable: choose $q$ heavier-tailed than $\pi$ (Student-$t$ over Gaussian, Laplace over Normal) so the weight ratio $\pi/q$ stays bounded in the tails. Then monitor the effective sample size $\mathrm{ESS} = \bigl(\sum_i \tilde w^{(i)}\bigr)^2 \big/ \sum_i \bigl(\tilde w^{(i)}\bigr)^2$. If it collapses to $\ll N$, your weights are degenerate: the estimate is dominated by one or two samples and the variance is effectively the variance of the largest weight times $f^2$.
Figure 2 lets you compare rejection and importance sampling on the same problem. Pick a target $\pi$ and a proposal $q$ (family, center, scale, and for Student-$t$ the degrees of freedom $\nu$). The top row plots each method's draws; the ESS bar shows how many useful samples per $N$ draws each method extracts; the empirical-fit strip overlays the resulting histograms on $\pi$ so you can see which method actually reconstructs the target. The preset row sweeps through the regimes where the two methods agree, disagree, or both fail.
3. MCMC
Once $\pi$ is multivariate and peaked, both rejection (acceptance rate $\to 0$) and importance sampling (weight variance $\to \infty$) collapse. MCMC trades i.i.d. samples for correlated samples from a Markov chain $X_0, X_1, X_2, \dots$ whose stationary distribution is $\pi$. The chain is constructed to satisfy detailed balance:
$$ \pi(x)\,T(x\to x') \;=\; \pi(x')\,T(x'\to x), $$which is sufficient for $\pi$ to be invariant under $T$. Different MCMC algorithms are different ways to engineer such a $T$.
The MCMC story is a sequence of relaxations. Each step removes one limitation, but it adds a new requirement or a new way to fail.
| Step | What changes | Requirement | What can fail |
|---|---|---|---|
| Metropolis | Replace i.i.d. draws with a Markov chain and accept/reject local proposals. | Symmetric proposal $q(x'\mid x)=q(x\mid x')$. | Proposal scale can make the chain crawl or reject almost everything. |
| Metropolis-Hastings | Allow asymmetric or independence proposals. | Evaluate the proposal correction $q(x\mid x')/q(x'\mid x)$. | A bad proposal still gives sticky chains or mode blindness. |
| Gibbs | Replace accept/reject proposals with exact coordinate conditional draws. | Closed-form full conditionals, or MH-within-Gibbs substeps. | Strong posterior correlation makes axis moves mix slowly. |
| RJMCMC | Let the chain jump between parameter spaces with different dimensions. | Dimension-matched reversible moves and Jacobian terms. | Birth/death move design is problem-specific and can mix poorly. |
Figure 3 puts the MCMC variants on the same toy state space. The tabs keep the Markov-chain skeleton fixed where possible, then add the extra mechanism that each variant needs: rejection self-loops, Hastings proposal correction, coordinate-wise conditionals, birth/death dimension moves, or gradient-guided proposals. The RJMCMC tab shares the 1-D state space with the others for visual parallelism — the states there stand in for model orders $k$, not for the within-model parameters. The genuine two-space picture (within-model MH on each $k$, plus dimension-jumping moves between them) is in Figure 13.
mcmc Metropolis MCMC (1953)
Metropolis MCMC is the symmetric-proposal member of the Metropolis-Hastings family. Use a proposal with $q(x'\mid x) = q(x\mid x')$, propose $x'\sim q(\cdot\mid x)$, and accept with probability
$$ \alpha(x,x') = \min\!\left\{1,\;\frac{\pi(x')}{\pi(x)}\right\}. $$The symmetric $q$ cancels, so only the target-density ratio remains. Typical choice: random walk $x' = x + \epsilon$, $\epsilon\sim\mathcal{N}(0,\Sigma_\text{prop})$. The step size $\Sigma_\text{prop}$ is the one tuning knob and it matters: too small and the chain crawls; too large and nearly every proposal is rejected.
mcmc Metropolis-Hastings (1970)
Generalizes Metropolis to asymmetric $q$ by adding a correction:
$$ \alpha(x,x') = \min\!\left\{1,\;\frac{\pi(x')\,q(x\mid x')}{\pi(x)\,q(x'\mid x)}\right\}. $$- Propose $x' \sim q(\cdot \mid x)$.
- Compute $\alpha = \min\bigl\{1, \frac{\pi(x')\,q(x \mid x')}{\pi(x)\,q(x' \mid x)}\bigr\}$.
- Draw $u \sim \mathcal{U}[0,1]$. If $u \lt \alpha$, set $x \leftarrow x'$; else keep $x$.
Variants worth naming:
- Random-walk MH: $q$ symmetric; ratio collapses to $\pi(x')/\pi(x)$. Default in most papers.
- Independent MH: $q(x'\mid x) = q(x')$. Equivalent to importance sampling with hard accept/reject; works only if $q$ is genuinely a good global approximation of $\pi$.
- Langevin / HMC: $q$ uses gradient information $\nabla\log\pi$ to bias proposals toward high-density regions. Much faster in high dimensions when gradients are available.
Why that $\alpha$? Because it is exactly the factor that forces detailed balance. Write the transition as $T(x\to x') = q(x'\mid x)\,\alpha(x,x')$. The proposal alone is unbalanced: the two flows $\pi(x)\,q(x'\mid x)$ and $\pi(x')\,q(x\mid x')$ generally differ. One direction is over-full, and its $\alpha$ is the ratio that scales it down while the reverse direction's $\alpha$ is $1$. Either way both realized flows land on $\min\{\pi(x)\,q(x'\mid x),\;\pi(x')\,q(x\mid x')\}$, so $\pi(x)\,T(x\to x') = \pi(x')\,T(x'\to x)$ — detailed balance. Figure 4 lets you push the target and proposal ratios around: the proposal flow tilts, $\alpha$ compensates, and the realized flow stays balanced no matter what.
Figure 5 runs a random-walk MH on a 2-D banana-shaped target and lets you watch how step size trades off acceptance rate against mixing. Pause and use Step to see one proposal at a time: the dashed ellipse is the proposal region (±σ around the current state), and the line shows where the last proposal landed and whether it was accepted.
The BNN posterior target swaps the banana for the two-parameter $f(x; w_1, w_2) = w_1\tanh(w_2 x)$ posterior used on the Bayesian neural networks page. The model has a sign symmetry, so the energy surface has two basins of equal depth. Watch HMC commit to one basin while random-walk MH stays trapped in whichever it started in — switching basins requires crossing a high-energy ridge that local proposals rarely traverse.
One caveat the per-step view hides: cost. An HMC step here spends $L$ gradient evaluations on its leapfrog trajectory (the leapfrog steps slider) where random-walk MH spends none. Per accepted step HMC wins easily; per gradient evaluation the gap narrows, and the honest comparison is mixing per unit of compute — which is why the strong-but-expensive proposal still has to earn its keep against the cheap-but-slow one.
For a live MALA sampler on this same banana target, see chi-feng's MCMC demo. MALA — the Metropolis-adjusted Langevin algorithm — is HMC with a single leapfrog step: one gradient-guided Langevin proposal, then an accept/reject correction. It sits between random-walk MH and full HMC.
Figure 8 makes MALA's anatomy explicit. Every proposed step on the banana target $x' = x + d + s$ splits into a deterministic drift $d = \tfrac{\epsilon^2}{2}\nabla\log\pi(x)$ that climbs the gradient and a random diffusion kick $s = \epsilon\,z$ with $z\sim\mathcal{N}(0, I)$. Because the proposal is asymmetric, the accept ratio carries both forward and reverse Langevin densities. Watch the acceptance gauge: MALA mixes best near a rate of $0.574$, so the gauge doubles as a step-size tuning instrument.
MALA is the corrected discretization of a continuous process. The Langevin diffusion $d\theta = \tfrac12\nabla\log\pi(\theta)\,dt + dB$ has $\pi$ as its exact stationary distribution. Take a finite Euler step of size $\epsilon$ and you get the unadjusted Langevin algorithm (ULA): fast, but the discretization biases the stationary distribution away from $\pi$. MALA removes that bias with the Metropolis accept/reject step. Toggle the correction off to run ULA. The histogram strip tracks the chain's $x$-marginal against the true $\mathcal{N}(0,4)$: at large $\epsilon$ the uncorrected chain visibly settles to the wrong distribution while MALA stays on target.
Figure 9 shows the same target as a surface. Drag to rotate it. The ridge shape makes the random-walk tuning problem visible: local proposals can slide along the ridge, but large proposals jump off the high-density mass.
mcmc Markov topology and stationary mass
The Markov chain is not just a path. It is a transition graph over the state space: each state sends probability mass to nearby proposals, then the accept/reject rule changes how much of that mass actually moves. A good chain has local moves that still let empirical occupation mass converge to the target mass. The mode buttons switch the state-space layout: a 1-D line (axis-aligned random walk on a bimodal target), a 2-D grid (Moore-neighborhood random walk on a two-bump target), and a ring with a barrier between two clusters of mass — each exposes a different mixing failure mode.
mcmc Gibbs sampling
A special case of MH where the proposal is the full conditional of one coordinate at a time. With this proposal, the acceptance probability is identically 1.
$$ x_i^{(t+1)} \sim p\!\left(x_i \,\middle|\, x_1^{(t+1)},\ldots,x_{i-1}^{(t+1)},\,x_{i+1}^{(t)},\ldots,x_d^{(t)}\right). $$Gibbs is the workhorse of Bayesian hierarchical modeling because conjugate priors give closed-form full conditionals, so every update is a draw from a named distribution. It mixes poorly along directions of high posterior correlation (because it can only move along axes); a common cure is to update blocks of variables jointly, or to reparametrize. If a conditional is intractable, replace that step with an MH sub-step. This is Metropolis-within-Gibbs.
Figure 11 shows why Gibbs feels different from a random walk: it moves by exact conditional draws along coordinate axes. That makes every proposal accepted, but strong posterior correlation can still make the chain advance slowly.
mcmc Reversible-jump MCMC (RJMCMC)
What if the model order itself is unknown: number of mixture components, number of change-points, number of hidden layers? Then the state space is a disjoint union $\bigsqcup_k \{k\}\times\mathbb{R}^{n_k}$ of spaces of different dimension, and standard MH doesn't apply because $\pi(x')/\pi(x)$ is comparing densities in different-dimensional spaces.
Green's RJMCMC fixes this with a dimension-matching auxiliary variable. To move from $(k,\theta_k)$ to $(k',\theta_{k'})$ with $n_{k'}>n_k$, draw auxiliary $u\sim g$ and apply a diffeomorphism $\theta_{k'} = T(\theta_k, u)$. The acceptance ratio carries a Jacobian:
$$ \alpha = \min\!\left\{1,\; \frac{\pi(k',\theta_{k'})\,g'(u')\,r(k'\to k)}{\pi(k,\theta_k)\,g(u)\,r(k\to k')}\left|\frac{\partial(\theta_{k'},u')}{\partial(\theta_k,u)}\right|\right\}. $$The classic example is a mixture model with a birth/death pair: birth splits one component into two ($\theta_1=\theta+u,\theta_2=\theta-u$, Jacobian $=2$); death merges two into one. Recent work uses RJMCMC inside simulated annealing to choose neural network architectures jointly with weights.
Figure 13 puts the split/merge map inside a Markov chain. The left panel runs within-model Metropolis-Hastings on the $k=1$ parameter $\mu$; the right panel runs MH on the $k=2$ parameters $(\mu_1, \mu_2)$. With probability $p_\text{rj}$ the chain proposes a between-model move instead: a birth $\mu \mapsto (\mu+u, \mu-u)$ with $u\sim g$, or a death that merges $(\mu_1, \mu_2)$ to their midpoint. The acceptance ratio for each between-model move carries the Jacobian factor $2$ from the diffeomorphism. The model-order bar below the panels shows the chain's marginal occupation of $\{k=1, k=2\}$.
mcmc Simulated annealing
Same MH machinery, opposite goal. Fix a target $\pi(x) \propto e^{-U(x)}$, but instead of sampling from $\pi$, find its peak (the global minimum of $U$). Introduce a temperature $T$ and sample from a tempered distribution $\pi_T(x) \propto e^{-U(x)/T}$. The MH accept ratio for a symmetric proposal is
$$ r \;=\; \exp\!\bigl(-(U(x^*) - U(x))/T\bigr). $$At $T=1$ this is ordinary MH on $\pi$. At $T \to \infty$ every proposal is accepted: the chain ignores the landscape and explores freely. At $T \to 0$ only downhill moves are accepted: the chain becomes greedy. Simulated annealing starts hot and cools: $T_t$ follows a schedule like the logarithmic $T_t = T_0/\log(1 + t)$ (slow enough to escape local minima; Geman & Geman, 1984) or the more practical exponential $T_t = T_0\,\alpha^t$ with $\alpha$ just below $1$.
Information-theory aside. The tempered family $\pi_T \propto e^{-U/T}$ is the Gibbs distribution at temperature $T$, with free energy $F(T) = -T \log Z(T)$. SA can be read as free-energy descent along the path of measures connecting the uniform distribution ($T \to \infty$) to a delta on the global minimum ($T \to 0$). The cooling schedule controls how much entropy the chain is allowed to keep at each step, the same shape as the entropy-regularized argmax that turns up in reinforcement learning and rate-distortion.
Figure 13b runs both walkers on the same triple-well $U(x)$: a fixed-$T$ MH chain that samples $\pi$ (it visits all three wells in proportion to their depth), and an SA chain that cools exponentially and ends up parked in the global minimum. Watch SA's acceptance rate plummet as $T \to 0$, and the rare hot-start jumps that let it escape the shallow wells before cooling makes that impossible.
4. Shared algorithm structure
These methods produce the same kind of output (an estimate of $\mathbb E_\pi[f]$ or samples from $\pi$) but differ in what they ask of $\pi$, how they generate candidate states, and how they update. Lining the algorithms up row by row makes the family resemblances and the algorithm-specific moves visible at a glance.
| Rejection | Importance sampling | Metropolis | Metropolis–Hastings | Gibbs | HMC | |
|---|---|---|---|---|---|---|
| Family | i.i.d., exact | i.i.d., weighted | MCMC | MCMC | MCMC | MCMC |
| Requires | envelope $M$ with $M\,q(x) \ge \pi(x)$ | proposal $q$ with $\mathrm{supp}\,q \supseteq \mathrm{supp}\,\pi$ | $\pi$ pointwise up to a constant | $\pi$ pointwise; proposal $q(x'\mid x)$ | tractable full conditionals $\pi(x_k\mid x_{-k})$ | $\pi$ and $\nabla \log \pi$ |
| Proposal | $x \sim q(\cdot)$ | $x_i \sim q(\cdot),\ i = 1\ldots N$ | $x' = x + \varepsilon,\ \varepsilon$ symmetric | $x' \sim q(\cdot \mid x)$ | cycle $k$: $x'_k \sim \pi(x_k\mid x_{-k})$ | leapfrog $L$ steps from $(x, p)$ |
| Accept / weight | $u < \dfrac{\pi(x)}{M\,q(x)}$ | $w_i = \dfrac{\pi(x_i)}{q(x_i)}$ | $u < \min\!\Bigl(1, \dfrac{\pi(x')}{\pi(x)}\Bigr)$ | $u < \min\!\Bigl(1, \dfrac{\pi(x')\,q(x\mid x')}{\pi(x)\,q(x'\mid x)}\Bigr)$ | always (exact draw) | $u < \min\!\bigl(1,\, e^{H(x,p) - H(x',p')}\bigr)$ |
| State update | keep $x$ if accepted; else discard and resample | no state; reweight each draw | $x \leftarrow x'$ if accepted; else stay | $x \leftarrow x'$ if accepted; else stay | $x_k \leftarrow x'_k$ for each coordinate | $x \leftarrow x'$ if accepted; resample momentum $p \sim \mathcal N(0, I)$ |
| Output | i.i.d. samples from $\pi$ | $\hat{\mathbb E}_\pi[f] = \sum_i \tilde w_i\,f(x_i),\ \tilde w_i \propto w_i$ | correlated chain $X_0, X_1, \ldots \to \pi$ | correlated chain $X_0, X_1, \ldots \to \pi$ | correlated chain $X_0, X_1, \ldots \to \pi$ | correlated chain with longer effective step |
| Main failure mode | acceptance rate $1/M$ collapses exponentially in $d$ | weight degeneracy when $q$ poorly covers $\pi$ or in high $d$ | symmetric-proposal scale fragility | poor $q$ ⇒ sticky chain or mode-trapping | strong coordinate correlation ⇒ slow mixing | step size and trajectory length need tuning; divergences |
Reading across the Accept / weight row is where the family shows itself most clearly: rejection accepts or discards; IS accepts every sample but reweights; Metropolis and MH accept stochastically by a target- density ratio (with MH adding the proposal correction $q(x\mid x')/q(x'\mid x)$ that the symmetric Metropolis case eliminates); Gibbs accepts always but restricts the proposal to one-coordinate conditional draws; HMC accepts by an energy ratio rather than a density ratio, because its proposal is a trajectory rather than a point. Each row in this table is a single design decision; each method is one combination.
Two methods that share a row stay in the same family of failure modes. Reading Main failure mode shows why: high-$d$ dimension hurts rejection and IS for the same reason (geometric weight-mass concentration in the proposal-vs-target ratio), while MH and Gibbs share the symptom slow mixing but for different reasons (proposal-scale tuning vs. posterior anisotropy in coordinate-aligned axes).
5. Pair by pair: what changes between methods
The table above shows the design space at a glance. The four cards below zoom in on the single-step changes: this turned into that by swapping one line of the inner loop. Pseudocode for each pair shares the same outer structure, so the differences are confined to the highlighted lines.
Rejection ↔ Importance sampling
Both draw from $q$. Rejection discards mismatched samples; IS keeps every sample with a weight.
samples = []
while |samples| < N:
x ~ q(·)
u ~ Uniform(0, 1)
if u < π(x) / (M·q(x)):
samples.append(x)
return samples # i.i.d. from π
samples = []; weights = []
for i = 1 to N:
x ~ q(·)
w = π(x) / q(x)
samples.append(x)
weights.append(w)
return Σᵢ w̃ᵢ·f(xᵢ) # w̃ ∝ w
The accept-or-discard test (left) becomes a weight assignment (right). Rejection wastes a fraction $1 - 1/M$ of draws but returns honest i.i.d. samples; IS wastes none but returns a weighted estimator whose variance depends on how well $q$ covers $\pi$ (see §2 for the variance algebra).
Metropolis ↔ Metropolis–Hastings
Asymmetric proposals need a correction factor in the acceptance ratio.
x = x₀
chain = [x]
for t = 1 to T:
x' = x + ε, ε symmetric
u ~ Uniform(0, 1)
if u < min(1, π(x') / π(x)):
x = x'
chain.append(x)
return chain
x = x₀
chain = [x]
for t = 1 to T:
x' ~ q(·|x)
u ~ Uniform(0, 1)
if u < min(1, π(x')·q(x|x') /
(π(x)·q(x'|x))):
x = x'
chain.append(x)
return chain
Two lines change. The proposal becomes a general $q(\cdot \mid x)$ instead of a symmetric jump, and the accept test picks up the Hastings correction $q(x \mid x')/q(x' \mid x)$ that compensates for proposal asymmetry. Metropolis is MH with $q$ symmetric; the correction equals 1 and cancels.
Metropolis–Hastings ↔ Gibbs
Replace "propose then accept/reject" with "draw exactly from a full conditional, always accept."
x = x₀
chain = [x]
for t = 1 to T:
x' ~ q(·|x)
u ~ Uniform(0, 1)
if u < min(1, π(x')·q(x|x') /
(π(x)·q(x'|x))):
x = x'
chain.append(x)
return chain
x = x₀
chain = [x]
for t = 1 to T:
for k = 1 to d:
x_k ~ π(x_k | x_{-k})
chain.append(x)
return chain
Gibbs replaces the entire propose/accept block with a coordinate cycle of exact conditional draws. The acceptance probability is identically $1$ because each conditional draw is from the right distribution by construction. The price: you need the full conditionals $\pi(x_k \mid x_{-k})$ in closed form. When you don't, MH-within-Gibbs replaces one or more of these conditional draws with a local MH step.
Metropolis–Hastings ↔ HMC
Replace random-walk proposals with a Hamiltonian trajectory; replace the density ratio with an energy ratio.
x = x₀
chain = [x]
for t = 1 to T:
x' = x + ε, ε ~ N(0, Σ)
u ~ Uniform(0, 1)
if u < min(1, π(x') / π(x)):
x = x'
chain.append(x)
return chain
x = x₀
chain = [x]
for t = 1 to T:
p ~ N(0, I)
(x', p') = leapfrog(x, p, ε, L)
# uses ∇log π
u ~ Uniform(0, 1)
if u < min(1, exp(H(x,p) - H(x',p'))):
x = x'
chain.append(x)
return chain
The proposal step changes from a local random jump to a deterministic leapfrog trajectory of length $L\varepsilon$ using $\nabla\log\pi$. The accept test changes from a density ratio to an energy ratio $\exp(H - H')$ where $H(x, p) = -\log\pi(x) + \tfrac12\|p\|^2$. The leapfrog integrator is reversible and volume-preserving, which is what makes the energy-ratio test correct. HMC trades the need for gradients for much longer effective steps and much better mixing in high $d$.
6. Method choice by target
The decision table summarizes static-target methods. Tags echo the section colors: exact means returns-true-samples-from-$\pi$, i.i.d. means draws are independent, mcmc means correlated chain samples. For state-space targets (Kalman, EKF, UKF, particle filter), see Kalman & Particle Filters.
| Method | Use when | Strengths | Pitfalls |
|---|---|---|---|
| Inverse CDF exacti.i.d. | Univariate target with invertible CDF (exponential, Cauchy, custom 1-D). | Exact, zero waste. | Useless beyond 1-D or without closed-form $F^{-1}$. |
| Rejection exacti.i.d. | Low-dimensional target with a known envelope $M\,q\geq\pi$. | Exact i.i.d. samples; simple to implement. | Acceptance rate $1/M$ collapses exponentially in dimension. |
| Importance sampling i.i.d. | Computing expectations under an unnormalized target; building block inside SMC. | Never rejects; self-normalized version handles unnormalized $\pi$; flexible $q$. | Weight degeneracy if $q$ poorly covers $\pi$ or in high $d$; monitor ESS. |
| Metropolis mcmc | Static posterior where a symmetric random-walk proposal is adequate. | Simpler acceptance ratio; needs only pointwise unnormalized $\pi$. | A special case of MH; proposal scale still controls mixing and rejection. |
| Metropolis-Hastings mcmc | General-purpose static posterior with no special structure; only ratios $\pi(x')/\pi(x)$ available. | Asymptotically exact; needs only pointwise unnormalized $\pi$. | Tuning the proposal scale is fiddly; mixes poorly in high $d$ or multimodal $\pi$. |
| Gibbs sampling mcmc | Hierarchical/conjugate models where every full conditional has closed form. | Zero tuning; acceptance rate 1; natural for graphical models. | Slow when variables are correlated; needs tractable conditionals (else MH-within-Gibbs). |
| RJMCMC mcmc | Joint inference over model order and parameters (mixture components, change-points, NN architecture). | Asymptotically exact across model space; principled Bayesian model selection. | Move design and Jacobians are problem-specific; mixing across dimensions is hard. |
- 1-D target with an invertible CDF? → inverse CDF.
- Low-D target with a known envelope $M\,q\geq\pi$? → rejection.
- Want an expectation under an unnormalized target with a reasonable proposal $q$? → importance sampling.
- High-D static target, only ratios of $\pi$? → symmetric proposal: Metropolis; asymmetric: Metropolis-Hastings. Conjugate conditionals available? → Gibbs. Gradients available? → HMC/Langevin.
- Model order itself is unknown? → RJMCMC.
- Time-evolving hidden state? → see Kalman & Particle Filters.
What next
Sampling connects most directly to change of measure, approximate inference, dependence testing, and state-space filtering.