Monte Carlo & MCMC

Rejection, importance sampling, Metropolis-Hastings, Gibbs, RJMCMC, and simulated annealing: what they share, where they differ, which to reach for, and what their probability mass is doing. State-space filtering — Kalman and particle filters — lives on its own page.

Bayesian computation centres on two integrals: the posterior normalizer $p(y) = \int p(y\mid\theta)p(\theta)\,d\theta$, and a posterior expectation $\mathbb{E}_{p(\theta\mid y)}[f(\theta)] = \int f(\theta)\,p(\theta\mid y)\,d\theta$. Neither has a closed form for most models of interest. Monte Carlo methods replace the integrals with averages over samples, either drawn directly (i.i.d. samplers) or drawn from a Markov chain whose stationary distribution is the target (MCMC). The sequential / state-space case (sequential Monte Carlo, particle filters, Kalman family) is a different problem shape and gets its own page.

1. Sampling methods estimate expectations under a target

Every method below evaluates an expectation against some target $\pi(x)$, typically a Bayesian posterior $\pi(x) = p(\theta\mid y)$, possibly known only up to a constant. They differ in three axes:

The decision table at the end reduces the choice to one view.

Figure 1 · How the sampler creates probability mass
Structure Sampling idea Use when
Static target Direct / rejection Low-dimensional target with an envelope or inverse transform.
Weighted target Importance sampling A tractable proposal covers the target and the goal is an expectation.
Markov target MCMC Only unnormalized target ratios are available, often in high dimension.
State over time Sequential Monte Carlo Latent state evolves and observations arrive one time step at a time. See state-space filtering.
target $\pi$ proposal / transition weighted particles

2. I.i.d. methods

exacti.i.d. Inverse-CDF sampling

The cleanest case. If $X$ has CDF $F$, then $F(X)\sim\mathcal{U}[0,1]$, so $X = F^{-1}(U)$ for $U\sim\mathcal{U}[0,1]$ recovers exact i.i.d. samples.

$$ X = F^{-1}(U),\qquad U\sim\mathcal{U}[0,1]. $$

It is exact and free of rejection but useless beyond 1-D and beyond targets whose CDF has an invertible closed form. It is the baseline that motivates everything else.

exacti.i.d. Rejection sampling

Pick a tractable proposal $q$ and a constant $M$ with $M\,q(x)\geq \pi(x)$ for all $x$. Draw $x\sim q$, $u\sim\mathcal{U}[0,1]$, and accept if $u \lt \pi(x)/(M\,q(x))$. Accepted samples are exact draws from $\pi$.

Rejection sampling
  1. Draw $x \sim q$.
  2. Draw $u \sim \mathcal{U}[0,1]$.
  3. If $u \lt \pi(x) / (M\,q(x))$, accept $x$; else reject and repeat.

Acceptance rate is $1/M$. In high dimensions $M$ blows up exponentially, because the gap between any plausible $q$ and a peaked $\pi$ widens fast in volume, so rejection is unusable past a handful of dimensions.

i.i.d. Importance sampling (IS)

Never reject. Reweight every sample from $q$ by the ratio $w(x) = \pi(x)/q(x)$:

$$ \mathbb{E}_\pi[f(X)] \approx \frac{1}{N}\sum_{i=1}^N w(x^{(i)})\,f(x^{(i)}),\qquad x^{(i)}\sim q. $$

When $\pi$ is unnormalized (Bayesian posteriors typically are), use self-normalized IS:

$$ \tilde w^{(i)} = \frac{w(x^{(i)})}{\sum_j w(x^{(j)})},\qquad \hat{\mathbb{E}}_\pi[f] = \sum_i \tilde w^{(i)} f(x^{(i)}). $$

How good is this estimator? The samples come from $q$, so the variance is taken under $q$:

$$ \begin{aligned} \operatorname{Var}_q\bigl[f(X)\,w(X)\bigr] &= \mathbb{E}_q\!\bigl[\,f(X)^2 w(X)^2\,\bigr] - \mathbb{E}_q\!\bigl[\,f(X)\,w(X)\,\bigr]^2 \\ &= \mathbb{E}_q\!\bigl[\,f(X)^2 w(X)^2\,\bigr] - \mathbb{E}_p\!\bigl[\,f(X)\,\bigr]^2 \\ &= \mathbb{E}_q\!\bigl[\,f(X)^2 w(X)^2\,\bigr] - \mu_f^2. \end{aligned} $$

The bridge from line 1 to line 2 is the unbiasedness identity that motivated IS in the first place: $\mathbb E_q[fw] = \int f(x)\,\tfrac{\pi(x)}{q(x)}\,q(x)\,dx = \mathbb E_p[f(X)] = \mu_f$. The second term doesn't depend on $q$; it's a property of the problem.

Two things follow immediately:

Minimising $\mathbb E_q[f^2 w^2]$ over $q$ subject to $\int q = 1$ gives the optimal proposal

$$ q^*(x) \;\propto\; |f(x)|\,\pi(x), $$

which is exactly the thing you can't sample from — it requires the integral $\int |f|\pi$, the object closest to what you were trying to compute. What this says in practice is direct samples toward where $|f|\pi$ is large. For computing the evidence ($f \equiv 1$) the optimal proposal is $q^* = \pi$ — exact samples from the target. For a tail-event probability ($f$ an indicator), $q^*$ shifts mass to the tail, which is the design principle behind rare-event simulation.

The operational shortcut, since $q^*$ is unavailable: choose $q$ heavier-tailed than $\pi$ (Student-$t$ over Gaussian, Laplace over Normal) so the weight ratio $\pi/q$ stays bounded in the tails. Then monitor the effective sample size $\mathrm{ESS} = \bigl(\sum_i \tilde w^{(i)}\bigr)^2 \big/ \sum_i \bigl(\tilde w^{(i)}\bigr)^2$. If it collapses to $\ll N$, your weights are degenerate: the estimate is dominated by one or two samples and the variance is effectively the variance of the largest weight times $f^2$.

Figure 2 lets you compare rejection and importance sampling on the same problem. Pick a target $\pi$ and a proposal $q$ (family, center, scale, and for Student-$t$ the degrees of freedom $\nu$). The top row plots each method's draws; the ESS bar shows how many useful samples per $N$ draws each method extracts; the empirical-fit strip overlays the resulting histograms on $\pi$ so you can see which method actually reconstructs the target. The preset row sweeps through the regimes where the two methods agree, disagree, or both fail.

Figure 2 · Rejection sampling vs. importance sampling: choose $p$ and $q$
target $\pi$ proposal $M\!\cdot\!q$ accepted / rejection ESS importance weights / IS ESS rejected / wasted draws

3. MCMC

Once $\pi$ is multivariate and peaked, both rejection (acceptance rate $\to 0$) and importance sampling (weight variance $\to \infty$) collapse. MCMC trades i.i.d. samples for correlated samples from a Markov chain $X_0, X_1, X_2, \dots$ whose stationary distribution is $\pi$. The chain is constructed to satisfy detailed balance:

$$ \pi(x)\,T(x\to x') \;=\; \pi(x')\,T(x'\to x), $$

which is sufficient for $\pi$ to be invariant under $T$. Different MCMC algorithms are different ways to engineer such a $T$.

The MCMC story is a sequence of relaxations. Each step removes one limitation, but it adds a new requirement or a new way to fail.

StepWhat changesRequirementWhat can fail
Metropolis Replace i.i.d. draws with a Markov chain and accept/reject local proposals. Symmetric proposal $q(x'\mid x)=q(x\mid x')$. Proposal scale can make the chain crawl or reject almost everything.
Metropolis-Hastings Allow asymmetric or independence proposals. Evaluate the proposal correction $q(x\mid x')/q(x'\mid x)$. A bad proposal still gives sticky chains or mode blindness.
Gibbs Replace accept/reject proposals with exact coordinate conditional draws. Closed-form full conditionals, or MH-within-Gibbs substeps. Strong posterior correlation makes axis moves mix slowly.
RJMCMC Let the chain jump between parameter spaces with different dimensions. Dimension-matched reversible moves and Jacobian terms. Birth/death move design is problem-specific and can mix poorly.

Figure 3 puts the MCMC variants on the same toy state space. The tabs keep the Markov-chain skeleton fixed where possible, then add the extra mechanism that each variant needs: rejection self-loops, Hastings proposal correction, coordinate-wise conditionals, birth/death dimension moves, or gradient-guided proposals. The RJMCMC tab shares the 1-D state space with the others for visual parallelism — the states there stand in for model orders $k$, not for the within-model parameters. The genuine two-space picture (within-model MH on each $k$, plus dimension-jumping moves between them) is in Figure 13.

Figure 3 · MCMC variant lab: what changes in the transition kernel
target mass proposal / transition accepted move rejection / self-loop

mcmc Metropolis MCMC (1953)

Metropolis MCMC is the symmetric-proposal member of the Metropolis-Hastings family. Use a proposal with $q(x'\mid x) = q(x\mid x')$, propose $x'\sim q(\cdot\mid x)$, and accept with probability

$$ \alpha(x,x') = \min\!\left\{1,\;\frac{\pi(x')}{\pi(x)}\right\}. $$

The symmetric $q$ cancels, so only the target-density ratio remains. Typical choice: random walk $x' = x + \epsilon$, $\epsilon\sim\mathcal{N}(0,\Sigma_\text{prop})$. The step size $\Sigma_\text{prop}$ is the one tuning knob and it matters: too small and the chain crawls; too large and nearly every proposal is rejected.

mcmc Metropolis-Hastings (1970)

Generalizes Metropolis to asymmetric $q$ by adding a correction:

$$ \alpha(x,x') = \min\!\left\{1,\;\frac{\pi(x')\,q(x\mid x')}{\pi(x)\,q(x'\mid x)}\right\}. $$
Metropolis-Hastings
  1. Propose $x' \sim q(\cdot \mid x)$.
  2. Compute $\alpha = \min\bigl\{1, \frac{\pi(x')\,q(x \mid x')}{\pi(x)\,q(x' \mid x)}\bigr\}$.
  3. Draw $u \sim \mathcal{U}[0,1]$. If $u \lt \alpha$, set $x \leftarrow x'$; else keep $x$.

Variants worth naming:

Why that $\alpha$? Because it is exactly the factor that forces detailed balance. Write the transition as $T(x\to x') = q(x'\mid x)\,\alpha(x,x')$. The proposal alone is unbalanced: the two flows $\pi(x)\,q(x'\mid x)$ and $\pi(x')\,q(x\mid x')$ generally differ. One direction is over-full, and its $\alpha$ is the ratio that scales it down while the reverse direction's $\alpha$ is $1$. Either way both realized flows land on $\min\{\pi(x)\,q(x'\mid x),\;\pi(x')\,q(x\mid x')\}$, so $\pi(x)\,T(x\to x') = \pi(x')\,T(x'\to x)$ — detailed balance. Figure 4 lets you push the target and proposal ratios around: the proposal flow tilts, $\alpha$ compensates, and the realized flow stays balanced no matter what.

Figure 4 · The acceptance ratio forces detailed balance
proposal flow $\pi\,q$ over-full direction, throttled by $\alpha$ realized flow $\pi\,T$

Figure 5 runs a random-walk MH on a 2-D banana-shaped target and lets you watch how step size trades off acceptance rate against mixing. Pause and use Step to see one proposal at a time: the dashed ellipse is the proposal region (±σ around the current state), and the line shows where the last proposal landed and whether it was accepted.

Figure 5 · Random-walk Metropolis-Hastings on a banana target
current state proposal region (±σ) accepted proposal rejected proposal
Figure 6 · Trace, rolling autocorrelation, and effective sample size
trace rolling ACF ESS estimate
Figure 7 · Random-walk MH versus Hamiltonian proposals
random-walk MH leapfrog trajectory accepted HMC move

The BNN posterior target swaps the banana for the two-parameter $f(x; w_1, w_2) = w_1\tanh(w_2 x)$ posterior used on the Bayesian neural networks page. The model has a sign symmetry, so the energy surface has two basins of equal depth. Watch HMC commit to one basin while random-walk MH stays trapped in whichever it started in — switching basins requires crossing a high-energy ridge that local proposals rarely traverse.

One caveat the per-step view hides: cost. An HMC step here spends $L$ gradient evaluations on its leapfrog trajectory (the leapfrog steps slider) where random-walk MH spends none. Per accepted step HMC wins easily; per gradient evaluation the gap narrows, and the honest comparison is mixing per unit of compute — which is why the strong-but-expensive proposal still has to earn its keep against the cheap-but-slow one.

For a live MALA sampler on this same banana target, see chi-feng's MCMC demo. MALA — the Metropolis-adjusted Langevin algorithm — is HMC with a single leapfrog step: one gradient-guided Langevin proposal, then an accept/reject correction. It sits between random-walk MH and full HMC.

Figure 8 makes MALA's anatomy explicit. Every proposed step on the banana target $x' = x + d + s$ splits into a deterministic drift $d = \tfrac{\epsilon^2}{2}\nabla\log\pi(x)$ that climbs the gradient and a random diffusion kick $s = \epsilon\,z$ with $z\sim\mathcal{N}(0, I)$. Because the proposal is asymmetric, the accept ratio carries both forward and reverse Langevin densities. Watch the acceptance gauge: MALA mixes best near a rate of $0.574$, so the gauge doubles as a step-size tuning instrument.

MALA is the corrected discretization of a continuous process. The Langevin diffusion $d\theta = \tfrac12\nabla\log\pi(\theta)\,dt + dB$ has $\pi$ as its exact stationary distribution. Take a finite Euler step of size $\epsilon$ and you get the unadjusted Langevin algorithm (ULA): fast, but the discretization biases the stationary distribution away from $\pi$. MALA removes that bias with the Metropolis accept/reject step. Toggle the correction off to run ULA. The histogram strip tracks the chain's $x$-marginal against the true $\mathcal{N}(0,4)$: at large $\epsilon$ the uncorrected chain visibly settles to the wrong distribution while MALA stays on target.

Figure 8 · Langevin diffusion, ULA, and MALA
drift $d = \tfrac{\epsilon^2}{2}\nabla\log\pi$ diffusion $s = \epsilon z$ rejected proposal

Figure 9 shows the same target as a surface. Drag to rotate it. The ridge shape makes the random-walk tuning problem visible: local proposals can slide along the ridge, but large proposals jump off the high-density mass.

Figure 9 · WebGL density surface for the banana target
high target density low target density current MH state

mcmc Markov topology and stationary mass

The Markov chain is not just a path. It is a transition graph over the state space: each state sends probability mass to nearby proposals, then the accept/reject rule changes how much of that mass actually moves. A good chain has local moves that still let empirical occupation mass converge to the target mass. The mode buttons switch the state-space layout: a 1-D line (axis-aligned random walk on a bimodal target), a 2-D grid (Moore-neighborhood random walk on a two-bump target), and a ring with a barrier between two clusters of mass — each exposes a different mixing failure mode.

Figure 10 · Discrete MH transition topology and occupation mass
target mass transition edges empirical mass

mcmc Gibbs sampling

A special case of MH where the proposal is the full conditional of one coordinate at a time. With this proposal, the acceptance probability is identically 1.

$$ x_i^{(t+1)} \sim p\!\left(x_i \,\middle|\, x_1^{(t+1)},\ldots,x_{i-1}^{(t+1)},\,x_{i+1}^{(t)},\ldots,x_d^{(t)}\right). $$

Gibbs is the workhorse of Bayesian hierarchical modeling because conjugate priors give closed-form full conditionals, so every update is a draw from a named distribution. It mixes poorly along directions of high posterior correlation (because it can only move along axes); a common cure is to update blocks of variables jointly, or to reparametrize. If a conditional is intractable, replace that step with an MH sub-step. This is Metropolis-within-Gibbs.

Figure 11 shows why Gibbs feels different from a random walk: it moves by exact conditional draws along coordinate axes. That makes every proposal accepted, but strong posterior correlation can still make the chain advance slowly.

Figure 11 · Gibbs coordinate updates versus random-walk MH
target contours Gibbs path MH path

mcmc Reversible-jump MCMC (RJMCMC)

What if the model order itself is unknown: number of mixture components, number of change-points, number of hidden layers? Then the state space is a disjoint union $\bigsqcup_k \{k\}\times\mathbb{R}^{n_k}$ of spaces of different dimension, and standard MH doesn't apply because $\pi(x')/\pi(x)$ is comparing densities in different-dimensional spaces.

Green's RJMCMC fixes this with a dimension-matching auxiliary variable. To move from $(k,\theta_k)$ to $(k',\theta_{k'})$ with $n_{k'}>n_k$, draw auxiliary $u\sim g$ and apply a diffeomorphism $\theta_{k'} = T(\theta_k, u)$. The acceptance ratio carries a Jacobian:

$$ \alpha = \min\!\left\{1,\; \frac{\pi(k',\theta_{k'})\,g'(u')\,r(k'\to k)}{\pi(k,\theta_k)\,g(u)\,r(k\to k')}\left|\frac{\partial(\theta_{k'},u')}{\partial(\theta_k,u)}\right|\right\}. $$

The classic example is a mixture model with a birth/death pair: birth splits one component into two ($\theta_1=\theta+u,\theta_2=\theta-u$, Jacobian $=2$); death merges two into one. Recent work uses RJMCMC inside simulated annealing to choose neural network architectures jointly with weights.

Figure 12 · Dimension-matching map: split / merge
$k=1$ parameter $\theta = \mu$ $k=2$ parameters $(\mu_1, \mu_2)$ auxiliary $u$

Figure 13 puts the split/merge map inside a Markov chain. The left panel runs within-model Metropolis-Hastings on the $k=1$ parameter $\mu$; the right panel runs MH on the $k=2$ parameters $(\mu_1, \mu_2)$. With probability $p_\text{rj}$ the chain proposes a between-model move instead: a birth $\mu \mapsto (\mu+u, \mu-u)$ with $u\sim g$, or a death that merges $(\mu_1, \mu_2)$ to their midpoint. The acceptance ratio for each between-model move carries the Jacobian factor $2$ from the diffeomorphism. The model-order bar below the panels shows the chain's marginal occupation of $\{k=1, k=2\}$.

Figure 13 · RJMCMC as two coupled chains
target density current state recent trail RJ proposal

mcmc Simulated annealing

Same MH machinery, opposite goal. Fix a target $\pi(x) \propto e^{-U(x)}$, but instead of sampling from $\pi$, find its peak (the global minimum of $U$). Introduce a temperature $T$ and sample from a tempered distribution $\pi_T(x) \propto e^{-U(x)/T}$. The MH accept ratio for a symmetric proposal is

$$ r \;=\; \exp\!\bigl(-(U(x^*) - U(x))/T\bigr). $$

At $T=1$ this is ordinary MH on $\pi$. At $T \to \infty$ every proposal is accepted: the chain ignores the landscape and explores freely. At $T \to 0$ only downhill moves are accepted: the chain becomes greedy. Simulated annealing starts hot and cools: $T_t$ follows a schedule like the logarithmic $T_t = T_0/\log(1 + t)$ (slow enough to escape local minima; Geman & Geman, 1984) or the more practical exponential $T_t = T_0\,\alpha^t$ with $\alpha$ just below $1$.

Information-theory aside. The tempered family $\pi_T \propto e^{-U/T}$ is the Gibbs distribution at temperature $T$, with free energy $F(T) = -T \log Z(T)$. SA can be read as free-energy descent along the path of measures connecting the uniform distribution ($T \to \infty$) to a delta on the global minimum ($T \to 0$). The cooling schedule controls how much entropy the chain is allowed to keep at each step, the same shape as the entropy-regularized argmax that turns up in reinforcement learning and rate-distortion.

Figure 13b runs both walkers on the same triple-well $U(x)$: a fixed-$T$ MH chain that samples $\pi$ (it visits all three wells in proportion to their depth), and an SA chain that cools exponentially and ends up parked in the global minimum. Watch SA's acceptance rate plummet as $T \to 0$, and the rare hot-start jumps that let it escape the shallow wells before cooling makes that impossible.

Figure 13b · Fixed-T MH samples the distribution; SA cools to find the peak

4. Shared algorithm structure

These methods produce the same kind of output (an estimate of $\mathbb E_\pi[f]$ or samples from $\pi$) but differ in what they ask of $\pi$, how they generate candidate states, and how they update. Lining the algorithms up row by row makes the family resemblances and the algorithm-specific moves visible at a glance.

Rejection Importance sampling Metropolis Metropolis–Hastings Gibbs HMC
Family i.i.d., exact i.i.d., weighted MCMC MCMC MCMC MCMC
Requires envelope $M$ with $M\,q(x) \ge \pi(x)$ proposal $q$ with $\mathrm{supp}\,q \supseteq \mathrm{supp}\,\pi$ $\pi$ pointwise up to a constant $\pi$ pointwise; proposal $q(x'\mid x)$ tractable full conditionals $\pi(x_k\mid x_{-k})$ $\pi$ and $\nabla \log \pi$
Proposal $x \sim q(\cdot)$ $x_i \sim q(\cdot),\ i = 1\ldots N$ $x' = x + \varepsilon,\ \varepsilon$ symmetric $x' \sim q(\cdot \mid x)$ cycle $k$: $x'_k \sim \pi(x_k\mid x_{-k})$ leapfrog $L$ steps from $(x, p)$
Accept / weight $u < \dfrac{\pi(x)}{M\,q(x)}$ $w_i = \dfrac{\pi(x_i)}{q(x_i)}$ $u < \min\!\Bigl(1, \dfrac{\pi(x')}{\pi(x)}\Bigr)$ $u < \min\!\Bigl(1, \dfrac{\pi(x')\,q(x\mid x')}{\pi(x)\,q(x'\mid x)}\Bigr)$ always (exact draw) $u < \min\!\bigl(1,\, e^{H(x,p) - H(x',p')}\bigr)$
State update keep $x$ if accepted; else discard and resample no state; reweight each draw $x \leftarrow x'$ if accepted; else stay $x \leftarrow x'$ if accepted; else stay $x_k \leftarrow x'_k$ for each coordinate $x \leftarrow x'$ if accepted; resample momentum $p \sim \mathcal N(0, I)$
Output i.i.d. samples from $\pi$ $\hat{\mathbb E}_\pi[f] = \sum_i \tilde w_i\,f(x_i),\ \tilde w_i \propto w_i$ correlated chain $X_0, X_1, \ldots \to \pi$ correlated chain $X_0, X_1, \ldots \to \pi$ correlated chain $X_0, X_1, \ldots \to \pi$ correlated chain with longer effective step
Main failure mode acceptance rate $1/M$ collapses exponentially in $d$ weight degeneracy when $q$ poorly covers $\pi$ or in high $d$ symmetric-proposal scale fragility poor $q$ ⇒ sticky chain or mode-trapping strong coordinate correlation ⇒ slow mixing step size and trajectory length need tuning; divergences

Reading across the Accept / weight row is where the family shows itself most clearly: rejection accepts or discards; IS accepts every sample but reweights; Metropolis and MH accept stochastically by a target- density ratio (with MH adding the proposal correction $q(x\mid x')/q(x'\mid x)$ that the symmetric Metropolis case eliminates); Gibbs accepts always but restricts the proposal to one-coordinate conditional draws; HMC accepts by an energy ratio rather than a density ratio, because its proposal is a trajectory rather than a point. Each row in this table is a single design decision; each method is one combination.

Two methods that share a row stay in the same family of failure modes. Reading Main failure mode shows why: high-$d$ dimension hurts rejection and IS for the same reason (geometric weight-mass concentration in the proposal-vs-target ratio), while MH and Gibbs share the symptom slow mixing but for different reasons (proposal-scale tuning vs. posterior anisotropy in coordinate-aligned axes).

5. Pair by pair: what changes between methods

The table above shows the design space at a glance. The four cards below zoom in on the single-step changes: this turned into that by swapping one line of the inner loop. Pseudocode for each pair shares the same outer structure, so the differences are confined to the highlighted lines.

Rejection ↔ Importance sampling

Both draw from $q$. Rejection discards mismatched samples; IS keeps every sample with a weight.

Rejection sampling
samples = []
while |samples| < N:
    x ~ q(·)
    u ~ Uniform(0, 1)
    if u < π(x) / (M·q(x)):
        samples.append(x)
return samples            # i.i.d. from π
Importance sampling
samples = []; weights = []
for i = 1 to N:
    x ~ q(·)
    w = π(x) / q(x)
    samples.append(x)
    weights.append(w)
return Σᵢ w̃ᵢ·f(xᵢ)         # w̃ ∝ w

The accept-or-discard test (left) becomes a weight assignment (right). Rejection wastes a fraction $1 - 1/M$ of draws but returns honest i.i.d. samples; IS wastes none but returns a weighted estimator whose variance depends on how well $q$ covers $\pi$ (see §2 for the variance algebra).

Metropolis ↔ Metropolis–Hastings

Asymmetric proposals need a correction factor in the acceptance ratio.

Metropolis (symmetric)
x = x₀
chain = [x]
for t = 1 to T:
    x' = x + ε,  ε symmetric
    u ~ Uniform(0, 1)
    if u < min(1, π(x') / π(x)):
        x = x'
    chain.append(x)
return chain
Metropolis–Hastings
x = x₀
chain = [x]
for t = 1 to T:
    x' ~ q(·|x)
    u ~ Uniform(0, 1)
    if u < min(1, π(x')·q(x|x') /
                  (π(x)·q(x'|x))):
        x = x'
    chain.append(x)
return chain

Two lines change. The proposal becomes a general $q(\cdot \mid x)$ instead of a symmetric jump, and the accept test picks up the Hastings correction $q(x \mid x')/q(x' \mid x)$ that compensates for proposal asymmetry. Metropolis is MH with $q$ symmetric; the correction equals 1 and cancels.

Metropolis–Hastings ↔ Gibbs

Replace "propose then accept/reject" with "draw exactly from a full conditional, always accept."

Metropolis–Hastings
x = x₀
chain = [x]
for t = 1 to T:
    x' ~ q(·|x)
    u ~ Uniform(0, 1)
    if u < min(1, π(x')·q(x|x') /
                  (π(x)·q(x'|x))):
        x = x'
    chain.append(x)
return chain
Gibbs sampling
x = x₀
chain = [x]
for t = 1 to T:
    for k = 1 to d:
        x_k ~ π(x_k | x_{-k})
    chain.append(x)
return chain

Gibbs replaces the entire propose/accept block with a coordinate cycle of exact conditional draws. The acceptance probability is identically $1$ because each conditional draw is from the right distribution by construction. The price: you need the full conditionals $\pi(x_k \mid x_{-k})$ in closed form. When you don't, MH-within-Gibbs replaces one or more of these conditional draws with a local MH step.

Metropolis–Hastings ↔ HMC

Replace random-walk proposals with a Hamiltonian trajectory; replace the density ratio with an energy ratio.

Metropolis–Hastings (random walk)
x = x₀
chain = [x]
for t = 1 to T:
    x' = x + ε,  ε ~ N(0, Σ)


    u ~ Uniform(0, 1)
    if u < min(1, π(x') / π(x)):
        x = x'
    chain.append(x)
return chain
Hamiltonian Monte Carlo
x = x₀
chain = [x]
for t = 1 to T:
    p ~ N(0, I)
    (x', p') = leapfrog(x, p, ε, L)
                                  # uses ∇log π
    u ~ Uniform(0, 1)
    if u < min(1, exp(H(x,p) - H(x',p'))):
        x = x'
    chain.append(x)
return chain

The proposal step changes from a local random jump to a deterministic leapfrog trajectory of length $L\varepsilon$ using $\nabla\log\pi$. The accept test changes from a density ratio to an energy ratio $\exp(H - H')$ where $H(x, p) = -\log\pi(x) + \tfrac12\|p\|^2$. The leapfrog integrator is reversible and volume-preserving, which is what makes the energy-ratio test correct. HMC trades the need for gradients for much longer effective steps and much better mixing in high $d$.

6. Method choice by target

The decision table summarizes static-target methods. Tags echo the section colors: exact means returns-true-samples-from-$\pi$, i.i.d. means draws are independent, mcmc means correlated chain samples. For state-space targets (Kalman, EKF, UKF, particle filter), see Kalman & Particle Filters.

MethodUse whenStrengthsPitfalls
Inverse CDF
exacti.i.d.
Univariate target with invertible CDF (exponential, Cauchy, custom 1-D). Exact, zero waste. Useless beyond 1-D or without closed-form $F^{-1}$.
Rejection
exacti.i.d.
Low-dimensional target with a known envelope $M\,q\geq\pi$. Exact i.i.d. samples; simple to implement. Acceptance rate $1/M$ collapses exponentially in dimension.
Importance sampling
i.i.d.
Computing expectations under an unnormalized target; building block inside SMC. Never rejects; self-normalized version handles unnormalized $\pi$; flexible $q$. Weight degeneracy if $q$ poorly covers $\pi$ or in high $d$; monitor ESS.
Metropolis
mcmc
Static posterior where a symmetric random-walk proposal is adequate. Simpler acceptance ratio; needs only pointwise unnormalized $\pi$. A special case of MH; proposal scale still controls mixing and rejection.
Metropolis-Hastings
mcmc
General-purpose static posterior with no special structure; only ratios $\pi(x')/\pi(x)$ available. Asymptotically exact; needs only pointwise unnormalized $\pi$. Tuning the proposal scale is fiddly; mixes poorly in high $d$ or multimodal $\pi$.
Gibbs sampling
mcmc
Hierarchical/conjugate models where every full conditional has closed form. Zero tuning; acceptance rate 1; natural for graphical models. Slow when variables are correlated; needs tractable conditionals (else MH-within-Gibbs).
RJMCMC
mcmc
Joint inference over model order and parameters (mixture components, change-points, NN architecture). Asymptotically exact across model space; principled Bayesian model selection. Move design and Jacobians are problem-specific; mixing across dimensions is hard.
Quick decision tree.
  1. 1-D target with an invertible CDF? → inverse CDF.
  2. Low-D target with a known envelope $M\,q\geq\pi$? → rejection.
  3. Want an expectation under an unnormalized target with a reasonable proposal $q$? → importance sampling.
  4. High-D static target, only ratios of $\pi$? → symmetric proposal: Metropolis; asymmetric: Metropolis-Hastings. Conjugate conditionals available? → Gibbs. Gradients available? → HMC/Langevin.
  5. Model order itself is unknown? → RJMCMC.
  6. Time-evolving hidden state? → see Kalman & Particle Filters.

What next

Sampling connects most directly to change of measure, approximate inference, dependence testing, and state-space filtering.