Monte Carlo & MCMC

Rejection, importance sampling, Metropolis-Hastings, Gibbs, RJMCMC, and simulated annealing: what they share, where they differ, which to reach for, and what their probability mass is doing. State-space filtering — Kalman and particle filters — lives on its own page.

Bayesian computation centres on two integrals: the posterior normalizer $p(y) = \int p(y\mid\theta)p(\theta)\,d\theta$, and a posterior expectation $\mathbb{E}_{p(\theta\mid y)}[f(\theta)] = \int f(\theta)\,p(\theta\mid y)\,d\theta$. Neither has a closed form for most models of interest. Monte Carlo methods replace the integrals with averages over samples, either drawn directly (i.i.d. samplers) or drawn from a Markov chain whose stationary distribution is the target (MCMC). The sequential / state-space case (sequential Monte Carlo, particle filters, Kalman family) is a different problem shape and gets its own page.

1. Sampling methods estimate expectations under a target

Every method below evaluates an expectation against some target $\pi(x)$, typically a Bayesian posterior $\pi(x) = p(\theta\mid y)$, possibly known only up to a constant. They differ in three axes:

How samples are produced. Direct i.i.d. draws (inverse CDF, rejection), reweighted draws from a tractable proposal (importance sampling), or draws from a Markov chain converging to $\pi$ (Metropolis-Hastings, Gibbs, RJMCMC).
What they require of $\pi$. Inverse CDF needs the CDF; rejection needs a global envelope; importance sampling needs only pointwise $\pi(x)$; MCMC needs only ratios $\pi(x')/\pi(x)$; Gibbs needs tractable full conditionals.
What state space they're built for. Static $\theta$ (most posteriors) or variable-dimension $(k,\theta_k)$ (RJMCMC for model selection). For time-evolving latent state $x_{0:T}$, see state-space filtering.

The decision table at the end reduces the choice to one view.

Figure 1 · How the sampler creates probability mass

Structure	Sampling idea	Use when
Static target	Direct / rejection	Low-dimensional target with an envelope or inverse transform.
Weighted target	Importance sampling	A tractable proposal covers the target and the goal is an expectation.
Markov target	MCMC	Only unnormalized target ratios are available, often in high dimension.
State over time	Sequential Monte Carlo	Latent state evolves and observations arrive one time step at a time. See state-space filtering.

target $\pi$ proposal / transition weighted particles

2. I.i.d. methods

exacti.i.d. Inverse-CDF sampling

In the cleanest case, if $X$ has CDF $F$, then $F(X)\sim\mathcal{U}[0,1]$, so $X = F^{-1}(U)$ for $U\sim\mathcal{U}[0,1]$ recovers exact i.i.d. samples.

$$ X = F^{-1}(U),\qquad U\sim\mathcal{U}[0,1]. $$

It is exact and free of rejection but useless beyond 1-D and beyond targets whose CDF has an invertible closed form. It is the baseline that motivates everything else.

exacti.i.d. Rejection sampling

Pick a tractable proposal $q$ and a constant $M$ with $M\,q(x)\geq \pi(x)$ for all $x$. Draw $x\sim q$, $u\sim\mathcal{U}[0,1]$, and accept if $u \lt \pi(x)/(M\,q(x))$. Accepted samples are exact draws from $\pi$.

Rejection sampling

Draw $x \sim q$.
Draw $u \sim \mathcal{U}[0,1]$.
If $u \lt \pi(x) / (M\,q(x))$, accept $x$; else reject and repeat.

Acceptance rate is $1/M$. In high dimensions $M$ blows up exponentially, because the gap between any plausible $q$ and a peaked $\pi$ widens fast in volume, so rejection is unusable past a handful of dimensions.

i.i.d. Importance sampling (IS)

Never reject. Reweight every sample from $q$ by the ratio $w(x) = \pi(x)/q(x)$:

$$ \mathbb{E}_\pi[f(X)] \approx \frac{1}{N}\sum_{i=1}^N w(x^{(i)})\,f(x^{(i)}),\qquad x^{(i)}\sim q. $$

When $\pi$ is unnormalized (Bayesian posteriors typically are), use self-normalized IS:

$$ \tilde w^{(i)} = \frac{w(x^{(i)})}{\sum_j w(x^{(j)})},\qquad \hat{\mathbb{E}}_\pi[f] = \sum_i \tilde w^{(i)} f(x^{(i)}). $$

How good is this estimator? The samples come from $q$, so the variance is taken under $q$:

$$ \begin{aligned} \operatorname{Var}_q\bigl[f(X)\,w(X)\bigr] &= \mathbb{E}_q\!\bigl[\,f(X)^2 w(X)^2\,\bigr] - \mathbb{E}_q\!\bigl[\,f(X)\,w(X)\,\bigr]^2 \\ &= \mathbb{E}_q\!\bigl[\,f(X)^2 w(X)^2\,\bigr] - \mathbb{E}_p\!\bigl[\,f(X)\,\bigr]^2 \\ &= \mathbb{E}_q\!\bigl[\,f(X)^2 w(X)^2\,\bigr] - \mu_f^2. \end{aligned} $$

The bridge from line 1 to line 2 is the unbiasedness identity that motivated IS in the first place: $\mathbb E_q[fw] = \int f(x)\,\tfrac{\pi(x)}{q(x)}\,q(x)\,dx = \mathbb E_p[f(X)] = \mu_f$. The second term doesn't depend on $q$; it's a property of the problem.

Two things follow immediately:

The IS estimator is unbiased for any $q$ with support covering $\pi$. The choice of $q$ doesn't shift the mean — only the variance.
The only $q$-dependent term is $\mathbb E_q[f^2 w^2]$. Since $w(x) = \pi(x)/q(x)$, the dangerous factor is $w(x)^2 = (\pi(x)/q(x))^2$: variance blows up wherever $q$ is small relative to $\pi$ in a region where $f$ is not also small. The squared weight is the explosion mechanism.

Minimising $\mathbb E_q[f^2 w^2]$ over $q$ subject to $\int q = 1$ gives the optimal proposal

$$ q^*(x) \;\propto\; |f(x)|\,\pi(x), $$

which is exactly the thing you can't sample from — it requires the integral $\int |f|\pi$, the object closest to what you were trying to compute. What this says in practice is direct samples toward where $|f|\pi$ is large. For computing the evidence ($f \equiv 1$) the optimal proposal is $q^* = \pi$ — exact samples from the target. For a tail-event probability ($f$ an indicator), $q^*$ shifts mass to the tail, which is the design principle behind rare-event simulation.

The operational shortcut, since $q^*$ is unavailable: choose $q$ heavier-tailed than $\pi$ (Student-$t$ over Gaussian, Laplace over Normal) so the weight ratio $\pi/q$ stays bounded in the tails. Then monitor the effective sample size $\mathrm{ESS} = \bigl(\sum_i \tilde w^{(i)}\bigr)^2 \big/ \sum_i \bigl(\tilde w^{(i)}\bigr)^2$. If it collapses to $\ll N$, your weights are degenerate: the estimate is dominated by one or two samples and the variance is effectively the variance of the largest weight times $f^2$.

Figure 2 lets you compare rejection and importance sampling on the same problem. Pick a target $\pi$ and a proposal $q$ (family, center, scale, and for Student-$t$ the degrees of freedom $\nu$). The top row plots each method's draws; the ESS bar shows how many useful samples per $N$ draws each method extracts; the empirical-fit strip overlays the resulting histograms on $\pi$ so you can see which method actually reconstructs the target. The preset row sweeps through the regimes where the two methods agree, disagree, or both fail.

Figure 2 · Rejection sampling vs. importance sampling: choose $p$ and $q$

target $\pi$ proposal $M\!\cdot\!q$ accepted / rejection ESS importance weights / IS ESS rejected / wasted draws

target $p$ Bimodal mixture

proposal $q$ family Gaussian

proposal center $\mu_q$ 0

proposal scale $\sigma_q$ 2.0

Student-t df $\nu$ 3

# samples 300

Common: keep a current state and propose a transition that should preserve $\pi$.

Metropolis: propose from symmetric $q(x' \mid x)$, so $q$ cancels from the ratio.

Accept/reject: draw $u$ and keep the proposal only when $u \lt \alpha$; otherwise record a self-loop.

Hastings correction: multiply the target ratio by $q(x\mid x')/q(x'\mid x)$.

Gibbs: choose a coordinate and sample exactly from its full conditional.

No rejection: a true full-conditional draw has $\alpha=1$.

RJMCMC: propose a birth/death move between model dimensions.

Dimension match: include proposal probabilities and the Jacobian $|J|$.

HMC: simulate a gradient-guided trajectory before proposing the endpoint.

Energy correction: accept by $\min(1,\exp(-\Delta H))$ to correct discretization error.

3. MCMC

Once $\pi$ is multivariate and peaked, both rejection (acceptance rate $\to 0$) and importance sampling (weight variance $\to \infty$) collapse. MCMC trades i.i.d. samples for correlated samples from a Markov chain $X_0, X_1, X_2, \dots$ whose stationary distribution is $\pi$. The chain is constructed to satisfy detailed balance:

$$ \pi(x)\,T(x\to x') \;=\; \pi(x')\,T(x'\to x), $$

which is sufficient for $\pi$ to be invariant under $T$. Different MCMC algorithms are different ways to engineer such a $T$.

The MCMC story is a sequence of relaxations. Each step removes one limitation, but it adds a new requirement or a new way to fail.

Step	What changes	Requirement	What can fail
Metropolis	Replace i.i.d. draws with a Markov chain and accept/reject local proposals.	Symmetric proposal $q(x'\mid x)=q(x\mid x')$.	Proposal scale can make the chain crawl or reject almost everything.
Metropolis-Hastings	Allow asymmetric or independence proposals.	Evaluate the proposal correction $q(x\mid x')/q(x'\mid x)$.	A bad proposal still gives sticky chains or mode blindness.
Gibbs	Replace accept/reject proposals with exact coordinate conditional draws.	Closed-form full conditionals, or MH-within-Gibbs substeps.	Strong posterior correlation makes axis moves mix slowly.
RJMCMC	Let the chain jump between parameter spaces with different dimensions.	Dimension-matched reversible moves and Jacobian terms.	Birth/death move design is problem-specific and can mix poorly.

Figure 3 puts the MCMC variants on the same toy state space. The tabs keep the Markov-chain skeleton fixed where possible, then add the extra mechanism that each variant needs: rejection self-loops, Hastings proposal correction, coordinate-wise conditionals, birth/death dimension moves, or gradient-guided proposals. The RJMCMC tab shares the 1-D state space with the others for visual parallelism — the states there stand in for model orders $k$, not for the within-model parameters. The genuine two-space picture (within-model MH on each $k$, plus dimension-jumping moves between them) is in Figure 13.

Figure 3 · MCMC variant lab: what changes in the transition kernel

target mass proposal / transition accepted move rejection / self-loop

variant knob 0.55

speed (moves/sec) 2.5

mcmc Metropolis MCMC (1953)

Metropolis MCMC is the symmetric-proposal member of the Metropolis-Hastings family. Use a proposal with $q(x'\mid x) = q(x\mid x')$, propose $x'\sim q(\cdot\mid x)$, and accept with probability

$$ \alpha(x,x') = \min\!\left\{1,\;\frac{\pi(x')}{\pi(x)}\right\}. $$

The symmetric $q$ cancels, so only the target-density ratio remains. Typical choice: random walk $x' = x + \epsilon$, $\epsilon\sim\mathcal{N}(0,\Sigma_\text{prop})$. The step size $\Sigma_\text{prop}$ is the one tuning knob and it matters: too small and the chain crawls; too large and nearly every proposal is rejected.

mcmc Metropolis-Hastings (1970)

Generalizes Metropolis to asymmetric $q$ by adding a correction:

$$ \alpha(x,x') = \min\!\left\{1,\;\frac{\pi(x')\,q(x\mid x')}{\pi(x)\,q(x'\mid x)}\right\}. $$

Metropolis-Hastings

Propose $x' \sim q(\cdot \mid x)$.
Compute $\alpha = \min\bigl\{1, \frac{\pi(x')\,q(x \mid x')}{\pi(x)\,q(x' \mid x)}\bigr\}$.
Draw $u \sim \mathcal{U}[0,1]$. If $u \lt \alpha$, set $x \leftarrow x'$; else keep $x$.

Variants worth naming:

Random-walk MH: $q$ symmetric; ratio collapses to $\pi(x')/\pi(x)$. Default in most papers.
Independent MH: $q(x'\mid x) = q(x')$. Equivalent to importance sampling with hard accept/reject; works only if $q$ is genuinely a good global approximation of $\pi$.
Langevin / HMC: $q$ uses gradient information $\nabla\log\pi$ to bias proposals toward high-density regions. Much faster in high dimensions when gradients are available.

Why that $\alpha$? Because it is exactly the factor that forces detailed balance. Write the transition as $T(x\to x') = q(x'\mid x)\,\alpha(x,x')$. The proposal alone is unbalanced: the two flows $\pi(x)\,q(x'\mid x)$ and $\pi(x')\,q(x\mid x')$ generally differ. One direction is over-full, and its $\alpha$ is the ratio that scales it down while the reverse direction's $\alpha$ is $1$. Either way both realized flows land on $\min\{\pi(x)\,q(x'\mid x),\;\pi(x')\,q(x\mid x')\}$, so $\pi(x)\,T(x\to x') = \pi(x')\,T(x'\to x)$ — detailed balance. Figure 4 lets you push the target and proposal ratios around: the proposal flow tilts, $\alpha$ compensates, and the realized flow stays balanced no matter what.

Figure 4 · The acceptance ratio forces detailed balance

proposal flow $\pi\,q$ over-full direction, throttled by $\alpha$ realized flow $\pi\,T$

target ratio $\pi(x')/\pi(x)$ 2

proposal ratio $q(x\mid x')/q(x'\mid x)$ 1

Figure 5 runs a random-walk MH on a 2-D banana-shaped target and lets you watch how step size trades off acceptance rate against mixing. Pause and use Step to see one proposal at a time: the dashed ellipse is the proposal region (±σ around the current state), and the line shows where the last proposal landed and whether it was accepted.

Figure 5 · Random-walk Metropolis-Hastings on a banana target

current state proposal region (±σ) accepted proposal rejected proposal

step size $\sigma$ 0.5

speed (steps/frame) 5

Figure 6 · Trace, rolling autocorrelation, and effective sample size

trace rolling ACF ESS estimate

step size $\sigma$ 0.6

Figure 7 · Random-walk MH versus Hamiltonian proposals

random-walk MH leapfrog trajectory accepted HMC move

leapfrog step $\epsilon$ 0.08

leapfrog steps 16

speed (steps/sec) 2

The BNN posterior target swaps the banana for the two-parameter $f(x; w_1, w_2) = w_1\tanh(w_2 x)$ posterior used on the Bayesian neural networks page. The model has a sign symmetry, so the energy surface has two basins of equal depth. Watch HMC commit to one basin while random-walk MH stays trapped in whichever it started in — switching basins requires crossing a high-energy ridge that local proposals rarely traverse.

One caveat the per-step view hides: cost. An HMC step here spends $L$ gradient evaluations on its leapfrog trajectory (the leapfrog steps slider) where random-walk MH spends none. Per accepted step HMC wins easily; per gradient evaluation the gap narrows, and the honest comparison is mixing per unit of compute — which is why the strong-but-expensive proposal still has to earn its keep against the cheap-but-slow one.

For a live MALA sampler on this same banana target, see chi-feng's MCMC demo. MALA — the Metropolis-adjusted Langevin algorithm — is HMC with a single leapfrog step: one gradient-guided Langevin proposal, then an accept/reject correction. It sits between random-walk MH and full HMC.

Figure 8 makes MALA's anatomy explicit. Every proposed step on the banana target $x' = x + d + s$ splits into a deterministic drift $d = \tfrac{\epsilon^2}{2}\nabla\log\pi(x)$ that climbs the gradient and a random diffusion kick $s = \epsilon\,z$ with $z\sim\mathcal{N}(0, I)$. Because the proposal is asymmetric, the accept ratio carries both forward and reverse Langevin densities. Watch the acceptance gauge: MALA mixes best near a rate of $0.574$, so the gauge doubles as a step-size tuning instrument.

MALA is the corrected discretization of a continuous process. The Langevin diffusion $d\theta = \tfrac12\nabla\log\pi(\theta)\,dt + dB$ has $\pi$ as its exact stationary distribution. Take a finite Euler step of size $\epsilon$ and you get the unadjusted Langevin algorithm (ULA): fast, but the discretization biases the stationary distribution away from $\pi$. MALA removes that bias with the Metropolis accept/reject step. Toggle the correction off to run ULA. The histogram strip tracks the chain's $x$-marginal against the true $\mathcal{N}(0,4)$: at large $\epsilon$ the uncorrected chain visibly settles to the wrong distribution while MALA stays on target.

Figure 8 · Langevin diffusion, ULA, and MALA

drift $d = \tfrac{\epsilon^2}{2}\nabla\log\pi$ diffusion $s = \epsilon z$ rejected proposal

step size $\epsilon$ 0.45

speed (steps/sec) 3

Figure 9 shows the same target as a surface. Drag to rotate it. The ridge shape makes the random-walk tuning problem visible: local proposals can slide along the ridge, but large proposals jump off the high-density mass.

Figure 9 · WebGL density surface for the banana target

high target density low target density current MH state

mcmc Markov topology and stationary mass

The Markov chain is not just a path. It is a transition graph over the state space: each state sends probability mass to nearby proposals, then the accept/reject rule changes how much of that mass actually moves. A good chain has local moves that still let empirical occupation mass converge to the target mass. The mode buttons switch the state-space layout: a 1-D line (axis-aligned random walk on a bimodal target), a 2-D grid (Moore-neighborhood random walk on a two-bump target), and a ring with a barrier between two clusters of mass — each exposes a different mixing failure mode.

Figure 10 · Discrete MH transition topology and occupation mass

target mass transition edges empirical mass

proposal radius 2

temperature 1.0

speed (moves/sec) 2

background samples 25

mcmc Gibbs sampling

A special case of MH where the proposal is the full conditional of one coordinate at a time. With this proposal, the acceptance probability is identically 1.

$$ x_i^{(t+1)} \sim p\!\left(x_i \,\middle|\, x_1^{(t+1)},\ldots,x_{i-1}^{(t+1)},\,x_{i+1}^{(t)},\ldots,x_d^{(t)}\right). $$

Gibbs is the workhorse of Bayesian hierarchical modeling because conjugate priors give closed-form full conditionals, so every update is a draw from a named distribution. It mixes poorly along directions of high posterior correlation (because it can only move along axes); a common cure is to update blocks of variables jointly, or to reparametrize. If a conditional is intractable, replace that step with an MH sub-step. This is Metropolis-within-Gibbs.

Figure 11 shows why Gibbs feels different from a random walk: it moves by exact conditional draws along coordinate axes. That makes every proposal accepted, but strong posterior correlation can still make the chain advance slowly.

Figure 11 · Gibbs coordinate updates versus random-walk MH

target contours Gibbs path MH path

correlation $\rho$ 0.85

MH step size 0.9

mcmc Reversible-jump MCMC (RJMCMC)

What if the model order itself is unknown: number of mixture components, number of change-points, number of hidden layers? Then the state space is a disjoint union $\bigsqcup_k \{k\}\times\mathbb{R}^{n_k}$ of spaces of different dimension, and standard MH doesn't apply because $\pi(x')/\pi(x)$ is comparing densities in different-dimensional spaces.

Green's RJMCMC fixes this with a dimension-matching auxiliary variable. To move from $(k,\theta_k)$ to $(k',\theta_{k'})$ with $n_{k'}>n_k$, draw auxiliary $u\sim g$ and apply a diffeomorphism $\theta_{k'} = T(\theta_k, u)$. The acceptance ratio carries a Jacobian:

$$ \alpha = \min\!\left\{1,\; \frac{\pi(k',\theta_{k'})\,g'(u')\,r(k'\to k)}{\pi(k,\theta_k)\,g(u)\,r(k\to k')}\left|\frac{\partial(\theta_{k'},u')}{\partial(\theta_k,u)}\right|\right\}. $$

The classic example is a mixture model with a birth/death pair: birth splits one component into two ($\theta_1=\theta+u,\theta_2=\theta-u$, Jacobian $=2$); death merges two into one. Recent work uses RJMCMC inside simulated annealing to choose neural network architectures jointly with weights.

Figure 12 · Dimension-matching map: split / merge

$k=1$ parameter $\theta = \mu$ $k=2$ parameters $(\mu_1, \mu_2)$ auxiliary $u$

mean $\mu$ (in $k=1$) 0.5

auxiliary $u$ 1.2

Figure 13 puts the split/merge map inside a Markov chain. The left panel runs within-model Metropolis-Hastings on the $k=1$ parameter $\mu$; the right panel runs MH on the $k=2$ parameters $(\mu_1, \mu_2)$. With probability $p_\text{rj}$ the chain proposes a between-model move instead: a birth $\mu \mapsto (\mu+u, \mu-u)$ with $u\sim g$, or a death that merges $(\mu_1, \mu_2)$ to their midpoint. The acceptance ratio for each between-model move carries the Jacobian factor $2$ from the diffeomorphism. The model-order bar below the panels shows the chain's marginal occupation of $\{k=1, k=2\}$.

Figure 13 · RJMCMC as two coupled chains

target density current state recent trail RJ proposal

between-model proposal probability $p_{rj}$ 0.30

speed (steps/sec) 12

mcmc Simulated annealing

Same MH machinery, opposite goal. Fix a target $\pi(x) \propto e^{-U(x)}$, but instead of sampling from $\pi$, find its peak (the global minimum of $U$). Introduce a temperature $T$ and sample from a tempered distribution $\pi_T(x) \propto e^{-U(x)/T}$. The MH accept ratio for a symmetric proposal is

$$ r \;=\; \exp\!\bigl(-(U(x^*) - U(x))/T\bigr). $$

At $T=1$ this is ordinary MH on $\pi$. At $T \to \infty$ every proposal is accepted: the chain ignores the landscape and explores freely. At $T \to 0$ only downhill moves are accepted: the chain becomes greedy. Simulated annealing starts hot and cools: $T_t$ follows a schedule like the logarithmic $T_t = T_0/\log(1 + t)$ (slow enough to escape local minima; Geman & Geman, 1984) or the more practical exponential $T_t = T_0\,\alpha^t$ with $\alpha$ just below $1$.

Information-theory aside. The tempered family $\pi_T \propto e^{-U/T}$ is the Gibbs distribution at temperature $T$, with free energy $F(T) = -T \log Z(T)$. SA can be read as free-energy descent along the path of measures connecting the uniform distribution ($T \to \infty$) to a delta on the global minimum ($T \to 0$). The cooling schedule controls how much entropy the chain is allowed to keep at each step, the same shape as the entropy-regularized argmax that turns up in reinforcement learning and rate-distortion.

Figure 13b runs both walkers on the same triple-well $U(x)$: a fixed-$T$ MH chain that samples $\pi$ (it visits all three wells in proportion to their depth), and an SA chain that cools exponentially and ends up parked in the global minimum. Watch SA's acceptance rate plummet as $T \to 0$, and the rare hot-start jumps that let it escape the shallow wells before cooling makes that impossible.

Figure 13b · Fixed-T MH samples the distribution; SA cools to find the peak

initial temperature $T_0$ 3

cooling rate $alpha$ 0.992

speed (steps/sec) 24

4. Method choice by target

The decision table summarizes static-target methods. Tags echo the section colors: exact means returns-true-samples-from-$\pi$, i.i.d. means draws are independent, mcmc means correlated chain samples. For state-space targets (Kalman, EKF, UKF, particle filter), see Kalman & Particle Filters.

Method	Use when	Strengths	Pitfalls
Inverse CDF exacti.i.d.	Univariate target with invertible CDF (exponential, Cauchy, custom 1-D).	Exact, zero waste.	Useless beyond 1-D or without closed-form $F^{-1}$.
Rejection exacti.i.d.	Low-dimensional target with a known envelope $M\,q\geq\pi$.	Exact i.i.d. samples; simple to implement.	Acceptance rate $1/M$ collapses exponentially in dimension.
Importance sampling i.i.d.	Computing expectations under an unnormalized target; building block inside SMC.	Never rejects; self-normalized version handles unnormalized $\pi$; flexible $q$.	Weight degeneracy if $q$ poorly covers $\pi$ or in high $d$; monitor ESS.
Metropolis mcmc	Static posterior where a symmetric random-walk proposal is adequate.	Simpler acceptance ratio; needs only pointwise unnormalized $\pi$.	A special case of MH; proposal scale still controls mixing and rejection.
Metropolis-Hastings mcmc	General-purpose static posterior with no special structure; only ratios $\pi(x')/\pi(x)$ available.	Asymptotically exact; needs only pointwise unnormalized $\pi$.	Tuning the proposal scale is fiddly; mixes poorly in high $d$ or multimodal $\pi$.
Gibbs sampling mcmc	Hierarchical/conjugate models where every full conditional has closed form.	Zero tuning; acceptance rate 1; natural for graphical models.	Slow when variables are correlated; needs tractable conditionals (else MH-within-Gibbs).
RJMCMC mcmc	Joint inference over model order and parameters (mixture components, change-points, NN architecture).	Asymptotically exact across model space; principled Bayesian model selection.	Move design and Jacobians are problem-specific; mixing across dimensions is hard.

Quick decision tree.

1-D target with an invertible CDF? → inverse CDF.
Low-D target with a known envelope $M\,q\geq\pi$? → rejection.
Want an expectation under an unnormalized target with a reasonable proposal $q$? → importance sampling.
High-D static target, only ratios of $\pi$? → symmetric proposal: Metropolis; asymmetric: Metropolis-Hastings. Conjugate conditionals available? → Gibbs. Gradients available? → HMC/Langevin.
Model order itself is unknown? → RJMCMC.
Time-evolving hidden state? → see Kalman & Particle Filters.

What next

Sampling connects most directly to change of measure, approximate inference, dependence testing, and state-space filtering.

State space

Kalman & Particle Filters

The time-recursive cousin: sequential Monte Carlo and the Kalman family for problems where the target evolves over time.

Foundations

Measure Theory & Random Variables

Importance sampling is the Radon–Nikodym derivative identity turned into an estimator.

Approximation

Free Energy & Variational Inference

A complementary route to Bayesian computation: optimize a tractable approximation instead of drawing samples.

Processes

Markov Chains

Separate the general theory of transition matrices, mixing, recurrence, and CTMCs from their MCMC use here.

Testing

Distance Correlation

See another place resampling appears: permutation tests for nonlinear dependence.