Variational Bayes for Gaussian Mixtures

Mean-field VI on a mixture model: closed-form coordinate updates, a Dirichlet prior that prunes unused components automatically, and the ELBO as the convergence handle.

1. The model

Observations $\mathbf x_1, \ldots, \mathbf x_N \in \mathbb R^D$ are modelled as a mixture of $K$ Gaussians. Latent assignments $\mathbf z_n \in \{1,\ldots,K\}$ say which component each $\mathbf x_n$ came from. With conjugate priors,

$$ \begin{aligned} \boldsymbol\pi &\sim \mathrm{Dir}(\alpha_0, \ldots, \alpha_0), \\ (\boldsymbol\mu_k, \boldsymbol\Lambda_k) &\sim \mathrm{NW}(\mathbf m_0, \beta_0, \mathbf W_0, \nu_0), \\ \mathbf z_n \mid \boldsymbol\pi &\sim \mathrm{Cat}(\boldsymbol\pi), \\ \mathbf x_n \mid \mathbf z_n = k, \boldsymbol\mu_k, \boldsymbol\Lambda_k &\sim \mathcal N(\boldsymbol\mu_k, \boldsymbol\Lambda_k^{-1}). \end{aligned} $$

$\mathrm{NW}$ is the Normal–Wishart prior, conjugate for an unknown Gaussian mean and precision. The exact posterior $p(\mathbf Z, \boldsymbol\pi, \boldsymbol\mu, \boldsymbol\Lambda \mid \mathbf X)$ is intractable because the latents $\mathbf Z$ couple the component parameters through the data.

2. The variational family

Pick the mean-field factorization

$$ q(\mathbf Z, \boldsymbol\pi, \boldsymbol\mu, \boldsymbol\Lambda) \;=\; q(\mathbf Z)\,q(\boldsymbol\pi)\,\prod_{k=1}^{K} q(\boldsymbol\mu_k, \boldsymbol\Lambda_k). $$

The standard CAVI derivation (Bishop §10.2) recovers the same parametric forms as the prior:

$q(\mathbf Z) = \prod_n \mathrm{Cat}(\mathbf z_n \mid \mathbf r_n)$, a responsibility vector $\mathbf r_n$ per data point.
$q(\boldsymbol\pi) = \mathrm{Dir}(\boldsymbol\alpha)$, one pseudo-count $\alpha_k$ per component.
$q(\boldsymbol\mu_k, \boldsymbol\Lambda_k) = \mathrm{NW}(\mathbf m_k, \beta_k, \mathbf W_k, \nu_k)$, Normal–Wishart per component.

This is the simplest case where mean field gives genuinely closed-form updates for all factors. The price is the independence assumption between $\mathbf Z$ and the parameters. VB systematically underestimates posterior correlations, just as in the regression case.

3. The CAVI updates

Each CAVI sweep alternates between soft assignments and conjugate posterior parameters. The assignment step scores component $k$ by its expected log mixing weight plus its expected Gaussian log likelihood, then normalizes:

$$ \log r_{nk} \leftarrow \mathbb E[\log \pi_k] + \mathbb E[\log \mathcal N(\mathbf x_n \mid \boldsymbol\mu_k,\boldsymbol\Lambda_k^{-1})] \quad\text{then normalize over }k. $$

The parameter step computes weighted sufficient statistics $N_k = \sum_n r_{nk}$, $\bar{\mathbf x}_k = \frac1{N_k}\sum_n r_{nk}\mathbf x_n$, $\mathbf S_k$, and updates the Dirichlet and Normal-Wishart factors. The load-bearing update is the pseudo-count:

$$ \alpha_k = \alpha_0 + N_k. $$

The remaining Normal-Wishart updates have the same shape: prior strength plus effective data assigned to component $k$. The whole iteration is closed-form: no sampling, no inner Newton. Each step is guaranteed not to decrease the ELBO.

4. Automatic pruning

The Dirichlet concentration $\alpha_0$ is the load-bearing knob. The posterior mean mixing weight is $\mathbb E[\pi_k] = \alpha_k/(\sum_j \alpha_j) = (\alpha_0 + N_k)/(K\alpha_0 + N)$. With $\alpha_0 \lt 1$, the Dirichlet prior actively pushes mass off components that attract few data points. Empirically, starting with more components than the data need and a small $\alpha_0$ (e.g. $10^{-3}$) drives unused components to near-zero $\pi_k$ within a few iterations. They "die out" automatically without any model-selection step.

This is the main thing VB-GMM gives you that classical EM does not. EM has no mechanism to remove components; you would have to fit a sequence of models with $K = 1, 2, 3, \ldots$ and compare on an information criterion. Variational Bayes does it inside one fit.

Information-theory aside. The ELBO decomposes as $\mathcal L(q) = \mathbb E_q[\log p(\mathbf X, \mathbf Z, \boldsymbol\theta)] + H[q]$, the expected complete-data log-likelihood plus the entropy of $q$. The entropy contribution from $q(\mathbf Z)$ is exactly the soft assignment entropy $-\sum_{n,k} r_{nk} \log r_{nk}$. Read through this lens, pruning is a rate-distortion phenomenon: small $\alpha_0$ raises the Dirichlet KL cost of keeping a component "alive" (high rate), and CAVI trades that against the log-likelihood improvement (low distortion). When a component cannot earn its rate by reducing distortion, it is killed. CAVI itself is I-projection onto each coordinate slice of the mean-field manifold.

5. Watching VB-GMM converge

Figure 1 runs the CAVI iteration on a 2-D dataset. The most important control is Dirichlet $\alpha_0$, shown on a log scale: push it below $1$ and unsupported components rapidly lose mixing weight; raise it and extra components remain alive longer. Start with up to $K = 10$ components placed by random initialization on a small dataset drawn from 2–4 true Gaussian clusters. Watch:

The component ellipses (one-sigma contours of $\mathbb E[\boldsymbol\mu_k]$ and $\mathbb E[\boldsymbol\Lambda_k]^{-1}$) settle onto the data clusters.
The right-side bar chart of $\mathbb E[\pi_k]$: components with no support shrink to near zero. With $\alpha_0 \lt 1$ the surviving components match the true cluster count.
The ELBO trace at the bottom climbs and plateaus.
Data points are coloured by their argmax responsibility; the dimmer dots are points where no component has a confident claim.

Figure 1 · CAVI updates on a 2-D Gaussian mixture

data, coloured by argmax responsibility true cluster means component 1-σ ellipses, opacity = $\pi_k$

true clusters 3

data points per cluster 80

initial components $K$ 8

Dirichlet $alpha_0$ (log10) -2

iterations / second 4

6. VB vs. EM

The CAVI updates above reduce to EM in a specific limit: set $\alpha_0 \to 0$, $\beta_0 \to 0$, $\nu_0 \to D-1$, and replace the variational expectations with the corresponding MAP values. You recover the familiar EM E and M steps for a GMM with maximum-likelihood point estimates of $\pi_k$, $\boldsymbol\mu_k$, $\boldsymbol\Lambda_k$. What VB adds:

Regularization. The Normal–Wishart prior keeps $\boldsymbol\Lambda_k$ from collapsing to a delta on a single point. EM is vulnerable to this singularity; VB is not.
Automatic component pruning. See above.
An ELBO. Stopping criterion, convergence diagnostic, and lower bound on the log-marginal likelihood, all from the same quantity.
Approximate posterior uncertainty. $q(\boldsymbol\mu_k, \boldsymbol\Lambda_k)$ gives parameter uncertainty, not just point estimates. This is mean-field underestimated, but it's something.

What you lose: correlations between assignments and component parameters, and between component parameters across components. For mixture problems where component identifiability matters (label-switching, near-degenerate components), mean field's symmetry-breaking is brittle. MCMC or normalizing-flow VI handles this better at the cost of more compute.

What next

Foundations

Free Energy & Variational Inference

The ELBO / free-energy identity that justifies the CAVI updates derived here.

Regression

Mean-Field VI in Bayesian Regression

The Bayesian linear regression cousin, with exact posterior available so the mean-field underestimation of variance is visible directly.

Hierarchy

Hierarchical Bayes

VB-GMM is hierarchical: components share a Dirichlet hyperprior. Pulling unused components toward the prior is the same dynamic.

Alternative

Monte Carlo & MCMC

Collapsed Gibbs and HMC are the sampling alternatives; slower but recover correlations VB misses.