Variational Bayes for Gaussian Mixtures

Mean-field VI on a mixture model: closed-form coordinate updates, a Dirichlet prior that prunes unused components automatically, and the ELBO as the convergence handle.

1. The model

Observations $\mathbf x_1, \ldots, \mathbf x_N \in \mathbb R^D$ are modelled as a mixture of $K$ Gaussians. Latent assignments $\mathbf z_n \in \{1,\ldots,K\}$ say which component each $\mathbf x_n$ came from. With conjugate priors,

$$ \begin{aligned} \boldsymbol\pi &\sim \mathrm{Dir}(\alpha_0, \ldots, \alpha_0), \\ (\boldsymbol\mu_k, \boldsymbol\Lambda_k) &\sim \mathrm{NW}(\mathbf m_0, \beta_0, \mathbf W_0, \nu_0), \\ \mathbf z_n \mid \boldsymbol\pi &\sim \mathrm{Cat}(\boldsymbol\pi), \\ \mathbf x_n \mid \mathbf z_n = k, \boldsymbol\mu_k, \boldsymbol\Lambda_k &\sim \mathcal N(\boldsymbol\mu_k, \boldsymbol\Lambda_k^{-1}). \end{aligned} $$

$\mathrm{NW}$ is the Normal–Wishart prior, conjugate for an unknown Gaussian mean and precision. The exact posterior $p(\mathbf Z, \boldsymbol\pi, \boldsymbol\mu, \boldsymbol\Lambda \mid \mathbf X)$ is intractable because the latents $\mathbf Z$ couple the component parameters through the data.

2. The variational family

Pick the mean-field factorization

$$ q(\mathbf Z, \boldsymbol\pi, \boldsymbol\mu, \boldsymbol\Lambda) \;=\; q(\mathbf Z)\,q(\boldsymbol\pi)\,\prod_{k=1}^{K} q(\boldsymbol\mu_k, \boldsymbol\Lambda_k). $$

The standard CAVI derivation (Bishop §10.2) recovers the same parametric forms as the prior:

This is the simplest case where mean field gives genuinely closed-form updates for all factors. The price is the independence assumption between $\mathbf Z$ and the parameters. VB systematically underestimates posterior correlations, just as in the regression case.

3. The CAVI updates

The variational expectations needed for the E-step are

$$ \begin{aligned} \mathbb E\!\bigl[\ln \pi_k\bigr] &= \psi(\alpha_k) - \psi(\textstyle\sum_j \alpha_j), \\ \mathbb E\!\bigl[\ln |\boldsymbol\Lambda_k|\bigr] &= \sum_{d=1}^D \psi\!\Bigl(\tfrac{\nu_k + 1 - d}{2}\Bigr) + D\ln 2 + \ln |\mathbf W_k|, \\ \mathbb E\!\bigl[(\mathbf x_n - \boldsymbol\mu_k)^\top \boldsymbol\Lambda_k (\mathbf x_n - \boldsymbol\mu_k)\bigr] &= \tfrac{D}{\beta_k} + \nu_k\,(\mathbf x_n - \mathbf m_k)^\top \mathbf W_k (\mathbf x_n - \mathbf m_k). \end{aligned} $$

The responsibilities update is

$$ \ln\rho_{nk} \;=\; \mathbb E[\ln\pi_k] + \tfrac12 \mathbb E[\ln|\boldsymbol\Lambda_k|] - \tfrac{D}{2}\ln(2\pi) - \tfrac12 \mathbb E[\Delta_{nk}], \qquad r_{nk} \;=\; \frac{\rho_{nk}}{\sum_j \rho_{nj}}. $$

The M-step computes weighted sufficient statistics $N_k = \sum_n r_{nk}$, $\bar{\mathbf x}_k = \frac1{N_k}\sum_n r_{nk}\mathbf x_n$, $\mathbf S_k = \frac1{N_k}\sum_n r_{nk} (\mathbf x_n - \bar{\mathbf x}_k)(\mathbf x_n - \bar{\mathbf x}_k)^\top$, and updates the variational hyperparameters:

$$ \begin{aligned} \alpha_k &= \alpha_0 + N_k, \\ \beta_k &= \beta_0 + N_k, \\ \nu_k &= \nu_0 + N_k, \\ \mathbf m_k &= \frac{\beta_0 \mathbf m_0 + N_k \bar{\mathbf x}_k}{\beta_k}, \\ \mathbf W_k^{-1} &= \mathbf W_0^{-1} + N_k \mathbf S_k + \frac{\beta_0 N_k}{\beta_k}(\bar{\mathbf x}_k - \mathbf m_0)(\bar{\mathbf x}_k - \mathbf m_0)^\top. \end{aligned} $$

The whole iteration is closed-form: no sampling, no inner Newton. Each step is guaranteed not to decrease the ELBO.

4. Automatic pruning

The Dirichlet concentration $\alpha_0$ is the load-bearing knob. The posterior mean mixing weight is $\mathbb E[\pi_k] = \alpha_k/(\sum_j \alpha_j) = (\alpha_0 + N_k)/(K\alpha_0 + N)$. With $\alpha_0 \lt 1$, the Dirichlet prior actively pushes mass off components that attract few data points. Empirically, starting with more components than the data need and a small $\alpha_0$ (e.g. $10^{-3}$) drives unused components to near-zero $\pi_k$ within a few iterations. They "die out" automatically without any model-selection step.

This is the main thing VB-GMM gives you that classical EM does not. EM has no mechanism to remove components; you would have to fit a sequence of models with $K = 1, 2, 3, \ldots$ and compare on an information criterion. Variational Bayes does it inside one fit.

Information-theory aside. The ELBO decomposes as $\mathcal L(q) = \mathbb E_q[\log p(\mathbf X, \mathbf Z, \boldsymbol\theta)] + H[q]$, the expected complete-data log-likelihood plus the entropy of $q$. The entropy contribution from $q(\mathbf Z)$ is exactly the soft assignment entropy $-\sum_{n,k} r_{nk} \log r_{nk}$. Read through this lens, pruning is a rate-distortion phenomenon: small $\alpha_0$ raises the Dirichlet KL cost of keeping a component "alive" (high rate), and CAVI trades that against the log-likelihood improvement (low distortion). When a component cannot earn its rate by reducing distortion, it is killed. CAVI itself is I-projection onto each coordinate slice of the mean-field manifold.

5. Watching VB-GMM converge

Figure 1 runs the CAVI iteration on a 2-D dataset. Start with up to $K = 10$ components placed by random initialization on a small dataset drawn from 2–4 true Gaussian clusters. Watch:

Figure 1 · CAVI updates on a 2-D Gaussian mixture
data, coloured by argmax responsibility true cluster means component 1-σ ellipses, opacity = $\pi_k$

6. VB vs. EM

The CAVI updates above reduce to EM in a specific limit: set $\alpha_0 \to 0$, $\beta_0 \to 0$, $\nu_0 \to D-1$, and replace the variational expectations with the corresponding MAP values. You recover the familiar EM E and M steps for a GMM with maximum-likelihood point estimates of $\pi_k$, $\boldsymbol\mu_k$, $\boldsymbol\Lambda_k$. What VB adds:

What you lose: correlations between assignments and component parameters, and between component parameters across components. For mixture problems where component identifiability matters (label-switching, near-degenerate components), mean field's symmetry-breaking is brittle. MCMC or normalizing-flow VI handles this better at the cost of more compute.

What next