Free Energy & Variational Inference

How an intractable Bayesian posterior turns into an optimization problem.

Bayesian inference asks: given data $y$ and a model $p(y,\theta) = p(y\mid\theta)\,p(\theta)$, what is the posterior $p(\theta\mid y)$? In principle you just apply Bayes: $p(\theta\mid y) = p(y,\theta)/p(y)$. In practice the marginal evidence $p(y) = \int p(y,\theta)\,d\theta$ is a high-dimensional integral that is almost never available in closed form. Variational inference sidesteps the integral by replacing "find the posterior" with "find the closest tractable distribution to the posterior", turning inference into optimization.

The posterior is the prior, tilted. Bayes' rule has a one-line measure-theoretic reading: the posterior measure is absolutely continuous with respect to the prior, with Radon–Nikodym derivative $$ \frac{dP_{\theta\mid y}}{dP_\theta}(\theta) \;\propto\; p(y\mid\theta). $$ Updating doesn't create probability out of nothing; it reweights the prior by the likelihood and renormalizes. Variational inference is the case where this tilt is intractable, so we minimize a KL gap to an approximating $q$ instead. Conjugate cases (see named distributions) are the ones where the tilt stays inside a finite-dimensional exponential family and updates have closed form.

The closeness measure is the Kullback-Leibler divergence, and the quantity we actually optimize is the variational free energy (also known, with a sign flip, as the ELBO). Together they sit inside the single identity

$$ \ln p(y) \;=\; \underbrace{\mathrm{KL}\!\bigl[\,q(\theta)\,\Vert\,p(\theta\mid y)\,\bigr]}_{\geq 0} \;+\; F(q,y). $$

Variational inference uses KL as a building block. For the standalone intuition behind KL, including categorical examples and forward-vs-reverse behavior, see KL Divergence. The same directed gap becomes an inference algorithm.

1. KL inside variational inference

For two densities $q$ and $p$ on the same space, the KL divergence is

$$ \mathrm{KL}[q\,\Vert\,p] \;=\; \int q(\theta)\,\ln\frac{q(\theta)}{p(\theta)}\,d\theta \;=\; \mathbb{E}_q\!\left[\ln\frac{q(\theta)}{p(\theta)}\right]. $$

It is non-negative, zero only when $q=p$ almost everywhere, and directed: $\mathrm{KL}[q\Vert p] \neq \mathrm{KL}[p\Vert q]$ in general. Variational inference uses $\mathrm{KL}[q\Vert p(\theta\mid y)]$, the reverse or mode-seeking direction, because that is the one that drops out of the algebra below.

Drag $q$ and $p$. The shaded gap between the curves on the bottom plot is the integrand $q(\theta)\ln\bigl(q(\theta)/p(\theta)\bigr)$, weighted by $q$. The point for VI is that changing $q$ changes both where the approximation puts mass and where the mismatch is measured.

Figure 1 · $\mathrm{KL}[q\Vert p]$ between two Gaussians
$q(\theta)$ $p(\theta)$ integrand of KL
Figure 1a · Reverse KL mode-seeking on a bimodal target
target posterior $p(\theta\mid y)$ single-Gaussian $q(\theta)$ objective landscape

2. The variational identity

The evidence is a log-partition function. Identify $E(\theta) = -\ln p(y,\theta)$. Then the marginal evidence is literally a partition function over parameters: $$ p(y) \;=\; \int p(y,\theta)\,d\theta \;=\; \int e^{-E(\theta)}\,d\theta \;=\; Z, $$ and the posterior is the Gibbs distribution at temperature $1$: $p(\theta\mid y) = e^{-E(\theta)}/Z$. The Legendre-duality identity $\ln Z = \sup_q\!\bigl(\mathbb{E}_q[-E] + H(q)\bigr)$ is then exactly the ELBO, achieved when $q = p(\theta\mid y)$. So variational inference is statistical mechanics on probability distributions: minimizing free energy, with the posterior as the equilibrium and the KL gap as the excess free energy.

Start from Bayes' rule, $p(y,\theta) = p(\theta\mid y)\,p(y)$, take logs, and play the classic multiply-and-divide-by-$q(\theta)$ trick:

$\displaystyle \ln p(y) \;=\; \ln\frac{p(y,\theta)}{p(\theta\mid y)}$
Bayes' rule, rearranged

$\displaystyle \phantom{\ln p(y)} \;=\; \int q(\theta)\,\ln\frac{p(y,\theta)}{p(\theta\mid y)}\,d\theta$
Multiply by $q(\theta)$ and integrate; $\ln p(y)$ is constant in $\theta$, $\int q = 1$

$\displaystyle \phantom{\ln p(y)} \;=\; \int q(\theta)\,\ln\!\left[\frac{p(y,\theta)}{p(\theta\mid y)}\cdot\frac{q(\theta)}{q(\theta)}\right]d\theta$
Multiply and divide by $q(\theta)$ inside the log

$\displaystyle \phantom{\ln p(y)} \;=\; \int q(\theta)\,\ln\frac{q(\theta)}{p(\theta\mid y)}\,d\theta \;+\; \int q(\theta)\,\ln\frac{p(y,\theta)}{q(\theta)}\,d\theta$
Split the log of a product

$\displaystyle \phantom{\ln p(y)} \;=\; \underbrace{\mathrm{KL}\!\bigl[q\,\Vert\,p(\cdot\mid y)\bigr]}_{\color{#b8412a}\text{divergence}\geq 0} \;+\; \underbrace{F(q,y)}_{\color{#1f4a8c}\text{free energy}}$
Read off the two pieces
Figure 1b · Step through the variational identity
$q(\theta)$ true posterior KL $F$

Two consequences follow:

The picture: log-evidence is a fixed ceiling, $F$ rises toward it as we optimize, and the leftover gap is exactly the KL.

3. Visualizing the decomposition

Take a Bayesian inference problem with a closed-form posterior, so we have a ground truth to compare against. Model:

$$ \theta \sim \mathcal{N}(\mu_0, \sigma_0^2),\qquad y \mid \theta \sim \mathcal{N}(\theta, \sigma^2_{\!\text{lik}}). $$

With one observation $y$, the true posterior is $p(\theta\mid y) = \mathcal{N}(\mu^\star, \sigma^{\star 2})$ with $\sigma^{\star 2}=(1/\sigma_0^2+1/\sigma^2_{\!\text{lik}})^{-1}$ and $\mu^\star = \sigma^{\star 2}(\mu_0/\sigma_0^2 + y/\sigma^2_{\!\text{lik}})$. We pick a variational family $q(\theta)=\mathcal{N}(\mu_q,\sigma_q^2)$ and watch the identity $\ln p(y) = \mathrm{KL}[q\Vert p(\cdot\mid y)] + F(q,y)$ hold for every choice of $(\mu_q,\sigma_q)$, even bad ones.

The bar on the right of the figure shows the decomposition. The ceiling $\ln p(y)$ is constant (it depends on the data and model, not on $q$). As you move $q$ closer to the true posterior, the red KL band shrinks and the blue free-energy band fills in to meet it. The "Optimize" button does a gradient ascent on $F(q,y)$; you'll see $q$ settle onto the posterior.

Figure 2 · $\ln p(y) = \mathrm{KL}[q\Vert p(\cdot\mid y)] + F(q,y)$
prior $p(\theta)$ likelihood $p(y\mid\theta)$ true posterior $p(\theta\mid y)$ variational $q(\theta)$

4. Two ways to read the free energy

The free energy admits a second decomposition that is often more useful for computation. Starting from its definition,

$$ F(q,y) = \int q(\theta)\,\ln\frac{p(y,\theta)}{q(\theta)}\,d\theta = \underbrace{\mathbb{E}_q[\ln p(y,\theta)]}_{\text{expected log-joint}} + \underbrace{\mathrm{H}[q]}_{\text{entropy of }q}. $$

Maximizing $F$ trades two pressures:

Equivalently, and this is the form people optimize in practice, $F(q,y) = \mathbb{E}_q[\ln p(y\mid\theta)] - \mathrm{KL}[q\Vert p(\theta)]$: fit the data, but stay close to the prior.

The two contributions appear side by side as you change $q$. Watch the trade-off: shrinking $\sigma_q$ raises the fit term (if $\mu_q$ is in the right place) but lowers the entropy. The optimum balances them.

Figure 3 · $F = \mathbb{E}_q[\ln p(y,\theta)] + \mathrm{H}[q]$
Figure 4 · Mean-field VI underestimates correlated posterior variance
true correlated posterior axis-aligned mean-field $q_1q_2$ marginal variance kept by reverse KL

The heatmap below shows $F(q,y)$ as a function of $(\mu_q, \sigma_q)$ for the same Gaussian-Gaussian model. Click anywhere on the landscape to place $q$ there; the optimum's $(\mu^\star,\sigma^\star)$ is marked with a crosshair, an arrow shows the gradient direction at your current $q$, and the inset on the right plots that $q$ against the true posterior on the $\theta$ axis. The readout reports $F$, the constant ceiling $\ln p(y)$, and their gap, which is exactly $\mathrm{KL}[q\Vert p(\cdot\mid y)]$. As you slide $y$ or the likelihood noise, the whole landscape shifts.

Figure 5 · ELBO landscape over variational parameters
higher ELBO / posterior marker optimizer trajectory / current $q$ $\nabla F$ direction

5. Mean-field VI in Bayesian linear regression

Figure 4 above made the underestimation point abstractly: a 2D Gaussian with correlation, approximated by an axis-aligned $q_1 q_2$. The same picture becomes more concrete in Bayesian linear regression, where the correlation in the posterior is not assumed but arises from the data you collect. Click in the data panel to add points, drag to move them, or alt-click to remove. The model is $y_i = \alpha + \beta x_i + \epsilon_i$ with Gaussian noise and a Gaussian prior on $(\alpha, \beta)$, so the exact posterior is a (correlated) Gaussian we can plot directly against the best axis-aligned mean-field approximation.

The failure mode: when intercept and slope are correlated, reverse-KL mean-field VI cannot rotate its ellipse. It shrinks the marginal variances to avoid putting mass in low-posterior-density corners — the "VI underestimates uncertainty" result you saw in Figure 4, now made contingent on the data you place yourself.
Figure 6 · Bayesian linear regression posterior and mean-field approximation
exact posterior mean-field VI posterior predictive

The next figure runs the actual coordinate-ascent VI (CAVI) updates. For a Gaussian posterior, the reverse-KL mean-field optimum has the same mean as the exact posterior, but each factor variance is the inverse of the matching precision diagonal. Click the buttons to alternate between updating the intercept factor $q(\alpha)$ and the slope factor $q(\beta)$: the red ellipse snaps toward the coordinate-wise optimum and the ELBO rises. You can also click anywhere in the $(\alpha,\beta)$ panel to drop $q$ at a different starting position before running CAVI.

Figure 7 · Mean-field coordinate updates and ELBO
exact posterior current $q(\alpha)q(\beta)$ ELBO trace

When the $x$ values are concentrated on one side, many intercept-slope pairs explain the data nearly equally well. The exact posterior tilts along that tradeoff. The factorized approximation cannot represent the tilt, so reverse KL chooses a smaller axis-aligned ellipse — the same phenomenon as Figure 4, but here you can see exactly which data placements force it.

Related constructions

$\ln p(y) = \mathrm{KL}[q\Vert p(\cdot\mid y)] + F(q,y)$. Everything from the mean-field updates of a topic model to the loss function of a VAE is a tactic for making one side of that equation easy to compute.

What next

Variational inference sits between measure-theoretic identities and sampling-based computation.