Free Energy & Variational Inference

How an intractable Bayesian posterior turns into an optimization problem.

Bayesian inference asks: given data $y$ and a model $p(y,\theta) = p(y\mid\theta)\,p(\theta)$, what is the posterior $p(\theta\mid y)$? In principle you just apply Bayes: $p(\theta\mid y) = p(y,\theta)/p(y)$. In practice the marginal evidence $p(y) = \int p(y,\theta)\,d\theta$ is a high-dimensional integral that is almost never available in closed form. Variational inference sidesteps the integral by replacing "find the posterior" with "find the closest tractable distribution to the posterior", turning inference into optimization.

The posterior is the prior, tilted. Bayes' rule has a one-line measure-theoretic reading: the posterior measure is absolutely continuous with respect to the prior, with Radon–Nikodym derivative $$ \frac{dP_{\theta\mid y}}{dP_\theta}(\theta) \;\propto\; p(y\mid\theta). $$ Updating doesn't create probability out of nothing; it reweights the prior by the likelihood and renormalizes. Variational inference is the case where this tilt is intractable, so we minimize a KL gap to an approximating $q$ instead. Conjugate cases (see named distributions) are the ones where the tilt stays inside a finite-dimensional exponential family and updates have closed form.

The closeness measure is the Kullback-Leibler divergence, and the quantity we actually optimize is the variational free energy (also known, with a sign flip, as the ELBO). Together they sit inside the single identity

$$ \ln p(y) \;=\; \underbrace{\mathrm{KL}\!\bigl[\,q(\theta)\,\Vert\,p(\theta\mid y)\,\bigr]}_{\geq 0} \;+\; F(q,y). $$

Variational inference uses KL as a building block. For the standalone intuition behind KL, including categorical examples and forward-vs-reverse behavior, see KL Divergence. The same directed gap becomes an inference algorithm.

1. KL inside variational inference

For two densities $q$ and $p$ on the same space, the KL divergence is

$$ \mathrm{KL}[q\,\Vert\,p] \;=\; \int q(\theta)\,\ln\frac{q(\theta)}{p(\theta)}\,d\theta \;=\; \mathbb{E}_q\!\left[\ln\frac{q(\theta)}{p(\theta)}\right]. $$

It is non-negative, zero only when $q=p$ almost everywhere, and directed: $\mathrm{KL}[q\Vert p] \neq \mathrm{KL}[p\Vert q]$ in general. Variational inference uses $\mathrm{KL}[q\Vert p(\theta\mid y)]$, the reverse or mode-seeking direction, because that is the one that drops out of the algebra below.

Drag $q$ and $p$. The shaded gap between the curves on the bottom plot is the integrand $q(\theta)\ln\bigl(q(\theta)/p(\theta)\bigr)$, weighted by $q$. The point for VI is that changing $q$ changes both where the approximation puts mass and where the mismatch is measured.

Figure 1 · $\mathrm{KL}[q\Vert p]$ between two Gaussians

$q(\theta)$ $p(\theta)$ integrand of KL

$\mu_q$ 0

$\sigma_q$ 1

$\mu_p$ 1

$\sigma_p$ 1

Figure 1a · Reverse KL mode-seeking on a bimodal target

target posterior $p(\theta\mid y)$ single-Gaussian $q(\theta)$ objective landscape

$\mu_q$ 0

$\sigma_q$ 1.4

2. The variational identity

The evidence is a log-partition function. Identify $E(\theta) = -\ln p(y,\theta)$. Then the marginal evidence is literally a partition function over parameters: $$ p(y) \;=\; \int p(y,\theta)\,d\theta \;=\; \int e^{-E(\theta)}\,d\theta \;=\; Z, $$ and the posterior is the Gibbs distribution at temperature $1$: $p(\theta\mid y) = e^{-E(\theta)}/Z$. The Legendre-duality identity $\ln Z = \sup_q\!\bigl(\mathbb{E}_q[-E] + H(q)\bigr)$ is then exactly the ELBO, achieved when $q = p(\theta\mid y)$. So variational inference is statistical mechanics on probability distributions: minimizing free energy, with the posterior as the equilibrium and the KL gap as the excess free energy.

Start from Bayes' rule, $p(y,\theta) = p(\theta\mid y)\,p(y)$, take logs, and play the classic multiply-and-divide-by-$q(\theta)$ trick:

$\displaystyle \ln p(y) \;=\; \ln\frac{p(y,\theta)}{p(\theta\mid y)}$

Bayes' rule, rearranged

$\displaystyle \phantom{\ln p(y)} \;=\; \int q(\theta)\,\ln\frac{p(y,\theta)}{p(\theta\mid y)}\,d\theta$

Multiply by $q(\theta)$ and integrate; $\ln p(y)$ is constant in $\theta$, $\int q = 1$

$\displaystyle \phantom{\ln p(y)} \;=\; \int q(\theta)\,\ln\!\left[\frac{p(y,\theta)}{p(\theta\mid y)}\cdot\frac{q(\theta)}{q(\theta)}\right]d\theta$

Multiply and divide by $q(\theta)$ inside the log

$\displaystyle \phantom{\ln p(y)} \;=\; \int q(\theta)\,\ln\frac{q(\theta)}{p(\theta\mid y)}\,d\theta \;+\; \int q(\theta)\,\ln\frac{p(y,\theta)}{q(\theta)}\,d\theta$

Split the log of a product

$\displaystyle \phantom{\ln p(y)} \;=\; \underbrace{\mathrm{KL}\!\bigl[q\,\Vert\,p(\cdot\mid y)\bigr]}_{\color{#b8412a}\text{divergence}\geq 0} \;+\; \underbrace{F(q,y)}_{\color{#1f4a8c}\text{free energy}}$

Read off the two pieces

Figure 1b · Step through the variational identity

$q(\theta)$ true posterior KL $F$

derivation step 3

$\mu_q$ 0.0

$\sigma_q$ 1.50

Two consequences follow:

Because $\ln p(y)$ depends on $q$ only through the right-hand side, and KL is non-negative, $F(q,y) \le \ln p(y)$. The free energy is a lower bound on the log evidence, the "Evidence Lower BOund" (ELBO).
Because $\ln p(y)$ is constant in $q$, maximizing $F(q,y)$ is equivalent to minimizing $\mathrm{KL}[q\Vert p(\cdot\mid y)]$. We have turned an intractable integral into a tractable optimization.

The picture: log-evidence is a fixed ceiling, $F$ rises toward it as we optimize, and the leftover gap is exactly the KL.

3. Visualizing the decomposition

Take a Bayesian inference problem with a closed-form posterior, so we have a ground truth to compare against. Model:

$$ \theta \sim \mathcal{N}(\mu_0, \sigma_0^2),\qquad y \mid \theta \sim \mathcal{N}(\theta, \sigma^2_{\!\text{lik}}). $$

With one observation $y$, the true posterior is $p(\theta\mid y) = \mathcal{N}(\mu^\star, \sigma^{\star 2})$ with $\sigma^{\star 2}=(1/\sigma_0^2+1/\sigma^2_{\!\text{lik}})^{-1}$ and $\mu^\star = \sigma^{\star 2}(\mu_0/\sigma_0^2 + y/\sigma^2_{\!\text{lik}})$. We pick a variational family $q(\theta)=\mathcal{N}(\mu_q,\sigma_q^2)$ and watch the identity $\ln p(y) = \mathrm{KL}[q\Vert p(\cdot\mid y)] + F(q,y)$ hold for every choice of $(\mu_q,\sigma_q)$, even bad ones.

The bar on the right of the figure shows the decomposition. The ceiling $\ln p(y)$ is constant (it depends on the data and model, not on $q$). As you move $q$ closer to the true posterior, the red KL band shrinks and the blue free-energy band fills in to meet it. The "Optimize" button does a gradient ascent on $F(q,y)$; you'll see $q$ settle onto the posterior.

Figure 2 · $\ln p(y) = \mathrm{KL}[q\Vert p(\cdot\mid y)] + F(q,y)$

prior $p(\theta)$ likelihood $p(y\mid\theta)$ true posterior $p(\theta\mid y)$ variational $q(\theta)$

observation $y$ 2

likelihood noise $\sigma_{\!\text{lik}}$ 1

$\mu_q$ 0

$\sigma_q$ 1.5

4. Two ways to read the free energy

The free energy admits a second decomposition that is often more useful for computation. Starting from its definition,

$$ F(q,y) = \int q(\theta)\,\ln\frac{p(y,\theta)}{q(\theta)}\,d\theta = \underbrace{\mathbb{E}_q[\ln p(y,\theta)]}_{\text{expected log-joint}} + \underbrace{\mathrm{H}[q]}_{\text{entropy of }q}. $$

Maximizing $F$ trades two pressures:

The expected log-joint $\mathbb{E}_q[\ln p(y,\theta)]$ pulls $q$ toward regions where the model thinks the data are likely, where high prior meets high likelihood. It is a "fit" term.
The entropy $\mathrm{H}[q] = -\int q\ln q$ pushes $q$ to spread out. It is a "don't be overconfident" term.

Equivalently, and this is the form people optimize in practice, $F(q,y) = \mathbb{E}_q[\ln p(y\mid\theta)] - \mathrm{KL}[q\Vert p(\theta)]$: fit the data, but stay close to the prior.

The two contributions appear side by side as you change $q$. Watch the trade-off: shrinking $\sigma_q$ raises the fit term (if $\mu_q$ is in the right place) but lowers the entropy. The optimum balances them.

Figure 3 · $F = \mathbb{E}_q[\ln p(y,\theta)] + \mathrm{H}[q]$

$\mu_q$ 0

$\sigma_q$ 1.5

observation $y$ 2

Figure 4 · Mean-field VI underestimates correlated posterior variance

true correlated posterior axis-aligned mean-field $q_1q_2$ marginal variance kept by reverse KL

posterior correlation $\rho$ 0.78

The heatmap below shows $F(q,y)$ as a function of $(\mu_q, \sigma_q)$ for the same Gaussian-Gaussian model. Click anywhere on the landscape to place $q$ there; the optimum's $(\mu^\star,\sigma^\star)$ is marked with a crosshair, an arrow shows the gradient direction at your current $q$, and the inset on the right plots that $q$ against the true posterior on the $\theta$ axis. The readout reports $F$, the constant ceiling $\ln p(y)$, and their gap, which is exactly $\mathrm{KL}[q\Vert p(\cdot\mid y)]$. As you slide $y$ or the likelihood noise, the whole landscape shifts.

Figure 5 · ELBO landscape over variational parameters

higher ELBO / posterior marker optimizer trajectory / current $q$ $\nabla F$ direction

observation $y$ 2

likelihood noise $\sigma_{\!\text{lik}}$ 1

5. Mean-field VI in Bayesian linear regression

Figure 4 above made the underestimation point abstractly: a 2D Gaussian with correlation, approximated by an axis-aligned $q_1 q_2$. The same picture becomes more concrete in Bayesian linear regression, where the correlation in the posterior is not assumed but arises from the data you collect. Click in the data panel to add points, drag to move them, or alt-click to remove. The model is $y_i = \alpha + \beta x_i + \epsilon_i$ with Gaussian noise and a Gaussian prior on $(\alpha, \beta)$, so the exact posterior is a (correlated) Gaussian we can plot directly against the best axis-aligned mean-field approximation.

The failure mode: when intercept and slope are correlated, reverse-KL mean-field VI cannot rotate its ellipse. It shrinks the marginal variances to avoid putting mass in low-posterior-density corners — the "VI underestimates uncertainty" result you saw in Figure 4, now made contingent on the data you place yourself.

Figure 6 · Bayesian linear regression posterior and mean-field approximation

exact posterior mean-field VI posterior predictive

noise $\sigma$ 0.35

prior sd 2.0

The next figure runs the actual coordinate-ascent VI (CAVI) updates. For a Gaussian posterior, the reverse-KL mean-field optimum has the same mean as the exact posterior, but each factor variance is the inverse of the matching precision diagonal. Click the buttons to alternate between updating the intercept factor $q(\alpha)$ and the slope factor $q(\beta)$: the red ellipse snaps toward the coordinate-wise optimum and the ELBO rises. You can also click anywhere in the $(\alpha,\beta)$ panel to drop $q$ at a different starting position before running CAVI.

Figure 7 · Mean-field coordinate updates and ELBO

exact posterior current $q(\alpha)q(\beta)$ ELBO trace

When the $x$ values are concentrated on one side, many intercept-slope pairs explain the data nearly equally well. The exact posterior tilts along that tradeoff. The factorized approximation cannot represent the tilt, so reverse KL chooses a smaller axis-aligned ellipse — the same phenomenon as Figure 4, but here you can see exactly which data placements force it.

Related constructions

Mean-field variational inference. If $q(\theta) = \prod_i q_i(\theta_i)$ factorizes, coordinate ascent on $F$ has a closed-form update for each $q_i$. This is VI for graphical models, topic models, latent Dirichlet allocation, and more.
Amortized inference / VAE. Replace $q(\theta)$ with $q_\phi(\theta\mid y)$, a neural network mapping data to variational parameters. Maximize the same $F$, now over $\phi$ and over the generative model. That is the variational autoencoder.
Free-energy principle. The same identity, applied at every level of a hierarchical generative model of sensory input, gives Karl Friston's account of perception, action, and learning as minimizing free energy.
EM as a special case. The Expectation–Maximization algorithm alternates E-steps (set $q$ to the exact posterior, KL = 0) and M-steps (maximize $F$ over model parameters) on the same identity.

$\ln p(y) = \mathrm{KL}[q\Vert p(\cdot\mid y)] + F(q,y)$. Everything from the mean-field updates of a topic model to the loss function of a VAE is a tactic for making one side of that equation easy to compute.

What next

Variational inference sits between measure-theoretic identities and sampling-based computation.

Information

KL Divergence

Study the directed mismatch measure used here, including categorical support errors and forward-vs-reverse behavior.

Foundations

Measure Theory & Random Variables

Back up to densities as relationships between measures, including the KL and importance-weight identities used here.

Geometry

Uninformative Priors, Fisher Information & MLE

Connect likelihood curvature, Jeffreys priors, and posterior concentration to the optimization view used here.

Computation

Monte Carlo & MCMC

Contrast optimization-based approximate inference with rejection, importance, and MCMC. The state-space (particle-filter) variant is on its own page.

Information

Entropy & Mutual Information

The entropy term in free energy is the same uncertainty accounting, now used as an optimization pressure.