KL Divergence

A directed measure of the extra cost of using one distribution in place of another.

The Kullback-Leibler divergence compares two probability measures on the same measurable space. If $Q$ is absolutely continuous with respect to $P$, written $Q\ll P$, then

$$ \mathrm{KL}[Q\Vert P] = \int \log\frac{dQ}{dP}\,dQ = \mathbb{E}_Q\!\left[\log\frac{dQ}{dP}\right]. $$

Equivalently, by changing the integrating measure back to $P$, $\mathrm{KL}[Q\Vert P]=\int \frac{dQ}{dP}\log\frac{dQ}{dP}\,dP$. When both measures have densities $q$ and $p$ with respect to a common base measure, this reduces to the familiar formula $\int q(x)\log(q(x)/p(x))\,dx$. These are Lebesgue integrals; the compact form integrates $\log(dQ/dP)$ with respect to the probability measure $Q$.

KL is an expectation under the left-hand measure: samples are drawn from $Q$, and each sample charges the log Radon-Nikodym derivative between what $Q$ expects and what $P$ assigned. That is why KL is directed. The distribution on the left decides where the comparison spends its mass.

One sentence: $\mathrm{KL}[q\Vert p]$ is the average extra log-cost of pretending the data-generating distribution is $p$ when it is actually $q$.

1. The Radon-Nikodym derivative inside KL

The figure below keeps the base space ordinary, so $dQ/dP$ is just the density ratio $q(x)/p(x)$. The top panel shows the two probability densities, the middle panel shows the Radon-Nikodym derivative, and the bottom panel shows the KL integrand $q(x)\log(q(x)/p(x))$. In the fully measure-theoretic formula, the middle curve is the object being logged.

Figure 1 · $\mathrm{KL}[Q\Vert P] = \int \log(dQ/dP)\,dQ$
$Q$ / $q(x)$ $P$ / $p(x)$ $dQ/dP = q/p$ $q\log(q/p)$

If $Q\not\ll P$, the derivative $dQ/dP$ does not exist as a finite density on all the places $Q$ needs it, and $\mathrm{KL}[Q\Vert P]=+\infty$. This is the measure-theoretic version of the discrete rule that putting positive $q_i$ where $p_i=0$ gives infinite KL.

2. KL direction sets the averaging measure

KL is zero only when the two distributions match, but it is not a distance. Usually $\mathrm{KL}[q\Vert p] \neq \mathrm{KL}[p\Vert q]$. The figure below uses Gaussians because the exact value is available in closed form, while the lower plot shows the pointwise contribution $q(x)\log(q(x)/p(x))$.

Figure 2 · $\mathrm{KL}[q\Vert p]$ between two Gaussians
$q(x)$ $p(x)$ integrand

3. The left distribution chooses the bill

The discrete case makes the accounting visible. Each outcome contributes $q_i\log(q_i/p_i)$. If $q_i=0$, that outcome contributes nothing, no matter how large $p_i$ is. If $q_i>0$ and $p_i=0$, the KL is infinite: $p$ says an event that actually happens is impossible.

Figure 3 · Per-outcome KL contributions
$q_i$ $p_i$ $q_i\log(q_i/p_i)$

4. Forward KL covers; reverse KL chooses

When a simple distribution approximates a multi-modal target, the direction changes the qualitative behavior. Minimizing $\mathrm{KL}[p\Vert q]$ tends to cover all mass that $p$ might generate. Minimizing $\mathrm{KL}[q\Vert p]$ tends to place $q$ where it can be confident, often inside one mode.

Figure 4 · Forward vs. reverse KL on a bimodal target
target $p(x)$ single Gaussian $q(x)$ selected objective
Figure 4b · Optimize forward and reverse KL from the same starting q
bimodal target $p$ reverse KL path forward KL path

5. KL is locally quadratic, and Fisher is the curvature

Globally, KL is asymmetric and unbounded. Locally, it is neither. Expand $\mathrm{KL}[p_\theta\Vert p_{\theta+\Delta\theta}]$ in $\Delta\theta$. The score $s_\theta = \partial_\theta\log p_\theta$ has mean zero under $p_\theta$, so the linear term vanishes; the second-order term is the Fisher information:

$$ \mathrm{KL}\bigl[p_\theta \,\Vert\, p_{\theta+\Delta\theta}\bigr] \;\approx\; \tfrac{1}{2}\, \Delta\theta^\top I(\theta)\,\Delta\theta. $$

The bridge to Fisher information is local: KL's asymmetry disappears at second order, and the symmetric quadratic that remains is exactly the Fisher metric on the parameter manifold. So statements like "the model is locally sensitive to $\theta$" (Fisher information), "nearby parameters are easy to distinguish" (KL local quadratic), and "the MLE concentrates at rate $1/(nI)$" (Cramér–Rao) are three views of the same curvature.

Why is KL symmetric to second order? Because in the expansion $\mathrm{KL}[p_\theta\Vert p_{\theta+\Delta\theta}]$ vs $\mathrm{KL}[p_{\theta+\Delta\theta}\Vert p_\theta]$, the linear terms differ in sign but both vanish (zero-mean score), and the quadratic terms agree ($\tfrac{1}{2}\Delta\theta^\top I\Delta\theta$ in both directions). Asymmetry is a third-order phenomenon.

The figure below makes this concrete on a one-parameter Bernoulli family. Forward and reverse KL leave the origin tangent to the same parabola $\tfrac12 I(\theta)\,\Delta\theta^2$; shrink the displacement range and all three curves collapse together — KL really is locally quadratic. Widen it, or push $\theta$ toward an edge where $I(\theta)$ grows large, and the two KL curves peel away from the parabola in opposite directions.

Figure 5 · KL versus its quadratic Fisher approximation
forward KL reverse KL quadratic ½ I(θ) Δθ²
In two parameters this is a tangent paraboloid. The figure above is the one-parameter slice of a more general picture: when $\theta$ is two-dimensional, the KL bowl $D_{\mathrm{KL}}(\theta_0 \,\Vert\, \theta)$ is a surface and its second-order Fisher approximation is a tangent paraboloid. See Information Geometry · Figure 3 for an interactive 3D view of both surfaces and how they peel apart toward the boundary.

6. KL in Machine Learning: MLE and VI

The direction of KL dictates its use in machine learning algorithms. The choice between Forward KL and Reverse KL determines whether we are doing Maximum Likelihood Estimation (MLE) or Variational Inference (VI).

Maximum Likelihood Estimation (Forward KL)

In MLE, we have an unknown true data-generating distribution $P_\text{data}$ and a model family $P_\theta$. We want to find the parameter $\theta$ that makes our model best match the data. This is equivalent to minimizing the Forward KL divergence:

$$ \min_\theta \mathrm{KL}[P_\text{data} \Vert P_\theta] = \min_\theta \mathbb{E}_{x \sim P_\text{data}} \left[ \log \frac{P_\text{data}(x)}{P_\theta(x)} \right] $$

Because $P_\text{data}$ is fixed, we can ignore the $\mathbb{E}[\log P_\text{data}]$ term (which is just the negative entropy of the data). This leaves us with:

$$ \min_\theta -\mathbb{E}_{x \sim P_\text{data}} [\log P_\theta(x)] $$

This is exactly the objective of Maximum Likelihood: maximize the expected log-likelihood of the data under the model. Forward KL forces the model $P_\theta$ to "cover" all the modes of $P_\text{data}$ because if the model assigns near-zero probability to any region where the data has mass, the KL penalty explodes.

Variational Inference (Reverse KL)

In Bayesian inference, we want to approximate a complex posterior $P(z|x)$ with a simpler, tractable distribution $Q_\phi(z)$. Here, we minimize the Reverse KL divergence:

$$ \min_\phi \mathrm{KL}[Q_\phi(z) \Vert P(z|x)] = \min_\phi \mathbb{E}_{z \sim Q_\phi} \left[ \log \frac{Q_\phi(z)}{P(z|x)} \right] $$

This objective is tractable because the expectation is taken over the simple approximation $Q_\phi$, from which samples are easy to draw. Reverse KL often places $Q_\phi$ on a single mode of the true posterior. If $Q_\phi$ puts mass where $P(z|x)$ is near zero, the penalty is large, so the optimum often stays within high-probability regions and underestimates the true variance. This is the foundation of Variational Inference.

7. KL connects entropy, Radon-Nikodym derivatives, and the ELBO

EntropyAverage surprise under one distribution: $\mathrm{H}[q]=-\mathbb{E}_q[\log q]$.
Cross-entropyAverage coding cost when events come from $q$ but are coded with $p$: $\mathrm{H}(q,p)=-\mathbb{E}_q[\log p]=\mathrm{H}[q]+\mathrm{KL}[q\Vert p]$.
Radon-NikodymThe measure-theoretic ratio: $dQ/dP$, whose log is averaged by KL. ELBO / free energyA tractable objective whose gap to log evidence is a KL term.

Use KL when the question is which distribution is responsible for the samples and which approximation is being charged. Support errors that KL refuses to ignore are the other reason to reach for it. For the underlying measure machinery, see Radon-Nikodym derivatives. In variational inference, this directed gap becomes the optimization target; see Free Energy & Variational Inference.