Optimal Transport
How far apart are two probability distributions? Information-theoretic measures like KL divergence compare densities point-by-point. But if two distributions don't overlap, KL divergence is infinite and its gradient is zero.
Optimal Transport (OT) takes a different approach. It asks: what is the minimum cost to "move" the mass of one distribution to match the other? This cost is known as the Wasserstein distance or the Earth Mover's Distance.
1. The Transport Plan
Given two distributions $p$ and $q$, we look for a transport plan $\pi(x, y)$ that describes how much mass to move from location $x$ to location $y$. The total cost is the integral of the distance moved times the mass moved:
$$ W(p, q) = \inf_{\pi \in \Pi(p, q)} \int d(x, y) \, d\pi(x, y) $$The figure below shows the optimal transport plan between two 1D histograms. Drag the bars to change the distributions. The 2D grid shows the plan $\pi$, and the lines show the mass being transported.
We use Entropic Regularization (the Sinkhorn algorithm) to find the plan. As the regularization $\lambda$ decreases, the plan becomes more "sparse" and focused on the exact shortest distances.
2. Brenier's Theorem: the Map is a Gradient
When the source and target are absolutely continuous and the cost is squared Euclidean distance, the transport plan collapses to a deterministic map: every source point $x$ goes to a single destination $T(x)$. Brenier's theorem says more — that map is the gradient of a convex function:
$$ T(x) = \nabla \varphi(x), \qquad \varphi : \mathbb{R}^d \to \mathbb{R} \text{ convex}. $$For two Gaussians $p = \mathcal{N}(\mu_p, \sigma_p^2 I)$ and $q = \mathcal{N}(\mu_q, \sigma_q^2 I)$ in the plane, the potential has a closed form:
$$ \varphi(x) = \frac{\sigma_q}{2\sigma_p} \|x - \mu_p\|^2 + (\mu_q - \tfrac{\sigma_q}{\sigma_p}\mu_p) \cdot x. $$The figure plots $\varphi$ as a 3D paraboloid. Its slope at any source point gives the displacement to the corresponding target point — the gradient field on the base plane is precisely the optimal map $T$.
Because $\varphi$ is convex, its gradient is monotone: nearby source points stay near each other after transport, and the map never "crosses itself." That convexity is what makes optimal transport well-posed where pointwise comparisons of densities (next section) break down.
3. Wasserstein vs. KL Divergence
The main advantage of the Wasserstein distance is that it stays well-behaved even when distributions have non-overlapping support. KL divergence only "sees" the overlap; Wasserstein "sees" the distance between the masses.
This property is why Wasserstein distances are used in training Generative Adversarial Networks (WGANs). It provides a useful gradient for the generator even if its current output is completely different from the real data.
What next
Compare optimal transport with other divergence measures.