Optimal Transport

The Earth Mover's Distance and moving probability mass.

How far apart are two probability distributions? Information-theoretic measures like KL divergence compare densities point-by-point. But if two distributions don't overlap, KL divergence is infinite and its gradient is zero.

Optimal Transport (OT) takes a different approach. It asks: what is the minimum cost to "move" the mass of one distribution to match the other? This cost is known as the Wasserstein distance or the Earth Mover's Distance.

One sentence: Optimal transport treats distributions as "piles of dirt" and measures the work required to shovel one pile into the shape of another.

1. The Transport Plan

Given two distributions $p$ and $q$, we look for a transport plan $\pi(x, y)$ that describes how much mass to move from location $x$ to location $y$. The total cost is the integral of the distance moved times the mass moved:

$$ W(p, q) = \inf_{\pi \in \Pi(p, q)} \int d(x, y) \, d\pi(x, y) $$

The figure below shows the optimal transport plan between two 1D histograms. Drag the bars to change the distributions. The 2D grid shows the plan $\pi$, and the lines show the mass being transported.

Figure 1 · Optimal Transport Plan (Sinkhorn Iteration)

source distribution $p$ target distribution $q$ transport matrix $\pi$

regularization λ 0.05

We use Entropic Regularization (the Sinkhorn algorithm) to find the plan. As the regularization $\lambda$ decreases, the plan becomes more "sparse" and focused on the exact shortest distances.

2. Brenier's Theorem: the Map is a Gradient

When the source and target are absolutely continuous and the cost is squared Euclidean distance, the transport plan collapses to a deterministic map: every source point $x$ goes to a single destination $T(x)$. Brenier's theorem says more — that map is the gradient of a convex function:

$$ T(x) = \nabla \varphi(x), \qquad \varphi : \mathbb{R}^d \to \mathbb{R} \text{ convex}. $$

For two Gaussians $p = \mathcal{N}(\mu_p, \sigma_p^2 I)$ and $q = \mathcal{N}(\mu_q, \sigma_q^2 I)$ in the plane, the potential has a closed form:

$$ \varphi(x) = \frac{\sigma_q}{2\sigma_p} \|x - \mu_p\|^2 + (\mu_q - \tfrac{\sigma_q}{\sigma_p}\mu_p) \cdot x. $$

The figure plots $\varphi$ as a 3D paraboloid. Its slope at any source point gives the displacement to the corresponding target point — the gradient field on the base plane is precisely the optimal map $T$.

Figure 2 · Brenier Potential and its Gradient Map

Brenier potential $\varphi$ source $p$ target $q$ gradient $\nabla \varphi$ (transport map)

Source $p$ σ 1.00

Target $q$ σ 0.70

Separation $\mu_q - \mu_p$ Δx 2.00 Δy 0.60

Layers source & target disks gradient arrows tangent plane at $\mu_p$

Drag the canvas to rotate. The potential height is rescaled for legibility — only its slope is physically meaningful.

Because $\varphi$ is convex, its gradient is monotone: nearby source points stay near each other after transport, and the map never "crosses itself." That convexity is what makes optimal transport well-posed where pointwise comparisons of densities (next section) break down.

3. Wasserstein vs. KL Divergence

The main advantage of the Wasserstein distance is that it stays well-behaved even when distributions have non-overlapping support. KL divergence only "sees" the overlap; Wasserstein "sees" the distance between the masses.

Figure 3 · Gradients: Wasserstein distance vs. KL Divergence

fixed distribution draggable distribution

Drag the red bump. Observe how KL flatlines when they don't overlap, while Wasserstein changes linearly.

This property is why Wasserstein distances are used in training Generative Adversarial Networks (WGANs). It provides a useful gradient for the generator even if its current output is completely different from the real data.

What next

Compare optimal transport with other divergence measures.

Divergence

KL Divergence

Contrast the geometric approach of optimal transport with the information-theoretic approach of KL divergence.