Information Geometry

Probability distributions as points on a Riemannian manifold.

A family of probability distributions $p(x; \theta)$ can be viewed as a manifold where each point corresponds to a distribution and the parameters $\theta$ are the coordinates. Information Geometry uses differential geometry to study the intrinsic shape of these manifolds.

The key insight is that the "distance" between distributions should be based on their information content, not the Euclidean distance of their parameters. This distance is defined by the Fisher Information Metric — the expected outer product of the score, written in coordinates $\theta$ as:

$$ g_{ij}(\theta) = \mathbb{E}_{p_\theta}\!\left[ \frac{\partial \log p_\theta}{\partial \theta^i} \, \frac{\partial \log p_\theta}{\partial \theta^j} \right] $$

Two distributions are far apart when a sample easily tells them apart; they are close when it does not. The metric $g_{ij}$ turns that statement into a length.

One sentence: Information geometry treats statistical models as curved surfaces where the "straightest" paths (geodesics) follow the flow of information.

1. The Probability Simplex and Geodesics

The simplest statistical manifold is the probability simplex: the set of all categorical distributions over $n$ outcomes. For $n=3$, this is a triangle where each point $(p_1, p_2, p_3)$ satisfies $p_i \ge 0$ and $\sum p_i = 1$.

Under the Fisher metric, the simplex is not flat—it is a portion of a hypersphere. The shortest paths (geodesics) between two distributions are great circles on this sphere.

Concretely, the map $p_i \mapsto 2\sqrt{p_i}$ sends each distribution onto the positive orthant of a sphere of radius 2. Distance along the manifold is then the great-circle distance — the Fisher-Rao distance:

$$ d_{\mathrm{FR}}(p, q) = 2 \arccos\!\left( \sum_i \sqrt{p_i \, q_i} \right) $$
Figure 1 · Geodesics on the 3-outcome Simplex
Euclidean path (mixture) Fisher-Rao geodesic
Drag the two endpoints (A and B) on the triangle. Compare the "straight" mixture path to the curved information-theoretic geodesic.

A "straight line" in Euclidean space corresponds to a mixture: $(1-t)P + tQ$. A Fisher-Rao geodesic corresponds to a path that changes the "surprise" at a constant rate.

2. Dual Flatness: e-projections and m-projections

Exponential families (like the Gaussian, Poisson, or Bernoulli) have a special geometric structure called dual flatness. They can be described by two sets of coordinates: natural parameters ($\eta$) and expectation parameters ($\mu$).

Such a family has density

$$ p(x; \eta) = h(x) \, \exp\!\left( \eta^\top T(x) - A(\eta) \right), $$

and the log-partition function $A(\eta)$ ties the two coordinate systems together: its gradient carries natural parameters to expectation parameters, $\mu = \nabla A(\eta)$.

This duality gives us two types of "straight lines":

Figure 2 · Dual Projections: e-projection vs. m-projection
m-projection (Forward KL / MLE) e-projection (Reverse KL / VI)
The chart shows the parameter space of a Bernoulli distribution. Drag the target point $P$ and the constraint line.

For two members of the family, KL divergence is the Bregman divergence generated by the log-partition function:

$$ D_{\mathrm{KL}}\!\left[\, p_{\eta_1} \,\Vert\, p_{\eta_2} \,\right] = A(\eta_2) - A(\eta_1) - \nabla A(\eta_1)^\top (\eta_2 - \eta_1) $$

Minimizing Forward KL divergence is an m-projection (finding the point in a sub-family that has the same average statistics). Minimizing Reverse KL divergence is an e-projection. This explains why MLE and VI behave so differently: they are "dropping perpendiculars" in different geometries.

3. The KL Landscape and its Tangent Quadratic

The previous figures show parameter space from above. Lifting a divergence into a third dimension — using KL as a height — exposes structure that the flat views hide: how the bowl steepens toward the boundary of the simplex, where the Fisher quadratic approximation peels away from the true KL, and how the two dual geodesics of §2 actually trace through the landscape.

Consider two independent Bernoulli variables with means $(\mu_1, \mu_2) \in (0,1)^2$. The KL divergence from a fixed anchor $\theta_0 = (\mu_{0,1}, \mu_{0,2})$ to a varying point $\theta = (\mu_1, \mu_2)$ is a sum over coordinates:

$$ D_{\mathrm{KL}}(\theta_0 \,\Vert\, \theta) = \sum_{i=1}^{2} \mu_{0,i} \log \frac{\mu_{0,i}}{\mu_i} + (1 - \mu_{0,i}) \log \frac{1 - \mu_{0,i}}{1 - \mu_i} $$

The Fisher quadratic — the second-order Taylor expansion of KL at the anchor — uses the inverse of the Fisher information metric:

$$ \hat D(\theta) = \tfrac{1}{2} \sum_{i} \frac{(\mu_i - \mu_{0,i})^2}{\mu_{0,i}(1 - \mu_{0,i})} $$

At the anchor, surface and paraboloid agree to second order. Move away and they diverge — and the asymmetry of the bowl (true KL grows much faster as $\theta$ approaches the boundary than as it moves toward the centre) is precisely what makes KL non-symmetric.

Figure 3 · The KL Bowl, its Fisher Quadratic, and Two Geodesics
KL bowl (height = divergence) Fisher quadratic (wireframe) m-geodesic e-geodesic
Anchor θ₀
Endpoint θ₁
Layers
Drag the canvas to rotate. The bowl is capped at a finite height so its boundary behaviour stays legible.

The two geodesics agree at the endpoints but otherwise trace different curves over the bowl. The m-geodesic ($\mu$ linear in $t$) is the chord of the base square. The e-geodesic ($\eta = \log(\mu / (1-\mu))$ linear in $t$) bows toward the interior, where the Fisher metric stretches less. Lift either curve onto the surface and you see why KL projections in §2 land on different points: the two connections traverse different paths between the same two distributions.

What next

Revisit the foundations.