Dependency Trees & Structural Probes

Structural probes test whether dependency-tree distance and depth are readable as geometry.

Dependency grammar represents a sentence as a tree. Each word depends on a head. For example, an adjective depends on the noun it modifies, a subject depends on its verb, and a determiner depends on its noun. This gives two geometric targets: distance between words in the tree, and depth of each word from the root.

A structural probe asks whether a linear transformation of contextual word representations can make those syntactic quantities visible. In the original version, squared distance in the transformed space is trained to match tree distance, and squared norm is trained to match tree depth. If the transformed geometry recovers a dependency tree, syntax is not just linearly classifiable one edge at a time; a whole tree-like structure is present in the representation.

Figure 1 · Structural probe as tree geometry

1. The syntactic object

A dependency tree is sparse: a sentence with five words has four dependency edges. But the structural probe trains against richer quantities. Every pair of words has a tree distance, and every word has a depth. Those targets turn the tree into a geometry problem.

For a concrete case, take "The cat sat on the mat." The tree distance from cat to sat is 1, and from cat to mat is 3, following the path from cat to sat to on to mat. Depth counts from the root, so sat has depth 0 and cat has depth 1.

Tree distance
The number of dependency edges on the shortest path between two words. In "The cat sat on the mat," cat to mat is 3, along cat to sat to on to mat. The distance probe learns a linear map B so the squared distance $\lVert B(h_i - h_j)\rVert^2$ between contextual vectors approximates this count for every pair of words.
Tree depth
The number of edges from a word to the root. The root has depth 0; in the same sentence sat is the root and cat has depth 1. A separately fit probe matches the squared norm $\lVert B h_i\rVert^2$ to depth, which orders words from root to leaves.
MST extraction
Predicted pairwise distances do not by themselves form a tree. The minimum spanning tree over them gives the unrooted, unlabeled tree the probe is scored on, compared edge for edge against the gold parse.
Attachment scores, by what they require
Directed (head → dependent)Undirected
Unlabeled UAS Correct head and direction for each word. UUAS Correct edge, ignoring direction. What structural probes are scored on, because the extracted MST is unrooted.
Labeled LAS Correct head plus the relation label (nsubj, obj, …). Not standard. Structural probes recover geometry, not labels, so neither labeled score applies to them.

2. Middle-layer peak

Many probing studies find that syntactic information is easiest to decode in middle layers. Early layers are close to surface form, while later layers are closer to task-specific or next-token-output information. Syntax is an intermediate abstraction, so it often peaks in the middle. The layer slider uses a schematic curve; it is not a measurement from a particular model.

In Hewitt and Manning's original setup, a rank-128 distance probe on BERT-base recovers roughly 82% UUAS on the Penn Treebank test set, read from around layer 7.

Geometry is not parsing. A probe that recovers a tree is not the same thing as showing the model runs a parser internally. It shows that parse-like structure can be read from the representation under the probe's assumptions.

3. Controls and limitations

Structural probes inherit the general probe-validity problem. Baselines, control tasks, and data splits affect the interpretation. A model may encode word position, lexical association, or local adjacency in ways that help tree recovery without amounting to syntactic knowledge. Lexical controls and controlled syntactic phenomena reduce semantic shortcuts.

The conditional conclusion is narrower: under this transformation and evaluation, dependency-like geometry is more available than in the chosen baselines.

Citations Related pages

What next