Dependency Trees & Structural Probes
Dependency grammar represents a sentence as a tree. Each word depends on a head. For example, an adjective depends on the noun it modifies, a subject depends on its verb, and a determiner depends on its noun. This gives two geometric targets: distance between words in the tree, and depth of each word from the root.
A structural probe asks whether a linear transformation of contextual word representations can make those syntactic quantities visible. In the original version, squared distance in the transformed space is trained to match tree distance, and squared norm is trained to match tree depth. If the transformed geometry recovers a dependency tree, syntax is not just linearly classifiable one edge at a time; a whole tree-like structure is present in the representation.
1. The syntactic object
A dependency tree is sparse: a sentence with five words has four dependency edges. But the structural probe trains against richer quantities. Every pair of words has a tree distance, and every word has a depth. Those targets turn the tree into a geometry problem.
For a concrete case, take "The cat sat on the mat." The tree distance from cat to sat is 1, and from cat to mat is 3, following the path from cat to sat to on to mat. Depth counts from the root, so sat has depth 0 and cat has depth 1.
- Tree distance
- The number of dependency edges on the shortest path between two words. In "The cat sat on the mat," cat to mat is 3, along cat to sat to on to mat. The distance probe learns a linear map B so the squared distance $\lVert B(h_i - h_j)\rVert^2$ between contextual vectors approximates this count for every pair of words.
- Tree depth
- The number of edges from a word to the root. The root has depth 0; in the same sentence sat is the root and cat has depth 1. A separately fit probe matches the squared norm $\lVert B h_i\rVert^2$ to depth, which orders words from root to leaves.
- MST extraction
- Predicted pairwise distances do not by themselves form a tree. The minimum spanning tree over them gives the unrooted, unlabeled tree the probe is scored on, compared edge for edge against the gold parse.
| Directed (head → dependent) | Undirected | |
|---|---|---|
| Unlabeled | UAS Correct head and direction for each word. | UUAS Correct edge, ignoring direction. What structural probes are scored on, because the extracted MST is unrooted. |
| Labeled | LAS Correct head plus the relation label (nsubj, obj, …). | Not standard. Structural probes recover geometry, not labels, so neither labeled score applies to them. |
2. Middle-layer peak
Many probing studies find that syntactic information is easiest to decode in middle layers. Early layers are close to surface form, while later layers are closer to task-specific or next-token-output information. Syntax is an intermediate abstraction, so it often peaks in the middle. The layer slider uses a schematic curve; it is not a measurement from a particular model.
In Hewitt and Manning's original setup, a rank-128 distance probe on BERT-base recovers roughly 82% UUAS on the Penn Treebank test set, read from around layer 7.
3. Controls and limitations
Structural probes inherit the general probe-validity problem. Baselines, control tasks, and data splits affect the interpretation. A model may encode word position, lexical association, or local adjacency in ways that help tree recovery without amounting to syntactic knowledge. Lexical controls and controlled syntactic phenomena reduce semantic shortcuts.
The conditional conclusion is narrower: under this transformation and evaluation, dependency-like geometry is more available than in the chosen baselines.
- Hewitt and Manning (2019), "A Structural Probe for Finding Syntax in Word Representations", for the distance/depth probe.
- Hall Maudslay and Cotterell (2021), "Do Syntactic Probes Probe Syntax?", for lexical controls.
- Probes and Validity for the broader probe-evidence framework.