STRUCTFORMER: JOINT UNSUPERVISED INDUCTION OF DEPENDENCY AND CONSTITUENCY STRUCTURE FROM MASKED LANGUAGE MODELING

Abstract

There are two major classes of natural language grammars -the dependency grammar that models one-to-one correspondences between words and the constituency grammar that models the assembly of one or several corresponded words. While previous unsupervised parsing methods mostly focus on only inducing one class of grammars, we introduce a novel model, StructFormer, that can induce dependency and constituency structure at the same time. To achieve this, we propose a new parsing framework that can jointly generates constituency tree and dependency graph. Then we integrate the induced dependency relations into transformer, in a differentiable manner, through a novel dependency-constrained self-attention mechanism. Experimental results show that our model can achieve strong results on unsupervised constituency parsing, unsupervised dependency parsing and masked language modeling at the same time.

1. INTRODUCTION

Human languages have a rich latent structure. This structure is multifaceted, with the two major classes of grammar being dependency and constituency structures. There have been an exciting breath of recent work that are targeted at learning this structure in a data-driven unsupervised fashion. The core principle behind recent methods that induce structure from data is simple -provide an inductive bias that is conducive for structure to emerge as a byproduct of some self-supervised training, e.g., language modeling. To this end, a wide range of models have been proposed that are able to successfully learn grammar structures (Shen et al., 2018a; c; Wang et al., 2019; Kim et al., 2019b; a) . However, most of these works focus on learning constituency structures alone. To the best of our knowledge, there have been no prior model or work that is able to induce, in an unsupervised fashion, more than one grammar structure at once. In this paper, we make two important technical contributions. First, we introduce the a new neural model that is able to induce dependency structures from raw data in an end-to-end unsupervised fashion. Most of existing approaches induce dependency structures from other syntactic information like gold POS tags (Klein & Manning, 2004; Cohen & Smith, 2009; Jiang et al., 2016) . Previous works, that have trained from words alone, often requires additional information, like pre-trained word clustering (Spitkovsky et al., 2011) , pre-trained word embedding (He et al., 2018) , acoustic cues (Pate & Goldwater, 2013) , or annotated data from related languages (Cohen et al., 2011) . Second, we introduce the first neural model that is able to induce both dependency structure and constituency structure at the same time. Specifically, our approach aims to unify latent structure induction of different types of grammar within the same framework. We introduce a new inductive bias that enables the Transformer models to induce a directed dependency graph in a fully unsupervised manner. To avoid the need of grammar labels during training, we use a distance-based parsing mechanism. The key idea is that it predicts a sequence of Syntactic Distances T (Shen et al., 2018b) and a sequence of Syntactic Heights ∆ (Luo et al., 2019) to represent dependency graph and constituency trees at the same time. Examples of ∆ and T are illustrated in Figure 1a . Based on the syntactic distances (T) and syntactic heights (∆), we provide a new dependency-constrained self-attention layer to replace the multi-head self-attention layer in standard transformer model. More concretely, each attention head can only attend on its parent (to avoid confusion with self-attention head, we use "parent" to note "head" in dependency graph) or its dependents in the predicted dependency structure, through a weighted sum of different relations shown in Figure 1b . In this way, we replace the complete graph in the standard transformer model with a differentiable directed dependency graph. During the process of training on a downstream task (e.g. masked language model), the model will gradually converge to a reasonable dependency graph via gradient descent. Thus, the parser can be trained in an unsupervised manner as a component of the model. Incorporating the new parsing mechanism, the dependency-constrained self-attention, and the Transformer architecture, we introduce a new model named StructFormer. The proposed model can perform unsupervised dependency and constituency parsing at the same time, and can leverage the parsing results to achieve strong performance on masked language model tasks.

2. RELATED WORK

Previous works on unsupervised dependency parsing are primarily based on the dependency model with valence (DMV) (Klein & Manning, 2004) and its extension (Daumé III, 2009; Gillenwater et al., 2010) . To effectively learn the DMV model for better parsing accuracy, a variety of inductive biases and handcrafted features, such as correlations between parameters of grammar rules involving different part-of-speech (POS) tags, have been proposed to incorporate prior information into learning. The most recent progress is the neural DMV model (Jiang et al., 2016) , which uses a neural network model to predict the grammar rule probabilities based on the distributed representation of POS tags. However, most previous unsupervised dependency parsing algorithms require the gold POS tags as input, which are labeled by humans and can be potentially difficult (or prohibitively expensive) to obtain for large corpora. Spitkovsky et al. (2011) proposed to overcome this problem with unsupervised word clustering that can dynamically assign tags to each word considering its context. Unsupervised constituency parsing has recently received more attention. PRPN (Shen et al., 2018a) and ON-LSTM (Shen et al., 2018c) induce tree structure by introducing an inductive bias to recurrent neural networks. PRPN proposes a parsing network to compute the syntactic distance of all word pairs, and a reading network utilizes the syntactic structure to attend to relevant memories. ON-LSTM allows hidden neurons to learn long-term or short-term information by a novel gating mechanism and activation function. In URNNG (Kim et al., 2019b) , amortized variational inference was applied between a recurrent neural network grammar (RNNG) (Dyer et al., 2016) decoder and a tree structure inference network, which encourages the decoder to generate reasonable tree structures. DIORA (Drozdov et al., 2019) proposed using inside-outside dynamic programming to compose latent representations from all possible binary trees. The representations of inside and outside passes from the same sentences are optimized to be close to each other. Compound PCFG (Kim et al., 2019a) achieves grammar induction by maximizing the marginal likelihood of the sentences which are generated by a probabilistic context-free grammar (PCFG) in a corpus. Tree Transformer (Wang et al., 2019) adds extra locality constraints to the Transformer encoder's self-attention to encourage the attention heads to follow a tree structure such that each token can only attend on nearby neighbors in lower layers and gradually extend the attention field to further tokens when climbing to higher layers. Though large scale pre-trained models have dominated most natural language processing tasks, some recent work indicates that neural network models can see accurarcy gains by leveraging syntactic information rather then ignoring it (Marcheggiani & Titov, 2017; Strubell et al., 2018) . Strubell et al. (2018) introduces syntactically-informed self-attention that force one attention head to attend on the syntactic governor of input token. Omote et al. (2019) and Deguchi et al. (2019) argue that dependency-informed self-attention can improve Transformer's performance on machine translation. Kuncoro et al. (2020) shows that syntactic biases help large scale pre-trained models, like BERT, to achieve better language understanding.

3. SYNTACTIC DISTANCE AND HEIGHT

In this section, we first reintroduce the concepts of syntactic distance and height, then discuss their relations in the context of StructFormer. Algorithm 1 Distance to binary constituency tree 1: function CONSTITUENT(w, d) 2: if d = [] then 3: T ⇐ Leaf(w) 4: else 5: i ⇐ arg max i (d) 6: child l ⇐ Constituent(w ≤i , d <i ) 7: child r ⇐ Constituent(w >i , d >i ) 8: T ⇐ Node(child l , child r ) 9: return T Algorithm 2 Converting binary constituency tree to dependency graph 1: function DEPENDENT(T, ∆) 2: if T = w then 3: D ⇐ [], parent ⇐ w 4: else 5: child l , child r ⇐ T 6: D l , parent l ⇐ Dependent(child l , ∆) 7: D r , parent r ⇐ Dependent(child r , ∆) 8: D ⇐ Union(D l , D r ) 9: if ∆(parent l ) > ∆(parent r ) then Let T be a constituency tree for sentence (w 0 , ..., w n ). The height of the lowest common ancestor for consecutive words x i and x i+1 is τi . Syntactic distances T = (τ 0 , ..., τ n-1 ) are defined as a sequence of n -1 real scalars that share the same rank as (τ 0 , ..., τn-1 ). In other words, each syntactic distance d i is associated with a split point (i, i + 1) and specify the relative order in which the sentence will be split into smaller components. Thus, any sequence of n -1 real values can unambiguously map to an unlabeled binary constituency tree with n leaves through the Algorithm 1 (Shen et al., 2018b) . As Shen et al. (2018c; a) ; Wang et al. (2019) pointed out, the syntactic distance reflects the information communication between constituents. More concretely, a large syntactic distance τ i represents that less short-term or local information should be communicated between (x ≤i ) and (x >i ). While cooperating with correct inductive bias, we can leverage this feature to build unsupervised dependency parsing models.

3.2. SYNTACTIC HEIGHT

Syntactic height is proposed in Luo et al. (2019) , where the syntactic height is used to capture the distance to the root node in a dependency graph. A word with high syntactic height means it is close to the root node. In this paper, to match the definition of syntactic distance, we redefine syntactic height as: Definition 3.2. Let D be a dependency graph for sentence (w 0 , ..., w n ). The height of a token w i in D is δi . The syntactic heights of D can be any sequence of n real scalars ∆ = (δ 0 , ..., δ n ) that share the same rank as ( δ0 , ..., δn-1 ). Although the syntactic height is defined based on the dependency structure, we cannot rebuild the original dependency structure just by syntactic heights, since there is no information about whether a token should be attached to the left side or the right side. However, given a unlabelled constituent tree, we can convert it into a dependency graph with the help of syntactic distance. The converting process is similar to the standard process of converting constituency treebank to dependency treebank (Gelbukh et al., 2005) . Instead of using the constituent labels and POS tags to identify the parent of each constituent, we simply assign the token with largest syntactic height as the parent of each constituent. The converting algorithm is described in Algorithm 2. In Appendix A.1, we also proposed a joint algorithm, that takes T and ∆ as inputs and output constituency tree and dependency graph at the same time.

3.3. THE RELATION BETWEEN SYNTACTIC DISTANCE AND HEIGHT

As discussed previously, the syntactic distance controls information communication between the two side of the split point. The syntactic height quantifies the centrality of each token in the dependency graph. A token with large syntactic height tend to have more long-term dependency relations to connect different part of the sentence together. In StructFormer, we quantify the syntactic distance and height in the same scale. Given a split point (i, i + 1) and it's syntactic distance δ i , only tokens x j with τ j > δ i can have connections across the split point (i, i + 1). Thus tokens with small syntactic height will be limited to mostly attend on near tokens.

4. STRUCTFORMER

In this section, we present the StructFormer model. Figure 2a shows the architecture of Struct-Former, which includes a parser network and a Transformer module. The parser network predicts T and ∆, then passes them to a set of differentiable functions to generate dependency distributions. The Transformer module takes these distributions and the sentence as input to computes a contextual embedding for each position. The StructFormer can be trained in an end-to-end fashion on a Masked Language Model task. In this setting, the gradient back propagates through the relation distributions into the parser.

4.1. PARSING NETWORK

As shown in Figure 2b , the parsing network takes word embeddings as input and feeds them into several convolution layers: s l,i = tanh (Conv (s l-1,i-W , s l-1,i-W +1 , ..., s l-1,i+W )) where s l,i is the output of l-th layer at i-th position, s 0,i is the input embedding of token w i , and 2W + 1 is the convolution kernel size. Given the output of the convolution stack s N,i , we parameterize the syntactic distance T as: τ i = W τ 1 tanh W τ 2 s N,i s N,i+1 + b τ 2 + b τ 1 (2) where δ i is the contextualized distance for the i-th split point between token w i and w i+1 . The syntactic height ∆ is parameterized in a similar way: Given T and ∆, we now explain how to estimate the probability p(x j |x i ) that the j-th token is the parent of the i-th token. The first step is identifying the smallest legit constituent C(x i ), that contains x i and x i is not C(x i )'s parent. The second step is identifying the parent of the constituent x j = Pr(C(x i )). Given the discussion in section 3.2, the parent of C(x i ) must be the parent of x i . Thus, the two-stages of identifying the parent of x i can be formulated as: δ i = W δ 1 tanh W δ 2 s N,i + b δ 2 + b δ 1 D(x i ) = Pr(C(x i )) In StructFormer, C(x i ) is represented as constituent [l, r], where l is the starting index (l ≤ i) of C(x i ) and r is the ending index (r ≥ i) of C(x i ). For example, in Figure 3 , C(x 4 ) = [4, 8] and the parent of constituent Pr([4, 8]) = x 6 , thus D(x 4 ) = x 6 . In a dependency graph, x i is only connected to its parent and dependents. It means that x i don't have direct connection to the outside of C(x i ). In other words, C(x i ) = [l, r] is the smallest constituent that satisfies: δ i < τ l-1 , δ i < τ r where τ l-1 is the first τ <i that is larger then δ i while looking backward, and τ r is the first τ ≥i that is larger then δ i while looking forward. In the previous, δ 4 = 3.5, τ 3 = 4 > δ 4 and τ 8 = ∞ > δ 4 , thus C(x 4 ) = [4, 8]. To make this process differentiable, we define τ k as a real value and δ i as a probability distribution p( δi ). For the simplicity and efficiency of computation, we directly parameterize the cumulative distribution function p( δi > τ k ) with sigmoid function: p( δi > τ k ) = σ((δ i -τ k )/µ 1 ) (6) where σ is the sigmoid function, δ i is the mean of distribution p( δi ) and µ 1 is a learnable temperature term. Thus the probability that the l-th (l < i) token is inside C(x i ) is equal to the probability that δi is larger then the maximum distance τ between l and i: p(l ∈ C(x i )) = p( δi > max(τ i-1 , ..., τ l )) = σ((δ i -max(τ l , ..., τ i-1 ))/µ) Then we can compute the probability distribution for l: p lef t (l|i) = k∈[1,l] p lef t (k|i) - k∈[1,l-1] p lef t (k|i) = p(l ∈ C(x i )) -p(l -1 ∈ C(x i )) = σ((δ i -max(τ l , ..., τ i-1 ))/µ) -σ((δ i -max(τ l-1 , ..., τ i-1 ))/µ) Similarly, we can compute the probability distribution for r: p right (r|i) = σ((δ i -max(τ i , ..., τ r-1 ))/µ) -σ((δ i -max(τ i , ..., τ r ))/µ) The probability distribution for [l, r] = C(x i ) can be computed as: p C ([l, r]|i) = p lef t (l|i)p right (r|i), l ≤ i ≤ r 0, otherwise The second step is to identify the parent of [l, r]. For any constituent [l, r], we choose the j = argmax k∈[l,r] (δ k ) as the parent of [l, r]. In the previous example, given constituent [4, 8], the maximum syntactic height is δ 6 = 4.5, thus Pr([4, 8]) = x 6 . We use softmax function to parameterize the probability p Pr (j|[l, r]): p Pr (j|[l, r]) = exp(hj /µ2) l≤k≤r exp(h k /µ2) , l ≤ t ≤ r 0, otherwise Given probability p(j|[l, r]) and p([l, r]|i), we can compute the probability that x j is the parent of x i : p D (j|i) = [l,r] p Pr (j|[l, r])p C ([l, r]|i), i = j 0, i = j (12)

4.3. DEPENDENCY-CONSTRAINED MULTI-HEAD SELF-ATTENTION

The multi-head self-attention in transformer can be seen as a information propagation mechanism on the complete graph G = (X, E), where the set of vertices X contains all n tokens in the sentence, and the set of edges E contains all possible word pairs (x i , x j ). StructFormer replace the complete graph G with a soft dependency graph D = (X, A), where A is the set of n probability distribution {p D (j|i)} that represent the probability of existing and directed edge between the dependent i and the parent j. The reason that we called it a directed edge is that each specific head is only allow to propagate information either from parent to dependent or from from dependent to parent. To do so, structformer associate each attention head with a probability distribution over parent or dependent relation. p parent = exp(wparent) exp(wparent)+exp(w dep ) , p dep = exp(w dep ) exp(wparent)+exp(w dep ) (13) where w parent and w dep are learnable parameters that associated with each attention head, p parent is the probability that this head will propagate information from parent to dependent, vice versa. The model will learn to assign this association from the downstream task via gradient descent. Then we can compute the probability that information can be propagated from node j to node i via this head: p i,j = p parent p D (j|i) + p dep p D (i|j) However, Htut et al. ( 2019) pointed out that different heads tend to associate with different type of universal dependency relations (including nsubj, obj, advmod, etc), but there is no generalist head can that work with all different relations. To accommodate this observation, we compute a individual probability for each head and pair of tokens (x i , x j ): q i,j = sigmoid QK T √ d k (15) where Q and K are query and key matrix in a standard transformer model and d k is the dimension of attention head. The equation is inspired by the scaled dot-product attention in transformer. We replace the original softmax function with sigmoid function, so q i,j became an independent probability that indicate whether the specific could work with the work (x i , x j ). In the end, we propose to replace transformer's scaled dot-product attention with our dependency-constrained self-attention: Attention(Q i , K j , V j , D) = p i,j q i,j V j (16)

5. EXPERIMENTS

We evaluate the proposed model on three tasks: Masked Language Modeling, Unsupervised Consituency Parsing and Unsupervised Dependency Parsing. Our implementation of StructFormer is close to the original Transformer encoder (Vaswani et al., 2017) . Except that we put the layer normalization in front of each layer, similar to the T5 model (Raffel et al., 2019) . We found that this modification allows the model to converges faster. For all experiments, we set the number of layers L = 8, the embedding size and hidden size to be d model = 512, the number of self-attention heads h = 8, the feed-forward size d f f = 2048, dropout rate as 0.1, and the number of convolution layers in the parsing network as L p = 3.

5.1. MASKED LANGUAGE MODEL

Masked Language Modeling (MLM) has been widely used as pretraining object for larger scale pretraining models. In BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) , authors found that MLM perplexities on held-out evaluation set have a positive correlation with the end-task performance. We trained and evaluated our model on 2 different datasets: the Penn TreeBank (PTB) and BLLIP. In our MLM experiments, each individual token has a independent chance to be replaced by a mask token <mask>, except that we never replace < unk > token. The training and evaluation object for Masked Language Model is to predict the replaced tokens. The performance of MLM is evaluated measuring perplexity on masked words. PTB is a standard dataset for language modeling (Mikolov et al., 2012) and unsupervised constituency parsing (Shen et al., 2018c; Kim et al., 2019a) . Following the setting proposed in Shen et al. (2018c) , we use Mikolov et al. ( 2012)'s prepossessing process, which removes all punctuations, and replaces low frequency tokens with <unk>. The preprocessing results in a vocabulary size of 10001 (including <unk>, <pad> and <mask>). For PTB, we use a 30% mask rate. BLLIP is a large Penn Treebank-style parsed corpus of approximately 24 million sentences. We train and evaluate StructFormer on three splits of BLLIP: BLLIP-XS (40k sentences, 1M tokens), BLLIP-SM (200K sentences, 5M tokens), and BLLIP-MD (600K sentences, 14M tokens). They are obtained by randomly sampling sections from BLLIP 1987-89 Corpus Release 1. All models are tested on a shared held-out test set (20k sentences, 500k tokens). Following the settings provided in (Hu et al., 2020) , we use subword-level vocabulary extracted from the GPT-2 pre-trained model rather than the BLLIP training corpora. For BLLIP, we use a 15% mask rate.

5.2. UNSUPERVISED CONSTITUENCY PARSING

The unsupervised constituency parsing task compares the latent tree structure induced by the model with those annotated by human experts. We use the Alogrithm 1 to predict the constituency trees from T predicted by StructFormer. Following the experiment settings proposed in Shen et al. (2018c) , we take the model trained on PTB dataset, and evaluate it on WSJ test set. The WSJ test set is the section 23 of WSJ corpus, it contains 2416 human expert labeled sentences. Punctuation is ignored during the evaluation.

5.3. UNSUPERVISED DEPENDENCY PARSING

The unsupervised dependency parsing evaluation compares the induced dependency relations with those in the reference dependency graph. The most common metric is Unlabeled Attachment Score (UAS), which measures the percentage that a token is correctly attached to its parent in the reference tree. Another widely used metric for unsupervised dependency parsing is Undirected Unlabeled Attachment Score (UUAS) measures the percentage that the reference undirected and unlabeled connections are recovered by the induced tree. Similar to the unsupervised constituency parsing, we take the model trained on PTB dataset, and evaluate it on WSJ test set (section 23). For the WSJ test set, reference dependency graphs are converted from its human annotated constituency trees. Unsupervised Parsing Models PRPN (Shen et al., 2018a) 37.4 (0.3) ON-LSTM (Shen et al., 2018c) 47.7 (1.5) Tree-T (Wang et al., 2019) 49.5 URNNG (Kim et al., 2019b) 52.4 C-PCFG (Kim et al., 2019a) 55.2 StructFormer 54.0 (0.3) (a) Constituency Parsing Results. * results are from Kim et al. (2020) .

Methods UAS

w/o gold POS tags DMV (Klein & Manning, 2004) 35.8 E-DMV (Headden III et al., 2009) 38.2 UR-A E-DMV (Tu & Honavar, 2012) 46.1 CS* (Spitkovsky et al., 2013) 64.4* Neural E-DMV (Jiang et al., 2016) 42.7 Gaussian DMV (He et al., 2018) 43.1 (1.2) INP (He et al., 2018) 47.9 (1.2) StructFormer 46.2 (0.4) w/ gold POS tags (for reference only) DMV (Klein & Manning, 2004) 39.7 UR-A E-DMV (Tu & Honavar, 2012) 57.0 MaxEnc (Le & Zuidema, 2015) 65.8 Neural E-DMV (Jiang et al., 2016) 57.6 CRFAE (Cai et al., 2017) 55.7 L-NDMV † (Han et al., 2017) 63.2 Table 2 : The unsupervised parsing performance of different models. and CoNLL dependencies. Following the setting of previous papers (Jiang et al., 2016) , we ignored the punctuation during evaluation. To obtain the dependency relation from our model, we compute the argmax for dependency distribution: k = argmax j =i p D (j|i) and assign the k-th token as the parent of i-th token.

5.4. EXPERIMENTAL RESULTS

The masked language model results are shown in Table 1 . StructFormer consistently outperform our Transformer baseline. This result aligns with previous observations that linguistic informed selfattention can help Transformers achieve stronger performance. We also observe that StructFormer converges much faster than the standard Transformer model. Table 2a shows that our model achieves strong results on unsupervised constituency parsing. While the C-PCFG (Kim et al., 2019a) achieve a stronger parsing performance with its strong linguistic constraints (e.g. a finite set of production rules), StructFormer may have border domain of application. For example, it can replace standard transformer encoder in most of popular large-scale pretrained language models (e.g. BERT and ReBERTa) and transformer based machine translation models. It's also interesting to notice that, different from Tree-T (Wang et al., 2019) , we didn't directly use constituents to restrict the self-attention receptive field. But we eventually achieve a stronger constituency parsing performance with same experiment setting. This result may suggest that the dependency relations is a more suitable for grammar induction in transformer-based models. Table 3 shows that our model achieve strong accuracy while predicting Noun Phrase (NP), Preposition Phrase (PP), Adjective Phrase (ADJP), and Adverb Phrase (ADVP). on some kind of latent POS tags or pretrained word embeddings, StructFormer can be seen as a easy-to-use alternative that works in an end-to-end fashion. Table 5 shows that our model recovers 61.6% of undirected dependency relations. Given the strong performances on both dependency parsing and masked language modeling, we believe that the dependency graph schema could be an viable substitute for the complete graph schema used in standard transformer. Since our model uses a mixture of relation probability distribution for each self-attention head, we also studied how different combinations of relations effect the performance of our model. Table

A APPENDIX

A.1 JOINT DEPENDENCY AND CONSTITUENCY PARSING Algorithm 3 The joint dependency and constituency parsing algorithm. Inputs are a sequence of words w, syntactic distances d, syntactic heights h. Outputs are a binary constituency tree T, a dependency graph D that is represented as a set of dependency relations, the parent of dependency graph D, and the syntactic height of parent. 



shows that the model can achieve the best performance, while using both parent and dependent relations. The model suffers more on dependency parsing, if the parent relation is removed. And if the dependent relation is removed, the model will suffers more on the constituency parsing. CONCLUSIONIn this paper, we introduce a novel dependency and constituency joint parsing framework. Based on the framework, we propose StructFormer, a new unsupervised parsing algorithm that does unsupervised dependency and constituency parsing at the same time. We also introduced a novel dependency-constrained self-attention mechanism that allows each attention head to focus on a specific mixture of dependency relations. This brings Transformers closer to modeling a directed dependency graph. The experiments show premising results that StructFormer can induce meaningful dependency and constituency structures and achieve better performance on masked language model task. This research provides a new path to build more linguistic bias into pre-trained language model.



(a) An example of Syntactic Distances T (grey bars) and Syntactic Heights ∆ (white bars). In this example, like is the parent (head) of constituent (like cats) and (I like cats).(b) Two types of dependency relations. The parent distribution allows each token to attend on its parent. The dependent distribution allows each token to attend on its dependents. For example the parent of cats is like. Cats and I are dependents of like Each attention head will receive a different weighted sum of these relations.

Figure 1: An example of our parsing mechanism and dependency-constrained self-attention mechanism. The parsing network first predicts the syntactic distance T and syntactic height ∆ to represent the latent structure of the input sentence I like cats. Then the parent and dependent relations are computed in a differentiable manner from T and ∆.

proposed inShen et al. (2018b)  to quantify the process of splitting sentences into smaller constituents. Definition 3.1.

Figure 2: The Architecture of StructFormer. The parser takes shared word embeddings as input, outputs syntactic distances T, syntactic heights ∆, and dependency distributions between tokens.The transformer layers take word embeddings and dependency distributions as input, output contextualized embeddings for input words.

(b) Dependency Parsing Results on WSJ testset. Starred entries (*) benefit from additional punctuation-based constraints. Daggered entries ( † ) benefit from larger additional training data. Baseline results are from He et al. (2018).

Figure 4: Dependency relation weights learnt on different datasets. Row i constains relation weights for all attention heads in the i-th transformer layer. p represents the parent relation. d represents the dependent relation. We observe a clearer preference for each attention head in the model trained on BLLIP-SM. This probably due to BLLIP-SM has signficantly more training data. It's also interesting to notice that the first layer tend to focus on parent relations.

However, there are two different sets of rules for the conversion: the Stanford dependencies and the CoNLL dependencies. While Stanford dependencies are used as reference dependencies in previous unsupervised parsing papers, we noticed that our model sometimes output dependency structures that are closer to the CoNLL dependencies. Therefore, we report UAS and UUAS for both Stanford Masked Language Model perplexities on BLLIP datasets of different models.



Fraction of ground truth constituents that were predicted as a constituent by the models broken down by label (i.e. label recall)

The performance of StructFormer with different combinations of attention masks.

D THE PERFORMANCE OF STRUCTFORMER WITH DIFFERENT MASK RATES

The performance of StructFormer on PTB dataset with different mask rates. Dependency parsing is especially affected by the masks. Mask rate 0.3 provides the best and the most stable performance.

