STRUCTFORMER: JOINT UNSUPERVISED INDUCTION OF DEPENDENCY AND CONSTITUENCY STRUCTURE FROM MASKED LANGUAGE MODELING

Abstract

There are two major classes of natural language grammars -the dependency grammar that models one-to-one correspondences between words and the constituency grammar that models the assembly of one or several corresponded words. While previous unsupervised parsing methods mostly focus on only inducing one class of grammars, we introduce a novel model, StructFormer, that can induce dependency and constituency structure at the same time. To achieve this, we propose a new parsing framework that can jointly generates constituency tree and dependency graph. Then we integrate the induced dependency relations into transformer, in a differentiable manner, through a novel dependency-constrained self-attention mechanism. Experimental results show that our model can achieve strong results on unsupervised constituency parsing, unsupervised dependency parsing and masked language modeling at the same time.

1. INTRODUCTION

Human languages have a rich latent structure. This structure is multifaceted, with the two major classes of grammar being dependency and constituency structures. There have been an exciting breath of recent work that are targeted at learning this structure in a data-driven unsupervised fashion. The core principle behind recent methods that induce structure from data is simple -provide an inductive bias that is conducive for structure to emerge as a byproduct of some self-supervised training, e.g., language modeling. To this end, a wide range of models have been proposed that are able to successfully learn grammar structures (Shen et al., 2018a; c; Wang et al., 2019; Kim et al., 2019b; a) . However, most of these works focus on learning constituency structures alone. To the best of our knowledge, there have been no prior model or work that is able to induce, in an unsupervised fashion, more than one grammar structure at once. In this paper, we make two important technical contributions. First, we introduce the a new neural model that is able to induce dependency structures from raw data in an end-to-end unsupervised fashion. Most of existing approaches induce dependency structures from other syntactic information like gold POS tags (Klein & Manning, 2004; Cohen & Smith, 2009; Jiang et al., 2016) . Previous works, that have trained from words alone, often requires additional information, like pre-trained word clustering (Spitkovsky et al., 2011) , pre-trained word embedding (He et al., 2018) , acoustic cues (Pate & Goldwater, 2013) , or annotated data from related languages (Cohen et al., 2011) . Second, we introduce the first neural model that is able to induce both dependency structure and constituency structure at the same time. Specifically, our approach aims to unify latent structure induction of different types of grammar within the same framework. We introduce a new inductive bias that enables the Transformer models to induce a directed dependency graph in a fully unsupervised manner. To avoid the need of grammar labels during training, we use a distance-based parsing mechanism. The key idea is that it predicts a sequence of Syntactic Distances T (Shen et al., 2018b) and a sequence of Syntactic Heights ∆ (Luo et al., 2019) to represent dependency graph and constituency trees at the same time. Examples of ∆ and T are illustrated in Figure 1a . Based on the syntactic distances (T) and syntactic heights (∆), we provide a new dependency-constrained self-attention layer to replace the multi-head self-attention layer in standard transformer model. More concretely, each attention head can only attend on its parent (to avoid confusion with self-attention head, we use "parent" to note "head" in dependency graph) or its dependents in the predicted dependency structure, through a weighted sum of different relations shown in Figure 1b . In this way, we replace the complete graph in the standard transformer model with a differentiable directed dependency graph. During the process of training on a downstream task (e.g. masked language model), the model will gradually converge to a reasonable dependency graph via gradient descent. Thus, the parser can be trained in an unsupervised manner as a component of the model. Incorporating the new parsing mechanism, the dependency-constrained self-attention, and the Transformer architecture, we introduce a new model named StructFormer. The proposed model can perform unsupervised dependency and constituency parsing at the same time, and can leverage the parsing results to achieve strong performance on masked language model tasks.

2. RELATED WORK

Previous works on unsupervised dependency parsing are primarily based on the dependency model with valence (DMV) (Klein & Manning, 2004 ) and its extension (Daumé III, 2009; Gillenwater et al., 2010) . To effectively learn the DMV model for better parsing accuracy, a variety of inductive biases and handcrafted features, such as correlations between parameters of grammar rules involving different part-of-speech (POS) tags, have been proposed to incorporate prior information into learning. The most recent progress is the neural DMV model (Jiang et al., 2016) , which uses a neural network model to predict the grammar rule probabilities based on the distributed representation of POS tags. However, most previous unsupervised dependency parsing algorithms require the gold POS tags as input, which are labeled by humans and can be potentially difficult (or prohibitively expensive) to obtain for large corpora. Spitkovsky et al. (2011) proposed to overcome this problem with unsupervised word clustering that can dynamically assign tags to each word considering its context. Unsupervised constituency parsing has recently received more attention. PRPN (Shen et al., 2018a) and ON-LSTM (Shen et al., 2018c) induce tree structure by introducing an inductive bias to recurrent neural networks. PRPN proposes a parsing network to compute the syntactic distance of all word pairs, and a reading network utilizes the syntactic structure to attend to relevant memories. ON-LSTM allows hidden neurons to learn long-term or short-term information by a novel gating mechanism and activation function. In URNNG (Kim et al., 2019b) , amortized variational inference was applied between a recurrent neural network grammar (RNNG) (Dyer et al., 2016) decoder and a tree structure inference network, which encourages the decoder to generate reasonable tree structures. DIORA (Drozdov et al., 2019) proposed using inside-outside dynamic programming to



(a) An example of Syntactic Distances T (grey bars) and Syntactic Heights ∆ (white bars). In this example, like is the parent (head) of constituent (like cats) and (I like cats).(b) Two types of dependency relations. The parent distribution allows each token to attend on its parent. The dependent distribution allows each token to attend on its dependents. For example the parent of cats is like. Cats and I are dependents of like Each attention head will receive a different weighted sum of these relations.

Figure 1: An example of our parsing mechanism and dependency-constrained self-attention mechanism. The parsing network first predicts the syntactic distance T and syntactic height ∆ to represent the latent structure of the input sentence I like cats. Then the parent and dependent relations are computed in a differentiable manner from T and ∆.

