STRUCTFORMER: JOINT UNSUPERVISED INDUCTION OF DEPENDENCY AND CONSTITUENCY STRUCTURE FROM MASKED LANGUAGE MODELING

Abstract

There are two major classes of natural language grammars -the dependency grammar that models one-to-one correspondences between words and the constituency grammar that models the assembly of one or several corresponded words. While previous unsupervised parsing methods mostly focus on only inducing one class of grammars, we introduce a novel model, StructFormer, that can induce dependency and constituency structure at the same time. To achieve this, we propose a new parsing framework that can jointly generates constituency tree and dependency graph. Then we integrate the induced dependency relations into transformer, in a differentiable manner, through a novel dependency-constrained self-attention mechanism. Experimental results show that our model can achieve strong results on unsupervised constituency parsing, unsupervised dependency parsing and masked language modeling at the same time.

1. INTRODUCTION

Human languages have a rich latent structure. This structure is multifaceted, with the two major classes of grammar being dependency and constituency structures. There have been an exciting breath of recent work that are targeted at learning this structure in a data-driven unsupervised fashion. The core principle behind recent methods that induce structure from data is simple -provide an inductive bias that is conducive for structure to emerge as a byproduct of some self-supervised training, e.g., language modeling. To this end, a wide range of models have been proposed that are able to successfully learn grammar structures (Shen et al., 2018a; c; Wang et al., 2019; Kim et al., 2019b; a) . However, most of these works focus on learning constituency structures alone. To the best of our knowledge, there have been no prior model or work that is able to induce, in an unsupervised fashion, more than one grammar structure at once. In this paper, we make two important technical contributions. First, we introduce the a new neural model that is able to induce dependency structures from raw data in an end-to-end unsupervised fashion. Most of existing approaches induce dependency structures from other syntactic information like gold POS tags (Klein & Manning, 2004; Cohen & Smith, 2009; Jiang et al., 2016) . Previous works, that have trained from words alone, often requires additional information, like pre-trained word clustering (Spitkovsky et al., 2011) , pre-trained word embedding (He et al., 2018) , acoustic cues (Pate & Goldwater, 2013) , or annotated data from related languages (Cohen et al., 2011) . Second, we introduce the first neural model that is able to induce both dependency structure and constituency structure at the same time. Specifically, our approach aims to unify latent structure induction of different types of grammar within the same framework. We introduce a new inductive bias that enables the Transformer models to induce a directed dependency graph in a fully unsupervised manner. To avoid the need of grammar labels during training, we use a distance-based parsing mechanism. The key idea is that it predicts a sequence of Syntactic Distances T (Shen et al., 2018b) and a sequence of Syntactic Heights ∆ (Luo et al., 2019) to represent dependency graph and constituency trees at the same time. Examples of ∆ and T are illustrated in Figure 1a . Based on the syntactic distances (T) and syntactic heights (∆), we provide a new dependency-constrained self-attention layer to replace the multi-head self-attention layer in standard transformer model. More concretely, each attention head can only attend on its parent (to avoid

