REPRESENTATIONAL CORRELATES OF HIERARCHICAL PHRASE STRUCTURE IN DEEP LANGUAGE MODELS

Abstract

While contextual representations from pretrained Transformer models have set a new standard for many NLP tasks, there is not yet a complete accounting of their inner workings. In particular, it is not entirely clear what aspects of sentence-level syntax are captured by these representations, nor how (if at all) they are built along the stacked layers of the network. In this paper, we aim to address such questions with a general class of interventional, input perturbation-based analyses of representations from Transformers networks pretrained with self-supervision. Importing from computational and cognitive neuroscience the notion of representational invariance, we perform a series of probes designed to test the sensitivity of Transformer representations to several kinds of structure in sentences. Each probe involves swapping words in a sentence and comparing the representations from perturbed sentences against the original. We experiment with three different perturbations: (1) random permutations of n-grams of varying width, to test the scale at which a representation is sensitive to word position; (2) swapping of two spans which do or do not form a syntactic phrase, to test sensitivity to global phrase structure; and (3) swapping of two adjacent words which do or do not break apart a syntactic phrase, to test sensitivity to local phrase structure. We also connect our probe results to the Transformer architecture by relating the attention mechanism to syntactic distance between two words. Results from the three probes collectively suggest that Transformers build sensitivity to larger parts of the sentence along their layers, and that hierarchical phrase structure plays a role in this process. In particular, sensitivity to local phrase structure increases along deeper layers. Based on our analysis of attention, we show that this is at least partly explained by generally larger attention weights between syntactically distant words. 1

1. INTRODUCTION AND RELATED WORK

It is still unknown how distributed information processing systems encode and exploit complex relational structures in data. The fields of deep learning (Saxe et al., 2013; Hewitt & Manning, 2019) , neuroscience (Sarafyazd & Jazayeri, 2019; Stachenfeld et al., 2017) , and cognitive science (Elman, 1991; Kemp & Tenenbaum, 2008; Tervo et al., 2016) have given great attention to this question, including a productive focus on the potential models and their implementations of hierarchical tasks, such as predictive maps and graphs. Natural (human) language provides a rich domain for studying how complex hierarchical structures are encoded in information processing systems. More so than other domains, human language is unique in that its underlying hierarchy has been extensively studied and theorized in linguistics, which provides source of "ground truth" structures for stimulus data. Much prior work on characterizing the types of linguistic information encoded in computational models of language such as neural networks has focused on supervised readout probes, which train a classifier on top pretrained models to predict a particular linguistic label (Belinkov & Glass, 2017; Liu et al., 2019a; Tenney et al., 2019) . In particular, Hewitt & Manning (2019) apply probes to discover linear subspaces that encode tree-distances as distances in the representational subspace, and Kim et al. (2020) show that these distances can be used even without any labeled information to induce hierarchical structure. However, recent work has highlighted issues with correlating supervised probe performance with the amount of language structure encoded in such representations (Hewitt & Liang, 2019) . Another popular approach to analyzing deep models is through the lens of geometry (Reif et al., 2019; Gigante et al., 2019) . While geometric interpretations provide significant insights, they present another challenge in summarizing the structure in a quantifiable way. More recent techniques such as replica-based mean field manifold analysis method (Chung et al., 2018; Cohen et al., 2019; Mamou et al., 2020) connects representation geometry with linear classification performance, but the method is limited to categorization tasks. In this work, we make use of an experimental framework from cognitive science and neuroscience to probe for hierarchical structure in contextual representations from pretrained Transformer models (i.e., BERT (Devlin et al., 2018) and its variants). A popular technique in neuroscience involves measuring change in the population activity in response to controlled, input perturbations (Mollica et al., 2020; Ding et al., 2016) . We apply this approach to test the characteristic scale and the complexity (Fig. 1 ) of hierarchical phrase structure encoded deep contextual representations, and present several key findings: 1. Representations are distorted by shuffling small n-grams in early layers, while the distortion caused by shuffling large n-grams starts to occur in later layers, implying the scale of characteristic word length increases from input to downstream layers. 2. Representational distortion caused by swapping two constituent phrases is smaller than when the control sequences of the same length are swapped, indicating that the BERT representations are sensitive to hierarchical phrase structure. 3. Representational distortion caused by swapping adjacent words across phrasal boundary is larger than when the swap is within a phrasal boundary; furthermore, the amount of distortion increases with the syntactic distance between the swapped words. The correlation between distortion and tree distance increases across the layers, suggesting that the characteristic complexity of phrasal subtrees increases across the layers. 4. Early layers pay more attention between syntactically closer adjacent pairs and deeper layers pay more attention between syntactically distant adjacent pairs. The attention paid in each layer can explain some of the emergent sensitivity to phrasal structure across layers. Our work demonstrates that interventional tools such as controlled input perturbations can be useful for analyzing deep networks, adding to the growing, interdisciplinary body of work which profitably adapt experimental techniques from cognitive neuroscience and psycholinguistics to analyze computational models of language (Futrell et al., 2018; Wilcox et al., 2019; Futrell et al., 2019; Ettinger, 2020) .

2. METHODS

Eliciting changes in behavioral and neural responses through controlled input perturbations is a common experimental technique in cognitive neuroscience and psycholinguistics (Tsao & Livingstone, 2008; Mollica et al., 2020) . Inspired by these approaches, we perturb input sentences and measure the discrepancy between the resulting, perturbed representation against the original. While conceptually simple, this approach allows for a targeted analysis of internal representations obtained from different layers of deep models, and can suggest partial mechanisms by which such models are able to encode linguistic structure. We note that sentence perturbations have been primarily utilized in NLP for representation learning (Hill et al., 2016; Artetxe et al., 2018; Lample et al., 2018 ), data augmentation (Wang et al., 2018; Andreas, 2020) , and testing for model robustness (e.g., against adversarial examples) (Jia & Liang, 2017; Belinkov & Bisk, 2018) . A methodological contribution of our work is to show that input perturbations can serve as a useful tool for analyzing representations learned by deep networks.

2.1. SENTENCE PERTURBATIONS

In this work we consider three different types of sentence perturbations designed to probe for different phenomena. n-gram shuffling In the n-gram shuffling experiments, we randomly shuffle the words of a sentence in units of n-grams, with n varying from 1 (i.e., individual words) to 7 (see Fig. 2a for an example).



Datasets, extracted features and code will be publicly available upon publication.

