MAPPING THE TIMESCALE ORGANIZATION OF NEURAL LANGUAGE MODELS

Abstract

In the human brain, sequences of language input are processed within a distributed and hierarchical architecture, in which higher stages of processing encode contextual information over longer timescales. In contrast, in recurrent neural networks which perform natural language processing, we know little about how the multiple timescales of contextual information are functionally organized. Therefore, we applied tools developed in neuroscience to map the "processing timescales" of individual units within a word-level LSTM language model. This timescale-mapping method assigned long timescales to units previously found to track long-range syntactic dependencies. Additionally, the mapping revealed a small subset of the network (less than 15% of units) with long timescales and whose function had not previously been explored. We next probed the functional organization of the network by examining the relationship between the processing timescale of units and their network connectivity. We identified two classes of long-timescale units: "controller" units composed a densely interconnected subnetwork and strongly projected to the rest of the network, while "integrator" units showed the longest timescales in the network, and expressed projection profiles closer to the mean projection profile. Ablating integrator and controller units affected model performance at different positions within a sentence, suggesting distinctive functions of these two sets of units. Finally, we tested the generalization of these results to a character-level LSTM model and models with different architectures. In summary, we demonstrated a model-free technique for mapping the timescale organization in recurrent neural networks, and we applied this method to reveal the timescale and functional organization of neural language models. 1

1. INTRODUCTION

Language processing requires tracking information over multiple timescales. To be able to predict the final word "timescales" in the previous sentence, one must consider both the short-range context (e.g. the adjective "multiple") and the long-range context (e.g. the subject "language processing"). How do humans and neural language models encode such multi-scale context information? Neuroscientists have developed methods to study how the human brain encodes information over multiple timescales during sequence processing. By parametrically varying the timescale of intact context, and measuring the resultant changes in the neural response, a series of studies (Lerner et al., 2011; Xu et al., 2005; Honey et al., 2012) showed that higher-order regions are more sensitive to longrange context change than lower-order sensory regions. These studies indicate the existence of a "hierarchy of processing timescales" in the human brain. More recently, Chien & Honey (2020) used a time-resolved method to investigate how the brain builds a shared representation, when two groups of people processed the same narrative segment preceded by different contexts. By directly mapping the time required for individual brain regions to converge on a shared representation in response to shared input, we confirmed that higher-order regions take longer to build a shared representation. Altogether, these and other lines of investigation suggest that sequence processing in the brain is supported by a distributed and hierarchical structure: sensory regions have short processing timescales and are primarily influenced by the current input and its short-range context, while higher-order cortical regions have longer timescales and track longer-range dependencies (Hasson et al., 2015; Honey et al., 2012; Chien & Honey, 2020; Lerner et al., 2011; Baldassano et al., 2017; Runyan et al., 2017; Fuster, 1997) . How are processing timescales organized within recurrent neural networks (RNNs) trained to perform natural language processing? Long short-term memory networks (LSTMs) (Hochreiter & Schmidhuber, 1997) have been widely investigated in terms of their ability to successfully solve sequential prediction tasks. However, long-range dependencies have usually been studied with respect to a particular linguistic function (e.g. subject-verb number agreement, Linzen et al. 2016; Gulordava et al. 2018; Lakretz et al. 2019) , and there has been less attention on the broader question of how sensitivity to prior context -broadly construed -is functionally organized within these RNNs. Therefore, drawing on prior work in the neuroscience literature, here we demonstrate a model-free approach to mapping processing timescale in RNNs. We focused on existing language models that were trained to predict upcoming tokens at the word level (Gulordava et al., 2018) and at the character level (Hahn & Baroni, 2019) . The timescale organization of these two models both revealed that the higher layers of LSTM language models contained a small subset of units which exhibit long-range sequence dependencies; this subset includes previously reported units (e.g. a "syntax" unit, Lakretz et al., 2019) as well as previously unreported units. After mapping the timescales of individual units, we asked: does the processing timescales of each unit in the network relate to its functional role, as measured by its connectivity? The question is motivated by neuroscience studies which have shown that in the human brain, higher-degree nodes tend to exhibit slower dynamics and longer context dependence than lower-degree nodes (Baria et al., 2013) . More generally, the primate brain exhibits a core periphery structure in which a relatively small number of "higher order" and high-degree regions (in the prefrontal cortex, in default-mode regions and in so-called "limbic" zones) maintain a large number of connections with one another, and exert a powerful influence over large-scale cortical dynamics (Hagmann et al., 2008; Mesulam, 1998; Gu et al., 2015) . Inspired by the relationships between timescales and network structure in the brain, we set out to test corresponding hypotheses in RNNs: (1) Do units with longer-timescales tend to have higher degree in neural language models? and (2) Do neural language models also exhibit a "core network" composed of functionally influential high-degree units? Using an exploratory network-theoretic approach, we found that units with longer timescales tend to have more projections to other units. Furthermore, we identified a set of medium-to-long timescale "controller" units which exhibit distinct and strong projections to control the state of other units, and a set of longtimescale "integrator units" which showed influence on predicting words where the long context is relevant. In summary, these findings advance our understanding of the timescale distribution and functional organization of LSTM language models, and provide a method for identifying important units representing long-range contextual information in RNNs.

2. RELATED WORK

Linguistic Context in LSTMs. How do LSTMs encode linguistic context at multiple timescales? Prior work suggested that the units sensitive to information that requires long-range dependencies are sparse. By ablating one unit at a time, Lakretz et al. (2019) found two units that encode information required for processing long-range subject-verb number agreement (one for singular and one for plural information encoding). They further identified several long-range "syntax units" whose activation was associated with syntactic tree-depth. Overall, Lakretz et al. (2019) suggests that a sparse subset of units tracks long-range dependencies related to subject-verb agreement and syntax. If this pattern is general -i.e. if there are very few nodes tracking long-range dependencies in general -this may limit the capacity of the models to process long sentences with high complexity, for reasons similar to those that may limit human sentence processing (Lakretz et al., 2020) . To test whether long-range nodes are sparse in general, we require a model-free approach for mapping the context dependencies of every unit in the language network. Whole-network context dependence. Previous work by Khandelwal et al. (2018) investigated the duration of prior context that LSTM language models use to support word prediction. Contextdependence was measured by permuting the order of words preceding the preserved context, and



The code and dataset to reproduce the experiment can be found at https://github.com/ sherrychien/LSTM_timescales

