NEURAL NETWORKS AND THE CHOMSKY HIERARCHY

Abstract

Reliable generalization lies at the heart of safe ML and AI. However, understanding when and how neural networks generalize remains one of the most important unsolved problems in the field. In this work, we conduct an extensive empirical study (20 910 models, 15 tasks) to investigate whether insights from the theory of computation can predict the limits of neural network generalization in practice. We demonstrate that grouping tasks according to the Chomsky hierarchy allows us to forecast whether certain architectures will be able to generalize to out-ofdistribution inputs. This includes negative results where even extensive amounts of data and training time never lead to any non-trivial generalization, despite models having sufficient capacity to fit the training data perfectly. Our results show that, for our subset of tasks, RNNs and Transformers fail to generalize on non-regular tasks, LSTMs can solve regular and counter-language tasks, and only networks augmented with structured memory (such as a stack or memory tape) can successfully generalize on context-free and context-sensitive tasks.

1. INTRODUCTION

Statistical learning theory is the most widely used theory of generalization in practical machine learning, justifying empirical risk minimization and estimating the generalization error via a test set (Vapnik, 1998) . However, its central assumption that training and test data are independent and identically distributed (i.i.d.) is violated for many problems of interest (distribution shifts, continual learning, etc.). An example of such a non-i.i.d. setting is testing generalization on sequence prediction problems, where an agent is trained with sequences of length ℓ ≤ N and tested with arbitrarily longer sequences ℓ ≫ N . This problem is of particular importance since it subsumes all computable problems (Dawid, 1984; Rich, 2007; Sipser, 1997; Solomonoff, 2009; 2010) . Central to sequence prediction is inductive inference, which consists of deriving a general rule from a finite set of concrete instances and using this rule to make predictions. For example, in program induction (Goldberg, 1989; Gomez et al., 2008; Holland, 1992; Liang et al., 2013; Nordin, 1997; Solomonoff, 1964a; b; Wineberg & Oppacher, 1994) , the goal is to obtain a model that correctly identifies the underlying data-generating process given examples of input-output sequences. Then, if the model is correct, it can produce results in accordance with the generative process for previously unseen input sequences. The key challenge of inductive inference (as opposed to deduction) is that it does not allow selecting one hypothesis with certainty among the ones that fit the data. For instance, the sequence 2, 4, 6, 8 has infinitely many possible continuations. Thus, any principle that selects a particular continuation requires additional assumptions that are independent of the data, i.e., inductive biases (Mitchell, 1980) . In machine learning, the network architecture, training mechanisms (e.g., gradient descent), and initial distributions over parameters all generate their corresponding inductive biases. This has led to a vast number of approaches for designing inductive biases via architectural and training protocol changes (see Battaglia et al. (2018) for an overview). However, the problem is that stronger inductive biases generally come at the cost of decreasing the universality of a model, and thus finding a good balance between the two is one of the biggest challenges in the contemporary literature. Even if a neural architecture is theoretically universal or Turing complete, gradient-based training, which cannot exhaustively search the parameter space, can impede finding the right solution and thus Left: Our empirical evaluation locates the architectures on the hierarchy of formal language classes. Right: Each formal language class is associated with a minimal computational model (automaton) to recognize or generate the language (see Section 3). All automata have a finite-state controller at their core, in addition to increasingly restrictive memory access as we descend the hierarchy. practically render the model non-universal. Therefore, both architectural and training limitations impact which sequence prediction problems a model can solve in practice. In formal language theory, the Chomsky hierarchy (Chomsky, 1956) classifies such (sequence prediction) problems by increasing complexity. This hierarchy is associated with an equivalent hierarchy of models (automata) that can solve different problem classes (Savage, 1998; Sipser, 1997) . Lower-level automata have restrictive memory models and can only solve lower-level problems, while Turing machines with infinite memory and unrestricted memory access lie on top of the hierachy and can solve all computable problems. However, unlike for classical automata, a unified placement of neural architectures on the Chomsky hierarchy has not yet been practically established, which is precisely the goal of our work.

This work

We conduct an extensive empirical study with the aim of discovering how neural network models used for program induction relate to the idealized computational models defined by the Chomsky hierarchy in practice (see Fig. 1 for a summary of our findings). We investigate whether the theoretical limitations of certain neural models hold in practice when trained with gradient-based methods. For example, previous work has theoretically argued that RNNs are Turing complete (Siegelmann & Sontag, 1994) . However, more recent theoretical analyses (Ackerman & Cybenko, 2020; Merrill, 2019; Weiss et al., 2018) showed that RNNs lie much lower on the Chomsky hierarchy. To complement these theoretical analyses, we conduct a large-scale empirical evaluation on sequence prediction problems. We make the following main contributions: • We conduct an extensive generalization study (20 910 models, 15 tasks) of state-of-the-art neural network architectures (RNN, LSTM, Transformer) and memory-augmented networks (Stack-RNN, Tape-RNN) on a battery of sequence-prediction tasks spanning the entire Chomsky hierarchy that can be practically tested with finite-time computation. • We open-source a length generalization benchmark (https://github.com/deepmind/ neural _ networks _ chomsky _ hierarchy) that is out of reach for state-of-the-art sequence prediction models and allows us to pinpoint the failure modes of these architectures. • We show how increasing amounts of training data do not enable generalization on our tasks higher up in the hierarchy for some architectures (under sufficient capacity to perfectly learn the training data) potentially implying hard limitations for scaling laws (Kaplan et al., 2020). • We demonstrate how augmenting architectures with differentiable structured memory (e.g., with a stack or a tape) can enable them to solve tasks higher up the hierarchy.

2. RELATED WORK

Learning formal languages A long line of work has empirically investigated whether common machine learning architectures, including RNNs (Elman, 1990), GRUs (Cho et al., 2014 ), SCNs (Giles et al., 1992; Pollack, 1991 ), LSTMs (Hochreiter & Schmidhuber, 1997 ), and Transformers (Vaswani et al., 2017) , are capable of learning formal languages. The main insights are: These networks can



Figure1: Formal language classes and their correspondence with neural network architectures. Left: Our empirical evaluation locates the architectures on the hierarchy of formal language classes. Right: Each formal language class is associated with a minimal computational model (automaton) to recognize or generate the language (see Section 3). All automata have a finite-state controller at their core, in addition to increasingly restrictive memory access as we descend the hierarchy.

