WHAT THEY DO WHEN IN DOUBT: A STUDY OF INDUCTIVE BIASES IN SEQ2SEQ LEARNERS

Abstract

Sequence-to-sequence (seq2seq) learners are widely used, but we still have only limited knowledge about what inductive biases shape the way they generalize. We address that by investigating how popular seq2seq learners generalize in tasks that have high ambiguity in the training data. We use four new tasks to study learners' preferences for memorization, arithmetic, hierarchical, and compositional reasoning. Further, we connect to Solomonoff's theory of induction and propose to use description length as a principled and sensitive measure of inductive biases. In our experimental study, we find that LSTM-based learners can learn to perform counting, addition, and multiplication by a constant from a single training example. Furthermore, Transformer and LSTM-based learners show a bias toward the hierarchical induction over the linear one, while CNN-based learners prefer the opposite. The latter also show a bias toward a compositional generalization over memorization. Finally, across all our experiments, description length proved to be a sensitive measure of inductive biases.

1. INTRODUCTION

Sequence-to-sequence (seq2seq) learners (Sutskever et al., 2014) demonstrated remarkable performance in machine translation, story generation, and open-domain dialog (Sutskever et al., 2014; Fan et al., 2018; Adiwardana et al., 2020 ). Yet, these models have been criticized for requiring a tremendous amount of data and being unable to generalize systematically (Dupoux, 2018; Loula et al., 2018; Lake & Baroni, 2017; Bastings et al., 2018) . In contrast, humans rely on their inductive biases to generalize from a limited amount of data (Chomsky, 1965; Lake et al., 2019) . Due to the centrality of humans' biases in language learning, several works have studied inductive biases of seq2seq models and connected their poor generalization to the lack of the "right" biases (Lake & Baroni, 2017; Lake et al., 2019) . In this work, we focus on studying inductive biases of seq2seq models. We start from an observation that, generally, multiple explanations can be consistent with a limited training set, each leading to different predictions on unseen data. A learner might prefer one type of explanations over another in a systematic way, as a result of its inductive biases (Ritter et al., 2017; Feinman & Lake, 2018) . To illustrate the setup we work in, consider a quiz-like question: if f (3) maps to 6, what does f (4) map to? The "training" example is consistent with the following answers: 6 (f (x) ≡ 6); 7 (f (x) = x + 3); 8 (f (x) = 2 • x); any number z, since we always can construct a function such that f (3) = 6 and f (4) = z. By analyzing the learner's output on this new input, we can infer its biases. This example demonstrates how biases of learners are studied through the lenses of the poverty of the stimulus principle (Chomsky, 1965; 1980) : if nothing in the training data indicates that a learner should generalize in a certain way, but it does nonetheless, then this is due to the biases of the learner. Inspired by the work of Zhang et al. (2019) in the image domain, we take this principle to the extreme and study biases of seq2seq learners in the regime of very few training examples, often as little as one. Under this setup, we propose four new synthetic tasks that probe seq2seq learners' preferences to memorization-, arithmetic-, hierarchical-and compositional-based "reasoning". Next, we connect to the ideas of Solomonoff's theory of induction (Solomonoff, 1964) and Minimal Description Length (Rissanen, 1978; Grunwald, 2004) and propose to use description length, under a learner, as a principled measure of its inductive biases. Our experimental studyfoot_1 shows that the standard seq2seq learners have strikingly different inductive biases. We find that LSTM-based learners can learn non-trivial counting-, multiplication-, and addition-based rules from as little as one example. CNN-based seq2seq learners would prefer linear over hierarchical generalizations, while LSTM-based ones and Transformers would do just the opposite. When investigating the compositional reasoning, description length proved to be a sensitive measure. Equipped with it, we found that CNN-, and, to a lesser degree, LSTM-based learners prefer compositional generalization over memorization when provided with enough composite examples. In turn, Transformers show a strong bias toward memorization.

2. SEARCHING FOR INDUCTIVE BIASES

To formalize the way we look for inductive biases of a learner M, we consider a training dataset of input/output pairs, T = {x i , y i } n i=1 , and a hold-out set of inputs, H = {x i } k i=n+1 . W.l.o.g, we assume that there are two candidate "rules" that explain the training data, but do not coincide on the hold-out data: C 1 (x i ) = C 2 (x i ) = y i , 1 ≤ i ≤ n and ∃i : C 1 (x i ) = C 2 (x i ), n + 1 ≤ i ≤ k. To compare preferences of a learner M toward those two rules, we fit the learner on the training data T and then compare its predictions on the hold-out data H to the outputs of the rules. We refer to this approach as "intuitive". Usually, the measures of similarity between the outputs are task-specific: We too start with an accuracy-based measure. We define the fraction of perfect agreement (FPA) between a learner M and a candidate generalization rule C as the fraction of seeds that generalize perfectly in agreement with that rule on the hold-out set H. Larger FPA of M is w.r.t. C, more biased M is toward C. However, FPA does not account for imperfect generalization nor allows direct comparison between two candidate rules when both are dominated by a third candidate rule. Hence, below we propose a principled approach based on the description length. Description Length and Inductive Biases At the core of the theory of induction (Solomonoff, 1964) is the question of continuation of a finite string that is very similar to our setup. Indeed, we can easily re-formulate our motivating example as a string continuation problem: "3 → 6; 4 →". The solution proposed by Solomonoff (1964) is to select the continuation that admits "the simplest explanation" of the entire string, i.e. that is produced by programs of the shortest length (description length). Our intuition is that when a continuation is "simple" for a learner, then this learner is biased toward it. We consider a learner M to be biased toward C 1 over C 2 if the training set and its extension according to C 1 has a shorter description length (for M) compared to that of C 2 . Denoting description length of a dataset D under the learner M as L M (D), we hypothesise that if L M ({C 1 (x i )} k i=1 ) < L M ({C 2 (x i )} k i=1 ), then M is biased toward C 1 . Calculating Description Length To find the description length of data under a fixed learner, we use the online (prequential) code (Rissanen, 1978; Grunwald, 2004; Blier & Ollivier, 2018) . The problem of calculating L M (D), D = {x i , y i } k i=1 is considered as a problem of transferring outputs y i one-by-one, in a compressed form, between two parties, Alice (sender) and Bob (receiver). Alice has the entire dataset {x i , y i }, while Bob only has inputs {x i }. Before the transmission starts, both parties agreed on the initialization of the model M, order of the inputs {x}, random seeds, and the details of the learning procedure. Outputs {y i } are sequences of tokens from a vocabulary V . W.l.o.g. we fix some order over {x}. We assume that, given x, the learner M produces a probability distribution over the space of the outputs y, p M (y|x).



* Equal contribution. Code used in the experiments can be found at https://github.com/facebookresearch/FIND.



McCoy et al. (2020) used accuracy of the first term, Zhang et al. (2019) used correlation and MSE, and Lake & Baroni (2017) used accuracy calculated on the entire output sequence.

