WHAT THEY DO WHEN IN DOUBT: A STUDY OF INDUCTIVE BIASES IN SEQ2SEQ LEARNERS

Abstract

Sequence-to-sequence (seq2seq) learners are widely used, but we still have only limited knowledge about what inductive biases shape the way they generalize. We address that by investigating how popular seq2seq learners generalize in tasks that have high ambiguity in the training data. We use four new tasks to study learners' preferences for memorization, arithmetic, hierarchical, and compositional reasoning. Further, we connect to Solomonoff's theory of induction and propose to use description length as a principled and sensitive measure of inductive biases. In our experimental study, we find that LSTM-based learners can learn to perform counting, addition, and multiplication by a constant from a single training example. Furthermore, Transformer and LSTM-based learners show a bias toward the hierarchical induction over the linear one, while CNN-based learners prefer the opposite. The latter also show a bias toward a compositional generalization over memorization. Finally, across all our experiments, description length proved to be a sensitive measure of inductive biases.

1. INTRODUCTION

Sequence-to-sequence (seq2seq) learners (Sutskever et al., 2014) demonstrated remarkable performance in machine translation, story generation, and open-domain dialog (Sutskever et al., 2014; Fan et al., 2018; Adiwardana et al., 2020 ). Yet, these models have been criticized for requiring a tremendous amount of data and being unable to generalize systematically (Dupoux, 2018; Loula et al., 2018; Lake & Baroni, 2017; Bastings et al., 2018) . In contrast, humans rely on their inductive biases to generalize from a limited amount of data (Chomsky, 1965; Lake et al., 2019) . Due to the centrality of humans' biases in language learning, several works have studied inductive biases of seq2seq models and connected their poor generalization to the lack of the "right" biases (Lake & Baroni, 2017; Lake et al., 2019) . In this work, we focus on studying inductive biases of seq2seq models. We start from an observation that, generally, multiple explanations can be consistent with a limited training set, each leading to different predictions on unseen data. A learner might prefer one type of explanations over another in a systematic way, as a result of its inductive biases (Ritter et al., 2017; Feinman & Lake, 2018) . To illustrate the setup we work in, consider a quiz-like question: if f (3) maps to 6, what does f (4) map to? The "training" example is consistent with the following answers: 6 (f (x) ≡ 6); 7 (f (x) = x + 3); 8 (f (x) = 2 • x); any number z, since we always can construct a function such that f (3) = 6 and f (4) = z. By analyzing the learner's output on this new input, we can infer its biases. This example demonstrates how biases of learners are studied through the lenses of the poverty of the stimulus principle (Chomsky, 1965; 1980) : if nothing in the training data indicates that a learner should generalize in a certain way, but it does nonetheless, then this is due to the biases of the learner. Inspired by the work of Zhang et al. (2019) in the image domain, we take this principle to the extreme and study biases of seq2seq learners in the regime of very few training examples, often as little as one. Under this setup, we propose four new synthetic tasks that probe seq2seq learners' preferences to memorization-, arithmetic-, hierarchical-and compositional-based "reasoning".

