LEARNING TO RECOMBINE AND RESAMPLE DATA FOR COMPOSITIONAL GENERALIZATION

Abstract

Flexible neural sequence models outperform grammar-and automaton-based counterparts on a variety of tasks. However, neural models perform poorly in settings requiring compositional generalization beyond the training data-particularly to rare or unseen subsequences. Past work has found symbolic scaffolding (e.g. grammars or automata) essential in these settings. We describe R&R, a learned data augmentation scheme that enables a large category of compositional generalizations without appeal to latent symbolic structure. R&R has two components: recombination of original training examples via a prototype-based generative model and resampling of generated examples to encourage extrapolation. Training an ordinary neural sequence model on a dataset augmented with recombined and resampled examples significantly improves generalization in two language processing problems-instruction following (SCAN) and morphological analysis (SIGMORPHON 2018)-where R&R enables learning of new constructions and tenses from as few as eight initial examples.



How can we build machine learning models with the ability to learn new concepts in context from little data? Human language learners acquire new word meanings from a single exposure (Carey & Bartlett, 1978) , and immediately incorporate words and their meanings productively and compositionally into larger linguistic and conceptual systems (Berko, 1958; Piantadosi & Aslin, 2016) . Despite the remarkable success of neural network models on many learning problems in recent years-including one-shot learning of classifiers and policies (Santoro et al., 2016; Wang et al., 2016) -this kind of few-shot learning of composable concepts remains beyond the reach of standard neural models in both diagnostic and naturalistic settings (Lake & Baroni, 2018; Bahdanau et al., 2019a) . Consider the few-shot morphology learning problem shown in Fig. 1 , in which a learner must predict various linguistic features (e.g. 3rd person, SinGular, PRESent tense) from word forms, with only a small number of examples of the PAST tense in the training set. Neural sequenceto-sequence models (e.g. Bahdanau et al., 2015) trained on this kind of imbalanced data fail to predict past-tense tags on held-out inputs of any kind (Section 5). Previous attempts to address this and related shortcomings in neural models have focused on explicitly encouraging rule-like behavior by e.g. modeling data with symbolic grammars (Jia & Liang, 2016; Xiao et al., 2016; Cai et al., 2017) or applying rule-based data augmentation (Andreas, 2020). These procedures involve highly task-specific models or generative assumptions, preventing them from generalizing effectively to less structured problems that combine rule-like and exceptional behavior. More fundamentally, they fail to answer the question of whether explicit rules are necessary for compositional inductive bias, and whether it is possible to obtain "rule-like" inductive bias without appeal to an underlying symbolic generative process. This paper describes a procedure for improving few-shot compositional generalization in neural sequence models without symbolic scaffolding. Our key insight is that even fixed, imbalanced training datasets provide a rich source of supervision for few-shot learning of concepts and composition rules. In particular, we propose a new class of prototype-based neural sequence models (c.f. Gu et al., 2018) that can be directly trained to perform the kinds of generalization exhibited in Fig. 1 by explicitly recombining fragments of training examples to reconstruct other examples. Even when these prototype-based models are not effective as general-purpose predictors, we can resample their outputs to select high-quality synthetic examples of rare phenomena. Ordinary neural sequence models may then be trained on datasets augmented with these synthetic examples, distilling the learned regularities into more flexible predictors. This procedure, which we abbreviate R&R, promotes efficient generalization in both challenging synthetic sequence modeling tasks (Lake & Baroni, 2018) and morphological analysis in multiple natural languages (Cotterell et al., 2018) . By directly optimizing for the kinds of generalization that symbolic representations are supposed to support, we can bypass the need for symbolic representations themselves: R&R gives performance comparable to or better than state-of-the-art neuro-symbolic approaches on tests of compositional generalization. Our results suggest that some failures of systematicity in neural models can be explained by simpler structural constraints on data distributions and corrected with weaker inductive bias than previously described.foot_0 

2. BACKGROUND AND RELATED WORK

Compositional generalization Systematic compositionality-the capacity to identify rule-like regularities from limited data and generalize these rules to novel situations-is an essential feature of human reasoning (Fodor et al., 1988) . While details vary, a common feature of existing attempts to formalize systematicity in sequence modeling problems (e.g. Gordon et al., 2020) is the intuition that learners should make accurate predictions in situations featuring novel combinations of previously observed input or output subsequences. For example, learners should generalize from actions seen in isolation to more complex commands involving those actions (Lake et al., 2019), and from relations of the form r(a,b) to r(b,a) (Keysers et al., 2020; Bahdanau et al., 2019b) . In machine learning, previous studies have found that standard neural architectures fail to generalize systematically even when they achieve high in-distribution accuracy in a variety of settings (Lake & Baroni, 2018; Bastings et al., 2018; Johnson et al., 2017) . Data augmentation and resampling Learning to predict sequential outputs with rare or novel subsequences is related to the widely studied problem of class imbalance in classification problems. There, undersampling of the majority class or oversampling of the minority class has been found to improve the quality of predictions for rare phenomena (Japkowicz et al., 2000) . This can be combined with targeted data augmentation with synthetic examples of the minority class (Chawla et al., 2002) . Generically, given a training dataset D, learning with class resampling and data augmentation involves defining an augmentation distribution p(x, y | D) and sample weighting function u(x, y) and maximizing a training objective of the form: L(θ) = 1 |D| x∈D log p θ (y | x) Original training data + E (x,y)∼ p u(x, y) log p θ (y | x) Augmented data . (1) In addition to task-specific model architectures (Andreas et al., 2016; Russin et al., 2019) 



Code for all experiments in this paper is available at https://github.com/ekinakyurek/compgen. We implemented our experiments in Knet (Yuret, 2016) using Julia(Bezanson et al., 2017).



Figure 1: We first train a generative model to reconstruct training pairs (x y) by constructing them from other training pairs (a). We then perform data augmentation by sampling from this model, preferentially generating samples in which y contains rare tokens or substructures (b). Dashed boxes show prediction targets. Conditional models trained on the augmented dataset accurately predict outputs y from new inputs x requiring compositional generalization (c).

, recent years have seen a renewed interest in data augmentation as a flexible and model-agnostic tool for encouraging controlled generalization (Ratner et al., 2017). Existing proposals for sequence models are mainly rule-based-in sequence modeling problems, specifying a synchronous context-free grammar (Jia & Liang, 2016) or string rewriting system (Andreas, 2020) to generate new examples. Rule-based data augmentation schemes that recombine multiple training examples have been proposed

