LEARNING TO RECOMBINE AND RESAMPLE DATA FOR COMPOSITIONAL GENERALIZATION

Abstract

Flexible neural sequence models outperform grammar-and automaton-based counterparts on a variety of tasks. However, neural models perform poorly in settings requiring compositional generalization beyond the training data-particularly to rare or unseen subsequences. Past work has found symbolic scaffolding (e.g. grammars or automata) essential in these settings. We describe R&R, a learned data augmentation scheme that enables a large category of compositional generalizations without appeal to latent symbolic structure. R&R has two components: recombination of original training examples via a prototype-based generative model and resampling of generated examples to encourage extrapolation. Training an ordinary neural sequence model on a dataset augmented with recombined and resampled examples significantly improves generalization in two language processing problems-instruction following (SCAN) and morphological analysis (SIGMORPHON 2018)-where R&R enables learning of new constructions and tenses from as few as eight initial examples.



How can we build machine learning models with the ability to learn new concepts in context from little data? Human language learners acquire new word meanings from a single exposure (Carey & Bartlett, 1978) , and immediately incorporate words and their meanings productively and compositionally into larger linguistic and conceptual systems (Berko, 1958; Piantadosi & Aslin, 2016) . Despite the remarkable success of neural network models on many learning problems in recent years-including one-shot learning of classifiers and policies (Santoro et al., 2016; Wang et al., 2016) -this kind of few-shot learning of composable concepts remains beyond the reach of standard neural models in both diagnostic and naturalistic settings (Lake & Baroni, 2018; Bahdanau et al., 2019a) . Consider the few-shot morphology learning problem shown in Fig. 1 , in which a learner must predict various linguistic features (e.g. 3rd person, SinGular, PRESent tense) from word forms, with only a small number of examples of the PAST tense in the training set. Neural sequenceto-sequence models (e.g. Bahdanau et al., 2015) trained on this kind of imbalanced data fail to predict past-tense tags on held-out inputs of any kind (Section 5). Previous attempts to address this and related shortcomings in neural models have focused on explicitly encouraging rule-like behavior by e.g. modeling data with symbolic grammars (Jia & Liang, 2016; Xiao et al., 2016; Cai et al., 2017) or applying rule-based data augmentation (Andreas, 2020). These procedures involve highly task-specific models or 1



Figure 1: We first train a generative model to reconstruct training pairs (x y) by constructing them from other training pairs (a). We then perform data augmentation by sampling from this model, preferentially generating samples in which y contains rare tokens or substructures (b). Dashed boxes show prediction targets. Conditional models trained on the augmented dataset accurately predict outputs y from new inputs x requiring compositional generalization (c).

