LEARNING TO RECOMBINE AND RESAMPLE DATA FOR COMPOSITIONAL GENERALIZATION

Abstract

Flexible neural sequence models outperform grammar-and automaton-based counterparts on a variety of tasks. However, neural models perform poorly in settings requiring compositional generalization beyond the training data-particularly to rare or unseen subsequences. Past work has found symbolic scaffolding (e.g. grammars or automata) essential in these settings. We describe R&R, a learned data augmentation scheme that enables a large category of compositional generalizations without appeal to latent symbolic structure. R&R has two components: recombination of original training examples via a prototype-based generative model and resampling of generated examples to encourage extrapolation. Training an ordinary neural sequence model on a dataset augmented with recombined and resampled examples significantly improves generalization in two language processing problems-instruction following (SCAN) and morphological analysis (SIGMORPHON 2018)-where R&R enables learning of new constructions and tenses from as few as eight initial examples.



How can we build machine learning models with the ability to learn new concepts in context from little data? Human language learners acquire new word meanings from a single exposure (Carey & Bartlett, 1978) , and immediately incorporate words and their meanings productively and compositionally into larger linguistic and conceptual systems (Berko, 1958; Piantadosi & Aslin, 2016) . Despite the remarkable success of neural network models on many learning problems in recent years-including one-shot learning of classifiers and policies (Santoro et al., 2016; Wang et al., 2016) -this kind of few-shot learning of composable concepts remains beyond the reach of standard neural models in both diagnostic and naturalistic settings (Lake & Baroni, 2018; Bahdanau et al., 2019a) . Consider the few-shot morphology learning problem shown in Fig. 1 , in which a learner must predict various linguistic features (e.g. 3rd person, SinGular, PRESent tense) from word forms, with only a small number of examples of the PAST tense in the training set. Neural sequenceto-sequence models (e.g. Bahdanau et al., 2015) trained on this kind of imbalanced data fail to predict past-tense tags on held-out inputs of any kind (Section 5). Previous attempts to address this and related shortcomings in neural models have focused on explicitly encouraging rule-like behavior by e.g. modeling data with symbolic grammars (Jia & Liang, 2016; Xiao et al., 2016; Cai et al., 2017) or applying rule-based data augmentation (Andreas, 2020) . These procedures involve highly task-specific models or generative assumptions, preventing them from generalizing effectively to less structured problems that combine rule-like and exceptional behavior. More fundamentally, they fail to answer the question of whether explicit rules are necessary for compositional inductive bias, and whether it is possible to obtain "rule-like" inductive bias without appeal to an underlying symbolic generative process. This paper describes a procedure for improving few-shot compositional generalization in neural sequence models without symbolic scaffolding. Our key insight is that even fixed, imbalanced training datasets provide a rich source of supervision for few-shot learning of concepts and composition rules. In particular, we propose a new class of prototype-based neural sequence models (c.f. Gu et al., 2018) that can be directly trained to perform the kinds of generalization exhibited in Fig. 1 by explicitly recombining fragments of training examples to reconstruct other examples. Even when these prototype-based models are not effective as general-purpose predictors, we can resample their outputs to select high-quality synthetic examples of rare phenomena. Ordinary neural sequence models may then be trained on datasets augmented with these synthetic examples, distilling the learned regularities into more flexible predictors. This procedure, which we abbreviate R&R, promotes efficient generalization in both challenging synthetic sequence modeling tasks (Lake & Baroni, 2018) and morphological analysis in multiple natural languages (Cotterell et al., 2018) . By directly optimizing for the kinds of generalization that symbolic representations are supposed to support, we can bypass the need for symbolic representations themselves: R&R gives performance comparable to or better than state-of-the-art neuro-symbolic approaches on tests of compositional generalization. Our results suggest that some failures of systematicity in neural models can be explained by simpler structural constraints on data distributions and corrected with weaker inductive bias than previously described.foot_0 

2. BACKGROUND AND RELATED WORK

Compositional generalization Systematic compositionality-the capacity to identify rule-like regularities from limited data and generalize these rules to novel situations-is an essential feature of human reasoning (Fodor et al., 1988) . While details vary, a common feature of existing attempts to formalize systematicity in sequence modeling problems (e.g. Gordon et al., 2020) is the intuition that learners should make accurate predictions in situations featuring novel combinations of previously observed input or output subsequences. For example, learners should generalize from actions seen in isolation to more complex commands involving those actions (Lake et al., 2019) , and from relations of the form r(a,b) to r(b,a) (Keysers et al., 2020; Bahdanau et al., 2019b) . In machine learning, previous studies have found that standard neural architectures fail to generalize systematically even when they achieve high in-distribution accuracy in a variety of settings (Lake & Baroni, 2018; Bastings et al., 2018; Johnson et al., 2017) .

Data augmentation and resampling

Learning to predict sequential outputs with rare or novel subsequences is related to the widely studied problem of class imbalance in classification problems. There, undersampling of the majority class or oversampling of the minority class has been found to improve the quality of predictions for rare phenomena (Japkowicz et al., 2000) . This can be combined with targeted data augmentation with synthetic examples of the minority class (Chawla et al., 2002) . Generically, given a training dataset D, learning with class resampling and data augmentation involves defining an augmentation distribution p(x, y | D) and sample weighting function u(x, y) and maximizing a training objective of the form: L(θ) = 1 |D| x∈D log p θ (y | x) Original training data + E (x,y)∼ p u(x, y) log p θ (y | x) Augmented data . (1) In addition to task-specific model architectures (Andreas et al., 2016; Russin et al., 2019) , recent years have seen a renewed interest in data augmentation as a flexible and model-agnostic tool for encouraging controlled generalization (Ratner et al., 2017) . Existing proposals for sequence models are mainly rule-based-in sequence modeling problems, specifying a synchronous context-free grammar (Jia & Liang, 2016) or string rewriting system (Andreas, 2020) for image classification (Inoue, 2018) and machine translation (Fadaee et al., 2017) . While rulebased data augmentation is highly effective in structured problems featuring crisp correspondences between inputs and outputs, the effectiveness of such approaches involving more complicated, context-dependent relationships between inputs and outputs has not been well-studied. Learned data augmentation What might compositional data augmentation look like without rules as a source of inductive bias? As Fig. 1 suggests, an ideal data augmentation procedure (p in Eq. 1) should automatically identify valid ways of transforming and combining examples, without pre-committing to a fixed set of transformations. 2 A promising starting point is provided by prototype-based models, a number of which (Gu et al., 2018; Guu et al., 2018; Khandelwal et al., 2020) (Rosenblatt, 1956) . But additionally-building on the motivation in Section 1-they may be viewed as one-shot learners trained to generate new data d from a single example. Existing work uses prototype-based models as replacements for standard sequence models. We will show here that they are even better suited to use as data augmentation procedures: they can produce high-precision examples in the neighborhood of existing training data, then be used to bootstrap simpler predictors that extrapolate more effectively. But our experiments will also show that existing prototype-based models give mixed results on challenging generalizations of the kind depicted in Fig. 1 when used for either direct prediction or data augmentation-performing well in some settings but barely above baseline in others. Accordingly, R&R is built on two model components that transform prototype-based language models into an effective learned data augmentation scheme. Section 3 describes an implementation of p rewrite that encourages greater sample diversity and well-formedness via a multi-prototype copying mechanism (a two-shot learner). Section 4 describes heuristics for sampling prototypes d and model outputs d to focus data augmentation on the most informative examples. Section 5 investigates the empirical performance of both components of the approach, finding that they together provide they a simple but surprisingly effective tool for enabling compositional generalization.

3. PROTOTYPE-BASED SEQUENCE MODELS FOR DATA RECOMBINATION

We begin with a brief review of existing prototype-based sequence models. Our presentation mostly follows the retrieve-and-edit approach of Guu et al. (2018) , but versions of the approach in this paper could also be built on retrieval-based models implemented with memory networks (Miller et al., 2016; Gu et al., 2018) or transformers (Khandelwal et al., 2020; Guu et al., 2020) . The generative process described in Eq. 2 implies a marginal sequence probability: p(d) = 1 |D| d ∈D p rewrite (d | d ; θ) Maximizing this quantity over the training set with respect to θ will encourage p rewrite to act as a model of valid data transformations: To be assigned high probability, every training example must be explained by at least one other example and a parametric rewriting operation. (The trivial solution where p θ is the identity function, with p θ (d | d = d) = 1, can be ruled out manually in 2 As a concrete example of the potential advantage of learned data augmentation, consider applying the GECA procedure of Andreas (2020) to the language of strings a n b n . GECA produces a training set that is substitutable in the sense of Clark & Eyraud (2007) ; as noted there, a n b n is not substitutable. GECA will infer that a can be replaced with aab based on their common context in (aabb, aaabbb), then generate the malformed example aaababbb by replacing an a in the wrong position. In contrast, recurrent neural networks can accurately model a n b n (Weiss et al., 2018; Gers & Schmidhuber, 2001) . Of course, this language can also be generated using even more constrained procedures than GECA, but in general learned sequence models can capture a broader set of both formal regularities and exceptions compared to rule-based procedures. the design of p θ .) When D is large, the sum in Eq. 3 is too large to enumerate exhaustively when computing the marginal likelihood. Instead, we can optimize a lower bound by restricting the sum to a neighborhood N (d) ⊂ D of training examples around each d: p(d) ≥ 1 |D| d ∈N (d) p rewrite (d | d ; θ) . The choice of N is discussed in more detail in Section 4. Now observe that: log p(d) ≥ log |N (d)| d ∈N (d) 1 |N (d)| p rewrite (d | d ; θ) -log |D| (5) ≥ 1 |N (d)| d ∈N (d) log p rewrite (d | d ; θ) + log |N (d)| |D| (6) where the second step uses Jensen's inequality. If all |N (d)| are the same size, maximizing this lower bound on log-likelihood is equivalent to simply maximizing d ∈N (d) log p rewrite (d | d ; θ) (7) over D-this is the ordinary conditional likelihood for a string transducer (Ristad & Yianilos, 1998) or sequence-to-sequence model (Sutskever et al., 2014) with examples d, d ∈ N (d). foot_2  We have motivated prototype-based models by arguing that p rewrite learns a model of transformations licensed by the training data. However, when generalization involves complex compositions, we will show that neither a basic RNN implementation of p rewrite or a single prototype is enough; we must provide the learned rewriting model with a larger inventory of parts and encourage reuse of those parts as faithfully as possible. This motivates the two improvements on the prototype-based modeling framework described in the remainder of this section: generalization to multiple prototypes (Section 3.1) and a new rewriting model (Section 3.2).

3.1. n-PROTOTYPE MODELS

To improve compositionality in prototype-based models, we equip them with the ability to condition on multiple examples simultaneously. We extend the basic prototype-based language model to n prototypes, which we now refer to as a recombination model p recomb : d ∼ p recomb (• | d 1:n ; θ) where d 1:n def = (d 1 , d 2 , . . . , d n ) ∼ p Ω (•) A multi-protype model may be viewed as a meta-learner (Thrun & Pratt, 1998; Santoro et al., 2016) : it maps from a small number of examples (the prototypes) to a distribution over new datapoints consistent with those examples. By choosing the neighborhood and implementation of p recomb appropriately, we can train this meta-learner to specialize in one-shot concept learning (by reusing a fragment exhibited in a single prototype) or compositional generalization (by assembling fragments of prototypes into a novel configuration). To enable this behavior, we define a set of compatible prototypes Ω ⊂ D n (Section 4) and let p Ω def = Unif(Ω). We update Eq. 6 to feature a corresponding multi-prototype neighborhood N : D → Ω. The only terms that have changed are the conditioning variable and the constant term, and it is again sufficient to choose θ to optimize d 1:n ∈N (d) log p recomb (d | d 1:n ) over D, implementing p recomb as described next.

3.2. RECOMBINATION NETWORKS

Past work has found that latent-variable neural sequence models often ignore the latent variable and attempt to directly model sequence marginals (Bowman et al., 2016) . When an ordinary sequence-tosequence model with attention is used to implement p recomb , even in the one-prototype case, generated sentences often have little overlap with their prototypes (Weston et al., 2018) . We describe a specific model architecture for p recomb that does not function as a generic noise model, and in which outputs are primarily generated via explicit reuse of fragments of multiple prototypes, by facilitating copying from independent streams containing prototypes and previously generated input tokens. We take p recomb (d | d 1:n ; θ) to be a neural (multi-)sequence-to-sequence model (c.f. Sutskever et al., 2014) which decomposes probability autoregressively: p recomb (d | d 1:n ; θ) = t p(d t | d <t , d 1:n ; θ). As shown in Fig. 2 , three LSTM encoders-two for the prototypes and one for the input prefixcompute sequences of token representations h proto and h out respectively. Given the current decoder hidden state h t out , the model first attends to both prototype and output tokens: α i out ∝ exp(h t out W o h i out ) i < t (10) α kj proto ∝ exp(h t out W p h kj proto ) k ≤ n, j ≤ |d k | To enable copying from each sequence, we project attention weights α out and α k proto onto the output vocabulary to produce a sparse vector of probabilities: p t copy,out (d t = w) = i<t 1[d i = w] • α i out (12) p t copy,proto-k (d t = w) = j≤|d k | 1[d k,j = w] • α kj proto ( ) Unlike rule-based data recombination procedures, however, p recomb is not required to copy from the prototypes, and can predict output tokens directly using values retrieved by the attention mechanism: h t pre = h t out , i α i out h i out , k,j α kj proto h kj proto ( ) p t write ∝ exp(W write h t pre ) To produce a final distribution over output tokens at time t, we combine predictions from each stream: β gate = softmax(W gate h t out ) (16) p(d t = w | d <t , d 1:n ; θ) = β gate • [p t write w), p t copy,out (w), p t copy,proto-1 (w), ..., p t copy,proto-n (w)] (17) This copy mechanism is similar to the one proposed by Merity et al. (2017) and See et al. (2017) . We compare 1-and 2-prototype models to an ordinary sequence model and baselines in Section 5.

4. SAMPLING SCHEMES

The models above provide generic procedures for generating well-formed combinations of training data, but do nothing to ensure that the generated samples are of a kind useful for compositional generalization. While the training objective in Eq. 7 encourages the learned p(d) to lie close to the training data, an effective data augmentation procedure should intuitively provide novel examples of rare phenomena. To generate augmented training data, we combine the generative models of Section 3 with a simple sampling procedure that upweights useful examples.

4.1. RESAMPLING AUGMENTED DATA

In classification problems with imbalanced classes, a common strategy for improving accuracy on the rare class is to resample so that the rare class is better represented in training data (Japkowicz et al., 2000) . When constructing an augmented dataset using the models described above, we apply a simple rejection sampling scheme. In Eq. 1, we set: u(d) = 1[min t p(d t ) < ] . Here p(d t ) is the marginal probability that the token d t appears in any example and is a hyperparameter. The final model is then trained using Eq. 1, retaining those augmented samples for which u(d) = 1. For extremely imbalanced problems, like the ones considered in Section 5, this weighting scheme effectively functions as a rare tag constraint: only examples containing rare words or tags are used to augment the original training data.

4.2. NEIGHBORHOODS AND PROTOTYPE PRIORS

How can we ensure that the data augmentation procedure generates any samples with positive weight in Eq. 18? The prototype-based models described in Section 3 offer an additional means of control over the generated data. Aside from the implementation of p recomb , the main factors governing the behavior of the model are the choice of neighborhood function N (d) and, for n ≥ 2, the set of prior compatible prototypes Ω. Defining these so that rare tags also preferentially appear in prototypes helps ensure that the generated samples contribute to generalization. 2018) define a one-prototype N based on a Jaccard distance threshold (Jaccard, 1901) . For experiments with one-prototype models we employ a similar strategy, choosing an initial neighborhood of candidates such that N (d) def = {d 1 ∈ D : (α • |d∆d 1 | + β • lev(d, d 1 )) < δ} ( ) where lev is string edit distance (Levenshtein, 1966) and α, β and δ are hyperparameters (discussed in Appendix B).

2-prototype neighborhoods

The n ≥ 2 prototype case requires a more complex neighborhood function-intuitively, for an input d, we want each (d 1 , d 2 , . . .) in the neighborhood to collectively contain enough information to reconstruct d. Future work might treat the neighborhood function itself as latent, allowing the model to identify groups of prototypes that make d probable; here, as in existing one-prototype models, we provide heuristic implementations for the n = 2 case. Long-short recombination: For each (d 1 , d 2 ) ∈ N (d), d 1 is chosen to be similar to d, and d 2 is chosen to be similar to the difference between d and d 1 . (The neighborhood is so named because one of the prototypes will generally have fewer tokens than the other one.) N (d) def = {(d 1 , d 2 ) ∈ Ω : lev (d, d 1 ) < δ, lev ([d\d 1 ], d 2 ) < δ, |d\d 1 | > 0, |d\d 1 \d 2 | = 0} (20) Here [d\d 1 ] is the sequence obtained by removing all tokens in d 1 from d. Recall that we have defined p Ω (d 1:n ) def = Unif(Ω) for a set Ω of "compatible" prototypes. For experiments using long-short combination, all prototypes are treated as compatible; that is, Ω = D × D. Long-long recombination: N (d) contains pairs of prototypes that are individually similar to d and collectively contain all the tokens needed to reconstruct d: N (d) def = {(d 1 , d 2 ) ∈ Ω : lev (d, d 1 ) < δ, lev (d, d 2 ) < δ, |d∆d 1 | = 1, |d\d 1 \d 2 | = 0} For experiments using long-long recombination, we take Ω = {(d 1 , d 2 ) ∈ D × D : |d 1 ∆d 2 | = 1}.

5. DATASETS & EXPERIMENTS

We evaluate R&R on two tests of compositional generalization: the SCAN instruction following task (Lake & Baroni, 2018 ) and a few-shot morphology learning task derived from the SIGMORPHON 2018 dataset (Kirov et al., 2018; Cotterell et al., 2018) . Our experiments are designed to explore the effectiveness of learned data recombination procedures in controlled and natural settings. Both tasks involve conditional sequence prediction: while preceding sections have discussed augmentation Data augmentation with recomb-2 + resampling performs slightly worse than GECA on the jump and around right splits; data augmentation with recomb-1 + resampling or an ordinary RNN does not generalize robustly to either split. All differences except between GECA and recomb-2 + resampling in jump are significant (paired t-test, p 0.001). (Dashes indicate that all samples were rejected by resampling when decoding with temperature T = 1.) (b) Ablation experiments on the jump split. Introducing the latent variable used in previous work (Guu et al., 2018) does not change performance; removing the copy mechanism results in a complete failure of generalization. While it is possible to perform conditional inference of p(y | x) given the generative model in Eq. 3 (direct inference), this gives significantly worse results than data augmentation (see Sec. 5.3) procedures that produce data points d = (x, y), learners are evaluated on their ability to predict an output y from an input x: actions y given instructions x, or morphological analyses y given words x. For each task, we compare a baseline with no data augmentation, the rule-based GECA data augmentation procedure (Andreas, 2020) , and a sequence of ablated versions of R&R that measure the importance of resampling and recombination. The basic Learned Aug model trains an RNN to generate (x, y) pairs, then trains a conditional model on the original data and samples from the generative model. Resampling filters these samples as described in Section 4. Recomb-n models replace the RNN with a prototype-based model as described in Section 3. Additional experiments (Table 1b ) compare data augmentation to prediction of y via direct inference (Appendix E) in the prototype-based model and several other model variants.

5.1. SCAN

SCAN (Lake & Baroni, 2018 ) is a synthetic dataset featuring simple English commands paired with sequences of actions. Our experiments aim to show that R&R performs well at one-shot concept learning and zero-shot generalization on controlled tasks where rule-based models succeed. We experiment with two splits of the dataset, jump and around right. In the jump split, which tests one-shot learning, the word jump appears in a single command in the training set but in more complex commands in the test set (e.g. look and jump twice). The around right split (Loula et al., 2018) tests zero-shot generalization by presenting learners with constructions like walk around left and walk right in the training set, but walk around right only in the test set. Despite the apparent simplicity of the task, ordinary neural sequence-to-sequence models completely fail to make correct predictions on SCAN test set (Table 1 ). As such it has been a major focus of research on compositional generalization in sequence-to-sequence models, and a number of heuristic procedures and specialized model architectures and training procedures have been developed to solve it (Russin et al., 2019; Gordon et al., 2020; Lake, 2019; Andreas, 2020) . Here we show that the generic prototype recombination procedure described above does so as well. We use long-short recombination for the jump split and long-long recombination for the around right split. We use a recombination network to generate 400 samples d = (x, y) and then train an ordinary LSTM with attention (Bahdanau et al., 2019b) on the original and augmented data to predict y from x. Training hyperparameters are provided in Appendix D. Table 1 shows the results of training these models on the SCAN dataset.foot_3 2-prototype recombination is essential for successful generalization on both splits. Additional ablations (Table 1b ) show that the continuous latent variable used by Guu et al. (2018) does not affect performance, but that the copy mechanism described in Section 3.2 and the use of the recomb-2 model for data augmentation rather than direct inference are necessary for accurate prediction.

5.2. SIGMORPHON 2018

The SIGMORPHON 2018 dataset consists of words paired with morphological analyses (lemmas, or base forms, and tags for linguistic features like tense and case, as depicted in Fig. 1 ). We use the data to construct a morphological analysis task (Akyürek et al., 2019) (predicting analyses from surface forms) to test models' few-shot learning of new morphological paradigms. In three languages of varying morphological complexity (Spanish, Swahili, and Turkish) we construct splits of the data featuring a training set of 1000 examples and three test sets of 100 examples. One test set consists exclusively of words in the past tense, one in the future tense and one with other word forms (present tense verbs, nouns and adjectives). The training set contains exactly eight past-tense and eight future-tense examples; all the rest are other word forms. Experiments evaluate R&R's ability to efficiently learn noisy morphological rules, long viewed a key challenge for connectionist approaches to language learning (Rumelhart & McClelland, 1986) . As approaches may be sensitive to the choice of the eight examples from which the model must generalize, we construct five different splits per language and use the Spanish past-tense data as a development set. As above, we use long-long recombination with similarity criteria applied to y only. We augment the training data with 180 samples from p recomb and again train an ordinary LSTM with attention for final predictions. Details are provided in Appendix B. Table 2 shows aggregate results across languages. We report the model's F 1 score for predicting morphological analyses of words in the few-shot training condition (past and future) and the standard training condition (other word forms). Here, learned data augmentation with both one-and twoprototype models consistently matches or outperforms GECA. The improvement is sometimes dramatic: for few-shot prediction in Swahili, recomb-1 augmentation reduces the error rate by 40% relative to the baseline and 21% relative to GECA. An additional baseline + resampling experiment upweights the existing rare samples rather than synthesizing new ones; results demonstrate that recombination, and not simply reweighting, is important for generalization. Table 2 also includes a finer-grained analysis of novel word forms: words in the evaluation set whose exact morphological analysis never appeared in the training set. R&R again significantly outperforms both the baseline and GECA-based data augmentation in the few-shot FUT+PAST condition and the ordinary OTHER condition, underscoring the effectiveness of this approach for "in-distribution" compositional generalization. Finally, the gains provided by learned augmentation and GECA appear to be at least partially orthogonal: combining the GECA + resampling and recomb-1 + resampling models gives further improvements in Spanish and Turkish.

5.3. ANALYSIS

Why is R&R effective? Samples from the best learned data augmentation models for SCAN and SIGMORPHON may be found in the Appendix G.3 . We programaticaly analyzed 400 samples from recomb-2 models in SCAN and found that 40% of novel samples are exactly correct in the around right split and 74% in the jump split. A manual analysis of 50 Turkish samples indicated that only 14% of the novel samples were exactly correct. The augmentation procedure has a high error rate! However, our analysis found that malformed samples either (1) feature malformed xs that will never appear in a test set (a phenomenon also observed by Andreas (2020) for outputs of GECA), or (2) are mostly correct at the token level (inducing predictions with a high F 1 score). Data augmentation thus contributes a mixture of irrelevant examples, label noise-which may exert a positive regularizing effect (Bishop, 1995) -and well-formed examples, a small number of which are sufficient to induce generalization (Bastings et al., 2018) . Without resampling, SIGMORPHON models generate almost no examples of rare tags. Why does R&R outperform direct inference? A partial explanation is provided by the preceding analysis, which notes that the accuracy of the data augmentation procedure as a generative model is comparatively low. Additionally, the data augmentation procedure selects only the highest-confidence samples from the model, so the quality of predicted ys conditioned on random xs will in general be even lower. A conditional model trained on augmented data is able to compensate for errors in augmentation or direct inference (Table 12 in the Appendix). Why is Resampling without Recombination effective? One surprising feature of Table 2 is performance of the learned aug (basic) + resampling model. While less effective than the recombinationbased models, augmentation with samples from an ordinary RNN trained on (x, y) pairs improves performance for some test splits. One possible explanation is that resampling effectively acts as a posterior constraint on the final model's predictive distribution, guiding it toward solutions in which rare tags are more probable than observed in the original training data. Future work might model this constraint explicitly, e.g. via posterior regularization (as in Li & Rush, 2020).

6. CONCLUSIONS

We have described a method for improving compositional generalization in sequence-to-sequence models via data augmentation with learned prototype recombination models. These are the first results we are aware of demonstrating that generative models of data are effective as data augmentation schemes in sequence-to-sequence learning problems, even when the generative models are themselves unreliable as base predictors. Our experiments demonstrate that it is possible to achieve compositional generalization on-par with complex symbolic models in clean, highly structured domains, and outperform them in natural ones, with basic neural modeling tools and without symbolic representations. Given d, we sort training examples by using score 1 as the comparison key and pick the four smallest neighbors (using a lexicographic sort) to form N (d). For the recomb-2 model, N (d) uses the same score function for the first prototype as in the recomb-1 case. The second prototype is selected using: score 2 (d, d 1 , d 2 ) = d tags = d 2tags , |d 1tags ∆d 2tags | Given x, and a scored first prototype, we do one more sort over training examples by using score 2 as the comparison key. Then we pick first four neighbors for N (d). Sampling We use a mix strategy of temperature sampling with T = 0.5 and greedy sampling in which we use the former for d input and the latter for d output . We sample 180 unique and novel examples. Every conditional model's size is the same as the corresponding generative model which was used for augmentation. This is to ensure that the conditional model and the generative model have the same capacity. We train conditional models for 150 epochs for SCAN and we used augmentation ratios of p aug = 0.01 and p aug = 0.2 in jump and around right, respectively. For morphology, we train the conditional models for 100 epochs, and we use all generated examples for augmentation.

E DIRECT INFERENCE

To adapt the prototype-based model for conditional prediction, we condition the neighborhood function on the input x rather than the full datum d, as in Hashimoto et al. (2018) . Candidate ys are then sampled from the generative model given the observed x while marginalizing over retrieved prototypes. Finally, we re-rank these candidates via Eq. 7 and output the highest-scoring candidate.

F VAE MODEL

Prior p(z): We use the same prior as Guu et al. (2018) given in Eq. 31. In this prior, z is defined by a norm and direction vector. The norm is sampled from the uniform distribution between zero and a maximum possible norm µ max = 10.0, and the direction is sampled uniformly from the unit hypersphere. This sampling procedure corresponds to a von Mises-Fisher distribution with concentration parameter zero.

G.4 ATTENTION HEATMAP

Here we provide a visualization copy and attention mechanism in recomb-2 model for SCAN experiments. 



Code for all experiments in this paper is available at https://github.com/ekinakyurek/compgen. We implemented our experiments in Knet(Yuret, 2016) using Julia(Bezanson et al., 2017). Past work also includes a continuous latent variable z, defining:prewrite(d | d ) = E z∼p(z) [prewrite(d | d , z; θ)](8)As discussed in Section 5, the use of a continuous latent variable appears to make no difference in prediction performance for the tasks in this paper. The remainder of our presentation focuses on the simpler model in Eq. 7. We provide results from GECA for comparison. Our final RNN predictor is more accurate than the one used byAndreas (2020), and training it on the same augmented dataset gives higher accuracies than reported in the original paper. When training 2-proto and 1-proto models, we increment epoch counter when the entire neighborhood for every d is processed. For 0-proto, one epoch is defined canonically i.e. the entire train set.



Figure 1: We first train a generative model to reconstruct training pairs (x y) by constructing them from other training pairs (a). We then perform data augmentation by sampling from this model, preferentially generating samples in which y contains rare tokens or substructures (b). Dashed boxes show prediction targets. Conditional models trained on the augmented dataset accurately predict outputs y from new inputs x requiring compositional generalization (c).

Figure 2: (a) RNN encoders produce contextual embeddings for prototype tokens. (b) In the decoder, a gated copy mechanism reuses prototypes and generated output tokens via an attention mechanism (dashed lines).

Figure 3: Generation of a sample. We plot normalized output scores on the left, and attention weights to the different prototypes on the right. The prototypes are on the y axes. The model is recomb-2 model trained on SCAN jump split.

to generate new examples. Rule-based data augmentation schemes that recombine multiple training examples have been proposed

Let d 1 and d 2 be prototypes. As a notational convenience, given two sequences d 1 , d 2 , let d 1 \d 2 the set of tokens in d 1 but not d 2 , and d 1 ∆d 2 denote the set of tokens not common to d 1 and d 2 .

Results on the SCAN dataset. (a) Comparison of R&R with previous work. Connecting lines indicate that model components are inherited from the parent (e.g. the row labeled recomb-2 also includes resampling).

.

F1 score for morphological analysis on rare (FUT+PST) and frequent (OTHER) word forms. R&R variants with 1-and 2-prototype recombination (shaded in grey) consistently match or outperform both a no-augmentation baseline and GECA; recomb-1 + resampling is best overall. Bold numbers are not significantly different from the best result in each column under a paired t-test (p < 0.05 after Bonferroni correction; nothing is bold if all differences are insignificant). The NOVEL portion of the table shows model accuracy on examples whose exact tag set never appeared in the training data. (There were no such words in the test set for the Spanish OTHER.) Differences between GECA and the best R&R variant (recomb-1 + resampling) are larger than in the full evaluation set. * The Spanish past tense was used as a development set. ±0.01 0.88 ±0.01 0.75 ±0.02 0.90 ±0.01 0.69 ±0.04 0.85 ±0.03 resampling 0.65 ±0.01 0.88 ±0.01 0.77 ±0.01 0.90 ±0.02 0.69 ±0.04 0.84 ±0.04 GECA 0.66 ±0.01 0.88 ±0.01 0.76 ±0.02 0.90 ±0.02 0.69 ±0.02 0.87 ±0.01 resampling 0.72 ±0.02 0.88 ±0.01 0.81 ±0.02 0.89 ±0.01 0.75 ±0.03 0.85 ±0.02 learned aug. (basic) 0.66 ±0.02 0.88 ±0.01 0.77 ±0.02 0.90 ±0.01 0.70 ±0.02 0.87 ±0.01 ALL resampling 0.70 ±0.02 0.86 ±0.01 0.84 ±0.01 0.90 ±0.01 0.73 ±0.04 0.85 ±0.03 recomb-1 0.72 ±0.02 0.87 ±0.01 0.85 ±0.01 0.90 ±0.02 0.77 ±0.02 0.87 ±0.02 recomb-2 0.71 ±0.01 0.87 ±0.02 0.82 ±0.02 0.90 ±0.01 0.75 ±0.03 0.86 ±0.03 GECA + recomb-1 + resamp. 0.74 ±0.02 0.86 ±0.01 0.85 ±0.02 0.89 ±0.01 0.79 ±0.02 0.84 ±0.01 ±0.02 0.42 ±0.12 0.75 ±0.03 0.82 ±0.04 GECA + recomb-1 + resamp. 0.69 ±0.02 -0.83 ±0.02 0.35 ±0.11 0.77 ±0.03 0.71 ±0.07

Morphology All of the hyper parameters mentioned here are optimized by a grid search on the Spanish validation set. We train our models for 25 epochs 5 . We use Adam optimizer with learning rate 0.0001. The generative model is trained on morphological reinflection order (d lemma d tags d inflection ) from left to right, then the samples from the model are reordered for morphological analysis task (d inflection d lemma d tags ).SCAN We use different number of epochs for jump and around right splits where all models are trained for 8 epochs in the former and 3 epochs in the latter. We use Adam optimizer with learning rate 0.002, and gradient norm clip with 1.0.

ACKNOWLEDGMENTS

We thank Eric Chu for feedback on early drafts of this paper. This work was supported by a hardware donation from NVIDIA under the NVAIL grant program. The authors acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center (Reuther et al., 2018) for providing HPC resources that have contributed to the research results reported within this paper.

A MODEL ARCHITECTURE A.1 PROTOTYPE ENCODER

We use a single layer BiLSTM network to encode h kj proto as follows:Morphology The hidden and embedding sizes are 1024. No dropout is applied. We project bidirectional embeddings to the hidden size with a linear projection. We concatenate the backward and the forward hidden states.SCAN We choose the hidden size as 512, and embedding size as 64. We apply 0.5 dropout to the input. We project hidden vectors in the attention mechanism.

A.2 DECODER

The decoder is implemented by a single layer. In addition to the hidden state and memory cell, we also carry out a feed vector through time:The input to the LSTM decoder at time step t is the concatenation of the previous token's representation, previous feed vector, and a latent z vector (in the VAE model).Morphology We use a single-layer LSTM network with a hidden size of 1024, and an embedding size of 1024. We initialize the decoder hidden states with the final hidden states of the BiLSTM encoder. feed is the same size as the hidden state. No dropout is applied in the decoder. Output calculations are provided in the original paper in equation Eq. 16. The query vector for the attentions is identically the hidden state:Further details of the attention are provided in Appendix A.3.SCAN The decoder is implemented by a single layer LSTM network with hidden size of 512, and embedding size of 64. The embedding parameters are shared with the encoder. Here the size of the feed vector is equal to embedding size, 64.We have no self-attention for this decoder in the feed vector. There is an attention projection with dimension is 128. The details of the attention mechanism given in Appendix A.3. Finally, we use transpose of the embedding matrix to project feed to the output space.output t contains unnormalized scores before the final softmax layer. We apply 0.7 dropout to h t out during both training and test. The copy mechanism will be further described in Appendix A.3.The input to the LSTM decoder is the same as Eq. 25 except the decoder embedding matrix, W d , shares parameters with encoder embedding matrix W e . We applied 0.5 dropout to the embeddings d t-1 .The query vector for the attention is calculated by:

A.3 ATTENTION AND COPYING

We use the attention mechanism described in Vaswani et al. (2017) with slight modifications.Morphology We use a linear transformation for key while retaining an embedding size of 1024, and leave query and value transformations as the identity. We do not normalize by the square root of the attention dimension. The query vector is described in the decoder Appendix A.2. The copy mechanism for the morphology task is explained in the paper in detail.SCAN We use the nonlinear tanh transformation for key, query and value. That the attention scores are calculated separately for each prototype using different parameters as well as the normalization i.e. obtaining α's is performed separately for each prototype.The copy mechanism for this task is slightly different and follows Gu et al. (2016) We normalize prototype attention scores and output scores jointly. Let ᾱi represent attention weights for each prototype sequence before normalization. Then, we concatenate them to the output vector in Eq. 27.We obtain a probability vector via a final softmax layer:That size of this probability vector is vocabulary size plus the total length all prototypes. We then project this into the output space by:where indices finds all corresponding scores in prob t for token w where there might be more than one element for a given w. This is because one score can come from the output t region, and others from the prototype regions of prob t . During training we applied 0.5 dropout to the indices from output t . Thus, the model is encouraged to copy more.

B NEIGHBORHOODS AND SAMPLING

In the Eq. 20 and Eq. 21 we expressed the generic form of neighborhood sets. Here we provide the implementation details.SCAN In the jump split, we use long-short recombination with δ = 0.5. In around right we use long-long recombination with δ = 0.5, and construct Ω so that the first and second prototypes to differ by a single token. We randomly pick k < 10 × 3 (10 different first prototypes, and 3 different second prototypes for each of them) prototype pairs that satisfy these conditions. For the recomb-1 experiment, we use the same neighborhood setup except but consider only the k < 10 first prototypes.Sampling In the jump split, we used beam search with beam size 4 in the decoder. We calculate the mean and standard deviation over the lengths of both among the first d 1 and the second d 2 prototypes in the train set. Then, during the sampling, we expect the first and second prototypes whose length is shorter than their respective mean plus standard deviation. This decision is based on the fact that the part of the Ω that the model is exposed to is determined by the empirical distribution, Ω, that arises from training neighborhoods. When sampling, we try to pick prototypes from a distribution that are close to properties of that empirical distribution. In around right, we use temperature sampling with T = 0.4. If a model cannot sample the expected number of both novel and unique samples within a reasonable time, we increase temperature T .Morphology We use long-long recombination, as explained in the paper, with slight modifications which leverage the structure of the task. We set Ω as:For the recomb-1 model N (d) utilizes tag similarity, lemma similarity and is constructed using a score function:For SCAN, the size of z is 32, and for morphology the size of z is 2.Proposal Network q(z|d, d 1:n ): Similarly to the prior, the posterior network decomposes z into its norm and direction vectors. The norm vector is sampled from a uniform distribution at (|µ|, min(|µ| + , µ max )), and the direction is sampled from the von Mises-Fisher distribution vmF(µ, κ) where κ = 25, = 1.0.G ADDITIONAL RESULTS

G.1 MORPHOLOGY RESULTS

In the paper, 

G.2 SIGNIFICANCE TESTS

Tables 9, 10 and 11 sho the p-values for pairwise differences between the baseline and prototype-based models 

G.3 GENERATED SAMPLES

All samples are randomly selected unless otherwise indicated.

G.3.1 SCAN

In Table 12 , we present three test samples from the SCAN task along with the predictions by direct inference and the conditional model trained on the augmented data with recomb-2. Note that the augmentation procedure was able to create novel samples whose input (x) happens to be in the test set (Examples 1 and 3) while y may or may not be correct (Example 1). Below are a set of samples from the learned aug (basic) model for SCAN dataset's jump and around right splits, in order: 

