TRANSFERABILITY OF COMPOSITIONALITY

Abstract

Compositional generalization is the algebraic capacity to understand and produce large amount of novel combinations from known components. It is a key element of human intelligence for out-of-distribution generalization. To equip neural networks with such ability, many algorithms have been proposed to extract compositional representations from the training distribution. However, it has not been discussed whether the trained model can still extract such representations in the test distribution. In this paper, we argue that the extraction ability does not transfer naturally, because the extraction network suffers from the divergence of distributions. To address this problem, we propose to use an auxiliary reconstruction network with regularized hidden representations as input, and optimize the representations during inference. The proposed approach significantly improves accuracy, showing more than a 20% absolute increase in various experiments compared with baselines. To our best knowledge, this is the first work to focus on the transferability of compositionality, and it is orthogonal to existing efforts of learning compositional representations in training distribution. We hope this work will help to advance compositional generalization and artificial intelligence research. The code is in supplementary materials.

1. INTRODUCTION

Human intelligence (Minsky, 1986; Lake et al., 2017) exhibits compositional generalization, the algebraic capacity to understand and produce large amount of novel combinations from known components (Chomsky, 1957; Montague, 1970) . This capacity helps humans to recognize the world efficiently and to be imaginative. It is also beneficial to design machine learning algorithms with compositional generalization skills. Current neural network models, however, generally lack such ability. Compositional generalization is a type of out-of-distribution generalization (Bengio, 2017) , where the training and test distributions are different. A sample in such a setting is a combination of several components, and the generalization is enabled by recombining the seen components of the unseen combination during inference. In the image domain, an object is a combination of many parts or properties. In the language domain, a compound word is a combination of multiple words. As an example, we consider two digits are overlapped (Figure 1 ). Each digit is a component, and it appears in training. A test example has a new combination of two digits. The main approach for compositional generalization is to learn compositional representations (Bengio, 2013) , which contain several component representations. Each of them depends only on the underlying generative factor, and does not change when other factors change. We call this the compositionality property, and will formally introduce in Section 3. In the digit example, this means that the representation of one digit does not change when the other digit changes. Multiple approaches have been proposed to learn compositional representations in the train distribution. However, little discussion has focused on whether the model can still extract the representations in the test distribution. We find that the extraction ability does not transfer naturally, because the extraction network suffers from the divergence of distributions (Bengio, 2017; Pleiss et al., 2020) , so that each extracted representation shifts away from the corresponding one in training. Our experiment on the digit example shows that the accuracy drops from 89.6% in training to 49.3% in test (Table 1 in Section 5). To address the problem, we hope each representation is consistent with the training one while reflecting the test sample. We use an auxiliary network, which has hidden representations as inputs, Figure 1 : Examples of compositional generalization with overlapping digits. Each sample is a horizontal block with three images and two digits. The middle image X is input and the right two digits Y = Y 1 , Y 2 are output. The left two images X 1 , X 2 are hidden components. X 1 is in its original form, and X 2 is flipped over left-top to right-bottom diagonal. The sum of the digits is even in train, and odd in test. We hope to learn a prediction model in training, and transfer it to test. and the original input as output. For a test sample, we regularize each hidden representation in its training manifold, and optimize them to recover the original input. Then we use the optimized representations for prediction. Experimental results show that the proposed approach has more than a 20% absolute increase in various experiments compared to baselines, and even outperforms humans on the overlapping digit task. Our contributions can be summarized as follows. • We raise and investigate the problem of transferability of compositionality to test distribution. This work is orthogonal to many efforts of learning compositionality in training distribution. • We propose to address the problem by using an auxiliary reconstruction network with regularized hidden representations as input, and optimize the representations during inference. • We empirically show that the transferability problem exists and the proposed approach has significant improvements over baselines.

2. RELATED WORK

Compositional generalization (Chomsky, 1957; Montague, 1970) is critical in human cognition (Minsky, 1986; Lake et al., 2017; Johnson & et al, 2017; Higgins & et al, 2018; Lake et al., 2019) . It helps humans to understand and produce large amount of novel combinations from known components. Broadly speaking, compositional generalization is a type of out-of-distribution (o.o.d.) transferring or generalization, which is also called domain adaptation (Redko et al., 2020) or concept drift (Gama et al., 2014) . This is different from traditional i.i.d. setting, where the training and the test distributions are identical. The transferring requires prior knowledge of how the distribution is changed, and compositional generalization has a particular form of such change, as mentioned in the later section. Compositional generalization is also a desirable property for deep neural networks. Human-level compositional learning (Marcus, 2003; Lake & Baroni, 2018) has been an important open challenge (Yang et al., 2019; Keysers & et al, 2020) , although there is a long history of studying compositionality in neural networks. Classic view (Fodor & Pylyshyn, 1988; Marcus, 1998; Fodor & Lepore, 2002) et al., 2017; Lake & Baroni, 2018; Loula et al., 2018; Kliegl & Xu, 2018; Li et al., 2019; Lake, 2019; Gorden et al., 2020) for learning compositionality in training distribution. Another line of related work is independent disentangled representation learning (Higgins et al., 2017; Burgess et al., 2018; Kim & Mnih, 2018; Chen et al., 2018; Kumar et al., 2017; Hsieh et al., 2018; Locatello et al., 2019; 2020) . Its main assumption is that the expected components are statistically independent in training

