GROUNDED COMPOSITIONAL GENERALIZATION WITH ENVIRONMENT INTERACTIONS

Abstract

In this paper, we present a compositional generalization approach in grounded agent instruction learning. Compositional generalization is an important part of human intelligence, but current neural network models do not have such ability. This is more complicated in multi-modal problems with grounding. Our proposed approach has two main ideas. First, we use interactions between agent and the environment to find components in the output. Second, we apply entropy regularization to learn corresponding input components for each output component. The results show the proposed approach significantly outperforms baselines in most tasks, with more than 25% absolute average accuracy increase. We also investigate the impact of entropy regularization and other changes with ablation study. We hope this work is the first step to address grounded compositional generalization, and it will be helpful in advancing artificial intelligence research. The source code is included in supplementary material.

1. INTRODUCTION

Compositional generalization is a key skill for flexible and efficient learning. Humans leverage compositionality to create and recognize new combinations of familiar concepts (Chomsky, 1957; Minsky, 1986) . Though there are many progresses for machine learning and deep learning in various areas recently (LeCun et al., 2015) , current main learning algorithms are not able to perform compositional generalization, and require many samples to train models. Such efficient learning is even more important when machines interact with the environment for grounding, because interactions are usually slow. Machine learning has been mostly developed with an assumption that training and test distributions are identical. Compositional generalization, however, is a kind of out-of-distribution generalization (Bengio, 2017) , where training and test distributions are different. During training, dataset does not contain the information of the difference, so it can only be given as prior knowledge. In compositional generalization, a sample is a combination of several components. Test distribution changes as test samples are new combinations of seen components in training. For example, if we can find "large apple" and "small orange" in some environments, then we can also find "large orange" among multiple objects in a new environment. The recombination is enabled when an output component depends only on the corresponding input components, and invariant of other components (please see Section 4.1 for more details). So there are two aspects to consider. What are the components in output, and how to find the corresponding input signals. We propose to use interactions between agent and the environment to define output components. This is analogues to model-free reinforcement learning (Sutton & Barto, 2018) , where an agent does not have an environment model, but leans to act at each step during the interactions with the environment. Then we use entropy regularization (Li et al., 2019; Li & Eisner, 2019) to learn the minimal input components for outputs. We evaluate the approach with gSCAN dataset (Ruis et al., 2020) , which is designed to study compositional generalization in grounded agent instruction learning. Please see Figure 1 for examples. The results show the proposed approach significantly outperforms baselines in most tasks, with more than 25% absolute average accuracy increase, and the high accuracy indicates that the proposed approach addresses the designed grounded compositional generalization problems in these tasks. We also look into the impact of entropy regularization and other changes with ablation study. We hope this work will be helpful in advancing grounded compositional generalization and artificial intelligence research. The contributions of this paper can be summarized as follows. • This is the first work to enable accurate compositional generalization in grounded instruction learning problem, serving for analyzing and understanding the mechanism. • The novelty of this paper is to find that the combination of environment interaction and entropy regularization helps the generalization.

2. RELATED WORK

Compositional generalization research has a long history, and recently there are increasing focus on this area. SCAN dataset (Lake, 2019) was proposed to study compositional generalization in instruction learning. It maps a command sentence to a sequence of actions. This dataset has a property that input words and output actions have direct correspondence. Though some NLP tasks, such as machine translation, have similar property, not all problems fit to the setting. Also this dataset does not contain an environment for an agent to take actions. SCAN dataset inspired multiple approaches (Russin et al., 2019; Andreas, 2019; Li et al., 2019; Lake, 2019; Gordon et al., 2020; Liu et al., 2020) . Some of them lead to general techniques for compositional generalization. For example, entropy regularization (Li et al., 2019) is proposed to avoid redundant dependency on input, and it is a core idea of the approach in this paper. Compositional generalization has applications in various fields such as question answering (Andreas et al., 2016; Hudson & Manning, 2019; Keysers et al., 2020 ), counting (Rodriguez & Wiles, 1998; Weiss et al., 2018) , systematic behaviour (Wong & Wang, 2007; Brakel & Frank, 2009) , and hierarchical structure (Linzen et al., 2016) . Another related work is independent disentangled representation (Higgins et al., 2017; Locatello et al., 2019) , but they do not address compositional generalization. Compositionality is also helpful for reasoning (Talmor et al., 2020) and continual learning (Jin et al., 2020; Li et al., 2020) . Grounded SCAN (gSCAN) dataset was proposed to introduce environment and grounding to agent instruction learning with compositional generalization (Ruis et al., 2020) . It has a command sentence as input and a sequence of actions as output. However, the input command does not tell the specific way to act, but agent needs to understand the environment and take corresponding actions. This also avoids direct mapping between input words and output actions. Different approaches have been proposed to address this problem. As compositional generalization requires prior knowledge for distribution change, these approaches correspond to different ways to provide the prior knowledge. Andreas (2019) uses linguistic knowledge to augment training data.



(a) Walk to a small cylinder. (b) Walk to a large blue circle.

Figure 1: Examples of gSCAN dataset. The dataset evaluates compositional generalization ability in grounded instruction learning. The agent needs to understand command and the environment to take sequence of actions. Training and test data have different distributions and tasks require different kinds of compositional generalization. Please refer to Section 5 for more details.

