GROUNDED COMPOSITIONAL GENERALIZATION WITH ENVIRONMENT INTERACTIONS

Abstract

In this paper, we present a compositional generalization approach in grounded agent instruction learning. Compositional generalization is an important part of human intelligence, but current neural network models do not have such ability. This is more complicated in multi-modal problems with grounding. Our proposed approach has two main ideas. First, we use interactions between agent and the environment to find components in the output. Second, we apply entropy regularization to learn corresponding input components for each output component. The results show the proposed approach significantly outperforms baselines in most tasks, with more than 25% absolute average accuracy increase. We also investigate the impact of entropy regularization and other changes with ablation study. We hope this work is the first step to address grounded compositional generalization, and it will be helpful in advancing artificial intelligence research. The source code is included in supplementary material.

1. INTRODUCTION

Compositional generalization is a key skill for flexible and efficient learning. Humans leverage compositionality to create and recognize new combinations of familiar concepts (Chomsky, 1957; Minsky, 1986) . Though there are many progresses for machine learning and deep learning in various areas recently (LeCun et al., 2015) , current main learning algorithms are not able to perform compositional generalization, and require many samples to train models. Such efficient learning is even more important when machines interact with the environment for grounding, because interactions are usually slow. Machine learning has been mostly developed with an assumption that training and test distributions are identical. Compositional generalization, however, is a kind of out-of-distribution generalization (Bengio, 2017) , where training and test distributions are different. During training, dataset does not contain the information of the difference, so it can only be given as prior knowledge. In compositional generalization, a sample is a combination of several components. Test distribution changes as test samples are new combinations of seen components in training. For example, if we can find "large apple" and "small orange" in some environments, then we can also find "large orange" among multiple objects in a new environment. The recombination is enabled when an output component depends only on the corresponding input components, and invariant of other components (please see Section 4.1 for more details). So there are two aspects to consider. What are the components in output, and how to find the corresponding input signals. We propose to use interactions between agent and the environment to define output components. This is analogues to model-free reinforcement learning (Sutton & Barto, 2018) , where an agent does not have an environment model, but leans to act at each step during the interactions with the environment. Then we use entropy regularization (Li et al., 2019; Li & Eisner, 2019) to learn the minimal input components for outputs. We evaluate the approach with gSCAN dataset (Ruis et al., 2020) , which is designed to study compositional generalization in grounded agent instruction learning. Please see Figure 1 for examples. The results show the proposed approach significantly outperforms baselines in most tasks, with more than 25% absolute average accuracy increase, and the high accuracy indicates that the proposed approach addresses the designed grounded compositional generalization problems in these tasks. We also look into the impact of entropy regularization and other changes with ablation study.

