META-LEARNING FROM DEMONSTRATIONS IMPROVES

Abstract

We study the problem of compositional generalization of language-instructed agents in gSCAN. gSCAN is a popular benchmark which requires an agent to generalize to instructions containing novel combinations of words, which are not seen in the training data. We propose to improve the agent's generalization capabilities with an architecture inspired by the Meta-Sequence-to-Sequence learning approach (Lake, 2019). The agent receives as a context a few examples of pairs of instructions and action trajectories in a given instance of the environment (a support set) and it is tasked to predict an action sequence for a query instruction for the same environment instance. The context is generated by an oracle and the instructions come from the same distribution as seen in the training data. In each training episode, we also shuffle the indices of the actions and the words of the instructions to make the agent figure out the relations between the actions and the words from the context. Our predictive model has the standard Transformer architecture. We show that the proposed architecture can significantly improve the generalization capabilities of the agent on one of the most difficult gSCAN splits: the "adverb-toverb" Split H.

1. INTRODUCTION

We want autonomous agents to have the same compositional understanding of language that humans do (Chomsky, 1957; Tenenbaum, 2018) . Without this understanding, the sample complexity required to train them for a wide range of compositions of instructions would be very high (Sodhani et al., 2021; Jang et al., 2021) . Naturally, such compositional generalization has received interest from both the language and reinforcement learning communities. "Compositional Generalization" can be divided into several different sub-skills, for example being able to reason about object properties compositionally (Chaplot et al., 2018; Qiu et al., 2021) , composing sub-instructions into a sequence (Logeswaran et al., 2022; Min et al., 2021) or generating novel action sequences according to novel instructions made up of familiar components (Lake & Baroni, 2018) . In this work, we examine this latter challenge in more detail using the gSCAN environment (Ruis et al., 2020) . gSCAN is a testing environment for language-grounded agents, consisting of a 6-by-6 grid-world where each episode has a unique combination of objects, an initial agent position (in this work, the state) and some language instruction. The objects are one of a circle, cylinder or triangle, can be five different sizes and come in the colors red, blue, purple and yellow. We encode each cell in the state is encoded as bag of words, similar to BabyAI (Chevalier-Boisvert et al., 2019) . The components are the object size, color, shape, and possible agent occupancy and direction. Available instruction action words are push, pull and walk to and available instruction adverbs are while spinning, while zigzagging, hesitantly and cautiously. Actions which the agent must produce are in the set of WALK, STAY, LTURN, RTURN, PUSH and PULL. In this work, when we refer to an action that is repeated more than once, we place the repeat count in parenthesis, like LTURN(4). A success happens when the agent exactly matches the target actions for an episode. The instructions follow a template of "action a size? color? object adverb", where ? indicates that a token is optional. Certain combinations of instructions, actions, objects and adverbs are not found in the training set. They are instead found in certain held out "Splits". Split A is an in-distribution validation set split, containing instructions, target objects and target locations which can be found in the training set. Splits B to F are out-of-distribution sets where the target object has an unseen description made up of combinations of familiar terms (for example red square in Split C) or is at a location not seen before during training (for example, southwest of the agent in Split D). These test splits, except Split D, have been solved by a using a Transformer (Qiu et al., 2021) . Split G is a "meta-learning" split, where an example of "cautiously" is seen only k times during training. Split H, otherwise known as the "adverb-to-verb" split, contains only instructions following the template "pull a size? color? object while spinning". This requires the agent to walk towards the target object and pull it the required number of times, while at the same time performing actions LTURN(4) after each WALK and PULL. Solving the problem requires the agent to generate the unseen action trajectory based on a compositional understanding of the instructions. In this work, we mainly focus on Split H. We hypothesize that a promising approach is Meta Sequence-to-Sequence Learning (meta-seq2seq) proposed by Lake (2019). We think the reason why this approach works well is the permutations applied to the supports and target instructions, which ensures that an agent does not overfit to particular sequences of symbols in the output space and instead forces meta-learning to determine what the true output actions should be for a given episode. Extending this approach to language grounding environments was flagged as a possible future work direction by Ruis et al. (2020) . In this work we propose to do exactly that. Our contributions are: first, we describe an extension of meta-seq2seq with state-relevant supports, second we report promising success rate performance on gSCAN Split H, third we demonstrate the importance of generating relevant supports and show how different procedures for generating supports affect performance, and fourth we motivate how this approach aligns with intuitions about human compositional problem solving.

2. RELATED WORK

Compositional Generalization There is a long line of work on the challenge of compositional generalization in deep learning. Initial works show that sequence models such as RNNs cannot solve these problems well (Tenenbaum, 2018; Loula et al., 2018) . Datasets such as SCAN (Lake & Baroni, 2018), 0gendata (Geiger et al., 2019) , COGS (Kim & Linzen, 2020), and PCFG (Hupkes et al., 2020) serve as benchmarks to measure progress. Various approaches have been proposed to improve compositional generalization performance, including data augmentation (Andreas, 2020; Shi et al., 2021; Guo et al., 2020a; Qiu et al., 2022) , problem-specific inductive biases (Russin et al., 2020; Guo et al., 2020b; Yin et al., 2021; Chakravarthy et al., 2022; Spilsbury & Ilin, 2022) , increased data diversity (Patel et al., 2022; Andreas, 2020 ), transfer learning (Zhu et al., 2021) and modular networks (Andreas et al., 2016; Ruis & Lake, 2022) . These approaches can perform very well, but usually require prior assumptions about underlying data. In computer vision and multimodal domains, the Transformer architecture has been shown to solve some compositional generalization tasks (Vaswani et al., 2017; Hudson & Zitnick, 2021; Chen et al., 2020) . Transformer's success on token-level tasks is also promising but still limited (Power et al., 2022; Bhattamishra et al., 2020; Qiu et al., 2021; 2022; Sikarwar et al., 2022) . Meta-learning (Conklin et al., 2021; Yang et al., 2022; Mitchell et al., 2021) and group equivariance (Gordon et al., 2020) have also shown promise on such problems. In-context and model-based learning We take inspiration from a long line of work on in-context meta-learning, starting with RL 2 (Duan et al., 2016) for RNNs and the extension to Transformers with TrML (Melo, 2022) . Also related are the ideas of retrieval for in-context learning (Goyal et al., 2022; Borgeaud et al., 2022) and proposing goals and planning in an imagination of the world (Nair et al., 2018; Chane-Sane et al., 2021; Hafner et al., 2020; Deac et al., 2021) .



Many language grounding environments exist, such as BabyAI(Chevalier-Boisvert et al., 2019),ALFRED (Shridhar et al., 2020), VizDoom (Chaplot et al., 2018)   and SILG(Zhong et al., 2021). gSCAN and its derivatives(Ruis et al., 2020; Wu et al., 2021)   specifically focus on task compositional generalization in an interactive world. The various splits of gSCAN are still not completely solved. Various approaches including graph networks(Gao et al.,  2020), linguistic-assisted attention(Kuo et al., 2021), symbolic reasoning Nye et al. (2021), auxiliary tasks Jiang & Bansal (2021) Transformers (Qiu et al., 2021), modular networks (Heinze-Deml & Bouchacourt, 2020; Ruis & Lake, 2022), data augmentation (Setzler et al., 2022; Ruis & Lake, 2022) have been proposed. Splits D, G and H remain challenging to solve with a general approach.

