META-LEARNING FROM DEMONSTRATIONS IMPROVES

Abstract

We study the problem of compositional generalization of language-instructed agents in gSCAN. gSCAN is a popular benchmark which requires an agent to generalize to instructions containing novel combinations of words, which are not seen in the training data. We propose to improve the agent's generalization capabilities with an architecture inspired by the Meta-Sequence-to-Sequence learning approach (Lake, 2019). The agent receives as a context a few examples of pairs of instructions and action trajectories in a given instance of the environment (a support set) and it is tasked to predict an action sequence for a query instruction for the same environment instance. The context is generated by an oracle and the instructions come from the same distribution as seen in the training data. In each training episode, we also shuffle the indices of the actions and the words of the instructions to make the agent figure out the relations between the actions and the words from the context. Our predictive model has the standard Transformer architecture. We show that the proposed architecture can significantly improve the generalization capabilities of the agent on one of the most difficult gSCAN splits: the "adverb-toverb" Split H.

1. INTRODUCTION

We want autonomous agents to have the same compositional understanding of language that humans do (Chomsky, 1957; Tenenbaum, 2018) . Without this understanding, the sample complexity required to train them for a wide range of compositions of instructions would be very high (Sodhani et al., 2021; Jang et al., 2021) . Naturally, such compositional generalization has received interest from both the language and reinforcement learning communities. "Compositional Generalization" can be divided into several different sub-skills, for example being able to reason about object properties compositionally (Chaplot et al., 2018; Qiu et al., 2021) , composing sub-instructions into a sequence (Logeswaran et al., 2022; Min et al., 2021) or generating novel action sequences according to novel instructions made up of familiar components (Lake & Baroni, 2018) . In this work, we examine this latter challenge in more detail using the gSCAN environment (Ruis et al., 2020) . gSCAN is a testing environment for language-grounded agents, consisting of a 6-by-6 grid-world where each episode has a unique combination of objects, an initial agent position (in this work, the state) and some language instruction. The objects are one of a circle, cylinder or triangle, can be five different sizes and come in the colors red, blue, purple and yellow. We encode each cell in the state is encoded as bag of words, similar to BabyAI (Chevalier-Boisvert et al., 2019) . The components are the object size, color, shape, and possible agent occupancy and direction. Available instruction action words are push, pull and walk to and available instruction adverbs are while spinning, while zigzagging, hesitantly and cautiously. Actions which the agent must produce are in the set of WALK, STAY, LTURN, RTURN, PUSH and PULL. In this work, when we refer to an action that is repeated more than once, we place the repeat count in parenthesis, like LTURN(4). A success happens when the agent exactly matches the target actions for an episode. The instructions follow a template of "action a size? color? object adverb", where ? indicates that a token is optional. Certain combinations of instructions, actions, objects and adverbs are not found in the training set. They are instead found in certain held out "Splits". Split A is an in-distribution validation set split, containing instructions, target objects and target locations which can be found in the training set. Splits B to F are out-of-distribution sets where the target object has 1

