GROUNDING LANGUAGE TO ENTITIES FOR GENERALIZATION IN REINFORCEMENT LEARNING

Abstract

In this paper, we consider the problem of leveraging textual descriptions to improve generalization of control policies to new scenarios. Unlike prior work in this space, we do not assume access to any form of prior knowledge connecting text and state observations, and learn both symbol grounding and control policy simultaneously. This is challenging due to a lack of concrete supervision, and incorrect groundings can result in worse performance than policies that do not use the text at all. We develop a new model, EMMA (Entity Mapper with Multi-modal Attention) which uses a multi-modal entity-conditioned attention module that allows for selective focus over relevant sentences in the manual for each entity in the environment. EMMA is end-to-end differentiable and can learn a latent grounding of entities and dynamics from text to observations using environment rewards as the only source of supervision. To empirically test our model, we design a new framework of 1320 games and collect text manuals with free-form natural language via crowd-sourcing. We demonstrate that EMMA achieves successful zeroshot generalization to unseen games with new dynamics, obtaining significantly higher rewards compared to multiple baselines. The grounding acquired by EMMA is also robust to noisy descriptions and linguistic variation. 1 

1. INTRODUCTION

Interactive game environments are useful for developing agents that learn grounded representations of language for autonomous decision making (Golland et al., 2010; Andreas & Klein, 2015; Bahdanau et al., 2018) . The key objective in these learning setups is for the agent to utilize feedback from the environment to acquire linguistic representations (e.g. word vectors) that are optimized for the task. Figure 1 provides an example of such a setting, where the meaning of the word fleeing in the context is to "move away", which is captured by the movements of that particular entity (wizard). Learning a useful grounding of concepts can also help agents navigate new environments with previously unseen entities or dynamics. Recent research has explored this approach by grounding language descriptions to the transition and reward dynamics of an environment (Narasimhan et al., 2018; Zhong et al., 2020) . While these methods demonstrate successful transfer to new settings, they require manual specification of some minimal grounding before the agent can learn (e.g. a ground-truth mapping between individual entities and their textual symbols). In this paper, we propose a model to learn an effective grounding for entities and dynamics without requiring any prior mapping between text and state observations, using only scalar reward signals from the environment. To achieve this, there are two key inferences for an agent to make -(1) figure out which facts refer to which entities, and ( 2) understand what the facts mean to guide its decision making. To this end, we develop a new model called EMMA (Entity Mapper with Multimodal Attention), which simultaneously learns to select relevant sentences in the manual for each entity in the game as well as incorporate the corresponding text description into its control policy. This is done using a multi-modal attention mechanism which uses entity representations as queries to attend to specific tokens in the manual text. EMMA then generates a text-conditioned representation which is processed further by a deep neural network to generate a policy. We train the entire model in a multi-task fashion using reinforcement learning to maximize task returns.



Code and data are available at https://www.dropbox.com/s/fnprjrfekbnxxru/code_data.zip?raw=1.1

