ENTITY DIVIDER WITH LANGUAGE GROUNDING IN MULTI-AGENT REINFORCEMENT LEARNING

Abstract

We investigate the use of natural language to drive the generalization of policies in multi-agent settings. Unlike single-agent settings, the generalization of policies should also consider the influence of other agents. Besides, with the increasing number of entities in multi-agent settings, more agent-entity interactions are needed for language grounding, and the enormous search space could impede the learning process. Moreover, given a simple general instruction, e.g., beating all enemies, agents are required to decompose it into multiple subgoals and figure out the right one to focus on. Inspired by previous work, we try to address these issues at the entity level and propose a novel framework for language grounding in multi-agent reinforcement learning, entity divider (EnDi). EnDi enables agents to independently learn subgoal division at the entity level and act in the environment based on the associated entities. The subgoal division is regularized by agent modeling to avoid subgoal conflicts and promote coordinated strategies. Empirically, EnDi demonstrates the strong generalization ability to unseen games with new dynamics and expresses the superiority over existing methods.

1. INTRODUCTION

The generalization of reinforcement learning (RL) agents to new environments is challenging, even to environments slightly different from those seen during training (Finn et al., 2017) . Recently, language grounding has been proven to be an effective way to grant RL agents the generalization ability (Zhong et al., 2019; Hanjie et al., 2021) . By relating the dynamics of the environment with the text manual specifying the environment dynamics at the entity level, the language-based agent can adapt to new settings with unseen entities or dynamics. In addition, language-based RL provides a framework for enabling agents to reach user-specified goal states described by natural language (Küttler et al., 2020; Tellex et al., 2020; Branavan et al., 2012) . Language description can express abstract goals as sets of constraints on the states and drive generalization. However, in multi-agent settings, things could be different. First, the policies of others also affect the dynamics of the environment, while text manual does not provide such information. Therefore, the generalization of policies should also consider the influence of others. Second, with the increasing number of entities in multi-agent settings, so is the number of agent-entity interactions needed for language grounding. The enormous search space could impede the learning process. Third, sometimes it is unrealistic to give detailed instructions to tell exactly what to do for each agent. On the contrary, a simple goal instruction, e.g., beating all enemies or collecting all the treasuries, is more convenient and effective. Therefore, learning subgoal division and cultivating coordinated strategies based on one single general instruction is required. The key to generalization in previous works (Zhong et al., 2019; Hanjie et al., 2021) is grounding language to dynamics at the entity level. By doing so, agents can reason over the dynamic rules of all the entities in the environment. Since the dynamic of the entity is the basic component of the dynamics of the environment, such language grounding is invariant to a new distribution of dynamics or tasks, making the generalization more reliable. Inspired by this, in multi-agent settings, the influence of policies of others should also be reflected at the entity level for better generalization. In addition, after jointly grounding the text manual and the language-based goal (goal instruction) to environment entities, each entity has been associated with explicit dynamic rules and relationships with the goal state. Thus, the entities with language grounding can also be utilized to form better strategies. We present two goal-based multi-agent environments based on two previous single-agent settings, i.e., MESSENGER (Hanjie et al., 2021) and RTFM (Zhong et al., 2019) , which require generalization to new dynamics (i.e., how entities behave), entity references, and partially observable environments. Agents are given a document that specifies environment dynamics and a language-based goal. Note that one goal may contain multiple subgoals. In more detail, in multi-agent messenger, agents are required to bring all the messages to the targets, while in multi-agent RTFM, the general goal is to eliminate all monsters in a given team. Thus, one single agent may struggle or be unable to finish it. In particular, after identifying relevant information in the language descriptions, agents need to decompose the general goal into many subgoals and figure out the optimal subgoal division strategy. Note that we focus on interactive environments that are easily converted to symbolic representations, instead of raw visual observations, for efficiency, interpretability, and emphasis on abstractions over perception. In this paper, we propose a novel framework for language grounding in multi-agent reinforcement learning (MARL), entity divider (EnDi), to enable agents independently learn subgoal division strategies at the entity level. Specifically, each EnDi agent first generates a language-based representation for the environment and decomposes the goal instruction into two subgoals: self and others. Note that the subgoal can be described at the entity level since language descriptions have given the explicit relationship between the goal and all entities. Then, the EnDi agent acts in the environment based on the associated entities of the self subgoal. To consider the influence of others, the EnDi agent has two policy heads. One is to interact with the environment, and another is for agent modeling. The EnDi agent is jointly trained end-to-end using reinforcement learning and supervised learning for two policy heads, respectively. The gradient signal of the supervised learning from the agent modeling is used to regularize the subgoal division of others. Our framework is the first attempt to address the challenges of grounding language for generalization to unseen dynamics in multi-agent settings. EnDi can be instantiated on many existing language grounding modules and is currently built and evaluated in two multi-agent environments mentioned above. Empirically, we demonstrate that EnDi outperforms existing language-based methods in all tasks by a large margin. Importantly, EnDi also expresses the best generalization ability on unseen games, i.e., zero-shot transfer. By ablation studies, we verify the effectiveness of each component, and EnDi indeed can obtain coordinated subgoal division strategies by agent modeling. We also argue that many language grounding problems can be addressed at the entity level.

2. RELATED WORK

Language grounded policy-learning. Language grounding refers to learning the meaning of natural language units, e.g., utterances, phrases, or words, by leveraging the non-linguistic context. In many previous works (Wang et al., 2019; Blukis et al., 2019; Janner et al., 2018; Küttler et al., 2020; Tellex et al., 2020; Branavan et al., 2012) , the text conveys the goal or instruction to the agent, and the agent produces behaviors in response after the language grounding. Thus, it encourages a strong connection between the given instruction and the policy. More recently, many works have extensively explored the generalization from many different perspectives. Hill et al. (2020a; 2019; 2020b) investigated the generalization regarding novel entity combinations, from synthetic template commands to natural instructions given by humans and the number of objects. Choi et al. (2021) proposed a language-guided policy learning algorithm, enabling learning new tasks quickly with language corrections. In addition, Co-Reyes et al. (2018) proposed to guide policies by language to generalize on new tasks by meta learning. Huang et al. (2022) utilized the generalization of large language models to achieve zero-shot planners. However, all these works may not generalize to a new distribution of dynamics or tasks since they encourage a strong connection between the given instruction and the policy, not the dynamics of the environment. Language grounding to dynamics of environments. A different line of research has focused on utilizing manuals as auxiliary information to aid generalization. These text manuals provide

