MANIPULATING MULTI-AGENT NAVIGATION TASK VIA EMERGENT COMMUNICATIONS

Abstract

Multi-agent corporations struggle to efficiently sustain grounded communications with specific task goal. Existing approaches are limited in their simple task settings and single-turn communications. This work describes a multi-agent communication scenario via emergent language in navigation task. This task involves two agents with unequal abilities: the tourist (agent A) who can only observe its surroundings and the guide (agent B) who has the holistic view but does not know the initial position of agent A. They communicate with the emerged language grounded through the environment and a common task goal: help the tourist find the target place. We release a new dataset of 3000 scenarios that involve such visual and language navigation. We also seek to address the multi-agent emergent communications by proposing a collaborative learning framework that enables the agents to generate and understand emergent language and solve task. The framework is trained with reinforcement learning by maximizing the task success rate in an end-to-end manner. Results show that the proposed framework achieves competing performance in both accuracy of language understanding and in task success rate. We also discuss the explanations of the emerged language.

1. INTRODUCTION

Communication is a crucial factor for multiple agents to cooperate. While most recent works are focused on the interactions between the artificial agent with humans, there still are some research efforts made on communication between artificial agents. However, most of these works are focused on single-turn communication with a unidirectional message pass and the evolvement of natural language or some specific properties of the emergent language like compositionality, interpretability, and so on. But the multi-turn conversation is more analogous to human language. In a natural conversation, language generation and understanding should be mutual rather than unidirectional. Therefore, we provide a framework for agents to generate multi-turn dialogues including two agents. To prove the feasibility of our framework, we provide a scenario for agents to generate multi-turn conversations. We propose a new task adapted from the vision-language navigation(VLN) task coming from the human-machine communication area where the guide should communicate with the tourist to give guidance and help it find the target location. Different from traditional VLN tasks, in our settings, the tourist(agent A) and the guide(agent B) are both machines, rather than the original human-machine settings. Besides, we suppose that the guide does not know the initial position of the tourist, so the guide not only has to give guidance to the tourist but also confirm the location of it. Our contributions can be summarized as follows: 1. From the view of emergent language, we study the language with multiple turns. To give a suitable scenario for multi-turn conversations, we provide a navigation task adapted from the vision-language navigation(VLN) task. 2. Compared with methods with agents speaking the natural language, ours is cheaper with no expensive annotation, making it a more practical way for agents to communicate. 3. As far as we know, We are the first to propose a VLN-like task in a two-agent cooperation scenario. And we also provide a benchmark for it, which gives a possible solution to this kind of task.

2. RELATED WORK

Vision Language Navigation(VLN) Navigation tasks with vision and natural language information have attracted much attention in the last few years. Generally, this task involves humans, an artificial agent, and the environment. The agent communicates with humans in natural language. The agent may receive requests or instructions and ask for more guidance from the human. It can navigate and interact with the environment to get more information or complete corresponding task requested by the human. While humans should give requests or guidance to help the agent complete the task. Generally, the goal of these VLN tasks is to help the agent find the target place known only by the human side. 2021). However, this kind of work only gives the instructions at the beginning, so multi-turn communication can not be generated. • Oracle guidance: In the process of navigation, the agent can request additional natural language guidance to get more information Chi et al. (2019) . But in most cases, the oracle can only respond with simple words like "turn left" and "turn right". • Dialogue: Dialogue is given as a supervised signal.de Vries et al. (2018); Thomason et al. (2019) . But the fixed dialogue constrains the flexibility of policy. By putting the VLN task into multi-agent settings, the problems listed above can be alleviated or solved. To our knowledge, talk-the-walk de Vries et al. ( 2018) is the only work applying the VLN task with multiple agents and is very similar to our task settings. However, talk-the-walk only focuses on the localization part, where the guide finds where the tourist is. But ignore the guidance part, in which the guide makes a route plan and gives guidance to the tourist, and the tourist makes corresponding actions. Talk-the-walk models this part with a random walk and only implements a minimal baseline, making the VLN task with multi-agent settings incomplete. Emergent Language Emergent language is an unplanned language that comes up in interactions between the students, and/or the teacher and the students. From the perspective of agents' relationships, most of the works are focused on fully cooperative tasks like referential games Lazaridou et al. (2018) adapted from the Lewis signaling game. In the referential game, there are two roles: speaker and listener. The speaker has to give instructions to the listener so that it can select the target picture from the candidates successfully. Some navigation tasks are proposed Kalinowska et al. (2022) ; Kajić et al. (2020) , however, those works can only be applied in limited environments, with 2 and 4 cases respectively. Some semi-cooperative tasks are also proposed, like the negotiation game Cao et al. ( 2018), in which two agents have to negotiate to get more scores for themselves. There are also some fully-competitive cases like a circular-based sender-receiver game Noukhovitch et al. (2021) where the sender tries to make the receiver choose the position closer to its own target position rather than the receiver's. Most of those part of works are studying single message pass, with one agent only having a language generation part, and the other one having a language understanding module. Still, there are some works putting attention to the multi-turn form Cao et al. (2018); Evtimova et al. (2017) . But the former study focuses on a semi-cooperative case rather than a fully cooperative one like us, while the latter one uses the bag of words, and generates 0 or 1, with 1 indicating this token is in the generated sentence. This method makes that the latter token does not rely on the former ones, making it not a sentence from general views.



Anderson et al. (2017)  proposes a VLN dataset called Room-to-Room (R2R) based on matterport3dChang et al. (2017)  data and provides a benchmark for this task first. Chen et al. (2018) proposes a dataset Touchdown and applies the VLN task to the outdoor environment. Still, some scenarios where agents need to interact with objects in the environment are proposed. Narayan-Chen et al. (2019) put the task into the Minecraft game and instruct the agent to complete building tasks.From the view of language format, we can divide those tasks into three categories Gu et al. (2022):

