COMMUNICATION IN MULTI-AGENT REINFORCEMENT LEARNING: INTENTION SHARING

Abstract

Communication is one of the core components for learning coordinated behavior in multi-agent systems. In this paper, we propose a new communication scheme named Intention Sharing (IS) for multi-agent reinforcement learning in order to enhance the coordination among agents. In the proposed IS scheme, each agent generates an imagined trajectory by modeling the environment dynamics and other agents' actions. The imagined trajectory is a simulated future trajectory of each agent based on the learned model of the environment dynamics and other agents and represents each agent's future action plan. Each agent compresses this imagined trajectory capturing its future action plan to generate its intention message for communication by applying an attention mechanism to learn the relative importance of the components in the imagined trajectory based on the received message from other agents. Numeral results show that the proposed IS scheme significantly outperforms other communication schemes in multi-agent reinforcement learning.

1. INTRODUCTION

Reinforcement learning (RL) has achieved remarkable success in various complex control problems such as robotics and games (Gu et al. (2017) ; Mnih et al. (2013) ; Silver et al. (2017) ). Multi-agent reinforcement learning (MARL) extends RL to multi-agent systems, which model many practical real-world problems such as connected cars and smart cities (Roscia et al. (2013) ). There exist several distinct problems in MARL inherent to the nature of multi-agent learning (Gupta et al. (2017) ; Lowe et al. (2017) ). One such problem is how to learn coordinated behavior among multiple agents and various approaches to tackling this problem have been proposed (Jaques et al. ( 2018 2019)). That is, a message-generation network is defined at each agent and connected to other agents' policies or critic networks through communication channels. Then, the message-generation network is trained by using the gradient of other agents' policy or critic losses. Typically, the message-generation network is conditioned on the current observation or the hidden state of a recurrent network with observations as input. Thus, the trained message encodes the past and current observation information to minimize other agents' policy or critic loss. It has been shown that due to the capability of sharing observation information, this kind of communication scheme has good performance as compared to communication-free MARL algorithms such as independent learning, which is widely used in MARL, in partially observable environments. In this paper, we consider the following further question for communication in MARL: "How to harness the benefit of communication beyond sharing partial observation." We propose intention of each agent as the content of message to address the above question. Sharing intention using communication has been used in natural multi-agent systems like human society. For example, drivers use signal light to inform other drivers of their intentions. A car driver may slow down if a driver in his or her left lane turns the right signal light on. In this case, the signal light encodes the driver's intention, which indicates the driver's future behavior, not current or past observation such as the field view. By sharing intention using signal light, drivers coordinate their drive with each other. In this paper, we formalize and propose a new communication scheme for MARL named Intention sharing (IS) in order to go beyond existing observation-sharing schemes for communication in MARL. The proposed IS scheme allows each agent to share its intention with other agents in the form of encoded imagined trajectory. That is, each agent generates an imagined trajectory by modeling the environment dynamics and other agents' actions. Then, each agent learns the relative importance of the components in the imagined trajectory based on the received messages from other agents by using an attention model. The output of the attention model is an encoded imagined trajectory capturing the intention of the agent and used as the communication message. We evaluate the proposed IS scheme in several multi-agent environments requiring coordination among agents. Numerical result shows that the proposed IS scheme significantly outperforms other existing communication schemes for MARL including the state-of-the-art algorithms such as ATOC and TarMAC.

2. RELATED WORKS

Under the asymmetry in learning resources between the training and execution phases, the framework of centralized training and decentralized execution (CTDE), which assumes the availability of all system information in the training phase and distributed policy in the execution phase, has been adopted in most recent MARL researches (Lowe et al. ( 2017 2019) proposed Targeted Multi-Agent Communication (TarMAC) to learn the message-generation network in order to produce different messages for different agents based on a signature-based attention model. The message-generation networks in the aforementioned algorithms are conditioned on the current observation or a hidden state of LSTM. Under partially observable environments, such messages which encode past and current observations are useful but do not capture any future information. In our approach, we use not only the current information but also future information to generate messages and the weight between the current and future information is adaptively learned according to the environment. This yields further performance enhancement, as we will see in Section 5. In our approach, the encoded imagined trajectory capturing the intention of each agent is used as the communication message in MARL. Imagined trajectory was used in other problems too. Racanière et al. ( 2017) used imagined trajectory to augment it into the policy and critic for combining modelbased and model-free approaches in single-agent RL. It is shown that arbitrary imagined trajectory (rolled-out trajectory by using a random policy or own policy) is useful for single-agent RL in terms of performance and data efficiency. Strouse et al. (2018) introduced information-regularizer to share or hide agent's intention to other agents for a multi-goal MARL setting in which some agents know the goal and other agents do not know the goal. By maximizing (or minimizing) the mutual information between the goal and action, an agent knowing the goal learns to share (or hide) its intention to other agents not knowing the goal in cooperative (or competitive) tasks. They showed that sharing intention is effective in the cooperative case. In addition to our approach, Theory of Mind (ToM) and Opponent Modeling (OM) use the notion of intention. Rabinowitz et al. (2018) proposed the Theory of Mind network (ToM-net) to predict other agents' behaviors by using meta-learning. Raileanu et al. (2018) proposed Self Other-Modeling (SOM) to infer other agents' goal in an online manner. Both ToM and OM take advantage of predicting other agents' behaviors capturing the intention. One difference between our approach and



); Pesce & Montana (2019); Kim et al. (2020)). One promising approach to learning coordinated behavior is learning communication protocol among multiple agents (Foerster et al. (2016); Sukhbaatar et al. (2016); Jiang & Lu (2018); Das et al. (2019)). The line of recent researches on communication for MARL adopts end-to-end training based on differential communication channel (Foerster et al. (2016); Jiang & Lu (2018); Das et al. (

); Foerster et al. (2018); Iqbal & Sha (2018); Kim et al. (2020)). Under the framework of CTDE, learning communication protocol has been considered to enhance performance in the decentralized execution phase for various multi-agent tasks (Foerster et al. (2016); Jiang & Lu (2018); Das et al. (2019)). For this purpose, Foerster et al. (2016) proposed Differentiable Inter-Agent Learning (DIAL). DIAL trains a message-generation network by connecting it to other agents' Q-networks and allowing gradient flow through communication channels in the training phase. Then, in the execution phase the messages are generated and passed to other agents through communication channels. Jiang & Lu (2018) proposed an attentional communication model named ATOC to learn when to communicate and how to combine information received from other agents through communication based on attention mechanism. Das et al. (

