MULTI-AGENT SEQUENTIAL DECISION-MAKING VIA COMMUNICATION

Abstract

Communication helps agents to obtain information about others so that better coordinated behavior can be learned. Some existing work communicates predicted future trajectory with others, hoping to get clues about what others would do for better coordination. However, circular dependencies sometimes can occur when agents are treated synchronously so it is hard to coordinate decision-making. In this paper, we propose a novel communication scheme, Sequential Communication (SeqComm). SeqComm treats agents asynchronously (the upper-level agents make decisions before the lower-level ones) and has two communication phases. In negotiation phase, agents determine the priority of decision-making by communicating hidden states of observations and comparing the value of intention, which is obtained by modeling the environment dynamics. In launching phase, the upper-level agents take the lead in making decisions and communicate their actions with the lower-level agents. Theoretically, we prove the policies learned by SeqComm are guaranteed to improve monotonically and converge. Empirically, we show that SeqComm outperforms existing methods in various multi-agent cooperative tasks.

1. INTRODUCTION

The partial observability and stochasticity inherent to the nature of multi-agent systems can easily impede the cooperation among agents and lead to catastrophic miscoordination (Ding et al., 2020) . Communication has been exploited to help agents obtain extra information during both training and execution to mitigate such problems (Foerster et al., 2016; Sukhbaatar et al., 2016; Peng et al., 2017) . Specifically, agents can share their information with others via a trainable communication channel. Centralized training with decentralized execution (CTDE) is a popular learning paradigm in cooperative multi-agent reinforcement learning (MARL). Although the centralized value function can be learned to evaluate the joint policy of agents, the decentralized policies of agents are essentially independent. Therefore, a coordination problem arises. That is, agents may make sub-optimal actions by mistakenly assuming others' actions when there exist multiple optimal joint actions (Busoniu et al., 2008) . Communication allows agents to obtain information about others to avoid miscoordination. However, most existing work only focuses on communicating messages, e.g., the information of agents' current observation or historical trajectory (Jiang & Lu, 2018; Singh et al., 2019; Das et al., 2019; Ding et al., 2020) . It is impossible for an agent to acquire other's actions before making decisions since the game model is usually synchronous, i.e., agents make decisions and execute actions simultaneously. Recently, intention or imagination, depicted by a combination of predicted actions and observations of many future steps, has been proposed as part of messages (Kim et al., 2021; Pretorius et al., 2021) . However, circular dependencies can still occur, so it may be hard to coordinate decision-making under synchronous settings. A general approach to solving the coordination problem is to make sure that ties between equally good actions are broken by all agents. One simple mechanism for doing so is to know exactly what others will do and adjust the behavior accordingly under a unique ordering of agents and actions (Busoniu et al., 2008) . Inspired by this, we reconsider the cooperative game from an asynchronous perspective. In other words, each agent is assigned a priority (i.e., order) of decision-making each step in both training and execution, thus the Stackelberg equilibrium (SE) (Von Stackelberg, 2010) is naturally set up as the learning objective. Specifically, the upper-level agents make decisions before the lower-level agents. Therefore, the lower-level agents can acquire the actual actions of the upper-level agents by

