MULTI-AGENT SEQUENTIAL DECISION-MAKING VIA COMMUNICATION

Abstract

Communication helps agents to obtain information about others so that better coordinated behavior can be learned. Some existing work communicates predicted future trajectory with others, hoping to get clues about what others would do for better coordination. However, circular dependencies sometimes can occur when agents are treated synchronously so it is hard to coordinate decision-making. In this paper, we propose a novel communication scheme, Sequential Communication (SeqComm). SeqComm treats agents asynchronously (the upper-level agents make decisions before the lower-level ones) and has two communication phases. In negotiation phase, agents determine the priority of decision-making by communicating hidden states of observations and comparing the value of intention, which is obtained by modeling the environment dynamics. In launching phase, the upper-level agents take the lead in making decisions and communicate their actions with the lower-level agents. Theoretically, we prove the policies learned by SeqComm are guaranteed to improve monotonically and converge. Empirically, we show that SeqComm outperforms existing methods in various multi-agent cooperative tasks.

1. INTRODUCTION

The partial observability and stochasticity inherent to the nature of multi-agent systems can easily impede the cooperation among agents and lead to catastrophic miscoordination (Ding et al., 2020) . Communication has been exploited to help agents obtain extra information during both training and execution to mitigate such problems (Foerster et al., 2016; Sukhbaatar et al., 2016; Peng et al., 2017) . Specifically, agents can share their information with others via a trainable communication channel. Centralized training with decentralized execution (CTDE) is a popular learning paradigm in cooperative multi-agent reinforcement learning (MARL). Although the centralized value function can be learned to evaluate the joint policy of agents, the decentralized policies of agents are essentially independent. Therefore, a coordination problem arises. That is, agents may make sub-optimal actions by mistakenly assuming others' actions when there exist multiple optimal joint actions (Busoniu et al., 2008) . Communication allows agents to obtain information about others to avoid miscoordination. However, most existing work only focuses on communicating messages, e.g., the information of agents' current observation or historical trajectory (Jiang & Lu, 2018; Singh et al., 2019; Das et al., 2019; Ding et al., 2020) . It is impossible for an agent to acquire other's actions before making decisions since the game model is usually synchronous, i.e., agents make decisions and execute actions simultaneously. Recently, intention or imagination, depicted by a combination of predicted actions and observations of many future steps, has been proposed as part of messages (Kim et al., 2021; Pretorius et al., 2021) . However, circular dependencies can still occur, so it may be hard to coordinate decision-making under synchronous settings. A general approach to solving the coordination problem is to make sure that ties between equally good actions are broken by all agents. One simple mechanism for doing so is to know exactly what others will do and adjust the behavior accordingly under a unique ordering of agents and actions (Busoniu et al., 2008) . Inspired by this, we reconsider the cooperative game from an asynchronous perspective. In other words, each agent is assigned a priority (i.e., order) of decision-making each step in both training and execution, thus the Stackelberg equilibrium (SE) (Von Stackelberg, 2010) is naturally set up as the learning objective. Specifically, the upper-level agents make decisions before the lower-level agents. Therefore, the lower-level agents can acquire the actual actions of the upper-level agents by communication and make their decisions conditioned on what the upper-level agents would do. Under this setting, the SE is likely to be Pareto superior to the average Nash equilibrium (NE) in games that require a high cooperation level (Zhang et al., 2020) . However, is it necessary to decide a specific priority of decision-making for each agent? Ideally, the optimal joint policy can be decomposed by any orders (Wen et al., 2019) , e.g., π * (a 1 , a 2 |s) = π * (a 1 |s)π * (a 2 |s, a 1 ) = π * (a 2 |s)π * (a 1 |s, a 2 ). But during the learning process, it is unlikely for agents to use the optimal actions of other agents for gradient calculation, making it still vulnerable to the relative overgeneralization problem (Wei et al., 2018) . Overall, there is no guarantee that the above equation will hold in the learning process, thus ordering should be carefully concerned. In this paper, we propose a novel model-based multi-round communication scheme for cooperative MARL, Sequential Communication (SeqComm), to enable agents to explicitly coordinate with each other. Specifically, SeqComm has two-phase communication, negotiation phase and launching phase. In the negotiation phase, agents communicate their hidden states of observations with others simultaneously. Then they are able to generate multiple predicted trajectories, called intention, by modeling the environmental dynamics and other agents' actions. In addition, the priority of decision-making is determined by communicating and comparing the corresponding values of agents' intentions. The value of each intention represents the rewards obtained by letting that agent take the upper-level position of the order sequence. The sequence of others follows the same procedure as aforementioned with the upper-level agents fixed. In the launching phase, the upper-level agents take the lead in decision-making and communicate their actual actions with the lower-level agents. Note that the actual actions will be executed simultaneously in the environment without any changes. SeqComm is currently built on MAPPO (Yu et al., 2021) . Theoretically, we prove the policies learned by SeqComm are guaranteed to improve monotonically and converge. Empirically, we evaluate SeqComm on a set of tasks in multi-agent particle environment (MPE) (Lowe et al., 2017) and StarCraft multi-agent challenge (SMAC) (Samvelyan et al., 2019) . In all these tasks, we demonstrate that SeqComm outperforms prior communication-free and communication-based methods. By ablation studies, we confirm that treating agents asynchronously is a more effective way to promote coordination and SeqComm can provide the proper priority of decision-making for agents to develop better coordination.

2. RELATED WORK

Communication. Existing studies (Jiang & Lu, 2018; Kim et al., 2019; Singh et al., 2019; Das et al., 2019; Zhang et al., 2019; Jiang et al., 2020; Ding et al., 2020; Konan et al., 2022) in this realm mainly focus on how to extract valuable messages. ATOC (Jiang & Lu, 2018)  and IC3Net (Singh et al., 2019) utilize gate mechanisms to decide when to communicate with other agents. Many works (Das et al., 2019; Konan et al., 2022) employ multi-round communication to fully reason the intentions of others and establish complex collaboration strategies. Social influence (Jaques et al., 2019) uses communication to influence the behaviors of others. I2C (Ding et al., 2020) only communicates with agents that are relevant and influential which are determined by causal inference. However, all these methods focus on how to exploit valuable information from current or past partial observations effectively and properly. More recently, some studies (Kim et al., 2021; Du et al., 2021; Pretorius et al., 2021 ) begin to answer the question: can we favor cooperation beyond sharing partial observation? They allow agents to imagine their future states with a world model and communicate those with others. IS (Pretorius et al., 2021) , as the representation of this line of research, enables each agent to share its intention with other agents in the form of the encoded imagined trajectory and use the attention module to figure out the importance of the received intention. However, two concerns arise. On one hand, circular dependencies can lead to inaccurate predicted future trajectories as long as the multi-agent system treats agents synchronously. On the other hand, MARL struggles in extracting useful information from numerous messages, not to mention more complex and dubious messages, i.e., predicted future trajectories. Unlike these works, we treat the agents from an asynchronously perspective therefore circular dependencies can be naturally resolved. Furthermore, agents only send actions to lower-level agents besides partial observations to make sure the messages are compact as well as informative. Coordination. The agents are essentially independent decision makers in execution and may break ties between equally good actions randomly. Thus, in the absence of additional mechanisms, different

