CONCENTRATED ATTENTION FOR MULTI-AGENT RE-INFORCEMENT LEARNING

Abstract

In cooperative multi-agent reinforcement learning, centralized training with decentralized execution (CTDE) shows great promise for a trade-off between independent Q-learning and joint action learning. However, vanilla CTDE methods assumed a fixed number of agents could hardly adapt to real-world scenarios where dynamic team compositions typically suffer from the dilemma of dramatic partial observability variance. Specifically, agents with extensive sight ranges are prone to be affected by trivial environmental substrates, dubbed the "attention distraction" issue; ones with limited observability can hardly sense their teammates, hindering the quality of cooperation. In this paper, we propose a Concentrated Attention for Multi-Agent reinforcement learning (CAMA) approach, which roots in a divide-and-conquer strategy to facilitate stable and sustainable teamwork. Concretely, CAMA targets dividing the input entities with controlled observability masks by an Entity Dividing Module (EDM) according to their contributions for attention weights. To tackle the attention distraction issue, the highly contributed entities are fed to an Attention Enhancement Module (AEM) for execution-related representation extraction via action prediction with an inverse model. For better out-of-sight-range cooperation, the lowly contributed ones are compressed to brief messages by a Attention Replenishment Module (ARM) with a conditional mutual information estimator. Our CAMA outperforms the SOTA methods significantly on the challenging StarCraftII, MPE, and Traffic Junction benchmarks.

1. INTRODUCTION

Cooperative multi-agent deep reinforcement learning (MARL) has gained increasing attention in many areas such as games (Berner et al., 2019; Samvelyan et al., 2019; Kurach et al., 2019) , social science (Jaques et al., 2019) , sensor networks (Zhang & Lesser, 2013) , and autonomous vehicle control (Xu et al., 2018) . With practical agent cooperation and scalable deployment capability, centralized training with decentralized execution (CTDE) (Rashid et al., 2018; Gupta et al., 2017) has been widely adopted for MARL. Current CTDE methods usually assume a fixed number of agents such as QMIX (Rashid et al., 2018) , MADDPG (Lowe et al., 2017 ), QPLEX (Wang et al., 2020a) , etc. To adapt to complicated and dynamic real-world scenarios with dynamic team compositions (i.e., the team size varies), researchers extend these methods by introducing the attention mechanism (Vaswani et al., 2017) , which usually requires splitting the state of the environment into a series of entities (Yang et al., 2020; Agarwal et al., 2019; Iqbal et al., 2021) . However, attention-based methods can hardly handle the varying partial observability (e.g., the varying sight range of each agent) in multi-agent systems, Fig. 1 . With severe partial observability, agents usually lose the sight of teammates, leading to the poor coordination quality. We use a demo in Sec. 5.1 to verify the phenomenon. With slight partial observability (large sight ranges with near perfect information), these methods exhibit apparent performance degradation that more trivial entities may distract the agents' attention and interfere with their decision making. See Sec. 4.1 for a detailed analysis. Therefore, maintaining agents' attention on potential cooperators and executionrelated entities to adapt to the partial observability variation is crucial for MARL in challenging environments. In this paper, we propose a Concentrated Attention for Multi-Agent reinforcement learning (CAMA) approach, which roots in a divide-and-conquer strategy to facilitate stable and sustainable teamwork via attention learning. Specifically, we first use an Entity Dividing Module (EDM) to divide the raw entities into two parts for each agent according to its attention weights. For the attention distraction issue in settings with large sight ranges, an Attention Enhancement Module (AEM) is applied on the entities with high attention weights for execution-related representation extraction via action prediction with an inverse model. For out-of-sight-range coordination in low sight ranges, an Attention Replenishment Module (ARM) with a novel conditional mutual information estimator is applied to compress the information in entities with low attention weights. With the above three modules, agents' attention can be properly concentrated on execution of local actions and potential teamwork to deal with dynamic partial observability. We evaluate our method on three commonly used benchmarks: StarCraftII (SC2) (Samvelyan et al., 2019) , Multi-agent Particle Environment (MPE) (Lowe et al., 2017), and Traffic Junction (Sukhbaatar et al., 2016) . The proposed CAMA outperforms SOTA methods significantly on all conducted experiments and exhibits remarkable robustness to sight range variation and dynamic team composition.

2. RELATED WORK

As a popular paradigm for single reward MARL, CTDE is a trade-off between independent Qlearning (Tan, 1993) and joint action learning (Claus & Boutilier, 1998) . Centralized training makes agents cooperate better while decentralized execution benefits the flexible deployment capability. A series of works concentrate on distributing the team reward to all agents by value function factorization (Sunehag et al., 2018; Rashid et al., 2018) , deriving and extending the Individual-Global-Max (IGM) principle for policy optimality analysis (Son et al., 2019; Wang et al., 2020a; Rashid et al., 2020; Wan et al., 2021) . To avoid constraints of IGM, some works then delve into applying centralized critics on local policies using the actor-critic paradigm (Lowe et al., 2017; Foerster et al., 2018; Zhou et al., 2020) . Although CTDEs have achieved great progresses in recent years, with fixed sizes of agents, they are typically impeded by the dynamic team composition issue (Schroeder de Witt et al., 2019; Liu et al., 2021) in real-world applications. Dynamic Team Composition. When the agent number varies in each episode, the attention mechanism (Vaswani et al., 2017) is commonly adopted to handle the issue (Jiang et al., 2018; Agarwal et al., 2019; Yang et al., 2020; Hu et al., 2020; Iqbal et al., 2021) . Some works develop a set of curricula to adapt to the increasing team sizes (Baker et al., 2019; Long et al., 2020; Wang et al., 2020c) with non-negligible computational costs for training on different team sizes. Iqbal et al. ( 2021) add auxiliary Q-learning tasks to increase the multi-agent system's robustness by randomly masking out part of agents' observability, which increases the types of situations encountered by agents. Although these methods adapt to different team sizes well, they still suffer obvious attention distraction when the sight ranges of agents is large, and are prone to fail in some situations where agents with limited observability must cooperate beyond their sight ranges (Liu et al., 2021) . Communication Mechanism is a feasible solution to enhance agents' cooperation. Recently some works regard the relationship between agents as a proximity-based or fully-connected graph and assume the information can propagate among the graph edges (Foerster et al., 2016; Suttle et al., 2020; Agarwal et al., 2019; Zhang et al., 2018; Sukhbaatar et al., 2016; Liu et al., 2020; Mao et al., 2020a) . These methods usually let agents communicate with all neighbors or the whole team, which brings high communication costs. Moreover, the communication mode is sensitive to team sizes. Some works assume the existence of a centralized coach to integrate information and send messages to all agents (Liu et al., 2021; Mao et al., 2020b; Niu et al., 2021) , which requires centralized execution, with relatively low communication costs. The communication messages are usually trained by backpropagation of RL loss, sometimes with constraint from mutual information objective to reduce the communication bandwidth (Wang et al., 2020b) , help action decision (Yuan et al., 2022) 



Figure 1: The dynamic sight range dilemma. (a) Agents can hardly cooperate beyond their sight ranges. (b) Agents with large sight ranges may perform worse due to "attention distraction". (c) A sketch of our CAMA.

