CONCENTRATED ATTENTION FOR MULTI-AGENT RE-INFORCEMENT LEARNING

Abstract

In cooperative multi-agent reinforcement learning, centralized training with decentralized execution (CTDE) shows great promise for a trade-off between independent Q-learning and joint action learning. However, vanilla CTDE methods assumed a fixed number of agents could hardly adapt to real-world scenarios where dynamic team compositions typically suffer from the dilemma of dramatic partial observability variance. Specifically, agents with extensive sight ranges are prone to be affected by trivial environmental substrates, dubbed the "attention distraction" issue; ones with limited observability can hardly sense their teammates, hindering the quality of cooperation. In this paper, we propose a Concentrated Attention for Multi-Agent reinforcement learning (CAMA) approach, which roots in a divide-and-conquer strategy to facilitate stable and sustainable teamwork. Concretely, CAMA targets dividing the input entities with controlled observability masks by an Entity Dividing Module (EDM) according to their contributions for attention weights. To tackle the attention distraction issue, the highly contributed entities are fed to an Attention Enhancement Module (AEM) for execution-related representation extraction via action prediction with an inverse model. For better out-of-sight-range cooperation, the lowly contributed ones are compressed to brief messages by a Attention Replenishment Module (ARM) with a conditional mutual information estimator. Our CAMA outperforms the SOTA methods significantly on the challenging StarCraftII, MPE, and Traffic Junction benchmarks.

1. INTRODUCTION

Cooperative multi-agent deep reinforcement learning (MARL) has gained increasing attention in many areas such as games (Berner et al., 2019; Samvelyan et al., 2019; Kurach et al., 2019) , social science (Jaques et al., 2019) , sensor networks (Zhang & Lesser, 2013) , and autonomous vehicle control (Xu et al., 2018) . With practical agent cooperation and scalable deployment capability, centralized training with decentralized execution (CTDE) (Rashid et al., 2018; Gupta et al., 2017) has been widely adopted for MARL. Current CTDE methods usually assume a fixed number of agents such as QMIX (Rashid et al., 2018) , MADDPG (Lowe et al., 2017) , QPLEX (Wang et al., 2020a) , etc. To adapt to complicated and dynamic real-world scenarios with dynamic team compositions (i.e., the team size varies), researchers extend these methods by introducing the attention mechanism (Vaswani et al., 2017) , which usually requires splitting the state of the environment into a series of entities (Yang et al., 2020; Agarwal et al., 2019; Iqbal et al., 2021) . However, attention-based methods can hardly handle the varying partial observability (e.g., the varying sight range of each agent) in multi-agent systems, Fig. 1 . With severe partial observability, agents usually lose the sight of teammates, leading to the poor coordination quality. We use a demo in Sec. 5.1 to verify the phenomenon. With slight partial observability (large sight ranges with near perfect information), these methods exhibit apparent performance degradation that more trivial entities may distract the agents' attention and interfere with their decision making. See Sec. 4.1 for a detailed analysis. Therefore, maintaining agents' attention on potential cooperators and executionrelated entities to adapt to the partial observability variation is crucial for MARL in challenging environments. In this paper, we propose a Concentrated Attention for Multi-Agent reinforcement learning (CAMA) approach, which roots in a divide-and-conquer strategy to facilitate stable and sustainable teamwork via attention learning. Specifically, we first use an Entity Dividing Module (EDM) to divide the raw entities into two parts for each agent according to its attention weights. For the attention distraction issue in settings with large sight ranges, an Attention Enhancement Module (AEM) is applied on the entities with high attention weights for execution-related representation extraction via action prediction with an inverse model. For out-of-sight-range coordination in low sight ranges, an Attention Replenishment Module (ARM) with a novel conditional mutual information estimator is applied to compress the information in entities with low attention weights. With the above three modules, agents' attention can be properly concentrated on execution of local actions and potential teamwork to deal with dynamic partial observability. We evaluate our method on three commonly used benchmarks: StarCraftII (SC2) (Samvelyan et al., 2019) , Multi-agent Particle Environment (MPE) (Lowe et al., 2017) , and Traffic Junction (Sukhbaatar et al., 2016) . The proposed CAMA outperforms SOTA methods significantly on all conducted experiments and exhibits remarkable robustness to sight range variation and dynamic team composition.

2. RELATED WORK

As a popular paradigm for single reward MARL, CTDE is a trade-off between independent Qlearning (Tan, 1993) and joint action learning (Claus & Boutilier, 1998) . Centralized training makes agents cooperate better while decentralized execution benefits the flexible deployment capability. A series of works concentrate on distributing the team reward to all agents by value function factorization (Sunehag et al., 2018; Rashid et al., 2018) , deriving and extending the Individual-Global-Max (IGM) principle for policy optimality analysis (Son et al., 2019; Wang et al., 2020a; Rashid et al., 2020; Wan et al., 2021) . To avoid constraints of IGM, some works then delve into applying centralized critics on local policies using the actor-critic paradigm (Lowe et al., 2017; Foerster et al., 2018; Zhou et al., 2020) . Although CTDEs have achieved great progresses in recent years, with fixed sizes of agents, they are typically impeded by the dynamic team composition issue (Schroeder de Witt et al., 2019; Liu et al., 2021) in real-world applications. Dynamic Team Composition. When the agent number varies in each episode, the attention mechanism (Vaswani et al., 2017) is commonly adopted to handle the issue (Jiang et al., 2018; Agarwal et al., 2019; Yang et al., 2020; Hu et al., 2020; Iqbal et al., 2021) . Some works develop a set of curricula to adapt to the increasing team sizes (Baker et al., 2019; Long et al., 2020; Wang et al., 2020c) with non-negligible computational costs for training on different team sizes. Iqbal et al. (2021) add auxiliary Q-learning tasks to increase the multi-agent system's robustness by randomly masking out part of agents' observability, which increases the types of situations encountered by agents. Although these methods adapt to different team sizes well, they still suffer obvious attention distraction when the sight ranges of agents is large, and are prone to fail in some situations where agents with limited observability must cooperate beyond their sight ranges (Liu et al., 2021) . Communication Mechanism is a feasible solution to enhance agents' cooperation. Recently some works regard the relationship between agents as a proximity-based or fully-connected graph and assume the information can propagate among the graph edges (Foerster et al., 2016; Suttle et al., 2020; Agarwal et al., 2019; Zhang et al., 2018; Sukhbaatar et al., 2016; Liu et al., 2020; Mao et al., 2020a) . These methods usually let agents communicate with all neighbors or the whole team, which brings high communication costs. Moreover, the communication mode is sensitive to team sizes. Some works assume the existence of a centralized coach to integrate information and send messages to all agents (Liu et al., 2021; Mao et al., 2020b; Niu et al., 2021) , which requires centralized execution, with relatively low communication costs. The communication messages are usually trained by backpropagation of RL loss, sometimes with constraint from mutual information objective to reduce the communication bandwidth (Wang et al., 2020b) , help action decision (Yuan et al., 2022) or predict future trajectories (Liu et al., 2021) . Unlike these methods, we use communication from a centralized coach to apply attention replenishment for agents' better coordination.

3. BACKGROUND

MARL Symbols. We model a fully collaborative multi-agent task with n agents as a decentralised partially observable Markov decision process (Dec-POMDP) (Oliehoek et al., 2016 ) G = ⟨S, A, I, P, r, Z, O, n, γ⟩, where s ∈ S is the environment's state. At time step t, each agent i ∈ I a ≡ {1, ..., n a } chooses an action a i ∈ A, which makes up the joint action a ∈ A ≡ A na . P (s t+1 |s t , a t ) : S ×A×S → [0, 1] is the environment's state transition distribution. All agents share the same reward function r(s, a) : S × A → R. The discount factor is denoted by γ ∈ [0, 1). Each agent i has its local observations o i ∈ O drawn from the observation function Z(s, i) : S × I → O and chooses an action by its stochastic policy π i (a i |ρ i , χ i ) : Γ × X → ∆([0, 1] |A| ), where ρ i ∈ Γ ≡ (O × A) l denotes the action-observation history of agent i, and l is the number of stateaction pairs in ρ i . ρ is the action-observation histories of all agents. χ i ∈ X denotes the additional communication message and π i has no dependence on χ i in CTDE. The agents' joint policy π induces a joint action-value function: Q π (s t , a t ) = E st+1:∞,at+1:∞ [R t |s t , a t ], where R t = ∞ k=0 γ k r t+k is the discounted accumulated team reward. The goal of MARL is to find the optimal joint policy π * such that Q π * (s, a) ≥ Q π (s, a), for all π and (s, a) ∈ S × A. Value Function Factorization. During execution in CTDE, each agent chooses actions by its local Q function Q i (ρ i ) induced by local observation o i . During training, since all agents share a common team reward, a global Q function Q tot is calculated from all local Q i conditioned on the global state s by a M ixer module as: Q tot (ρ, a) = g(Q 1 (ρ 1 , a 1 ), ..., Q n (ρ n , a n ), s). To guarantee the global optimality from the local optimality, a Mixer should satisfy the IGM principle (Son et al., 2019) : arg max a Q tot (ρ, a) = (arg max a 1 Q 1 (ρ 1 , a 1 ), ..., arg max a n Q n (ρ n , a n )). Multi-Head Attention (MHA) in MARL. Under the condition of dynamic teams, the observation vector for each agent may have varying sizes during one episode, and therefore we can hardly apply traditional methods which only accept fixed-size input. In contrast, we use the "entity-wise input" to represent the observation and the multi-head attention (MHA) module to embed the entities with dynamic number into a fixed length vector for each agent. To feed MHA, the raw state s of the environment is commonly expressed as a series of entities e i , i.e., s e := {e i }, i ∈ [1, n e ] with the same vector length, where n e is the maximum number of entities. The entities include agents we can or can not control, and other substrates in the multi-agent scenario (e.g. obstacles). We consider a proximity-based observation function Z(s e , i) for agent i, i.e., each agent has a sight range SR, and Z(s e , i) := {e j |d(i, j) <= SR}, j ∈ [1, n e ], where d(i, j) is the Euclidean (or Manhattan) distance between entity i and j. Let X ∈ R ne×d be the entity input, in which each row is an entity. Let I a ⊆ I := {1, ..., n e } be the set of indices that selects which entities of the input X are used to compute queries such that X Ia ∈ R na×d (usually i ∈ I a means entity i is an controllable agent, and I a := {1, ..., n a }). The attention head is as follows: AH I, X , M; W Q , W K , W V = softmax mask QK ⊤ √ h , M V ∈ R |I|×h , Q = X Ia W Q , K = X W K , V = X W V , M ∈ {0, 1} na×ne , W Q , W K , W V ∈ R d×h . (1) The mask(Y, M) operation takes two matrices with the same size as input, and fills the entries of Y with -∞ where M equals 0. If we set the values in the positions of unseen entities to 0, this operation blocks the information from certain entities after softmax, to uphold the partial observability for local agents. W Q , W K , and W V are learnable parameters. Then we can define the mulit-head attention module by concatenating n h attention heads together: MHA(I, X , M) = concat AH I, X , M; W Q j , W K j , W V j , j ∈ 1 . . . n h , where information not blocked can be integrated across all entities. Figure 3 : The attention distraction issue illustration with the agents' average attention heatmap of REFIL on StartCraftII map "3-8sz_symmetric" (the hardest scenario "8sz_vs_8sz") during one episode on all entities. The y axis denotes 8 agents (A1-A8), while the x axis with additional 8 enemies (E1-E8) records 16 entities. The sum of each row is normalized to 1. "SR" is each agent's sight range, and "WR" is the winning rate against preset AI.

4.1. INTUITION AND OVERVIEW

Intuition. The performance of traditional MARL methods is highly affected by the partial observability of the environment. We use the agents in a game (e.g., StarCraftII) for example. When the sight range is small, the agents can hardly find the teammates and support them, leading to poor team coordination (a demo in Sec. 5.1 verifies this hypothesis). However, it is counterintuitive that with increasing sight ranges, the agents' performance typically degrades (this phenomenon is detailed in Sec. 5.1). We argue that the agents' attention is easily distracted by unrelated scenarios, causing the attention distraction issue. To reveal the importance of agents' attention on their performance, we train the SOTA algorithm REFIL (Iqbal et al., 2021) on different sight ranges (3,9, and ∞) and visualize the agents' attention weights on all entities (i.e., the value of the matrix QK ⊤ in MHA module) in Fig. 3 . It can be seen that when more entities are visible, the agents' attentions get more dispersed so that more difficulties need to be overcome to win the preset AI. Existing dynamic team MARL methods commonly face such problem (see Appendix E). To deal with the dilemma of the dynamic partial observability in MARL, we resort to a divideand-conquer learning strategy, dubbed CAMA, for stable performance and sustainable teamwork. Concretely, in low sight ranges we need to improve agents' attention on realizing the potential cooperators by out-of-sight-range information, while in large sight ranges, agents' attention concentration on execution-related entities should be kept. Framework and RL Training. CAMA mainly consists of three components, including an Entity-Dividing Module (EDM) for dynamic partial observability control, an Attention Enhancement Module (AEM) for attention concentration on execution-related entities, and an Attention Replenishment Module (ARM) for agents' out of sight range coordination, Fig. 2 . Specifically, for each agent i, EDM divides and embeds the raw entities as f i and f -i to feed AEM and ARM respectively. For AEM, an inverse model is applied to resolve the attention distraction issue. For ARM, a coach with global sights is introduced to generate a communication message ζ i for agents' coordination. Following Yuan et al. (2022) , we generate a Q i(local) from f i and agent i's observation-action history ρ i (which is the output of a gated recurrent unit (GRU) cell), and a global) . All local Q i s are fed into a mixing network (Yang et al., 2020) to calculate a Q tot . The RL loss can be formulated as follows: Q i(global) from ζ i . Q i is computed by their summation, i.e., Q i = Q i(local) + Q i( L QL = E (at,rt,ρt,ρt+1)∼D r t + γ max a ′ Qtot (ρ t+1 , a ′ ) -Q tot (ρ t , a t ) 2 , ( ) where Qtot is the target network, and D is the replay buffer.

4.2. ENTITY DIVIDING MODULE

An EDM divides raw entities into an attention enhancement part and an attention replenishment part by their ranking of attention weights. For the former one, we wish to constrain the maximum of observed entities to avoid the attention distraction. And the latter one should contain enough outof-sight-range information for team coordination. We first deal with the attention enhancement part. Recall that in MHA module, M ∈ {0, 1} na×ne is a binary mask applied on the entity embeddings which is generated by the environment. To uphold each agent's partial observability, a more sparse mask M s is introduced to replace M which satisfies the following constraints: ||M s || ∞ ≤ αn e , ¬M s ⊙ ¬M = ¬M, where n e is the maximum number of entities, α ∈ (0, 1] is a hyper-parameter, and ⊙ means the element-wise multiplication operation. The negation of M is defined as ¬M := 1 -M, where 1 is an all 1 matrix with the same shape as M. The left side of Eq. ( 4) ensures that the percent of observable entities is less than α, while the right side makes the agent observe the entities available in the original mask M only. We can assign a low value for α (e.g., 0.4) to limit each agent's visible entities in complicated environments. To get M s , we define a B i (W ) operation to get the indices of the top i values in each row of W . M s can be calculated as follows: M s = M ⊙ M f , M f [I] = 1, M f [others] = 0, I = B ⌊αne⌋ (QK ⊤ ), where ⌊•⌋ means to round down the value, and Q, K are matrices of queries and keys in MHA module, respectively. Under Eq. ( 5), each agent only remains the sight on at most ⌊αn e ⌋ entities with the highest attention weight, and improves its attention concentration. We prove in Appendix A that M s obtained by Eq. ( 5) is adequate for Eq. ( 4). After getting M s , we can compute f i for the attention enhancement part, which is the output of MHA(I, X , M s ). Since the attention replenishment part should contain all the information not involved in the former one for better out-of-sight-range coordination, we transmit it with f -i , which is the embedding of the complement entities of M s on s with the same MHA module as MHA(I, X , ¬M s ).

4.3. ATTENTION ENHANCEMENT FOR LOCAL AGENT

When knowing what will happen when a specific action is taken, the learned agents can hardly be distracted. Thereby, we aim to concentrate agents' attention on execution-related information distilled from the high-dimensional state space. Specifically, we resort to the inverse model (Pathak et al., 2017) , a two-layer MLP, that uses the local observation o i t and o i t+1 to predict the agent i's action a i . In the prediction module, the local observation o i t and o i t+1 are first fed into the same EDM to get the features f i t and f i t+1 . Then, the probability of each action can be predicted as p(â i t ) = IM (f i t , f i t+1 ; θ). The learning loss can be defined as: L IM = CE(p(â i t ), a i t ), where CE means the cross entropy. f i will contain the necessary information for predicting a i by optimizing Eq. ( 6), which encourages EDMs to discard irrelevant distracting information for the consciousness enhancement embedding f i . With the auxiliary representation learning task, the learned embedding f i can be used to calculate each agent's local Q function Q i(local) .

4.4. ATTENTION REPLENISHMENT BY GLOBAL COACH

To equip agents with the ability of out-of-sight-range coordination, we use a centralized coach equipped with global states to generate a message ζ i from f -i and compute a Q i(global) for each agent i at each time step (Liu et al., 2021; Niu et al., 2021; Mao et al., 2020b) . The message plays the role of attention replenishment when agents facing difficulty to cooperate through their local observations. The learning objective for ζ i should contain the information unknown to agent i, and not distract the agent's attention. Therefore, we maximize: I(ζ i ; f -i ) -βI(ζ i ; s) = (1 -β)I(ζ i ; s) -I(ζ i ; f i |f -i ), ( ) where s is the global state, and I(•; •) means the mutual information. f -i indicates the replenishment information for agent i. We leave the derivation of Eq. ( 7) to Appendix B. Maximizing I(ζ i ; f -i ) let ζ i be the summary of f -i . By feeding ζ i to agent i, the agent can sense the information beyond its sight range, and further alleviate the difficulties of cooperation caused by partial observability. Minimizing I(ζ i ; s) compresses the information ζ i has, which can be regarded as an information bottleneck constraint on ζ i (Wang et al., 2020b) . We use a hyper-parameter β ∈ [0, 1] to control the compression degree of ζ i . Combining the two terms, we (1) discard the information in f i , which is already known to agent i, and (2) compress the sophisticated f -i into a brief message, which can hardly distract the agent while promotes coordination. We then separately optimize the two mutual information term in the right side of Eq. ( 7). Directly maximizing I(ζ i ; s) is difficult, but there exist some tools estimating its differentiable lower bound, e.g. infoNCE (Oord et al., 2018) and MINE (Belghazi et al., 2018) . We choose the CatGen formulation (Fischer, 2020) of the former one and maximize the following lower bound of I(ζ i ; s): ĪNCE (ζ i ; s) = -E ζ i ,s [log p(ζ i |s) 1 K K k=1 p(ζ i |s k ) ], where K is the sample number of a mini-batch. Then we move on to minimizing I(ζ i ; f i |f -i ). There are existing tools that minimize the upper bound of the mutual information between two random variables, such as CLUB (Cheng et al., 2020) and L1Out (Poole et al., 2019) . But these methods can not be directly applied to the conditional mutual information paradigm. Therefore, we extend the CLUB estimator to the conditional form and present Conditional-CLUB (CC) estimator as a differentiable upper bound of I(ζ i ; f i |f -i ): I CC (ζ i , f i |f -i ) = E ζ i ,f i ,f -i [log p(ζ i |f i , f -i )] -E f i E ζ i ,f -i [log p(ζ i |f i , f -i )]. Theorem 4.1. For three random variables ζ i , f i and f -i , I CC (ζ i , f i |f -i ) ≥ I(ζ i , f i |f -i ). (10) The equality holds if and only if f i is independent of the joint distribution of ζ i , f -i . The proof of Thm. 4.1 can be referred in Appendix C. Accordingly, we can minimize the conditional mutual information via minimizing I CC . In practice, assuming we have the conditional distribution p(ζ i |f i , f -i ), we can sample pairs {(ζ i k , f i k , f -i k )} K k=1 and get an unbiased estimation of I CC : ÎCC = 1 K K k=1 log p(ζ i k |f i k , f -i k ) - 1 K 2 K k=1 K j=1 log p(ζ i k |f i j , f -i k ). Since the second part of Eq. ( 11) requires O K 2 computational complexity, we use the following faster counterpart: ĪCC = 1 K K k=1 log p(ζ i k |f i k , f -i k ) - 1 K K k=1 log p(ζ i k |f i µ(k) , f -i k ), where µ(•) is a mapping from k ∈ {1, .., K} to its random permutation. Since ÎCC and ĪCC are both unbiased estimators, E[ ÎCC ] = E[ ĪCC ] = I CC . In practice, we assume p(ζ i |f i , f -i ) as a Gaussian distribution. We use a neural network with input concat(f i , f -i ) to calculate its mean and variance and optimize it with the reparameterization trick (Kingma & Welling, 2013). Since concat(f i , f Appendix G.2 for a detailed description. We test the performance of the following methods on three sight ranges in Fig. 5: (1) 2 CTDE methods: Qatten (Yang et al., 2020) , REFIL (Iqbal et al., 2021) . (2) 1 graph communication method: EMP (Agarwal et al., 2019) . ( 3) 4 centralized communication methods: MAGIC (Niu et al., 2021) , Gated_Qatten (Mao et al., 2020b) , COPA (Liu et al., 2021) , and our CAMA. MAGIC and Gated_Qatten are modified to make them suitable for the entity-wise input setting. The results show that CAMA outperforms all baselines in all sight range settings. And only CAMA performs better in SR = 1.5 than SR = 1.0 and SR = 0.5, while others suffer from attention distraction and perform worse in larger sight range.

5.2. COMPONENT ANALYSIS

Entity Dividing Module. In EDM, the parameter α plays the role of balancing the observability between AEM and ARM. We explore how to choose α in different sight ranges in Fig. 6 . To make the error bars (std) more clear, points with the same α are slightly offset on the X axis. Note that α = 1.0 means no constraint on observability function, i.e., deleting EDM. We find that as the sight range increases, the agents can see more entities, and therefore a lower α can help attention concentration. Attention Enhancement. We analyze the contribution of each loss in Table 1 . L IM (AEM) is important when SR is large, which is reasonable since large SR brings attention distraction issue and requires agents to focus attention on execution-related entities. Table 3 : Results on Traffic Junction. SR = 0 or 1 means the agents can see themselves only or can observe the 3 × 3 grids around them. "↑" means MI maximization and "↓" means minimization. Attention Replenishment. Table 1 shows that L M I (ARM) can bring obvious improvement under both SR conditions. To check whether the improvement comes from the communication mechanism or mutual information objective, we turn to a grid-world based traffic junction environment (TJ). We use the map from Sukhbaatar et al. (2016) , which is a crossroads where cars are continuously generated from a Poisson distribution at one of the four entrances, and aiming for one of the other three exits (Fig. 7 ). Unlike the original simplified setting that the routes are fixed so that agents only choose to accelerate or brake, we use a more difficult setting that agents can move towards four directions freely, which is harder and more realistic. Please refer to Appendix G.2 for environment details. We compare four MI-based message generators under our entity-wise setting: MAIC&NDQ (Yuan et al., 2022; Wang et al., 2019) , IMAC (Wang et al., 2020b) , COPA (Liu et al., 2021) and our CAMA. To make a fair comparison on MI objective, we only implement the mutual information part of each method and keep the others same to ours, e.g., the structure of neural networks and communication mechanism. Since MAIC and NDQ use similar MI objectives under the centralized coach setting, we regard them as one method. The test return results are shown in Table 3 , which exhibit the superior performance of our MI objective in limited sight range. We check the effect of information compression degree β in L M I with different environment difficulties in Table 2 , and find that in simple tasks such as 1-lane TJ, the coach can handle f -i without obvious information compression. While in hard tasks, e.g., RC with SR = 0.5, a large β can simplify communication messages, so that not to distract the agents and bring higher performance. 

A PROOF OF THE MASK GENERATOR

We now prove that M s got by Eq. ( 5) is sufficient for Eq. ( 4). We first show that M f in Eq. ( 5  || ∞ = ||M ⊙ M f || ∞ ≤ ||M f || ∞ ≤ αn e . Then we show that ¬M s ⊙ ¬M = ¬M, which equals to ¬(M f ⊙ M) ⊙ ¬M = ¬M. Recall that M f and M are both 0,1 matrices with the same shape. For any position in the matrices, we can use Table 4 to conclude all the situations: We can find that at any situation ¬(M f ⊙ M) ⊙ ¬M = ¬M, therefore ¬M s ⊙ ¬M = ¬M. In summary, M s got by Eq. ( 5) is sufficient for Eq. ( 4). M M f ¬(M f ⊙ M) ⊙ ¬M ¬M 0 0 1 1 0 1 1 1 1 0 0 0 1 1 0 0

B DERIVATION OF MUTUAL INFORMATION OBJECTIVE

We now give the derivation of Eq 7. I(ζ i ; f -i ) -βI(ζ i ; s) =H(ζ i ) -H(ζ i |f -i ) -(βH(ζ i ) -βH(ζ i |s)) =(1 -β)H(ζ i ) -(1 -β)H(ζ i |s) + H(ζ i |s) -H(ζ i |f -i ) =(1 -β)[H(ζ i ) -H(ζ i |s)] -[H(ζ i |f -i ) -H(ζ i |f i , f -i )] //split s into f i and f -i =(1 -β)I(ζ i ; s) -I(ζ i ; f i |f -i ), C PROOF OF THEOREM 4.1 For three random variables ζ i , f i and f -i , I CC (ζ i , f i |f -i ) ≥ I(ζ i , f i |f -i ). (14) The equality holds if and only if f i is independent of the joint distribution of ζ i , f -i . Proof. Let ∆ be the gap between I CC (ζ i , f i |f -i ) and I(ζ i , f i |f -i ): ∆ :=I CC (ζ i , f i |f -i ) -I(ζ i , f i |f -i ) = E ζ i ,f i ,f -i [log p(ζ i |f i , f -i )] -E f i E ζ i ,f -i [log p(ζ i |f i , f -i )] -E ζ i ,f i ,f -i [log p(ζ i |f i , f -i )] -E ζ i ,f -i [log p(ζ i |f -i )] =E ζ i ,f -i [log p(ζ i |f -i )] -E ζ i ,f -i E i f [log p(ζ i |f i , f -i )] =E ζ i ,f -i log[E f i p(ζ i |f i , f -i )] -E f i [log p(ζ i |f i , f -i )] . Since log(•) is a concave function, ∆ ≥ 0 due to Jensen's Inequality.  (ζ i |f i , f -i )to estimate p(ζ i |f i , f -i ): I vCC (ζ i , f i |f -i ) = E ζ i ,f i ,f -i [log q(ζ i |f i , f -i )] -E f i E ζ i ,f -i [log q(ζ i |f i , f -i )]. Theorem 1 gives a sufficient condition to ensure I vCC (ζ i , f i |f -i ) be an upper bound of I(ζ i , f i |f -i ): Theorem 1. Denote q(ζ i , f i , f -i ) = q(ζ i |f i , f -i )p(f i )p(f -i ). If KL(p(f i )p(ζ i , f -i )||q(ζ i , f i , f -i )) ≥ KL(p(ζ i , f i , f -i )||q(ζ i , f i , f -i )), ( ) then I vCC (ζ i , f i |f -i ) ≥ I(ζ i , f i |f -i ). The equality holds when f i and the join distribution ζ i , f -i are independent. Proof. Let ∆ be the gap between I vCC (ζ i , f i |f -i ) and I(ζ i , f i |f -i ). We have: ∆ =I vCC (ζ i , f i |f -i ) -I(ζ i , f i |f -i ) = E ζ i ,f i ,f -i [log q(ζ i |f i , f -i )] -E f i E ζ i ,f -i [log q(ζ i |f i , f -i )] -E ζ i ,f i ,f -i [log p(ζ i |f i , f -i )] -E ζ i ,f -i [log p(ζ i |f -i )] = E ζ i ,f -i [log p(ζ i |f -i )] -E f i E ζ i ,f -i [log q(ζ i |f i , f -i )] -E ζ i ,f i ,f -i [log p(ζ i |f i , f -i )] -E ζ i ,f i ,f -i [log q(ζ i |f i , f -i )] =E f i E ζ i ,f -i [log p(ζ i |f -i ) q(ζ i |f i , f -i ) ] -E ζ i ,f i ,f -i [log p(ζ i |f i , f -i ) q(ζ i |f i , f -i ) ] =E f i E ζ i ,f -i [log p(ζ i |f -i )p(f i )p(f -i ) q(ζ i |f i , f -i )p(f i )p(f -i ) ] -E ζ i ,f i ,f -i [log p(ζ i |f i , f -i )p(f i )p(f -i ) q(ζ i |f i , f -i )p(f i )p(f -i ) ] =E f i E ζ i ,f -i [ p(f i )p(ζ i , f -i ) q(ζ i , f i , f -i ) ] -E ζ i ,f i ,f -i [ p(ζ i , f i , f -i ) q(ζ i , f i , f -i ) ] =KL(p(f i )p(ζ i , f -i )||q(ζ i , f i , f -i )) -KL(p(ζ i , f i , f -i )||q(ζ i , f i , f -i )) Theorem 1 reveals that I vCC is an MI upper bound if the variational joint distribution q (ζ i , f i , f -i ) is more "closer" to p(ζ i , f i , f -i ) than to p(f i )p(ζ i , f -i ). Let q ϕ be the parameterization of q. In addition to the Eq. 16 that optimizes I vCC , we should minimize the KL divergence between p(ζ i , f i , f -i ) and q(ζ i , f i , f -i ): min ϕ KL(p(ζ i , f i , f -i )||q ϕ (ζ i , f i , f -i )) = min ϕ E ζ i ,f i ,f -i [log p(ζ i |f i , f -i )p(f i )p(f -i ) q ϕ (ζ i |f i , f -i )p(f i )p(f -i ) ] = min ϕ E ζ i ,f i ,f -i [log p(ζ i |f i , f -i ) -log q ϕ (ζ i |f i , f -i )] = max ϕ E ζ i ,f i ,f -i [log q ϕ (ζ i |f i , f -i )], Therefore, with sample pairs {(ζ i k , f i k , f -i k )} K k=1 , we can maximize the log-likelihood of q ϕ : max ϕ 1 K K k=1 log q ϕ (ζ i k |f i k , f -i k ). With enough optimization times of Eq. 20, I vCC is guaranteed to be an MI upper bound. We test the effect of coach with memory in the Resource Collection environment with SR = 0.5 and SR = 1.0, Table 5 . We find that although equip the coach with the memory module improves the average performance, it brings large variance that the method becomes unstable. Therefore, to keep a stable performance of our method, we still use the coach with MLP in the main paper. We also visualize the heat maps of our method and REFIL at single time steps. We choose the first attention head of each saved model, and set the color bar's range of attention weight to [0,0.35]. We visualize in Fig. 11 both methods' attention weights on all agents from t = 0 to t = 24, at which time all the agents are usually alive. We find that the large sight range distracts the attention of REFIL, while our method CAMA keeps attention concentration.

F HYPERPARAMETERS

We summarize the hyperparameters in Table . 6 G ENVIRONMENT DETAILS G.1 RESOURCE COLLECTION On a map of [-1,1], agents are initialized at random places with team size sampled from [3, 8] . They need to collect resources from 6 resource points and transport the goods home. The locations of resource points and home are randomly sampled from the whole map at the start of each episode. 

H HARDWARE

We ran experiments on 2 GPU servers, with each one having 8*RTX3090TI GPUS and 2*AMD EPYC 7H12 CPUs. Each experiment (one seed) takes 12-24 hours on one GPU.

I GENERALIZABILITY

We test the generalizability of our method in the environment Resource Collection. We train each method for 10 7 time steps in the environment with 2-5 agents. Every 5 × 10 4 time steps, the model is evaluated on the environment with 6-8 agents for 160 episodes, and we plot the test return curves in Fig. 13 . The results show that our method can generalize well to larger team sizes than training, which is probably due to the ARM module with the message generator that passes the global information.



Figure 1: The dynamic sight range dilemma. (a) Agents can hardly cooperate beyond their sight ranges. (b) Agents with large sight ranges may perform worse due to "attention distraction". (c) A sketch of our CAMA.

Figure 2: Network structure of our proposed CAMA. Entities and the observation mask are first fed into the Entity Dividing Module (EDM) to get the Attention Enhancement embedding f i and Attention Replenishment embedding f -i , which is trained by inverse model (IM) loss and mutual information (MI) objective, respectively. For RL training, a local Q i is generated by f i , f -i , and its observation-action history with a GRU, and further fed into the mixing network for Q tot .

Figure 4: (a) A sketch of environment "Resource Collection".(b) A sketch of the demo "Catch Apple". (c) Task sovling rate in "Catch Apple". Qatten and REFIL are CTDE, and MAIC and CAMA have a centralized coach. (d) Visualization of messages received (grouped by the next action agents take).

Figure 5: Test returns of 3 sight ranges (SR) on RC.

Figure 6: The Effect of α on different sight ranges (SR).

30.89±2.54 95.54±5.73 389.47±41.88 366±15.35 0.3 29.9±2.99 99.64±4.45 398.74±68.33 379.8±36.11 0.5 29.89±3.21 95.85±3.49 458.13±46.93 385.25±25.61 0.7 27.58±1.23 91.32±0.74 373.19±25.43 396±63.19 0.9 27.29±0.74 91.28±4.23 354.06±20.24 369.4±52.52 Table 2: The effect of β on different tasks: 1-lane and 2-lane Traffic Junction, Resource Collection with sight range (SR) 1 and 0.5. From left to right, the task gets harder.

Figure 7: Traffic Junction teaser (SR=1).

HARD TASK SMACWe test CAMA's performance on the hard SMAC(Samvelyan et al., 2019) tasks with dynamic teams in this section. We use the setting fromLiu et al. (2021) andIqbal et al. (2021) that at the start of each episode a total of 3-8 agents are randomly divided into 2-4 groups and initialized at different places on the edge of a circle with the radius 9, and enemies are divided into 1-2 groups, Fig.8(a). Agents have the sight range 9. Agents must learn to find teammates first before fighting against enemies with more quantities. We show the results of test win rate (TWR) on 3 maps in Fig.8(b). Our method remarkably exceeds the current SOTA methods on all maps.

Figure 8: (a) An initialization teaser on SC2. (b) TWR comparisons on the 3 SC2 maps.

Figure9: TWR comparisons with different SRs (a) and dynamic team composition (b). The x axis is the agent number which varies from 3-8. The y axis is log(ω/ω), where ω is the winning rate and ω is agents' average winning rate. The black dotted line denotes ω.Dynamic Team Composition. In Fig.9(b), we test the same model on different team sizes and plot the logarithm of the relative winning rate of the corresponding agent number against the total agents, where CAMA shows outstanding robustness to agent variation with a nearly stationary performance curve.

HEAT MAPS OF ATTENTION WEIGHTS E.1 AVERAGE HEAT MAPS OF MORE METHODSAs mentioned in Sec. 4.1, we visualize the attention weights of four methods in Fig.10. Only our CAMA pay more attention to the agents themselves when the sight range is large, and therefore keeps a high performance.

Figure 10: Attention Weights of Qatten, COPA, REFIL and our CAMA.

Detailed environment parameters in two versions of Traffic Junction Environment. max_steps is the length of each episode. entrances or targets indicate the number of choices of entrance or target and in the hard version the cars are restricted to keep to the right when entering or exiting, so there are only 8 entrances or targets. the total number of cars is fixed at N max and the cars will be turned inactive when reaching the target or colliding with others and new cars get added to the environment with probability p arrive at every time-step which varies during training consistent with the curriculum learning in Singh et al. (2018). The detailed environment parameters settings are indicated in Table.

Figure 12: Traffic Junction Environment. Cars are navigating to their own assigned target and avoiding collision. There are two difficulty levels.

) satisfies ||M f || ∞ ≤ αn e . Recall the definition of M f , it is a indicator matrix, that at each row, only the positions of top ⌊αn e ⌋ values of the corresponding row in the attention weight matrix are 1, and others are 0. Therefore, at each row of M f , there are at most ⌊αn e ⌋ 1s, which means the sum of the absolute value in each row is no larger than ⌊αn e ⌋. So ||M f || ∞ ≤ ⌊αn e ⌋ ≤ αn e . Then we show ||M s || ∞ ≤ αn e . Since the observability mask M is also a 0,1 mask,||M s

Logical table of the mask.

D GENERATE MESSAGES WITH MEMORYIn Sec. 4.4, we use concat(f i , f -i ) to generate the message ζ i . It means we have the true prior distribution p(ζ i |f i , f -i ), which can be expediently applied to estimate I CC . However, if we wish to enhance the coach's ability, e.g., introducing a memory structure (similar to the GRU cell in local agents) into the message generator, the prior distribution turns into p(ζ i |ρ) and therefore we can not obtain p(ζ i |f i , f -i ) directly to estimate I CC . Similar to the idea inCheng et al. (2020), we propose a variational term q

Comparison on the style of coach.

Name

Description Value γ Discounted factor 0.99 ε anneal time Time-steps for ε to anneal from ε s to ε f . ε is the probability for 500000 agents choosing random actions. Clipping value for all gradients 10Table 6 : Hyper-parameters.The radius of the home and the resource location & agent is 0.1 and 0.05, respectively. There are 3 kinds of resources, and each agent i has its own ability b e i uniformly sampled from {0.1, 0.5, 0.9} to collect each kind of resource e. Each agent can accelerate towards 4 directions or apply no forces at each time-step. Each agent has its maximal speed uniformly sampled from {0.3, 0.5, 0.7} and the acceleration is fixed to 3.0. Every time the agent i collects resource e, the team will get a reward 10 * b e i . When an agent brings the resource home, the team will get a reward 1. An agent can only carry one resource at a time, which means an agent needs to bring the collected resource home before it starts to collect the next one. The episode limit is 145. The number of the agents for training is uniformly sampled from {2,3,4,5}, while for testing it is sampled from {6,7,8}. Each agent has a sight range SR. Entities including other agents and resource points that exceed agent i's SR are invisible to agent i.

G.2 TRAFFIC JUNCTION

The simulated traffic junction environment from Sukhbaatar et al. (2016) has been a conventional and useful testbed for testing the performance of multi-agent communication algorithms Singh et al. (2018) ; Das et al. (2019) ; Liu et al. (2020) . Despite its great success, cars in the original traffic junction environment can only move along pre-assigned routes on one or more road junctions and so the action space for each car only consists of two actions, i.e. gas and brake, which restricts its ability to simulate real-world environments and test communication algorithms. Moreover, the original observation is not fit for the need of entity-wise input.So, we modify the original traffic junction environment to a more flexible version. In stead of preassigned routes, a random selected navigation target is assigned to each active car and the aim of each active car is navigating to its own target and avoiding collision which occurs when two cars are on same location. Along with the modified setting, the observation for each active car is altered to entity-wise form which contains the position of cars in a limited visibility (e.g. 3 x 3 for sight_range = 1) and the action space is more flexible with five actions i.e. forward, back, left, right and wait. Besides, the rewards consists of a linear time penalty -0.01τ , where τ is the number of active time steps for the car since the last resurrection, a collision penalty r collision = -10 and the difference Manhattan distance from the target between the previous and the current step . We choose two different difficulty levels following the settings in Singh et al. (2018) , illustrated in Fig. 12 . Moreover, Under review as a conference paper at ICLR 2023 Figure 13 : Generalizability on the task Resource Collection. Each method is trained on the agent number 2-5, and tested on agent number 6-8 every 50k time steps to plot the curve.

