OFFLINE COMMUNICATION LEARNING WITH MULTI-SOURCE DATASETS

Abstract

Scalability and partial observability are two major challenges faced by multi-agent reinforcement learning. Recently researchers propose offline MARL algorithms to improve scalability by reducing online exploration cost, while the problem of partial observability is often ignored in the offline MARL setting. Communication is a promising approach to alleviate the miscoordination caused by partially observability, thus in this paper we focus on offline communication learning where agents learn from an fixed dataset. We find out that learning communications in an end-to-end manner from a given offline dateset without communication information is intractable, since the correct communication protocol space is too sparse compared with the exponentially growing joint state-action space when the number of agents increases. Besides, unlike offline policy learning which can be guided by reward signals, offline communication learning is struggling since communication messages implicitly impact the reward. Moreover, in real-world applications, offline MARL datasets are often collected from multi-source, leaving offline MARL communication learning more challenging. Therefore, we present a new benchmark which contains a diverse set of challenging offline MARL communication tasks with single/multi-source datasets, and propose a novel Multi-Head structure for Communication Imitation learning (MHCI) algorithm that automatically adapts to the distribution of the dataset. Empirical result shows the effectiveness of our method on various tasks of the new offline communication learning benchmark. Under review as a conference paper at ICLR 2023 StarCraft II (Rashid et al. (2018) and soccer (Huang et al. (2021)), as well as real-world applications, e.g., autonomous driving (Shalev-Shwartz et al. (2016)) and traffic control (Das et al. (2019)). Although several offline MARL algorithms (Yang et al. (2021); Pan et al. ( 2022)) have been proposed to tackle with the scalability challenge recently, how to deal with the partial observability in the offline MARL setting, has not received much attention. Unluckily, simply adopting communication mechanism in offline MARL, i.e., learning communication in an end-to-end manner by offline datasets is still problematic. Finding effective communication protocols without any guidance can be the bottleneck especially when the task scale increases. It may converge to sub-optimal communication protocols that influence the downstream policy learning. To handle this problem, in this paper, we investigate the new area of offline communication learning, where the multiple agents learn communication protocols from a static offline dataset containing extra communication information. We denote this kind of dataset as "communication-based dataset" to distinguish it from single-agent offline dataset. In real-world applications, communication-based dataset may be collected by a variety of existing communication protocols, like handcraft protocols designed by experts, or hidden protocols leaned by other agents. Therefore, communication-based dataset can be established in offline MARL learning to boost the performance of the downstream tasks. Previous offline RL works focus on eliminating the problem of distributional shift, while offline MARL communication learning faces different challenges. Unlike policy learning which is directly guided by reward signals since actions influence the expected return, it is hard to evaluate communication learning since communication serves as an implicit impact between agents . What's worse, it is likely that the offline dataset is multi-source in real world, thus trajectories may be sampled by different communication protocols as well as policies. The multi-source property introduces extra challenges as we cannot simply imitate the dataset communication. Offline communication learning algorithms need to distinguish the source of each trajectories before learning from them. In this paper, We propose Multi-head Communication Imitation (MHCI) that accomplishes multi-source classification and message imitation at the same time. To our best knowledge, MHCI is the first to learn a composite communication from a multi-source communication-based dataset. We also provide theoretical explanation on its optimality under the dataset.

1. INTRODUCTION

Cooperative multi-agent reinforcement learning is essential for many real-world tasks where multiple agents must coordinate to achieve a joint goal. However, the problems of scalability and partial observability limit the effectiveness of online MARL algorithms. The large joint state-action space makes exploration costly especially when the number of agents increases. On the other hand, the partial observability requires communication among agents to make better decisions. Plenty of previous works in MARL try to find solutions for these two challenges, with the hope to make cooperative MARL applicable to more complicated real-world tasks (Sukhbaatar et al. (2016) ; Singh et al. (2019) ; Yang et al. (2021) ). Recently, emerging researches apply offline RL to cooperative multi-agent RL in order to avoid costly exploration across the joint state-action space, thus the scalability is improved. Offline RL is defined as learning from a fixed dataset instead of online interactions. In the context of singleagent offline RL, the main challenge is the distributional shift issue, which means that the learned policy reaches unseen state-action pairs which are not correctly estimated. By constraining the policy on the behavioral policy, offline RL has gained success on diverse single-agent offline tasks like locomotion (Fu et al. (2020) ) and planning without expensive online exploration (Finn & Levine (2017) ). As the problem of scalability can be promisingly alleviated by utilizing offline datasets, another challenge of MARL, i.e., the partial observability, can be addressed by introducing communications during coordination. Communication allows agents to share information and work as a team. A wide range of multi-agent systems benefit from effective communication, including electronic games like To better evaluate the effectiveness of our algorithm as well as for further study, we propose an offline communication learning benchmark, including environments from previous works and additional environments that require sophisticated communication. The empirical results show that Multi-head Communication Imitation (MHCI) successfully combines and refines information in the communication-based dataset, thus obtains outperformance in diverse challenging tasks of the offline communication learning benchmark. Our main contributions are two-folds: 1) we analyze the new challenges in offline communication learning, and introduce a benchmark of offline communication learning which contains diverse tasks.; 2) we propose an effective algorithm, Multi-head Communication Imitation (MHCI), which aims to address the problem of learning from single-source or multi-source datasets, and our method shows superior outperformance in various environments of our benchmark.

2. RELATED WORKS

MARL with communication Multi-agent reinforcement learning has attracted great attention in recent years. (Tampuu et al. (2017) ; Matignon et al. (2012) ; Mordatch & Abbeel (2017) ; Wen et al. (2019) ) In MARL, the framework of centralized training and decentralized execution has been widely adopted (Kraemer & Banerjee (2016) ; Lowe et al. (2017) ). For cooperative scenarios under this framework, COMA (Foerster et al. (2018) ) assigns credits to different agents based on a centralized critic and counter-factual advantage functions, while another series of works, including VDN (Sunehag et al. (2018) ), QMIX (Rashid et al. (2018) ) and QTRAN (Son et al. (2019) ), achieve this by applying value-function factorization. These MARL algorithms show remarkable empirical results when tested on the popular StarCraft unit micromanagement benchmark (SMAC) (Samvelyan et al. (2019) ). CommNet (Sukhbaatar et al. (2016) ), RIAL and DIAL (Foerster et al. (2016) ) are seminal works that allow agents to learn how to communicate with each other in MARL. CommNet and DIAL design the communication structure in a differentiable way to enable end-to-end training, and RIAL trains communication with RL algorithms by constraining communication to be discrete and regarding it as another kind of actions. To make communication more effective and efficient, IC3Net (Singh et al. (2019) ), Gated-ACML (Mao et al. (2020b) ) , and I2C (Ding et al. (2020) ) utilize a gate mechanism to decide when and who to communicate with. TarMAC (Das et al. (2019) ) and DAACMP (Mao et al. (2020a) ) achieve targeted communication by introducing attention mechanism. NDQ (Wang* et al. (2020) ) uses information-theory-based regularizers to minimize communication. MAIC (Yuan et al. (2022) ) realizes sparse and effective communication by modeling teammates. Offline MARL Offline RL is a hot topic in the last few years. It focuses on RL given a static dataset. The problem of distributional shift is critical for offline RL (Fujimoto et al. (2019) ), and typical solutions include constraining the policy (Kumar et al. (2019) ; Wu et al. (2020) ) or the Q value (Kumar et al. (2020) ). Recently, offline MARL is also attracting interests. As a combination of two areas, offline RL and MARL, it faces new challenges of larger state-action space. Yang et al. (2021) proposes a even more conservative method compared to single-agent offline RL. Pan et al. (2022) finds that conservative offline RL algorithms converge to local optima in the multi-agent setting, thus proposes better optimization methods. Multi-source fusion Works in other fields also concern about multi-source fusion. For example, Sun et al. (2021) use an attention mechainsm to fuse multiple word features for downstream tasks.

3. BACKGROUND

In a fully cooperative multi-agent RL, the environment is generally modelled as a Dec-POMDP Oliehoek & Amato (2016) . Adding communication, Dec-POMDP with communication is defined as G =< I, S, A, P, R, Ω, O, n, γ, M >. I = {1, 2, • • • , n} stands for the set of the agents. s ∈ S, a i ∈ A, o i ∈ Ω are the state, action and observation of a certain agent, and the observation of agent i o i is calculated by the observation function o i = O(s, i). P (s ′ |s, a), R(s, a) are the transition function and the reward function, and γ ∈ [0, 1) is the discount factor. When communication is added to this model, m ij is denoted as message conveyed from agent i to agent j. In practice, we take s = o : by default. The offline dataset is denoted as D. Based on the basic definitions of Dec-POMDP with communication, we can further denote the action-observation history τ i ∈ T ≡ (Ω × A) * , and a policy π i (m :i , τ i ) is defined on the history. In order to unify the notation of communication protocols proposed by previous works, we first give a general form of communication m ij = C ij (o i ), a j = π j (m :j , o j , τ j ) (∀i, j ∈ I), where C ij is an arbitrary communication function, and : is the abbreviation of all the agents. Although o j is included in τ j , here we abuse the notation for convenience. In Equation 1, each message from i to j is generated from the communication function, and an action is determined through the policy function π, by taking as input all the received messages, individual observation, and individual history if needed. In Appendix A, we list the expression of the communication protocols in previous works. Generally speaking, they take different inductive bias into account and can all be viewed as modified protocols from Equation 1.

4. A MOTIVATIVE EXAMPLE

Before digging into the specific method of offline communication learning, we first illustrate why learning communication directly from communication-based dataset is important, rather than learning in an end-to-end manner, i.e., learning communication from scratch. In this section, we analyze the influence of the number of agents n and the dimension of states p, by comparing the performance in an imaginary communication game with tunable n, p. The number of agents corresponds to the scalability of a task, and the dimension of states reflects the complexity of information sources (e.g. images are complex, while velocity and acceleration information isn't). The imaginary communication game consists of n agents, with state dimension p and observation dimension q. And 2q (not q) dimensions of the state space are related to the policy of each agent in (4), therefore besides the q dimensions that can be directly observed as defined in (3), the other q dimensions are not observed thus come from communication. The max horizon T = 1, and each of the p-dimension states is randomly initialized from -1, 1. Details are in Appendix C. T = 1, s ∈ {-1, 1} p (2) o i = (s index (i) 1 , s index (i) 2 , • • • , s index (i) q ) (3) a * i = 1 j=1•••2q s index (i) j > 0 0 otherwise (a * i is the optimal action of agent i) (4) r = i=1•••n I(a i = a * i ). (5) Figure 1 shows how the performance decays with the increasing n and p. In general, learning communication from dataset performs better than learning communication from scratch, especially when the task becomes harder with a larger n or p. The basic reason is that, in challenging tasks where the whole state space is enourmous, pure offline MARL methods still have difficulty in finding unobserved information related to an optimal policy, even if hard exploration in the state space is no longer a problem. A randomly initialized communication function may converge to sub-optimal, that confuses the downstream policy to some extent. For example, in a multi-agent navigation task with image inputs, the agent needs to communicate whether the goal is in sight. However, there's too much redundant information in the image, so the communication tends to converge to easier patterns like the color or background of the goal, which are misleading in downstream policy learning. 

5. METHOD

In this work, we propose Multi-head Communication Imitation (MHCI) for offline communication learning as shown in Figure 2 . In Section 5.1, we provide the complete definitions of the multi-source communication in the offline setting, and derive the offline optimality according to the communication in the dataset. Finally in Section 5.2, the full structure of our Multi-head Communication Imitation (MHCI) is introduced, which learns a composite communication protocol that supports the optimal policy according to the offline dataset.

5.1. UNIVERSAL AND LOCAL COMMUNICATION IN THE OFFLINE SETTING

In order to introduce the concept of Universal and Local Communication, we first define how the multi-source dataset is split. Denote a communication function as C(o : ) : Ω n → R p•n 2 (p is the dimension of each message m ij ), which specifically is C((o 1 , • • • , o n )) = (C 11 (o 1 ), C 12 (o 1 ), • • • , C 1n (o 1 ), C 21 (o 2 ), • • • , C 2n (o 2 ), • • • , C n1 (o n ), • • • , C nn (o n )). Assume that the communication in dataset D can be discriminated by source into |G| groups, with state domain and communication function  G = {(S (1) , C (1) (•)), (S (2) , C (2) (•)), • • • , (S (|G|) , C (|G|) (•))}. ∃f, s.t. ∀j, k ∈ [n], f jk (C (1) jk (o j )) = C (2) jk (o j ). (8) Definition 2. Define the Universal Communication (UC) as the communication function that contains all of dataset communication function. U C : S → R * , ∃f 1 • • • f |G| , ∀i ∈ [|G|], j, k ∈ [n], f (i) jk (U C jk (o j )) = C (i) jk (o j ) i.e. ∀i ∈ [|G|], U C dominates C (i) . (9)

Compared to the Local Communication, the Universal communication includes the information of all the Local Communications

C (1) , C (2) , • • • , C (|G|) . After introducing the concept of Universal and Local Communication, we prove that the Universal Communication is sufficient for obtaining policies that match or outperform the dataset in Theorem 1. For simplicity, first denote the optimal policy and the optimal value function under a Dec-POMDP T C (M ) (transformed from the real MDP M by the communication function M ) . Theorem 1. The optimal expected return based on the Universal Communication is greater than or equal to that of dataset communications and policies, i.e., ∀(S, LC, π) C(•)) as π * T C (M ) , V * T C ( ∈ G π , V * T U C (M D ) (s init ) ≥ V T LC (M D ),π (s init ). ( ) Theorem 1 means that the optimal policy given U C(•) is always equal to or better than that of the dataset, under the dataset MDP M D as defined in (Fujimoto et al. (2019) ). The whole proof is in appendix B. In fact, the provably true Universal Communication is at least the concatenation of all Local Communications, because without any additional assumptions on communication, the local communications may include distinct information. As a result, we are actually approximating the Universal Communication in the following algorithm design.

5.2. MULTI-HEAD COMMUNICATION IMITATION

In section 5.1, we have shown that in order to guarantee optimality compared with the performance of dataset policies, combining all the information contained by dataset communication is essential for the policy training afterward. Multi-head structure We design a multi-head stucture to classify the true category. The structure is inspired by the widely used attention module, where query and key are first computed from the input, and the weight of each element is obtained from the inner-product of key and query. In our multi-head structure, we simply compute the key from dataset communication, and the query from the observation. After taking the inner-product and applying the softmax operator, we get a prediction probability of each category, which is used in the overall imitation loss in Equation 11. In practice, the communication heads k) share most networks except the last two layers for efficiency. C (1) , C (2) , • • • , C Loss Imitation = k i=1 prob i • M SE(C (i) (o D ), m D ). ( ) Figure 3 : The schematics of the network structure and the loss computation in Equation 11. Linear fusion Although the Multi-head Communication Imitation successfully learns the communication function of each category, we still need to fuse all those categories of generated messages in the testing phase. A trivial fusion method is to concatenate them all, but it is bandwidth-consuming as the dimension of all the received messages m :i is k • p. To fuse different categories of message output, as long as not adding extra burden on computation, we use Linear fusion which is enough in most cases empirically. It can be viewed as an approximation of the concatenated messages. We fuse the received messages in the following way. m f usion = A kp×p (m (1) , m (2) , • • • , m (k) ) T , where (m (1) , m (2) , • • • , m (k) ) are the learned messages with dimension p. k is the total number of heads in MHCI. Further investigations on Linear Fusion are in the ablation study in Section 6.3. In practice, the fusion matrix A kp×p is learned by the downstream RL loss. Since it's probable that the dataset communication isn't optimal, we also assign an additional learned head of communication. So the optimization of the linear fusion is as follows A, ϕ = argmin A,ϕ Loss RL (o, a, A(m (1) , m (2) , • • • , m (k) , C (k+1) ϕ (o)) T ). ( )

6. EXPERIMENT

In order to evaluate offline communication learning algorithms, we introduce a benchmark that consists of many kinds of communication-intensive cooperative multi-agent tasks. Tasks introduced in previous researches are investigated in our work, and we also create new tasks that require sophisticated communication. The details of the benchmark are included in Section 6.1. All the environments except Room (an added environment introduced in Section 6. Besides environments used in previous researches, the first added environment is Room. Each agent is corresponding to one of the goals (diamonds in Figure 4 ), and acquires reward only when it arrives at the proper goal of the same color. The goals are initialized randomly in the upper row, while the agents are initialized randomly in the lower row. The challenge comes from the fact that, each agent has no idea where the corresponding goal is. So the optimal joint policy is first going up to check the color of that diamond, and then sending the coordinate message to the agent with the same color. In this way, every agent is aware of its proper destination, and is able to head for the proper goal with reward. We also come up with a coordination task in StarCraft II called 6o1b vs 1r. In this environment, there are 6 observers who can't move, and 1 agent (between the two observers on the top) needs to reach the enemy's position and kill it. The 6 observers and 1 agent are initiallized in the fixed position, while the enemy is randomly initialized in the lower-left or lower-right corner. However, each agent has a limited sight range. So the message is transferred from the lowest observer to upper observer, and finally transferred to the agent that can move to the enemy. During designing new environments with sophisticated communication, we conclude the following patterns that make the task really challenging. • Inconsistent communication It means that the communication is inconsistent among different receivers. Unlike broadcasting part of one's observation to everyone, sending the necessary message without redundant information is a more effective way. • Delayed-effect communication Consider a special case that the optimal policy is based on the received message in the past. Optimizing in an environment with such communication pattern requires recursive backpropogating along the historic policy network (e.g. RNN), adding to additional difficulty in convergence. In conclusion, there are different kinds of environments in the benchmark. Some come from previous online cooperative MARL researches, and others are new environments where a sophisticated communication is essential for solving these tasks. Specifically, we assign moderate-size datasets in order to investigate how offline communication learning helps downstream policy learning in challenging tasks. Although real-world tasks involve larger multi-agent systems and more complicated observations than experimental settings, we hope the offline communication learning benchmark can bridge the gap between experimental RL tasks and more challenging real-world applications. 67.2 ± 6.4 49.9 ± 1.7 61.0 ± 1.2 6o1b vs 1r-medium-random 45.0 ± 5.7 44.0 ± 6.5 44.0 ± 1.4

MHCI (Ours

Table 1 : The performance comparison of our method MHCI, learning communication from scratch and pure imitation, normalized by the score of the expert (from 0 to 100). The results are averaged over three random seeds with standard deviation. As shown in Table 1 , we compare our algorithm with two other communication learning strategies. Learning communication from scratch in the second column means that instead of using the dataset communication, the communication function is trained together with the policy in an end-to-end manner. Pure imitation means that without discriminating different sources, the communication is learned by minimizing the MSE loss between dataset and learned communication. The table is split into two parts, including environments introduced in previous works and added environments respectively. In the first part (the first 5 rows of results), we cope with StarCraft II tasks introduced in Wang* et al. (2020) . From the results, we can conclude that by learning from communication-based datasets, both MHCI and pure imitation perform better in 3b vs 1h1m-mdeium-random, 5z vs 1ulmedium-random and MMM-medium-random. Among the 3 methods, MHCI takes the property of multi-source into account and achieves higher scores than Pure imitation. In 1o10b vs 1r-mediumrandom, pure imitation performs better than MHCI and learning from scratch, which is probably because the large number of agents (11 agents in total) makes multi-head learning difficult. In 1o2r vs 4r-medium-random, all the three strategies actually perform similarly. The comparison in the added environments described in Section 6.1 is included in the second part (the last 5 rows of results). In the single-source dataset Room-medium, pure imitation obtains the highest score, and MHCI is the second. We also compare under three mixing schemes in Room. The results show that our algorithm outperforms learning communication from scratch by using dataset communication, as well as pure imitation, which neglects the multi-source property. In 6o1b vs 1r where a cascading communication is required, all the three methods fail to have good scores. We look forward to future algorithms that can handle 6o1b vs 1r-medium-random. In conclusion, we conduct comparison among the three methods, MHCI, learning communication from scratch and pure imitation, on the created benchmark of offline communication learning. In this way, we show that by utilizing dataset communication, both MHCI and pure imitation work better than learning communication from scratch. And in many datasets, dealing with the multisource property gives MHCI an additional advantage over pure imitation. Therefore, we can draw the conclusion that MHCI boosts downstream policy learning in offline communication learning with multi-source datasets.

6.3. ABLATION STUDY

In this section, we look into the effectness of each module in MHCI. As introduced in Section 5, our method includes two main parts, the multi-head structure for communication imitation learning, and the way of fusing the messages from different sources. Therefore, we respectively compare the modules with other alternatives. The effectiveness of the multi-head structure In MHCI, we use an attention-like multi-head network to predict the categories of the dataset communications. An alternative is to classify each communication to its closest category argmin i dist(m D , m(i) ). We compare the performances using the two classification methods in the upper part of Table 2 , drawing the conclusion that our multi-head structure outperforms the method of finding the closest category and acts more robustly.

MHCI (ours) closest

The effectiveness of linear fusion We compare between learned linear fusion, PCA-based fusion and directly using the hidden variable shared by all communication heads. PCA-based fusion means that the fusion matrix is obtained from calculating the compression matrix given the concatentated multi-head messages (F.R.S. ( 1901)). And the results in the lower part of Table 2 show that learned linear fusion performs better than PCA-based fusion.

7. CONCLUSION

Partial observablity and Scalability are two main challenges in cooperative multi-agent RL, we look into the new area of offline communication learning that hopefully addresses the two problems to achieve higher group intelligence. However, addtional challenges come from the multi-source property of offline dataset, in which directly training communication in the end-to-end manner isn't effective enough. Therefore, we propose Multi-head Communication Imitation, aiming to combine the information in the dataset communication to boost downstream policy learning. Empirical results show the effectiveness of our algorithm under the new offline communication benchmark.

A DIFFERENT COMMUNICATION PROTOCOLS IN PREVIOUS WORKS

In Section 3, the general form of how communication works in multi-agent RL is summarized in Equation 1. We further analyze that previous works on communication can be viewed as applying different modifications on the general form. • Informative communication m ij ∈ M = R d , d is the dimension of message. It is generally used in all kinds of differentiable communication learning. • Broadcasting communication C i ≡ C i1 = C i2 = • • • = C in ∀i ∈ I (Sukhbaatar et al. ( )). • Discrete communication (action communication) m ij ∈ M = A m = {1, 2, • • • , |A m |}. Each agent i sends the same messages to all receivers (Foerster et al. (2016) ).

• Targeted communication

The message follows the broadcasting communication protocol, while the policy function includes an attention module. m i = C(o i ), a j = π j (Attention(m : , o j ), τ j ) (Das et al. (

)). • Incentive communication

The message directly influence the Q value of each receiver. (Yuan et al. (2022) ). m ij ∈ M = R |A| , a j = arg max Q(o j , τ j ) + m ij

B SUPPLEMENTARY PROOF OF THEOREM 1

In general, the proof of Theorem 1 is based on Lemma 1, which includes more efforts. In this section, we first give a proof of Theorem 1 using Lemma 1. And the residual part are all proof of Lemma 1. Theorem 1. The optimal expected return based on the Universal Communication is greater than or equal to that of dataset communications and policies. ∀(S, LC, π) ∈ G π , V * T U C (M D ) (s init ) ≥ V T LC (M D ),π (s init ) Proof. According to Definition 1 and 2, ∀(S, LC, π) ∈ G π , C U C dominates C LC . Therefore by Lemma 1, we have V * T GC (M D ) (s init ) ≥ V * T C (M D ) (s init ) And apparently V * T C (M D ) (s init ) ≥ V T C (M D ),π (s init ). Therefore V * T GC (M D ) (s init ) ≥ V * T C (M D ) (s init ) ≥ V T C (M D ),π (s init ) Lemma 1. If communication function C 1 dominates C 2 , we have V * T C 1 (M ) (s init ) ≥ V * T C 2 (M ) (s init ). ( ) The following is the proof of Lemma 1. Recall that the observation function in POMDP is O : S → O. Definition 3. The relation that O 1 contains more information than O 2 is defined as ∃f : S → O, ∀s ∈ S(∈ M DP ), f (O 1 (s)) = O 2 (s). It can be understood as the fact that O 1 contains strictly more information than O 2 . Denote history τ t = (s 1 , a 1 , s 2 , a 2 , • • • , s t-1 , a t-1 , s t ), τ (a) t = (s 1 , a 1 , s 2 , a 2 , • • • , s t-1 , a t-1 ) (∀t), τ t ∈ T t , the probability p(τ ) Denote the history under observation function O(•) as O(τ t ) = (O(s 1 ), a 1 , O(s 2 ), a 2 , • • • ), O(τ t ) ∈ O(T t ), the probability p(τ O t ) = O(τt)=τ O t p(τ t ) Denote the infostate obtained from the history IS : O(T t ) → ∆(S), p IS(τ O t ) (s) = τ =(••• ,st),O(τt)=τ O t p(τ t )/p(τ O t ) Lemma 2. O(T t ) is a split of T t . If O 1 contains more information than O 2 , O 2 (T t ) is a split of O 1 (T t ) Proof. 1. IS(O 1 (τ t )) ⊂ IS(O 2 (τ t )). 2. Probability adds up to 1. (By taking expectation) Therefore define the Split O1 (O 2 (τ t )) ∈ ∆(O 1 (T t )). Lemma 3. Suppose τ t = (o 1 , a 1 , • • • , o t-1 , a t-1 , o t ), τ (a) t = (o 1 , a 1 , • • • , o t-1 , a t-1 ), if E τ O 1 t ∼Split O 1 (τ O 2 t ) Q * O1 (τ O1 t , a t ) ≥ Q * O2 (τ O2 t , a t ), we have E τ O 1 t ∼Split O 1 (τ O 2 t ) Q * O1 (τ (a),O1 t , a t ) ≥ Q * O2 (τ (a),O2 t , a t ) Proof. LHS = E ot|τ O 1 E τ O 1 ∼Split O 1 (τ O 2 t ) Q * O1 (τ O1 t , a t ) ≥ E ot|τ O 1 Q * O2 (τ O2 t , a t ) = RHS Theorem 2. ∀t, O 1 dominates O 2 , we have E τ O 1 t ∼Split O 1 (τ O 2 t ) Q * O1 (τ O1 t , a t ) ≥ Q * O2 (τ O2 t , a t ). Proof. Prove literately. Without generality, suppose all trajectories have the same length t max , and the inequality naturally holds for the ending state. If E τ O 1 t ∼Split O 1 (τ O 2 t+1 ) Q * O1 (τ O1 t+1 , a t+1 ) ≥ Q * O2 (τ O2 t+1 , a t+1 ), E τ O 1 t ∼Split O 1 (τ O 2 t ) Q * O1 (τ O1 t , a t ) = E τ O 1 t ∼Split O 1 (τ O 2 t ) max at+1 Q * O1 (τ (a),O1 t+1 , a t+1 ) (Denote τ (a),O1 t+1 = Concat(τ O1 t , a t )) ≥ max at+1 E τ O 1 t ∼Split O 1 (τ O 2 t ) Q * O1 (τ (a),O1 t+1 , a t+1 ) ≥ max at+1 Q * O2 (τ (a),O2 , a t+1 ) (∵ Lemma 4) =Q * O2 (τ O2 t , a t )

C DETAILS OF THE IMAGINARY COMMUNICATION GAME

The formulation of the imaginary communication is included in Section 4. It is named as an imaginary communication game, because it doesn't include any real-world correspondence. However, in this way we can concretely measure the influence of scalability, state space complexity and the advantage from learning communication from datasets. The number of agents is n. The state space is of p dimension, while the indiviual observation space is q. Natually, p > q, i.e., each agent itself doesn't observe all the information. In the implementation, we guarantee nq > p and that all the concatenated observations of all agents include all the information of the state. The partial observability lies in the fact that for each agent i, there are 2q dimensions that relate to the optimal policy, but only q dimensions are included in its own observation. Therefore, a communication mechanism is required to make up this lost q dimensions. The two experiments share the same network structure, but still, learning communication from scratch fails to converge to optimal when the the task become more challenging. The network structure involves a communication module and a policy module, and both modules are 3-layer MLPs. The communication module of agent i takes the individual observation o i as input and outputs the message m i: to each receivers. The policy module of agent j takes o j , m :j as input and outputs an action. All the network doesn't share parameters. Since this environment is of horizon T = 1 and the reward belongs to 0, 1, it can be modeled as a classification problem. We use BCE loss for optimization, with the data size 1000, batch size 1000 (1 step covers 1 epoch) and learning rate 0.1 in SGD optimizer. In learning communication from scratch, we optimize both the communication and policy module by the downstream BCE Loss. In the fitting dataset communication, we optimize the communication module by minimizing the MSE loss using dataset communication, while optimize the policy module by the downstream BCE loss. For the sake of fairness, we use the rule of early stopping that stops training if the evaluation accuracy doesn't increase in 30 steps. Figure 1 is the average evaluation accuracy over 100 experiments at each value of n, p with q = 2. In every experiment, the environment and dataset are re-initialized. In appendix, we also compare under a different q = 3 in Figure 6 , and a different data size 10000 in Figure 7 . In all the experiments, the state is the observation concatenation by default. For convenience, we use expert to represent manually designed datasets, which are optimal. Medium indicates that the dataset is collected by a learned policy and communication. And random is just trajectories collected by a randomly intialized policy and communication. We mix datasets from different sources and name as expert-medium for example, meaning the mixture of a expert dataset and a medium dataset. Besides, for easier understanding, we mention in this paragraph the general challenge of StarCraft II tasks included in the benchmark. In previous communication-related works, the sight range of each agent is assigned to a small value, making the tasks partially observable. We also use this criterion in the added task 6o1b vs 1r.

Dataset generation details

In StarCraft II related tasks (3b vs 1h1m, 1010b vs 1r, 5z vs 1ul, MMM, 1o2r vs 4r and 6o1b vs 1r), the medium dataset is generated by collecting the trajectories after 2e7 steps training of the NDQ algorithm (Wang* et al. (2020) ), which means that the policy is trained to convergence. And the random dataset is collected at the first step of training, meaning that the policy is only randomly initialized. The state required by the mixing network is the concatenation of all agents' observation, rather than a predefined low-dimension state as in previous works. Because the latter requires domain knowledge which is unfeasible in the general CTDE assumption. In the Room environment, we manually designed expert datasets, in which all the agents act perfectctly (First transmit the location to the correct agents, then move towards its own goal, as described in Section 6.1). And the communication function is designed as m ij = (1 + location) * 10 if i observes j's goal, otherwise 0. And in datasets where two different expert sources are included, they're m 

E EFFECTIVENESS OF MULTI-HEAD STRUCTURE

In this section, we look into the classification performance of multi-head structure in communication learning, and show how it improves the downstream performance of the policy. From the head classification results shown in Table 5 , MHCI successfully classifies different categories of dataset communication into different heads in most datasets, which allows for accurate communication learning. And eventually the overall performance of the downstream policy is increased significantly. On contrast, in those datasets that communication classification doesn't work well (e.g. 1o10b vs 1r, 5z vs 1ul and Room-medium), MHCI doesn't surpass pure imitation. (Since the second version in rebuttal updates the way of linear fusion, we need to clarify that the tendency shown in Table 5 is similar both in the old version of random linear fusion and the new version of learned linear fusion.)



Figure 1: Evaluation accuracy with different number of agents n, and different dimension of the state space p, comparing fitting the dataset communication and learning communication from scratch.

Figure 2: The pipeline of offline communication learning, breaking into two parts, the communication network and the policy network. Our work focuses on offline communication learning using a multi-head structure shown in the communication network part. The messages are generated by first computing the messages in all the heads, and then a linear fusion is applied on them.

Figure 4: The environment Room

Figure 6: Evaluation accuracy with different n, p, given q = 3 (size 1000 by default).

Figure 7: Evaluation accuracy with different n, p, given size 10000 (q = 2 by default).

ij = -(1 + location) * 10 respectively. The medium dataset is generated by offline training (training communication from scratch) on abundant expert data. And the random dataset is also collected from the trajectories of randomly initialized policies.

To this end, we propose a Multi-head Communication Imitation (MHCI) pipeline that learns communication from multi-source datasets. It simultaneously optimizes the predicted category of each communication and imitates the dataset communication. The learned multi-head communication can be viewed as Local Communication mentioned above. Besides training, we adopt a Linear Fusion module to fuse the learned communications in different categories. The fusing phase can be understood as approximating the Universal Communication. With MHCI and a policy learner, we're able to master the communication and policy in challenging tasks with multi-source offline datasets.

Ablation study including alternative classifying methods (the upper part), and alternative techniques of message fusion (the lower part).

The table of data size in offline communication learning benchmark.

Multi-head

Single-head 3b vs 1h1m 

G THE EFFECT OF COMBINING DIFFERENT COMMUNICATION PROTOCOLS

We look into how MHCI combines different communication protocols and improve the overall performance. We specifically design the dataset including multiple sources that are valuable in different aspects, and therefore combining all of them, rather than learning one of them, is essential for good performance.In detail, we modify the Room-medium dataset by masking half of the senders' messages, and leaving the trajectories the same. In 50% data, we mask the first 4 agents' sent messages, and in the remaining 50% data, we mask the last 4 agents' sent messages (8 agents in all). Therefore, the communication learning algorithm needs to combine the experiences of how the first 4 agents should communicate, and also how the last 4 agents should communicate. The experiment results shows that, under such dataset that requires combining experiences, MHCI gains advantage by combining different communication protocols, and exceeds learning from scratch.50% mask 1-4 100% mask 1-4 Not masked Learn communication 50% mask 5-8 from scratch 82.0 ± 11.4 69.5 ± 2.3 88.3 ± 1.2 74.2 ± 3.5Table 7 : The experiment results when the optimal communication in the dataset is masked partially, showing that MHCI combines experience from different sources of communication protocols.

ALGORITHMS

We compare MHCI with VBC (Zhang et al. ( 2019)) under 3b vs 1h1m-medium-random in Figure 8 . VBC fails to converge because it isn't designed for the offline setting.Figure 8 : Learning curve on 3b vs 1h1m-medium-random.

