ROBUST MULTI-AGENT REINFORCEMENT LEARNING DRIVEN BY CORRELATED EQUILIBRIUM

Abstract

In this paper we deal with robust cooperative multi-agent reinforcement learning (CMARL). While CMARL has many potential applications, only a trained policy that is robust enough can be confidently deployed in real world. Existing works on robust MARL mainly apply vanilla adversarial training in centralized training and decentralized execution paradigm. We, however, find that if a CMARL environment contains an adversarial agent, the performance of decentralized equilibrium might perform significantly poor for achieving such adversarial robustness. To tackle this issue, we suggest that when execution the non-adversarial agents must jointly make the decision to improve the robustness, therefore solving correlated equilibrium instead. We theoretically demonstrate the superiority of correlated equilibrium over the decentralized one in adversarial MARL settings. Therefore, to achieve robust CMARL, we introduce novel strategies to encourage agents to learn correlated equilibrium while maximally preserving the convenience of the decentralized execution. The global variables with mutual information are proposed to help agents learn robust policies with MARL algorithms. The experimental results show that our method can dramatically boost performance on the SMAC environments. Recently, reinforcement learning (RL) has achieved remarkable success in many practical sequential decision problems, such as Go (Silver et al., 2017 ), chess (Silver et al., 2018), real-time strategy games (Vinyals et al., 2019), etc. In real-world, many sequential decision problems involve more than one decision maker (i.e. multi-agent), such as auto-driving, traffic light control and network routing. Cooperative multi-agent reinforcement learning (CMARL) is a key framework to solve these practical problems. Existing MARL methods for cooperative environments include policybased methods, e.g.



Therefore, in practice, we expect to have a multi-agent team policy in a fully cooperative environment that is robust when some agent(s) make some mistakes and even behave adversarially. To the best of knowledge, very few existing works on this issue mainly use vanilla adversarial training strategy. Klima et al. (2018) considered a two-agent cooperative case, in order to make the policy robust, agents become competitive with a certain probability during training. Li et al. (2019) provided a robust MADDPG approach called M3DDPG, where each agent optimizes its policy based on other agents' influenced sub-optimal actions. Most state-of-the-art MARL algorithms utilize centralized training and decentralized execution (CTDE) routine, since this setting is common in real world cases. The robust MARL method M3DDPG also followed the CTDE setting. However, existing works on team mini-max normal form or extensive form games show that if the environment contains an adversarial agent, then the decentralized equilibrium from CTDE routine can be significantly worse than the correlated equilibrium. We furthermore extend this finding into stochastic team mini-max games. Inspired by this important observation, if we can urge agents to learn a correlated equilibrium (i.e. the non-adversarial agents jointly make the decision when execution), then we may achieve better performance than those following CTDE in robust MARL setting. In this work, we achieve the robust MARL via solving correlated equilibrium motivated by the latent variable model, where the introduction of a latent variable across all agents could help agents jointly make their decisions. Our contributions can be summarized as follows. • We demonstrate that in stochastic team mini-max games, the decentralized equilibrium can be arbitrarily worse than correlated one, and the gap can be significantly larger than in normal or extensive form games. • With this result, we point out that learning correlated equilibrium is indeed necessary in robust MARL. • We propose a simple strategy to urge agents to learn correlated equilibrium, and show that this method can yield significant performance improvements over vanilla adversarial training.

2. RELATED WORKS

Robust RL The robustness in RL involves the perturbations occurring in different cases, such as state or observation, environment, action or policy, opponent's policy, etc. 1) For the robustness to state or observation perturbation, most works focused on adversarial attacks of image state/observation. Pattanaik et al. (2018) used gradient-based attack on image state, and vanilla adversarial training was adopted to obtain robust policy; Fischer et al. (2019) first trained a normal policy, and distilled it on adversarial states to achieve robustness; Ferdowsi et al. (2018) (2018) focused on the case that agent's reward may be influenced by another agent, and adversarial training was implemented to solve the two-agent game to obtain a robust agent. Correlated Equilibrium Correlated equilibrium is a more general equilibrium in game theory compared to Nash equilibrium. In a cooperative task, if the team agents jointly make decisions together, then the optimal team policy is correlated equilibrium. Correlated equilibrium is widely studied in game theory (e.g. Hart & Mas-Colell (2001; 2000) ; Neyman (1997) ). In team mini-max game, solving the team's correlated equilibrium in normal form game is straightforward (just treat the team as a single agent); Celli & Gatti (2017) ; Zhang & An (2020); Farina et al. (2018) proposed various algorithms to solve correlated equilibrium in extensive formal games. In deep RL scenario, Celli et al. (2019) applied vanilla hidden variable model to solve correlated equilibrium in simple repetitive environments, while information loss with hidden variable model was used in Chen et al. (2019) to solve correlated equilibrium in normal multi-agent environment.

3.1. BACKGROUND

A typical cooperative MARL problem can be formulated as a stochastic Markov game (S, {A i } n i=1 , r, P ), where S denotes the state space, {A i } is i-th agent's action space. The en-vironment starts at state s 0 based on some initial distribution p 0 . At each time step t, agents select a joint action a t ∈ × n i=1 A i based on some policy π team (a t |s t ), and receive reward r(s t , a t ). The environment will transfer to a new state s t+1 ∼ P (•|s t , a t ). The goal is to maximize agents' expected accumulated reward: max πteam E s0∼p0,at∼πteam,st+1∼P [ t r(s t , a t )]. Most state-of-the-art MARL algorithms utilize CTDE routine, that is, each agent selects its own action independently. If the environment is fully observable, then at each timestep t, each agent selects an action a it ∈ A i based on some policy π i (a it |s t ), and if the environment is partially observable, then at each timestep t, each agent will receive an observation o it based on s t , and select action based on policy π i (a it |o it ). And the goal becomes max π1:n E s0∼p0,a1:n,t∼π1:n,st+1∼P [ t r(s t , a 1:n,t )].

3.2. ROBUST MARL AND VANILLA ADVERSARIAL TRAINING

The motivation of our work is to obtain a policy that is robust when one agent makes some mistakes. In normal MARL algorithm, the team can guarantee to achieve high reward only when all agents accurately execute their optimal strategies. However, this may not always be true in real world scenarios. Real world agents may occasionally make mistakes (e.g. machine malfunctioning). To achieve the robustness to this kind of mistakes, we propose to solve the worst case mini-max problem: Fixed i or ∀ i, max π1:n min πi,mis E s0∼p0,ait∼πi,mis,a-i,t∼π-i,st+1∼P t r(s t , a 1:n,t ) s.t. D(π i,mis ||π i ) ≤ ε (1) where {-i} means all except i. π i,mis means the mistaken policy that the mistaken agent actually perform. D is some kind of distance measure, since we cannot expect that the team policy can still be robust when one agent makes very big mistakes. Unfortunately, this mini-max problem is hard to solve, since it contains two MDPs nested with each other. On the other side, the common case in real world is that, agents make mistakes randomly, i.e. these mistakes are most likely not related to the team's goal and other agents' policies. Also, agents typically make mistakes only occasionally, since agents that always/frequently makes mistakes are not allowed to be deployed in practice. So following these considerations, we consider a simpler case and instantiate the robust cooperative MARL into QMIX algorithm. We consider a weaker worst case mini-max problem. Since we assume agent only make mistakes occasionally, we consider the case that the mistaken agent executes its "worst" action in a certain probability ε. Also we assume that the mistakes are most likely not related to the team's goal and other agents' policies. In QMIX, if an agent i doesn't consider other agents' policies, then its worst action is the one that minimizes its own Q i function (since in QMIX ∂Qtot ∂Qi ≥ 0, lower Q i will lead to lower Q tot ). In summary, we let the mistaken agent i perform a i,mis = arg max a Q i (s, a) with prob. 1 -ε arg min a Q i (s, a) with prob. ε and apply vanilla adversarial training to obtain a robust policy. The detailed algorithm framework 2 is described in Appendix A.1 . The performance of vanilla adversarial training will be considered as a baseline.

4.1. ROBUST MARL REQUIRES CORRELATED EQUILIBRIUM

In this part, we will show that with naive adversarial training in a centralized training and decentralized execution fashion, the learned policy might be sub-optimal in adversarial settings, thus requiring a correlated equilibrium. In typical MARL settings, if the environment is fully cooperative, then the algorithms with centralized training and decentralized execution (e.g. QMIX/MADDPG/COMA, etc) can achieve stateof-the-art performance in certain environments. This indicates that at least for these environments, correlation in execution is not necessary. Furthermore, Lauer & Riedmiller (2000) proved that decentralized executed policy can achieve optimal performance for fully observable and fully cooperative RL. However, in the robust MARL setting, some agent(s) become adversarial, indicating that the environment is not fully cooperative now. So the question is: whether the correlation in execution is necessary in adversarial scenario? For the settings of Eq. ( 1), the problems actually become a team mini-max game. The works on team mini-max normal form or extensive form games (von Stengel & Koller, 1997; Basilico et al., 2016; Celli & Gatti, 2017) , pointed out that the decentralized equilibrium can be significantly worse than correlated equilibrium at least for some games. In normal form team mini-max game, Basilico et al. (2016) proved that the gap between correlated and decentralized equilibrium is at most O(m n-2 ), where n is the number of agents (includes the adversarial) and m is each agent's number of action. And the bound is tight. During this section, we simply define the "correlated equilibrium" as the equilibrium of optimal "correlated policy", that is, the team learns the optimal policy π * team (a|s t ) together, and "decentralized equilibrium" as the equilibrium of optimal "decentralized policy" learned by CTDE algorithm. We denote E cor as the team's expected reward under their optimal correlated policy, and E dec as that in their optimal decentralized policy. In MARL, agents play a stochastic game. Since a repeated normal form game is a special case of stochastic games, so Ecor E dec can be at least m n-2 in stochastic games. Moreover, we find that this gap can be even larger in stochastic games than that in normal form games, because stochastic game is a sequential game, and agents' previous actions can influence the future state and therefore affect the future reward. Proposition 1. There exists a stochastic game that Ecor E dec > m n-2 . Proof. Consider this example: S = {S 1 , S 2 }, initial state is S 1 . Agents 1 • • • n -1 is a team, n is adversary. A i = {1, • • • , m}, i = 1 • • • n. Discount factor γ < 1. The team's reward function r(S 1 , a) = 0, ∀a. r(S 2 , a) = 1 a 1 = • • • = a n 0 otherwise . Deterministic state transition function T (S 1 , a) = S 2 a 1 = • • • = a n S 1 otherwise ; T (S 2 , a) = Game ends a 1 = a 2 = • • • = a n S 2 otherwise . In each state, the team's optimal correlated policy is to perform correlated action {1, • • • , 1}, • • • , {m, • • • , m} with equal probability 1 m . Because if the team perform any of these actions with probability less than 1 m , the adversary will perform that action, which will reduce the team's reward. However, if the team plays in a decentralized way, then each agent's optimal policy is to perform each m actions with equal probability 1 m . We can prove that in this example, Ecor E dec ≥ m 2n-4 (1 -r) 2 . The detailed derivation can be found in Appendix A.2. In fact, the gap between correlated and decentralized equilibrium in stochastic team mini-max game can be arbitrarily larger than normal form game, elaborated in the following proposition. The detailed derivation can be found in Appendix A.2. Proposition 2. ∀ fixed k ∈ Z + , there exists a stochastic game in which Ecor E dec ≥ O(m k(n-2) ). Therefore, in robust MARL settings, CTDE algorithms may no longer achieve optimal performance. To achieve better robustness in adversarial settings, we need to design some methods that can urge agents to learn correlated equilibrium .

4.2. LEARNING CORRELATED EQUILIBRIUM

In this work, we propose a novel approach to learn correlated equilibrium for robust MARL agents: using global random variable with mutual information. This method is inspired by the idea of InfoGAN (Chen et al., 2016) . The idea is to add a global random variable z as an extra input of the Q network, i.e. changing Q i (o it , a it ) to Q i (o it , z t , a it )). Although the global random variable itself is meaningless, agents may learn to perform correlated equilibrium based on its value. The intuition is that, taking the example in Proposition 1, if we add a global random variable z sampled from [0, 1] uniformly at each timestep, then each agent might learn to perform action 1 when z ∈ 0, 1 m , perform action 2 when z ∈ 1 m , 2 m , and so on. Therefore, the overall policy is the optimal correlated policy of the team. The following proposition is straight forward using the definition of correlated equilibrium. We also give a simple derivation in Appendix A.2. Proposition 3. For a fully observable, finite discrete action MARL environment, if all agents receive a global continuous random variable z, then there exists a deterministic policy µ i (s, z) : S × R → A i , i ∈ {1, • • • , n} equivalent to the team's optimal correlated (stochastic) policy π * (a 1 , • • • , a n |s). We now summarize two advantages of the proposed global random variable method for learning correlation equilibrium in robust MARL settings. • We can allow agents to perform correlated equilibrium while maximally preserving the property of CTDE. Since the global latent variable is independent of states, the only elements that agents need to share is a random number generator and a random seed. This can be done before the game starts, and agents can still perform "decentralized" policy during the game. • In MDP or fully observable fully cooperative correlated multi-agent MDP, one can prove that there must exist an optimal deterministic policy (Puterman, 2014), and therefore deterministic policy algorithm can learn optimal policy. However, this property will not hold if the the multi-agent MDP is not fully cooperative, the optimal correlated policy in the example in Proposition 1 must be stochastic. This will prevent deterministic policy algorithms from learning optimal correlated policy directly. But when agents shares a global random variable, deterministic policy approaches will be possible to learn a policy that is equivalent to the optimal correlated policy. To avoid the ignorance of the global random variable's information for each agent, we propose to maximize the mutual information between the random variable and the agent's action I(z t ; a it ), and therefore agents must use the information from z. Also, to avoid solving the posterior P (z|a) directly, the variational lower bound is derived as an approximate objective, like InfoGAN and Barber & Agakov (2003) (The subscript i,it is omitted): I(z, a) ≥ E z,a log q(z|a) p(z) = E z,a [log q(z|a)] + H(z). where q(z|a) is the variational approximation of P (z|a). Unlike InfoGAN using latent variables c to model the information that is irrelevant of generator source z (width, italic), in reinforcement learning with latent variable, we prefer applying the latent variables to model the correlated policy that is related to current observations. Thus, instead of simply utilizing I(z t ; a it ), it's better to use the conditional mutual information I(z t ; a it |o it ). A similar variational lower bound can also be derived to approximate this conditional mutual information (The subscript i,it is omitted): I(z, a|o) = E z,a,o log p(z|a, o) p(z|o) = E z,a,o log q(z|a, o) p(z|o) + E a,o [D KL (p(z|a, o)||q(z|a, o))] ≥ E z,a,o log q(z|a, o) p(z|o) = E z,a,o [log q(z|a, o)] + H(z) (Since p(z|o) = p(z)) Since the entropy term H(z) is a constant, thus ignored, we can define L I = - n i=1 E oi,ai∼D,z∼p(z) [log q (z | a i , o i )], and use L tot = L RL +λ I L I as the overall loss function. In experiments, we follow InfoGAN's idea, configure variational approximation q(z|•) as a Gaussian distribution, and apply a neural network to output its mean and variance. Algorithm 1 describes the overall training procedure. Kim et al. (2020) involve the idea of global random variable and mutual information. However, both of these two works aim to improve coordination in standard MARL tasks, while we focus on solving robust CMARL. To the best of our knowledge, we are the first to demonstrate the importance of correlation in this kind of robust MARL setting. So as a first-step work, we just apply the most straightforward method to demonstrate this important finding: • In our work, we formulate the global random variable as certain common knowledge to be shared between agents, which is straightforward, and do not need to specify its distribution as Chen et al. ( 2019) (p z (z) in Theorem 2.2 ). • Kim et al. (2020) use the mutual information between each pair of agent's policy I π i (• | s t ) ; π j (• | s t ) to encourage agent to coordinate. This is a much complicate method. As a first-step work, we suggest to employ conditional mutual information I(z; a|s) (or I(z; a|o)) to encourage agents to use information from z, and allow agents to learn the correlated method themselves. Chen et al. (2019) proposed similar mutual information settings, but since they formulated p z (z), then the mutual information I z; π S (a, z | s) over joint distribution was used. Whether the more complicated correlation method can achieve even better performance in robust CMARL settings is still an open question.

5. EXPERIMENTS

We test our method with QMIX algorithm in SMAC (Samvelyan et al., 2019) environment. We follow the experiment settings in QMIX (Rashid et al., 2018) . The only difference is we train and evaluate with agents' mistaken action as in Eq. ( 2). We test two robust settings like Eq. (1), including: 1) only one fixed agent will make mistakes; 2) all agents can randomly make mistakes, but in each timestep at most one agent can make mistakes. Environment and Network Architecture We use the SMAC environment involving decentralised micromanagement scenarios, where each unit of the game is controlled by an individual RL agent. It is a partially observable environment where each agent's observation is within a circu-lar area around it. The action space is discrete. For further details of the environment, please refer to (Samvelyan et al., 2019) . In the training, we use a 3-dimensional independent uniform distribution U (0, 1) as global variable, and set MI loss coefficient λ I = 0.1. These are just hyperparameters, and we choose them based on performance. We uses the same experiment settings of QMIX paper, including the architecture of the Q network, all the hyper-parameter settings and the performance evaluation method (evaluate every 10K timesteps with 32 games). For the global variable part, to add latent variable z into the model, we just extend the agent's observation space. For the mutual information part (q(z|a, o)), we uses an independent fully connected network with 2 32-unit-layers and ReLU activation function. The network takes agent's observation (without latent variable) and agent's action as input, and output a 6-dimensional vector with first 3 dimensional as mean and last 3 dimensional as variance. During training, we gather all agents' observation and action data together and feed them into this network together.

Map and Agent Selection

To evaluate our method, we select the maps and agents such that the original QMIX can achieve good performance in non-robust settings, but its policy is not robust with the selected agents. By examining several maps about QMIX's performance and robustness, we choose four maps to evaluate our method: 8m, 2s3z, 3m, 3s5z. The detailed information can be found in Appendix A.3. In all the evaluations, we choose ε = 0.3 for 3m, and ε = 0.5 for other 3 maps. We evaluate the "random agent" case in all except 3s5z maps. Because in 3s5z map, normal policy will still obtain 68% of winning rate when a random agent has 50% probability to select its worst action. For the "fixed agent" case, since many agents in these maps are homogeneous, we only select some representative agents for evaluation. The selection criteria is mainly based on the robustness of the normal QMIX policy, and the detailed information can be found in Appendix A.3. We evaluate a total of 9 different agent settings.

Adversarial Testing Results

In this part we compare 4 different settings: normal policy (train without adversarial but test with adversarial), vanilla adversarial training (baseline), adversarial training with global variable but without mutual information loss, adversarial training with global variable and mutual information loss. In the whole section, we use the following abbreviations to indicate these four settings: NP, VA, GV, GM. We run each experiment 5 times like QMIX does, and report the 25% and 75% as the upper and lower error bar. The mean winning rate during training is shown in Figure 1 . We also plot the mean winning rate of the last few steps (in dotted line), and the performance of NP is plotted as black dotted line. The plots show that by adding global variable and mutual information loss, the performance of GM in most settings is better than VA and GV in average. The performance of GV is sometimes but not always better than VA. The reason might be that, without mutual information loss, agents will become easier to ignore global variable's information. Additional experimental results can be found in Appendix A.4. In adversarial settings, the testing variance is larger than normal settings. So in order to evaluate the performance more accurately, we select the high checkpoint of each run within the error bar, and test it with 1000 episodes. Also, we test the model with different percentages of adversarial rate. The results are depicted in Figure 2 , and further show that training with global variable and mutual information can lead to a better adversarial testing performance. The raw data can be found in Appendix A.5 in Table 3 .

Additional Baselines of Partially Observable Environments

In Section 4 we use some examples to demonstrate the importance of correlation in robust CMARL settings for fully observable environments, where correlation is not important in normal CMARL settings but plays an key role in robust CMARL settings. However, since SMAC is a partially observable environment, and some existing works (e.g. Bernstein et al. (2009) ) pointed out that providing a global random variable to all agents can actually improve the performance in partially observable environments. So in order to demonstrate that correlation is more important in robust CMARL settings, we build another baseline in the following. We train a policy in normal settings with a global variable and mutual information loss (we use abbreviation NG to indicate it), and then evaluate it in adversarial settings. We compare the average performance improvement of NG/NP and GM/VA. The results can be found in Table 1 here and Table 2 in Appendix A.4. The results shows that, in most maps and agents, the performance improvement of NG/NP is less than GM/VA. This indicates that "correlation is more important in robust settings" to a certain extent: in the same evaluation settings, the improvement of normal settings (normal training) is less than robust settings (adversarial training). Random Testing Results Apart from testing the adversarial agents, we also test scenarios that agent makes non-adversarial mistakes, i.e. agent plays random action with certain probability: a mis = arg max a Q(s, a) with prob. 1 -ε Random choice from A with prob. ε . We use the same ε as above. Figure 3 : Random testing results with different random rates. The results are depicted in Figure 3 , which show that training with global variable and mutual information can lead to a slightly better random testing performance. The raw data can be found in Appendix A.5 in Table 4 .

6. CONCLUSION AND DISCUSSIONS

In this work, we focus on robust CMARL when one agent in the team makes mistakes or even behave adversarially. We found that in team mini-max stochastic games, the performance of decentralized equilibrium can be significantly worse than correlated equilibrium. Thus in order to achieve a better robust CMARL performance, we propose a method which uses global variables with mutual information to help agents learn correlated equilibrium. The experimental results show that this method can achieve better performance compared to vanilla adversarial training. Robust CMARL is a important research direction, as it will directly determine whether we can safely deploy CMARL into practical problems. To the best of our knowledge, this research direction is still in the early stage. In this work, we just provide a simple method to solve the robust CMARL problem. Several future directions can be explored: • The experiments of this work show that in some settings, the robust policy will slightly decrease the non-robust performance. So one possible future work is to balance the performance and robustness. • We only consider a weak worst case mini-max problem. Whether it is possible to solve the real adversarial case (i.e. Eq. ( 1)) remains a future work. • Besides, we only focus on the robustness of one agent. In reality, perhaps all agents can have a certain probability to make mistakes. Therefore, another possible future work is to consider this case. In decentralized equilibrium, every step in S 1 has 1 m n-1 probability to transfer to S 2 , and every step in S 2 has 1 m n-1 probability to get reward. Similarly: E dec = 1 1 -1 -1 m n-1 γ γ 1 1-((1-1 m n-1 )γ) 1 m n-1 m n-1 Therefore: E cor E dec = m n-1 (1 -γ) + γ m(1 -γ) + γ 2 ≥ m n-1 (1 -γ) m(1 -γ) + mγ 2 = m 2n-4 (1 -γ) 2 Detailed derivation of Proposition 2 Consider the k step game with k states S 1 , • • • , S k . Agents and action spaces are the same as Proposition 1. The discount factor is γ. r(s, a) = 0, s ∈ {S 1 , • • • , S k-1 }, ∀a; r(S k , a) = 1 a 1 = a 2 = • • • = a n 0 otherwise . That is, the team can only receive reward in the last step, which is a common case in RL environments. The deterministic state tran- sition function is T (S i , a) = S i+1 a 1 = a 2 = • • • = a n S i otherwise , i = 1 • • • k, where S k+1 denotes the game ending. Let's prove that E cor,k E dec,k ≥ m k(n-2) (1 -γ) k . Proposition 1 shows that Ecor,2 E dec,2 ≥ m 2(n-2) (1 -γ) 2 . Now suppose Ecor,i E dec,i ≥ m i(n-2) (1 -γ) i , by using similar derivation in Proposition 1, we can get E cor,i+1 = 1 1 -1 -1 m γ γE cor,i m E dec,i+1 = 1 1 -1 -1 m n-1 γ γE dec,i m n-1 So, E cor,i+1 E dec,i+1 = m n-1 (1 -γ) + γ m(1 -γ) + γ E cor,i E dec,i ≥ m n-1 (1 -γ) m(1 -γ) + mγ E cor,i E dec,i = m n-2 (1 -γ) E cor,i E dec,i ≥ m (i+1)(n-2) (1 -γ) i+1 Derivation of Proposition 3 Given state s, since the action space is discrete, suppose each agent has m actions, then we can list all possible joint actions a 1 • • • a n in a sequence {a k } m n k=1 . Consider the optimal correlated policy π * (a 1 , • • • , a n |s). Let p k = π * (a k |s), k = 1 • • • m n . Let Z be the value range of z. Since z is a continuous random variable, we can divide the support set of Z into m n disjoint subset {Z k } m n k=1 , which satisfies P (z ∈ Z k ) = p k . Next, let each agent performs the following deterministic policy: µ i (s, z) = (a k ) i if z ∈ Z k That is, agent i performs action (a k ) i if z ∈ Z k . All agents will perform joint action a k if z ∈ Z k , and the team will perform joint action a k with probability P (z ∈ Z k ) = p k . This is exactly the optimal correlated policy.

A.3 SMAC MAP AND AGENT SELECTION

In SMAC environment, not all maps are suitable for evaluating out method. Because: • In some map(s), QMIX's performance is not good in non-robust settings (e.g. 3s5z vs 3s6z, 6h vs 8z, 27m vs 30m, corridor). • In some map(s), the policy trained by QMIX in non-robust settings is already robust with most agents (e.g. so many banelines, 3s vs {3,4,5}z, 1c3s5z except agent Colossi, MMM). • In some hard map(s), it may be difficult for agents to achieve robust policy. (e.g. 8m vs 9m or 10m vs 11m, our team already has less agents than the enemy; 1c3s5z agent Colossi, Colossi is a large agent, thus if it makes mistakes, other small agents are hard to make up.) Since big map may more likely be robust when only one agent makes mistakes, we decide to focus on small maps. By examining several maps about QMIX's performance and robustness, we choose four maps to evaluate our method: 8m, 2s3z, 3m, 3s5z. Since many agents in these maps are homogeneous, we only select some representative agents to evaluate. • 3m map: All agents are homogeneous. We select the agent that is the least robust one in normal QMIX policy: agent 2. (The test winning rates of normal QMIX policy with 1000 episodes are 18.9%, 21.2%, 17.6% respectively, when agent 0, 1, 2 has 30% probability to select the worst action.) • 8m map: All agents are homogeneous. Since this map has more agents than 3m, we select the most robust agent and the least robust agent in normal QMIX policy: agent 4 and agent 6. (The test winning rate of normal QMIX policy with 1000 episodes are 2.3%, 3.4%, 5.4%, 10.3%, 1.9%, 7.3%, 14.4%, 7.1% respectively, when agent 0, 1, 2, 3, 4, 5, 6, 7 has 50% probability to select the worst action.) • 3s5z map: During training we found that this map is difficult to train. We have to train more steps to make it converge, and the training process is slow. Therefore we only evaluate the least robust agent in normal QMIX policy: agent 3. (The test winning rate of normal QMIX policy with 1000 episodes are 27.5%, 31.5%, 80.5%, 9%, 9.5%, 14.5%, 18%, 22.5% respectively, when agent 0, 1, 2, 3, 4, 5, 6, 7 has 50% probability to select the worst action.) • 2s3z map: During training we found that this map is a bit difficult to train. We want to only evaluate the least robust agent in normal QMIX policy. But in this map, agent 2 and agent 4 are almost the same robust, so we evaluate both of them. (The test winning rate of normal QMIX policy with 1000 episodes are 6.1%, 8.1%, 5.7%, 6.4%, 5.7% respectively, when agent 0, 1, 2, 3, 4 has 50% probability to select the worst action.)

A.4 ADDITIONAL EXPERIMENT RESULTS

In this section, we present the additional experimental results, including the mean winning rate during training (Figure 4a ), the adversarial testing results (Figure 4b ), the comparison of NG/NP and GM/VA (Table 2 ), and the random testing results (Figure 4c ). The results shows that, the performance of GM is better than VA and GV in most settings, except for 2s3z's random agent, in which GV is better than VA but GM is similar to GV. Also, in most settings, the performance improvment of NG/NP is less than GM/VA. The mean testing winning rates of adversarial testing results are shown in Table 3 . The mean testing winning rates of random testing results are shown in Table 4 .



applied adversarial training to autonomous driving tasks that interfered agent's input sensors based on environment, and then conducted adversarial training; 2) For the robustness to environment, robust Markov decision process (MDP) could be used to formulate this problem. Many works (e.g. Wiesemann et al. (2013); Lim et al. (2013)) have studied this model and provided both theoretical analysis and algorithmic design. In deep RL scenario, Rajeswaran et al. (2016) used Monte Carlo approach to train agent, while Abdullah et al. (2019); Hou et al. (2020) adopted adversarial training to obtain a robust agent to all environments within a Wasserstein ball. Mankowitz et al. (2019) conducted adversarial training in MPO algorithm to optimize the performance in the worst performance environment. 3) To be against the perturbation of action or policy, Tessler et al. (2019); Gu et al. (2018); Vinitsky et al. (2020) considered the case that agent's action may be influenced by another action, and conducted adversarial training. 4) For the robustness to opponent, Pinto et al. (2017); Ma et al.

Figure 1: Test winning rate v.s. training steps for different maps and adversarial agents (NP: normal policy; VA: vanilla adversarial training; GV: adversarial training with global variable; GM: adversarial training with global variable and mutual information loss)

Figure 4: Additional Testing Results

Algorithm 1: Overall training procedure Initialize replay buffer D = ∅ for each epoch do for sampling loop do Obtain current s t or o t . Sample z t using some distribution. Rollout action a t (using ε-greedy or other method, with Q i (s t , z t , a it ) or Q i (o it , z t , a it )). Select i and rollout the mistaken action a it,mis using Eq. 2. Let a t,mis = (a 1t • • • a i-1,t , a it,mis , a i+1,t • • • a n ) Perform action a t,mis and get r t , s t+1 or o t+1 . Store transition (s t , z t , a t,mis , r t , s t+1 ) or (s t , o t , z t , a t,mis , r t , s t+1 , o t+1 ) to D. end for training loop do Sample a minibatch M from replay buffer D. Compute QMIX loss L T D . Compute overall loss L tot = L T D + λ I L I with L I described above. Perform a update step of QMIX with loss L tot . end end Comparison with existing works. Both Chen et al. (2019) and

Comparison of NG/NP and GM/VA: Mean test winning rate of different settings (NP: normal policy; NG: normal training with global variable and mutual information loss; VA: vanilla adversarial training; GM: adversarial training with global variable and mutual information loss)

Also, as mentioned in the end of Section 4, we just use the most straightforward correlation method to demonstrate the importance of correlation in robust CMARL. Whether more complicated correlation methods can achieve even better performance in robust CMARL settings is still an open problem. Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350-354, 2019. Obtain current s t or o t , and rollout action a t (using ε-greedy or other method). Select i and rollout the mistaken action a it,mis using Eq. 2. Let a t,mis = (a 1t • • • a i-1,t , a it,mis , a i+1,t • • • a n ) Perform action a t,mis and get r t , s t+1 or o t+1 . Store transition (s t , a t,mis , r t , s t+1 ) or (s t , o t , a t,mis , r t , s t+1 , o t+1 ) to D. In correlated equilibrium, every step in S 1 has 1 m probability to transfer to S 2 , and every step in S 2 has 1 m probability to get reward. If current state is S 2 , then the expected reward is:

Additional comparison of NG/NP and GM/VA

Adversarial Testing Results: Mean test winning rate VA 68.6 70.7 72.7 72.2 64.9 53.9 GV 92.0 91.4 88.2 81.9 71.3 58.5 GM 93.4 93.0 92.5 90.2 81.6 72.6 Random Agent NP 97.5 94.7 87.9 77.9 63.5 45.6 VA 89.9 90.0 88.1 82.1 74.5 60.9 GV 96.8 97.4 94.1 89.4 80.6 69.6 GM 91.6 93.1 91.4 87.5 81.1 67.5

