RMIX: RISK-SENSITIVE MULTI-AGENT REINFORCEMENT LEARNING

Abstract

Centralized training with decentralized execution (CTDE) has become an important paradigm in multi-agent reinforcement learning (MARL). Current CTDEbased methods rely on restrictive decompositions of the centralized value function across agents, which decomposes the global Q-value into individual Q values to guide individuals' behaviours. However, such expected, i.e., risk-neutral, Q value decomposition is not sufficient even with CTDE due to the randomness of rewards and the uncertainty in environments, which causes the failure of these methods to train coordinating agents in complex environments. To address these issues, we propose RMIX, a novel cooperative MARL method with the Conditional Value at Risk (CVaR) measure over the learned distributions of individuals' Q values. Our main contributions are in three folds: (i) We first learn the return distributions of individuals to analytically calculate CVaR for decentralized execution; (ii) We then propose a dynamic risk level predictor for CVaR calculation to handle the temporal nature of the stochastic outcomes during executions; (iii) We finally propose risk-sensitive Bellman equation along with Individual-Global-MAX (IGM) for MARL training. Empirically, we show that our method significantly outperforms state-of-the-art methods on many challenging StarCraft II tasks, demonstrating significantly enhanced coordination and high sample efficiency. Demonstrative videos and results are available in this anonymous link: https://sites.google.com/view/rmix.

1. INTRODUCTION

Reinforcement learning (RL) has made remarkable advances in many domains, including arcade video games (Mnih et al., 2015) , complex continuous robot control (Lillicrap et al., 2016) and the game of Go (Silver et al., 2017) . Recently, many researchers put their efforts to extend the RL methods into multi-agent systems (MASs), such as urban systems (Singh et al., 2020) , coordination of robot swarms (Hüttenrauch et al., 2017) and real-time strategy (RTS) video games (Vinyals et al., 2019) . Centralized training with decentralized execution (CTDE) (Oliehoek et al., 2008; Kraemer & Banerjee, 2016) has drawn enormous attention via training policies of each agent with access to global trajectories in a centralized way and executing actions given only the local observations of each agent in a decentralized way. Empowered by CTDE, several MARL methods, including valuebased and policy gradient-based, are proposed (Foerster et al., 2017a; Sunehag et al., 2017; Rashid et al., 2018; Son et al., 2019) . These MARL methods propose decomposition techniques to factorize the global Q value either by structural constraints or by estimating state-values or inter-agent weights to conduct the global Q value estimation. Among these methods, VDN (Sunehag et al., 2017) and QMIX (Rashid et al., 2018) are representative methods that use additivity and monotonicity structure constraints, respecitively. With relaxed structural constraints, QTRAN (Son et al., 2019) guarantees a more general factorization than VDN and QMIX. Some other methods include incorporating an estimation of advantage values (Wang et al., 2020a) and proposing a multi-head attention method to represent the global values (Yang et al., 2020) . Despite the merits, most of these works focus on decomposing the global Q value into individual Q values with different constraints and network architectures, but ignore the fact that the expected, i.e., risk-neutral, value decomposition is not sufficient even with CTDE due to the randomness of rewards and the uncertainty in environments, which causes the failure of these methods to train coordinating agents in complex environments. Specifically, these methods only learn the expected values over returns (Rashid et al., 2018) and do not handle the high variance caused by events with extremely high/low rewards to agents but small probabilities, which cause the inaccurate/insufficient estimations of the future returns. Therefore, instead of expected values, learning distributions of future returns, i.e., Q values, is more useful for agents to make decisions. Recently, QR-MIX (Hu et al., 2020) decomposes the estimated joint return distribution (Bellemare et al., 2017; Dabney et al., 2018a) into individual Q values. However, the policies in QR-MIX are still based expected individual Q values. Even further, given that the environment is nonstationary from the perspective of each agent, each agent needs a more dynamic way to choose actions based on the return distributions, rather than simply taking the expected values. However, current MARL methods do not extensively investigate these aspects. Motivated by the previous reasons, we intend to extend the risk-sensitivefoot_0 RL (Chow & Ghavamzadeh, 2014; Keramati et al., 2020; Zhang et al., 2020) to MARL settings, where risksensitive RL optimizes policies with a risk measure, such as variance, power formula measure value at risk (VaR) and conditional value at risk (CVaR). Among these risk measures, CVaR has been gaining popularity due to both theoretical and computational advantages (Rockafellar & Uryasev, 2002; Ruszczyński, 2010) . However, there are two main obstacles: i) most of the previous works focus on risk-neutral or static risk level in single-agent settings, ignoring the randomness of reward and the temporal structure of agents' trajectories (Dabney et al., 2018a; Tang et al., 2019; Ma et al., 2020; Keramati et al., 2020) ; ii) many methods use risk measures over Q values for policy execution without getting the risk measure values used in policy optimization in temporal difference (TD) learning, which causes the global value factorization on expected individual values to sub-optimal behaviours in MARL. We provide a detailed review of related works in Appendix A due to the limited space. In this paper, we propose RMIX, a novel cooperative MARL method with CVaR over the learned distributions of individuals' Q values. Specifically, our contributions are in three folds: (i) We first learn the return distributions of individuals by using Dirac Delta functions in order to analytically calculate CVaR for decentralized execution. The resulting CVaR values at each time step are used as policies for each agent via arg max operation; (ii) We then propose a dynamic risk level predictor for CVaR calculation to handle the temporal nature of stochastic outcomes during executions. The dynamic risk level predictor measures the discrepancy between the embedding of current individual return distributions and the embedding of historical return distributions. The dynamic risk levels are agent-specific and observation-wise; (iii) We finally propose risk-sensitive Bellman equation along with IGM for centralized training. The risk sensitive Bellman equation enables CVaR value update in a recursive form and can be trained with TD learning via a neural network. These also allow our method to achieve temporally extended exploration and enhanced temporal coordination, which are key to solving complex multi-agent tasks. Empirically, we show that RMIX significantly outperforms state-of-the-art methods on many challenging StarCraft II TM2 tasks, demonstrating enhanced coordination in many symmetric & asymmetric and homogeneous & heterogeneous scenarios and revealing high sample efficiency. To the best of our knowledge, our work is the first attempt to investigate cooperative MARL with risk-sensitive policies under the Dec-POMDP framework.

2. PRELIMINARIES

In this section, we provide the notation and the basic notions we will use in the following. We consider the probability space (Ω, F, Pr), where Ω is the set of outcomes (sample space), F is a σ-algebra over Ω representing the set of events, and Pr is the set of probability distributions. Given a set X , we denote with P(X ) the set of all probability measures over X . DEC-POMDP A fully MARL problem can be described as a decentralised partially observable Markov decision process (Dec-POMDP) (Oliehoek et al., 2016) which can be formulated as a tuple M = S, U, P, R, Υ, O, N , γ , where s ∈ S denotes the true state of the environment. Each agent i ∈ N := {1, ..., N } chooses an action u i ∈ U at each time step, giving rise to a joint action vector, Each agent also has an action-observation history τ i ∈ T := (Υ × U) * , on which it conditions its stochastic policy π u := [u i ] N i=1 ∈ U N . P(s |s, u) : S × U N × S → P(S) i (u i |τ i ) : T × U → [0, 1]. CVaR CVaR is a coherent risk measure and enjoys computational properties (Rockafellar & Uryasev, 2002) that are derived for loss distributions in discreet decision-making in finance. It gains popularity in various engineering and finance applications. CVaR (as illustrated in Figure 1 ) is the expectation of values that are less equal than the α-percentile value of the distribution over returns.

Risk level

VaR (Acerbi & Tasche, 2002) when X has a continuous distribution. The α-percentile value is value at risk (VaR). For ease of notation, we write CVaR as a function of the CDF F , CVaR α (F ). CVaR 𝛼 Mean Figure 1: CVaR Formally, let X ∈ X be a bounded random variable with cumulative distribution function F (x) = P [X ≤ x] and the inverse CDF is F -1 (u) = inf{x : F (x) ≥ u}. The conditional value at risk (CVaR) at level α ∈ (0, 1] of a random variable X is then defined as CVaR α (X) := sup ν ν -1 α E[(ν -X) + ] (Rockafellar et al., 2000) when X is a discrete random variable. Correspond- ingly, CVaR α (X) = E X∼F X|X ≤ F -1 (α) Risk-sensitive RL Risk-sensitive RL uses risk criteria over policy/value, which is a sub-field of the Safety RL (García et al., 2015) . Von Neumann & Morgenstern (1947) proposed the expected utility theory where a decision policy behaves as though it is maximizing the expected value of some utility functions. The theory is satisfied when the decision policy is a consistent and has a particular set of four axioms. This is the most pervasive notion of risk-sensitivity. A policy maximizing a linear utility function is called risk-neutral, whereas concave or convex utility functions give rise to riskaverse or risk-seeking policies, respectively. Many measures are used in RL such as CVaR (Chow et al., 2015; Dabney et al., 2018a) and power formula (Dabney et al., 2018a) . However, few works have been done in MARL and it cannot be easily extended. Our work fills this gap. CTDE CTDE has recently attracted attention from deep MARL to deal with nonstationarity while learning decentralized policies. One of the promising ways to exploit the CTDE paradigm is value function decomposition (Sunehag et al., 2017; Rashid et al., 2018; Son et al., 2019) which learns a decentralized utility function for each agent and uses a mixing network to combine these local Q values into a global action-value. It follows the IGM principle where the optimal joint actions across agents are equivalent to the collection of individual optimal actions of each agent (Son et al., 2019) . To achieve learning scalability, existing CTDE methods typically learn a shared local value or policy network for agents.

3. METHODOLOGY

In this section, we present our framework RMIX, as displayed in Figure 2 , where the agent network learns the return distribution of each agent, a risk operator network determines the risk level of each agent and the mixing network mixes the outputs of risk operators of agents to produce the global value. In the rest of this section, we first introduce the CVaR operator to analytically calculate the CVaR value with the modeled individual distribution of each agent in Section 3.1 and propose the dynamic risk level predictor to alleviate time-consistency issue in Section 3.2. Then, we introduce the risk-sensitive Bellman equation for both centralized training and decentralized execution in Section 3.3. Finally, we provide the details of centralized training of RMIX in Section 3.4. All proofs are provided in Appendix B.

3.1. CVAR OF RETURN DISTRIBUTION

In this section, we describe how we estimate the CVaR value. The value of CVaR can be either estimated through sampling or computed from the paremetrized return distribution (Rockafellar & Uryasev, 2002) . However, the sampling method is usually computationally expensive (Tang et al., 2019) . Therefore, we let each agent learn a return distribution parameterized by a mixture of Dirac Delta (δ) functionsfoot_2 , which is demonstrated to be highly expressive and computationally efficient (Bellemare et al., 2017) . For convenience, we provide the definition of the Generalized Return Probability Density Function (PDF). Definition 1. (Generalized Return PDF). For a discrete random variable R ∈ [-R max , R max ] and probability mass function (PMF) P(R = r k ), where r k ∈ [-R max , R max ], we define the generalized return PDF as: f R (r) = r k ∈R P R (r k )δ(r -r k ). Note that for any r k ∈ R, the probability of R = r k is given by the coefficient of the corresponding δ function, δ(r -r k ). We define the return distribution of each agent i at time step t as: Z t i (τ i , u t-1 i ) = m j=1 P j (τ i , u t-1 i )δ j (τ i , u t-1 i ) ( ) where m is the number of Dirac Delta functions. δ j (τ i , u t-1 i ) is the j-th Dirac Delta function and indicates the estimated value which can be parameterized by neural networks in practice. P j (τ i , u t-1 i ) is the corresponding probability of the estimated value given local observations and actions. τ i and u t-1 i are trajectories (up to that timestep) and actions of agent i, respectively. With the individual return distribution Z t i (τ i , u t-1 i ) ∈ Z and cumulative distribution function (CDF) F Zi(τi,u t-1 i ) , we define the CVaR operator Π αi , at a risk level α i (α i ∈ (0, 1] and i ∈ A) over return asfoot_3  C t i (τ i , u t-1 i , α i ) = Π α t i Z t i (τ i , u t-1 i ) = CVaR α t i (F Z t i (τi,u t-1 i ) ), where C ∈ C. As we use CVaR on return distributions, it corresponds to risk-neutrality (expectation, α i = 1) and indicates the improving degree of risk-aversion (α i → 0). CVaR αi can be estimated in a nonparametric way given ordering of Dirac Delta functions {δ j } m j=1 (Kolla et al., 2019 ) by leveraging the individual distribution: CVaR αi = m j=1 P j δ j 1 {δ j ≤ vm,αi }, where 1{•} is the indicator function and vm,αi is estimated value at risk from vm,αi = δ m(1-αi) with • being floor function. This is a closed-form formulation and can be easily implemented in practice. The optimal action of agent i can be calculated via arg max ui C i (τ i , u t-1 i , α i ). We will introduce the decentralized execution in detail in Section 3.3.

3.2. DYNAMIC RISK LEVEL PREDICTION

The values of risk levels, i.e., α i , i ∈ A, are important for the agents to make decisions. Most of the previous works take a fixed value of risk level and do not take into account any temporal structure of agents' trajectories, which can impede centralized training in the evolving multi-agent environments. Therefore, we propose the dynamic risk level prediction, which determines the risk levels of agents by explicitly taking into account the temporal nature of the stochastic outcomes, to alleviate time-consistency issue (Ruszczyński, 2010; Iancu et al., 2015) and stabilize the centralized training. Specifically, we represent the risk operator Π α by a deep neural network, which calculates the CVaR value with predicted dynamic risk level α over the return distribution. RNN of Agent 𝑖 RNN of 𝜓 ! z ! 𝑍 ! … Π " C ! 𝑡 𝜋 ! 𝑡 𝛼 # Observations & actions Figure 3: Agent architecture. 𝑍 ! " 𝑍 ! × 1 1 1 1 0 0 C ! = Π " ! 𝑍 ! 0 𝑓 "#$ # 𝜙 ! 𝛼 ! 𝐾 Dirac functions The dimension is 𝐾 x 𝑍 ! We show the architecture of agent i in Figure 3 and illustrate how ψ i works with agent i for CVaR calculation in practice in Figure 4 . As depicted in Figure 4 , at time step t, the agent's return distribution is Z i and its historical return distribution is Zi . Then we conducts the inner product to measure the discrepancy between the embedding of individual return distribution f emb (Z i ) and the embedding of past trajectory φ i (τ 0:t-1 i , u t-1 i ) modeled by GRU (Chung et al., 2014) . We discretize the risk level range into K even ranges for the purpose of computing. The k-th dynamic risk level α k i is output from ψ i and the probability of α k i is defined as: P(α k i ) = exp f emb (Z i ) k , φ k i K-1 k =0 exp f emb (Z i ) k , φ k i . (4) Then we get the k ∈ [1, . . . , K] with the maximal probability by arg max and normalize it into (0, 1], thus α i = k/K. The prediction risk level α i is a scalar value and it is converted into to a K-dimensional mask vector where the first α i × K items are one and the rest items are zero. This mask vector is used to calculate the CVaR value (Eqn. 2 and 3) of each action-return distribution that contains K Dirac functions. Finally, we obtain C i and the policy π i as illustrated in Figure 3 . During training, f embi updates its weights and the gradients of f embi are blocked (the dotted arrow in Figure 3 ) in order to prevent changing the weights of the network of agent i. We note that the predictor differs from the attention network used in previous works (Iqbal & Sha, 2019; Yang et al., 2020) because the agent's current return distribution and its return distribution of previous time step are separate inputs of their embeddings and there is no key, query and value weight matrices. The dynamic risk level predictors allow agents to determine the risk level dynamically based on historical return distributions.

3.3. RISK-SENSITIVE BELLMAN EQUATION

Motivated by the success of optimizing the CVaR value in single-agent RL (Chow & Ghavamzadeh, 2014) , RMIX aims to maximize the CVaR value of the joint return distribution, rather than the expectation (Rashid et al., 2018; Son et al., 2019) . As proved in Theorem 1, the maximizing operation of CVaR values satisfies the IGM principle, which implies that maximizing the CVaR value of joint return distribution is equivalent to maximizing the CVaR value of each agent. Theorem 1. In decentralized execution, given α = {α i } n-1 i=0 , we define the global arg max performed on global CVaR C tot (τ , u, α) as: arg max u C tot (τ , u, α) = arg max u1 C 1 (τ 1 , u 1 , α 1 ), • • • , arg max un C n (τ n , u n , α n ) (5) where τ and u are trajectories (up to that timestep) and actions of all agents, respectively. The individuals' maximization operation over return distributions defined above satisfies IGM and allows each agent to participate in a decentralised execution solely by choosing greedy actions with respect to its C i (τ i , u i , α i ). To maximize the CVaR value of each agent, we define the risk-sensitive Bellman operator T : T C tot (s, u, α) := E R(s, u) + γ max u C tot (s , u , α ) where α and α are agents' static risk levels or dynamic risk levels output from the dynamic risk level predictor ψ at each time step. The risk-sensitive Bellman operator T operates on the CVaR value of the agent and the reward, which can be proved to be a contracting operation, as showed in Proposition 1. Therefore, we can leverage the TD learning (Sutton & Barto, 2018) to compute the maximal CVaR value of each agent, thus leading to the maximal global CVaR value. Proposition 1. T : C → C is a γ-contraction.

3.4. CENTRALIZED TRAINING

We introduce the centralized training of RMIX. We utilize the monotonic mixing network of QMIX, which is a value decomposition network via hypernetwork (Ha et al., 2017) to maintain the monotonicity and has shown success in cooperative MARL. Based on IGM (Son et al., 2019) principle, we define monotonic IGM between C tot and C i for RMIX: ∂C tot ∂C i ≥ 0, ∀i ∈ {1, 2, . . . , N }, where C tot is the total CVaR value and C i (τ i , u i ) is the individual CVaR value of agent i, which can be considered as a latent combination of agents' implicit CVaR values to the global CVaR value. Following the CTDE paradigm, we define the TD loss of RMIX as: L Π (θ) := E D ∼D (y tot t -C tot (s t , u t , α t )) 2 (8) where y tot t = r t + γ max u C tot θ (s t+1 , u , α ) , and (y tot t -C tot θ (s t , u t , α t )) is our CVaR TD error for updating CVaR values. θ is the parameters of C tot which can be modeled by a deep neural network and θ indicates the parameters of the target network which is periodically copied from θ for stabilizing training (Mnih et al., 2015) . While training, gradients from Z i are blocked to avoid changing the weights of the agents' network from the dynamic risk level predictor. We train RMIX in an end-to-end manner. ψ i is trained together the agent network via the loss defined in Eq. 8. During training, f embi updates its weights while gradients of f embi are blocked in order to prevent changing the weights of the return distribution in agent i. The pseudo code of RMIX is shown in Algorithm 1 in Appendix D. We present our framework as shown in Figure 2 . Our framework is flexible and can be easily used in many cooperative MARL methods.

4. EXPERIMENTS

We empirically evaluate our methods on various StarCraft II scenarios. Especially, we are interested in the robust cooperation in complex asymmetric and homogeneous/heterogeneous scenarios. Additional introduction of baselines, scenarios and results are in Appendix C, E, F and G.

4.1. EXPERIMENT SETUP

StarCraft II We consider SMAC (Samvelyan et al., 2019) benchmarkfoot_4 (screenshots of some scenarios are in Figure 5 ), a challenging set of cooperative StarCraft II maps for micromanagement, as our evaluation environments. We evaluate our methods for every 10,000 training steps during training by running 32 episodes in which agents trained with our methods battle with built-in game bots. We report the mean test won rate (test_battle_won_mean, percentage of episodes won of MARL agents) along with one standard deviation of won rate (shaded in figures). Due to limited Figure 5 : SMAC scenarios: 27m_vs_30m, 5m_vs_6m, 6h_vs_8z, corridor and MMM2. page space, we present the results of our methods and baselines on 8 scenarios (we train our methods and baselines on 17 SMAC scenarios): corridor, 3s5z_vs_3s6z, 6h_vs_8z, 5m_vs_6m, 8m_vs_9m, 10m_vs_11m, 27m_vs_30m and MMM2. Table 1 shows detailed information on these scenarios. 3 s 5 z _ v s _ 3 s 6 z 6 h _ v s _ 8 z c o r r i d o r 2 7 m _ v s _ 3 0 m 5 m _ v s _ 6 m 3 s 5 z 8 m _ v s _ 9 m M M M 2 1 0 m _ v s _ 1 1 m 3 s _ v s _ 5 z 8 m M M M 1 c 3 s 5 z 3 m 2 5 m 2 m _ v s _ 1 z b a n e _ v

Baselines and training

The baselines are IQL (Tampuu et al., 2017) , VDN (Sunehag et al., 2017) , COMA (Foerster et al., 2017a) , QMIX (Rashid et al., 2018) , QTRAN (Son et al., 2019) , MAVEN (Mahajan et al., 2019) and Qatten (Yang et al., 2020) . We implement our methods on PyMARLfoot_5 and use 5 random seeds for training each method on 17 SMAC scenarios. We carry out experiments on NVIDIA Tesla V100 GPU 16G. RMIX demonstrates substantial superiority over baselines in asymmetric and homogeneous scenarios as depicted in Figure 7 . RMIX outperforms baselines in asymmetric homogeneous scenarios: 5m_vs_6m, 8m_vs_9m, 10m_vs_11m and 27m_vs_30m (hard game). In 3s5z_vs_3s6z (asymmetric heterogeneous, very hard game) and MMM2 (symmetric heterogeneous, hard game), RMIX also shows leading performance over baselines. RMIX learns micro-trick (wall off) and micro-trick (focus fire) faster and better in very hard corridor and 6h_vs_8z. RMIX improves coordination in a sample efficient way via risk-sensitive policies. We summarize the performance of RMIX and QMIX on 17 SMAC scenarios in Figure 6 . Readers can refer to Figure 19 and 20 for more results. We present results of RMIX in Figure 13 and 14 on 3s5z_vs_3s6z and 6h_vs_8z in 8 million training steps. Although training 27m_vs_30m is memory-consuming, we also present results of 2.5 million training steps, as depicted in Figure 15 .

4.2. EXPERIMENT RESULTS

Interestingly, as illustrated in Figure 8 , RMIX also demonstrates leading exploration performance over baselines on very hard corridor (in Figure 5(d) ) scenario, where there is a narrow corridor con- Figure 9 : test_battle_won_mean of RMIX vs QR-MIX and QMIX. In addition, we compare RMIX with QR-MIX (Hu et al., 2020) . We implement QR-MIX with PyMARL and train it on 5m_vs_6m (hard), corridor (very hard), 6h_vs_8z (very hard) and 3s5z_vs_3s6z (very hard) with 3 random seeds for each scenario. Hyper-parameters used during training are from QR-MIX paper. As shown in Figure 9 , RMIX shows leading performance and superior sample efficiency over QR-MIX on 5m_vs_6m , 27m_vs_30m and 6h_vs_8z. With distributional RL, QR-MIX presents slightly better performance on 3s5z_vs_3s6z. We present more results of RMIX vs QR-MIX in Appendix G.3. We conduct an ablation study by fixing the risk level with the value of 1, thus we get the risk-neutral method, which we name as RMIX (α = 1). We present results of RMIX, RMIX (α = 1) and QMIX on 4 scenarios in Figure 10 . RMIX outperforms RMIX (α = 1) in many heterogeneous and asymmetric scenarios, demonstrating the benefits of learning risk-sensitive MARL policies in complex scenarios where the potential of loss should be taken into consideration in coordination. Intuitively, for asymmetric scenarios, agents can be easily defeated by the opponents. As a consequence, coordination between agents is cautious in order to win the game, and the cooperative strategies in these scenarios should avoid massive casualties in the starting stage of the game. Apparently, our risk-sensitive policy representation works better than vanilla expected Q values in evaluation. In heterogeneous scenarios, action space and observation space are different among different types of agents, and methods with vanilla expected action value are inferior to RMIX.  C tot (τ , u, α) = C 1 (τ 1 , u 1 , α 1 ) + • • • + C n (τ n , u n , α n ). Following 11 . In some scenarios, for example 1c3z5z and 8m_vs_9m, the converged performance is even equal to that of RMIX, which demonstrate that RMIX is flexible in additivity mixing networks. Overall, with the new policy representation and additivity decomposition network, we can gain convincing improvements of RDN over VDN. We present how the risk level α of each agent changes during the episode and emergent cooperative strategies between agents in the results analysis of RMIX in Appendix G.2 due to limited space.

5. CONCLUSION

In this paper, we propose RMIX, a novel risk-sensitive MARL method with CVaR over the learned distributions of individuals' Q values. Our main contributions are in three folds: (i) We first learn the return distributions of individuals to analytically calculate CVaR for decentralized execution; (ii) We then propose a dynamic risk level predictor for CVaR calculation to handle the temporal nature of the stochastic outcomes during executions; (iii) We finally propose risk-sensitive Bellman equation along with Individual-Global-MAX (IGM) for MARL training. Empirically, we show that RMIX significantly outperforms state-of-the-art methods on many challenging StarCraft II tasks, demonstrating enhanced coordination in many complex scenarios and revealing high sample efficiency. To the best of our knowledge, our work is the first attempt to investigate cooperative MARL with risk-sensitive policies under the Dec-POMDP framework.

A RELATED WORKS

As deep reinforcement learning (DRL) becomes prevailing (Mnih et al., 2015; Schulman et al., 2017) , recent years have witnessed a renaissance in cooperative MARL with deep learning. However, there are several inevitable issues, including the nonstationarity of the environment from the view of individual agents (Foerster et al., 2017b) , the credit assignment in cooperative scenarios with shared global rewards (Sunehag et al., 2017; Rashid et al., 2018) , the lack of coordination and communication in cooperative scenarios (Jiang & Lu, 2018; Wang et al., 2020b) and the failure to consider opponents' strategies when learning agent policies (He et al., 2016) Recent advances in distributional RL (Bellemare et al., 2017; Dabney et al., 2018a; b) focuses on learning distribution over returns. However, with return distributions, these works still focus on either risk-neutral settings or with static risk level in single-agent setting, which neglects the ubiquitous significant risk-sensitive problems in many multi-agent real-world applications, including pipeline robots cooperation in factories, warehouse robots coordination, etc. This is very common in real-world applications, especially in highly dynamic tasks for example military action, resource allocation, finance portfolio management and Internet of Things (IoT), etc. Chow & Ghavamzadeh (2014) proposed considered the mean-CVaR optimization problem in MDPs and proposed policy gradient with CVaR, and García et al. (2015) presented a survey on safe RL, which have ignited the research on borrowing risk measures in RL (García et al., 2015; Tamar et al., 2015; Tang et al., 2019; Hiraoka et al., 2019; Majumdar & Pavone, 2020; Keramati et al., 2020; Ma et al., 2020) . However, these works focus on single-agent settings, where there is one agent interacting with the environment compared with dynamic and non-stationary multi-agent environments. The merit of CVaR in optimization of MARL has yet to be discovered. We explore the CVaR risk measure in our methods and demonstrate the leading performance of our MARL methods.

B PROOFS

We present proofs on our propositions introduced in previous sections. The proposition and equations numbers are reused in restated propositions. Assumption 1. The mean rewards are bounded in a known interval, i.e., r ∈ [-R max , R max ]. This assumption means we can bound the absolute value of the Q-values as |Q sa | ≤ Q max = HR max , where H is the maximum time horizon length in episodic tasks. Proposition 1. T : C → C is a γ-contraction. Proof. We consider the sup-norm contraction, T C (1) (s, u, α (1) ) -T C (2) (s, u, α (2) ) ≤ γ C (1) (s, u, α (1) ) -C (2) (s, u, α (2) ) ∞ ∀s ∈ S, u ∈ U, α (i),i∈{1,2} ∈ A. ( ) The sup-norm is defined as C ∞ = sup s∈S,u∈U ,α∈A |C(s, u)| and C ∈ R. In {C i } n-1 i=0 , the risk level is fixed and can be considered implicit input. Given two risk level set α (1) and α (2) , and two different return distributions Z (1) and Z (2) , we prove: T C (1) -T C (2) ≤ max s,u T C (1) (s, u, α (1) ) -T C (2) (s, u, α (1) ) = max s,u γ s P(s |s, u) max u C (1) (s , u , α ) -max u C (2) (s , u , α ) ≤ γ max s max u C (1) (s , u , α ) -max u C (2) (s , u , α ) ≤ γ max s ,u C (1) (s , u , α ) -C (2) (s , u , α ) = γ C (1) -C (2) ∞ (10) This further implies that T C (1) -T C (2) ≤ γ C (1) -C (2) ∞ ∀s ∈ S, u ∈ U, α (i),i∈{1,2} ∈ A. With proposition 1, we can leverage the TD learning (Sutton & Barto, 2018) to compute the maximal CVaR value of each agent, thus leading to the maximal global CVaR value. In some scenarios, where risk is not the primal concern for policy optimization, for example corridor scenario, where agents should learn to explore to win the game. RMIX will show less sample efficient to learn the optimal policy compared with its risk neutral variant, RMIX (α = 1). As we can see in Figure 16 (b), section G.1.1, RMIX learns slower than RMIX (α = 1) because it relies on the dynamic risk predictor while the predictor is trained together with the agent, it will take more samples in these environments. Interestingly, RMIX shows very good performance over other baselines. Theorem 1. In decentralized execution, given α = {α i } n-1 i=0 , we define the global arg max performed on global CVaR C tot (τ , u, α) as: arg max u C tot (τ , u, α) = arg max u1 C 1 (τ 1 , u 1 , α 1 ), • • • , arg max un C n (τ n , u n , α n ) (5) where τ and u are trajectories (up to that timestep) and actions of all agents, respectively. The individuals' maximization operation over return distributions defined above satisfies IGM and allows each agent to participate in a decentralised execution solely by choosing greedy actions with respect to its C i (τ i , u i , α i ). Proof. With monotonicity network f m , in RMIX, we have C tot (τ , u, α) = f m (C 1 (τ 1 , u 1 , α 1 ), . . . , C n (τ n , u n , α n )) (12) Consequently, we have C tot (τ , {arg max u C i (τ i , u , α i )} n-1 i=0 ) = f m ({max u C i (τ i , u , α i )} n-1 i=0 ) (13) By the monotonocity property of f m , we can easily derive that if j ∈ {0, 1, . . . , n -1}, u * j = arg max u C j (τ j , u , α j ), α * j ∈ (0, 1] is the optimal risk level given the current return distributions and historical return distributions, and actions of other agents are not the best action, then we have f m ({C j (τ j , u j , α j )} n-1 i=0 ) ≤ f m ({C j (τ j , u j , α j )} n-1 i=0,i =j , C j (τ j , u * j , α j )). So, for all agents, ∀j ∈ {0, 1, . . . , n -1}, u * j = arg max u C j (τ j , u , α j ), we have f m ({C j (τ i , u i , α i )} n-1 i=0 ) ≤ f m ({C j (τ j , u j , α j )} n-1 i=0,i =j , C j (τ j , u * j )) ≤ f m ({C i (τ i , u * i , α i )} n-1 i=0 ) = max {ui} n-1 i=0 f m ({C i (τ i , u i , α i )} n-1 i=0 ). Therefore, we can get max {ui,αi} n-1 i=0 f m ({C i (τ i , u i , α i )} n-1 i=0 ) = max u,α C tot (τ , u, α), which implies max u C tot (τ , u, α) = C tot (τ , {arg max u ,α C i (τ i , u , α i )} n-1 i=0 ). ( ) Proposition 2. For any agent i, i ∈ {0, . . . , n -1}, ∃λ(τ i , u i ) ∈ (0, 1], such that C i (τ i , u i ) = λ(τ i , u i )E [Z i (τ i , u i )]. Proof. We first provide that given a return distribution Z, return random variable Z and risk level α ∈ A, ∀z, Π α Z can be rewritten as E [Z |Z < z] < E [Z ]. This can be easily proved by following Privault (2020)'s proof. Thus we can get Π α Z < E [Z], and there exists λ (τi,ui) ∈ (0, 1], which is a value of agent's trajectories, such that Π α Z i (τ i , u i ) = λ (τi,ui) E [Z i (τ i , u i )]. Proposition 2 implies that we can view the CVaR value as truncated values of Q values that are in the lower region of return distribution Z i (τ i , u i ). CVaR can be decomposed into two factors: λ (τi,ui) and E[Z i (τ i , u i )].

C ADDITIONAL BACKGROUND

We introduce additional background on cooperative MARL algorithms, including QMIX, MAVEN and Qatten, for the convenience of readers who want to know more on these algorithms. Q-based cooperative MARL is concerned with estimating an accurate action-value function to select the actions with maximum expected returns. The optimal Q-function is defined as (Rashid et al., 2018) : Q tot θ (s, u) := E ∞ t=0 γ t r (s t , u t , θ) | s t+1 ∼ P (• | s t , u t , θ) , u t+1 = arg max Q tot θ (s t+1 , •) = r(s, u, θ) + γE max Q tot θ (s , •) | s ∼ P (• | s, u, θ) , where θ is the parameters of Q tot which can be model by deep neural networks working as the Q-function approximator. This Q-network can be trained by minimizing the loss function in a supervised-learning fashion as defined below: L(θ) := E D ∼D      r t + γQ tot θ s t+1 , arg max u Q tot θ (s t+1 , •) y tot t -Q tot θ (s t , u t )) 2      , where D = (s t , u t , r t , s t+1 ) and D is the replay buffer. θ indicates the parameters of the target network which is periodically copied from θ for stabilizing training (Mnih et al., 2015) . The network is trained in a centralized way with all partial observations accessible to all agents. QMIX QMIX (Rashid et al., 2018 ) is a well-known multi-agent Q-learning algorithm in the centralised training and decentralised execution paradigm, which restricts the joint action Q-values it can represent to be a monotonic mixing of each agent's utilities in order to enable decentralisation and value decomposition: Q mix := Q tot | Q tot (τ , u) = f m Q 1 τ 1 , u 1 , . . . Q n (τ n , u n ) , ∂f m ∂Q a ≥ 0, Q a (τ, u) ∈ R and the arg max operator is used to get the Q tot for centralized training via TD loss similar to DQN (Mnih et al., 2015) arg max u Q tot (τ , u) =    arg max u 1 Q 1 (τ 1 , u 1 ) . . . arg max u n Q n (τ n , u n )    . The architecture is shown in Figure 12 . The monotonic mixing network f m is parametrised as a feedforward network, of which non-negative weights are generated by hypernetworks (Ha et al., 2017) with the state as input. MAVEN MAVEN (Mahajan et al., 2019) (multi-agent variational exploration) overcomes the detrimental effects of QMIX's monotonicity constraint on exploration via learning a diverse ensemble of monotonic approximations with the help of a latent space. It consists of value-based agents that condition their behaviour on the shared latent variable z controlled by a hierarchical policy that offloads -greedy with committed exploration. Thus, fixing z, each joint action-value function is a monotonic approximation to the optimal action-value function that is learnt with Q-learning. Qatten Qatten (Yang et al., 2020) explicitly considers the agent-level impact of individuals to the whole system when transforming individual Q i s into Q tot . It theoretically derives a general formula of Q tot in terms of Q i , based on a multi-head attention formation to approximate Q tot , resulting in not only a refined representation of Q tot with an agent-level attention mechanism, but also a tractable maximization algorithm of decentralized policies.

D PSEUDO CODE OF RMIX

Algorithm 1: RMIX input: K, γ; 1 Initialize parameters θ of the network of agent, risk operator and monotonic mixing network; 2 Initialize parameters θ of the target network of agent, risk operator and monotonic mixing network; ); 12 Predict the risk level α i : (Vinyals et al., 2017) . We introduce states and observations, action space and rewards of SMAC, and environmental settings of RMIX below. arg max k exp( f emb (Zi) k ,φ k i ) K-1 k =0 exp( f emb (Zi) k ,φ k i ) k /K, k

States and Observations

At each time step, agents receive local observations within their field of view. This encompasses information about the map within a circular area around each unit with a radius equal to the sight range. The sight range makes the environment partially observable for each agent. An agent can only observe other agents if they are both alive and located within its sight range. Hence, there is no way for agents to distinguish whether their teammates are far away or dead. The feature vector observed by each agent contains the following attributes for both allied and enemy units within the sight range: distance, relative x, relative y, health, shield, and unit type. All Protos units have shields, which serve as a source of protection to offset damage and can regenerate if no new damage is received. The global state is composed of the joint observations but removing the restriction of sight range, which could be obtained during training in the simulations. All features, both in the global state and in individual observations of agents, are normalized by their maximum values. Action Space The discrete set of actions which agents are allowed to take consists of move[direction], attack[enemy id], stop and no-op. Dead agents can only take no-op action while live agents cannot. Agents can only move with a fixed movement amount 2 in four directions: north, south, east, or west. To ensure decentralization of the task, agents are restricted to use the attack[enemy id] action only towards enemies in their shooting range. This additionally constrains the ability of the units to use the built-in attack-move macro-actions on the enemies that are far away. The shooting range is set to be6 for all agents. Having a larger sight range than a shooting range allows agents to make use of the move commands before starting to fire. The unit behavior of automatically responding to enemy fire without being explicitly ordered is also disabled. Rewards At each time step, the agents receive a joint reward equal to the total damage dealt on the enemy units. In addition, agents receive a bonus of 10 points after killing each opponent, and 200 points after killing all opponents for winning the battle. The rewards are scaled so that the maximum cumulative reward achievable in each scenario is around 20. Environmental Settings of RMIX The difficulty level of the built-in game AI we use in our experiments is level 7 (very difficult) by default as many previous works did (Rashid et al., 2018; Mahajan et al., 2019; Yang et al., 2020) . The scenarios used in Section 4 are shown in Table 1 . We present the table of all scenarios in SMAC in Table 1 and the corresponding memory usage for training each scenario in Table 2 . The Ally Units are agents trained by MARL methods and Enemy Units are built-in game bots. For example, 5m_vs_6m indicates that the number of MARL agent is 5 while the number of the opponent is 6. The agent (unit) type is marinefoot_6 . This asymmetric setting is hard for MARL methods. 

F ADDITIONAL TRAINING DETAILS

The baselines are list in table 3 as depicted below. To make a fair comparison, we use episode (single-process environment for training, compared with parallel) runner defined in PyMARL to run all methods. The evaluation interval is 10, 000 for all methods. We use uniform probability to estimate Z i (•, •) for each agent. We use the other hyper parameters used for training in the original papers of all baselines. The metrics are calculated with a moving window size of 15. Experiments are carried out on NVIDIA Tesla V100 GPU 16G. We also provide memory usage of baselines (given the current size of the replay buffer) for training each scenario of SCII domain in SMAC. We use the same neural network architecture of agent used by QMIX (Rashid et al., 2018) . The trajectory embedding network φ i is similar to the network of the agent. (Tampuu et al., 2017) Independent Q-learning VDN (Sunehag et al., 2017) Value decomposition network COMA (Foerster et al., 2017a) Counterfactual Actor-critic QMIX (Rashid et al., 2018) Monotonicity Value decomposition QTRAN (Son et al., 2019) Value decomposition with linear affine transform MAVEN (Mahajan et al., 2019) MARL with variational method for exploration Qatten (Yang et al., 2020) Multi-head attention for decomposing the global Q values

G ADDITIONAL EXPERIMENTS ON SMAC G.1 ABLATIONS

We use the same hyper-parameters in ablation study unless otherwise specified.

G.1.1 STATIC RISK LEVEL

We present more results of RMIX and RMIX (α = 1) in Figure 13 , 14 and 15. In vary hard 3s5z_vs_3s6z, 6h_vs_8z games, RMIX and RMIX (α = 1) outperforms baselines. Surprisingly, RMIX is even slightly better in 6h_vs_8z where micro-trick (focus fire) is learned. In conclusion, RMIX is also capable of learning hard micro-trick tasks in StarCraft II. In asymmetric scenario 27m_vs_30m, RMIX shows leading performance as well. We also conduct an ablation study with static risk level of α = 0.1 and α = 0.3 in RMIX-static. Obviously, as shown in Figure 16 , with static risk level, RMIX-static shows steadily progress over time, but its performance is lower than RMIX in 6h_vs_8z and 5m_vs_6m. On asymmetric scenario 5m_vs_6m, the converged performance of RMIX-static is 0.6, which is lower than RMIX's (0.8). In some scenarios, where risk is not the primal concern for policy optimization, for example corridor scenario, where agents should learn to explore to win the game. RMIX shows less sample efficient to learn the optimal policy compared with its risk neutral variant, RMIX (α = 1) in corridor as shown in Figure 16 (b). RMIX learns slower than RMIX (α = 1) because it relies on the dynamic risk predictor while the predictor is trained together with the agent, it will take more samples in these environments. Interestingly, RMIX shows very good performance over other baselines. We provide additional results analysis of RMIX on corridor in Figure 17 . There are 6 RMIX agents in corridor. For brevity of visualization, we use the data of 3 agents (agent 0, 1 and 3) to analyse the results and to demonstrate RMIX agents have learned to address time-consistency issue. 



"Risk" refers to the uncertainty of future outcomes(Dabney et al., 2018a).2 StarCraft II is a trademark of Blizzard Entertainment, Inc. The Dirac Delta is a Generalized function in the theory of distributions and not a function given the properties of it, we use the name Dirac Delta function by convention. We will omit t in the rest of the paper for notation brevity. https://github.com/oxwhirl/smac https://github.com/oxwhirl/pymarl A type of unit (agent) in StarCraft II. Readers can refer to https://liquipedia.net/ starcraft2/Marine_(Legacy_of_the_Void) for more information



is a Markovian transition function and governs all state transition dynamics. Every agent shares the same joint reward function R(s, u) : S × U N → R, and γ ∈ [0, 1) is the discount factor. Due to partial observability, each agent has individual partial observation υ ∈ Υ, according to some observation function O(s, i) : S ×N → Υ.

Figure 2: The framework of RMIX (dotted arrow indicates that gradients are blocked during training). (a) Agent network structure (bottom) and risk operator (top). (b) The overall architecture. (c) Mixing network structure. Each agent i applies an individual risk operator Π αi on its return distribution Z i (•, •) to calculate C i (•, •, •) for execution given risk level α i predicted by the dynamic risk level predictor ψ i . {C i (•, •, •)} n-1 i=0 are fed into the mixing network for centralized training.

Figure 4: Risk level predictor ψ i .

Figure 6: test_battle_won_mean summary of RMIX and QMIX on 17 SMAC scenarios.

Figure 8: test_battle_won_mean of RMIX, QMIX, Qatten and MAVEN on very hard corridor scenario.

Figure7: test_battle_won_mean of RMIX and baselines on 8 scenarios, the x-axis denotes the training steps and the y-axis is the test battle won rate, ranging from 0 to 1. It applies for result figures in the rest of the paper, including figures in appendix.

Figure 10: test_battle_won_mean of RMIX and baselines on 4 scenarios.

Figure 12: The overall setup of QMIX (best viewed in colour), reproduced from the original paper (Rashid et al., 2018) (a) Mixing network structure. In red are the hypernetworks that produce the weights and biases for mixing network layers shown in blue. (b) The overall QMIX architecture. (c) Agent network structure.

Figure13: test_battle_won_mean of RMIX, RMIX (α = 1) and QMIX on 3s5z_vs_3s6z (heterogeneous and asymmetric scenario, very hard game).

Figure17: RMIX results analysis on corridor. We use trained model of RMIX and run the model to collect one episode data including game replay, states, actions, rewards and α values (risk level). We show rewards of one episode and the corresponding α value each agent predicts per time step in row one and row two. We provide description and analyses on how agents learn time-consistency α values for the rest rows. Pictures are screenshots from the game replay. Readers can watch the game replay via this anonymous link: https://youtu.be/J-PG0loCDGk. Interestingly, it also shows emergent cooperation strategies between agents at different time step during the episode, which demonstrate the superiority of RMIX.

. Aiming to address these issues, centralized training with decentralized execution (CTDE)(Oliehoek et al., 2008;Kraemer & Banerjee, 2016) has drawn enormous attention via training policies of each agent with access to global trajectories in a centralized way and executing actions given only the local observations of each agent in a decentralized way. Several MARL methods are proposed(Lowe et al., 2017; Foerster et al., 2017a;Sunehag et al., 2017;Rashid et al., 2018;Son et al., 2019;Yang et al., 2020; Wang  et al., 2020a), including value-based and policy gradient. Among these methods, VDN(Sunehag et al., 2017) and QMIX(Rashid et al., 2018) are representative methods that use value decomposition of the joint action-value function by adopting additivity and monotonicity structural constraints. Free from such structural constraints, QTRAN(Son et al., 2019) guarantees more general factorization than VDN and QMIX, however, such linear affine transformation fails to scale up in complex multi-agent scenarios, for example, StarCraft II environments.

∈ {1, ...., K}; , {o t i } i∈[0,...,n-1] , u t , r t , s ) in replay buffer D;Calculate the TD loss L Π (θ) := E D ∼D (y tot -C tot ) 2 ; benchmark is a challenging set of cooperative StarCraft II maps for micromanagement developed bySamvelyan et al. (2019)  built on DeepMind's PySC2



Memory usage (given the current size of the replay buffer) for the training of each method (exclude COMA, which is an on-policy method without using replay buffer) on scenarios of SCII domain in SMAC.



G.3 ADDITIONAL RESULTS

We conduct experiments of RMIX, QMIX, MAVEN, Qatten, VDN, COMA and IQL on 17 SMAC scenarios. We show results of test_battle_won_mean and test_return_mean of aforementioned methods in Figure 19 and 20, respectively. RMIX shows leading performance on most of scenarios, ranging from symmetric homogeneous scenarios to asymmetric heterogeneous scenarios. Surprisingly, RMIX also shows superior performance on scenarios where micro-trick should be learned to win the game.In addition, we compare RMIX with QR-MIX (Hu et al., 2020) . Unlike QMIX, QR-MIX decomposes the estimated joint return distribution into individual Q values. We implement QR-MIX with PyMARL by using hyper-parameters in QR-MIX paper and train it on 3m (easy), 1c3s5z (easy), 5m_vs_6m (hard), 8m_vs_9m (hard), 10_vs_11m (hard), 27m_vs_30m (very hard), MMM2 (very hard), 3s5z_vs_3s6z (very hard), corridor (very hard) and 6h_vs_8z (very hard) with 3 random seeds for each scenario. Results are shown in Figure 18 . 

