RMIX: RISK-SENSITIVE MULTI-AGENT REINFORCEMENT LEARNING

Abstract

Centralized training with decentralized execution (CTDE) has become an important paradigm in multi-agent reinforcement learning (MARL). Current CTDEbased methods rely on restrictive decompositions of the centralized value function across agents, which decomposes the global Q-value into individual Q values to guide individuals' behaviours. However, such expected, i.e., risk-neutral, Q value decomposition is not sufficient even with CTDE due to the randomness of rewards and the uncertainty in environments, which causes the failure of these methods to train coordinating agents in complex environments. To address these issues, we propose RMIX, a novel cooperative MARL method with the Conditional Value at Risk (CVaR) measure over the learned distributions of individuals' Q values. Our main contributions are in three folds: (i) We first learn the return distributions of individuals to analytically calculate CVaR for decentralized execution; (ii) We then propose a dynamic risk level predictor for CVaR calculation to handle the temporal nature of the stochastic outcomes during executions; (iii) We finally propose risk-sensitive Bellman equation along with Individual-Global-MAX (IGM) for MARL training. Empirically, we show that our method significantly outperforms state-of-the-art methods on many challenging StarCraft II tasks, demonstrating significantly enhanced coordination and high sample efficiency. Demonstrative videos and results are available in this anonymous link: https://sites.google.com/view/rmix.

1. INTRODUCTION

Reinforcement learning (RL) has made remarkable advances in many domains, including arcade video games (Mnih et al., 2015) , complex continuous robot control (Lillicrap et al., 2016) and the game of Go (Silver et al., 2017) . Recently, many researchers put their efforts to extend the RL methods into multi-agent systems (MASs), such as urban systems (Singh et al., 2020) , coordination of robot swarms (Hüttenrauch et al., 2017) and real-time strategy (RTS) video games (Vinyals et al., 2019) . Centralized training with decentralized execution (CTDE) (Oliehoek et al., 2008; Kraemer & Banerjee, 2016) has drawn enormous attention via training policies of each agent with access to global trajectories in a centralized way and executing actions given only the local observations of each agent in a decentralized way. Empowered by CTDE, several MARL methods, including valuebased and policy gradient-based, are proposed (Foerster et al., 2017a; Sunehag et al., 2017; Rashid et al., 2018; Son et al., 2019) . These MARL methods propose decomposition techniques to factorize the global Q value either by structural constraints or by estimating state-values or inter-agent weights to conduct the global Q value estimation. Among these methods, VDN (Sunehag et al., 2017) and QMIX (Rashid et al., 2018) are representative methods that use additivity and monotonicity structure constraints, respecitively. With relaxed structural constraints, QTRAN (Son et al., 2019) guarantees a more general factorization than VDN and QMIX. Some other methods include incorporating an estimation of advantage values (Wang et al., 2020a) and proposing a multi-head attention method to represent the global values (Yang et al., 2020) . Despite the merits, most of these works focus on decomposing the global Q value into individual Q values with different constraints and network architectures, but ignore the fact that the expected, i.e., risk-neutral, value decomposition is not sufficient even with CTDE due to the randomness of rewards and the uncertainty in environments, which causes the failure of these methods to train coordinating agents in complex environments. Specifically, these methods only learn the expected values over returns (Rashid et al., 2018) and do not handle the high variance caused by events with extremely high/low rewards to agents but small probabilities, which cause the inaccurate/insufficient estimations of the future returns. Therefore, instead of expected values, learning distributions of future returns, i.e., Q values, is more useful for agents to make decisions. Recently, QR-MIX (Hu et al., 2020) decomposes the estimated joint return distribution (Bellemare et al., 2017; Dabney et al., 2018a) into individual Q values. However, the policies in QR-MIX are still based expected individual Q values. Even further, given that the environment is nonstationary from the perspective of each agent, each agent needs a more dynamic way to choose actions based on the return distributions, rather than simply taking the expected values. However, current MARL methods do not extensively investigate these aspects. Motivated by the previous reasons, we intend to extend the risk-sensitivefoot_0 RL (Chow & Ghavamzadeh, 2014; Keramati et al., 2020; Zhang et al., 2020) to MARL settings, where risksensitive RL optimizes policies with a risk measure, such as variance, power formula measure value at risk (VaR) and conditional value at risk (CVaR). Among these risk measures, CVaR has been gaining popularity due to both theoretical and computational advantages (Rockafellar & Uryasev, 2002; Ruszczyński, 2010) . However, there are two main obstacles: i) most of the previous works focus on risk-neutral or static risk level in single-agent settings, ignoring the randomness of reward and the temporal structure of agents' trajectories (Dabney et al., 2018a; Tang et al., 2019; Ma et al., 2020; Keramati et al., 2020) ; ii) many methods use risk measures over Q values for policy execution without getting the risk measure values used in policy optimization in temporal difference (TD) learning, which causes the global value factorization on expected individual values to sub-optimal behaviours in MARL. We provide a detailed review of related works in Appendix A due to the limited space. In this paper, we propose RMIX, a novel cooperative MARL method with CVaR over the learned distributions of individuals' Q values. Specifically, our contributions are in three folds: (i) We first learn the return distributions of individuals by using Dirac Delta functions in order to analytically calculate CVaR for decentralized execution. The resulting CVaR values at each time step are used as policies for each agent via arg max operation; (ii) We then propose a dynamic risk level predictor for CVaR calculation to handle the temporal nature of stochastic outcomes during executions. The dynamic risk level predictor measures the discrepancy between the embedding of current individual return distributions and the embedding of historical return distributions. The dynamic risk levels are agent-specific and observation-wise; (iii) We finally propose risk-sensitive Bellman equation along with IGM for centralized training. The risk sensitive Bellman equation enables CVaR value update in a recursive form and can be trained with TD learning via a neural network. These also allow our method to achieve temporally extended exploration and enhanced temporal coordination, which are key to solving complex multi-agent tasks. Empirically, we show that RMIX significantly outperforms state-of-the-art methods on many challenging StarCraft II TM2 tasks, demonstrating enhanced coordination in many symmetric & asymmetric and homogeneous & heterogeneous scenarios and revealing high sample efficiency. To the best of our knowledge, our work is the first attempt to investigate cooperative MARL with risk-sensitive policies under the Dec-POMDP framework.

2. PRELIMINARIES

In this section, we provide the notation and the basic notions we will use in the following. We consider the probability space (Ω, F, Pr), where Ω is the set of outcomes (sample space), F is a σ-algebra over Ω representing the set of events, and Pr is the set of probability distributions. Given a set X , we denote with P(X ) the set of all probability measures over X . 



"Risk" refers to the uncertainty of future outcomes(Dabney et al., 2018a).2 StarCraft II is a trademark of Blizzard Entertainment, Inc.



A fully MARL problem can be described as a decentralised partially observable Markov decision process (Dec-POMDP)(Oliehoek et al., 2016)  which can be formulated as a tuple M = S, U, P, R, Υ, O, N , γ , where s ∈ S denotes the true state of the environment. Each agent i ∈ N := {1, ..., N } chooses an action u i ∈ U at each time step, giving rise to a joint action vector, u := [u i ] N i=1 ∈ U N . P(s |s, u) : S × U N × S → P(S) is a Markovian transition function and governs all state transition dynamics. Every agent shares the same joint reward function R(s, u) : S × U N → R, and γ ∈ [0, 1) is the discount factor. Due to partial observability, each agent has individual partial observation υ ∈ Υ, according to some observation function O(s, i) : S ×N → Υ.

