ROBUST MULTI-AGENT REINFORCEMENT LEARNING WITH STATE UNCERTAINTIES

Abstract

In real-world multi-agent reinforcement learning (MARL) applications, agents may not have perfect state information (e.g., due to inaccurate measurement or malicious attacks), which challenges the robustness of agents' policies. Though robustness is getting important in MARL deployment, little prior work has studied state uncertainties in MARL, neither in problem formulation nor algorithm design. Motivated by this robustness issue, we study the problem of MARL with state uncertainty in this work. We provide the first attempt to the theoretical and empirical analysis of this challenging problem. We first model the problem as a Markov Game with state perturbation adversaries (MG-SPA), and introduce Robust Equilibrium as the solution concept. We conduct a fundamental analysis regarding MG-SPA and give conditions under which such an equilibrium exists. Then we propose a robust multi-agent Q-learning (RMAQ) algorithm to find such an equilibrium, with convergence guarantees. To handle high-dimensional state-action space, we design a robust multi-agent actor-critic (RMAAC) algorithm based on an analytical expression of the policy gradient derived in the paper. Our experiments show that the proposed RMAQ algorithm converges to the optimal value function; our RMAAC algorithm outperforms several MARL methods that do not consider state uncertainty in several multi-agent environments. In this work, we develop a robust MARL framework that accounts for state uncertainty. Specifically, we model the problem of MARL with state uncertainty as a Markov Game with state perturbation adversaries (MG-SPA), in which each agent is associated with a state perturbation adversary. One state perturbation adversary always plays against its corresponding agent by preventing the agent from knowing the true state accurately. We analyze the MARL problem with adversarial or worstcase state perturbations. Compared to single-agent RL, MARL is more challenging due to the interactions among agents and the necessity of studying equilibrium policies (Nash, 1951; McKelvey & McLennan, 1996; Slantchev, 2008; Daskalakis et al., 2009; Etessami & Yannakakis, 2010) . The contributions of this work are summarized as follows. Contributions: To the best of our knowledge, this work is the first attempt to systematically characterize state uncertainties in MARL and provide both theoretical and empirical analysis. First, we formulate the MARL problem with state uncertainty as a Markov Game with state perturbation adversaries (MG-SPA). We define the solution concept of the game as a Robust Equilibrium, where all players including the agents and the adversaries use policies that no one has an incentive to deviate. In an MG-SPA, each agent not only aims to maximize its return when considering other agents' actions but also needs to act against all state perturbation adversaries. Therefore, a Robust Equilibrium policy of one agent is robust to the state uncertainties. Second, we study its fundamental properties and prove the existence of a Robust Equilibrium under certain conditions. We develop a robust multi-agent Q-learning (RMAQ) algorithm with convergence guarantee, and an actor-critic (RMAAC) algorithm for computing a robust equilibrium policy in an MG-SPA. Finally, we conduct experiments in a two-player game to validate the convergence of the proposed Q-learning method RMAQ. We show that our RMAQ and RMAAC algorithms can learn robust policies that outperform baselines under state perturbations in multi-agent environments. ) for all π i and ρ ĩ, where -i/ -ĩ represents the indices of all agents/adversaries except agent i/adversary ĩ. We seek to characterize the optimal value v * The Bellman Equations of an MG-SPA are in the forms of (1) and (2). The Bellman Equation is a recursion for expected rewards, which helps us identifying or finding an RE. for all i ∈ N , where π = (π i , π -i * ), ρ = (ρ ĩ, ρ - ĩ * ), π -i * , ρ - ĩ * are in robust equilibrium for G. We prove them in the following subsection. Vector Notations: To make the analysis easy to read, we follow and extend the vector notations in Puterman (2014). Let V denote the set of bounded real valued functions on S with component-wise partial order and norm ∥v i ∥ := sup s∈S |v i (s)|. Let V M denote the subspace of V of Borel measurable functions. For discrete state space, all real-valued functions are measurable so that where J i (λ, χ) is the expected payoff i.e. -E λ,χ [g i s ] when P 1 takes λ, P 2 takes χ, χ(s|s) In the following parts as well as the main text, when we mention a Nash Equilibrium for an EFG, it refers to a Nash Equilibrium in behavioral strategies. How to solve an EFG is out of our scope since it has been investigated in many literature (Başar & Olsder, 1998; Schipper, 2017; Slantchev, 2008) . And the single policies λ i and χ i can be attained through the marginal probabilities calculation with chain rules (Devore et al., 2012; Mémoli, 2012)  Proof. Since S is a subset of S, S is finite when S is finite. When v 1 = • • • = v N , and S, A are finite, an EFG based on degenerates to a zero-sum two-person extensive-form game with finite strategies and perfect recall. Thus, an NE of this EFG exists (Başar & Olsder, 1998; Schipper, 2017; Slantchev, 2008) . Proof. The NE of the extensive-form game (λ * , χ * ) implies that for all i = 1, where exists and is a bijection as well, then we have Under review as a conference paper at ICLR 2023 Similarly, we have Recall the definition of the minimax operator of L i v i (s), we have, for all s ∈ S, Based on the proof, we also denote (π v * , ρ v * ) as an NE policy for the EFG Proposition A.8. (Contraction mapping, same as proposition 3.5 in the main text.) Suppose 0 ≤ γ < 1 and Assumption 3.4 hold. Then L is a contraction mapping on V. Proof. Let u and v be in V. Given Assumption 3.4, these two EFGs both have at least one mixed Nash Equilibrium according to Lemma A.6. And let (π u * , ρ u * ) and (π v * , ρ v * ) be two Nash Equilibrium for these two games, respectively. According to Lemma A.7, we have the following equations hold for all s ∈ S, (π u * ,ρ u * ) (s) + γ s ′ ∈S p (π u * ,ρ u * ) (s ′ |s)u i (s ′ ) Then we have r i (π u * ,ρ v * ) (s) + γ s ′ ∈S p (π u * ,ρ v * ) (s ′ |s)v i (s ′ ) ≤ L i v i (s) ≤ r i (π v * ,ρ u * ) (s) + γ s ′ ∈S p (π v * ,ρ u * ) (s ′ |s)v i (s ′ ), r i (π v * ,ρ u * ) (s) + γ s ′ ∈S p (π v * ,ρ u * ) (s ′ |s)u i (s ′ ) ≤ L i u i (s) ≤ r i (π u * ,ρ v * ) (s) + γ s ′ ∈S p (π u * ,ρ v * ) (s ′ |s)u i (s ′ ), since (π u * , ρ v * ) and

1. INTRODUCTION

Reinforcement Learning (RL) recently has achieved remarkable success in many decision-making problems, such as robotics, autonomous driving, traffic control, and game playing (Espeholt et al., 2018; Silver et al., 2017; Mnih et al., 2015) . However, in real-world applications, the agent may face state uncertainty in that accurate information about the state is unavailable. This uncertainty may be caused by unavoidable sensor measurement errors, noise, missing information, communication issues, and/or malicious attacks. A policy not robust to state uncertainty can result in unsafe behaviors and even catastrophic outcomes. For instance, consider the path planning problem shown in Figure 1 , where the agent (green ball) observes the position of an obstacle (red ball) through sensors and plans a safe (no collision) and shortest path to the goal (black cross). In Figure 1-(a) , the agent can observe the true state s (red ball) and choose an optimal and collision-free curve a * (in red) tangent to the obstacle. In comparison, when the agent can only observe the perturbed state s (yellow ball) caused by inaccurate sensing or state perturbation adversaries (Figure 1-(b) ), it will choose a straight line ã (in blue) as the shortest and collision-free path tangent to s. However, by following ã, the agent actually crashes into the obstacle. To avoid collision in the worst case, one can construct a state uncertainty set that contains the true state based on the observed state. Then the robustly optimal path under state uncertainty becomes the yellow curve ã * tangent to the uncertainty set, as shown in Figure 1-(c ). In single-agent RL, imperfect information about the state has been studied in the literature of partially observable Markov decision process (POMDP) (Kaelbling et al., 1998) . However, as pointed out in recent literature (Huang et al., 2017; Kos & Song, 2017; Yu et al., 2021b; Zhang et al., 2020a) , the conditional observation probabilities in POMDP cannot capture the worst-case (or adversarial) scenario, and the learned policy without considering state uncertainties may fail to achieve the agent's goal. Similarly, the existing literature of decentralized partially observable Markov decision process (Dec-POMDP) (Oliehoek et al., 2016) does not provide theoretical analysis or algorithmic tools for MARL under worst-case state uncertainties either. Dealing with state uncertainty becomes even more challenging for Multi-Agent Reinforcement Learning (MARL), where each agent aims to maximize its own total return during the interaction with other agents and the environment. Even one agent receives misleading state information, its action affects both its own return and the other agents' returns (Zhang et al., 2020b) and may result in catastrophic failure. To better illustrate the effect of state uncertainty in MARL, the path planning problem in Figure 1 is modified such that two agents are trying to reach their individual goals without collision (a penalty or negative reward applied). When the blue agent knows the true position s g 0 (the subscript denotes time, which starts from 0) of the green agent, it will get around the green agent to quickly reach its goal without collision. However, in , when the blue agent can only observe the perturbed position sg 0 (yellow circle) of the green agent, it would choose a straight line that it thought safe (Figure 2 -(a1)), which eventually leads to a crash ). In Figure 2-(b ), the blue agent adopts a robust trajectory by considering a state uncertainty set based on its observation. As shown in Figure 2 -(b1), there is no overlap between (s b 0 , sg 0 ) or (s b T , sg T ). Since the uncertainty sets centered at sg 0 and sg T (the dotted circles) include the true state of the green agent, this robust trajectory also ensures no collision between (s b 0 , s g 0 ) or (s b T , s g T ). The blue agent considers the interactions with the green agent to ensure no collisions at any time. Therefore, it is necessary to consider state uncertainty in a multi-agent setting where the dynamics of other agents are considered. Robust Reinforcement Learning: Recent robust reinforcement learning studied different types of uncertainties, such as action uncertainties (Tessler et al., 2019) and transition kernel uncertainties (Sinha et al., 2020; Yu et al., 2021b; Hu et al., 2020; Wang & Zou, 2021; Lim & Autef, 2019; Nisioti et al., 2021; He et al., 2022) . Some recent attempts about adversarial state perturbations for single-agent validated the importance of considering state uncertainty and improving the robustness of the learned policy in Deep RL (Huang et al., 2017; Lin et al., 2017; Zhang et al., 2020a; 2021; Everett et al., 2021) . The works of Zhang et al. (2020a; 2021) formulate the state perturbation in single agent RL as a modified Markov decision process, then study the robustness of single agent RL policies. The works of Huang et al. (2017) and Lin et al. (2017) show that adversarial state perturbation undermines the performance of neural network policies in single agent reinforcement learning and proposes different single agent attack strategies. In this work, we consider the more challenging problem of adversarial state perturbation for MARL, when the environment of an individual agent is non-stationary with other agents' changing policies during the training process.

Robust Multi-Agent Reinforcement Learning:

There is very limited literature for the solution concept or theoretical analysis when considering adversarial state perturbations in MARL. Other types of uncertainties have been investigated in the literature, such as uncertainties about training partner's type (Shen & How, 2021) and the other agents' policies (Li et al., 2019; Sun et al., 2021; van der Heiden et al., 2020) , and reward uncertainties (Zhang et al., 2020b) . The policy considered in these papers relies on the current true state information, hence, the robust MARL considered in this work is fundamentally different since the agents do not know the true state information. Dec-POMDP enables a team of agents to optimize policies with the partial observable states (Oliehoek et al., 2016; Chen et al., 2022) . The work of Lin et al. (2020) studies state perturbation in cooperative MARL, and proposes an attack method to attack the state of one single agent in order to decrease the team reward. In contrast, we consider the worst-case scenario that the state of every agent can be perturbed by an adversary and focus on theoretical analysis of robust MARL including the existence of optimal value function and Robust Equilibrium (RE). Our work provides formal definitions of the state uncertainty challenge in MARL, and derives both theoretical analysis and practical algorithms. Game Theory and MARL: MARL shares theoretical foundations with game theory research field and literature review has been provided to understand MARL from a game theoretical perspective (Yang & Wang, 2020) . A Markov game, sometimes called a stochastic game models the interaction between multiple agents (Owen, 2013; Littman, 1994) . Algorithms to compute the Nash Equilibrium (NE) in Nash Q-learning (Hu & Wellman, 2003) , Dec-POMDP (Oliehoek et al., 2016) or POSG (partially observable stochastic game) and analysis assuming that NE exists (Chades et al., 2002; Hansen et al., 2004; Nair et al., 2002) have been developed in the literature without proving the conditions for the existence of NE. The main theoretical contributions of this work include proving conditions under which the proposed MG-SPA has Robust Equilibrium solutions, and convergence analysis of our proposed robust Q-learning algorithm. This is the first attempts to analyze fundamental properties of MARL under adversarial state uncertainties.

3.1. MARKOV GAME WITH STATE PERTURBATION ADVERSARIES (MG-SPA)

Preliminary: A Markov Game (MG) G is defined as (N , {S i } i∈N , {A i } i∈N , {r i } i∈N , p, γ) , where N is a set of N agents, S i and A i are the state space, action space of agent i, respectively (Littman, 1994; Owen, 2013)  .γ ∈ [0, 1) is the discounting factor. S = S 1 × • • • × S N is the joint state space. A = A 1 × • • • × A N is the joint action space. The state transition p : S × A → ∆(S) is controlled by the current state and joint action, where ∆(S) represents the set of all probability distributions over the joint state space S. Each agent has a reward function, r i : S × A → R. At time t, agent i chooses its action a i t according to a policy π i : S i → ∆(A i ). The agents' joint policy π = i∈N π i : S → ∆(A).

Notations:

We use a tuple G := (N , M, {S i } i∈N , {A i } i∈N , {B ĩ} ĩ∈M , {r i } i∈N , p, f, γ) to denote a Markov game with state perturbation adversaries (MG-SPA). In an MG-SPA, we introduce an additional set of adversaries M = { 1, • • • , Ñ } to a Markov game (MG) with an agent set N . Each agent i is associated with an adversary ĩ and a true state s i ∈ S i if without adversarial perturbation. Each adversary ĩ is associated with an action b ĩ ∈ B ĩ and the same state s i ∈ S i as agent i. We define the adversaries' joint action as b = (b 1, ..., b Ñ ) ∈ B, B = B 1 × • • • × B Ñ . At time t, adversary ĩ can manipulate the corresponding agent i's state. Once adversary ĩ gets state s i t , it chooses an action b ĩ t according to a policy ρ ĩ : S i → ∆(B ĩ). According to a perturbation function f , adversary ĩ perturbs state s i t to si t = f (s i t , b ĩ t ) ∈ S i . Here we define the adversaries' joint policy ρ = ĩ∈M ρ ĩ : S → ∆(B). The definitions of agent action and agents' joint action are the same as their definitions in an MG. Agent i chooses its action a i t with si t according to a policy π i (a i t |s i t ), π i : S i → ∆(A i ). Agents execute the agents' joint action a t , then at time t + 1, the joint state s t turns to the next state s t+1 according to a transition probability function p : S×A×B → ∆(S). Each agent i gets a reward according to a state-wise reward function r i t : S×A×B → R. Each adversary ĩ gets an opposite reward -r i t . In an MG, the transition probability function and reward function are considered as the model of the game. In an MG-SPA, the perturbation function f is also considered as a part of the model, i.e., the model of an MG-SPA is consisted of f, p and {r i } i∈N . To incorporate realistic settings into our analysis, we restrict the power of each adversary, which is a common assumption for state perturbation adversaries in the RL literature (Zhang et al., 2020a; 2021; Everett et al., 2021) . We define perturbation constraints si ∈ B dist (ϵ, s i ) ⊂ S i to restrict the adversary ĩ to perturb a state only to a predefined set of states. B dist (ϵ, s i ) is a ϵ-radius ball measured in metric dist(•, •), which is often chosen to be the l-norm distance: dist(s i , si ) = ∥s i -si ∥ l . We omit the subscript dist in the following context. For each agent i, it attempts to maximize its expected sum of discounted rewards, i.e. its objective function J i (π, ρ) = E ∞ t=1 γ t-1 r i t (s t , a t , b t )|s 1 = s, a t ∼ π(•|s t ), b t ∼ ρ(•|s t ) . Each adversary ĩ aims to minimize the objective function of agent i and is considered as receiving an opposite reward of agent i, which also leads to a value function -J i (π, ρ) for adversary ĩ. We further define the value functions in an MG-SPA as follows: Definition 3.1. (Value Functions) v π,ρ = (v π,ρ,1 , • • • , v π,ρ,N ), q π,ρ = (q π,ρ,1 , • • • , q π,ρ,N ) are defined as the state-value function or value function for short, and the action-value function, respectively. The ith element v π,ρ,i and q π,ρ,i are defined as q π,ρ,i (s, a, b) = E ∞ t=1 γ t-1 r i t |s 1 = s, a 1 = a, b 1 = b, a t ∼ π(•|s t ), b t ∼ ρ(•|s t ), si t = f (s i t , b ĩ t ) , v π,ρ,i (s) = E ∞ t=1 γ t-1 r i t |s 1 = s, a t ∼ π(•|s t ), b t ∼ ρ(•|s t ), si t = f (s i t , b ĩ t ) , respectively. We name an equilibrium for an MG-SPA as a Robust Equilibrium (RE) and define it as follows:  i ∈ N , s ∈ S, v (π -i when S is a continuum, V M is a proper subset of V . Let v = (v 1 , • • • , v N ) ∈ V be the set of bounded real valued functions on S ×• • •×S, i.e. the across product of N state set and norm ∥v∥ := sup j ∥v j ∥. For discrete S, let |S| denote the number of elements in S. Let r i denote a |S|-vector, with sth component r i (s) which is the expected reward for agent i under state s. And P the |S| × |S| matrix with (s, s ′ )th entry given by p(s ′ |s). We refer to r i d as the reward vector of agent i, and P d as the probability transition matrix corresponding to a joint policy d = (π, ρ). r i d + γP d v i is the expected total one-period discounted reward of agent i, obtained using the joint policy d = (π, ρ). Let z as a list of joint policy {d 1 , d 2 , • • • } and P 0 z = I, we denote the expected total discounted reward of agent i using z as v i z = ∞ t=1 γ t-1 P t-1 z r i dt = r i d1 + γP d1 r i d2 + • • • + γ n-1 P d1 • • • P dn-1 r i tn + • • • . Now, we define the following minimax operator which is used in the rest of the paper. Definition 3.3. (Minimax Operator) For v i ∈ V, s ∈ S, we define the nonlinear operator L i on v i (s) by L i v i (s) := max π i min ρ ĩ [r i d + γP d v i ](s), where d := (π -i * , π i , ρ - ĩ * , ρ ĩ). We also define the operator Lv(s) = L(v 1 (s), • • • , v N (s)) = (L 1 v 1 (s), • • • , L N v N (s)). Then L i v i is a |S|-vector, with sth component L i v i (s). For discrete S and bounded r i , it follows from Lemma 5.6.1 in Puterman (2014) that L i v i ∈ V for all v i ∈ V . Therefore Lv ∈ V for all v ∈ V. And in this paper, we consider the following assumptions in Markov games with state perturbation adversaries. Assumption 3.4. (1) Bounded rewards; |r i (s, a, b)| ≤ M i < M < ∞ for all i ∈ N , a ∈ A, b ∈ B and s ∈ S. (2) Finite state and action spaces; all S i , A i , B ĩ are finite. (3) Stationary transition probability and reward functions. (4) f is a bijection when s i is fixed. ( 5) All agents share one common reward function. The next two propositions characterize the properties of the minimax operator L and space V. They have been proved in Appendix A.2. Proposition 3.5. (Contraction mapping) Suppose 0 ≤ γ < 1, and Assumption 3.4 holds. Then L is a contraction mapping on V. Proposition 3.6. (Complete Space) V is a complete normed linear space. In Theorem 3.7, we show the fundamental theoretical analysis of an MG-SPA. In (1), we show that an optimal value function of an MG-SPA satisfies the Bellman Equations by applying the Squeeze theorem [Theorem 3.3.6, Sohrab (2003) ]. Theorem 3.7-(2) shows that the unique solution of the Bellman Equation exists, a consequence of the fixed-point theorem (Smart, 1980) . Therefore, the optimal value function of an MG-SPA exists under Assumption 3.4. By introducing (3), we characterize the relationship between the optimal value function and a Robust Equilibrium. However, (3) does not imply the existence of an RE. To this end, in (4), we formally establish the existence of RE when the optimal value function exists. We formulate a 2N -player Extensive-form game (EFG) (Osborne & Rubinstein, 1994; Von Neumann & Morgenstern, 2007) based on the optimal value function such that its Nash Equilibrium (NE) is equivalent to an RE of the MG-SPA. The full proof of Theorem 3.7 is in Appendix A.3. Theorem 3.7. Suppose 0 ≤ γ < 1 and Assumption 3.4 holds. (1) (Solution of Bellman Equation) A value function v * ∈ V is an optimal value function if for all i ∈ N , the point-wise value function v i * ∈ V satisfies the corresponding Bellman Equation (2), i.e. v i * = L i v i * for all i ∈ N . (

) (Existence and uniqueness of optimal value function) There exists a unique

v * ∈ V satisfying Lv * = v * , i.e. for all i ∈ N , L i v i * = v i * . (3) (Robust Equilibrium (RE) and optimal value function) A joint policy d * = (π * , ρ * ), where π * = (π 1 * , • • • , π N * ) and ρ * = (ρ 1 * , • • • , ρ Ñ * ) , is a robust equilibrium if and only if v d * is the optimal value function. (4) (Existence of Robust Equilibrium) There exists a mixed RE for an MG-SPA. Though the existence of NE in a stochastic game with perfect information has been investigated (Nash, 1951; Wald, 1945) , it is still an open and challenging problem when players have no global state or partially observable information (Hansen et al., 2004; Yang & Wang, 2020) . There is a bunch of literature developing algorithms trying to find the NE in Dec-POMDP or partially observable stochastic game (POSG), and conducting algorithm analysis assuming that NE exists (Chades et al., 2002; Hansen et al., 2004; Nair et al., 2002) without proving the conditions for existence of NE. Once established the existence of RE, we design algorithms to find it. We first develop a robust multi-agent Q-learning (RMAQ) algorithm with convergence guarantee. We then propose a robust multi-agent actor-critic (RMAAC) algorithm to handle the case with high-dimensional state-action spaces. 3.3 ROBUST MULTI-AGENT Q-LEARNING (RMAQ) ALGORITHM By solving the Bellman Equation, we are able to get the optimal value function of an MG-SPA as shown in Theorem 3.7. We therefore develop a value iteration (VI)-based method called robust multi-agent Q-learning (RMAQ) algorithm. Recall the Bellman equation using action-value function in (1), the optimal action-value q * satisfies q i * (s, a, b) := r i (s, a, b) + γE s ′ ∈S p(s ′ |s, a, b)q i * (s ′ , a ′ , b ′ )|a ′ ∼ π * (•|s ′ ), b ′ ∼ ρ * (•|s ′ ) . As a consequence, the tabularsetting RMAQ update can be written as below, q i t+1 (st, at, bt) = (1 -αt)q i t (st, at, bt)+ (3) αt   r i t (st, at, bt) + γ a t+1 ∈A b t+1 ∈B π q t * ,t (at+1|st+1)ρ q t * ,t (bt+1|st+1)q i t (st+1, at+1, bt+1)   , where (π qt * ,t , ρ qt * ,t ) is an NE policy by solving the 2N -player Extensive-form game (EFG) based on a payoff function (q 1 t , • • • , q N t , -q 1 t , • • • , -q N t ). The joint policy (π qt * ,t , ρ qt * ,t ) is used in updating q t . All related definitions of the EFG (q 1 t , • • • , q N t , -q 1 t , • • • , -q N t ) are introduced in Appendix A.1. How to solve an EFG is out of the scope of this work, algorithms to do this exist in the literature ( Čermák et al., 2017; Kroer et al., 2020) . Note that, in RMAQ, each agent's policy is related to not only its own value function, but also other agents' value function. This multi-dependency structure considers the interactions between agents in a game, which is different from the the Q-learning in single-agent RL that considers optimizing its own value function. Meanwhile, establishing the convergence of a multiagent Q-learning algorithm is also a general challenge. Therefore, we try to establish the convergence of (3) in Theorem 3.9, motivated from Hu & Wellman (2003) . Due to space limitation, in Appendix B.2, we prove that RMAQ is guaranteed to get the optimal value function q * = (q 1 * , • • • , q N * ) by updating q t = (q 1 t , • • • , q N t ) recursively using (3) under Assumptions 3.8. Assumption 3.8. (1) State and action pairs have been visited infinitely often. (2) The learning rate α t satisfies the following conditions: 0 ≤ α t < 1, t≥0 α 2 t ≤ ∞; if (s, a, b) ̸ = (s t , a t , b t ), α t (s, a, b) = 0. (3) An NE of the 2N -player EFG based on (q 1 t , • • • , q N t , -q 1 t , • • • , -q N t ) exists at each iteration t. Theorem 3.9. Under Assumption 3.8, the sequence {q t } obtained from (3) converges to {q * } with probability 1, which are the optimal action-value functions that satisfy Bellman equations (1) for all 1) is a typical ergodicity assumption used in the convergence analysis of Q-learning (Littman & Szepesvári, 1996; Hu & Wellman, 2003; Szepesvári & Littman, 1999; Qu & Wierman, 2020; Sutton & Barto, 1998) . And for Q-learning algorithm design papers that the exploration property is not the main focus, this assumption is also a common assumption (Fujimoto et al., 2019) . For exploration strategies in RL (McFarlane, 2018) , researchers use ϵ-greedy exploration (Gomes & Kowalczyk, 2009), UCB (Jin et al., 2018; Azar et al., 2017) , Thompson sampling (Russo et al., 2018) , Boltzmann exploration (Cesa-Bianchi et al., 2017) , etc. And for assumption 3.8-(3), researchers have found that the convergence is not necessarily so sensitive to the existence of NE for the stage games during training (Hu & Wellman, 2003; Yang et al., 2018) . In particular, under Assumption 3.4, an NE of the 2N -player EFG exists, which has been proved in Lemma A.6 in Appendix A.1. We also provide an example in the experiment part (the two-player game) where assumptions are indeed satisfied, and our RMAQ algorithm successfully converges to the RE of the corresponding MG-SPA, i = 1, • • • , N . Assumption 3.8-(

3.4. ROBUST MULTI-AGENT ACTOR-CRITIC (RMAAC) ALGORITHM

According to the above descriptions of a tabular RMAQ algorithm, each learning agent has to maintain N action-value functions. The total space requirement is N |S||A| N |B| N if |A 1 | = • • • = |A N |, |B 1 | = • • • = |B N |. This space complexity is linear in the number of joint states, polynomial in the number of agents' joint actions and adversaries' joint actions, and exponential in the number of agents. The computational complexity is mainly related to algorithms to solve an Extensive-form game ( Čermák et al., 2017; Kroer et al., 2020) . However, even for general-sum normal-form games, computing an NE is known to be PPAD-complete, which is still considered difficult in game theory literature (Daskalakis et al., 2009; Chen et al., 2009; Etessami & Yannakakis, 2010) . These properties of the RMAQ algorithm motivate us to develop an actor-critic method to handle high-dimensional space-action spaces. Because actor-critic methods can incorporate function approximation into the update (Konda & Tsitsiklis, 1999) . We consider each agent i's policy π i is parameterized as π θ i for i ∈ N , and the adversary's policy ρ ĩ is parameterized as ρ ω i . We denote θ = (θ 1 , • • • , θ N ) as the concatenation of all agents' policy parameters, ω has the similar definition. For simplicity, we omit the subscript θ i , ω i , since the parameters can be identified by the names of policies. Note that we here parameterize all policies π i , ρ ĩ as deterministic policies. Then the value function v i (s) under policy (π, ρ) satisfies v π,ρ,i (s) = E a∼π,b∼ρ s ′ ∈S p(s ′ |s, a, b)[r i (s, a, b) + γv π,ρ,i (s ′ )] We establish the general policy gradient with respect to the parameter θ, ω in the following theorem. Then we propose our robust multi-agent actor-critic algorithm (RMAAC) which adopts a centralizedtraining decentralized-execution algorithm structure in MARL literature (Li et al., 2019; Lowe et al., 2017; Foerster et al., 2018) . We put the pseudo-code of RMAAC in Appendix B.2.2. Theorem 3.10. (Policy Gradient in RMAAC for MG-SPA). For each agent i ∈ N and adversary ĩ ∈ M, the policy gradients of the objective J i (θ, ω) with respect to the parameter θ, ω are: ∇ θ i J i (θ, ω) = 1 T T t=1 ∇ a i q i (s t , a t , b t )∇ θ i π i (s i t )| a i t =π i (s i t ),b ĩ t =ρ ĩ(s i t ) ∇ ω i J i (θ, ω) = 1 T T t=1 ∇ b ĩ q i (s t , a t , b t ) + reg ∇ ω i ρ ĩ(s i t )| a i t =π i (s i t ),b ĩ t =ρ ĩ(s i t ) (5) where reg = ∇ b ĩ f (s i t , b ĩ t )∇ a i q i (s t , a t , b t )∇ f π i (f ). Proof. Taking gradient with respect to θ i , ω i for all i on both sides of (4) yields the results. See details in Appendix B.2.1.

4. EXPERIMENT

4.1 ROBUST MULTI-AGENT Q-LEARNING (RMAQ) Figure 3 : Two-player game: each player has two states and the same action set with size 2. Under state s 0 , two players get the same reward 1 when they choose the same action. At state s 1 , two players get same reward 1 when they choose different actions. One state switches to another state only when two players get reward, i.e. two players always stay in the current state until they get reward. We show the performance of the proposed RMAQ algorithm by applying it to a two-player game. We first introduce the designed two-player game. Then we investigate the convergence of this algorithm and compare the performance of the Robust Equilibrium policies with other agents' policies under different adversaries' policies. Two-player game: For the game in Figure 3 , two players have the same action space A = {0, 1} and state space S = {s 0 , s 1 }. The two players get the same positive rewards when they choose the same action under state s 0 or choose different actions under state s 1 . The state does not change until these two players get a positive reward. Possible Nash Equilibrium (NE) in this game can be π * 1 = (π 1 1 , π 2 1 ) that player 1 always chooses action 1, player 2 chooses action 1 under state s 0 and action 0 under state s 1 ; or π * 2 = (π 1 2 , π 2 2 ) that player 1 always chooses action 0, player 2 chooses action 0 under state s 0 and action 1 under state s 1 . When using the NE policy, these two players always get the same positive rewards. The optimal discounted state value of this game is v i * (s) = 1/(1 -γ) for all s ∈ S, i ∈ {1, 2}, γ is the reward discounted rate. We set γ = 0.99, then v i * (s) = 100. According to the definition of MG-SPA, we add two adversaries, one for each player to perturb the state and get a negative reward of the player. They have a same action space B = {0, 1}, where 0 means do not disturb, 1 means change the observation to another one. Some times no perturbation would be a good choice for adversaries. For example, when the true state is s 0 , players are using π * 1 , if adversary 1 does not perturb player 1's observation, player 1 will still select action 1. While adversary 2 changes player 2's observation to state s 1 , player 2 will choose action 0 which is not same to player 1's action 1. Thus, players always fail the game and get no rewards. A Robust Equilibrium for MG-SPA would be d * = (π 1 * , π2 * , ρ1 * , ρ2 * ) that each player chooses actions with equal probability and so do adversaries. The optimal discounted state value of corresponding MG-SPA is ṽi * (s) = 1/2(1 -γ) for all s ∈ S, i ∈ {1, 2} when players use Robust Equilibrium (RE) policies. We use γ = 0.99, then ṽi * (s) = 50. More explanations of this two-player game refer to Appendix C.1. The learning process for RE: We initialize q 1 (s, a, b) = q 2 (s, a, b) = 0 for all s, a, b. After observing the current state, adversaries choose their actions to perturb the agents' state. Then players execute their actions based on the perturbed state information. They then observe the next state and rewards. Then every agent updates its q according to (3). In the next state, all agents repeat the process above. The training stops after 7500 steps. When updating the Q-values, the agent applies a NE policy from the Extensive-form game based on (q 1 , q 2 , -q 1 , -q 2 ). Experiment results: After 7000 steps of training, we find that agents' Q-values stabilize at certain values, though the dimension of q is a bit high as q ∈ R 32 . We compare the optimal state value ṽ * and the total discounted rewards in Table 1 . The value of the total discounted reward converges to the optimal state value of the corresponding MG-SPA. This two-player game experiment result validates the convergence of our RMAQ method. We compare the RE policy with other agents' policies under different adversaries' policies in Appendix C.1. This is to verify the robustness of RE policies. Discussion: Even for general-sum normal-form games, computing an NE is known to be PPADcomplete, which is still considered difficult in game theory literature (Conitzer & Sandholm, 2002; Etessami & Yannakakis, 2010) . Therefore, we do not anticipate that the RMAQ algorithm can scale to very large MARL problems. In the next experimental subsection, we show RMAAC with function approximation can handle large-scale MARL problems.

4.2. ROBUST MULTI-AGENT ACTOR-CRITIC (RMAAC)

Figure 4 : Comparison of episode mean testing rewards using different algorithm in complicated scenarios with a larger number of agents of MPE. We compare our RMAAC algorithm with MAD-DPG (Lowe et al., 2017) , which does not consider robustness, and M3DDPG (Li et al., 2019) , where robustness is considered with respective to the opponents' policies altering. We run experiments in several benchmark multi-agent environments, based on the multi-agent particle environments (MPE) (Lowe et al., 2017) . The host machine adopted in our experiments is a server configured with AMD Ryzen Threadripper 2990WX 32-core processors and four Quadro RTX 6000 GPUs. Our experiments are performed on Python 3.5.4, Gym 0.10.5, Numpy 1.14.5, Tensorflow 1.8.0, and CUDA 9.0. Experiment procedure: We first train agents' policies using RMAAC, MADDPG and M3DDPG, respectively. For our RMAAC algorithm, we set the constraint parameter ϵ = 0.5. And we choose two types of perturbation functions to validate the robustness of trained policies under different MG-SPA models. The first one is the linear noise format that f 1 (s i , b ĩ) := s i + b ĩ, i.e. the perturbed state si is calculated by adding a random noise b ĩ generated by adversary ĩ to the true state s i . And f 2 (s i , b ĩ) := s i + Gaussian(b ĩ, Σ), where the adversary ĩ's action b ĩ is the mean of the Gaussian distribution. And Σ is the covariance, we set it as I, i.e. an identity matrix. We call it Gaussian Experiment results: In Figure 5 and Table 2 , we report the episode mean testing rewards and variance of 10000 steps testing rewards, respectively. We will use mean rewards and variance for short in the following experimental report and explanations. In the table and figure, we use RM, M3, MA for abbreviations of RMAAC, M3DDPG and MADDPG, respectively. In Figure 5 , the left five figures are mean rewards under the linear noise format f 1 , the right ones are under the Gaussian noise format f 2 . Under the optimally disturbed environment, agents with RMAAC policies get the highest mean rewards in almost all scenarios no matter what noise format is used. The only exception is when using in Keep away under linear noise. However, our RMAAC still achieves the highest rewards when testing in Keep away under Gaussian noise. In Figure 4 , we show the comparison results in a complicated scenario with a larger number of agents. The policies trained with RMAAC get highest reward when testing under optimally perturbed environments. Higher rewards mean agents are performing better. It turns out RMAAC policies are more robust to the worst-case state uncertainty than other two baselines. In Table 2 , the left three columns report the variance under the linear noise format f 1 , and the right ones are under the Gaussian noise format f 2 . The variance is used to evaluate the stability of the trained policies, i.e. the robustness to system randomness. Because the testing experiments are done in the same environments that are initialized by different random seeds. We can see that, by using our RMAAC method, the agents can get the lowest variance in most of scenarios under these two different perturbation formats. Therefore, our RMAAC algorithm is also more robust to the system randomness, compared with the baselines. Due to the page limits, more experiment results and explanations are in Appendix C.2.

5. CONCLUSION

We study the problem of multi-agent reinforcement learning with state uncertainties in this work. We model the problem as a Markov Game with state perturbation adversaries (MG-SPA), where each agent aims to find out a policy to maximize its own total discounted reward and each associated adversary aims to minimize the reward. This problem is challenging with little prior work on theoretical analysis or algorithm design. We provide the first attempt of theoretical analysis and algorithm design for MARL under worst-case state uncertainties. We first introduce Robust Equilibrium as the solution concept for MG-SPA, and prove conditions under which such an equilibrium exists. Then we propose a robust multi-agent Q-learning algorithm (RMAQ) to find such an equilibrium, with convergence guarantees under certain conditions. We also derive the policy gradients and design a robust multi-agent actor-critic (RMAAC) algorithm to handle the more general high-dimensional state-action space MARL problems. We also conduct experiments which validate our methods. Figure 6 : a team extensive-form game Look at Figure 6 , an EFG involves from the top of the tree to the tip of one of its branches. And a centralized nature player (P 1) has | S| alternatives (branches) to choose from, whereas a centralized agent (P 2) has |A| alternatives, and the order of play is that the centralized nature player acts before the centralized agent does. The set A is same as the agents' joint action set in an MG-SPA, set S is a set of perturbed state constrained by a constrained parameter ϵ. At the end of lower branches, some numbers will be given. These numbers represent the playoffs to the centralized agent (or equivalently, losses incurred to the centralized nature player) if the corresponding paths are selected by the players. We give the formal definition of an EFG we will use in the proof and the main text as follows:  Definition A.3. An extensive-form game based on (v 1 , • • • , v N , -v 1 , • • • , -v N ) under s ∈ S is (s, a), • • • , g N s (s, a)) where g i s (s, a) = r i (s, a, f -1 s (s)) + s ′ p(s ′ |a, f -1 s (s))v i (s ′ ) assigns a real number to each terminal vector of the tree. Player P 1 gets -g s (s, a) while player P 2 gets g s (s, a), 5. a partition of the nodes of the tree into two player sets (to be denoted by N 1 and N 2 for P 1 and P 2, respectively), 6. a sub-partition of each player set N i into information set {η i j }, such that the same number of immediate branches emanates from every node belonging to the same information set, and no node follows another node in the same information set. Note that f s (b) := f (s, b) = (f (s 1 , b 1), • • • , f (s N , b Ñ ) is the vector version of the perturbation function f in an MG-SPA. Since in an MG-SPA, q i (s, a, b) = r i (s, a, b) + s ′ p(s ′ |s, a, b)v i (s ′ ) for all i = 1, • • • , N , g i s (s, a) = q i (s, a, f -1 s (s)) as well. We can also use (q 1 , • • • , q N , -q 1 , • • • , -q N ) to denote an extensive-form game based on (v 1 , • • • , v N , -v 1 , • • • , -v N ). Then we define the behavioral strategies for P 1 and P 2, respectively in the following definition. Definition A.4. (Behavioral strategy) Let I i denote the class of all information sets of P i, with a typical element designed as η i . Let U i η i denote the set of alternatives of P i at the nodes belonging to the information set η i . Define U i = ∪U i η i where the union is over η i ∈ I i . Let Y η 1 denote the set of all probability distributions on U 1 η 1 , where the latter is the set of all alternatives of P 1 at the nodes belonging to the information set η 1 . Analogously, let Z η 2 denote the set of all probability distributions on U 2 η 2 . Further define Y = ∪ I 1 Y η 1 , Z = ∪ I 2 Z η 2 . Then, a behavioral strategy λ for P 1 is a mapping from the class of all his information sets I 1 into Y , assigning one element in Y for each set in I 1 , such that λ(η 1 ) ∈ Y η 1 for each η 1 ∈ I 1 . A typical behavioral strategy χ for P 2 is defined, analogously, as a restricted mapping from I 2 into Z. The set of all behavioral strategies for P i is called his behavioral strategy set, and it is denoted by Γ i . The information available to the centralized agent (P 2) at the time of his play is indicated on the tree diagram in Figure 6 by dotted lines enclosing an area (i.e. the information set) including the relevant nodes. This means the centralized agent is in a position to know exactly how the centralized nature player acts. In this case, a strategy for the centralized agent is a mapping from the collection of his information sets into the set of his actions. And the behavioral strategy λ for P 1 is a mapping from his information sets and action space into a probability simplex, i.e. λ(s|s) is the probability of choosing s given s. Similarly, the behavioral strategy χ for P 2 is χ(a|s), i.e. the probability of choosing action a when s is given. Note that every behavioral strategy is a mixed strategy. We then give the definition of Nash Equilibrium in behavioral strategies for an EFG. Definition A.5. (Nash Equilibrium in behavioral strategies) A pair of strategies {λ * ∈ Γ 1 , χ * ∈ Γ 2 } is said to constitute a Nash Equilibrium in behavioral strategies if the following inequalities are satisfied that for all i = 1, • • • , N, λ ∈ Γ 1 , χ ∈ Γ 2 , s ∈ S: J i (λ i ≤ γ||v i -u i || Repeating this argument in the case that L i u i (s) ≤ L i v i (s) implies that ||L i v i (s) -L i u i (s)|| ≤ γ||v i -u i || for all s ∈ S, i.e. L i is a contraction mapping on V . Recall that ||v|| = sup j ||v j ||, then we have ||Lv -Lu|| = sup j ||L j v j -L j u j || ≤ γ sup j ||v j -u j || = γ||v -u|| L is a contraction mapping on V. Proposition A.9. (Complete Space, same as proposition 3.6 in the main text.) V is a complete normed linear space. Proof. Recall that V denote the set of bounded real valued functions on S × • • • × S, i.e. the across product of N state set with component-wise partial order and norm ||v|| := sup s∈S sup j |v i (s)|. Since V is closed under addition and scalar multiplication and is endowed with a norm, it is a normed linear space. Since every Cauchy sequence contains a limit point in V, V is a complete space. A.3 PROOF OF THEOREM 3.7 In this section, our goal is to prove Theorem 3.7. We first prove (1) the optimal value function of an MG-SPA satisfies the Bellman Equation by applying the Squeeze theorem [Theorem 3.3.6, Sohrab (2003) ] in A.3.1. Then we prove that a unique solution of the Bellman Equation exists using fixed-point theorem (Smart, 1980) in A.3.2. Thereby, the existence of the optimal value function gets proved. By introducing (3), we characterize the relationship between the optimal value function and a Robust Equilibrium. The proof of (3) can be found in A.3.3. However, (3) does not imply the existence of an RE. To this end, in (4), we formally establish the existence of RE when the optimal value function exists. We formulate a 2N -player Extensive-form game (EFG) (Osborne & Rubinstein, 1994; Von Neumann & Morgenstern, 2007) based on the optimal value function such that its Nash Equilibrium (NE) is equivalent to an RE of the MG-SPA. The details are in A.3.4. Theorem A.10. (Same as theorem 3.7 in the main text.) Suppose 0 ≤ γ < 1 and Assumption 3.4 holds. (1) (Solution of Bellman Equation) A value function v * ∈ V is an optimal value function if for all i ∈ N , the point-wise value function v i * ∈ V satisfies the corresponding Bellman Equation (2), i.e. v i * = L i v i * for all i ∈ N . (2) (Existence and uniqueness of optimal value function) There exists a unique v * ∈ V satisfying Lv * = v * , i.e. for all i ∈ N , L i v i * = v i * . (3) (Robust Equilibrium (RE) and optimal value function) A joint policy d * = (π * , ρ * ), where π * = (π 1 * , • • • , π N * ) and ρ * = (ρ 1 * , • • • , ρ Ñ * ) , is a robust equilibrium if and only if v d * is the optimal value function. (4) (Existence of Robust Equilibrium) There exists a mixed RE for an MG-SPA.

A.3.1 (1) SOLUTION OF BELLMAN EQUATION

Proof. First, we prove that if there exists a v i ∈ V such that v i ≥ Lv i then v i ≥ v i * . v i ≥ Lv i implies v i ≥ max min[r i + γP v i ] = r i d + γP d v i , where d = (π v,-i * , π v,i * , ρ v,-i * , ρ v,i * ) is a Nash Equilibrium for the EFG v = (v 1 , • • • , v N , -v 1 , • • • , -v N ). We omit the superscript v for convenience when there is no confusion. We choose a list of policy i.e. z = (d 1 , d 2 , • • • ) where d j = (π -i * , π i j , ρ - ĩ * , ρ ĩ * ). Then we have v i ≥ r d1 + γP d1 v i ≥ r i d1 + γP d1 (r i d2 + γP d2 v i ) = r i d1 + γP d1 r i d2 + γP d1 P d2 v i By induction, it follows that, for n ≥ 1, v i ≥ r i d1 + γP d1 r i d2 + • • • + γ n-1 P d1 • • • P dn-1 r i dn + γ n P n z v i v i -v i z ≥ γ n P n z v i - ∞ t=n γ t P t z r i dt+1 (7) Since ||γ n P n z v i || ≤ γ n ||v i || and γ ∈ [0, 1), for ϵ > 0, we can find a sufficiently large n such that ϵe/2 ≥ γ n P n z v i ≥ -ϵe/2 where e denotes a vector of 1's. And as a result of Assumption 3.4-(1), we have - ∞ t=n γ t P t z r i dt+1 ≥ - γ n M e 1 -γ (9) B ALGORITHM B.1 ROBUST MULTI-AGENT Q-LEARNING (RMAQ) In this section, we prove the convergence of RMAQ under certain conditions. First, let's recall the convergence theorem and certain assumptions. Assumption B.1. (Same as assumption 3.8) (1) State and action pairs have been visited infinitely often. (2) The learning rate α t satisfies the following conditions: 0 ≤ α t < 1, t≥0 α 2 t ≤ ∞; if (s, a, b) ̸ = (s t , a t , b t ), α t (s, a, b) = 0. (3) An NE of the EFG based on (q 1 t , • • • , q N t , -q 1 t , • • • , -q N t ) exists at each iteration t. Theorem B.2. (Same as theorem 3.9) Under Assumption B.1, the sequence {q t } obtained from (13) converges to {q * } with probability 1, which are the optimal action-value functions that satisfy Bellman equations (1) for all i = 1, • • • , N . q i t+1 (s t , a t , b t ) = (1 -α t )q i t (s t , a t , b t )+ (13) α t   r i t (s t , a t , b t ) + γ at+1∈A bt+1∈B π qt * ,t (a t+1 |s t+1 )ρ qt * ,t (b t+1 |s t+1 )q i t (s t+1 , a t+1 , b t+1 )   , Proof. Define the operator T q t = T (q 1 t , • • • , q N t ) = (T 1 q 1 t , • • • , T N q N t ) where the operator T i is defined as below: T i q i t (s, a, b) = r i t + γ a ′ ∈A b ′ ∈B π qt * (a ′ |s ′ )ρ qt * (b ′ |s ′ )q i t (s ′ , a ′ , b ′ ) for i ∈ N , where (π qt * , ρ qt * ) is the tuple of Nash Equilibrium policies for the EFG based on (q 1 t , • • • , q N t , -q 1 t , • • • , -q N t ) obtained from (13). Because of proposition B.3 and proposition B.4 the Lemma 8 in Hu & Wellman (2003) or Corollary 5 in Szepesvári & Littman (1999) tell that q t+1 = (1 -α t )q t + α t T q t converges to q * with probability 1. Proposition B.3. (contraction mapping) T q t = (T 1 q 1 t , • • • , T N q N t ) is a contraction mapping. Proof. We omit the subscript t when there is no confusion. Assume T i p i ≥ T i q i , we have 0 ≤ T i p i -T i q i =γ a ′ ∈A b ′ ∈B π p * (a ′ |s ′ )ρ p * (b ′ |s ′ )p i (s ′ , a ′ , b ′ ) - a ′ ∈A b ′ ∈B π q * (a ′ |s ′ )ρ q * (b ′ |s ′ )q i (s ′ , a ′ , b ′ ) ≤γ a ′ ∈A b ′ ∈B π q * (a ′ |s ′ )ρ q * (b ′ |s ′ )p i (s ′ , a ′ , b ′ ) - a ′ ∈A b ′ ∈B π p * (a ′ |s ′ )ρ p * (b ′ |s ′ )q i (s ′ , a ′ , b ′ ) ≤γ p i -q i . ( ) Repeating the case T i p i ≤ T i q i implies that T i is a contraction mapping such that ||T i p i -T i q i || ≤ γ||p i -q i || for all p i , q i ∈ Q. Recall that ||p -q|| = sup j ||p j -q j || ||T p -T q|| = sup j ||T j p j -T j q j || ≤ γ sup j ||p j -q j || = γ||p -q|| T is a contraction mapping such that ||T p -T q|| ≤ γ||p -q|| for all p, q ∈ Q. Proposition B.4. (a condition of Lemma 8 in Hu & Wellman (2003) also Corollary 5 in Szepesvári & Littman (1999) ) q * = E[T q * ] Proof. E T i q i * (s, a, b) = E r i + γ a ′ inA b ′ inB π * (a ′ |s ′ )ρ * (b ′ |s ′ )q i * (s ′ , a ′ , b ′ ) = r i + γ s ′ ∈S p(s ′ |s, a, b) a ′ ∈A b ′ ∈B π * (a ′ |s ′ )ρ * (b ′ |s ′ )q i * (s ′ , a ′ , b ′ ) = q i * (s, a, b) Therefore q * = E[T q * ].

B.2 ROBUST MULTI-AGENT ACTOR-CRITIC (RMAAC)

In this section, we first give the details of policy gradients proof in MG-SPA and then list the Pseudo code of RMAAC.

B.2.1 PROOF OF POLICY GRADIENTS

Recall the policy gradient in RMAAC for MG-SPA in the follows: Theorem B.5. (Policy Gradient in RMAAC for MG-SPA, same as the theorem 3.10). For each agent and adversary i = 1, • • • , N , the policy gradients of the objective J i (θ, ω) with respect to the parameter θ, ω are: ∇ θ i J i (θ, ω) = 1 T T t=1 ∇ a i q i (s t , a t , b t )∇ θ i π i (s i t )| a i t =π i (s i t ),b i t =ρ i (s i t ) ∇ ω i J i (θ, ω) = 1 T T t=1 ∇ b i q i (s t , a t , b t ) + reg ∇ ω i ρ i (s i t )| a i t =π i (s i t ),b i t =ρ i (s i t ) where reg = ∇ bi f (s i t , b i t )∇ a i q i (s t , a t , b t )∇ f π i (f ). Proof. ∇ θ i J i (θ, ω) = E s∼p (π,ρ) ∇ θ i q i (s, a, b) = E s∼p (π,ρ) ∇ a i q i (s, a, b)∇ θ i π i (s i ) , ∇ ω i J i (θ, ω) = E s∼p (π,ρ) ∇ ω i q i (s, a, b) = E s∼p (π,ρ) ∇ a i q i (s, a, b)∇ si π i (s i )∇ b i f (s i , b i )∇ ω i ρ i (s i ) + ∇ b i q i (s, a, b)∇ ω i ρ i (s i ) = E s∼p (π,ρ) ∇ ω i ρ i (s i ) ∇ b i q i (s, a, b) + reg , where reg = ∇ a i q i (s, a, b)∇ si π i (s i )∇ b i f (s i , b i ). When the actors are updated in a mini-batch fashion (Mnih et al., 2015; Li et al., 2014) , ( 18) and ( 19) approximate ( 20) and ( 21), respectively.

B.2.2 PSEUDO CODE OF RMAAC

We provide the Pseudo code of RMMAC in Algorithm 1. Algorithm 1: RMAAC 1 Randomly initialize the critic network q i (s, a, b|η i ), the actor network π i (s i |θ i ), and the adversary network ρ i (s i |ω i ) for agent i. Initialize target networks q i′ , π i′ , ρ i′ ; 2 for each episode do  (s k , a k , b k , r k , s ′ k ) from D; 9 Set y i k = r i k + γq i′ (s ′ k , a ′ k , b ′ k )| a i′ k =π i′ (s i k ),b i′ k =ρ i′ (s i k ) ; 10 Update critic by minimizing the loss L = 1 K k y i k -q i (s k , a k , b k ) 2 ; 11 for each iteration step do 12 Update actor π i (•|θ i ) and adversary ρ i (•|ω i ) using the following gradients 13 θ i ← θ i + α a 1 K k ∇ θ i π i (s i k )∇ a i q i (s k , a k , b k ) where a i k = π i (s i k ), b i k = ρ i (s i k ); 14 ω i ← ω i -α b 1 K k ∇ ω i ρ i (s i k ) ∇ b i q i (s k , a k , b k ) + reg where reg = ∇ a i k q i (s k , a k , b k )∇ si k π i (s i k ), a i k = π i (s i k ), b i k = ρ i (s i k ); 15 end 16 end 17 Update all target networks: θ i′ ← τ θ i + (1 -τ )θ i′ , ω i′ ← τ ω i + (1 -τ )ω i′ . 18 end 19 end C EXPERIMENTS C.1 ROBUST MULTI-AGENT Q-LEARNING (RMAQ) In this section, we first further introduce the designed two-player game that the reward function, transition probability function are formally defined. The MG-SPA based on the two-player game is also further explained. Then we show more experimental results about the proposed robust multiagent Q-learning (RMAQ) algorithm, including the training process of the RMAQ algorithm in terms of the total discounted rewards, the comparison of testing total discounted rewards when using different policies with different adversaries. These two players get same rewards all the time, i.e. they share a reward function r.

C.1.1 TWO-PLAYER GAME

r i (s, a 1 , a 2 ) =            1, a 1 = a 2 , and s = s 0 1, a 1 ̸ = a 2 , and s = s 1 0, a 1 ̸ = a 2 , and s = s 0 0, a 1 = a 2 , and s = s 1 (22) The state does not change until these two players get a positive reward. So the transition probability function p is p(s 1 |s, a 1 , a 2 ) =            1, a 1 = a 2 , and s = s 0 0, a 1 ̸ = a 2 , and s = s 0 1, a 1 = a 2 , and s = s 1 0, a 1 ̸ = a 2 , and s = s 1 p(s 0 |s, a 1 , a 2 ) =            0, a 1 = a 2 , and s = s 0 1, a 1 ̸ = a 2 , and s = s 0 0, a 1 = a 2 , and s = s 1 1, a 1 ̸ = a 2 , and s = s 1 (23) Possible Nash Equilibrium can be π * 1 = (π 1 1 , π 2 1 ) or π * 2 = (π 1 2 , π 2 ) where π 1 1 (a 1 |s) =            1, a 1 = 1, and s = s 0 0, a 1 = 0, and s = s 0 1, a 1 = 1, and s = s 1 0, a 1 = 0, and s = s 1 π 2 1 (a 2 |s) =            1, a 2 = 1, and s = s 0 0, a 2 = 0, and s = s 0 0, a 2 = 1, and s = s 1 1, a 2 = 0, and s = s 1 (24) π 1 2 (a 1 |s) =            0, a 1 = 1, and s = s 0 1, a 1 = 0, and s = s 0 0, a 1 = 1, and s = s 1 1, a 1 = 0, and s = s 1 π 2 2 (a 2 |s) =            0, a 2 = 0, and s = s 0 1, a 2 = 1, and s = s 0 1, a 2 = 0, and s = s 1 0, a 2 = 1, and s = s 1 (25) discounted rewards of RE agents are stable but those rewards of NE agents and baseline agents are keep decreasing. This experiment is to validate the necessity of RE policy which is not only robust to the worst-case or adversarial state uncertainties, but also robust to some worse but note the worst cases. Figure 9 : RE policy outperforms other polices in terms of total discounted rewards and total accumulated rewards when strong adversaries exist. can not assume the agents always have accurate information of the states. Hence, improving the robustness of the policies is very important for MARL as we explained in the introduction of this work. It is worth noting that our RMAAC policies also work well in environments with random perturbations instead of worst-case perturbations. As shown in Fig. 12 , the performance of our RMAAC policies outperforms the baselines in most scenarios when random noise is introduced into the state. MAPPO is a multi-agent reinforcement learning algorithm which performs well in cooperative multi-agent settings (Yu et al., 2021a) . We use MP to denote MAPPO (https://github.com/ marlbenchmark/on-policy.). In Figure 13 , we compare its performance with our RMAAC algorithm in two cooperative scenarios of MPE. The details of scenarios such as Cooperative navigation, Navigate communication can be found in the last section. We can see that under the optimally perturbed environment, RMAAC outperforms MAPPO in all scenarios. In Figure 14 and Table 6 , we compare the episode mean testing rewards and variances under different environments in the complicated scenario with a large number of agents between different algorithms. We adopt Gaussian noise format in training RMAAC polices. We can see our method has lower variance under two of three environments and has the highest rewards under all environments. As we can see from these figures, most of the time, our RMAAC policy outperforms the MARL (MADDPG) and robust MARL (M3DDPG) baseline policies. In Cooperative communication and Predator Prey, under all 4 different noise formats, RMAAC policies achieve the highest mean episode rewards. In Cooperative navigation, RMAAC policies have the highest mean episode rewards when the non-optimal Gaussian noise format and Laplace noise format are used. The only exceptions happen in Cooperative navigation when Uniform noise format and fixed Gaussian noise format are used. However, we can find that the performance of RMAAC policies is close to that of the baseline policies in terms of mean episode testing rewards. In general, our RMAAC algorithm is robust to different types of state information attacks. Figure 15 : We train RMAAC policies using different values of constraint parameters in the scenario Cooperative Communication. In general, the smaller the constraint parameter is used, the higher the mean episode rewards RMAAC can achieve. Figure 16 : We train RMAAC policies using different values of constraint parameters in the scenario Cooperative Navigation. In general, the smaller the constraint parameter is used, the higher the mean episode rewards RMAAC can achieve. Figure 17 : We train RMAAC policies using different values of constraint parameters in the scenario Predator-Prey. In general, the smaller the constraint parameter is used, the higher the mean episode rewards RMAAC can achieve. In general, the smaller the variance is used, the higher the mean episode rewards RMAAC can achieve. Our RMAAC algorithm outperforms baseline algorithms in terms of mean episode testing rewards under all kinds of attacks.



Figure 1: Motivation of considering state uncertainty in RL.

Figure 2: Motivation of considering state uncertainty in MARL.

Figure 5: Comparison of episode mean testing rewards using different algorithms and different perturbation functions in MPE.

a finite tree structure with:1. A player P 1 has a action set S = B(ϵ, s) = B(ϵ, s 1 ) × • • • × B(ϵ, s N ),with a typical element designed as s. And P 1 moves first, 2. Another player P 2 has a action set A, with a typical element designed as a. And P 2 which moves after P 1, 3. a specific vertex indicating the starting point of the game, 4. a payoff function g s (s, a) = (g 1 s

Initialize a random process N for action exploration; 4 Receive initial state s; 5 for each time step do 6 For each adversary i, select action b i = ρ i (s i ) + N w.r.t the current policy and exploration. Compute the perturbed state si = f (s i , b i ). Execute actions a i = π(s i ) + N and observe the reward r = (r 1 , ..., r n ) and the new state information s ′ and store(s, a, b, s, r, s ′ ) in replay buffer D. Set s ′ → s; 7 for agent i=1 to n do 8 Sample a random minibatch of K samples

Figure7: Two-player game: each player has two states and the same action set with size 2. Under state s 0 , two players get the same reward 1 when they choose the same action. At state s 1 , two players get same reward 1 when they choose different actions. One state switches to another state only when two players get reward, i.e. two players always stay in the current state until they get reward.

Figure 11: Comparison of episode mean testing rewards using different algorithms and different perturbation functions, under cleaned environments.

Figure 12: Comparison of episode mean testing rewards using different algorithms and different perturbation functions, under randomly perturbed environments.

Figure 13: Comparison of episode mean testing rewards using MAPPO and RMAAC under optimally perturbed environments.

Figure 14: Comparison of episode mean testing rewards using different algorithms under different environments in Predator prey+.

Figure 21: We train RMAAC policies using different values of variance in the scenario Cooperative Communication.In general, the smaller the variance is used, the higher the mean episode rewards RMAAC can achieve.

Figure22: We train RMAAC policies using different values of variance in the scenario Cooperative Navigation. In general, the smaller the variance is used, the higher the mean episode rewards RMAAC can achieve.

Figure 23: We train RMAAC policies using different values of variance in the scenario Predator-Prey.In general, the smaller the variance is used, the higher the mean episode rewards RMAAC can achieve.

Figure 27: We test the performance of RMAAC(RM), MADDPG(MA), and M3DDPG(M3) policies under the attacks of different noise formats in the scenario Cooperative Communication. RM denotes our robust MARL algorithm, i.e. RMAAC. MA denotes MADDPG, a MARL baseline algorithm. M3 denotes M3DDPG, a robust MARL baseline algorithm. Our RMAAC algorithm outperforms baseline algorithms in terms of mean episode testing rewards under all kinds of attacks.

Figure 28: We test the performance of RMAAC policies under the attacks of different noise formats in the scenario Cooperative Navigation. RM denotes our robust MARL algorithm, i.e. RMAAC. MA denotes MADDPG, a MARL baseline algorithm. M3 denotes M3DDPG, a robust MARL baseline algorithm. Our RMAAC algorithm either outperforms or is close to baseline algorithms in terms of mean episode testing rewards under all kinds of attacks.

Figure 29: We test the performance of RMAAC policies under the attacks of different noise formats in the scenario Predator-Prey. RM denotes our robust MARL algorithm, i.e. RMAAC. MA denotes MADDPG, a MARL baseline algorithm. M3 denotes M3DDPG, a robust MARL baseline algorithm.Our RMAAC algorithm outperforms baseline algorithms in terms of mean episode testing rewards under all kinds of attacks. 37

Convergence Values of Total Discounted Rewards when Training Ends v 1 (s 0 ) v 2 (s 0 ) v 1 (s 1 ) v 2 (s 1 ) ṽ1 These two formats f 1 , f 2 are commonly used in adversarial training(Creswell et al., 2018; Zhang et al., 2020a; 2021). Then we test the well-trained policies in the optimally disturbed environment (injected noise is produced by those adversaries trained with RMAAC algorithm). The testing step is chosen as 10000 and each episode contains 24 steps. All hyperparameters used in experiments for RMAAC, MADDPG and M3DDPG are attached in Appendix C.2.2. Note that since the rewards are defined as negative values in the used multi-agent environments, we add the same baseline (100) to rewards for making them positive. Then it's easier to observe the testing results and make comparisons. Those used MPE scenarios are Cooperative communication (CC),

Variance of testing rewards under cleaned environment Algorithms RM with f 1 RM with f 2

Variance of testing rewards under randomly perturbed environment Algorithms RM with f 1 RM with f 2

Variance of testing rewards under different environments in Predator prey+. Algorithm RM M3 MA Optimally Perturbed Env 4.199 4.046 3.924 Randomly Perturbed Env 4.664 5.774 6.191 Cleaned Env 3.928 5.521 6.006 4 different noise formats, respectively. MADDPG is a MARL baseline algorithm. M3DDPG is a robust MARL baseline algorithm. The y-axis denotes the mean episode reward of the agents.

Supplementary Material for "Robust Multi-Agent Reinforcement Learning with State Uncertainty" A THEORY

In this section, we give the full proof of the all propositions and theorems in the theoretical analysis of an MG-SPA.In section A.1, we construct an extensive-form game (EFG) (Başar & Olsder, 1998; Osborne & Rubinstein, 1994; Von Neumann & Morgenstern, 2007) whose payoff function is related to value functions of an MG-SPA. And, we give certain conditions under which, a Nash Equilibrium for the constructed EFG exists. In section A.2, we prove the propositions 3.5 and 3.6. In section A.3, we give the full proof of Theorem 3.7.To make the supplemental material self-contained, we re-show the vector notations and assumptions we have presented in section 3.2. Readers can also skip the repeated text and go directly to section A.1.We follow and extend the vector notations in Puterman (2014) . Let V denote the set of bounded real valued functions on S with component-wise partial order and norm ∥v i ∥ := sup s∈S |v i (s)|. Let V M denote the subspace of V of Borel measurable functions. For discrete state space, all real-valued functions are measurable so that V = V M . But when S is a continuum, V M is a proper subset of V . Let v = (v 1 , • • • , v N ) ∈ V be the set of bounded real valued functions on S × • • • × S, i.e. the across product of N state set and norm ∥v∥ := sup j ∥v j ∥. We also define the set Q and Q in a similar style such that q i ∈ Q, q ∈ Q.For discrete S, let |S| denote the number of elements in S. Let r i denote a |S|-vector, with sth component r i (s) which is the expected reward for agent i under state s. And P the |S| × |S| matrix with (s, s ′ )th entry given by p(s ′ |s). We refer to r i d as the reward vector of agent i, and P d as the probability transition matrix corresponding to a joint policy d = (π, ρ). r i d + γP d v i is the expected total one-period discounted reward of agent i, obtained using the joint policy d = (π, ρ). Let z as a list of joint policy {d 1 , d 2 , • • • } and P 0 z = I, we denote the expected total discounted reward of agent i using z as. Now, we define the following minimax operator which is used in the rest of the paper. Definition A.1. (Minimax Operator, same as definition 3.3) For v i ∈ V, s ∈ S, we define the nonlinear operatorFor discrete S and bounded r i , it follows from Lemma 5.6.1 in Puterman (2014) that L i v i ∈ V for all v i ∈ V . Therefore Lv ∈ V for all v ∈ V. And in this paper, we consider the following assumptions in Markov games with state perturbation adversaries. Assumption A.2. (same as assumption 3.4)(1) Bounded rewards;(2) Finite state and action spaces; all S i , A i , B ĩ are finite.(3) Stationary transition probability and reward functions.(4) f is a bijection when s i is fixed.(5) All agents share one common reward function.

A.1 EXTENSIVE-FORM GAME

An extensive-form game (EFG) (Başar & Olsder, 1998; Osborne & Rubinstein, 1994; Von Neumann & Morgenstern, 2007) basically involves a tree structure with several nodes and branches, providing an explicit description of the order of players and the information available to each player at the time of his decision.Then we havefor all s ∈ S and ϵ > 0. Let all d j the same, since ϵ was arbitrary, we haveThen we prove that if there exists aFor arbitrary ϵ > 0 there exists a joint policyThe equality holds because the Theorem 6.1.1 in Puterman (2014). Since ϵ was arbitrary, we have* , i.e. if v i satisfies the Bellman Equation, v i is an optimal value function.

A.3.2 (2) EXISTENCE OF OPTIMAL VALUE FUNCTION

Proof. Proposition 3.5 and 3.6 establish that V is a complete normed linear space and L is a contraction mapping, so that the hypothesis of Banach Fixed-Point Theorem are satisfied (Smart, 1980) . Therefore there exists a unique solution v * ∈ V to Lv = v. From (1), we know if v * satisfies the Bellman Equation, it is an optimal value function. Therefore, the existence of the optimal value function is proved.

A.3.3 (3) ROBUST EQUILIBRIUM AND OPTIMAL VALUE FUNCTION

Proof. (i) Robust Equilibrium → Optimal value function.is the optimal value function.(ii) Optimal value function → Robust Equilibrium.Suppose v d * is the optimal value function, i.e., LvA.3.4 (4) EXISTENCE OF ROBUST EQUILIBRIUM Proof. From (2), we know that there exists a solution v * ∈ V to Bellman Equation Lv = v. Now, we consider an EFG based on (v 1 NE π * 1 means player 1 always selects action 1, player 2 selects action 1 under state s 0 and action 0 under state s 1 . NE π * 2 means player 1 always selects action 0, player 2 selects action 0 under state s 0 and action 0 under state s 1 .According to the definition of MG-SPA, we add two adversaries for each player to perturb the player's observations. And adversaries get negative rewards of players. We let adversaries share a same action space B 1 = B 2 = {0, 1}, where 0 means do not disturb, 1 means change the observation to the opposite one. Therefore, the perturbed function f in this MG-SPA is defined as:Obviously, f is a bijective function when s i is given. And the constraint parameter ϵ = ||S||, where ||S|| := max |s -s ′ | ∀s,s ′ ∈S , i.e. no constraints for adversaries' power.In Figure 8 , we show the total discounted rewards in the function of training episodes. We set learning rate as 0.1 and train our RMAQ algorithm for 400 episodes. And each episode contains 25 training steps. We can see the total discounted rewards converges to 50, i.e. the optimal value in the MG-SPA, after about 280 episodes or 7000 steps.Figure 8 : The total discounted rewards converges to the optimal value after about 280 training episodes.

C.1.3 TESTING COMPARISON

We further test well-trained RE policy when 'strong' adversaries exists. 'Strong' adversary means its probability of modifying agents' observations is larger than the probability of no perturbations in state information. We make two agents play the game using 3 different policies for 1000 steps under different adversaries. And the accumulated rewards, total discounted rewards are calculated. We use the Robust Equilibrium (of the MG-SPA), the Nash Equilibrium (of the original game) and a baseline policy and report the result in Figure 9 . The vertical axis is the accumulated/discounted reward, and the horizon axis is the probability that the adversary will attack/perturb the state. And we let these two adversaries share a same policy. We can see as the probability increase, the accumulated and

C.2 ROBUST MULTI-AGENT ACTOR-CRITIC (RMAAC)

In this section, we first briefly introduce the multi-agent environments we use in our experiments. Then we provide more experimental results and explanations, such as the testing results under a cleaned environment (accurate state information can be attained) and a randomly perturbed environment (injecting standard Gaussian noise in agents' observations). In the last subsection, we list all hyperparameters we used in the experiments, as well as the baseline source code. Cooperative navigation (CN): This is a cooperative game. There are 3 agents and 3 landmarks. Agents are rewarded based on how far any agent is from each landmark. Agents are penalized if they collide with other agents. So, agents have to learn to cover all the landmarks while avoiding collisions.Physical deception (PD): This is a mixed cooperative and competitive task. There are 2 collaborative agents, 2 landmarks, and 1 adversary. Both the collaborative agents and the adversary want to reach the target, but only collaborative agents know the correct target. The collaborative agents should learn a policy to cover all landmarks so that the adversary does not know which one is the true target.Keep away (KA): This is a competitive task. There is 1 agent, 1 adversary, and 1 landmark. The agent knows the position of the target landmark and wants to reach it. Adversary is rewarded if it is close to the landmark, and if the agent is far from the landmark. Adversary should learn to push agent away from the landmark.Predator prey (PP): This is a mixed game known as predator-prey. Prey agents (green) are faster and want to avoid being hit by adversaries (red). Predator are slower and want to hit good agents. Obstacles (large black circles) block the way.Navigate communication (NC): This is a cooperative game which is similar to Cooperative communication. There are 2 agents and 3 landmarks of different colors. A agent is the 'speaker' that does not move but observes goal of other agent. Another agent is the listener that cannot speak, but must navigate to correct landmark.Predator prey+ (PP+): This is an extension of the Predator prey environment by adding more agents. There are 2 preys, 6 adversaries, and 4 landmarks. Prey agents are faster and want to avoid being hit by adversaries. Predator are slower and want to hit good agents. Obstacles block the way.

C.2.2 EXPERIMENTS HYPER-PARAMETERS

In Table 3 , we show all hyper-parameters we use to train our policies and baselines. We also provide our source code in the supplementary material. The source code of M3DDPG (Li et al., 2019) and MADDPG (Lowe et al., 2017) accept the MIT License which allows any person obtaining them to deal in the code without restriction, including without limitation the rights to use, copy, modify, etc. More information about this license refers to https://github.com/openai/maddpg and https://github.com/dadadidodi/m3ddpg. In this subsection, we provide the testing results under a cleaned environment (accurate state information can be attained) and a randomly disturbed environment (injecting standard Gaussian noise into agents' observations).In Figure 11 , we show the comparison of mean episode testing rewards under a cleaned environment by using 4 different methods, RM1 denotes our RMAAC policy trained with the linear noise format f 1 , RM2 denotes our RMAAC policy trained with the Gaussian noise format f 2 , MA denotes MADDPG (https://github.com/openai/maddpg), M3 denotes M3DDPG (https:// github.com/dadadidodi/m3ddpg). We can see only in the Predator prey scenario, our method outperforms others under a cleaned environment. In Figure 12 , we can see our method outperforms others in the Cooperative communication, Keep away and Predator prey scenarios, and achieves a similar performance as others in the Cooperative navigation scenario under a randomly perturbed environment. In Table 4 and 5, we also report the variances of testing rewards in different scenarios under different environment settings. Our method has lower variance in three of five scenarios.This kind of performance also happens in robust optimization (Beyer & Sendhoff, 2007; Boyd & Vandenberghe, 2004 ) and distributionally robust optimization (Delage & Ye, 2010; Rahimian & Mehrotra, 2019) that the robust solution outperforms other non-robust solutions in the worst-case scenario. Similarly, for single-agent RL with state perturbations, robust policies perform better compared with baselines under state perturbations Zhang et al. (2020b) . But the robust solutions may get relatively poor performance compared with other non-robust solutions when there is no uncertainty or perturbation in the environment even in a single agent RL problem Zhang et al. (2020b) .Improving the robustness of the trained policy may sacrifice the performance of the decisions when the perturbations or uncertainties do not happen. That's why our RMAAC policies only beat all baselines in one scenario when the state uncertainty is eliminated. However, for many real-world systems we

C.2.4 TRAINING RESULTS USING LINEAR NOISE WITH DIFFERENT CONSTRAINT PARAMETERS

Training Setup: In this subsection, we train several RMAAC policies using linear noise format as the state perturbation function, i.e. f 1 (s i , b ĩ) = s i + b ĩ. The constraint parameter ϵ is respectively set as 0.01, 0.05, 0.1, 0.5, 1 and 2, given other hyper-parameters unchanged. Other used hyper-parameters can be found in Table 3 .Training Results: In Figure 15 , 16, and 17, we show the training process in three scenarios: Cooperative communication (CC), Cooperative navigation (CN), Predator Prey (PP), respectively. The y-axis denotes the mean episode reward of the agents and the x-axis denotes the training episodes.From these figures we can see that, in general, the smaller the used variance, the higher the mean episode rewards RMAAC can achieve. However, RMAAC has different sensitivities to the value of variance in different scenarios. When we use ϵ = 2, the RMAAC policies have the lowest mean episode rewards in all three scenarios. Nevertheless, when we use the smallest constraint parameter ϵ = 0.01, the trained RMAAC policies do not achieve the highest mean episode rewards in all three scenarios. In these three scenarios, it is clear to see the performance of RMAAC using ϵ = 0.5 is better than or similar to the performance of RMAAC using ϵ = 1, and better than the performance of RMAAC using ϵ = 2, i.e. Performance(ϵ = 0.5) ≥ Performance(ϵ = 1) > Performance(ϵ = 2). The performance of the RMAAC policies is close when the constraint parameters are less or equal to than 0.1.

C.2.5 TESTING RESULTS USING LINEAR NOISE WITH DIFFERENT CONSTRAINT PARAMETERS

In this subsection, we test well-trained RMAAC policies in perturbed environments where adversaries adopt linear noise format and different constraint parameters.Testing Setup: The tested policy π test is trained with the linear noise format f 1 (s i , b ĩ) = s i + b ĩ, constraint parameter is 0.5, where b ĩ = ρ ĩ test (s i |ϵ = 0.5). The policy ρ ĩ test is adversary ĩ's policy which is trained with π test in RMAAC, for all ĩ = 1, • • • , Ñ . We use ρ test to denote the joint policy of adversaries which is used in the testing. In summary, we test agents' joint policy π scenario test (s) when adversaries adopt the joint policy ρ scenario test (s|ϵ), and ϵ = 0.01, 0.05, 0.1, 0.5, 1, 2, scenario = Cooperative communication (CC), Cooperative navigation (CN), Predator Prey (PP), respectively. The testing is conducted over 400 episodes, which each episode has 25 time-steps.Testing Results: In Figures 18, 19 and 20, we compare the performance of RMAAC, M3DDPG, and MADDPG in scenarios CC, CN, and PP with different values of constraint parameters. MADDPG is a MARL baseline algorithm. M3DDPG is a robust MARL baseline algorithm. The y-axis denotes the mean episode reward of the agents.From these figures, we can see that in all three scenarios, our RMAAC policies outperform the baseline MARL and robust MARL policies in terms of mean episode testing rewards under the attacks of linear noise format with different constraint parameters ϵ. Our proposed RMAAC algorithm is robust to the state information attacks of linear noise format with different constraint parameters.

C.2.6 TRAINING RESULTS USING GAUSSIAN NOISE WITH DIFFERENT VARIANCE

Training Setup: In this subsection, we train several RMAAC policies using Gaussian noise format as the state perturbation function, i.e. f 2 (s i , b ĩ) = s i + N (b ĩ, σ). The variance σ is respectively set as 0.001, 0.05, 0.1, 0.5, 1, 2 and 3, given other hyper-parameters unchanged. Other used hyperparameters can be found in Table 3 .Training Results: In Figure 21 , 22 and 23, we show the training process of RMAAC in three scenarios: Cooperative communication (CC), Cooperative navigation (CN), Predator Prey (PP). The y-axis denotes the mean episode reward of the agents and the x-axis denotes the training episodes.From the figures we can see that, in general, the smaller the value of variance is used, the higher the mean episode rewards RMAAC can achieve. However, RMAAC has different sensitivities to the value of variance in different scenarios. When we use σ = 3, the RMAAC policies have the lowest mean episode rewards in all three scenarios. Nevertheless, when we use the smallest magnitude 0.001, the trained RMAAC policies do not always achieve the highest mean episode rewards. In these three scenarios, it is clear to see the performance of RMAAC using σ = 1 is better than or close to that of using σ = 2, and better than that of using σ = 3, i.e. Performance(σ = 1) ≥ Performance(σ = 2) > Performance(σ = 3). The performance of the RMAAC policies is close when the constraint parameters are less than or equal to 0.5.

C.2.7 TESTING RESULTS USING GAUSSIAN NOISE WITH DIFFERENT VARIANCE

In this subsection, we test well-trained RMAAC policies in perturbed environments where adversaries adopt Gaussian noise format and different variances.Testing Setup: The tested policy π test is trained with Gaussian noise format f 2 (s i , b ĩ) = s i + N (b ĩ, σ = 1), constraint parameter is 0.5, where b ĩ = ρ ĩ test (s i |ϵ = 0.5). ρ ĩ is adversary i's policy which is trained with π test in RMAAC, for all i = 1, • • • , N . We use ρ test to denote the joint policy of adversaries. In summary, we test agents' joint policy π scenario From these figures, we can see that in all three scenarios with all different values of constraint parameters, our RMAAC policies outperform the MARL and robust MARL baseline policies in terms of mean episode rewards under the attacks of Gaussian noise format with different variance. Our proposed RMAAC algorithm is robust to the state information attacks of Gaussian noise format with different values of variance.

C.2.8 TESTING RESULTS UNDER DIFFERENT STATE PERTURBATION FUNCTIONS

In this subsection, we test the well-trained RMAAC policies in perturbed environments where adversaries adopt different noise formats and policies.Testing Setup: The tested agents' joint policy π test is trained with Gaussian noise format f 2 (s i , b ĩ) = s i + Gaussian(b ĩ, σ = 1), constraint parameter is 0.5, where b ĩ = ρ ĩ test (s i |ϵ = 0.5). ρ ĩ test is adversary ĩ's policy which is trained with π test in RMAAC, for all ĩ = 1, • • • , Ñ . We use ρ test to denote the joint policy of adversaries. In a summary, we test agents' joint policy π scenario test (s) when adversaries adopt the joint policy ρ scenario test (s|ϵ = 0.5), in three scenarios scenario = Cooperative communication (CC), Cooperative navigation (CN), Predator Prey (PP), under non-optimal Gaussian format f 3 , Uniform noise format f 4 , fixed Gaussian noise format f 5 and Laplace noise format f 6 , respectively. These noise formats are defined in the following:where ρ ĩ non-optimal is a non-optimal policy of adversary ĩ. ρ ĩ non-optimal is randomly chosen from the training process. As we can see that f 3 and f 6 are independent of the optimal joint policy of adversaries, but f 4 and f 6 are not. The testing is conducted over 400 episodes, which each episode has 25 time-steps. 

