TOWARDS GLOBAL OPTIMALITY IN COOPERATIVE MARL WITH SEQUENTIAL TRANSFORMATION

Abstract

Policy learning in multi-agent reinforcement learning (MARL) is challenging due to the exponential growth of joint state-action space with respect to the number of agents. To achieve higher scalability, the paradigm of centralized training with decentralized execution (CTDE) is broadly adopted with factorized structure in MARL. However, we observe that existing CTDE algorithms in cooperative MARL cannot achieve optimality even in simple matrix games. To understand this phenomenon, we analyze two mainstream classes of CTDE algorithms -actor-critic algorithms and value-decomposition algorithms. Our theoretical and experimental results characterize the weakness of these two classes of algorithms when the optimization method is taken into consideration, which indicates that the currently used centralized training manner is deficient in compatibility with decentralized policy. To address this issue, we present a transformation framework that reformulates a multi-agent MDP as a special "single-agent" MDP with a sequential structure and can allow employing off-the-shelf single-agent reinforcement learning (SARL) algorithms to efficiently learn corresponding multiagent tasks. After that, a decentralized policy can still be learned by distilling the "single-agent" policy. This framework retains the optimality guarantee of SARL algorithms into cooperative MARL. To instantiate this transformation framework, we propose a Transformed PPO, called T-PPO, which can theoretically perform optimal policy learning in the finite multi-agent MDPs and shows significant outperformance on a large set of cooperative multi-agent tasks.

1. INTRODUCTION

Cooperative multi-agent reinforcement learning (MARL) is a promising approach to a variety of real-world applications, such as sensor networks (Zhang & Lesser, 2011) , traffic light control (Van der Pol & Oliehoek, 2016) , and multi-robot formation (Alonso-Mora et al., 2017) . However, "the curse of dimensionality" is one major challenge in cooperative MARL, since the joint state-action space grows exponentially with respect to the number of agents. To achieve higher scalability, the paradigm of centralized training with decentralized execution (CTDE) (Kraemer & Banerjee, 2016a) is wildly used, which allows agents to learn their local policies in a centralized way while retaining the ability of decentralized execution. Recently, many CTDE algorithms have been proposed. For value-based methods, the joint Q value is factorized as a function of individual Q values of agents (for which they are also called valuedecomposition algorithms), and then standard TD-learning is applied. To enable scalability and decentralized execution, it is critical to ensure the joint greedy action can be computed by selecting local greedy actions through individual Q functions, which is formalized as the Individual-Global-Max (IGM) principle (Son et al., 2019) . Based on this IGM property, a series of factorized multiagent Q-learning methods have been developed, including but not limited to VDN (Sunehag et al., 2018) , QMIX (Rashid et al., 2018) , QTRAN (Son et al., 2019) , and QPLEX (Wang et al., 2021b) . For multi-agent actor-critic methods, the joint policy is often factorized into the direct product of individual policies, each of which is learned through policy gradient updates. For example, COMA (Foerster et al., 2018) and DOP (Wang et al., 2021c) aim at the critic design for effective credit assignment, MADDPG (Lowe et al., 2017) studies the situation with parameterized deterministic policies, and MAPPO (Yu et al., 2021) applies PPO to multi-agent settings with parameter sharing. Despite the promising performance in benchmark tasks, these CTDE methods do not have a global optimality guarantee in general cooperative settings and may fail to achieve optimality even in simple matrix games (Section 3). It might be confusing since some algorithms like QPLEX (Wang et al., 2021b) have been proven to converge to the global optimum in some theoretical work (Wang et al., 2021a) , which contradicts our experimental results. To understand this phenomenon, we provide theoretical analysis for both actor-critic algorithms and value-decomposition algorithms. It shows that when the optimization method is taken into consideration, which prior analysis didn't, neither actor-critic algorithms nor value-decomposition algorithms can get out of local optimums, yet which wildly exists in multi-agent tasks (Section 3). To address this suboptimality issue, we present a transformation framework that reformulates a multi-agent MDP as a special "single-agent" MDP with a sequential decision-making structure among agents. With this transformation, any off-the-shelf single-agent reinforcement learning (SARL) method can be adopted to efficiently learn coordination policies in cooperative multi-agent tasks by solving the transformed single-agent tasks with their global optimality guarantee retained. To enable decentralized execution, a decentralized policy is learned at the same time by distilling the "single-agent" policy. As an instantiation of this transformation framework, a Transformed PPO (T-PPO) is proposed, which can theoretically perform optimal policy learning in finite-multi-agent MDPs and empirically shows significant outperformance on a large set of cooperative multi-agent tasks, including SMAC (Samvelyan et al., 2019) and GRF (Kurach et al., 2019) using attention mechanism (Vaswani et al., 2017) .

2.1. RL MODELS

In single-agent RL (SARL), an agent interacts with a Markov Decision Process (MDP) to maximize its cumulative reward (Sutton & Barto, 2018 ). An MDP is defined as a tuple (S, A, r, P, γ, s 0 ), where S and A denote the state space and action space. At each time step t, the agent observes the state s t and chooses an action a t ∈ A, where a t ∼ π(s t ) depends on s t and its policy π. After that, the agent will gain an instant reward r t = r(s t , a t ), and transit to the next state s t+1 ∼ P (•|s t , a t ). γ is the discount factor. The goal of an SARL agent is to optimize a policy π that maximizes the expected cumulative reward, i.e., J (π) = E st+1∼P (•|st,π(st)) [ ∞ t=0 γ t r(s t , π(s t ))]. In MARL, we model a fully cooperative multi-agent task as a Dec-POMDP (Oliehoek et al., 2016) defined by a tuple ⟨N , S, A, P, Ω, O, r, γ⟩, where N is a set of agents and S is the global state space, A is the action space, and γ is a discount factor. At each time step, agent i ∈ N has access to the observation o i ∈ Ω, drawn from the observation function O(s, i). Each agent has an actionobservation history τ i ∈ Ω×(A × Ω) * and constructs its individual policy π(a i |τ i ). With each agent i selecting an action a i ∈ A, the joint action a ≡ [a i ] n i=1 leads to a shared reward r = R(s, a) and the next state s ′ according to the transition distribution P (s ′ |s, a). The formal objective of MARL agents is to find a joint policy π = ⟨π 1 , . . . , π n ⟩ conditioned on the joint trajectories τ ≡ [τ i ] n i=1 that maximizes a joint value function V π (s) = E [ ∞ t=0 γ t r t |s 0 = s, π]. Another quantity in policy search is the joint action-value function Q π (s, a) = r(s, a) + γE s ′ [V π (s ′ )]. To simplify our analysis, we present a framework of Multi-agent MDPs (MMDP) (Boutilier, 1996) , a special case of Dec-POMDP, to model cooperative multi-agent decision-making tasks with full observations. MMDP is defined as a tuple ⟨N , S, A, P, r, γ⟩, where N , S, A, P , r, and γ are the same agent set, state space, action space, transition function, reward function, and discount factor in Dec-POMDP, respectively. Due to the full observations, at each time step, the current state s is observable to each agent. For each agent i, an individual policy πi (a|s) represents a distribution over actions conditioned on the state s. Agents aim to find a joint policy π = ⟨π 1 , . . . , πn ⟩ that maximizes a joint value function V π (s), where denoting V π (s) = E [ ∞ t=0 γ t r t |s 0 = s, π].

2.2. POLICY FACTORIZATION AND CENTRALIZED TRAINING WITH DECENTRALIZED EXECUTION

In MARL, due to partial observability and communication constraints, a decentralized policy is required during execution, i.e., the joint execution policy π test can be decomposed into a product of individual execution policies [π test i ] n i=1 , called policy factorization: ∀τ : π test (a|τ ) = n i=1 π test i (a i |τ i ). In order to effectively learn π test , centralized training with decentralized execution (CTDE) becomes a popular paradigm of cooperative MARL (Oliehoek et al., 2008; Kraemer & Banerjee, 2016b) . In CTDE, agents are trained in a centralized manner and are granted access to other agents' information or global state during the centralized training process. Denote the joint policy which is learned during training as π train . Note that during centralized training, the constraint of policy factorization (see Eq. ( 1)) is not necessary for π train and the agents can use joint policy π train to interact with the environments for collecting training data. However, most popular CTDE MARL algorithms (Foerster et al., 2018; Lowe et al., 2017; Wang et al., 2021c; Yu et al., 2021; de Witt et al., 2020; Sunehag et al., 2018; Rashid et al., 2018; Son et al., 2019; Wang et al., 2021b) encode the policy factorization defined in Eq. ( 1) into the training joint policy π train and these algorithms can be divided into two categories: actor-critic and value-decomposition. For multi-agent actor-critic algorithms (Foerster et al., 2018; Lowe et al., 2017; Wang et al., 2021c; Yu et al., 2021; de Witt et al., 2020) , a joint (stochastic or deterministic) policy is represented as a product of individual policies, i.e., π(a|τ ) = N i=1 π i (a i |τ i ), which corresponds to policy factorization defined in Eq. ( 1). After that, some estimation of multi-agent policy gradient (Kuba et al., 2021) is calculated through the critic to update the policy. For value-decomposition algorithms (Sunehag et al., 2018; Rashid et al., 2018; Wang et al., 2021b) , the joint Q value is decomposed as a function of local Q values with some parameter λ ∈ Λ (Fu et al., 2022) . Q jt (τ , a) = f mix (Q 1 (τ 1 , •), • • • , Q n (τ n , •), τ , a; λ) After that, standard TD-learning is applied, and the IGM (Individual-Global-Max) principle (Son et al., 2019) is enforced to realize effective TD-learning, which asserts the consistency between joint and local greedy action selections in the joint action-value Q tot (τ , a) and local action-values [Q i (τ i , a i )] n i=1 , respectively: ∀τ : arg max a∈A Q tot (τ , a) = arg max a1∈A Q 1 (τ 1 , a 1 ), . . . , arg max an∈A Q n (τ n , a n ) . Denote the greedy joint policy as π(a|τ ) = arg max a∈A Q tot (τ , a) and the greedy individual policies as π i (a i |τ i ) = arg max ai∈A Q i (τ i , a i ). We have π(a|τ ) = N i=1 π i (a i |τ i ) , which is called policy factorization defined in Eq. (1). Although various value factorizations (Wang et al., 2021a; Fu et al., 2022) are widely studied in the literature, to our best knowledge, this paper is the first to study the effect of CTDE from the perspective of optimal policy learning with optimization methods.

3. SUBOPTIMALITY OF CURRENT CTDE ALGORITHMS: INHERENT LOCAL-MINIMA STRUCTURE IN LOSS FUNCTION

In this section, we will formally analyze the suboptimality of multi-agent actor-critic and valuedecomposition algorithms when taking the optimization method into consideration. In short, the manner of centralized training adopted by multi-agent actor-critic and value-decomposition methods creates inherent local-minima structures in their loss function, which makes gradient-descent methods lose optimality guarantee in general. The main results of this section are summarized as Theorem 3.1 and Theorem 3.2. These two theorems elucidate the existence of local minima in the loss function of both actor-critic and valuedecomposition algorithms. For multi-agent actor-critic algorithms, any Nash's equilibrium of policies always constitutes a stationary point of the loss function of actors, regardless of the parameterization of policy networks (e.g. sharing parameter (Cobbe et al., 2020) or not) and credit assignment in gradient calculation (Foerster et al., 2018) , which implies multi-agent policy-gradient can reach zero even when current policies are not optimal. Consequently, it makes multi-agent actor-critic algorithms lose optimality guarantee when gradient descent-based optimization methods are used. For value-decomposition algorithms, we will prove that the loss function of any value-decomposition method could have exponentially many local optima as long as the Q-function class is complete for the IGM condition. In the research of value-decomposition, great efforts have been made by a series of work (Sunehag et al., 2018; Rashid et al., 2018; Wang et al., 2021b) to enrich the Q-function class and make it complete since incomplete function would diverge the process of TD-learning (Wang et al., 2021a) . However, our result shows that the richness of Q-function class and the smoothness of the loss function cannot retain at the same time under the IGM condition, which leads the gradient descent-based optimization methods to fail in this case. Actor-Critic algorithms On the one hand, for multi-agent actor-critic algorithms on MMDP, the actor represents a decentralized stochastic policy π jt (a|s; θ) = π i (a i |s; θ). And the actor loss is a primitive function of policy gradient, thus the policy is updated via multi-agent policy gradient (Leonardos et al., 2022) . We can prove the following theorem: Theorem 3.1. For multi-agent actor-critic algorithms, any Nash's Equilibrium of policies is a stationary point of actor loss. Moreover, there exists a family of single step-MMDP such that the actor loss function contains Ω(|A|) different local minimums for deterministic policy and infinite local minimums for stochastic policy. This theorem suggests that multi-agent actor-critic algorithms create inherent local minimums for their actor loss functions, and may get stuck in any Nash's equilibrium of policy, which can be arbitrarily worse than optimal policy (Table 1 ). The proof is presented in Appendix A.1. Table 1 : Matrix Game with multiple Nash Equilibria: 2 players (one selects a row and one selects a column) coordinate to select an entry of the matrix representing the joint payoff. In this case, (0, 0), (1, 1), (2, 2) are three different Nash's Equilibria with different payoffs 10, 5, 1. Only (0, 0) is the globally optimal policy. However, policy (1, 1) and (2, 2) are two local minima of the actor loss, no matter what parameterizations of local policies are due to Theorem 3.1.

10

-20 -20 -20 5 0 -20 0 1 Value-decomposition algorithms On the other hand, for a value-decomposition algorithm, the joint Q-value function is decomposed as Eq. ( 2), and updated by standard TD-loss. We are able to prove the following theorem: Theorem 3.2. There exists a family of MMDP, such that for any value-decomposition algorithm with a complete Q-function class satisfying the IGM condition (Eq. ( 3)), the TD-loss function contains The term "complete Q-function class" refers to that any function in R |S|×|A| n is decomposable by the function f mix , like that in QPLEX (Wang et al., 2021b) . This theorem suggests that, for valuedecomposition methods with IGM condition satisfied, then either it has an incomplete function class (e.g. VDN (Sunehag et al., 2018) , QMIX (Rashid et al., 2018) ), or it's loss function contains exponentially many local optimums (e.g. QPLEX, it contains infinite local optimums (Proposition A.1), worse than what is claimed in Theorem 3.2), both of which loses optimality guarantee when gradientbased optimization method is adopted. The proof is presented in Appendix A.1. To give some experimental indication of our theorems, we introduce a multi-task matrix game here, which is a simple 1-step game with 2 players and 10 matrices (see Appendix B.1.1 for details). We demonstrate the sum reward of all matrices for our approach and baselines in Figure 1 . Baseline algorithms get stuck in locally optimal solutions, which confirms the theorems we discussed above. And with the theoretical guarantee of global optimality, our method (see Section 4.2) converges to 100 immediately. Value-based methods VDN, QMIX, and QPLEX converge to a similar local optimal point at the end of training. Taking benefit from the duplex dueling network architecture, QPLEX has a stronger representative ability than VDN and QMIX. Nevertheless, such a complex structure creates inherent local optima for its loss function (Theorem 3.2, Proposition A.1). QTRAN (Son et al., 2019) achieves convincing performance in some simple matrix games with a carefully designed regularizer tackling IGM. However, its discontinuous loss function will still hinder the globally optimal solution in this case (Appendix A.1). SOTA multi-agent actor-critic algorithms HAPPO (Kuba et al., 2021) and MAPPO get stuck in Nash's Equilibrium and cannot guarantee global optimality (Theorem 3.1). In this way, our approach dominates in our multi-task matrix game.

4. A SEQUENTIAL TRANSFORMATION FRAMEWORK FOR MARL PROBLEMS

To achieve global optimality in cooperative MARL with the CTDE paradigm, we try to design a new manner of centralized training to fit decentralized policy better with gradient descent-based optimization. Inspired by a series of prior works on sequence modeling (Bertsekas, 2019; Angermueller et al., 2020; Jain et al., 2022; Olivecrona et al., 2017; You et al., 2018) , we found that breaking joint learning into sequential decision-making directly transfers properties of single-agent algorithms into multi-agent ones. To formalize such "reduction" of algorithms under the perspective of optimality guarantee, we propose a transformation framework that allows us to employ any single-agent RL (SARL) algorithm to efficiently learn corresponding multi-agent tasks with optimality guarantee of the SARL algorithm (if it has) kept. From the theoretical perspective, we also find that this transformation framework keeps the same mini-max sample complexity (Appendix A.4). However, in order to maintain sample efficiency under this framework, one main challenge is to deal with the long horizon and sparse reward (Appendix A.4), while they have been long-term challenges for the design of empirical reinforcement learning algorithms (Arjona-Medina et al., 2018; Zheng et al., 2018; Gangwani et al., 2020) . To tackle this challenge and instantiate our transformation framework, we propose a Transformed PPO, called T-PPO adopting the attention mechanism in network design, which incorporates the remarkable empirical ability of PPO (Schulman et al., 2017b) and the attention mechanism (Vaswani et al., 2017) as well as a theoretical optimality guarantee in the finite MMDPs. In this way, our implementation of T-PPO also bridges the gap between theoretical analysis and empirical performance.

4.1. SEQUENTIAL TRANSFORMATION

In the training phase of CTDE, all n agents will coordinate to improve their joint policy. In particular, when a joint action needs to be inferred, they coordinate to infer a = (a 1 , • • • , a n ) jointly. If we give these agents a virtual order, imagine they infer their individual actions one after another, that is, in the centralized training phase, when agent i infers its action a i , all "previously inferred" actions a j (j < i) are known to i. As we shall see, this procedure of multi-agent decision-making is equivalent to a single-agent one since all agents are homogeneous in the centralized training phase. From a theoretical perspective, the discussion above states a transformation from MMDP into MDP in essence. Definition 4.1 (Sequential Transformation Γ, informal). Given an MMDP M with n agents, its sequential transformation is an MDP Γ (M), that (1) for every time step t mod n = 1, the agent infers the action a ( t) 1 and transit from a state s to a virtual state (s, a ( t) 1 ). (2) for every time step t mod n = i > 1, it infers an action a ( t) i and transit from virtual state (s, a ( t) <i ) to (s, a ( t) ≤i ). (3) for every time step t mod n = 0, it infers an action a ( t) n and transit from virtual state (s, a ( t) <n ) to state s ′ according to the dynamics of MMDP M, at the same time, it gain a reward from M. t = ⌈t/n⌉ here is the corresponding time step on M. We present the formal definition in Definition A.1 for completeness. We can then prove the equivalence of the MMDP M and the MDP Γ(M) after transformation in the perspective of policy value, which is presented in appendix (Theorem A.1) due to the lack of space. This theorem also provides the conversion method between policy on Γ(M) and joint policy on M. This transformation allows us to use any SARL algorithm A, wrap the interface of the multi-agent environment to make A run as if it is accessible to the interactive environment of Γ (M), and finally convert the policy learned by A to a joint policy on M. The sequential framework converting SARL algorithm A to MARL algorithm T-A is formally described in pseudo-code in Appendix (Algorithm 1). At last, we can claim the main theorem of this framework. Theorem 4.1 (The transformation framework keeps the optimality). Using the sequential framework (Algorithm 1), if the SARL algorithm A is guaranteed to obtain an optimal policy on MDP, then the MARL algorithm T-A is guaranteed to obtain an optimal policy on MMDP. Moreover, thanks to the theoretical analysis for the global optimality of PPO with neural networks ( (Liu et al., 2019 )), we are able to claim a suitable implementation of T-PPO attains the global optimality under the same mild assumptions in Liu et al. (2019) . The proofs of both propositions are presented in Section A.3, and more analysis of the complexity of the transformed model is presented in Section A.4.

4.2. T-PPO: EXTENSION TO DEC-POMDP

In practice, many MARL benchmarks are partially observable. To apply our algorithms to partialobservable environments, we need to extend our algorithms to Dec-POMDP. To instantiate our transformation framework, we propose a Transformed PPO (T-PPO) based on PPO (Schulman et al., 2017a) . The Actor-Critic structure is shown in Figure 2 . Following the sequential transformation framework discussed in Section 4.1, we introduce previous agents' actions to each agent's actor and critic modules. However, this sequential transformed information increases with respect to the number of agents. To achieve scalability, we equip each agent's actor and critic with a multi-head attention (MHA) module. Intuitively, we believe considering too much information from previous agents' actions is harmful to learning since in most cases, we do not need that much information for a single agent to make a decision. So we add regularization terms to agents' actor and critic modules to encourage each agent to extract critical information. As demonstrated by yellow modules in actor, policies π (t) i,T used for training is combined with two parts: one part contains previous agents' actions as inputs for the MHA module, and the other part only takes the individual trajectory as input (π (t) i,main ). For regularization, we add KL divergence between π (t) i,main and π (t) i,T to decrease the influence of previous agents' actions. A similar structure is also implemented in the critic structure, as shown on the right side of Figure 2 , with L1 norm for regularization. To enable decentralized execution, we further distill the single-agent policy π (t) i,T to a decentralized policy π (t) i,E with behavior cloning by optimizing the cross entropy loss independently for each agent, which is equivalent to minimizing the KL-divergence between the joint policy and the joint decentralized policy (see Appendix B.2.3 for detail). Here we share GRU and the representation layer, whose inputs do not contain other agents' information. We first use a multi-task matrix game to demonstrate the global optimality of our single-agent policy compared to multi-agent value-based methods (VDN, QMIX, QTRAN, QPLEX) and policy-based methods (MAPPO, HAPPO(Kuba et al., 2021) ). This part has been discussed in Section 3. Then, we use challenging tasks from the StarCraft II micromanagement (SMAC) benchmark (Samvelyan et al., 2019) and Google Research Football (GRF) benchmark (Kurach et al., 2019) to further demonstrate and illustrate the outperformance of our approach. In each environment, We show the average and variance of the performance for our method and baselines tested with three random seeds (seed 0, 1, 2). For all baselines, we use the codes provided by the authors properly with the same hyperparameters as the original papers. Here we compare our approach with policy-based baselines on four super hard maps (MMM2, 3s5z_vs_3s6z, 6h_vs_8z, corridor), one easy map (1c3s5z), and one custom map (3h_vs_1b1z3h) based on the SMAC benchmark. ( " ! (#) , $ ! (#%&) ) MLP GRU ℎ ! (#%&) ℎ ! (#) MHA $ ' (#) , $ & (#) ,…, $ !%& (#) MLP WQs WKs, Vs FC FC ' !,) (#) SoftMax Actor ( ( (#) ) MLP GRU ℎ ! (#%&) ℎ ! (#) MHA $ ' (#) , $ & (#) ,…, $ !%& (#) MLP WQs WKs, Vs FC FC ) (#) Critic MLP ' !,* (#) Evaluation Training distillation L1 norm ' !,+,!- (#) KL $ ' (#) $ !%& (#) … MLP MLP ℎ ! (#) MLP WK WQ W V ℎ .,! (#) Attention k Attention 1, …, K ℎ .,! (#)

5.1. STARCRAFT II

We illustrate the learning curve of StarCraft II in Figure 3 . Our "single-agent" policy outperforms baselines on five out of six maps while performing similarly with MAPPO on 6h_vs_8z. Super hard maps are typically hard-exploration tasks. However, taking benefit of our approach's global optimality guarantee for MMDP, T-PPO can exploit better with the same exploration strategy as MAPPO (based on the entropy of learned policies). We will highlight a map 1c3s5z, where MAPPO con-verges to a locally optimal point but our approach achieves global optimality with nearly 100% winning rate. This phenomenon once again demonstrates the advantage of our sequential transformation framework. HAPPO can also achieve a nearly 100% winning rate on 1c3s5z but fails on other maps. We believe it is still because local optimality of Nash equilibrium learned by HAPPO. Meanwhile, compared with our approach and MAPPO, agents cannot share parameters in HAPPO, which significantly affects the training efficiency in complex tasks. We show the mean and standard deviation between the winning rates of the final policies trained on different random seeds in Table 2 . In this section, we will further analyze the local optimality of SMAC. Compared with the matrix game, the StarCraft II tasks are more complex with high dimensional state space. To verify whether our approach can drive agents out of local optimal points as in matrix games, we create a new StraCraft II map named 3h_vs_1b1z3h. In 3h_vs_1b1z3h, we control three Hydralisks, while our opponent controls three Hydralisks, one low-damage Zergling, and one Baneling with high area damage. The explosion of the Baneling can only be avoided if all agents gather fire to it instead of the nearer Zergling. However, gathering fire to the Zergling could be a suboptimal equilibrium, where no agents tend to change its policy. Previous research has demonstrated that superhard maps in SMAC require more exploration (Wang et al., 2020; Li et al., 2021a) . However, we find different experimental results on the maps that contain local optimal points, such as our own designed map 3h_vs_1b1z3h. In PPO, exploration is guaranteed by the entropy term in its loss function. As shown in Figure 5 , T-PPO and MAPPO's performance changes are divergent on different maps while tuning the related hyperparameter. On 6h_vs_8z, increasing entropy weight will improve learning efficiency for both algorithms. It is in line with our expectations. On 3h_vs_1b1z3h, increasing entropy weight will still improve the performance of T-PPO but will make the performance of MAPPO worse. We believe this phenomenon is related to the local optimality we discussed above. MAPPO has no motion to drive agents to escape local optimal points, which leads to low exploita-tion efficiency. Taking advantage of handling global optimality as discussed in Section 4.1, our approach achieves better exploitation under the same exploration strategy.

5.2. GOOGLE RESEARCH FOOTBALL

In this section, we test our approach against policy-based baselines on another MARL benchmark named Google Research Football (GRF). In the environment setting, we use sparse rewards with both SCORING and CHECKPOINT for our approach and all baselines. For observations, we follow (Li et al., 2021a) , using the simple 115-dimensional vector as the observation while removing the information irrelevant to the current scenario. Meanwhile, we introduce the relative position for each agent instead of absolute coordinates to achieve a more realistic environment. As shown in Figure 6 , our approach obviously outperforms baselines, achieving remarkable winning rate, while baselines almost learn nothing to win in academy_pass_and_shoot_with_keeper and academy_counterattack_hard. In GRF scenarios, agents must coordinate timing and positions to organize offense to seize fleeting opportunities. The cooperation between agents is difficult to coordinate because of the sparsity of agents' crucial movements. Our sequential transformation framework provides an optimal solution to MMDP by forcing agents to consider the information from previous ones, which promotes coordination among agents for achieving sophisticated cooperation. Compared with T-PPO, T-PPO-Distillation performs similarly, which ensures that our algorithm can be executed in a wide range of environments.

6. CONCLUSION

In this paper, we study state-of-the-art cooperative multi-agent reinforcement learning methods and observe that, even with full expressiveness, they may fail to converge to an optimal solution in simple matrix games. To analyze this phenomenon, we generalize these MARL methods with a general model and find that their factorized policy structure combined with the gradient descent optimization is one of the major causes of their suboptimality. To solve this issue, we propose a novel sequential transformation framework that allows employing off-the-shelf single-agent reinforcement learning methods to solve cooperative multi-agent tasks and retain their global optimality guarantee. Based on this framework, we develop T-PPO that extends single-agent PPO to multi-agent settings and significantly outperforms baselines on various benchmark tasks. It is an interesting future direction to extend efficient value-based SARL methods to multi-agent settings through our transformation framework.

A APPENDIX A.1 SUBOPTIMALITY OF EXISTING CTDE ALGORITHMS

For multi-agent actor-critic algorithms, we recall Theorem 3.1: Theorem 3.1. For multi-agent actor-critic algorithms, any Nash's Equilibrium of policies is a stationary point of actor loss. Moreover, there exists a family of single step-MMDP such that the actor loss function contains Ω(|A|) different local minimums for deterministic policy and infinite local minimums for stochastic policy. Proof. Suppose (π θ1 , • • • , π θn ) is an NE, then by the definition of NE, we have ∀i = 1, • • • , n : ∀θ ′ i : J (π θ1 , • • • , π θn ) ≥ J (π θ1 , • • • , π θi-1 , π θ ′ i , π θi+1 , • • • , π θ-i ). By denoting C as the actor loss function, it's equivalent to ∀i = 1, • • • , n : ∀θ ′ i : C(θ 1 , • • • , θ n ) ≤ C(θ 1 , • • • , θ i-1 , θ ′ i , θ i+1 , • • • , θ -i ). Suppose ∃i : ∂C ∂θi ̸ = 0. Let l = (0, • • • , 0 i-1 , ∂C ∂θi , 0, • • • , 0 n-i ), we have lim δ→0 C((θ 1 , • • • , θ n ) + δl) -C(θ 1 , • • • , θ n ) δ∥l∥ 2 = ∥l∥ 2 ̸ = 0 Choose a sufficient small δ will constitute a contradiction of the definition of NE. Thus we have ∀i : ∂C ∂θi = 0. After that, by applying the derivation rule of compound function, ∂C ∂Θ = n i=1 ∂C ∂θ i ∂θ i ∂Θ = 0 we can address the situation where parameter-sharing is taken into consideration, which completes our proof of the first part of the theorem. For the second part of the theorem, we construct a 2-agent matrix game here first. The payoff matrix M of the matrix game is M =     |A| -K • • • -K -K |A| -1 • • • -K . . . . . . . . . . . . -K -K • • • 1     where K > 0 is a positive constant. In this case, any entry of the diagonal is a local minimum. Let p, q be the i-th one-hot probability vector for some i = 1, • • • , |A|. The payoff of joint policy (p, q) is J (p, q) = p ⊤ M q. Let's disturb the joint policy a little, such that p ′ i , q ′ i ∈ (1 -ϵ, 1 -ϵ/2). Then denoting ∆p = (p -p ′ ), ∆q = (q -q ′ ), the disturbed payoff is p ′⊤ M q ′ = (p -∆p) ⊤ M (q -∆q) ≤ J (p, q) -O(ϵ) + O(ϵ 2 ) < J (p, q) for sufficiently small ϵ. This means (p, q) is a local minimum. This case is easy to extent to general MMDPs. Under review as a conference paper at ICLR 2023 For value-decomposition algorithms, we recall Theorem 3.2. Before we prove the theorem, we prove a stronger version for a special case (QPLEX) first. Proposition A.1. Assuming the neural network as a universal approximator, for any MMDP with no cycle, there are infinite many local optima of the TD-loss function of QPLEX. Proof. We first expand the original formula of QPLEX (Wang et al., 2021b ): Q(s, a) = V tot (s) + A tot (s, a) = V i (s) + λ i (s, a)A i (s, a i ) = (w i (s)V i (s) + b i (s)) + λ i (s, a)A i (s, a i ) where w i : S → R, b i : S → R, Q i : S × A n → R are neural networks, V i (s) = max Q i (s, •), A i (s, a i ) = Q i (s, a i ) -V i (s). By assuming the neural network as a universal approximator and a little abuse of notations, the original formula in QPLEX can be rewritten as follows: Q(s, a) = b(s) + λ i (s, a)A i (s, a i ) where b, λ i , Q i are parameterized universal approximators, and A i (s, a i ) = Q i (s, a i )-max Q i (s, •) is the individual advantage function. We explicitly specify the parameters used by each approximator here: b(s; ψ), λ i (s, a; ϕ i ), Q i (s, a i ; θ i ). Denoting Θ = (θ 1 , • • • , θ n , ϕ 1 , • • • , ϕ n , ψ ) as all parameters being used, the TD-loss function is L T D (Θ) = 1 2 E (s,a)∼D [Q(s, a; Θ) -(T Q)(s, a)] 2 for some distribution D fully supported on S × A n , where T is the Bellman operator. The distribution D does not matters that much here as long as it is fully supported. We now further assume D to be a uniform distribution. For D that is not uniform, the proof is essentially similar. We leave it to the reader. Fix any non-degenerated θ 1 , • • • , θ n , which means ∀i, s, the greedy action of Q i (s, •) is unique. Then a sufficiently small neighborhood of Θ won't change the greedy joint action due to the continuity of the functionfoot_0 . Therefore, we can fix a * (s) = arg max Q(s, •) for each state s (a * (s) is abbreviated as a * when there is no ambiguity). Denote T (s, a) = (T Q)(s, a) as the current target in TD-loss. We first consider the case when the MMDP is an one-step game (i.e. γ = 0). Then T is a constant tensor independent to current Q in this case. The TD-learning becomes a supervised learning task. We try to optimize the TD-loss by minimizing a [Q(s, a; Θ) -T (s, a)] 2 for each s. Then for any state s, denote L = {a ̸ = a * : T (s, a) ≤ T (s, a * )}, G = {a : T (s, a) > T (s, a * )} ∪ {a * }. We have a [Q(s, a; Θ) -T (s, a)] 2 = a∈L [Q(s, a; Θ) -T (s, a)] 2 + a∈G [Q(s, a; Θ) -T (s, a)] 2 Keeping θ 1 , • • • , θ n fixed, find ψ, ϕ 1 , • • • , ϕ n such that b(s; ψ) = 1 |G| a max(T (s, a) -T (s, a * ), 0) + T (s, a * ) ∀i : λ(s, a; ϕ i ) = min(T (s, a) -T (s, a * ), 0)/ i A i (s, a i ; θ i ) Plug these formula into the definition of mixing network, we have ∀a ∈ L :Q(s, a; Θ) = T (s, a) ∀a ∈ G :Q(s, a; Θ) = 1 |G| a ′ ∈G T (s, a ′ ) It's easy to verify that a∈L [Q(s, a; Θ) -T (s, a)] 2 = 0 and a∈G [Q(s, a; Θ) -T (s, a)] 2 = min v∈R a∈G [v -T (s, a)] 2 These further show that L T D (Θ) = min Q∈R |S|×|A| n :∀s:a * (s)∈arg max Q(s,•) [Q(s, a) -T (s, a)] 2 which means Θ is a global minimum conditioning on that the greedy joint policy stays unchanged. Recall that there is a small neighborhood of Θ such that the greedy joint policy stays unchanged, this finishes the proof of local optimality of Θ in L T D . This argument can be easily extended to acyclic MMDPs. We just need to calculate ψ by the reversal of topological order, then all formula above will be well-defined. Now we are able to prove Theorem 3.2 Theorem 3.2. There exists a family of MMDP, such that for any value-decomposition algorithm with a complete Q-function class satisfying the IGM condition (Eq. ( 3)), the TD-loss function contains Ω |A| |S| different local optima. Proof. Recall Eq. ( 2), the mixing network of any value-decomposition algorithm on MMDP is in the following form Q(s, a; Θ) = f mix (Q 1 (s, •), • • • , Q n (s, •), s, a; Θ) Following the proof of Proposition A.1, it's essential to to find a series of Θ (k) such that (1) Θ (k) is globally optimal (w.r.t the TD-loss function) conditioning on that the greedy joint policy unchanged; (2) the greedy joint policy is unchanged in a small neighborhood of Θ (k) . Now we construct the MMDP as follows: a. let γ = 0; b. let r(s, a) = i, a 1 = a 2 = • • • = a n = i 0, otherwise for all s ∈ S. Since γ = 0, the transition probability does not really matter. Let Π = {π : S → A : (π) for each π ∈ Π satisfying (1) and ( 2), then we finish our proof. π 1 = π 2 = • • • = π n } Fix any π ∈ Π, we construct Θ (π) as follows. First, let f m (s, a) =        a1+|A| 2 , a = π(s) π(s)1+|A| 2 -1 m , a 1 = a 2 = • • • = a n > π(s) 1 a 1 , a 1 = a 2 = • • • = a n < π(s) 1 0, otherwise for m ∈ N. By the completeness of Q-function class, we are able to find some Θ m , such that Q(s, a; Θ m ) = f m (s, a). By the uniqueness of greedy policy of f m , we have ∀a i ∈ A : Q i (s, π(s); Θ m ) > Q i (s, a i ; Θ m ) for i = 1, • • • , n. By Bolzano Weierstrass Theoremfoot_1 , we can find a convergent subsequence {Θ m k } ∞ k=1 . Take the limit, we have Θ m k → Θ (π) . By the continuityfoot_2 , we have Q(s, a; Θ (π) ) =    π(s)1+|A| 2 , a 1 = a 2 = • • • = a n ≥ π(s) 1 a 1 , a 1 = a 2 = • • • = a n < π(s) 1 0, otherwise , ∀a i ∈ A : Q i (s, π(s); Θ (π) ) ≥ Q i (s, a i ; Θ (π) ) for i = 1, • • • , n. (1) is obvious in this case. We can prove it by following the argument in the proof in Proposition A.1. To prove (2), we need to prove the strict inequivalence in ∀a i ∈ A : Q i (s, π(s); Θ (π) ) ≥ Q i (s, a i ; Θ (π) ). We prove it by contradiction. Suppose that there are some s, i, and a i , such that Q i (s, π(s); Θ (π) ) = Q i (s, a i ; Θ (π) ). Then we have Q(s, π(s) 1 , • • • , a i , • • • , π(s) n ; Θ (π) ) = max Q(s, •; Θ (π) ) = π(s)1+|A| 2 by the IGM assumption, which contradicts to Q(s, π(s) 1 , • • • , a i , • • • , π(s) n ; Θ (π) ) = 0. Experimental Results for QPLEX In empirical design of the algorithm, we notice that QPLEX has used some engineering tricks like "stop gradient" to modify the gradient of non-optimal points helping the algorithm to jump out of some local optimums. But these tricks are lack of theoretical guarantee, we can still construct cases that QPLEX is not able to reach global optimum, such as the following Matrix Game (Table3). -20 10 10 9 Table 3  : Matrix Game 2, m = 2 This Matrix Game has two global optimums (0, 1) and (1, 0), and a suboptimal solution (1, 1) with high reward. QPLEX will likely to initialize to the suboptimal solution (1, 1), and after that, it get confused since the manually modified gradient doesn't tell it the right direction. The learnt joint Q vibrates around the following matrix: -20 29/3 -ϵ 29/3 -ϵ 29/3 which can be proved to be a local optimum of L T D according to the proof of Proposition A.1. QTRAN L T D in QTRAN is discontinuous, for everywhere the police switches may constitute a jump discontinuity, which may be harmful for gradient descent methods, since gradient descent methods assume the loss function to be differentiable. Unfortunately, we are not able to give any theoretical analysis about QTRAN, either prove or disprove its optimality. We have only empirical results shown in Section 5 to prove its potential suboptimality.

A.2 SEQUENTIAL TRANSFORMATION

We present the formal definition of sequential transformation here for completeness. Definition A.1 (Sequential Transformation Γ). Given an MMDP M = (S, A, P, r, γ, s 0 , N ), its sequential transformation is an MDP Γ (M) = S, A, P , r, γ, s 0 , where S = N -1 i=0 S × A i is the state space, A is the same action space as the original MMDP M , P is the transformed transition function, where ∀k < N, ∀s = (s, a 1 , • • • , a k-1 ) ∈ S, ∀a k ∈ A, we have P ((s, a 1 , • • • , a k )|s, a k ) = 1, and ∀s = (s, a 1 , • • • , a N -1 ) ∈ S, ∀s ′ ∈ S, ∀a N ∈ A, we have P (s ′ |s, a N ) = P (s ′ |s, (a 1 , • • • , a N )) , r is the transformed reward function, where ∀k < N, ∀s = (s, a 1 , • • • , a k-1 ) ∈ S, ∀a k ∈ A, we have r(s, a k ) = 0, and ∀s = (s, a 1 , • • • , a N -1 ) ∈ S, ∀a N ∈ A, we have r(s, a N ) = r(s, (a 1 , • • • , a N )), γ = γ 1/N is a transformed discount factor, and s 0 is the initial state.

A.2.1 PSEUDOCODE OF THE FRAMEWORK WITH SEQUENTIAL TRANSFORMATION

Here we present pseudo-code of the sequential framework (Algorithm 1). Algorithm 1 The Sequential Framework Return s 0 to A.

7:

else if A asks for an interaction with the environment by providing an action a then 8: a t mod N +1 ← a 9: if t mod N = N -1 then 10: Call O M for the interaction by providing action (a 1 , 

POLICIES

Here we state the theorem of equivalence between M and ΓM in perspective of policy value. Theorem A.1. For any deterministic policy π on Γ (M), there is a decentralized policy π jt = (π 1 , • • • , π N ) on M such that J M (π jt ) = γ (1-n)/n J Γ(M) (π), where π 1 (s) = π(s), π k (s) = π((s, π 1 (s), • • • , π k-1 (s))) for all k > 1. For any stochastic policy η on Γ (M), there is a communicated policy η jt = (η 1 , • • • , η N ) on M such that J M (η jt ) = γ (1-n)/n J Γ(M) (η), where η 1 (a 1 |s) = η(a 1 |s), η k (a k |s, a 1 , • • • , a k-1 ) = η(a k |(s, a 1 , • • • , a k-1 )) for all k > 1, where a 1 , • • • , a k-1 are actions selected by agents 1, • • • , k- 1. And conversely, for any policy π jt on M, there is a policy π on ΓM such that J M (π jt ) = γ (1-n)/n J Γ(M) (π). Proof. For deterministic policy: J M (π jt ) = E ∞ t=0 γ t r(s t , π jt ) s t+1 ∼ P (s t , π jt ) = E ∞ t=0 γnt r((s t , π 1 (s t ), • • • , π n-1 (s t )), π) s t+1 ∼ P (s t , π jt ) = E ∞ t=0 n-1 k=0 γnt+k-n+1 r((s t , π 1 (s t ), • • • , π k-1 (s t )), π) s t+1 ∼ P (s t , π jt ) = E γ1-n ∞ t ′ =0 γt ′ r(s t ′ , π) s t ′ +1 ∼ P (s t ′ , π) = γ 1-n n J Γ(M) (π) For stochastic policy the proof is similar, J M (η jt ) = E ∞ t=0 γ t r(s t , η jt ) s t+1 ∼ P (s t , η jt ) = E ∞ t=0 γnt r s t , a <n , η s t+1 ∼ P (s t , η jt ), a (t) l ∼ η • s t , a (t) <l = E ∞ t=0 n-1 k=0 γnt+k-n+1 r s t , a <k , η s t+1 ∼ P (s t , η jt ), a (t) l ∼ η • s t , a (t) <l = E γ1-n ∞ t ′ =0 γt ′ r(s t ′ , η) s t ′ +1 ∼ P (s t ′ , η) = γ 1-n n J Γ(M) (η) For the converse part of the theorem, suppose π jt = (π 1 , • • • , π n ), it's sufficient to let π((s, a 1 , • • • , a k-1 )) = π k (s). The calculation of its value is similar to the above, we omit it here.

A.3 OPTIMALITY OF TPPO

The proof of Theorem 4.1 and Proposition 4.1 are directly followed by Theorem A.1 and Theorem 4.9 in (Liu et al., 2019) , which is omitted here.

A.4 THE COMPLEXITY OF THE TRANSFORMED MODEL

By sequential transform, we are able to convert any MMDP to an MDP and run SARL algorithms on the MDP to solve the MMDP. One natural question is, will such framework bring additional hardness of the task? a 2 a 1 A (1) A (2) A (3) A (4) A (5) A (1) 10 -10 -10 -10 -10 A (2) -10 9 0 0 0 A (3) -10 0 9 0 0 A (4) -10 0 0 9 0 A (5) -10 0 0 0 9 Matrix 2 a 2 a 1 A (1) A (2) A (3) A (4) A (5) A (1) 10 -10 10 -10 10 A (2) -10 10 -10 10 -10 A (3) 10 -10 10 -10 10 A (4) -10 10 -10 10 -10 A (5) 10 -10 10 -10 10 Matrix 3 a 2 a 1 A (1) A (2) A (3) A (4) A ( 5) A (1) -20 -20 -20 -20 10 A (2) -20 -20 -20 10 9 A (3) -20 -20 10 9 9 A (4) -20 10 9 9 9 A (5) 10 9 9 9 9 Matrix 4 a 2 a 1 A (1) A (2) A (3) A (4) A ( 5) A (1) -20 -20 -20 -20 10 A (2) -20 -20 -20 10 9 A (3) -20 -20 10 9 8 A (4) -20 10 9 8 7 A (5) 10 9 8 7 6 Matrix 5 a 2 a 1 A (1) A (2) A (3) A (4) A ( 5) A (1) -20 -15 -10 -5 6 A (2) -20 -15 -10 7 5 A (3) -20 -15 8 6 4 A (4) -20 9 7 5 3 A (5) 10 8 6 4 2 Matrix 6 a 2 a 1 A (1) A (2) A (3) A (4) A A (1) 0.8 -16.0 -5.0 -10.9 -3.7 A (2) -9.2 -4.2 7.3 9.6 -3.0 A (3) -20.0 -18.1 0.2 -4.3 9.0 A (4) -14.9 -2.0 -17.7 -17.6 -0.8 A (5) 3.8 10 7.5 9.2 -10.7 Matrix 7 a 2 a 1 A (1) A (2) A (3) A (4) A (5) A  a 2 a 1 A (1) A (2) A (3) A (4) A (5) A (1) -1.4 -19.2 7.2 -5.5 7.4 A (2) -18.5 -20.0 -14.4 -17.6 -5.1 A (3) 3.6 5.5 10 -13.3 -4.9 A (4) 9.8 -12.3 0.6 -16.5 -13.0 A (5) -11.8 -20.0 -2.4 7.1 -2.3 Matrix 9 a 2 a 1 A (1) A (2) A (3) A (4) A A  a 2 a 1 A (1) A (2) A (3) A (4) A A In section 5.1, we have compared and discussed the advantage of our approach against baselines on several representative maps. Here we further compare our reproach against baselines on all maps. The SMAC benchmark contains 14 maps that have been classified as easy, hard, and super hard. In this paper, we design one more map 3h_vs_1b1z3h, whose difficulty is comparable with official super hard maps. In Figure . 7, we compare the performance of our approach with baseline algorithms on all super hard maps. We can see that T-PPO outperforms all the baselines, especially on 3s5z_vs_3s6z, MMM2, and 3h_vs_1b1z3h. These results demonstrate that T-PPO can handle challenging tasks more efficiently with theoretical guarantees of its sequential transformation framework, in line with our expectations of it. Meanwhile, our distilled policy T-PPO-Distillation performs similarly to T-PPO, illustrating the competitiveness of our approach to fully decentralized evaluation. HAPPO performs poorly on 4 out of 6 super hard maps, demonstrating its limitation on complex tasks. Our approach maintains its out-performance on most hard and easy maps. Compared with MAPPO, our approach achieves better convergence points on 3s_vs_5z and 1c3s5z, which confirms the experiment results and analysis in Section ??. In summary, T-PPO establishes a new state of the art on SMAC benchmark by outperforming all policy-based baselines in 11 out of 15 scenarios. Meanwhile, the distilled strategy of T-PPO performs as well as the original strategy, which maintains a fairer comparison with baselines on fully decentralized execution. 

B.2.2 HYPER-PARAMETERS

Our code is implemented based on MAPPO (https://github.com/marlbenchmark/on-policy). We share the same structure with MAPPO except improvement we mentioned in Section ?? to instantiate our transformation framework. Meanwhile, we share the same hyper-parameters with MAPPO (Yu et al., 2021) only except: (1) We fine-tune the weight of entropy on three maps (0.03 on 3h_vs_1b1z3h, 6h_vs_8z, and 5m_vs_6m) for both our approach and MAPPO. (2) The hyperparameters of multi-head attention (MHA) modules. As for HAPPO, we use the officially released code and related hyper-parameters (https://github.com/cyanrain7/TRPO-in-MARL). The distillation is just independent behavioral cloning for each agent. Denote the joint policy as π jt (•|s), and the decentralized policy for agent i as π i decen (•|s; θ). The independent behavioral cloning is equivalent to minimization of the KL divergence between the joint policy π jt (•|s) and the joint decentralized policy Because there are fewer agents in GRF compared with SMAC (3 in Academy_3_vs_1_with_Keeper and 4 in Academy_Counterattack_Hard, we slightly decrease the number of MHA heads from 3 to 2 as shown in Table 9 . Other hyper-parameters remains the same as SMAC. The sequential update does take longer time than the concurrent update in the training phase, while in the testing phase, our algorithm doesn't take extra time since a decentralized policy is already calculated by distillation. Specifically, in the training phase, our framework takes n times the time to do action inference, where n is the number of agents. However, it's worth mentioning that, the time of action inference is only part of the time doing a whole training iteration, which also includes environment simulation and policy training. On the whole, the training time cost of our framework is 0.91 times more than MAPPO in SMAC environment 3m (3 agents), and 1.76 times more in SMAC environment 10m vs 11m (10 agents). This result embodies a trade-off between training time and training performance. Specific time of each part is shown Table 10 . We also compare T-PPO with two more actor-critic methods -DOP and FOP on three superhard SMAC maps for completeness. The result is shown in Figure 10 .

B.6 T-DQN RESULTS

We also implement the off-policy method T-DQN based on the single agent algorithm DQN (Mnih et al., 2013) for completeness. In Figure 11 We evaluate DQN on a Markov Game (Matrix games with random transition), a hard and a superhard SMAC maps to show that our framework is also compatible with off-policy methods. 

B.7 LEARNED BEHAVIOUR OF THE SEQUENTIAL FRAMEWORK

We visualize the policy learned by our approach and compare it with MAPPO in MMM2. Based on the comparison, we notice an interesting phenomenon. The joint strategy trained by MAPPO is usually conservative, only moving in a small area, and only two agents are left in the end. On the contrary, the joint strategy trained by our approach is more aggressive. Our agents pull back and forth frequently based on opponents' movement in a large-scale range while ensuring effective fire focus. We think this phenomenon is caused by our sequential transformation framework, which enables each agent to fully understand the team strategy for more efficient coordination.



Note that the continuity is naturally assumed, since we need to take the gradient of the loss function w.r.t. Θ BW Theorem claims that any bounded sequence in R k has a convergent subsequence. It's sufficient to assume the boundedness for the purpose of understanding the main idea of the proof. For the unbounded cases, this theorem can be extended a little bit to deal with unbounded sequence by defining the convergence of xn/|xn| as an extended convergence. This is a standard technique in mathematical analysis, we omit it here for conciseness. Note that the continuity is naturally assumed, since we need to take the gradient of the loss function w.r.t. Θ



Figure 1: Learning curve of Multi-task Matrix Game

Proposition 4.1 (Suitable implementation of T-PPO has optimality guarantee). T-PPO converges to global optimum, if Assumption 4.1, 4.2, and 4.3 in Liu et al. (2019) hold, in particular, if M is tabular.

Figure 2: The architecture of combining our sequential transformation framework with PPO (T-PPO)

Figure 3: Learning curve of SMAC

Figure 4: Illustration of 3h_vs_1b1z3h.

Figure 5: Different changes caused by adjusting the exploration coefficients (entropy weight) between T-PPO and MAPPO on 3h_vs_1b1z3h and 6h_vs_8z.

Figure 6: Learning curve of GRF.

Input: An SARL algorithm A, an oracle O M for interaction with an MMDP M. 2: while Simulating A do 3: if A asks for the initialization of environment then 4: Initialize t = 0, a i = 0 for i = 1, • • • , N . 5: Call O M for the initialization of M and obtain s 0 6:

Figure 7: Comparisons between Our approach and policy-based baselines on all superhard maps.

Figure 9: Comparisons between Our approach and policy-based baselines on all easy maps.

,••• ,an) π jt (a|s) log π jt (a|s) n i=1 π i decen (a i |s; θ) = E a∼πjt(•|s) -n i=1 log π i decen (a i |s; θ) behavioural cloning loss -H(π jt (•|s)) entropy (a constant) B.3 GOOGLE RESEARCH FOOTBALL (GRF) TASKS

Figure 10: Comparisons between T-PPO and other on-policy baselines on three superhard maps.

Mean value and standard deviation of winning rate on the SMAC benchmark.

be the set of policies with non-zero value on each state. Note that |Π| = |A| |S| . If we can show that there is a Θ

• • • , a N ) to obtain reward r and next state s ′ . Return 0 and (s, a 1 , • • • , a t mod N +1 ) to A. Convert π to the joint policy π jt on M. THE EQUIVALENCE BETWEEN M AND ΓM AS WELL AS THE CONVERSION OF

Multi-task Matrix Game

Common hyper-parameters for our approach in the SMAC domain.

Common hyper-parameters for our approach in the GRF domain.

Comparison between TPPO and MAPPO on training time

5. EXPERIMENTS

We design experiments to answer the following questions: (1) Can the proposed sequential framework achieve the globally optimal policy on MMDP? (Section 3 and Section 5.1.1) (2) Can our approach improve learning efficiency for policy-based MARL algorithms? (Section 5.1 and Section

annex

First of all, this framework obviously doesn't increase the minimax sample complexity of the task, since the MMDP M and the MDP Γ (M) can be transformed to each other with merely negligible additional cost in time and space (see Appendix A.6 ). Nevertheless, for a concrete algorithm A (e.g. Q-learning), the sample complexity is not necessary to be the same after such transformation.Let's take Q-learning as an example. We first investigate the size of state-action space before and after the transformation. It easy to see that the size of the state-action space of M is |S||A| N , and that of itsThis implies that the sequential transform does not increase the complexity in the state-action space.However, if we take a closer look here of the sample complexity, we will find that the exact sample complexity bound of Q-learning is Li et al., 2021b )), which depends on not only the size of state-action space, but also on the magnitude of 1 1-γ . This implies that the sample complexity may increase for certain algorithms since Γ (M) has a longer horizon. Despite this unpleasant result, for Q-learning, still, this analysis leave out the structure of Γ (M): for every n steps in Γ (M) there are n -1 deterministic transitions with reward 0. So fortunately, if we modify the original Q-learning a little bit, it will attain the same sample complexity as before (See Appendix A.5).

A.5 AN EXTENSION OF Q-LEARNING

Here we introduce a variant of Q-learning dealing with deterministic transitions (Algorithm 2) to demonstrate the claim at the end of Appendix A.4. If we adopt this algorithm in our sequential transformation framework, it will have the same sample complexity as the original Q-learning on the original MMDP M.We denote D : S × A → {0, 1} as an oracle telling whether this state-action pair would result in a deterministic transition.Algorithm 2 QLDT (Q-learning for MDPs with deterministic transitions)We have Proposition A.2. Proposition A.2. T-QLDT has the same sample complexity as original Q-learning on M.Proof. One can view T-QLDT as the original Q-learning maintaining a max heap, whereIn this way, T-QLDT has exactly the same behaviour as the original Q-learning and does not change the sample complexity as a consequence.

A.6 THE MINIMAX SAMPLE COMPLEXITY OF THE SEQUENTIAL FRAMEWORK

Here we explain a bit more of the claim of the minimax sample complexity in Appendix A.4.The minimax sample complexity here is the sample complexity of the "best" algorithm over the "hardest" task. For any multi-agent algorithm A, we can always find a single-agent algorithm B, such that A = T-B. It is because that for any MDP M , we can always compress n steps on M into one step, and then use the corresponding multi-agent algorithm A to solve it as a multi-agent problem. In this way, T-B is exactly A, and thus not increase the minimax sample complexity. One should keep in mind that every n samples on MDP correspond to exactly one sample on MMDP. Particularly, the number of bits we need to record every n samples on MDP are exactly what we need to record one sample on MMDP.

B EXPERIMENTAL DETAILS

In this section, we provide more experimental results supplementary to those presented in Section 5. We also discuss the details of the experimental settings of both our matrix game and the StarCraft II micromanagement (SMAC) benchmark.

B.1 MULTI-TASK MATRIX GAME

In section 3, we design a multi-task matrix game to demonstrate the global optimality of our sequential transformation framework. In this section, we will first show the details of this environment and provide more evidence of our algorithms' advantage on MMDP.

B.1.1 DETAILS OF MULTI-TASK MATRIX GAME

In multi-task matrix game, the return of the optimal strategy corresponding to each matrix is 10, which means the sum rewards of the global optimal strategy is 100. Two agents are initialized to one matrix uniformly at random, and the ID of a current matrix is observable to both of them. They need to cooperate to select the entry with the maximum reward for the current matrix, after that, the game ends. Each matrix contains 5 × 5 = 25 entries, which means A = {0, 1, 2, 3, 4} for each agent.All 10 payoff matrices are listed in Table 4 . The optimal strategies' payoff of all matrices is 10. Matrices 1 -5 are hand-crafted in order to create some hard NEs. Matrices 6 -10 are drawn uniformly at random. It's worth noting that a random 5 × 5 matrix has 25/9 ≈ 2.77 different NEs in expectation. And in our opinion, the existence of suboptimal NEs is the main reason why existing algorithms fail.

B.1.2 VISUALIZATION OF LEARNED JOINT STRATEGIES Table

To further illustrate our approach's ability to attach the global optimal point, we use the matrix game in (Wang et al., 2021b; Ma et al., 2021) as a toy example (shown in Table 5 ). Here we compare our approach T-PPO with MAPPO. Joint policies learned by T-PPO and MAPPO are shown in Table 6 and Table 7 . MAPPO falls into local optimal points due to Theorem 3.1, while T-PPO obtains the optimal strategy. Taking this advantage, our approach dominates in our multi-task matrix game.

