ORDER MATTERS: AGENT-BY-AGENT POLICY OPTI-MIZATION

Abstract

While multi-agent trust region algorithms have achieved great success empirically in solving coordination tasks, most of them, however, suffer from a nonstationarity problem since agents update their policies simultaneously. In contrast, a sequential scheme that updates policies agent-by-agent provides another perspective and shows strong performance. However, sample inefficiency and lack of monotonic improvement guarantees for each agent are still the two significant challenges for the sequential scheme. In this paper, we propose the Agent-byagent Policy Optimization (A2PO) algorithm to improve the sample efficiency and retain the guarantees of monotonic improvement for each agent during training. We justify the tightness of the monotonic improvement bound compared with other trust region algorithms. From the perspective of sequentially updating agents, we further consider the effect of agent updating order and extend the theory of non-stationarity into the sequential update scheme. To evaluate A2PO, we conduct a comprehensive empirical study on four benchmarks: StarCraftII, Multiagent MuJoCo, Multi-agent Particle Environment, and Google Research Football full game scenarios. A2PO consistently outperforms strong baselines.

1. INTRODUCTION

Trust region learning methods in reinforcement learning (RL) (Kakade & Langford, 2002) have achieved great success in solving complex tasks, from single-agent control tasks (Andrychowicz et al., 2020) to multi-agent applications (Albrecht & Stone, 2018; Ye et al., 2020) . The methods deliver superior and stable performances because of their theoretical guarantees of monotonic policy improvement. Recently, several works that adopt trust region learning in multi-agent reinforcement learning (MARL) have been proposed, including algorithms in which agents independently update their policies using trust region methods (de Witt et al., 2020; Yu et al., 2022) and algorithms that coordinate agents' policies during the update process (Wu et al., 2021; Kuba et al., 2022) . Most algorithms update the agents simultaneously, that is, all agents perform policy improvement at the same time and cannot observe the change of other agents, as shown in Fig. 1c . The simultaneous update scheme brings about the non-stationarity problem, i.e., the environment dynamic changes from one agent's perspective as other agents also change their policies (Hernandez-Leal et al., 2017) . Figure 1 : The taxonomy on the rollout scheme and the policy update scheme. In contrast to the simultaneous update scheme, algorithms that sequentially execute agent-byagent updates allow agents to perceive changes made by preceding agents, presenting another perspective for analyzing inter-agent interaction (Gemp et al., 2022) . Bertsekas (2021) proposed a sequential update framework, named Rollout and Policy Iteration for a Single Agent (RPISA) in this paper, which performs a rollout every time an agent updates its policy (Fig. 1a ). RPISA effectively turns non-stationary MARL problems into stationary single agent reinforcement learning (SARL) ones. It retains the theo-retical properties of the chosen SARL base algorithm, such as the monotonic improvement (Kakade & Langford, 2002) . However, it is sample-inefficient since it only utilizes 1/n of the collected samples to update n agents' policies. On the other hand, heterogeneous Proximal Policy Optimization (HAPPO) (Kuba et al., 2022) sequentially updates agents based on their local advantages estimated from the same rollout samples (Fig. 1b ). Although it avoids the waste of collected samples and has a monotonic improvement on the joint policy, the policy improvement of a single agent is not theoretically guaranteed. Consequently, one agent's policy update may offset previous agents' policy improvement, reducing the overall joint policy improvement. In this paper, we aim to combine the merits of the existing single rollout and sequential policy update schemes. Firstly, we show that naive sequential update algorithms with a single rollout can lose the monotonic improvement guarantee of PPO for a single agent's policy. To tackle this problem, we propose a surrogate objective with a novel off-policy correction method, preceding-agent offpolicy correction (PreOPC), which retains the monotonic improvement guarantee on both the joint policy and each agent's policy. Then we further show that the joint monotonic bound built on the single agent bound is tighter than those of other simultaneous update algorithms and is tightened during updating the agents at a stagefoot_0 . This leads to Agent-by-agent Policy Optimization (A2PO), a novel sequential update algorithm with single rollout scheme (Fig. 1b ). Further, we study the significance of the agent update order and extend the theory of non-stationarity to the sequential update scheme. We test A2PO on four popular cooperative multi-agent benchmarks: StarCraftII, multi-agent MuJoCo, multi-agent particle environment, and Google Research Football full game scenarios. On all benchmark tasks, A2PO consistently outperforms strong baselines with a large margin in both performance and sample efficiency and shows an advantage in encouraging interagent coordination. To sum up, the main contributions of this work are as follows: 1. Monotonic improvement bound. We prove that the guarantees of monotonic improvement on each agent's policy could be retained under the single rollout scheme with the off-policy correction method PreOPC we proposed. We further prove that the monotonic bound on the joint policy achieved given theoretical guarantees of each agent is the tightest among single rollout algorithms, yielding effective policy optimization. 2. A2PO algorithm. We propose A2PO, the first agent-by-agent sequential update algorithm that retains the monotonic policy improvement on both each agent's policy and the joint policy and does not require multiple rollouts when performing policy improvement. 3. Agent update order. We further investigate the connections between the sequential policy update scheme, the agent update order, and the non-stationarity problem, which motivates two novel methods: a semi-greedy agent selection rule for optimization acceleration and an adaptive clipping parameter method for alleviating the non-stationarity problem.

2. RELATED WORKS

Trust Region Policy Optimization (TRPO) (Schulman et al., 2015) and Proximal Policy Optimization (PPO) (Schulman et al., 2017) are popular trust region algorithms with strong performances, benefiting from the guarantee of monotonic policy improvement (Kakade & Langford, 2002) . Several recent works delve deeper into understanding these methods (Wang et al., 2019; Liu et al., 2019; Wang et al., 2020) . In the multi-agent scenarios, de Witt et al. (2020) and Papoudakis et al. (2020) empirically studied the performance of Independent PPO in multi-agent tasks. Yu et al. (2022) conducted a comprehensive benchmark and analyzed the factor influential to the performance of Multi-agent PPO (MAPPO), a variant of PPO with centralized critics. Coordinate PPO (CoPPO) (Wu et al., 2021) integrates the value decomposition (Sunehag et al., 2017) and approximately performs a joint policy improvement with monotonic improvement. Several further trials to implement trust region methods are discussed in Wen et al. (2021) ; Li & He (2020) ; Sun et al. (2022) ; Ye et al. (2022) . However, these MARL algorithms suffer from the non-stationarity problem as they update agents simultaneously. The environment dynamic changes from one agent's perspective as others also change their policies. Consequently, agents suffer from the high variance of gradients and require more samples for convergence (Hernandez-Leal et al., 2017) . To alleviate the non-stationarity problem, Multi-Agent Mirror descent policy algorithm with Trust region decomposition (MAMT) (Li et al., 2022b) factorizes the trust regions of the joint policy and constructs the connections among the factorized trust regions, approximately constraining the diversity of joint policy. Rollout and Policy Iteration for a Single Agent (RPISA) (Bertsekas, 2021) and Heterogeneous PPO (HAPPO) (Kuba et al., 2022) consider the sequential update scheme. RPISA suffers from sample inefficiency as it requires n times of rollout for n agents to complete their policies update. Additionally, their work lacks a practical algorithm for complex tasks. In contrast, we propose a practical algorithm A2PO that updates all agents using the same samples from a single rollout. HAPPO is derived from the advantage decomposition lemma, proposed as Lemma 1 in Kuba et al. (2022) . It does not consider the distribution shift caused by preceding agents, and has no monotonic policy improvement guarantee for each agent's policy. While A2PO is derived without decomposing the advantage, and has a guarantee of monotonic improvement for each agent's policy. We further discuss other MARL methods in Appx. C.

3.1. MARL PROBLEM FORMULATION

We consider formulating the sequential decision-making problem in multi-agent scenarios as a decentralized Markov decision process (DEC-MDP) (Bernstein et al., 2002) . An n-agent DEC-MDP can be formalized as a tuple (S, {A i } i∈N , r, T , γ), where N = {1, . . . , n} is the set of agents, S is the state space. A i is the action space of agent i, and  A = A 1 × • • • × A n is the joint action space. r : S × A → R (s) = (1 -γ) ∞ t=0 γ t P r(s t = s|π) and P r(•|π) : S → [0, 1] is the probability func- tion under π. We then define the value function V π (s) = E τ ∼(T ,π) [ ∞ t=0 γ t r(s t , a t )|s 0 = s] and the advantage function A π (s, a) = r(s, a) + γE s ′ ∼T (•|s,a) [V π (s ′ )] -V π (s) , where τ = {(s 0 , a 0 ), (s 1 , a 1 ), . . .} denotes one sampled trajectory. The agents maximize their expected return, denoted as: π * = argmax π J (π) = argmax π E τ ∼(T ,π) [ ∞ t=0 γ t r(s t , a t )] ,.

3.2. MONOTONIC IMPROVEMENT IN SEQUENTIAL POLICY UPDATE SCHEME

We assume agents are updated in the order 1, 2, . . . , n, without loss of generality. We define π as the joint base policy from which the agents are updated at a stage, e i = {1, . . . , i -1} as the set of preceding agents updated before agent i, and πi as the updated policy of agent i. We denote the joint policy composed of updated policies of agents in the set e i , the updated policy of agent i and base policies of other agents as πi = π1 × . . . × πi × π i+1 × . . . × π n , and define π0 = π and πn = π. A general sequential update scheme is shown as follows, where L πi-1 ( πi ) is the surrogate objective for agent i: π = π0 max π 1 Lπ( π1 ) ---------→ Update π 1 π1 - → • • • - → πn-1 max π n L πn-1 ( πn ) ------------→ Update π n πn = π. We wish our sequential update scheme retains the desired monotonic improvement guarantee while improving the sample efficiency. Before going to our method, we first discuss why naively updating agents sequentially with the same rollout samples will fail in monotonic improvement for each agent. Since agent i updates its policy from πi-1 , an intuitive surrogate objective (Schulman et al., 2015) used by agent i could be formulated as L I πi-1 ( πi ) = J ( πi-1 ) + O π ( πi ), where O π ( πi ) = 1 1-γ E (s,a)∼(d π , πi ) [A π (s, a) ] and the superscript I means 'Intuitive'. The expected return, however, is not guaranteed to improve with such a surrogate objective, as elaborated in the following proposition. Proposition 1 For agent i, let ϵ = max s,a |A π (s, a)|, α j = D max T V (π j ∥π j ) ∀j ∈ (e i ∪ {i}), where D T V (p∥q) is the total variation distance between distributions p and q and we define D max T V (π∥π) = max s D T V (π(•|s)∥π(•|s)), then we have: J ( πi ) -L I πi-1 ( πi ) ≤ 2ϵα i 3 1 -γ - 2 1 -γ(1 -j∈(e i ∪{i}) α j ) + Uncontrollable 2ϵ j∈e i α j 1 -γ = β I i . (1) The proof can be found in Appx. A.3. Remark. From Eq. ( 1) and the definition of L I πi-1 , we know J ( πi ) -J ( πi-1 ) > O π ( πi ) -β I i . Thus J ( πi ) > J ( πi-1 ) when O π ( πi ) > β I i , which can be satisfied by constraining β I i and optimizing O π ( πi ). However, in β I i , the term 2ϵ j∈e i α j /(1γ), is uncontrollable by agent i. Consequently, the upper bound β I i may be large and the expected performance J ( πi ) may not be improved after optimizing O π ( πi ) when O π ( πi ) < β I i even if α i is well constrained. Although one can still prove a monotonic guarantee for the joint policy by summing Eq. ( 1) for all the agents, we will show that the monotonic improvement on every single agent, if guaranteed, brings a tighter monotonic bound on the joint policy and incrementally tightens the monotonic bound on the joint policy when updating agents during a stage. Uncontrollable terms also appear when similarly analyzing HAPPO and cause the loss of monotonic improvement for a single agentfoot_1 .

3.3. PRECEDING-AGENT OFF-POLICY CORRECTION

The uncontrollable term in Prop. 1 is caused by one ignoring how the updating of its preceding agents' policies influences its advantage function. We investigate reducing the uncontrollable term in policy evaluation. Since agent i is updated from πi-1 , the advantage function A πi-1 should be used in agent i's surrogate objective rather than A π . However, A πi-1 is impractical to estimate using samples collected under π due to the off-policyness (Munos et al., 2016) of these samples. Nevertheless, we can approximate A πi-1 by correcting the discrepancy between πi-1 and π at each time step (Harutyunyan et al., 2016) . To retain the monotonic improvement properties, we propose preceding-agent off-policy correction (PreOPC), which approximates A πi-1 using samples collected under π by correcting the state probability at each step with truncated product weights: A π, πi-1 (s t , a t ) = δ t + k≥1 γ k k j=1 λ min 1.0, πi-1 (a t+j |s t+j ) π(a t+j |s t+j ) δ t+k , where δ t = r(s t , a t ) + γV (s t+1 ) -V (s t ) is the temporal difference for V (s t ), λ is a parameter controlling the bias and variance, as used in Schulman et al. (2016) . min(1.0, πi-1 (at+j |st+j ) π(at+j |st+j ) ) ∀j ∈ {1, . . . , k} are truncated importance sampling weights, approximating the probability of s t+k at time step t + k under πi-1 . The derivation of Eq. (2) can be found in Appx. A.8. With PreOPC, the surrogate objective of agent i becomes L πi-1 ( πi ) = J ( πi-1 ) + 1 1-γ E (s,a)∼(d π , πi ) [A π, πi-1 (s, a)] , and we summarize the surrogate objective of updating all agents as follows: We can now prove that the monotonic policy improvement guarantee of both updating one agent's policy and updating the joint policy is retained by using Eq. (3) as the surrogate objective. The detailed proofs can be found in Appx. A.4. G π (π) = J (π) + 1 1 -γ n i=1 E (s,a)∼(d π , πi ) [A π, πi-1 (s, a)] . Theorem 1 (Single Agent Monotonic Bound) For agent i, let ϵ i = max s,a |A πi-1 (s, a)|, ξ i = max s,a |A π, πi-1 (s, a) -A πi-1 (s, a)|, α j = D max T V (π j ∥π j ) ∀j ∈ (e i ∪ {i} ), then we have: J ( πi ) -L πi-1 ( πi ) ≤ 4ϵ i α i 1 1 -γ - 1 1 -γ(1 -j∈(e i ∪{i}) α j ) + ξ i 1 -γ ≤ 4γϵ i (1 -γ) 2 α i j∈(e i ∪{i}) α j + ξ i 1 -γ . ( ) The single agent monotonic bound depends on ϵ i , ξ i , and α i and the total variation distances of preceding agents. Unlike Eq. (1), we can effectively constrain the monotonic bound by controlling α i since ξ i decreases as agent i updating its value function (Munos et al., 2016) and does not  α i ( 1 1-γ - 1 1-γ(1-α i ) ) Single Agent: 4ϵα i ( 1 1-γ - 1 1-γ(1-α i ) ) MAPPO Single Simultaneous High 4ϵ n i=1 α i 1-γ CoPPO Single Simultaneous High 4ϵ n i=1 α i ( 1 1-γ - 1 1-γ(1-n j=1 α j ) ) HAPPO Single Sequential High 4ϵ n i=1 α i ( 1 1-γ - 1 1-γ(1-n j=1 α j ) ) Single Agent: No Guarantee A2PO (ours) Single Sequential High 4ϵ n i=1 α i ( 1 1-γ - 1 1-γ(1-j∈(e i ∪{i}) α j ) ) + n i=1 ξ i 1-γ Single Agent: 4ϵ i α i ( 1 1-γ - 1 1-γ(1-j∈(e i ∪{i}) α j ) ) + ξ i 1-γ lead to an unsatisfiable bound when α i is well constrained, providing the guarantee for monotonic improvement when updating a single agent. Given the above bound, we can prove the monotonic improvement of the joint policy. Theorem 2 (Joint Monotonic Bound) For each agent i ∈ N , let ϵ i = max s,a |A πi-1 (s, a)| , α i = D max T V (π i ∥π i ), ξ i = max s,a |A π, πi-1 (s, a) -A πi-1 (s, a) |, and ϵ = max i ϵ i , then we have: |J (π) -G π (π)| ≤ 4ϵ n i=1 α i 1 1 -γ - 1 1 -γ(1 -j∈(e i ∪{i}) α j ) + n i=1 ξ i 1 -γ ≤ 4γϵ (1 -γ) 2 n i=1 α i j∈(e i ∪{i}) α j + n i=1 ξ i 1 -γ . Eq. ( 5) suggests a condition for monotonic improvement of the joint policy, similar to that in the remark under Prop. 1. We further prove that the joint monotonic bound is incrementally tightened when performing the policy optimization agent-by-agent during a stage due to the single agent monotonic bound, i.e., the condition for improving J (π) is relaxed and more likely to be satisfied. The details can be found in Appx. A.5. We present the monotonic bounds of other algorithms in Tab. 1. Since - 5) achieves the tightest bound compared to other single rollout algorithms, with ξ i ∀i ∈ N small enough. The assumption about ξ i is valid since preceding-agent off-policy correction is a contraction operator, which is a corollary of Theorem 1 in Munos et al. (2016) . A tighter bound improves expected performance by optimizing the surrogate objective more effectively (Li et al., 2022a) . 1 1-γ(1-j∈(e i ∪{i}) α j ) < - 1 1-γ(1-n j=1 α j ) , Eq. (

4. AGENT-BY-AGENT POLICY OPTIMIZATION

We first give a practical implementation for optimizing the surrogate objective G π (π). When updating agent i, the monotonic bound in Eq. ( 4) consists of the total variation distances related to the preceding agents and agent i, i.e., α i j∈(e i ∪{i}) α j . It suggests that we can control the monotonic bound by controlling total variation distances α j ∀j ∈ (e i ∪{i}), to effectively improve the expected performance. We consider applying the clipping mechanism to control the total variation distances α j ∀j ∈ (e i ∪ {i}) (Queeney et al., 2021; Sun et al., 2022) . In the surrogate objective of agent i, i.e., J ( πi-1 ) + 1 1-γ E (s,a)∼(d π ,π) [ πi j∈e i πj π i j∈e i π j A π, πi-1 (s, a)], J ( πi-1 ) has no dependence to agent i, while the joint policy ratio πi j∈e i πj π i j∈e i π j in the advantage estimation is appropriate for applying the clipping mechanism. We further consider reducing the instability in estimating agent i's policy gradient by clipping the joint policy ratio of preceding agents first, with a narrower clipping range (Wu et al., 2021) . Thus we apply the clipping mechanism on the joint policy ratio twice: once on the joint policy ratio of preceding agents and once on the policy ratio of agent i. Finally, the practical objective for updating agent i becomes: Lπ i-1 ( πi ) = E (s,a)∼(d π ,π) min l(s, a)A π, πi-1 , clip l(s, a), 1 ± ϵ i A π, πi-1 , where l(s, a) = πi (a i |s) π i (a i |s) g(s, a), and g(s, a) = clip( j∈e i πj (a j |s) j∈e i π j (a j |s) , 1 ± ϵ i 2 ). The clipping parameter ϵ i is selected as ϵ i = C(ϵ, i), where ϵ is the base clipping parameter and C(•, •) is the clipping parameter adapting function. We summarize our proposed Agent-by-agent Policy Optimization (A2PO) in Alg. 1. Note that in Line 6, the agent for the next update iteration is selected according to the agent selection rule R(•). Algorithm 1: Agent-by-agent Policy Optimization (A2PO) Initialize the joint policy π 0 = {π 1 0 , . . . , π n 0 }, and the global value function V . for iteration m = 1, 2, . . . do Collect data using π m-1 = {π 1 m-1 , . . . , π n m-1 }. for Order k = 1, . . . , n do Select an agent according to the selection rule as i = R(k). Policy π i m = π i m-1 , preceding agents e i = {R(1), . . . , R(k -1)}. Joint policy πi = {π i m , π j∈e k m , π j∈N -e k m-1 }. Compute the advantage approximation as A π, πi-1 (s, a) via Eq. ( 2). Compute the value target v(s t ) = A π, πi-1 (s, a) + V (s). for P epochs do 11 π i m = arg max π i m Lπ i-1 ( πi ) as in Eq. ( 6). 12 V = arg min V E s∼d π ∥v(s) -V (s)∥ 2 . Eq. ( 6) approximates the surrogate objective of a single agent. We remark that the monotonic improvement guarantee of a single agent reveals how the update of a single agent affects the overall objective. We will further discuss R(•) and C(•, •) from the perspective of how to benefit the optimization of the overall surrogate objective by coordinating the policy updates of each agent. Semi-greedy Agent Selection Rule. With the monotonic policy improvement guarantee on the joint policy, as shown in Thm. 2, we can effectively improve the expected performance J (π) by optimizing the surrogate objective of all agents G π (π) = J (π) + n i=1 L πi-1 ( πi ). Since the policies except π i are fixed when maximizing L πi-1 ( πi ), we recognize maximizing n i=1 Lπ i-1 ( πi ) as performing a block coordinate ascent, i.e., iteratively seeking to update a block of chosen coordinates (agents) while other blocks (agents) are fixed. As a special case of the coordinate selection rule, the agent selection rule becomes crucial for convergence. On the one hand, intuitively, updating agent with a bigger absolute value of the advantage function contributes more to optimizing G π (π). Inspired by the Gauss-Southwell rule (Gordon & Tibshirani, 2015) , we propose the greedy agent selection rule, under which an agent with a bigger absolute value of the expected advantage function is updated with a higher priority. We will verify that the agents with small absolute values of the advantage function also benefit from the greedy selelction rule in Appx. B.2.5. On the other hand, purely greedy selection may lead to early convergence which harms the performance. Therefore, we introduce randomness into the agent selection rule to avoid converging too early (Lu et al., 2018) . Combining the merits, we propose the semi-greedy agent selection rule as R(k) = arg max i∈(N -e) E s,a i [|A π, πR(k-1) |], k2 = 0 R(k) ∼ U(N -e), k2 = 1 , where e = {R(1), . . . , R(k -1)} and U is a uniform distribution. We verify that the semi-greedy agent selection rule contributes to the performance of A2PO in Sec. 5.2. Adaptive Clipping Parameter. We improve the sample efficiency by updating all agents using the samples collected under the base joint policy π. However, when updating agent i by optimiz- ing 1 1-γ E (s,a)∼(d π , πi ) [A π, πi-1 (s, a) ], the expectation of advantage function is estimated using the states sampled under π instead of πi-1 , which reintroduces the non-stationarity since agent i can not perceive the change of the preceding agents. With the non-stationarity modeled by the state transition shift (Sun et al., 2022) , we define the state transition shift encountered when updating agent i as ∆ π1 ,...,π i-1 ,π i ,...,π n π 1 ,...,π n (s ′ |s) = a [T (s ′ |s, a)( πi-1 (a|s) -π(a|s))]. The state transition shift has the following property. Proposition 2 The state transition shift ∆ π1 ,...,π i-1 ,π i ,...,π n π 1 ,...,π n (s ′ |s) can be decomposed as follows. ∆ π1 ,...,π i-1 ,π i ,...,π n π 1 ,...,π n = ∆ π1 ,π 2 ,...,π n π 1 ,...,π n + ∆ π1 ,π 2 ,π 3 ,...,π n π1 ,π 2 ,...,π n + • • • + ∆ π1 ,...,π i-1 ,π i ,...,π n π1 ,...,π i-2 ,π i-1 ,...,π n Prop. 2 shows that the total state transition shift encountered by agent i can be decomposed into the sum of state transition shift caused by each agent whose policy has been updated. Shifts caused by agents with higher priorities will be encountered by more following agents and thus contribute more to the non-stationarity problem. Recall that the state transition shift effectively measures the total variation distance between policies. Therefore, in order to reduce the non-stationarity brought by the agents' policy updates, we can adaptively clip each agent's surrogate objective according to their update priorities. We propose a simple yet effective method, named adaptive clipping parameter, to adjust the clipping parameters according to the updating order: C(ϵ, k) = ϵ • c ϵ + ϵ • (1 -c ϵ ) • k/n, where c ϵ is a hyper-parameter. We demonstrate how the agents with higher priorities affect the following agents in Fig. 2 . Under the clipping mechanism, the influence of the agents with higher priority could be reflected in the clipping ranges of the joint policy ratio. The policy changes of the preceding agents may constrain the following agents to optimize the surrogate objective within insufficient clipping ranges, as shown on the left side of Fig. 2 . The right side of Fig. 2 demonstrates that the adaptive clipping parameter method leads to balanced and sufficient clipping ranges. a 1 a 2 a3 Agent 2 θ 2 old a 1 a 2 a3 Agent 3 θ 3 old a 1 a 2 a3 Agent 1 θ 1 old a 1 a 2 a3 Agent 2 θ 2 old a 1 a 2 a3 Agent 3 θ 3 old a 1 a 2 a3 Agent 1 θ 1 old Figure 2: The clipping ranges of three agents. The surface a 1 + a 2 + a 3 = 1 demonstrates the policy space of three discrete actions. The agents are updated in the order of 2, 3, 1. The areas in gray/pink are the clipping ranges with/without considering the joint policy ratio of preceding agents. Left: The agents have the same clipping parameters. The clipping range of agent 1 is insufficient due to the large variation in the policies of agent 2 and agent 3. Right: The clipping ranges are more balanced and sufficient with the adaptive clipping parameter method.

5. EXPERIMENTS

In this section, we empirically evaluate and analyze A2PO in the widely adopted cooperative multiagent benchmarks, including the StarCraftII Multi-agent Challenge (SMAC) (Samvelyan et al., 2019) , Multi-agent MuJoCo (MA-MuJoCo) (de Witt et al., 2020), Multi-agent Particle Environment (MPE) (Lowe et al., 2017) foot_2 , and more challenging Google Research Football (GRF) full-game scenarios (Kurach et al., 2020) . Experimental results demonstrate that 1) A2PO achieves performance and efficiency superior to those of state-of-the-art MARL Trust Region methods, 2) A2PO has strength in encouraging coordination behaviors to complete complex cooperative tasks, and 3) the PreOPC, the semi-greedy agent selection rule, and the adaptive clipping parameter methods significantly contribute to the performance improvement. 4We compare A2PO with advanced MARL trust-region methods: MAPPO (Yu et al., 2022) , CoPPO (Wu et al., 2021) and HAPPO (Kuba et al., 2022) . We implement all the algorithms as parameter sharing in SMAC and MPE, and as parameter-independent in MA-MuJoCo and GRF, according to the homogeneity and heterogeneity of agents. We divide the agents into blocks for tasks with numerous agents to control the training time of A2PO comparable to other algorithms. Full experimental details can be found in Appx. B.

5.1. PERFORMANCE AND EFFICIENCY

We evaluate the algorithms in 9 maps of SMAC with various difficulties, 14 tasks of 6 scenarios in MA-MuJoCo, and the 5-vs-5 and 11-vs-11 full game scenarios in GRF. Results in Tab. 2, Fig. 3 , and Fig. 4 show that A2PO consistently outperforms the baselines and achieves higher sample efficiency in all benchmarks. More results and the experimental setups can be found in Appx. B.2. StarCraftII Multi-agent Challenge (SMAC). As shown in Tab. 2, A2PO achieves (nearly) 100% win rates in 6 out of 9 maps and significantly outperforms other baselines in most maps. In Tab. 2, we additionally compare the performance with that of Qmix (Rashid et al., 2018) , a well known baseline in SMAC. We also observe that CoPPO and A2PO have better stability as they consider clipping joint policy ratios. Multi-agent MuJoCo environment (MA-MuJoCo). We investigate whether A2PO can scale to more complex continuous control multi-agent tasks in MA-MuJoCo. We calculate the normalized score return-minimum return maximum return-minimum return over all the 14 tasks in the left of Fig. 3 . We also present part of results in the right of Fig. 3 , where the control complexity and observation dimension, depending on the number of the robot's joints, increases from left to right. We observe that A2PO generally shows an increasing advantage over the baselines with increasing task complexity. Google Research Football (GRF). We evaluate A2PO in GRF full-game scenarios, where agents have difficulty discovering complex coordination behaviors. A2PO obtains nearly 100% win rate in the 5-vs-5 scenario. In both scenarios, we attribute the performance gain of A2PO to the learned coordination behavior. We analyze the experiments in GRF to verify that A2PO encourages agents to learn coordination behaviors in complex tasks. In Tab. 3, an 'Assist' is attributed to the player who passes the ball to the teammate that makes a score, a 'Pass' is counted when the passing-andreceiving process is finished, 'Pass Rate' is the proportion of success passes over the pass attempts. A2PO have an advantage in passing-and-receiving coordination, leading to more assists and scores. PreOPC. Fig. 5 shows the effects of utilizing off-policy correction in two cases: 1) Correction on all agents' policies for simultaneous update algorithms, i.e., MAPPO w/ V-trace (Espeholt et al., 2018) and CoPPO w/ V-trace, and 2) Correction on the preceding agents' policies for sequential update algorithms, i.e., HAPPO w/ PreOC and A2PO. V-trace brings no general improvement to MAPPO and CoPPO, while PreOPC significantly improves the sequential update cases. PreOPC improves the performance of HAPPO significantly, while A2PO still outperforms HAPPO w/ PreOPC. The performance gap lies in that A2PO clips the joint policy ratios, which matches the monotonic bound in Thm. 1. The results verify that A2PO reaches or outperforms the asymptotic performance of RPISA-PPO using an approximated advantage function and updating all the agents with the same rollout samples. Additionally, preceding-agent off-policy correction does not increase the sensitivity of the hyper-parameter λ, as shown in Appx. B.2.5. Agent Selection Rule. We provide comparisons of different agent selection rules in Fig. 6 . The 'Cyclic' rule means select agents in the order 1, . . . , n, and other rules have been introduced in sec. 4. The semi-greedy rule considers the optimization acceleration and the performance balance among agents and thus performs the best in all tasks. Adaptive Clipping Parameter. We propose the adaptive clipping parameter method for balanced and sufficient clipping ranges of agents. As shown in Fig. 7 , the adaptive clipping parameter contributes to the performance gain of A2PO. 

6. CONCLUSION

In this paper, we investigate the potential of the sequential update scheme in coordination tasks. We introduce A2PO, a sequential algorithm using a single rollout at a stage, which guarantees monotonic improvement on both the joint policy and each agent's policy. We also justify that the monotonic bound achieved by A2PO is the tightest among existing trust region MARL algorithms under single rollout scheme. Furthermore, A2PO integrates the proposed semi-greedy agent selection rule and adaptive clipping parameter method. Experiments in various benchmarks demonstrate that A2PO consistently outperforms state-of-the-art methods in performance and sample efficiency and encourages coordination behaviors for completing complex tasks. For future work, we plan to analyze the theoretical underpinnings of the agent selection rules and study the learnable methods to select agents and clipping parameters.

A PROOFS

A.1 NOTATIONS We list the main notations used in Tab. 4. Table 4 : The notations and symbols used in this paper. Notation Definition

S

The state space N The set of agents n The number of agents i The agent index A i The action space of agent i r The reward function T The transition function γ The discount factor t The time-step st The state at time-step t a i t The action of agent i at time-step t at The joint action at time-step t d π The discounted state visitation distribution P r The state probability function V The value function A The advantage function τ The trajectory of an episode e A set of preceding agents e i The set of preceding agents updated before agent i π i The policy of agent i πi The updated policy of agent i π The joint policy λ The bias and variance balance parameter π The joint target policy πi The joint policy after updating agent i J (π) The expected return / performance of the joint policy π L πi-1 ( πi ) The surrogate objective of agent i L I πi-1 ( πi ) An intuitive surrogate objective of agent i Gπ(π) The surrogate objective of all agents ϵ The upper bound of an advantage function DT V The total variation distance function α The total variation distance between 2 policies ξ i The off policy correction error of πi-1 C The clipping parameter adaptation function R The agent selection function

A.2 USEFUL LEMMAS

Lemma 1 (Multi-agent Policy Performance Difference Lemma). Given any joint policies π and π, the difference between the performance of two joint policies can be expressed as: J (π) -J (π) = 1 1 -γ E (s,a)∼(d π ,π) [A π (s, a)] , where (2022) . □ d π = (1-γ) ∞ t=0 γ t P For convenience, we give some properties and definitions of couplingfoot_4 and the definition of αcoupled policy pair (Schulman et al., 2015) here. Definition 1 (Coupling) A coupling of two probability distributions µ and ν is a pair of random variables (X, Y ) such that the marginal distribution of X is µ and the marginal distribution of Y is ν. A coupling (X, Y ) satisfies the following constraints: P r(X = x) = µ(x) and P r(Y = y) = ν(y). Proposition 3 For any coupling (X, Y ) that D T V (µ∥ν) ≤ P r(X ̸ = Y ). Proposition 4 There exists a coupling (X, Y ) that D T V (µ∥ν) = P r(X ̸ = Y ). Corollary 1 For all s, there exists a coupling (π (•|s), π(•|s)), that P r(a = ā) ≥ 1 -D max T V (π∥π), for a ∼ π(•|s), ā ∼ π(•|s). Proof. By prop. 4 there exists a coupling (π(•|s), π(•|s)), s.t. 1 -P r(a = ā) = P r(a ̸ = ā) = D T V (π, π) ≤ D max T V (π∥π) □ Corollary 2 For all s, D T V (π(•|s)∥π(•|s)) ≤ n i=1 D T V (π i (•|s)∥π i (•|s)). Proof. We denote π(•|s) as π(•) for brevity. D T V (π(•|s)∥π(•|s)) = 1 2 a 1 ,a 2 ,...,a n n i=1 π i (a i ) - n i=1 πi (a i ) = 1 2 a 1 ,a 2 ,...,a n n i=1 π i (a i ) -π 1 (a 1 ) n i=2 πi (a i ) + π 1 (a 1 ) n i=2 πi (a i ) - n i=1 πi (a i ) ≤ 1 2 a 1 π 1 (a 1 ) a 2 ,...,a n n i=2 π i (a i ) - n i=2 πi (a i ) + 1 2 a 1 π 1 (a 1 ) -π1 (a 1 ) a 2 ,...,a n n i=2 πi (a i ) = 1 2 a 2 ,...,a n n i=2 π i (a i ) - n i=2 πi (a i ) + 1 2 a 1 π 1 (a 1 ) -π1 (a 1 ) • • • ≤ 1 2 n i=1 a i |π i (a i ) -πi (a i )| = n i=1 D T V (π i (•|s)∥π i (•|s)) □ Definition 2 (α-coupled policy pair) If (π, π ) is an α-coupled policy pair, then (a, ā|s) satisfies P r(a ̸ = ā|s) ≤ α for all s, and a ∼ π(•|s), ā ∼ π(•|s). From Corollaries 1 and 2, we know that given any joint policy pair π and π, select α = D max T V (π(•|s)∥π(•|s)), then (π, π) is an α-coupled policy pair that for all s, P r(a ̸ = ā|s) ≤ D max T V (π(•|s)∥π(•|s)) ≤ n i=1 α i , where α i = D max T V (π i ∥π i ). Lemma 2 Given any joint policies π and π, if (π,π) is a coupled policy pair, the following inequality holds: |E a∼π [A π (s, a)] | ≤ 2ϵ n i=1 α i , where α i = D max T V (π i ∥π i ) and ϵ = max s,a |A π (s, a)|. Proof. Note that E a∼π [A π (s, a)] = 0. We have |E a∼π [A π (s, a)]| = |E ā∼π [A π (s, ā)] -E a∼π [A π (s, a)]| = E (ā,a)∼(π,π) [A π (s, ā) -A π (s, a)] = P r(ā ̸ = a|s)E (ā,a)∼(π,π) [A π (s, ā) -A π (s, a)] ≤ n i=1 α i E (ā,a)∼(π,π) [|A π (s, ā) -A π (s, a)|] ≤ n i=1 α i • 2 max s,a |A π (s, a)|

□

Lemma 3 (Multi-agent Advantage Discrepancy Lemma). Given any joint policies π 1 , π 2 and π 3 , if (π 1 , π 2 ) and (π 2 , π 3 ) are coupled policy pairs, the following inequality holds: E (st,at)∼(P r π 2 ,π 2 ) A π 1 -E (st,āt)∼(P r π 3 ,π 2 ) A π 1 ≤4ϵ π 1 • D max T V (π 1 ∥π 2 ) • (1 -(1 -D max T V (π 2 ∥π 3 )) t ) , where ϵ π 1 = max s,a ∥A π 1 (s, a)∥ and we denote A(s, a) as A for brevity. Proof. Let n t represent the times a ̸ = ā (π 1 disagrees with π 3 ) before timestamp t. E (st,at)∼(P r π 2 ,π 2 ) A π 1 -E (st,āt)∼(P r π 3 ,π 2 ) A π 1 =P r(n t > 0) • E (st,at)∼(P r π 2 ,π 2 )|nt>0 A π 1 -E (st,āt)∼(P r π 3 ,π 2 )|nt>0 A π 1 (a) = (1 -P r(n t = 0)) • E ≤(1 - t k=1 P r(a k = āk |a k ∼ π 2 (•|s k ), āk ∼ π 3 (•|s k ))) • E (b) ≤(1 - t k=1 (1 -D max T V (π 2 ∥π 3 ))) • E =(1 -(1 -D max T V (π 2 ∥π 3 )) t ) • E ≤(1 -(1 -D max T V (π 2 ∥π 3 )) t ) • 2 • 2 • D max T V (π 1 ∥π 2 ) • ϵ π 1 =4ϵ π 1 • D max T V (π 1 ∥π 2 ) • (1 -(1 -D max T V (π 2 ∥π 3 )) t ) In (a), we denote |E (st,at)∼(P r π 2 ,π 2 )|nt>0 [A π 1 ] -E (st,āt)∼(P r π 3 ,π 2 )|nt>0 [A π 1 ]| as E for brevity. (b) follows the definition of α-coupled policy pair. □ We provide a useful equation of the normalized discounted state visitation distribution here. Proposition 5 E (s,a)∼(d π 1 ,π 2 ) [f (s, a)] = (1 -γ) s ∞ t=0 γ t P r(s t = s|π 1 ) a π 2 (a|s)f (s, a) = (1 -γ) ∞ t=0 γ t s P r(s t = s|π 1 ) a π 2 (a|s)f (s, a) = (1 -γ) ∞ t=0 γ t E (st,at)∼(P r π 1 ,π 2 ) [f (s t , a t )] A.3 PROOFS OF INTUITIVE SEQUENTIAL UPDATE J ( πi ) -J ( πi-1 ) - 1 1 -γ E (s,a)∼(d π , πi ) [A π ] ≤ 1 1 -γ E (s,a)∼(d πi , πi ) A πi-1 -E (s,a)∼(d π , πi ) [A π ] ≤ 1 1 -γ E (s,a)∼(d πi , πi ) A πi-1 -E (s,a)∼(d π , πi ) A πi-1 + 1 1 -γ E (s,a)∼(d π , πi ) A πi-1 -E (s,a)∼(d π , πi ) [A π ] ≤4ϵ πi-1 α i ∞ t=0 γ t (1 -(1 - j∈(e i ∪{i}) α j ) t ) + 1 1 -γ E (s,a)∼(d π , πi ) A πi-1 -A π ≤4ϵ πi-1 α i ( 1 1 -γ - 1 1 -γ(1 -j∈(e i ∪{i}) α j ) ) + 1 1 -γ   4α i ϵ πi-1 + 2 j∈e i α j ϵ π   A.4 PROOFS OF MONOTONIC POLICY IMPROVEMENT OF A2PO Theorem 1 (Single Agent Monotonic Bound) For agent i, let ϵ i = max s,a |A πi-1 (s, a)|, ξ i = max s,a |A π, πi-1 (s, a) -A πi-1 (s, a)|, α j = D max T V (π j ∥π j ) ∀j ∈ (e i ∪ {i} ), then we have: J ( πi ) -L πi-1 ( πi ) ≤ 4ϵ i α i 1 1 -γ - 1 1 -γ(1 -j∈(e i ∪{i}) α j ) + ξ i 1 -γ ≤ 4γϵ i (1 -γ) 2 α i j∈(e i ∪{i}) α j + ξ i 1 -γ . ( ) Proof. Using Lemma 3 and Prop. 5, we get J ( πi ) -J ( πi-1 ) - 1 1 -γ E (s,a)∼(d π , πi ) A π, πi-1 = 1 1 -γ E (s,a)∼(d πi , πi ) A πi-1 -E (s,a)∼(d π , πi ) A π, πi-1 ≤ 1 1 -γ E (s,a)∼(d πi , πi ) A πi-1 -E (s,a)∼(d π , πi ) A πi-1 + 1 1 -γ E (s,a)∼(d π , πi ) A πi-1 -E (s,a)∼(d π , πi ) A π, πi-1 ≤4ϵ πi-1 α i ∞ t=0 γ t (1 -(1 - j∈(e i ∪{i}) α j ) t ) + 1 1 -γ E (s,a)∼(d π , πi ) A πi-1 -A π, πi-1 ≤4ϵ πi-1 α i ( 1 1 -γ - 1 1 -γ(1 -j∈(e i ∪{i}) α j ) ) + 1 1 -γ ξ i □ Theorem 2 (Joint Monotonic Bound) For each agent i ∈ N , let ϵ i = max s,a |A πi-1 (s, a)| , α i = D max T V (π i ∥π i ), ξ i = max s,a |A π, πi-1 (s, a) -A πi-1 (s, a) |, and ϵ = max i ϵ i , then we have: |J (π) -G π (π)| ≤ 4ϵ n i=1 α i 1 1 -γ - 1 1 -γ(1 -j∈(e i ∪{i}) α j ) + n i=1 ξ i 1 -γ ≤ 4γϵ (1 -γ) 2 n i=1 α i j∈(e i ∪{i}) α j + n i=1 ξ i 1 -γ . ( ) Proof. |J (π) -G π (π)| = J (π) -J (π) - n i=1 E (s,a)∼(d π , πi ) A π, πi-1 (s, a) = J ( πn ) -J ( πn-1 ) + • • • + J ( π1 ) -J ( π0 ) - 1 1 -γ n i=1 E (s,a)∼(d π , πi ) A π, πi-1 (s, a) ≤ n i=1 J ( πi ) -J ( πi-1 ) - 1 1 -γ E (s,a)∼(d π , πi ) A π, πi-1 (s, a) ≤4ϵ n i=1 α i 1 1 -γ - 1 1 -γ(1 -j∈(e i ∪{i}) α j ) + n i=1 ξ i 1 -γ ≤ 4γϵ (1 -γ) 2 n i=1   α i j∈(e i ∪{i}) α j   + n i=1 ξ i 1 -γ . □ A.

5. PROOFS OF INCREMENTALLY TIGHTENED BOUND OF A2PO

Assume agent k is updated with order k in the sequence 1, . . . , n, since πk-1 is known, we have |J (π) -G π (π)| ≤ k-1 i=1 J ( πi ) -L πi-1 ( πi ) + 4ϵ n i=k α i 1 1 -γ - 1 1 -γ(1 -j∈(e i ∪{i}) α j ) + n i=k ξ i 1 -γ ≤ k-2 i=1 J ( πi ) -L πi-1 ( πi ) + 4ϵ n i=k-1 α i 1 1 -γ - 1 1 -γ(1 -j∈(e i ∪{i}) α j ) + n i=k-1 ξ i 1 -γ . . . ≤4ϵ n i=1 α i 1 1 -γ - 1 1 -γ(1 -j∈(e i ∪{i}) α j ) + n i=1 ξ i 1 -γ Thus the condition for improving J (π) is relaxed during updating agents at a stage.

A.6 PROOFS OF MONOTONIC POLICY IMPROVEMENT OF MAPPO, COPPO AND HAPPO

In this section, we give proof of the monotonic policy improvement of MAPPO, and unify the formats of the monotonic bounds of CoPPO and HAPPO, without considering the parameter-sharing method.

MAPPO. For MAPPO, L

π (π) = n i=1 J (π) + 1 1-γ E (s,a)∼(d π ,π) πi π i A π . We first prove that for agent i, J (π) -J (π) -1 1-γ E (s,a)∼(d π ,π) πi π i A π is bounded. J (π) -J (π) - 1 1 -γ E (s,a)∼(d π ,π) πi π i A π = 1 1 -γ E (s,a)∼(d π ,π) [A π ] -E (s,a)∼(d π ,π) πi π i A π = ∞ t=0 γ t E (st,at)∼(P r π ,π) A π -E (st,at)∼(P r π ,π) πi π i A π ≤ ∞ t=0 2γ t     n j=1 α j   • ϵ π + α i • ϵ π   = 2ϵ π 1 -γ   α i + n j=1 α j   Sum the bounds for all agents and take the average, we get J (π) -J (π) - 1 n 1 1 -γ n i=1 E (s,a)∼(d π ,π) πi π i A π ≤ 2ϵ π 1 -γ n + 1 n n j=1 α j Finally, the monotonic bound for MAPPO is J (π) -J (π) - 1 1 -γ n i=1 E (s,a)∼(d π ,π) πi π i A π ≤ J (π) -J (π) - 1 n 1 1 -γ n i=1 E (s,a)∼(d π ,π) πi π i A π + n -1 n 1 1 -γ n i=1 E (s,a)∼(d π ,π) πi π i A π ≤ 2ϵ π 1 -γ n + 1 n n j=1 α j + n -1 n n i=1 1 1 -γ α i • 2ϵ π = 4ϵ π 1 -γ n i=1 α i CoPPO. We prove the results of CoPPO in a unified and convenient form. For CoPPO, L π (π) = J (π) + 1 1-γ E (s,a)∼(d π ,π) [A π (s, a)], we prove the bound using Lemma 3. J (π) -J (π) - 1 1 -γ E (s,a)∼(d π ,π) [A π ] ≤ 1 1 -γ E (s,a)∼(d π ,π) [A π ] -E (s,a)∼(d π ,π) [A π ] ≤ ∞ t=0 γ t E (s,a)∼(P r π ,π) [A π ] -E (s,a)∼(P r π ,π) [A π ] ≤4ϵ π ∞ t=0 γ t n i=1 α i 1 -(1 -D max T V (π∥π)) t ≤4ϵ π n i=1 α i 1 1 -γ - 1 1 -γ(1 - n j=1 α j ) HAPPO. Following the proof of Lemma 2 in Kuba et al. ( 2022), we know that HAPPO has the same monotonic improvement bound as that of CoPPO. For the monotonic improvement of a single agent, we formulate the surrogate objective of agent i using HAPPO as J ( πi-1 ) + 1 1-γ E (s,a)∼(d π , πi ) [A π (s, a)] -1 1-γ E (s,a)∼(d π , πi-1 ) [A π (s, a)], as shown in Proposition 3 of Kuba et al. (2022) . Following the proof of Thm. 1, we get the following inequality. J ( πi ) -J ( πi-1 ) - 1 1 -γ E (s,a)∼(d π , πi ) [A π ] + 1 1 -γ E (s,a)∼(d π , πi-1 ) [A π (s, a)] ≤ 1 1 -γ E (s,a)∼(d πi , πi ) A πi-1 -E (s,a)∼(d π , πi ) [A π ] + 1 1 -γ E (s,a)∼(d π , πi-1 ) [A π (s, a)] ≤ 1 1 -γ E (s,a)∼(d πi , πi ) A πi-1 - 1 1 -γ E (s,a)∼(d π , πi ) A πi-1 + 1 1 -γ E (s,a)∼(d π , πi ) A πi-1 -E (s,a)∼(d π , πi ) [A π ] + 2 1 1 -γ j∈e i α j e π ≤4ϵ πi-1 α i ∞ t=0 γ t (1 -(1 - j∈(e i ∪{i}) α j ) t ) + 1 1 -γ E (s,a)∼(d π , πi ) A πi-1 -A π + 2 1 1 -γ j∈e i α j e π ≤4ϵ πi-1 α i ( 1 1 -γ - 1 1 -γ(1 -j∈(e i ∪{i}) α j ) ) + 1 1 -γ   4α i ϵ πi-1 + 4 j∈e i α j ϵ π   The right side of the last inequality is not a monotonic improvement bound, or it does not provide a guarantee for improving the expected performance J ( πi ) since the term j∈e i α j ϵ π is not controllable for agent i, whether through policy improvement or value learning. The uncontrollable term means the expected performance may not be improved even if the total variation distances of consecutive policies are well constrained.

A.7 COMPARISONS ON MONOTONIC IMPROVEMENT BOUNDS

CoPPO and HAPPO have the same monotonic bound that is tighter than that of MAPPO. A2PO achieves the tightest monotonic bound given mild assumptions about the errors of preceding-agent off-policy correction, which is valid and easy to achieve since preceding-agent off-policy correction is a contraction operator. A sufficient condition that A2PO has the tightest bound is that ξ i < γ(1-γ) j∈N -e i -{i} α j (1-γ(1-j∈e i ∪{i} α j ))(1-γ(1-n j=1 α j )) , for all i ∈ N .

A.8 PRECEDING-AGENT OFF-POLICY CORRECTION

In Retrace(λ) (Munos et al., 2016) , consider the current policy as πi=1 and base policy as π, we have the following definition: R t = r t + γQ t+1 + k≥1 γ k k j=1 λ min 1.0, πi-1 (a t+j |s t+j ) π(a t+j |s t+j ) (r t+k + γQ t+k+1 -Q t+k ) , Following that same structure, we have: R t = r t + γV t+1 + k≥1 γ k k j=1 λ min 1.0, πi-1 (a t+j |s t+j ) π(a t+j |s t+j ) (r t+k + γV t+k+1 -V t+k ) , By subtracting V t , we get the definition of PreOPC. Or one can get γA π, πi-1 by substituting r t + γV t+1 for Q t and subtracting r t + γV t+1 .

A.9 WHY OFF-POLICYNESS IS MORE SERIOUS IN SEQUENTIAL UPDATE SCHEME?

As shown in Fig. 13 , the off policy correction in sequential update algorithms improves the performance significantly while similar performance gaps are not observed when used in simultaneous update algorithms. We attribute the difference to the influence of the clipping mechanism on the total variation distance. From Corollary 2, D T V (π∥π) < n i=1 D T V (π i ∥π i ). Although we can not prove exact relations, clipping the agents independently tends to larger total variation distances between the current and future policies of the agents, leading to more 'off-policyness' in sequential update algorithms.

B EXPERIMENTAL DETAILS B.1 IMPLEMENTATION

For a fair comparison, we (re)implement A2PO and the baselines based on the implementation of MAPPO. We keep the same structures for all the algorithms and tune all the algorithms following the same process, i.e., a grid search over a small collection of hyper-parameters, to avoid the influence of different implementation details on the results. The grid search is performed on three hyperparameters: the learning rate, λ and the agent block num in the tasks with numerous agents. The algorithms, including A2PO and baselines, are implemented into both parameter sharing and parameter independent versions. A2PO in the parameter sharing version is implemented as in Alg. 2. The main modifications are colored in blue. We rearrange the loops of agents and ppo epochs. The number of ppo epochs is divided by n for comparable updating times with the simultaneous algorithms. The approximated advantage is estimated by correcting the action probabilities of all the agents given such e i . Joint policy πi = π m . Compute the advantage approximation as A π, πi-1 (s, a) via eq. ( 2). Compute the value target v(s t ) = A π, πi-1 (s, a) + V (s). π i m = arg max π i m Lπ i-1 ( πi ) as in eq. ( 6). V = arg min V E s∼d π ∥v(s) -V (s)∥ 2 . Practically, each agent is equipped with a value function, we generate the agent order at once to avoid estimating the advantage function n(n-1) 2 times. The order becomes [1, . . . , i, . . . , j, . . . , n] in which E|A i | >= E|A j |.

B.2.1 STARCRAFTII MULTI-AGENT CHALLENGE

StarCraftII Multi-agent Challenge (SMAC) (Samvelyan et al., 2019) provides a wide range of multiagent tasks in the battle scenarios of StarCraftII. Algorithms adopting parameter sharing have shown superior performance in SMAC, so all the algorithms are implemented as parameter sharing. As shown in Tab. 5, we evaluate the algorithms in 12 maps of SMAC with various difficulties, in which the baselines can not achieve 100% win rates easily. We use the results of Qmix in Yu et al. (2022) . The learning curves for episode return are summarized in Fig. 8 .

B.2.2 MULTI-AGENT MUJOCO

Multi-agent MuJoCo (MA MuJoCo) (Peng et al., 2021) contains a range of multi-agent robot continuous control tasks, in which an agent controls the composition of robot joints. MA MuJoCo extends the high-dimensional single-agent locomotion tasks in MuJoCo (Todorov et al., 2012) , a widely adopted benchmark for SARL algorithms (Haarnoja et al., 2018; He & Hou, 2020) , into the multi-agent case. Agents must cooperate in their actions for robot locomotion, and different agents control different compositions of the robot joints. We use the reward settings of the original paper but set the environment to be fully observablefoot_5 . The agents are heterogeneous and mostly asymmetric in MA-MuJoCo, so we implement the algorithms as parameter-independent. We test 14 tasks of 6 scenarios in MA MuJoCo, as illustrated in Fig. 9 .

B.2.3 MULTI-AGENT PARTICLE ENVIRONMENT

We consider the Navigation task of the Multi-agent Particle Environment (MPE) (Lowe et al., 2017) implemented in PettingZoo (Terry et al., 2021) which implements MPE with minor fixes and provides convenience for customizing the number of agents and landmarks, and customizing the global and local rewards., with 3 and 5 agents and corresponding numbers of landmarks. The agents are rewarded based on the minimum distance to the landmarks and penalized for colliding with each other, meaning that the reward is entirely up to the coordination behavior. We adopted two different reward settings: Fully Cooperative and General-sum. In the Fully Cooperative setting, the agents share the same reward, while in the General-sum setting, the agents are additionally rewarded based on the local collision detection. The results in Fig. 10 show that A2PO generally outperforms the baselines on the average return and the sample efficiency. Noted that A2PO is developed in fully cooperative games, the results in the General-sum setting reveal the potential of extending A2PO into general-sum games. Further, the performance gap between A2PO and the baselines enlarges with the increasing number of agents. In the above experiments, we have evaluated A2PO in tasks where agents can learn both their micro-operations and coordination behaviors (SMAC and MA-MuJoCo) and tasks where agents can only learn coordination behaviors (the Navigation task). However, the coordination behaviors in the above tasks are relatively easy to discover, e.g., agents learn to concentrate their fire to shoot the enemies and cover each other in SMAC. Recent works (Wen et al., 2022; Yu et al., 2022) have conducted experiments on Google Research Football academic scenarios with a small number of players and easily accessible targets, making the coordination behavior also easy to discover. In contrast, we evaluate A2PO in the full-game scenarios, where the players of the left team, except for the goalkeeper, are controlled to play a football match against the right team controlled by the built-in AI provided by GRF. The agents in the full-game scenarios have high-dimensional observations, complex action spaces, and a long-span timescale (3000 steps). We reconstruct the observation space and design a dense reward to facilitate training in these scenarios based on Football-paris. The observation is formed to be agent-specific. The reward function estimates the behaviors of the entire team, including scoring, and carrying the ball to the opponent's restricted area et al., but not the individual behaviors such as ball-passing (Li et al., 2021) . We implement all the algorithms for the 5-vs-5 scenario as both parameter sharing and parameter-independent. The additional results with algorithms implemented as parameter sharing are shown in Fig. 11 , in which A2PO gets free from the trouble that the controlled agents have similar behavior and compete for the ball (Li et al., 2021) .

B.2.4 GOOGLE RESEARCH FOOTBALL

We implement all the algorithms on the 11-vs-11 scenario as parameter sharing using MALib (Zhou et al., 2021) for acceleration and train the algorithms for 300M environment steps. We summarize the learned behaviors observed in the game videos: • Basic Skills. The agents trained by MAPPO and CoPPO perform unsatisfactorily in basic skills such as dribbling, shooting, and the agents even run out of bounds frequently. In contrast, the agents trained by HAPPO and A2PO perform better in the basic skills. We attribute the problems to the non-stationarity issue that seriously influences the simultaneous updating algorithms. We also note that the agents trained by all the algorithms fail to understand the off-side mechanism and occasionally gather together on the opponent's bottom line. • Passing and Receiving Coordination. We analyze the direct way for coordination: passing and receiving the ball. As illustrated in Tab. 3, the agents trained by MAPPO have the lowest number of successful passes and the lowest successful pass rate, and we can hardly observe the agents passing the ball. Agents trained by CoPPO perform better on passing the ball but suffer from poor basic skills, and get tackled after receiving the ball. Agents trained by HAPPO prefer passing the ball without considering the teammates' situations, e.g., the receiver is marked by several opponents. Agents trained by A2PO can pass the ball to their teammates in a way that leads to a score. We attribute the performance gain to the preceding-agent off-policy correction, which means that agents estimate the teammates' situations and intentions better. We further visualize the learned behaviors of A2PO in Fig. 12 . In the top of Fig. 12 , two players cooperatively break through the opponent's defense and complete a passing and receiving coordination for scoring. In the bottom of Fig. 12 , three players make a fast thrust by two long passes: the goalkeeper passes the ball to the player at the edge, and the player at the edge passes the ball to the player behind the opponents. The complex coordination strategies are hardly observed in other baselines.

B.2.5 ABLATION

Preceding agent off-policy correction. More ablations on preceding-agent off-policy correction are shown in Fig. 13 . The baselines are: • MAPPO w/ V-trace, CoPPO w/ V-trace: Simultaneous update methods with advantage estimation as V-trace. • HAPPO w/ PreOPC: HAPPO with advantage estimation as PreOPC. The player Turing beat an opponent with the player Johnson marking. Turing beat another opponent. Johnson prepares to take the pass from Turing. Turing passes the ball to Johnson, then Johnson receives the pass and thrusts to shoot. Johnson breaks through the opponent's goalkeeper, shoots, and makes a goal. The goalkeeper makes a goal kick and plays a long pass to the player near the sideline. The player Turing receives the pass from the goalkeeper, then dribbles and passes the ball to the player in the midfield. The player Curie receives the pass from Turing, then plays a fast break. Curie shoots and makes a goal. In this ablation study, the baselines are equipped with off-policy correction methods. The experiment yields the following three conclusions: • The results firstly support the conclusion in Sec. 3.3 that applying PreOPC to sequential update methods results in a greater performance improvement than applying V-trace to simultaneous update methods. • Secondly, the primary distinction between A2PO and HAPPO with PreOPC is the clipping objective. The results demonstrate that the clipping objective derived from the single-agent improvement bound contributes to the performance improvement. • And thirdly, although we were unable to assess the error of PreOPC, we compare A2PO with RPISA-PPO, which can be viewed as A2PO algorithms with error-free off-policy correction methods (the advantage estimation is error-free) at the expense of sample inefficiency. A2PO reaches or outperforms the asymptotic performance of RPISA-PPO. A2PO outperforms RPISA-PPO since RPISA-PPO suffers from performance degradation as a result of agents updating policies with separated data (Taheri & Thrampoulidis, 2022) . We further analyze the sensitivity to the hyper-parameter λ. Results in Fig. 14 illustrate that preceding-agent off-policy correction does not introduce more sensitivity. Agent Selection Rule. More ablations on the agent selection rules are shown in Fig. 13 . We compare two additional rules: 'Reverse-greedy' and 'Reverse-semi-greedy'. 'Reverse' means selecting the agent with the minimal advantage first. While we observe that the effect of the selection rule becomes less significant in tasks with homogeneous or symmetric agents. Going deeper into the effects of agent selection rules, we show that the agents with implicit guidance from the advantage estimation benefit from greedily selecting agents in Fig. 16 

B.3 WALL TIME ANALYSIS

Multiple updates in a stage may increase training time, and the need for more training time may impact the scalability of A2PO, which is a common concern regarding the sequential update scheme. Nevertheless, a sequential update scheme will increase training time less than might be expected. Before proceeding, we note that the majority of experiments in our work are synchronously implemented, and the training time consists of the time spent updating policies and collecting samples. We have proposed a simple yet effective method for controlling training time in order to reduce training time. As a trade-off between performance and training time, we divide the agents into blocks to reduce the number of update iterations. For example, the tasks with 10 agents can be divided into 3 blocks, with sizes 3, 3, 4, respectively, and only 3 updates will be performed in a policy update iteration. From the implementation perspective, since the number of samples used in a single update decreases, the sequential update scheme requires less memory and less updating time when update policies. Therefore, it is possible to control the training time as less than 1.5 times the training time of the simultaneous update methods. In addition, assuming a good implementation, fewer update iterations will be performed if mini-batches are used in a single policy update, as the size of a mini-batch can be greater in sequential update methods under limited memory resources. In such a case, fewer mini-batches will be used, further decreasing the training time. Moreover, sampling consumes the majority of the training time, and the increased updating time appears less significant when analysing the wall time for on-policy algorithms with synchronized implementations. The training time is depicted in Tab. 6. A2PO achieves significantly greater performance with only marginally more training time. In addition, we illustrate the Humanoid 9|8 comparisons regarding environment steps and training time in Fig. 19a , and the comparisons on the GRF 11-vs-11 scenario in Fig. 19b . A2PO maintains an advantage in terms of training time.

B.4 HYPER-PARAMETERS

We tune several hyper-parameters in all the benchmarks, other hyper-parameters refer to the settings used in MAPPO. c ϵ are selected to be 0.5 in all the tasks. For the model structure in MA MuJoCo, the output from the last layer is processed by a Tanh layer and the action distribution is modeled as a Gaussian distribution initialized with mean as 0 and log std as -0.5. The probability output of different actions are averaged when computing the policy ratio. The common hyper-parameters used in MA MuJoCo are listed in Tab. 8. We list the hyper-parameters used in MPE in Tab. 10.

B.4.4 GOOGLE RESEARCH FOOTBALL

We list the hyper-parameters used in the GRF 5-vs-5 scenario in Tab. 11. 

C THE RELATED WORK OF OTHER MARL METHODS

Value decomposition methods. The value decomposition methods such as VDN (Sunehag et al., 2017) and Qmix (Rashid et al., 2018) , factorize the joint value function and adopt the centralized training and decentralized execution paradigm. The Individual-Global-MAX (IGM) principle is proposed to ensure consistency between the joint and local greedy action selections in the joint Q-value function Q tot (τ , a) and the individual Q-value function {Q i (τ i , a i } n i=1 : ∀τ ∈ T , arg max a∈A Q tot (τ , a) = (arg max a 1 ∈A 1 Q 1 (τ 1 , a 1 ), . . . , arg max a n ∈A n Q n (τ n , a n )). Two sufficient conditions, the additivity and the monotonicity, to satisfy IGM are proposed in Sunehag et al. (2017) and Rashid et al. (2018) respectively. In addition to the V function and Q function decomposition, QPLEX (Wang et al., 2021) considers implementing IGM in the dueling structure where Q = V + A. QPLEX only constrains the advantage functions to satisfy the IGM principle. The global advantage function is decomposed as A tot (τ , a) = n i=1 λ i (τ , a)A i (τ , a i ), where λ i (τ , a) > 0. We evaluate the performance of Qmix in Tab. 2 and Tab. 5. Integrating the IGM principle into A2PO without compromising the monotonic improvement guarantee is a desirable extension. Specifically, the advantage-based IGM establishes a connection between the global advantage function and the local advantage functions, and the advantage decomposition A tot (τ , a) = n i=1 λ i (τ , a)A i (τ , a i ) will not jeopardize the derivation of the monotonic improvement guarantee. Convergence and optimality of MARL. T-PPO (Ye et al., 2022) firstly introduce a framework called Generalized Multi-Agent Actor-Critic with Policy Factorization (GPF-MAC), which consists of methods with factorized local policies and may become stuck in sub-optimality. To address this problems, T-PPO transforms a multi-agent MDP into a special "single-agent" MDP with a sequential structure. T-PPO transforms a multi-agent MDP into a "single-agent" MDP with a sequential structure to address this issue. T-PPO has been shown to produce an optimal policy if implemented properly. Theoretically, sequential update methods, such as A2PO and HAPPO, are also instances of GPF-MAC and may be stuck into sub-optimal policies. The main differences between A2PO and T-PPO include that A2PO updates the factorized policies sequentially and makes decisions simultaneously, while T-PPO makes decisions sequentially, and that A2PO does not introduce the virtual state and the sequential transformation framework network. And theoretically, T-PPO may compromise the monotonic improvement guarantee. In Tab. 12, we compare A2PO, MAPPO and T-PPO on SMAC tasks empirically. A2PO is superior to T-PPO in the majority of tasks. 

D THE RELATED WORK OF COORDINATE DESCENT

Realizing the similarity between the sequential policy update scheme and the block coordinate descent algorithms, we borrow the optimization techniques in the coordinate descent algorithms to accelerate the optimization and amplify the convergence advantage over the simultaneous update scheme (Gordon & Tibshirani, 2015; Shi et al., 2017) . One of the critical questions in the coordinate descent algorithms is selecting the coordinate for the next-step optimization. Glasmachers & Dogan (2013) ; Lu et al. (2018) provided analyses of the convergence rate advantage of the Gauss-Southwell rule, i.e., greedily selecting the coordinate with the maximal gradient, over the random selection rule. We recognize the optimization of our surrogate objective (Schulman et al., 2017) agent-by-agent as a block coordinate descent problem. Therefore the agent selection rule plays a crucial role in accelerating the optimization. Inspired by the coordinate selection rules, we propose greedy and semi-greedy agent selection rules and empirically show that the underperforming agents benefit from the greedily selecting agents.



We define a stage as a period during which all the agents have been updated once (Fig.1). More discussions about why HAPPO fails to guarantee monotonic improvement for a single agent's policy can be found in Appx. A.6. We evaluate A2PO in fully cooperative and general-sum MPE tasks respectively, showing the potential of extending A2PO to general-sum games, see Appx. B.2.3 for full results. Code is available at https://anonymous.4open.science/r/A2PO. The definition of coupling and the properties can be found in any textbook containing Markov Chains. Empirically, we find the fully observable setting does not make the tasks easier because of the information redundancy.



Note that Eq. (3) takes the sum of expectations of the global advantage function approximated under different joint policies, different from the advantage decomposition lemma in Kuba et al. (2022) which decomposes the global advantage function into local ones.

Figure 3: Experiments in MA-MuJoCo. Left: Normalized scores on all the 14 tasks. Right: Comparisons of averaged return on selected tasks. The number of robot joints increases from left to right.

Figure 4: Averaged win rate on the Google Research Football full-game scenarios.

Figure 5: Ablation experiments on precedingagent off-policy correction.

Agent-by-agent Policy Optimization (Parameter Sharing) 1 Initialize the shared joint policy π 0 = {π 1 0 , . . . , π n 0 } with π 1 0 = • • • = π n 0 , and the global value function V . 2 for iteration m = 1, 2, . . . do 3 Collect data using π m-1 . 4 Policy π m = π m-1 . 5 for ⌈ P n ⌉ epochs do 6 for k = 1, . . . , n do Agent i = R(k), preceding agents e i = {R(1), . . . , R(n -1)}.

Figure 11: 5-vs-5 scenario with Parameter sharing.

Figure 9: Comparisons of average episode return on MA-MuJoCo.

Figure 12: Visualization of trained A2PO policies on the Google Research Football 11-vs-11 scenario, which shows that A2PO encourages complex cooperation behaviors to make a goal. Top: Player Turing and Johnson cooperate to beat multiple opponents to break through the defense and make a goal. Bottom: The goalkeeper, player Turing, and Curie achieve the pass and receive cooperation twice. A fast thrust is made by consecutively passing the ball.

Figure 13: Ablation experiments on preceding-agent off-policy correction.



Figure 19: Wall time Analysis.

Comparisons of trust region MARL algorithms. The proofs of the monotonic bounds can be found in Appx. A. Note that we also provide the monotonic bound of RPISA-PPO, which implements RPISA with PPO as the base algorithm. We separate RPISA-PPO from other methods as it has low sample efficiency and thus does not constitute a fair comparison.

Median win rates and standard deviations on SMAC tasks.

This section studies how PreOPC, the semi-greedy agent selection rule, and the adaptive clipping parameter affect the performance. Full ablation details can be found in Appx. B.2.5

r(s t = s|π) is the normalized discounted state visitation distribution.

Median win rates and standard deviations on SMAC tasks. 'w/ PS' means the algorithm is implemented as parameter sharing

and 17. More More ablation experiments on the adaptive clipping parameter. Left: Heterogeneous or asymmetric agents. Right: Homogeneous or symmetric agents.

The comparison of training duration. The format of the first line in a cell is: Training time(Sampling time+Updating Time). The second line of a cell represents the time normalized.We list the hyper-parameters used for each task of SMAC in Tab. 7.

Hyper-parameters in SMAC.

Common hypermeters in MA MuJoCo.

Hypermeters for the scenarios in MA MuJoCo.

Hypermeters for the scenarios in MPE.

Hypermeters for the scenarios in MPE.

Comparisons of A2PO, MAPPO and T-PPO.

acknowledgement

Acknowledgements. The SJTU team is partially supported by "New Generation of AI 2030" Major Project (2018AAA0100900), the Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), the Shanghai Sailing Program (21YF1421900), the National Natural Science Foundation of China (62076161, 62106141). Xihuai Wang and Ziyu Wan are supported by Wu Wen Jun Honorary Scholarship, AI Institute, Shanghai Jiao Tong University. We thank Yan Song and He Jiang for their help in the football experiments.

annex

Ethics Statement. Our method and algorithm do not involve any adversarial attack, and will not endanger human security. All our experiments are performed in the simulation environment, which does not involve ethical and fair issues.

Reproducibility Statement.

The source code of this paper is available at https:// anonymous.4open.science/r/A2PO. We provide proofs in appx. A, including the proofs of intuitive sequential update, monotonic policy improvement of A2PO, incrementally tightened bound of A2PO and monotonic policy improvement of MAPPO, CoPPO and HAPPO. We specify all the experiments implementation details, the experiments setup, and the additional results in the appx. B. The related works of coordinate descent are shown in appx. D. even bars appear in one fig means the agents are more balanced in terms of the guidance from the advantage estimation. Take the agent 10 in Fig. 16 for example, under 'Cyclic' and 'Random' rules, agent 10 perform the worst with high proportions, while it has higher proportions in prior ranks under 'Greedy' rule. 

