DOP: OFF-POLICY MULTI-AGENT DECOMPOSED POLICY GRADIENTS

Abstract

Multi-agent policy gradient (MAPG) methods recently witness vigorous progress. However, there is a significant performance discrepancy between MAPG methods and state-of-the-art multi-agent value-based approaches. In this paper, we investigate causes that hinder the performance of MAPG algorithms and present a multi-agent decomposed policy gradient method (DOP). This method introduces the idea of value function decomposition into the multi-agent actor-critic framework. Based on this idea, DOP supports efficient off-policy learning and addresses the issue of centralized-decentralized mismatch and credit assignment in both discrete and continuous action spaces. We formally show that DOP critics have sufficient representational capability to guarantee convergence. In addition, empirical evaluations on the StarCraft II micromanagement benchmark and multi-agent particle environments demonstrate that DOP outperforms both state-of-the-art value-based and policy-based multi-agent reinforcement learning algorithms. Demonstrative videos are available at https:// sites.google.com/ view/ dop-mapg/ .

1. INTRODUCTION

Cooperative multi-agent reinforcement learning (MARL) has achieved great progress in recent years (Hughes et al., 2018; Jaques et al., 2019; Vinyals et al., 2019; Zhang et al., 2019; Baker et al., 2020; Wang et al., 2020c) . Advances in valued-based MARL (Sunehag et al., 2018; Rashid et al., 2018; Son et al., 2019; Wang et al., 2020e) contribute significantly to the progress, achieving state-ofthe-art performance on challenging tasks, such as StarCraft II micromanagement (Samvelyan et al., 2019) . However, these value-based methods present a major challenge for stability and convergence in multi-agent settings (Wang et al., 2020a) , which is further exacerbated in continuous action spaces. Policy gradient methods hold great promise to resolve these challenges. MADDPG (Lowe et al., 2017) and COMA (Foerster et al., 2018) are two representative methods that adopt the paradigm of centralized critic with decentralized actors (CCDA), which not only deals with the issue of nonstationarity (Foerster et al., 2017; Hernandez-Leal et al., 2017) by conditioning the centralized critic on global history and actions but also maintains scalable decentralized execution via conditioning policies on local history. Several subsequent works make improvements to the CCDA framework by introducing the mechanism of recursive reasoning (Wen et al., 2019) or attention (Iqbal & Sha, 2019) . Despite the progress, most of the multi-agent policy gradient (MAPG) methods do not provide satisfying performance, e.g., significantly underperforming value-based methods on benchmark tasks (Samvelyan et al., 2019) . In this paper, we analyze this discrepancy and pinpoint three major issues that hinder the performance of MAPG methods. (1) Current stochastic MAPG methods do not support off-policy learning, partly because using common off-policy learning techniques is computationally expensive in multi-agent settings. (2) In the CCDA paradigm, the suboptimality of one agent's policy can propagate through the centralized joint critic and negatively affect policy learning of other agents, causing catastrophic miscoordination, which we call centralized-decentralized mismatch (CDM). (3) For deterministic MAPG methods, realizing efficient credit assignment (Tumer et al., 2002; Agogino & Tumer, 2004 ) with a single global reward signal largely remains challenging.

Multi-Agent Policy Gradients

The centralized training with decentralized execution (CTDE) paradigm (Foerster et al., 2016; Wang et al., 2020d) has recently attracted attention for its ability to address non-stationarity while maintaining decentralized execution. Learning a centralized critic with decentralized actors (CCDA) is an efficient approach that exploits the CTDE paradigm. MADDPG and COMA are two representative examples. MADDPG (Lowe et al., 2017) learns deterministic policies in continuous action spaces and uses the following gradients to update policies: g = E τ ,a∼D i ∇ θi µ i (τ i )∇ ai Q µ tot (τ , a)| ai=µi(τi) , and D is a replay buffer. COMA (Foerster et al., 2018) updates stochastic policies using the gradients: g = E π i ∇ θi log π i (a i |τ i )A π i (τ , a) , where A π i (τ , a) = Q π tot (τ , a) -a i Q π tot (τ , (a -i , a i ) ) is a counterfactual advantage (a -i is the joint action other than agent i) that deals with the issue of credit assignment and reduces variance.

3. ANALYSIS

In this section, we investigate challenges that limit the performance of state-of-the-art multi-agent policy gradient methods.

3.1. OFF-POLICY LEARNING FOR MULTI-AGENT STOCHASTIC POLICY GRADIENTS

Efficient stochastic policy learning in single-agent settings relies heavily on using off-policy data (Lillicrap et al., 2015; Wang et al., 2016; Fujimoto et al., 2018; Haarnoja et al., 2018) , which is not supported by existing stochastic MAPG methods (Foerster et al., 2018) . In the CCDA framework, off-policy policy evaluation-estimating Q π tot from data drawn from behavior policies β = β 1 , . . . , β n -encounters major challenges. Importance sampling (Meuleau et al., 2000; Jie & Abbeel, 2010; Levine & Koltun, 2013 ) is a simple way to correct for the discrepancy between π and β, but, it requires computing i πi(ai|τi) βi(ai|τi) , whose variance grows exponentially with the number of agents in multi-agent settings. An alternative is to extend the tree backup technique (Precup et al., 2000; Munos et al., 2016) to multi-agent settings and use the k-step tree backup update target for training the critic: y T B = Q π tot (τ , a) + k-1 t=0 γ t t l=1 λπ(a l |τ l ) [r t + γE π [Q π tot (τ t+1 , •)] -Q π tot (τ t , a t )] , where τ 0 = τ , a 0 = a. However, the complexity of computing E π [Q π tot (τ t+1 , •)] is O(|A| n ), which becomes intractable when the number of agents is large. Therefore, it is challenging to develop off-policy stochastic MAPG methods.

3.2. THE CENTRALIZED-DECENTRALIZED MISMATCH ISSUE

In the centralized critic with decentralized actors (CCDA) framework, agents learn individual policies, π i (a i |τ i ; θ i ), conditioned on the local observation-action history. However, the gradients for updating these policies are dependent on the centralized joint critic, Q π tot (τ , a) (see Eq. 1 and 2), which introduces the influence of actions of other agents. Intuitively, gradient updates will move an agent in the direction that can increase the global Q value, but the presence of other agents' actions incurs large variance in the estimates of such directions. Formally, the variance of policy gradients for agent i at (τ i , a i ) is dependent on other agents' actions: Var a-i∼π-i [Q π tot (τ , (a i , a -i ))∇ θi log π i (a i |τ i )] =Var a-i∼π-i [Q π tot (τ , (a i , a -i ))] (∇ θi log π i (a i |τ i ))(∇ θi log π i (a i |τ i )) T , where Var a-i [Q π tot (τ , (a i , a -i ))] can be very large due to the exploration or suboptimality of other agents' policies, which may cause suboptimality in individual policies. For example, suppose that the optimal joint action under τ is a * = a * 1 , . . . , a * n . When E a-i∼π-i [Q π tot (τ , (a * i , a -i ))] < 0 , π i (a * i |τ i ) will decrease, possibly resulting in a suboptimal π i . This becomes problematic because a negative feedback loop is created, in which the joint critic is affected by the suboptimality of agent i, which disturbs policy updates of other agents. We call this issue centralized-decentralized mismatch (CDM). Does CDM occur in practice for state-of-the-art algorithms? To answer this question, we carry out a case study in Sec. 5.1. We can see that the variance of DOP gradients is significantly smaller than COMA and MADDPG (Fig. 2 left ). This smaller variance enables DOP to outperform other algorithms (Fig. 2 middle ). We will explain this didactic example in detail in Sec. 5.1. In Sec. 5.2 and 5.3, we further show that CDM is exacerbated in sequential decision-making settings, causing divergence even after a near-optimal strategy has been learned.

3.3. CREDIT ASSIGNMENT FOR MULTI-AGENT DETERMINISTIC POLICY GRADIENTS

MADDPG (Lowe et al., 2017) and MAAC (Iqbal & Sha, 2019) extend deterministic policy gradient algorithms (Silver et al., 2014; Lillicrap et al., 2015) to multi-agent settings, enabling efficient offpolicy learning in continuous action spaces. However, they leave the issue of credit assignment (Tumer et al., 2002; Agogino & Tumer, 2004) largely untouched in fully cooperative settings, where agents learn policies from a single global reward signal. In stochastic cases, COMA assigns credits by designing a counterfactual baseline (Eq. 2). However, it is not straightforward to extend COMA to deterministic policies, since the output of polices is no longer a probability distribution. As a result, it remains challenging to realize efficient credit assignment in deterministic cases.

4. DECOMPOSED OFF-POLICY POLICY GRADIENTS

To address the limitations of existing MAPG methods discussed in Sec. 3, we introduce the idea of value decomposition into the multi-agent actor-critic framework and propose a Decomposed Off-Policy policy gradient (DOP) method. We factor the centralized critic as a weighted summation of individual critics across agents: Q φ tot (τ , a) = i k i (τ )Q φi i (τ , a i ) + b(τ ), where φ and φ i are parameters of the global and local Q functions, respectively, and k i ≥ 0 and b are generated by learnable networks whose inputs are global observation-action histories. In the following sections, we show that this linear decomposition helps address existing limitations of previous methods. A concern is the limited expressivity of linear decomposition (Wang et al., 2020b) , which may introduce bias in value estimations. We will show that this limitation does not violate the policy improvement guarantee of DOP. Figure 1 : A DECOMPOSED critic. Fig. 1 shows the architecture for learning decomposed critics. We learn individual critics Q φi i by backpropagating gradients from global TD updates dependent on the joint global reward, i.e., Q φi i is learned implicitly rather than from any reward specific to agent i. We enforce k i ≥ 0 by applying an absolute activation function at the last layer of the network. The network structure is described in detail in Appendix H. Based on the critic decomposition learning, the following sections will introduce decomposed off-policy policy gradients for learning stochastic policies and deterministic policies, respectively. Similar to other actor-critic methods, DOP alternates between policy evaluation-estimating the value function for a policy-and policy improvement-using the value function to update the policy (Barto et al., 1983) .

4.1. STOCHASTIC DECOMPOSED OFF-POLICY POLICY GRADIENTS

For learning stochastic policies, the linearly decomposed critic plays an essential role in enabling tractable multi-agent tree backup for off-policy policy evaluation and attenuating the CDM issue while maintaining provable policy improvement.

4.1.1. OFF-POLICY LEARNING

Policy Evaluation: Train the Critic As discussed in Sec. 3.1, using tree backup (Eq. 3) to carry out multi-agent off-policy policy evaluation requires calculating E π [Q φ tot (τ t+1 , •)], which needs O(|A| n ) steps of summation when a joint critic is used. Fortunately, using the linearly decomposed critic, DOP reduces the complexity of computing this expectation to O(n|A|): E π [Q φ tot (τ , •)] = i k i (τ )E πi [Q φi i (τ , •)] + b(τ ), making the tree backup technique tractable (detailed proof can be found in Appendix A.1). Another challenge of using multi-agent tree backup (Eq. 3) is that the coefficient c t = t l=1 λπ(a l |τ l ) decays as t gets larger, which may lead to relatively lower training efficiency. To solve this issue, we propose to mix off-policy tree backup updates with on-policy T D(λ) updates to trade off sample efficiency and training efficiency. Formally, DOP minimizes the following loss for training the critic: L(φ) = κL DOP-TB β (φ) + (1 -κ)L On π (φ) (7) where κ is a scaling factor, β is the joint behavior policy, and φ is the parameters of the critic. The first loss item is L DOP-TB β (φ) = E β [(y DOP-TB -Q φ tot (τ , a)) 2 ] , where y DOP-TB is the update target of the proposed k-step decomposed multi-agent tree backup algorithm: y DOP-TB = Q φ tot (τ , a) + k-1 t=0 γ t c t r t + γ i k i (τ t+1 )E πi [Q φ i i (τ t+1 , •)] + b(τ t+1 ) -Q φ tot (τ t , a t ) . Here, φ is the parameters of a target critic, and a t ∼ β(•|τ t ). The second loss item is L On π (φ) = E π [(y On -Q φ tot (τ , a)) 2 ], where y On is the on-policy update target as in TD(λ): y On = Q φ tot (τ , a) + ∞ t=0 (γλ) t r t + γQ φ tot (τ t+1 , a t+1 ) -Q φ tot (τ t , a t ) . In practice, we use two buffers, an on-policy buffer for computing L On π (φ) and an off-policy buffer for estimating L DOP-TB β (φ). Policy Improvement: Train Actors Using the linearly decomposed critic architecture, we can derive the following on-policy policy gradients for learning stochastic policies: g = E π i k i (τ )∇ θi log π i (a i |τ i ; θ i )Q φi i (τ , a i ) In Appendix A.2, we provide the detailed derivation and an off-policy version of stochastic policy gradients. This update rule reveals two important insights. (1) With a linearly decomposed critic, each agent's policy update only depends on the individual critic Q φi i . (2) Learning the decomposed critic implicitly realizes multi-agent credit assignment, because the individual critic provides credit information for each agent to improve its policy in the direction of increasing the global expected return. Moreover, Eq. 10 is also the policy gradients when assigning credits via the aristocrat utility (Wolpert & Tumer, 2002) (Appendix A.2). Eq. 7 and 10 form the core of our DOP algorithm for learning stochastic policies, which we call stochastic DOP and is described in detail in Appendix E. The CDM Issue occurs when decentralized policies' suboptimality exacerbates each other through the joint critic. As an agent's stochastic DOP gradients do not rely on the actions of other agents, they attenuate the effect of CDM. We empirically show that DOP can reduce variance in policy gradients in Sec. 5.1 and can attenuate the CDM issue in complex tasks in Sec. 5.2.1.

4.1.2. STOCHASTIC DOP POLICY IMPROVEMENT THEOREM

In this section, we theoretically demonstrate that stochastic DOP can converge to local optimal despite the fact that a linearly decomposed critic has limited representational capability. Since an accurate analysis for a complex function approximator (e.g., neural network) is difficult, we adopt several mild assumptions used in previous work (Feinberg et al., 2018; Degris et al., 2012) . We first show that the linearly decomposed structure ensures that the learned local value functions Q φi i (τ , a i ) preserve the order of Q π i (τ , a i ) = a-i π -i (a -i |τ -i )Q π tot (τ , a) for a wide range of function class. Fact 1. Under mild assumptions, when value evaluation converges, ∀π, Q φi i satisfies that Q π i (τ , a i ) ≥ Q π i (τ , a i ) ⇐⇒ Q φi i (τ , a i ) ≥ Q φi i (τ , a i ), ∀τ , a i , a i . Detailed proof of Fact 1 can be found in Appendix C.1 as well as more detailed discussion of its implications. Furthermore, we prove the following proposition to show that policy improvement can be guaranteed as long as the function class expressed by Q φi i is sufficiently large and the loss of critic training is minimized. Published as a conference paper at ICLR 2021 Proposition 1. Suppose the function class expressed by Q φi i (τ , a i ) is sufficiently large (e.g. neural networks) and the following loss L(φ) is minimized L(φ) = a,τ p(τ )π(a|τ )(Q π tot (τ , a) -Q φ tot (τ , a)) 2 , where Q φ tot (τ , a) ≡ i k i (τ )Q φi i (τ , a i ) + b(τ ). Then, we have g = E π [ i ∇ θi log π i (a i |τ i ; θ i )Q π (τ , a)] = E π i k i (τ )∇ θi log π i (a i |τ i ; θ i )Q φi i (τ , a i ) , which means stochastic DOP policy gradients are the same as those calculated using centralized critics (Eq. 2). Therefore, policy improvement is guaranteed. The proof can be found in Appendix C.2, which is inspired by Wang et al. (2020a) .

4.2.1. OFF-POLICY LEARNING

To enable efficient learning with continuous actions, we propose deterministic DOP. As in singleagent settings, because deterministic policy gradient methods avoid the integral over actions, it greatly eases the cost of off-policy learning (Silver et al., 2014) . For policy evaluation, we train the critic by minimizing the following TD loss: L(φ) = E (τt,rt,at,τt+1)∼D r t + γQ φ tot (τ t+1 , µ(τ t+1 ; θ )) -Q φ tot (τ t , a t ) 2 , ( ) where D is a replay buffer, and φ , θ are the parameters of the target critic and actors, respectively. For policy improvement, we derive the following deterministic DOP policy gradients: g = E τ ∼D i k i (τ )∇ θi µ i (τ i ; θ i )∇ ai Q φi i (τ , a i )| ai=µi(τi;θi) . Detailed proof can be found in Appendix B.1. Similar to the stochastic case, This result reveals that updates of individual deterministic policies depend on local critics when a linearly decomposed critic is used. Based on Eq. 11 and Eq. 12, we develop the DOP algorithm for learning deterministic policies in continuous action spaces, which is described in Appendix E and called deterministic DOP.

4.2.2. REPRESENTATION CAPACITY OF DETERMINISTIC DOP CRITICS

In continuous and smooth environments, we first show that a DOP critic has sufficient expressive capability to represent Q values in the proximity of µ(τ ), ∀τ with a bounded error. For simplicity, we denote O δ (τ ) = {a| a -µ(τ ) 2 ≤ δ}. Fact 2. Assume that ∀τ , a, a ∈ O δ (τ ), ∇ a Q µ tot (τ , a) -∇ a Q µ tot (τ , a ) 2 ≤ L a -a 2 . The estimation error of a DOP critic can be bounded by O(Lδ 2 ) for a ∈ O δ (τ ), ∀τ . Detailed proof can be found in Appendix D. Here we assume that the gradients of Q-values with respect to actions are Lipschitz smooth under the deterministic policy µ. This assumption is reasonable given that Q-values of most continuous environments with continuous policies are rather smooth. We further show that when Q-values in the proximity of µ(τ ), ∀τ are well estimated with a bounded error, deterministic DOP policy gradients are good approximation to the true gradients (Eq. 1). Approximately, |∇ ai Q µ tot (τ , a) -∇ ai k i (τ )∇ ai Q φi i (τ , a i )| ∼ O(Lδ), ∀i when δ 1. For detailed proof, we refer readers to Appendix D.

5. EXPERIMENTS

We design experiments to answer the following questions: (1) Does the CDM issue commonly exist and can decomposed critics attenuate it? (Sec. 5.1, 5.2.1, and 5.3) (2) Can our decomposed multi-agent tree backup algorithm improve the efficiency of off-policy learning? (Sec. 5.2.1) (3) Can deterministic DOP learn reasonable credit assignment? (Sec. 5.3) (4) Can DOP outperform state-of-the-art MARL algorithms? For evaluation, all the results are averaged over 12 different random seeds and are shown with 95% confidence intervals. 

5.1. DIDACTIC EXAMPLE: THE CDM ISSUE AND BIAS-VARIANCE TRADE-OFF

We use a didactic example to demonstrate how DOP attenuates CDM and achieves bias-variance trade-off. In a state-less game with 3 agents and 14 actions, if agents take action1, 5, 9, respectively, they get a team reward of 10; otherwise -10. We train stochastic DOP, COMA, and MADDPG for 10K timesteps and show the gradient variance, value estimation bias, and learning curves in Fig. 2 . Gumbel-Softmax trick (Jang et al., 2017; Maddison et al., 2017) is used to enable MADDPG to learn in discrete action spaces. Fig. 2 -right shows the average bias in the estimations of all Q values. We see that linear decomposition introduces extra estimation errors. However, the variance of DOP policy gradients is much smaller than other algorithms (Fig. 2-left ). As discussed in Sec. 3.2, large variance of other algorithms is due to the CDM issue that undecomposed joint critics are affected by actions of all agents. Free from the influence of other agents, DOP preserves the order of local Q-values (bottom of Fig. 2-right ) and effectively reduces the variance of policy gradients. In this way, DOP sacrifices value estimation accuracy for accurate and low-variance policy gradients, which explains why it can outperform other algorithms (Fig. 2-middle ).

5.2. DISCRETE ACTION SPACES: THE STARCRAFT II MICROMANAGEMENT BENCHMARK

We evaluate stochastic DOP on the challenging SMAC benchmark (Samvelyan et al., 2019) for its high control complexity. We compare our method with the state-of-the-art multi-agent stochastic policy gradient method (COMA), value-based methods (VDN, QMIX, QTRAN (Son et al., 2019) , NDQ (Wang et al., 2020e) , and QPLEX (Wang et al., 2020b) ), exploration method (MAVEN, Mahajan et al. ( 2019)), and role-based method (ROMA, Wang et al. (2020c) ). For stochastic DOP, we fix the hyperparameter setting and network structure in all experiments which are described in Appendix H. For baselines, we use their default hyperparameter settings that have been fine-tuned on the SMAC benchmark. Results are shown in Fig. 3 . Stochastic DOP significantly outperforms all the baselines by a wide margin. To our best knowledge, this is the first time that a MAPG method has significantly better performance than state-of-the-art value-based methods. Off-Policy Learning In our method, κ controls the "off-policyness" of training. For DOP, we set κ to 0.5. To demonstrate the effect of off-policy learning, we change κ to 0 and 1 and compare the performance. In Fig. 4 , we can see that both DOP and off-policy DOP perform much better than the on-policy version (κ=0), highlighting the importance of using off-policy data. Moreover, purely off-policy learning generally needs more samples to achieve similar performance to DOP. Mixing with on-policy data can largely improve training efficiency. The CDM Issue On-Policy DOP uses the same decomposed critic structure as DOP, but is trained only with on-policy data and does not use tree backup. The only difference between On-Policy DOP and COMA is that the former one uses a decomposed joint critic. Therefore, given that a COMA critic has a more powerful expression capacity than a DOP critic, the outperformance of On-Policy DOP against COMA shows the effect of CDM. COMA is not stable and may diverge after a near-optimal policy has been learned. For example, on map so_many_baneling, COMA policies degenerate after 2M steps. In contrast, On-Policy DOP can converge with efficiency and stability. Decomposed Multi-Agent Tree Backup DOP with Common Tree Backup (DOP without component (c)) is the same as DOP except that E π [Q φ tot (τ , •)] is estimated by sampling 200 joint actions from π. Here, we estimate this expectation by sampling because direct computation is intractable (for example, 20 10 summations are needed on the map MMM). Fig. 4 shows that when the number of agents increases, sampling becomes less efficient, and common tree backup performs even worse than On-Policy DOP. In contrast, DOP with decomposed tree backup can quickly and stably converge using a similar number of summations.

5.3. CONTINUOUS ACTION SPACES: MULTI-AGENT PARTICLE ENVIRONMENTS

We evaluate deterministic DOP on multi-agent particle environments (MPE, (Mordatch & Abbeel, 2018) ), where agents take continuous actions in continuous spaces. We compare our method with MADDPG (Lowe et al., 2017) and MAAC (Iqbal & Sha, 2019) . Hyperparameters and the network structure are fixed for deterministic DOP across experiments, which are described in Appendix H. The CDM Issue We use task Aggregation as an example to show that deterministic DOP attenuates the CDM issue. In this task, 5 agents navigate to one landmark. Only when all agents reach the landmark will they get a team reward of 10 and successfully end the episode; otherwise, an episode ends after 25 timesteps and agents get a reward of -10. Aggregation is a typical example where other agents' actions can influence an agent's local policy through an undecomposed joint critic. Intuitively, as long as one agent does not reach the landmark, the centralized Q value is negative, confusing other agents who get to the landmark. This intuition is supported by the empirical results shown in Fig. 5 -left -methods with undecomposed critics can find rewarding configurations but then quickly diverge, while DOP converges with stability. Credit Assignment We use task Mill to show that DOP can learn effective credit assignment mechanisms. In this task, 10 agents need to rotate a millstone clockwise. They can push the millstone clockwise or counterclockwise with force between 0 and 1. If the millstone's angular velocity, ω, gets greater than 30, agents are rewarded 3 per step. If ω exceeds 100 in 10 steps, the agents win the episode and get a reward of 10; otherwise, they lose and get a punishment of -10. Fig. 5 -right shows that deterministic DOP can gradually learn a reasonable credit assignment during training, where rotating the millstone clockwise has much larger Q-values. This explains why deterministic DOP outperforms previous state-of-the-art deterministic MAPG methods, as shown in Fig. 5 -middle.

6. CLOSING REMARKS

This paper pinpointed drawbacks that hinder the performance of state-of-the-art MAPG algorithms: on-policy learning of stochastic policy gradient methods, the centralized-decentralized mismatch problem, and the credit assignment issue in deterministic policy learning. We proposed decomposed actor-critic methods (DOP) to address these problems. Theoretical analyses and empirical evaluations demonstrate that DOP can achieve stable and efficient multi-agent off-policy learning.

A MATHEMATICAL DETAILS FOR STOCHASTIC DOP A.1 DECOMPOSED CRITICS ENABLE TRACTABLE MULTI-AGENT TREE BACKUP

In Sec. 4.1.1, we propose to use tree backup (Precup et al., 2000; Munos et al., 2016) to carry out multi-agent off-policy policy evaluation. When a joint critic is used, calculating E π Q φ tot (τ , •) requires O(|A| n ) steps of summation. To solve this problem, DOP uses a linearly decomposed critic, and it follows that: E π [Q φ tot (τ , a)] = a π(a|τ )Q φ tot (τ , a) = a π(a|τ ) i k i (τ )Q φi i (τ , a i ) + b(τ ) = a π(a|τ ) i k i (τ )Q φi i (τ , a i ) + a π(a|τ )b(τ ) = i ai π i (a i |τ i )k i (τ )Q φi i (τ , a i ) a-i π -i (a -i |τ -i ) + b(τ ) = i k i (τ )E πi [Q φi i (τ , •)] + b(τ ), which means the complexity of calculating this expectation is reduced to O(n|A|).

A.2.1 ON-POLICY VERSION

In Sec. 4.1.1, we give the on-policy stochastic DOP policy gradients: g = E π i k i (τ )∇ θi log π i (a i |τ i ; θ i )Q φi i (τ , a i ) . We now derive it in detail. Proof. We use the aristocrat utility (Wolpert & Tumer, 2002) to perform credit assignment: U i (τ , a i ) = Q φ tot (τ , a) - x π i (x|τ i )Q φ tot (τ , (x, a -i )) = j k j (τ )Q φj j (τ , a j ) - x π i (x|τ i )   j =i k j (τ )Q φj j (τ , a j ) + k i (τ )Q φi i (τ , x)   = k i (τ )Q φi i (τ , a i ) -k i (τ ) x π i (x|τ i )Q φi i (τ , x) = k i (τ ) Q φi i (τ , a i ) - x π i (x|τ i )Q φi i (τ , x) , It is worth noting that U i is independent of other agents' actions. Then, for the policy gradients, we have: g = E π [ i ∇ θ log π i (a i |τ i )U i (τ , a i )] = E π i ∇ θ log π i (a i |τ i )k i (τ ) Q φi i (τ , a i ) - x π i (x|τ i )Q φi i (τ , x) = E π i ∇ θ log π i (a i |τ i )k i (τ )Q φi i (τ , a i ) .

A.2.2 OFF-POLICY VERSION

In Appendix A.2, we derive the on-policy policy gradients for updating stochastic multi-agent policies. Similar to policy evaluation, using off-policy data can improve the sample efficiency with regard to policy improvement. Using the linearly decomposed critic architecture, the off-policy policy gradients for learning stochastic policies are: g = E β π(τ , a) β(τ , a) i k i (τ )∇ θ log π i (a i |τ i ; θ i )Q φi i (τ , a i ) . ( ) Proof. The objective function is: J(θ) = E β [V π tot (τ )] . Similar to Degris et al. (2012) , we have: ∇ θ J(θ) = E β π(a|τ ) β(a|τ ) i ∇ θ log π i (a i |τ i )U i (τ , a i ) = E β π(a|τ ) β(a|τ ) i ∇ θ log π i (a i |τ i )k i (τ )A i (τ , a i ) = E β π(a|τ ) β(a|τ ) i ∇ θ log π i (a i |τ i )k i (τ )Q φi i (τ , a i ) .

B MATHEMATICAL DETAILS FOR DETERMINISTIC DOP B.1 DETERMINISTIC DOP POLICY GRADIENT THEOREM

In Sec. 4.2.1, we give the following deterministic DOP policy gradients: ∇J(θ) = E τ ∼D i k i (τ )∇ θi µ i (τ i ; θ i )∇ ai Q φi i (τ , a i )| ai=µi(τi;θi) . ( ) Now we present the derivation of this update rule. Proof. Drawing inspirations from single-agent cases (Silver et al., 2014) , we have: ∇J(θ) = E τ ∼D [∇ θ Q φ tot (τ , a)] = E τ ∼D [ i ∇ θ k i (τ )Q φi i (τ , a i )| ai=µi(τi;θi) ] = E τ ∼D [ i ∇ θ µ i (τ i ; θ i )∇ ai k i (τ )Q φi i (τ , a i )| ai=µi(τi;θi) ].

C THEORETICAL JUSTIFICATION FOR STOCHASTIC DOP POLICY IMPROVEMENT

In order to understand how DOP works despite the biased Q π (τ , a) estimation, we provide some theoretical justification for the policy update. Unfortunately, a thorough analysis on deep neural network and TD-learning is too complex to be carried out. Thus, we make some assumptions for the mathematical proof. The following two subsections provide two different view points of theoretical understanding. 1. In the first view (Sec. C.1), we assume some mild assumptions on value evaluation which holds on a wide range of function class. In this way, we can prove a policy improvement theorem similar to Degris et al. (2012) . 2. In the second view, we remove the MONOTONE condition from a practical point of view. We then prove that when the loss of value evaluation is minimized (individual critics output Q φi i (τ , a i ) to be a good estimate of Q π i (τ , a i )), the DOP gradients in Eq. 12 equal to Eq. 2 which is the standard gradient form.

C.1 PROOF OF STOCHASTIC DOP POLICY IMPROVEMENT THEOREM

Inspired by previous work (Degris et al., 2012) , we relax the requirement that Q φ tot is a good estimate of Q π tot and show that stochastic DOP still guarantees policy improvement. First, we define Q π i (τ , a i ) = a-i π -i (a -i |τ -i )Q π tot (τ , a), A π i (τ , a i ) = a-i π(a -i |τ -i )A π i (τ , a). Directly analyzing the minimization of TD-error is challenging. To make it tractable, some works (Feinberg et al., 2018) simplify this analysis to an MSE problem. For the analysis of stochastic DOP, we adopt the same technique and formalize the critic's learning as the following problem: L(φ) = a,τ p(τ )π(a|τ ) Q π tot (τ , a) -Q φ tot (τ , a) 2 , ( ) where Q π tot (τ , a) are the true values, which are fixed during optimization. In the following analysis, we assume distinct parameters for different τ . We first show that Fact 1 holds for a wide range of function class of Q φi i . To this end, we first prove the following lemma. Lemma 1. Without loss of generality, we consider the following optimization problem: L τ (φ) = a π(a|τ ) Q π (τ , a) -f (Q φ (τ , a)) 2 . ( ) Here, f (Q φ (τ , a)) : R n → R, and Q φ (τ , a) is a vector whose i th entry is Q φi i (τ , a i ). In DOP, f satisfies that ∂f ∂Q φ i i (τ ,ai) > 0 for any i, a i . If ∇ φi Q φi i (τ , a i ) = 0, ∀φ i , a i it holds that: Q π i (τ , a i ) ≥ Q π i (τ , a i ) ⇐⇒ Q φi i (τ , a i ) ≥ Q φi i (τ , a i ), ∀a i , a i . Proof. When the optimization converges, φ i reaches a stationary point where ∇ φi L τ (φ) = 0, ∀i. π i (a i |τ i ) a-i j =i π j (a j |τ j ) Q π tot (τ , a) -f (Q φ (τ , a)) (- ∂f ∂Q φi i (τ , a i ) )∇ φi Q φi i (τ , a i ) = 0, ∀a i . Since ∇ φi Q φi i (τ , a i ) = 0, this implies that ∀i, a i , we have a-i j =i π j (a j |τ j )(Q π tot (τ , a) -f (Q φ (τ , a))) = 0 ⇒ a-i π -i (a -i |τ -i )f (Q φ (τ , a)) = Q π i (τ , a i ) We consider the function q(τ , a i ) = a-i π -i (a -i |τ -i )f (Q φ (τ , a))), which is a function of Q φ . Its partial derivative w.r.t Q φi i (τ , a i ) is: ∂q(τ , a i ) ∂Q φi i (τ , a i ) = a-i π -i (a -i |τ -i ) ∂f (Q φ (τ , a)) ∂Q φi i (τ , a i ) > 0 Therefore, if Q π i (τ , a i ) ≥ Q π i (τ , a i ), then any local minimal of L τ (φ) satisfies Q φi i (τ , a i ) ≥ Q φi i (τ , a i ). We argue that ∇ φi Q φi i (τ , a i ) = 0 is a rather mild assumption and holds for a large range of function class of Q φi i . Fact 3. [Formal Statement of Fact 1] For the following choices of Q φi i : 1. Tabular expression of Q φi i which requires Q(n|A||τ |) space; 2. Linear function class where a i are one-hot coded: Q φi i (τ , a i ) = φ i • τ , a i ; 3. 2-layer neural networks (φ i = 0) with strictly monotonic increasing activation functions (e.g. tanh, leaky-relu). 4. Arbitrary k-layer neural networks whose activation function at the (k -1) th layer is sigmoid. when value evaluation converges, ∀π, Q φi i satisfies that Q π i (τ , a i ) ≥ Q π i (τ , a i ) ⇐⇒ Q φi i (τ , a i ) ≥ Q φi i (τ , a i ), ∀τ , a i , a i . Proof. We need to prove that ∇ φi Q φi i (τ , a i ) = 0. For brevity, use a k i to denote the k th element of the one-hot coding, and use φ ta k i i to denote the weight connecting the t th element of the upper layer and the a k i element. (1 & 2) For tabular expression and linear functions, ∀a i = k we have ∂Q φi i (τ , a i ) ∂φ 1a k i i = 1 (3) The 2-layer neural network can be written as Q φi i (τ , a i ) = W 2 σ(W 1 (τ , a i )) . Besides, we denote the hidden layer as h. Since φ i = 0, we consider some nonzero element φ W2 1t,i . For the k th action, the gradient of the parameter φ W1 tk,i is ∂Q φi i (τ , a i ) ∂φ W1 tk,i = φ W2 1t,i σ (h t ) = 0, ∀k Without loss of generality, we consider the last layer φ W k 1t,i : ∂Q φi i (τ , a i ) ∂φ W k 1t,i = σ(h k-1 t ) > 0 These are the cases where ∇ φi Q φi i = 0. Even when ∃φ i , ∇ φi Q φi i = 0 , such φ i usually occupy only a small parameter space and happen with a small probability. As a result, we conclude that ∇ φi Q φi i (τ , a i ) = 0 is a rather mild assumption. Based on Fact 1, we are able to prove the policy improvement theorem for stochastic DOP. We will show that even without an accurate estimate of Q π tot , the stochastic DOP policy updates can still improve the objective function J(π) = E π [ t γ t r t ]. We first prove the following lemma.

Lemma 2. For two sequences {a

i }, {b i }, i ∈ [n] listed in an increasing order. If i b i = 0, then i a i b i ≥ 0. Proof. We denote ā = 1 n i a i , then i a i b i = ā( i b i ) + i ãi b i where i ãi = 0. Without loss of generality, we assume that āi = 0, ∀i. j and k which a j ≤ 0, a j+1 ≥ 0 and b k ≤ 0, b k+1 ≥ 0. Since a, b are symmetric, we assume j ≤ k. Then we have i∈[n] a i b i = i∈[1,j] a i b i + i∈[j+1,k] a i b i + i∈[k+1,n] a i b i ≥ i∈[j+1,k] a i b i + i∈[k+1,n] a i b i ≥ a k i∈[i+1,k] b i + a k+1 i∈[k+1,n] b i As i∈[j+1,n] b i ≥ 0, we have -i∈[j+1,k] b i ≤ i∈[k+1,n] b i . Thus, i∈[n] a i b i ≥ (a k+1 -a k ) i∈[k+1,n] b i ≥ 0. Based on Fact 1 and Lemma 2, we prove the following proposition. Proposition 2. [Stochastic DOP policy improvement theorem] Under mild assumptions, for any pre-update policy π o which is updated by Eq. 10 to π, denote π i (a i |τ i ) = π o i (a i |τ i ) + β ai,τ δ, where δ > 0 is a sufficiently small number. If it holds that ∀τ, a i , a i , Q φi i (τ , a i ) ≥ Q φi i (τ , a i ) ⇐⇒ β ai,τ ≥ β a i ,τ (MONOTONE condition , and φ i is the parameters before update.), then we have J(π) ≥ J(π o ), i.e., the joint policy is improved by the update. Proof. Under Fact 1, it follows that Q π o i (τ , a i ) > Q π o i (τ , a i ) ⇐⇒ β ai,τ ≥ β a i ,τ . Since J(π) = τ0 p(τ 0 )V π tot (τ 0 ), it suffices to prove that ∀τ t , V π tot (τ t ) ≥ V π o tot (τ t ). We have: at π(a t |τ t )Q π o tot (τ t , a t ) = at n i=1 π i (a t i |τ t i ) Q π o tot (τ t , a t ) = at n i=1 (π o i (a t i |τ t i ) + β a t i ,τt δ) Q π o tot (τ t , a t ) = V π o tot (τ t ) + δ n i=1 at β a t i ,τt   j =i π o j (a t j |τ t j )   Q π o tot (τ t , a t ) + o(δ) = V π o tot (τ t ) + δ n i=1 a t i β a t i ,τt Q π o i (τ t , a t i ) + o(δ). ( ) Since δ is sufficiently small, in the following analysis we omit o(δ). Observing that ai π i (a i |τ i ) = 1, ∀i, we get ai β ai,τ = 0. Thus, by Lemma 2 and Eq. 21, we have at π(a t |τ t )Q π o tot (τ t , a t ) ≥ V π o tot (τ t ). Similar to the policy improvement theorem for tabular MDPs (Sutton & Barto, 2018) , we have V π o tot (τ t ) ≤ at π(a t |τ t )Q π o tot (τ t , a t ) = at π(a t |τ t )   r(τ t , a t ) + γ τt+1 p(τ t+1 |τ t , a t )V π o tot (τ t+1 )   ≤ at π(a t |τ t )   r(τ t , a t ) + γ τt+1 p(τ t+1 |τ t , a t )   at+1 π(a t+1 |τ t+1 )Q π o tot (τ t+1 , a t+1 )     ≤ • • • ≤ V π tot (τ t ). This implies J(π) ≥ J(π o ) for each update. Moreover, we verify that ∀τ, a i , a i , Q φi i (τ , a i ) > Q φi i (τ , a i ) ⇐⇒ β ai,τ ≥ β a i ,τ (the MONOTONE condition) holds for any π with a tabular expression. For these π, let π i (a i |τ i ) = θ ai,τ , then it holds that ai θ ai,τ = 1. Since the gradient of policy update can be written as: ∇ θ J(π θ ) = E d(τ ) i k i (τ )∇ θ log π i (a i |τ i ; θ i )Q φi i (τ , a i ) = τ d(τ ) i k i (τ )∇ θi π(a i |τ i )Q φi i (τ , a i ) = τ d(τ ) i k i (τ )∇ θi π(a i |τ i )A φi i (τ , a i ) where d π (τ ) is the occupancy measure w.r.t our algorithm. With a tabular expression, the update of each θ ai,τ is proportion to β ai,τ β ai,τ ∝ dη(π θ ) dθ ai,τ = d(τ )A φi i (τ , a i ) Clearly, β a i ,τ ≥ β ai,τ ⇐⇒ Q φi i (τ , a i ) ≥ Q φi i (τ , a i ) .

C.2 ANALYSIS WITHOUT MONOTONE CONDITION

For practical implementation of policy π i (a i |τ i ), the MONOTONE condition is too strong to be satisfied for all π i . Analyzing the policy update when the condition is violated is difficult with only Fact 1 at hand. Therefore, it is beneficial to understand policy improvement without the MONOTONE condition. To bypass the MONOTONE condition, we require a stronger property of the learnt Q φi i (τ , a i ) in addition to order preserving (Fact 1). Theorem 1 in Wang et al. (2020a) offers a closed form solution of additive decomposition and we restate it as the following lemma Lemma 3 (Restatement of Theorem 1 in Wang et al. (2020a) ). If we consider the solution of arg min Q (s,a)∈S×A π(a|τ ) y(τ , a) - n i=1 Q i (τ , a i ) 2 , ∀i ∈ [n], ∀τ , a the individual action-value function Q i (τ , a i ) = E a-i∼π-i(•|τ-i) [y (τ , a i , a -i )] - n -1 n E a∼π(•|τ ) [y(τ , a)] + w i (s), The residual term w is an arbitrary vector satisfying ∀s, n i=1 w i (s) = 0. Based on this lemma, we can derive another proposition to theoretically justify the DOP architecture. Proposition 1. Suppose the function class expressed by Q φi i (τ , a i ) is sufficiently large (e.g. neural networks) and the following loss L(φ) is minimized Such miscoordination problems are common in complex multi-agent tasks (Wang et al., 2020e) . We believe introducing communication into DOP can help it solve a larger range of problems. L(φ) = a,τ p(τ )π(a|τ )(Q π tot (τ , a) -Q φ tot (τ , a)) 2 ,

G BASELINE BY SAMPLING

One problem of existing MAPG methods is the CDM issue, which describes the large variance in policy gradients caused by the influence of other agents' actions introduced through the joint critic. Another technique that is frequently used to reduce the variance in policy gradients in the single-agent RL literature is by using baselines (Sutton & Barto, 2018) . In this section, we investigate whether using baselines can effectively reduce variance in multi-agent settings. We start from centralized critics. COMA uses a baseline where local actions are marginalized. Since the variance and performance of COMA have been discussed in Sec. 5, we omit it here and study the baseline where actions of all agents are marginalized. In multi-agent settings, the calculation of this baseline requires computing an expectation over the joint action space, which is generally intractable. To solve this problem, we estimate the expectation by sampling. We compare stochastic DOP, COMA, and On-Policy DOP against this method, which we call Regular Critics with Baseline. Results are shown in Fig. 8 . We can see that Regular Critics with Baseline performs better than COMA. However, Regular Critics with Baseline performs worse than On-Policy DOP. These results indicate that a linearly decomposed critic can reduce variance in policy gradients more efficiently.

H INFRASTRUCTURE, ARCHITECTURE, AND HYPERPARAMETERS

Experiments are carried out on NVIDIA P100 GPUs and with fixed hyper-parameter settings, which are described in the following sections.  φi i (τ , •) for each possible local action, which are then linearly combined to get an estimate of the global Q value. The weights and bias of the linear combination, k i and b, are generated by linear networks conditioned on the global state s. k i is enforced to be non-negative by applying absolute activation at the last layer. We then divide k i by i k i to scale k i to [0, 1]. The local policy network consists of three layers, a fully-connected layer, followed by a 64 bit GRU, and followed by another fully-connected layer that outputs a probability distribution over local actions. We use ReLU activation after the first fully-connected layer. For all experiments, we set κ = 0.5 and use an off-policy replay buffer storing the latest 5000 episodes and an on-policy buffer with a size of 32. We run 4 parallel environments to collect data. The optimization of both the critic and actors is conducted using RMSprop with a learning rate of 5 × 10 -4 , α of 0.99, and with no momentum or weight decay. For exploration, we use -greedy with annealed linearly from 1.0 to 0.05 over 500k time steps and kept constant for the rest of the training. Mixed batches consisting of 32 episodes sampled from the off-policy replay buffer and 16 episodes sampled from the on-policy buffer are used to train the critic. For training actors, we sample 16 episodes from the on-policy buffer each time. The framework is trained on fully unrolled episodes. The learning rates for the critic and actors are set to 1 × 10 -4 and 5 × 10 -4 , respectively. And we use 5-step decomposed multi-agent tree backup. All experiments on StarCraft II use the default reward and observation settings of the SMAC benchmark.

H.2 DETERMINISTIC DOP

The critic network structure of deterministic DOP is similar to that of stochastic DOP, except that local actions are part of the input in deterministic DOP. For actors, we use a fully-connected forward network with two 64-dimensional hidden layers with ReLU activation, and the output of actors is a local action. We use an off-policy replay buffer storing the latest 10000 transitions, from which 1250 transitions are sampled each time to train the critic and actors. The learning rates of both the critic and actors are set to 5 × 10 -3 . To reduce variance in the updates of actors, we update the actors and target networks only after 2 updates to the critic, as proposed by Fujimoto et al. (2018) . We also use this technique of delaying policy update in all the baselines. For all the algorithms, we run a single environment to collect data, because we empirically find it more sample efficient than parallel environments in the MPE benchmark. RMSprop with a learning rate of 5 × 10 -4 , α of 0.99, and with no momentum or weight decay is used to optimize the critic and actors, which is the same as in stochastic DOP.

I RELATED WORKS

Cooperative multi-agent reinforcement learning provides a scalable approach to learning collaborative strategies for many challenging tasks (Vinyals et al., 2019; Berner et al., 2019; Samvelyan et al., 2019; Jaderberg et al., 2019) and a computational framework to study many problems, including the emergence of tool usage (Baker et al., 2020) , communication (Foerster et al., 2016; Sukhbaatar et al., 2016; Lazaridou et al., 2017; Das et al., 2019) , social influence (Jaques et al., 2019) , and inequity aversion (Hughes et al., 2018) . Recent work on role-based learning (Wang et al., 2020c; 2021) introduces the concept of division of labor into multi-agent learning and grounds MARL into more realistic applications. Centralized learning of joint actions can handle coordination problems and avoid non-stationarity. However, the major concern of centralized training is scalability, as the joint action space grows exponentially with the number of agents. The coordination graph (Guestrin et al., 2002b; a) is a promising approach to achieve scalable centralized learning, which exploits coordination independencies between agents and decomposes a global reward function into a sum of local terms. Zhang & Lesser (2011) employ the distributed constraint optimization technique to coordinate distributed learning of joint action-value functions. Sparse cooperative Q-learning (Kok & Vlassis, 2006) learns to coordinate the actions of a group of cooperative agents only in the states where such coordination is necessary. These methods require the dependencies between agents to be pre-supplied. To avoid this assumption, value function decomposition methods directly learn centralized but factorized global Q-functions. They implicitly represent the coordination dependencies among agents by the decomposable structure (Sunehag et al., 2018; Rashid et al., 2018; Son et al., 2019; Wang et al., 2020e) . The stability of multi-agent off-policy learning is a long-standing problem. Foerster et al. (2017) ; Wang et al. (2020a) study this problem in value-based methods. In this paper, we focus on how to achieve efficient off-policy policy-based learning. Our work is complementary to previous work based on multi-agent policy gradients, such as those regarding multi-agent multi-task learning (Teh et al., 2017; Omidshafiei et al., 2017) and multi-agent exploration (Wang et al., 2020d) . Multi-agent policy gradient algorithms enjoy stable convergence properties compared to value-based methods (Gupta et al., 2017; Wang et al., 2020a) and can extend MARL to continuous control problems. COMA (Foerster et al., 2018) and MADDPG (Lowe et al., 2017) propose the paradigm of centralized critic with decentralized actors to deal with the non-stationarity issue while maintaining decentralized execution. PR2 (Wen et al., 2019) and MAAC (Iqbal & Sha, 2019) extend the CCDA paradigm by introducing the mechanism of recursive reasoning and attention, respectively. Another line of research focuses on fully decentralized actor-critic learning (Macua et al., 2017; Zhang et al., 2018; Yang et al., 2018; Cassano et al., 2018; Suttle et al., 2019; Zhang & Zavlanos, 2019) . Different from the setting of this paper, agents have local reward functions and full observation of the true state in these works.



Figure 2: Bias-variance trade-off of DOP on the didactic example. Left: gradient variance; Middle: Performance; Right: Average bias in Q estimations; Right-bottom: the element in ith row and jth column is the local Q value learned by DOP for agent i taking action j.

Figure 4: Comparisons with ablations on the SMAC benchmark.

Figure 5: Left and middle: performance comparisons with COMA and MAAC on MPE. Right: The learned credit assignment mechanism on task Mill by deterministic DOP.

Figure 6: A decomposed critic can solve many coordination problems which can not be solved by IQL.

In stochastic DOP, each agent has a neural network to approximate its local utility. The local utility network consists of two 256-dimensional fully-connected layers with ReLU activation. Since the critic is not used when execution, we condition local Q networks on the global state s. The output of the local utility networks is Q

ACKNOWLEDGMENTS

We would like to thank the anonymous reviewers for their insightful comments and helpful suggestions. This work is supported in part by Science and Technology Innovation 2030 -"New Generation Artificial Intelligence" Major Project (No. 2018AAA0100904), and a grant from the Institute of Guo Qiang, Tsinghua University.

annex

where Q φ tot (τ , a) ≡ i k i (τ )Q φi i (τ , a i ) + b(τ ). Then, we have, which means stochastic DOP policy gradients are the same as those calculated using centralized critics (Eq. 2). Therefore, policy improvement is guaranteed.Proof. For brevity, we denote Qk φi i (τ , a i ) = k(τ )Q φi i (τ , a i ). Then L(φ) can be written asAccording to Lemma 3, when L(φ) is minimized, we haveTherefore, in expectation, stochastic DOP gradients are the same as those calculated using centralized critics (Eq. 2). We no longer require the MONOTONE condition to guarantee improvement of the policy update. Proposition 1 is another point of view to explain the performance guarantee of DOP despite its constrained critics.

D REPRESENTATIONAL CAPABILITY OF DETERMINISTIC DOP CRITICS

In Sec. 4.2.2, we present the following facts about deterministic DOP:The estimation error of a DOP critic can be bounded by O(Lδ 2 ) for a ∈ O δ (τ ), ∀τ .We consider the Taylor expansion with Lagrange remainder of Q µ tot (τ , a). Namely,Noticing that the first order Taylor expansion ofTherefore, the optimal solution of the MSE problem in Eq. 17 under DOP critics has an error term less than O(Lδ 2 ) for arbitrary sampling distribution p(τ , a) of a ∈ O δ (µ(τ )).When Q values in the proximity of µ(τ ), ∀τ is well estimated within a bounded error and δ 1, approximately, we have

E ALGORITHMS

In this section, we describe the details of our algorithms, as shown in Algorithm 1 and 2.Algorithm 1 Stochastic DOP Initialize a critic network Q φ , actor networks π θi , and a mixer network M ψ with random parameters φ,θ i , ψ.Initialize target networks: φ = φ, θ = θ, ψ = ψ Initialize an off-policy replay buffer D off and an on-policy replay buffer D on . for t = 1 to T do Generate a trajectory and store it in D off and D on Sample a batch consisting of N 1 trajectories from D on Update decentralized policies using the gradients described in Eq. 10 Calculate L On (φ) Sample a batch consisting of N 2 trajectories from D off Calculate L DOP-TB (φ) Update critics using L On (φ) and L DOP-TB (φ) if t mod d = 0 then Update target networks: φ = φ, θ = θ, ψ = ψ end if end for Algorithm 2 Deterministic DOP Initialize a critic network Q φ , actor networks µ θi and a mixer network M ψ with random parameters θ, φ, ψ Initialize target networks: φ = φ, θ = θ, ψ = ψ Initialize replay buffer D for t = 1 to T do Select action with exploration noise a ∼ µ(τ ) + , generate a transition and store the transition tuple in D Sample N transitions from D Update the critic using the loss function described in Eq. 11 Update decentralized policies using the gradients described in Eq. 12 if t mod d = 0 then Update target networks:

F DOP WITH COMMUNICATION

Although DOP can solve many coordination problems, as shown by the comparison against IQL in Fig. 6 , its complete decomposition critic raises the concern that DOP can not deal with the miscoordination problem induced by highly uncertain and partial observable environments.We use an example to illustrate the causes of miscoordination problems and argue that introducing communication into DOP can help address these problems. In hallway (Fig. 7 (a)), two agents randomly start at states a 1 to a m and b 1 to b n , respectively. Agents can observe their position and choose to move left, move right, or keep still at each timestep. Agents win and are rewarded 10 if they arrive at state g simultaneously. Otherwise, if any agent arrives at g earlier than the other, the team gets no reward, and the next episode begins. The horizon is set to max(m, n) + 10 to avoid an infinite loop.Without communication, one agent cannot know the position of its teammates, so it is difficult to coordinate actions. This explains why on hallway with m=n=4, the team can win only 25% of the games (Fig. 7(b) ). Equipping DOP with communication can largely solve the problem -agents learn to communicate their positions and move left at a 1 or b 1 simultaneously. For communication, we

