MORE CENTRALIZED TRAINING, STILL DECENTRAL-IZED EXECUTION: MULTI-AGENT CONDITIONAL POL-ICY FACTORIZATION

Abstract

In cooperative multi-agent reinforcement learning (MARL), combining value decomposition with actor-critic enables agents to learn stochastic policies, which are more suitable for the partially observable environment. Given the goal of learning local policies that enable decentralized execution, agents are commonly assumed to be independent of each other, even in centralized training. However, such an assumption may prohibit agents from learning the optimal joint policy. To address this problem, we explicitly take the dependency among agents into centralized training. Although this leads to the optimal joint policy, it may not be factorized for decentralized execution. Nevertheless, we theoretically show that from such a joint policy, we can always derive another joint policy that achieves the same optimality but can be factorized for decentralized execution. To this end, we propose multi-agent conditional policy factorization (MACPF), which takes more centralized training but still enables decentralized execution. We empirically verify MACPF in various cooperative MARL tasks and demonstrate that MACPF achieves better performance or faster convergence than baselines.

1. INTRODUCTION

The cooperative multi-agent reinforcement learning (MARL) problem has attracted the attention of many researchers as it is a well-abstracted model for many real-world problems, such as traffic signal control (Wang et al., 2021a) and autonomous warehouse (Zhou et al., 2021) . In a cooperative MARL problem, we aim to train a group of agents that can cooperate to achieve a common goal. Such a common goal is often defined by a global reward function that is shared among all agents. If centralized control is allowed, such a problem can be viewed as a single-agent reinforcement learning problem with an enormous action space. Based on this intuition, Kraemer & Banerjee (2016) proposed the centralized training with decentralized execution (CTDE) framework to overcome the non-stationarity of MARL. In the CTDE framework, a centralized value function is learned to guide the update of each agent's local policy, which enables decentralized execution. With a centralized value function, there are different ways to guide the learning of the local policy of each agent. One line of research, called value decomposition (Sunehag et al., 2018) , obtains local policy by factorizing this centralized value function into the utility function of each agent. In order to ensure that the update of local policies can indeed bring the improvement of joint policy, Individual-Global-Max (IGM) is introduced to guarantee the consistency between joint and local policies. Based on the different interpretations of IGM, various MARL algorithms have been proposed, such as VDN (Sunehag et al., 2018) , QMIX (Rashid et al., 2018) , QTRAN (Son et al., 2019) , and QPLEX (Wang et al., 2020a) . IGM only specifies the relationship between optimal local actions and optimal joint action, which is often used to learn deterministic policies. In order to learn stochastic policies, which are more suitable for the partially observable environment, recent studies (Su et al., 2021; Wang et al., 2020b; Zhang et al., 2021; Su & Lu, 2022) combine the idea of value decomposition with actor-critic. While most of these decomposed actor-critic methods do not guarantee optimality, FOP (Zhang et al., 2021) introduces Individual-Global-Optimal (IGO) for the optimal joint policy learning in terms of maximum-entropy objective and derives the corresponding way of value decomposition. It is proved that factorized local policies of FOP converge to the global optimum, given that IGO is satisfied. The essence of IGO is for all agents to be independent of each other during both training and execution. However, we find this requirement dramatically reduces the expressiveness of the joint policy, making the learning algorithm fail to converge to the global optimal joint policy, even in some simple scenarios. As centralized training is allowed, a natural way to address this issue is to factorize the joint policy based on the chain rule (Schum, 2001) , such that the dependency among agents' policies is explicitly considered, and the full expressiveness of the joint policy can be achieved. By incorporating such a joint policy factorization into the soft policy iteration (Haarnoja et al., 2018) , we can obtain an optimal joint policy without the IGO condition. Though optimal, a joint policy induced by such a learning method may not be decomposed into independent local policies, thus decentralized execution is not fulfilled, which is the limitation of many previous works that consider dependency among agents (Bertsekas, 2019; Fu et al., 2022) . To fulfill decentralized execution, we first theoretically show that for such a dependent joint policy, there always exists another independent joint policy that achieves the same expected return but can be decomposed into independent local policies. To learn the optimal joint policy while preserving decentralized execution, we propose multi-agent conditional policy factorization (MACPF), where we represent the dependent local policy by combining an independent local policy and a dependency policy correction. The dependent local policies factorize the optimal joint policy, while the independent local policies constitute their independent counterpart that enables decentralized execution. We evaluate MACPF in several tasks, including matrix game (Rashid et al., 2020) , SMAC (Samvelyan et al., 2019) , and MPE (Lowe et al., 2017) . Empirically, MACPF consistently outperforms its base method, i.e., FOP, and achieves better performance or faster convergence than other baselines. By ablation, we verify that the independent local policies can indeed obtain the same level of performance as the dependent local policies.

2.1. MULTI-AGENT MARKOV DECISION PROCESS

In cooperative MARL, we often formulate the problem as a multi-agent Markov decision process (MDP) (Boutilier, 1996) . A multi-agent MDP can be defined by a tuple ⟨I, S, A, P, r, γ, N ⟩. N is the number of agents, I = {1, 2 . . . , N } is the set of agents, S is the set of states, and A = A 1 ×• • •×A N is the joint action space, where A i is the individual action space for each agent i. For the rigorousness of proof, we assume full observability such that at each state s ∈ S, each agent i receives state s, chooses an action a i ∈ A i , and all actions form a joint action a ∈ A. The state transitions to the next state s ′ upon a according to the transition function P (s ′ |s, a) : S × A × S → [0, 1], and all agents receive a shared reward r(s, a) : S × A → R. The objective is to learn a local policy π i (a i | s) for each agent such that they can cooperate to maximize the expected cumulative discounted return, E[ ∞ t=0 γ t r t ], where γ ∈ [0, 1) is the discount factor. In CTDE, from a centralized perspective, a group of local policies can be viewed as a joint policy π jt (a| s). For this joint policy, we can define the joint state-action value function Q jt (s t , a t ) = E st+1:∞,at+1:∞ [ ∞ k=0 γ t r t+k | s t , a t ]. Note that although we assume full observability for the rigorousness of proof, we use the trajectory of each agent τ i ∈ T i : (Y × A i ) * to replace state s as its policy input to settle the partial observability in practice, where Y is the observation space.

2.2. FOP

FOP (Zhang et al., 2021) is one of the state-of-the-art CTDE methods for cooperative MARL, which extends value decomposition to learning stochastic policy. In FOP, the joint policy is decomposed into independent local policies based on Individual-Global-Optimal (IGO), which can be stated as: π jt (a| s) = N i=1 π i (a i | s). As all policies are learned by maximum-entropy RL (Haarnoja et al., 2018)  , i.e., π i (a i | s) = exp( 1 αi (Q i (s, a i ) -V i (s))) , IGO immediately implies a specific way of value decomposition: Q jt (s, a) = N i=1 α α i [Q i (s, a i ) -V i (s)] + V jt (s). Unlike IGM, which is used to learn deterministic local policies and naturally avoids the dependency of agents, IGO assumes agents are independent of each other in both training and execution. Although IGO advances FOP to learn stochastic policies, such an assumption can be problematic even in some simple scenarios and prevent learning the optimal joint policy. As stated in soft Q-learning (Haarnoja et al., 2018) , one goal of maximum-entropy RL is to learn an optimal maximum-entropy policy that captures multiple modes of near-optimal behavior. Since FOP can be seen as the extension of maximum-entropy RL in multi-agent settings, it is natural to assume that FOP can also learn a multi-modal joint policy in multi-agent settings. However, as shown in the following example, such a desired property of maximum-entropy RL is not inherited in FOP due to the IGO condition.

2.3. PROBLEMATIC IGO

We extend the single-agent multi-goal environment used in soft Q-learning (Haarnoja et al., 2018) to its multi-agent variant to illustrate the problem of IGO. In this environment, we want to control a 2D point mass to reach one of four symmetrically placed goals, as illustrated in Figure 1 . The reward is defined as a mixture of Gaussians, with means placed at the goal positions. Unlike the original environment, this 2D point mass is now jointly controlled by two agents, and it can only move when these two agents select the same moving direction; otherwise, it will stay where it is. As shown in Figure 1a , when centralized control is allowed, multi-agent training degenerates to single-agent training, and the desired multi-modal policy can be learned. However, as shown in Figure 1b , FOP struggles to learn any meaningful joint policy for the multi-agent setting. One possible explanation is that, since IGO is assumed in FOP, the local policy of each agent is always independent of each other during training, and the expressiveness of joint policy is dramatically reduced. Therefore, when two agents have to coordinate to make decisions, they may fail to reach an agreement and eventually behave in a less meaningful way due to the limited expressiveness of joint policy. To solve this problem, we propose to consider dependency among agents in MARL algorithms to enrich the expressiveness of joint policy. As shown in Figure 1c , the learned joint policy can once again capture multiple modes of near-optimal behavior when the dependency is considered. Details of this algorithm will be discussed in the next section.

3. METHOD

To overcome the aforementioned problem of IGO, we propose multi-agent conditional policy factorization (MACPF). In MACPF, we introduce dependency among agents during centralized training to ensure the optimality of the joint policy without the need for IGO. This joint policy consists of dependent local policies, which take the actions of other agents as input, and we use this joint policy as the behavior policy to interact with the environment during training. In order to fulfill decentralized execution, independent local policies are obtained from these dependent local policies such that the joint policy resulting from these independent local policies is equivalent to the behavior policy in terms of expected return.

3.1. CONDITIONAL FACTORIZED SOFT POLICY ITERATION

Like FOP, we also use maximum-entropy RL (Ziebart, 2010) to bridge policy and state-action value function for each agent. Additionally, it will also be used to introduce dependency among agents. For each local policy, we take the actions of other agents as its input and define it as follows: π i (a i | s, a <i ) = exp( 1 α i (Q i (s, a <i , a i ) -V i (s, a <i ))) (3) V i (s, a <i ) := α i ai exp( 1 α i Q i (s, a <i , a i )), where a <i represents the joint action of all agents whose indices are smaller than agent i. We then can get the relationship between the joint policy and local policies based on the chain rule factorization of joint probability: π jt (a| s) = N i=1 π i (a i | s, a <i ). The full expressiveness of the joint policy can be guaranteed by (5) as it is no longer restricted by the IGO condition. From ( 5), together with π jt (a| s) = exp( 1 α (Q jt (s, a) -V jt (s))), we have the Q jt factorization as: Q jt (s, a) = N i=1 α α i [Q i (s, a <i , a i ) -V i (s, a <i )] + V jt (s). Note that in maximum-entropy RL, we can easily compute V by Q. From (6), we introduce conditional factorized soft policy iteration and prove its convergence to the optimal joint policy in the following theorem. Theorem 1 (Conditional Factorized Soft Policy Iteration). For any joint policy π jt , if we repeatedly apply joint soft policy evaluation and individual conditional soft policy improvement from π i ∈ Π i . Then the joint policy π jt (a| s) = N i=1 π i (a i | s, a <i ) converges to π * jt , such that Q π * jt jt (s, a) ≥ Q πjt jt (s, a) for all π jt , assuming |A| < ∞. Proof. See Appendix A.

3.2. Independent JOINT POLICY

Using the conditional factorized soft policy iteration, we are able to get the optimal joint policy. However, such a joint policy requires dependent local policies, which are incapable of decentralized execution. To fulfill decentralized execution, we have to obtain independent local policies. Consider the joint policy shown in Figure 2a . This joint policy, called dependent joint policy π dep jt , involves dependency among agents and thus cannot be factorized into two independent local policies. However, one may notice that this policy can be decomposed as the combination of an independent joint policy π ind jt that involves no dependency among agents, as shown in Figure 2b , and a dependency policy correction b dep jt , as shown in Figure 2c . More importantly, since we use the Boltzmann distribution of joint Q-values as the joint policy, the equivalence of probabilities of two joint actions also indicates that their joint Q-values are the same, π dep jt (A, A) = π dep jt (B, B) ⇒ Q jt (A, A) = Q jt (B, B). (7) Therefore, in Table 2 , the expected return of the independent joint policy π ind jt will be the same as the dependent joint policy π dep jt , E π dep jt [Q jt ] = π dep jt (A, A) * Q jt (A, A) + π dep jt (B, B) * Q jt (B, B) (8) = π ind jt (A, A) * Q jt (A, A) = E π ind jt [Q jt ]. Formally, we have the following theorem. Theorem 2. For any dependent joint policy π dep jt that involves dependency among agents, there exists an independent joint policy π ind jt that does not involve dependency among agents, such that V π dep jt (s) = V π ind jt (s) for any state s ∈ S. Proof. See Appendix B. Note that the independent counterpart of the optimal dependent joint policy may not be directly learned by FOP, as shown in Figure 1 . Therefore, we need to explicitly learn the optimal dependent joint policy to obtain its independent counterpart.

3.3. MACPF FRAMEWORK

With Theorem 1 and 2, we are ready to present the learning framework of MACPF, as illustrated in Figure 3 , for simultaneously learning the dependent joint policy and its independent counterpart. In MACPF, each agent i has an independent local policy π ind i (a i | s; θ i ) parameterized by θ i and a dependency policy correction b dep i (a i | s, a <i ; ϕ i ) parameterized by ϕ i , which together constitute a dependent local policy π dep i (a i | s, a <i ) 1 . So, we have: π dep i (a i | s, a <i ) = π ind i (a i | s; θ i ) + b dep i (a i | s, a <i ; ϕ i ) (10) π dep jt (s, a) = N i=1 π dep i (a i | s, a <i ) (11) π ind jt (s, a) = N i=1 π ind i (a i | s; θ i ). Similarly, each agent i also has an independent local critic Q ind i (a i | s; ψ i ) parameterized by ψ i and a dependency critic correction c dep i (a i | s, a <i ; ω i ) parameterized by ω i , which together constitute a de- pendent local critic Q dep i (a i | s, a <i ). Given all Q ind i and Q dep i , we use a mixer network, Mixer(•; Θ) parameterized by Θ, to get Q dep jt and Q ind jt as follows, Q dep i (a i | s, a <i ) = Q ind i (a i | s; ψ i ) + c dep i (a i | s, a <i ; ω i ) (13) Q dep jt (s, a) = Mixer([Q dep i (a i | s, a <i )] N i=1 , s; Θ) (14) Q ind jt (s, a) = Mixer([Q ind i (a i | s; ψ i )] N i=1 , s; Θ). ( ) Q dep i , Q ind i , and Mixer are learned by minimizing the TD error, where D is the replay buffer collected by π dep jt , Q is the target network, and a ′ is sampled from the current π dep jt and π ind jt , respectively. To ensure the independent joint policy π ind jt has the same performance as π dep jt , the same batch sampled from D is used to compute both L dep and L ind . It is worth noting that the gradient of L dep only updates [c dep i ] N i=1 , while the gradient of L ind only updates L dep ([ω i ] N i=1 , Θ) = E D Q dep jt (s, a) -r + γ Qdep jt (s ′ , a ′ ) -α log π dep jt (a ′ | s ′ ) 2 (16) L ind ([ψ i ] N i=1 , Θ) = E D Q ind jt (s, a) -r + γ Qind jt (s ′ , a ′ ) -α log π ind jt (a ′ | s ′ ) 2 , [Q ind i ] N i=1 . Then, π dep i and π ind i are updated by minimizing KL-divergence as follows, J dep (ϕ i ) = E D,a<i∼π dep <i ,ai∼π dep i [α i log π dep i (a i | s, a <i ) -Q dep i (a i | s, a <i )] J ind (θ i ) = E D,ai∼π ind i [α i log π ind i (a i | s; θ i ) -Q ind i (a i | s; ψ i )]. Similarly, the gradient of J dep only updates b dep i and the gradient of J ind only updates π ind i . For computing J dep , a <i is sampled from their current policies π dep <i . The purpose of learning π ind jt is to enable decentralized execution while achieving the same performance as π dep jt . Therefore, a certain level of coupling has to be assured between π ind jt and π dep jt . First, motivated by Figure 2 , we constitute the dependent policy as a combination of an independent policy and a dependency policy correction, similarly for the local critic. Second, as aforementioned, the replay buffer D is collected by π dep jt , which implies π dep jt is the behavior policy and the learning of π ind jt is offline. Third, we use the same Mixer to compute Q dep jt and Q ind jt . The performance comparison between π dep jt and π ind jt will be studied by experiments.

4. RELATED WORK

Multi-agent policy gradient. In multi-agent policy gradient, a centralized value function is usually learned to evaluate current joint policy and guide the update of each local policy. Most multi-agent policy gradient methods can be considered as an extension of policy gradient from RL to MARL. For example, MAPPDG (Lowe et al., 2017) (Schulman et al., 2015) , and MAPPO (Yu et al., 2021) extends PPO (Schulman et al., 2017) . Some methods additionally address multi-agent credit assignment by policy gradient, e.g., counterfactual policy gradient (Foerster et al., 2018) or difference rewards policy gradient (Castellini et al., 2021; Li et al., 2022) . Value decomposition. Instead of providing gradients for local policies, in value decomposition, the centralized value function, usually a joint Q-function, is directly decomposed into local utility functions. Many methods have been proposed as different interpretations of Individual-Global-Maximum (IGM), which indicates the consistency between optimal local actions and optimal joint action. VDN (Sunehag et al., 2018) and QMIX (Rashid et al., 2018) give sufficient conditions for IGM by additivity and monotonicity, respectively. QTRAN (Son et al., 2019) transforms IGM into optimization constraints, while QPLEX (Wang et al., 2020a) takes advantage of duplex dueling architecture to guarantee IGM. Recent studies (Su et al., 2021; Wang et al., 2020b; Zhang et al., 2021; Su & Lu, 2022 ) combine value decomposition with policy gradient to learn stochastic policies, which are more desirable in partially observable environments. However, most research in this category does not guarantee optimality, while our method enables agents to learn the optimal joint policy. Coordination graph. In coordination graph (Guestrin et al., 2002) methods (Böhmer et al., 2020; Wang et al., 2021b; Yang et al., 2022) , the interactions between agents are considered as part of value decomposition. Specifically, the joint Q-function is decomposed into the combination of utility functions and payoff functions. The introduction of payoff functions increases the expressiveness of the joint Q-function and considers at least pair-wise dependency among agents, which is similar to our algorithm, where the complete dependency is considered. However, to get the joint action with the maximum Q-value, communication between agents is required in execution in coordination graph methods, while our method still fulfills fully decentralized execution. Coordinated exploration. One of the benefits of considering dependency is coordinated exploration. From this perspective, our method might be seen as a relative of coordinated exploration methods (Mahajan et al., 2019; Iqbal & Sha, 2019; Zheng et al., 2021) . In MAVEN (Mahajan et al., 2019) , a shared latent variable is used to promote committed, temporally extended exploration. In EMC (Zheng et al., 2021) , the intrinsic reward based on the prediction error of individual Q-values is used to induce coordinated exploration. It is worth noting that our method does not conflict with coordinated exploration methods and can be used simultaneously as our method is a base cooperative MARL algorithm. However, such a combination is beyond the scope of this paper.

5. EXPERIMENTS

In this section, we evaluate MACPF in three different scenarios. One is a simple yet challenging matrix game, which we use to verify whether MACPF can indeed converge to the optimal joint policy. Then, we evaluate MACPF on two popular cooperative MARL scenarios: StarCraft Multi-Agent Challenge (SMAC) (Samvelyan et al., 2019) and MPE (Lowe et al., 2017) , comparing it against QMIX (Rashid et al., 2018) , QPLEX (Wang et al., 2020a) , FOP (Zhang et al., 2021) , and MAPPO (Yu et al., 2021) . More details about experiments and hyperparameters are included in Appendix C. All results are presented using the mean and standard deviation of five runs with different random seeds. In SMAC experiments, for visual clarity, we plot the curves with the moving average of a window size of five and half standard deviation.

5.1. MATRIX GAME

In this matrix game, we have two agents. Each can pick one of the four actions and get a reward based on the payoff matrix depicted in Figure 4a . Unlike the non-monotonic matrix game in QTRAN (Son et al., 2019) , where there is only one optimal joint action, we have two optimal joint actions in this game, making this scenario much more challenging for many cooperative MARL algorithms. As shown in Figure 4b , general value decomposition methods, QMIX, QPLEX, and FOP, fail to learn the optimal coordinated strategy in most cases. The same negative result can also be observed for MAPPO. For general MARL algorithms, since agents are fully independent of each other when making decisions, they may fail to converge to the optimal joint action, which eventually leads to a suboptimal joint policy. As shown in Figure 4b , QMIX and MAPPO fail to converge to the optimal policy but find a suboptimal policy in all the seeds, while QPLEX, QTRAN, and FOP find the optima by chance (i.e., 60% for QPLEX, 20% for QTRAN, and 40% for FOP). This is because, in QMIX, the mixer network is purely a function of state and the input utility functions that are fully independent of each other. Thus it considers no dependency at all and cannot solve this game where dependency has to be considered. For QPLEX and FOP, since the joint action is considered as the input of their mixer network, the dependency among agents may be implicitly considered, which leads to the case where they can find the optima by chance. However, since the dependency is not considered explicitly, there is also a possibility that the mixer network misinterprets the dependency, which makes QPLEX and FOP sometimes find even worse policies than QMIX (20% for QPLEX and 40% for FOP). For QTRAN, it always finds at least the suboptimal policy in all the seeds. However, its optimality largely relies on the learning of its V jt , which is very unstable, so it also only finds the optima by chance. For the dependent joint policy π dep jt of MACPF, the local policy of the second agent depends on the action of the first agent. As a result, we can see from Figure 4b that π dep jt (denoted as MACPF_DEP) always converges to the highest return. We also notice that in Figure 4c , π dep jt indeed captures two optimal joint actions. Unlike QMIX, QPLEX, and FOP, the mixer network in MACPF is a function of state and the input utility functions Q dep i (a i | s, a <i ) that are properly dependent on each other, so the dependency among agents is explicitly considered. More importantly, the learned independent joint policy π ind jt of MACPF, denoted as MACPF in Figure 4b , always converges to the optimal joint policy. Note that in the rest of this section, the performance of MACPF is achieved by the learned π ind jt , unless stated otherwise.

5.2. SMAC

Further, we evaluate MACPF on SMAC. Maps used in our experiment include two hard maps (8m_vs_9m, 10m_vs_11m), and two super-hard maps (MMM2, corridor). We also consider two challenging customized maps (8m_vs_9m_myopic, 10m_vs_11m_myopic), where the sight range of each agent is reduced from 9 to 6, and the information of allies is removed from the observation of agents. These changes are adopted to increase the difficulty of coordination in the original maps. Results are shown in Figure 5 . In general, MACPF outperforms the baselines in all six maps. In hard maps, MACPF outperforms the baselines mostly in convergence speed, while in super-hard maps, MACPF outperforms other algorithms in either convergence speed or performance. Especially in corridor, when other value decomposition algorithms fail to learn any meaningful joint policies, MACPF obtains a winning rate of almost 70%. In the two more challenging maps, the margin between MACPF and the baselines becomes much larger than that in the original maps. These results show that MACPF can better handle complex cooperative tasks and learn coordinated strategies by introducing dependency among agents even when the task requires stronger coordination. We compare MACFP with the baselines in 18 maps totally. Their final performance is summarized in Appendix D. The win rate of MACFP is higher than or equivalent to the best baseline in 16 out of 18 maps, while QMIX, QPLEX, MAPPO, and FOP are respectively 7/18, 8/18, 9/18, and 5/18. Dependent and Independent Joint Policy. As discussed in Section 3.2, the learned independent joint policy of MACPF should not only enable decentralized execution but also match the performance of dependent joint policy, as verified in the matrix game. What about complex environments like SMAC? As shown in Figure 6 , we track the evaluation result of both π ind jt and π dep jt during training. As we can see, their performance stays at the same level throughout training. Ablation Study. Without learning a dependent joint policy to interact with the environment, our algorithm degenerates to FOP. However, since our factorization of Q jt is induced from the chain rule factorization of joint probability (5), we use a mixer network different from FOP (the reason is discussed and verified in Appendix E). Here we present an ablation study to further show that the improvement of MACPF is indeed induced by introducing the dependency among agents. In Figure 7 , MACPF_CONTROL represents an algorithm where all other perspectives are the same as MACPF, except no dependent joint policy is learned. As shown in Figure 7 , MACPF outperforms MACPF_CONTROL in all four maps, demonstrating that the performance improvement is indeed achieved by introducing the dependency among agents.

5.3. MPE

We further evaluate MACPF on three MPE tasks, including simple spread, formation control, and line control (Agarwal et al., 2020) . As shown in Table 1 , MACPF outperforms the baselines in all three tasks. A large margin can be observed in simple spread, while only a minor difference can be observed in the other two. This result may indicate that these MPE tasks are not challenging enough for strong MARL algorithms. -19.60±0.33 -20.12±0.21 -20.17±0.26 -19.78±0.27 -24.47±2.54 

6. CONCLUSION

We have proposed MACPF, where dependency among agents is introduced to enable more centralized training. By conditional factorized soft policy iteration, we show that dependent local policies provably converge to the optimum. To fulfill decentralized execution, we represent dependent local policies as a combination of independent local policies and dependency policy corrections, such that independent local policies can achieve the same level of expected return as dependent ones. Empirically, we show that MACPF can obtain the optimal joint policy in a simple yet challenging matrix game while baselines fail and MACPF also outperforms the baselines in SMAC and MPE.

A PROOF OF THEOREM 1

In this subsection, we incorporate dependency among agents into the standard soft policy iteration and prove that this modified soft policy iteration converges to the optimal joint policy. For soft policy evaluation, we will repeatedly apply soft Bellman operator Γ πjt to Q πjt jt until convergence, where: Γ πjt Q jt (s t , a t ) := r t + γ E st+1 [V jt (s t+1 )] (20) V jt (s t ) = E πjt [Q jt (s t , a t ) -α log π jt (a t | s t )]. In this way, as shown in Lemma A.1, we can get Q πjt jt for any joint policy π jt . Lemma A.1 (Joint Soft Policy Evaluation). Consider the modified soft Bellman backup operator Γ πjt and a mapping Q 0 jt : S ×A → R with |A| < ∞, and define Q k+1 jt = Γ πjt Q k jt . Then, the sequence Q k jt will converge to the joint soft Q-function of π jt as k → ∞. Proof. First, define the entropy augmented reward as: r πjt (s t , a t ) := r(s t , a t ) + E st+1 [H(π jt (•| s t+1 ))]. Then, rewrite the update rule as: Q jt (s t , a t ) ← r πjt (s t , a t ) + γ E st+1,at+1∼πjt [Q jt (s t+1 , a t+1 )]. Last, apply the standard convergence results for policy evaluation (Sutton & Barto, 2018). After we get Q πjt jt , we will make a one-step improvement for the joint policy. First, we restrict the local policy π i of each agent i to some set of policies Π i and update the local policy according to the following optimization problem: π new i = arg min π ′ i ∈Πi E a<i∼π new <i D KL π ′ i (a i | s, a <i )∥ exp 1 α i Q π old i i (s, a <i , a i ) -V π old i i (s, a <i ) J π old i ,a <i (π ′ i (ai| s,a<i)) . Based on individual conditional soft policy improvement, we will show that the newly projected joint soft policy has a higher state-action value than the old joint soft policy with respect to the maximum-entropy RL objective. Lemma A.2 (Individual Conditional Soft Policy Improvement). Let π old i ∈ Π i and π new i be the optimizer of the minimization problem in (22). Then, we have Q π new jt jt (s t , a t ) ≥ Q π old jt jt (s t , a t ) for all (s t , a t ) ∈ S ×A with |A| < ∞, where π old jt (a| s) = N i=1 π old i (a i | s, a <i ) and π new jt (a| s) = N i=1 π new i (a i | s, a <i ). Proof. Let Q π old i i and V π old i i be the corresponding soft state-action value and soft state value of individual policy π old i . First, considering that J π old i ,a<i (π new i (a i | s, a <i )) ≤ J π old i ,a<i (π old i (a i | s, a <i )) . Then, we have: E ai∼π new i ,a<i∼π new <i [α i log π new i (a i | s, a <i ) -Q π old i i (s, a <i , a i ) + V π old i i (s, a <i )] ≤ E ai∼π old i ,a<i∼π new <i [α i log π old i (a i | s, a <i ) -Q π old i i (s, a <i , a i ) + V π old i i (s, a <i )]. Since V π old i i depends only on s and a <i , where: E a<i∼π new <i [V π old i i (s, a <i )] = E a<i∼π new <i ,ai∼π old i [Q π old i i (s, a <i , a i ) -α i log π old i (a i | s, a <i )]. ) By deducing (24) from both sides of (23), we have: E ai∼π new i ,a<i∼π new <i [Q π old i i (s, a <i , a i ) -α i log π new i (a i | s, a <i )] ≥ E a<i∼π new <i [V π old i i (s, a <i )]. And since π new jt = exp 1 α Q π old jt jt (s, a) -V π old jt jt π new i = exp 1 α Q π old i i (s, a <i , a i ) -V π old i i (s, a <i ) π new jt (a| s) = N i=1 π new i (a i | s, a <i ), we can have: Q π old jt jt (s, a) = N i=1 α α i [Q π old i i (s, a <i , a i ) -V π old jt i (s, a <i )] + V π old jt jt (s). Then we have E a∼π new jt [Q π old jt jt (s, a) -α log π new jt (a| s)] = E a∼π new jt N i=1 α α i [Q π old i i (s, a <i , a i ) -V π old i i (s, a <i )] + V π old jt jt (s) -α log π new jt (a| s) = N i=1 E a∼π new jt ,a<i∼π new <i α α i [Q π old i i (s, a <i , a i ) -V π old i i (s, a <i ) -α i log π new i (a i | s, a <i )] + V π old jt jt (s) ≥ V π old jt jt (s), where the inequality is from plugging in (25). Last, considering the soft bellman equation, the following holds: Q π old jt jt (s t , a t ) = r t + γ E st+1 [V π old jt jt (s)] ≤ r t + γ E st+1 [E a∼π new jt [Q π old jt jt (s t+1 , a t+1 ) -α log π new jt (a t+1 | s t+1 )]] . . . ≤ Q π new jt jt (s t , a t ), where we have repeatedly expanded Q π old jt jt on the RHS by applying the soft Bellman equation and the bound in (26). Conditional factorized soft policy iteration alternates between joint soft policy evaluation and individual conditional soft policy improvement, and provably converges to the global optimum, as shown in Theorem 1. Theorem 1 (Conditional Factorized Soft Policy Iteration). For any joint policy π jt , if we repeatedly apply joint soft policy evaluation and individual conditional soft policy improvement from π i ∈ Π i . Then the joint policy π jt (a| s) = n i=1 π i (a i | s, a <i ) will eventually converge to π * jt , such that Q jt is bounded. Thus, this sequence must converge to some π * jt . Then, at convergence, we have the following inequality: J π * jt (π * jt (•| s)) ≤ J π * jt (π jt (•| s)), ∀ π jt ̸ = π * jt . Using the same iterative argument as in the proof of Lemma A.2, we get Q π * jt jt (s, a) ≥ Q πjt jt (s, a ) for all (s, a) ∈ S ×A. That is, the soft value of any other policy π jt is lower than that of the converged policy π * jt . Therefore, π * jt is optimal in Π 1 × • • • × Π N . B PROOF OF THEOREM 2 Theorem 2. For any dependent joint policy π dep jt that involves dependency among agents, there exists an independent joint policy π ind jt that does not involve dependency among agents, such that  V π dep jt (s) = V π ind jt π ind jt = N i=1 π i = N i=1 1[a i = arg max Q π dep jt [i]]. For such an independent joint policy π ind jt , we have a π  V π dep jt = a π dep jt Q π dep jt = a π ind jt Q π dep jt = E at∼π ind jt [Q π dep jt ]. Thus, we can have:  V π dep jt (s t ) = E at∼π ind jt [Q π dep jt (s t , a t )] = E at∼π ind jt ,

C EXPERIMENT SETTINGS AND IMPLEMENTATION DETAILS C.1 MATRIX GAME

In the matrix game, we use a learning rate of 3 × 10 -4 for all algorithms. For FOP and MACPF, α decays from 1 to 0.5, with a decay rate of 0.999 per episode. For QMIX and QPLEX, ϵ decays from 1 to 0.01, with a decay rate of 0.999 per episode. The batch size used in the experiment is 64 for FOP, MACPF, QMIX, and QPLEX, and 32 for MAPPO as it is an on-policy learning algorithm. All critics and actors used in the experiments consist of one hidden layer of 64 units with ReLU non-linearity. For the Mixer network, QMIX and MACPF both use hypernetwork, except ELU non-linearity is used for QMIX and no non-linearity is used for MACPF. FOP and QPLEX both use attention network for their mixer network. The environment and model are implemented in Python. All models are built on PyTorch and are trained on a machine with 1 Nvidia GPU (RTX 1060) and 8 AMD CPU Cores.

C.2 SMAC

In StarCraft II, for MACPF, we use a learning rate of 5 × 10 -4 . The critic network and policy network of MACPF consist of three layers, a fully-connected layer with 64 units activated by ReLU, followed by a 64 bit GRU, and followed by another fully-connected layer. The policy correction network and critic correction network consist of two layers, one fully-connected layer with 64 units activated by ELU, followed by another fully-connected layer. The target networks are updated after every 200 training episodes. The temperature parameters α and α i are annealed from 0.5 to 0.05 over 200k time steps for all easy and hard maps and fixed as 0.001 for all super-hard maps. For QMIX, QPLEX, FOP, and MAPPO, we use their default setting of each map. The environment and model are implemented in Python. All models are built on PyTorch and are trained on a machine with 4 Nvidia GPUs (A100) and 224 Intel CPU Cores. For 3s5z_vs_3s6z, all models are built on PyTorch and are trained on a machine with 1 Nvidia GPU (RTX 2080 TI) and 16 Intel CPU Cores. Our implementation of MACPF is based on PyMARL (Samvelyan et al., 2019) with MIT license. It worth noting that, although we assume full observability for the rigorousness of proof, the trajectory of each agent is used to replace state s for each agent as input to settle the partial observability in all SMAC experiments.

C.3 MPE

In MPE (MIT license), we use the default settings of MAPPO. For QMIX, QPLEX, FOP, and MACPF, we use a learning rate of 5 × 10 -4 . For FOP and MACPF, α decays from 0.5 to 0.05 over 50k time steps. For QMIX and QPLEX, ϵ decays from 1 to 0.05 over 50k time steps. The batch size used in the experiment is 64. All critics and actors used in the experiments consist of hidden layers of 64 units with ReLU non-linearity and 64 bit GRU. For the Mixer network, QMIX and MACPF both use hypernetwork, except ELU non-linearity is used for QMIX and no non-linearity is used for MACPF. FOP and QPLEX both use attention network for their mixer network. The environment and model are implemented in Python. All models are built on PyTorch and are trained on a machine with 1 Nvidia GPU (RTX 2080 TI) and 16 Intel CPU Cores. We also use the trajectory of each agent as input to settle the partial observability in all MPE experiments.

D MORE EXPERIMENTS ON SMAC D.1 MORE MAPS

We additionally evaluate MACPF on more SMAC maps. The maps used here include six easy maps (8m, MMM, 3s_vs_3z, 3s_vs_4z, so_many_baneling, 1c3s5z), three hard maps (3s5z, 2c_vs_64zg, 3s_vs_5z) and three super-hard maps (3s5z_vs_3s6z, 27m_vs_30m, 6h_vs_8z). Results are shown in Figure 8 . In general, MACPF matches or slightly outperforms the best performance of the baselines on all twelve maps.

D.2 SUMMARY OF SMAC FINAL PERFORMANCE

In this section, we provide the summary of SMAC experiments in terms of final performance. All results are achieved by 2M training timesteps. As shown in Table 2 , MACPF outperforms or at least matches the best performance of the baselines on all twelve maps.

E MIXER SELECTION

As mentioned in Section 5.2, we use a hypernetwork without non-linearity as our mixer network, which differs from QMIX, QPLEX, and FOP. In QPLEX and FOP, weighted summation is used to reflect the relationship between Q jt and Q i , where the weight is a function of both state and agent actions, such that the dependency among agents is implicitly considered. However, this implicit dependency may contradict our explicit dependency model in Q dep i and decrease the performance of both Q dep jt and Q ind jt . Another choice is to use a hypernetwork with non-linearity to reflect the relationship between Q jt and Q i , which is used in QMIX. However, due to the existence of the non-linearity unit, two joint actions with the same Q jt value may not be properly decomposed into two sets of Q i with the same sum. Thus, their joint probability may not be the same, and the dependency among agents is distorted. Therefore, the only option left for MACPF is to use a hypernetwork without non-linearity, which is equivalent to weighted summation where the weight is just a function of state. As shown in Figure 9 , MACPF_NONLINEAR and MACPF_ATT represent algorithms where all other aspects are the same as MACPF, except using a hypernetwork with non-linearity and a weighted summation with actions as input as their mixer networks, respectively. MACPF_NONLINEAR achieves similar performance as MACPF in the easy and hard maps, indicating that even distorted dependency can still benefit the learning. However, in the super-hard maps, MACPF outperforms MACPF_NONLINEAR, demonstrating the importance of accurate modeling of dependency among agents. MACPF_ATT is outperformed by both MACPF and MACPF_NONLINEAR by a large margin in all the maps, which verifies that the implicit dependency model in the mixer network of MACPF_ATT conflicts with the explicit dependency model in Q dep i . 

F FUTURE WORK

One limitation of our work is the sequential decision-making process in training. Since the dependent local policy π dep i (a i | s, a <i ) takes as input the joint action of all agents whose indices are smaller than agent i, agents have to make decisions one by one. This makes the whole decision process be O(N ). There is not much difference when N is small. However, when N is large, it slows down the training process. One approximate solution is to divide agents into groups, such that agents can make decisions group by group instead of one by one. However, such a mechanism may raise a new question about how to group agents, which will be considered in future work.



Figure 1: Sampled trajectories from the learned policy: (a) centralized control; (b) FOP, where IGO is assumed; (c) considering dependency during training.

Figure 2: A dependent joint policy and its independent counterpart

)1 The logit of π ind i (ai| s; θi) is first added with b dep i (ai| s, a<i; ϕi) to get the logit of π dep i (ai| s, a<i), then softmax is used over this combined logit to get π dep i (ai| s, a<i).

Figure 3: Learning framework of MACPF, where each agent i has four modules: an independent local policy π ind i (•; θi), a dependency policy correction b dep i (•; ϕi), an independent local critic Q ind i (•; ψi), and a dependency critic correction c dep i (•, ωi).

Figure 4: A matrix game that has two optimal joint actions: (a) payoff matrix; (b) learning curves of different methods; (c) the learned dependent joint policy of MACPF.

Figure 5: Learning curves of all the methods in six maps of SMAC, where the unit of x-axis is 1M timesteps and y-axis represents the win rate of each map.

Figure 6: Performance of π dep jt and π ind jt during training in four maps of SMAC, where the unit of x-axis is 1M timesteps and y-axis represents the win rate of each map.

a) ≥ Q πjt jt (s, a) for all π jt , assuming |A| < ∞.

Proof. First, by Lemma A.2, the sequence {π k jt } monotonically improves with Q

s) for any state s ∈ S. Proof. For a dependent joint policy π dep jt that involves dependency among agents, let max a Q π dep jt = A and min a Q π dep jt = B, we have A ≤ V π dep jt (s) ≤ B. Then, we can construct the following independent joint policy π ind jt :

Figure 8: Learning curves of all the methods in twelve maps of SMAC, where the unit of x-axis is 1M timesteps and y-axis represents the win rate of each map.

Figure9: Ablation study of the mixer selection of MACPF on four maps of SMAC, including one easy map (MMM), one hard map (8m_vs_9m) and two super-hard maps (MMM2, corridor), where the unit of x-axis is 1M timesteps and y-axis represents the win rate of each map.

Average rewards per episode on three MPE tasks.

st+1∼P [r(s t , a t ) + γV π dep

Final performance on all SMAC maps. We bold all values within one standard deviation of the best mean performance for each map. 709±0.162 0.224±0.231 0.506±0.144 0.679±0.054 3s5z_vs_3s6z (super-hard) 0.209±0.202 0.024±0.031 0.135±0.090 0.0±0.0 0.144±0.175 corridor (super-hard) 0.691±0.349 0.0±0.0 0.002±0.005 0.0±0.0 0.58±0.184 27m_vs_30m (super-hard) 0.726±0.094 0.532±0.23 0.294±0.159 0.45±0.143 0.78±0.095 8m_vs_9m (myopic) 0.855±0.069 0.675±0.127 0.716±0.075 0.338±0.329 0.81±0.119 10m_vs_11m (myopic) 0.888±0.188 0.702±0.129 0.664±0.089 0.384±0.372 0.514±0.253 3s_vs_3z (easy) 0.974±0.019 0.988±0.014 0.994±0.004 0.999±0.002 0.997±0.003 3s_vs_4z (easy) 0.995±0.005 0.99±0.008 0.997±0.003 0.789±0.22 0.957±0.022 3s_vs_5z (hard) 0.959±0.033 0.759±0.153 0.992±0.006 0.862±0.076 0.576±0.063 so_many_baneling (easy) 0.969±0.019 0.974±0.009 0.941±0.037 0.97±0.025 0.979±0.012 1c3s5z (easy) 0.984±0.006 0.98±0.013 0.985±0.003 0.984±0.005 0.989±0.007 6h_vs_8z (super-hard) 0.059±0.038 0.001±0.002 0.059±0.09 0.028±0.055 0.13±0.074

ACKNOWLEDGMENTS

This work was supported in part by NSFC (under grant 62250068) and Tencent.

availability

//github.com/PKU

