MORE CENTRALIZED TRAINING, STILL DECENTRAL-IZED EXECUTION: MULTI-AGENT CONDITIONAL POL-ICY FACTORIZATION

Abstract

In cooperative multi-agent reinforcement learning (MARL), combining value decomposition with actor-critic enables agents to learn stochastic policies, which are more suitable for the partially observable environment. Given the goal of learning local policies that enable decentralized execution, agents are commonly assumed to be independent of each other, even in centralized training. However, such an assumption may prohibit agents from learning the optimal joint policy. To address this problem, we explicitly take the dependency among agents into centralized training. Although this leads to the optimal joint policy, it may not be factorized for decentralized execution. Nevertheless, we theoretically show that from such a joint policy, we can always derive another joint policy that achieves the same optimality but can be factorized for decentralized execution. To this end, we propose multi-agent conditional policy factorization (MACPF), which takes more centralized training but still enables decentralized execution. We empirically verify MACPF in various cooperative MARL tasks and demonstrate that MACPF achieves better performance or faster convergence than baselines.

1. INTRODUCTION

The cooperative multi-agent reinforcement learning (MARL) problem has attracted the attention of many researchers as it is a well-abstracted model for many real-world problems, such as traffic signal control (Wang et al., 2021a) and autonomous warehouse (Zhou et al., 2021) . In a cooperative MARL problem, we aim to train a group of agents that can cooperate to achieve a common goal. Such a common goal is often defined by a global reward function that is shared among all agents. If centralized control is allowed, such a problem can be viewed as a single-agent reinforcement learning problem with an enormous action space. Based on this intuition, Kraemer & Banerjee (2016) proposed the centralized training with decentralized execution (CTDE) framework to overcome the non-stationarity of MARL. In the CTDE framework, a centralized value function is learned to guide the update of each agent's local policy, which enables decentralized execution. With a centralized value function, there are different ways to guide the learning of the local policy of each agent. One line of research, called value decomposition (Sunehag et al., 2018) , obtains local policy by factorizing this centralized value function into the utility function of each agent. In order to ensure that the update of local policies can indeed bring the improvement of joint policy, Individual-Global-Max (IGM) is introduced to guarantee the consistency between joint and local policies. Based on the different interpretations of IGM, various MARL algorithms have been proposed, such as VDN (Sunehag et al., 2018) , QMIX (Rashid et al., 2018) , QTRAN (Son et al., 2019), and QPLEX (Wang et al., 2020a) . IGM only specifies the relationship between optimal local actions and optimal joint action, which is often used to learn deterministic policies. In order to learn stochastic policies, which are more suitable for the partially observable environment, recent studies (Su et al., 2021; Wang et al., 2020b; Zhang et al., 2021; Su & Lu, 2022) combine the idea of value decomposition with actor-critic. While most of these decomposed actor-critic methods do not guarantee optimality, FOP (Zhang et al., 2021) introduces Individual-Global-Optimal (IGO) for the optimal joint policy learning in terms of maximum-entropy objective and derives the corresponding way of value decomposition. It is proved that factorized local policies of FOP converge to the global optimum, given that IGO is satisfied. The essence of IGO is for all agents to be independent of each other during both training and execution. However, we find this requirement dramatically reduces the expressiveness of the joint policy, making the learning algorithm fail to converge to the global optimal joint policy, even in some simple scenarios. As centralized training is allowed, a natural way to address this issue is to factorize the joint policy based on the chain rule (Schum, 2001) , such that the dependency among agents' policies is explicitly considered, and the full expressiveness of the joint policy can be achieved. By incorporating such a joint policy factorization into the soft policy iteration (Haarnoja et al., 2018) , we can obtain an optimal joint policy without the IGO condition. Though optimal, a joint policy induced by such a learning method may not be decomposed into independent local policies, thus decentralized execution is not fulfilled, which is the limitation of many previous works that consider dependency among agents (Bertsekas, 2019; Fu et al., 2022) . To fulfill decentralized execution, we first theoretically show that for such a dependent joint policy, there always exists another independent joint policy that achieves the same expected return but can be decomposed into independent local policies. To learn the optimal joint policy while preserving decentralized execution, we propose multi-agent conditional policy factorization (MACPF), where we represent the dependent local policy by combining an independent local policy and a dependency policy correction. The dependent local policies factorize the optimal joint policy, while the independent local policies constitute their independent counterpart that enables decentralized execution. We evaluate MACPF in several tasks, including matrix game (Rashid et al., 2020) , SMAC (Samvelyan et al., 2019), and MPE (Lowe et al., 2017) . Empirically, MACPF consistently outperforms its base method, i.e., FOP, and achieves better performance or faster convergence than other baselines. By ablation, we verify that the independent local policies can indeed obtain the same level of performance as the dependent local policies.

2.1. MULTI-AGENT MARKOV DECISION PROCESS

In cooperative MARL, we often formulate the problem as a multi-agent Markov decision process (MDP) (Boutilier, 1996) . A multi-agent MDP can be defined by a tuple ⟨I, S, A, P, r, γ, N ⟩. N is the number of agents, I = {1, 2 . . . , N } is the set of agents, S is the set of states, and A = A 1 ×• • •×A N is the joint action space, where A i is the individual action space for each agent i. For the rigorousness of proof, we assume full observability such that at each state s ∈ S, each agent i receives state s, chooses an action a i ∈ A i , and all actions form a joint action a ∈ A. The state transitions to the next state s ′ upon a according to the transition function P (s ′ |s, a) : S × A × S → [0, 1], and all agents receive a shared reward r(s, a) : S × A → R. The objective is to learn a local policy π i (a i | s) for each agent such that they can cooperate to maximize the expected cumulative discounted return, E[ ∞ t=0 γ t r t ], where γ ∈ [0, 1) is the discount factor. In CTDE, from a centralized perspective, a group of local policies can be viewed as a joint policy π jt (a| s). For this joint policy, we can define the joint state-action value function Q jt (s t , a t ) = E st+1:∞,at+1:∞ [ ∞ k=0 γ t r t+k | s t , a t ]. Note that although we assume full observability for the rigorousness of proof, we use the trajectory of each agent τ i ∈ T i : (Y × A i ) * to replace state s as its policy input to settle the partial observability in practice, where Y is the observation space.

2.2. FOP

FOP (Zhang et al., 2021) is one of the state-of-the-art CTDE methods for cooperative MARL, which extends value decomposition to learning stochastic policy. In FOP, the joint policy is decomposed into independent local policies based on Individual-Global-Optimal (IGO), which can be stated as: π jt (a| s) = N i=1 π i (a i | s). (1)

availability

//github.com/PKU-RL/FOP-DMAC

