TOWARDS GLOBAL OPTIMALITY IN COOPERATIVE MARL WITH SEQUENTIAL TRANSFORMATION

Abstract

Policy learning in multi-agent reinforcement learning (MARL) is challenging due to the exponential growth of joint state-action space with respect to the number of agents. To achieve higher scalability, the paradigm of centralized training with decentralized execution (CTDE) is broadly adopted with factorized structure in MARL. However, we observe that existing CTDE algorithms in cooperative MARL cannot achieve optimality even in simple matrix games. To understand this phenomenon, we analyze two mainstream classes of CTDE algorithms -actor-critic algorithms and value-decomposition algorithms. Our theoretical and experimental results characterize the weakness of these two classes of algorithms when the optimization method is taken into consideration, which indicates that the currently used centralized training manner is deficient in compatibility with decentralized policy. To address this issue, we present a transformation framework that reformulates a multi-agent MDP as a special "single-agent" MDP with a sequential structure and can allow employing off-the-shelf single-agent reinforcement learning (SARL) algorithms to efficiently learn corresponding multiagent tasks. After that, a decentralized policy can still be learned by distilling the "single-agent" policy. This framework retains the optimality guarantee of SARL algorithms into cooperative MARL. To instantiate this transformation framework, we propose a Transformed PPO, called T-PPO, which can theoretically perform optimal policy learning in the finite multi-agent MDPs and shows significant outperformance on a large set of cooperative multi-agent tasks.

1. INTRODUCTION

Cooperative multi-agent reinforcement learning (MARL) is a promising approach to a variety of real-world applications, such as sensor networks (Zhang & Lesser, 2011) , traffic light control (Van der Pol & Oliehoek, 2016) , and multi-robot formation (Alonso-Mora et al., 2017) . However, "the curse of dimensionality" is one major challenge in cooperative MARL, since the joint state-action space grows exponentially with respect to the number of agents. To achieve higher scalability, the paradigm of centralized training with decentralized execution (CTDE) (Kraemer & Banerjee, 2016a) is wildly used, which allows agents to learn their local policies in a centralized way while retaining the ability of decentralized execution. Recently, many CTDE algorithms have been proposed. For value-based methods, the joint Q value is factorized as a function of individual Q values of agents (for which they are also called valuedecomposition algorithms), and then standard TD-learning is applied. To enable scalability and decentralized execution, it is critical to ensure the joint greedy action can be computed by selecting local greedy actions through individual Q functions, which is formalized as the Individual-Global-Max (IGM) principle (Son et al., 2019) . Based on this IGM property, a series of factorized multiagent Q-learning methods have been developed, including but not limited to VDN (Sunehag et al., 2018) , QMIX (Rashid et al., 2018) , QTRAN (Son et al., 2019), and QPLEX (Wang et al., 2021b) . For multi-agent actor-critic methods, the joint policy is often factorized into the direct product of individual policies, each of which is learned through policy gradient updates. For example, COMA (Foerster et al., 2018) and DOP (Wang et al., 2021c) aim at the critic design for effective credit assignment, MADDPG (Lowe et al., 2017) studies the situation with parameterized deterministic policies, and MAPPO (Yu et al., 2021) applies PPO to multi-agent settings with parameter sharing. Despite the promising performance in benchmark tasks, these CTDE methods do not have a global optimality guarantee in general cooperative settings and may fail to achieve optimality even in simple matrix games (Section 3). It might be confusing since some algorithms like QPLEX (Wang et al., 2021b) have been proven to converge to the global optimum in some theoretical work (Wang et al., 2021a) , which contradicts our experimental results. To understand this phenomenon, we provide theoretical analysis for both actor-critic algorithms and value-decomposition algorithms. It shows that when the optimization method is taken into consideration, which prior analysis didn't, neither actor-critic algorithms nor value-decomposition algorithms can get out of local optimums, yet which wildly exists in multi-agent tasks (Section 3). To address this suboptimality issue, we present a transformation framework that reformulates a multi-agent MDP as a special "single-agent" MDP with a sequential decision-making structure among agents. With this transformation, any off-the-shelf single-agent reinforcement learning (SARL) method can be adopted to efficiently learn coordination policies in cooperative multi-agent tasks by solving the transformed single-agent tasks with their global optimality guarantee retained. To enable decentralized execution, a decentralized policy is learned at the same time by distilling the "single-agent" policy. As an instantiation of this transformation framework, a Transformed PPO (T-PPO) is proposed, which can theoretically perform optimal policy learning in finite-multi-agent MDPs and empirically shows significant outperformance on a large set of cooperative multi-agent tasks, including SMAC (Samvelyan et al., 2019) and GRF (Kurach et al., 2019) using attention mechanism (Vaswani et al., 2017) .

2.1. RL MODELS

In single-agent RL (SARL), an agent interacts with a Markov Decision Process (MDP) to maximize its cumulative reward (Sutton & Barto, 2018 ). An MDP is defined as a tuple (S, A, r, P, γ, s 0 ), where S and A denote the state space and action space. At each time step t, the agent observes the state s t and chooses an action a t ∈ A, where a t ∼ π(s t ) depends on s t and its policy π. After that, the agent will gain an instant reward r t = r(s t , a t ), and transit to the next state s t+1 ∼ P (•|s t , a t ). γ is the discount factor. The goal of an SARL agent is to optimize a policy π that maximizes the expected cumulative reward, i.e., J (π ) = E st+1∼P (•|st,π(st)) [ ∞ t=0 γ t r(s t , π(s t ))]. In MARL, we model a fully cooperative multi-agent task as a Dec-POMDP (Oliehoek et al., 2016) defined by a tuple ⟨N , S, A, P, Ω, O, r, γ⟩, where N is a set of agents and S is the global state space, A is the action space, and γ is a discount factor. At each time step, agent i ∈ N has access to the observation o i ∈ Ω, drawn from the observation function O(s, i). Each agent has an actionobservation history τ i ∈ Ω×(A × Ω) * and constructs its individual policy π(a i |τ i ). With each agent i selecting an action a i ∈ A, the joint action a ≡ [a i ] n i=1 leads to a shared reward r = R(s, a) and the next state s ′ according to the transition distribution P (s ′ |s, a). The formal objective of MARL agents is to find a joint policy π = ⟨π 1 , . . . , π n ⟩ conditioned on the joint trajectories τ ≡ [τ i ] n i=1 that maximizes a joint value function V π (s) = E [ ∞ t=0 γ t r t |s 0 = s, π]. Another quantity in policy search is the joint action-value function Q π (s, a) = r(s, a) + γE s ′ [V π (s ′ )]. To simplify our analysis, we present a framework of Multi-agent MDPs (MMDP) (Boutilier, 1996) , a special case of Dec-POMDP, to model cooperative multi-agent decision-making tasks with full observations. MMDP is defined as a tuple ⟨N , S, A, P, r, γ⟩, where N , S, A, P , r, and γ are the same agent set, state space, action space, transition function, reward function, and discount factor in Dec-POMDP, respectively. Due to the full observations, at each time step, the current state s is observable to each agent. For each agent i, an individual policy πi (a|s) represents a distribution over actions conditioned on the state s. Agents aim to find a joint policy π = ⟨π 1 , . . . , πn ⟩ that maximizes a joint value function V π (s), where denoting V π (s) = E [ ∞ t=0 γ t r t |s 0 = s, π].

2.2. POLICY FACTORIZATION AND CENTRALIZED TRAINING WITH DECENTRALIZED EXECUTION

In MARL, due to partial observability and communication constraints, a decentralized policy is required during execution, i.e., the joint execution policy π test can be decomposed into a product of

