ADVANTAGE CONSTRAINED PROXIMAL POLICY OPTIMIZATION IN MULTI-AGENT REINFORCEMENT LEARNING

Abstract

We explore the combination of value-based method and policy gradient in multiagent reinforcement learning (MARL). In value-based MARL, Individual-Global-Max (IGM) principle plays an important role, which maintains the consistency between joint and local action values. At the same time, IGM is difficult to guarantee in multi-agent policy gradient methods due to stochastic exploration and conflicting gradient directions. In this paper, we propose a novel multi-agent policy gradient algorithm called Advantage Constrained Proximal Policy Optimization (ACPPO). Based on multi-agent advantage decomposition lemma, ACPPO takes an advantage network for each agent to estimate current local state-action advantage. The coefficient of each agent constrains the joint-action advantage according to the consistency of the estimated joint-action advantage and local advantage. Unlike previous policy gradient-based MARL algorithms, ACPPO does not need an extra sampled baseline to reduce variance. We evaluate the proposed methods for continuous matrix game and Multi-Agent MuJoCo tasks. Results show that ACPPO outperforms the baselines such as MAPPO, MADDPG, and HATRPO.

1. INTRODUCTION

Many complex sequential decision-making problems in the real world that involve multiple agents can be described by Multi-Agent Reinforcement Learning (MARL) problem, including autonomous vehicles (Chang et al., 2021) , logistics management (Wang et al., 2017) , and electric transportation (Chen & Wang, 2020) . MARL in real-world environments usually requires systems with scalability and distributed execution capabilities (Shao et al., 2017) . A centralized controller will suffer from the exponential growth of the action space with the number of agents increasing (Hu et al., 2021) . Based on the parameter sharing, decentralized policies can reduce the complexity of tasks and enable scalable structures (Shao et al., 2019b) . However, directly deploying single-agent reinforcement learning in MARL suffers from a non-stationary issue. To stabilize the training process, MARL introduces monotonic improvement from trust region methods and Centralized training and Decentralized Execution (CTDE) from value-based methods. Trust region learning has played a major role in recent policy gradient methods (Shao et al., 2019a; Kakade, 2001) . Trust Region Policy Optimization (TRPO) (Schulman et al., 2015) and Proximal Policy Optimization (PPO) (Schulman et al., 2017) methods have achieved outstanding performance in single-agent reinforcement learning. The Key point of effectiveness of the trust region methods is based on the theoretically-justified guarantee of monotonic performance improvement at each step. With a KL divergence constraint, parameters can be updated within a trust region that avoids the gradient being too aggressive. Based on parameter sharing, centralized critic, and PopArt (Hessel et al., 2019) , MAPPO (Yu et al., 2021) achieves good performance in multi-agent environments. Unfortunately, the value function of MAPPO is affected by the exploration of other agents, making the convergence of MAPPO unstable. Kuba et al. (2021) introduce the optimal baseline for a more accurate estimate of the state value function in MAPPO. The optimal baseline is based on an estimated hypothesized joint action value, which may introduce potential estimation errors. However, examples (Kuba et al., 2022) show that MAPPO does not guarantee consistent improvement even with correct gradients. To obtain the guarantee of monotonic improvement in MARL, Kuba et al. (2022) introduces HATRPO. HATRPO implements heterogeneous agents and stochastic update schemes of agent gradient directions to obtain guarantee of monotonic improvement. At the same time, HATRPO does not apply a centralized critic but transmits the updated information of previous agents by compound policy ratio. However, the time complexity of the sequential update method is too high, and it is difficult to scale to the scenarios with a large number of agents. In this paper, we propose Advantage Constrained Proximal Policy Optimization (ACPPO), a novel multi-agent policy gradient method based on the advantage decomposition lemma. ACPPO adopts a policy subset, and each agent updates its policy according to their subset. The policy subset avoids the inefficiency caused by sequential updates and the instability caused by important sampling while ensuring improvement consistency gradient updates. A fictitious joint-action advantage function is estimated by summation of a set of local advantages which is learned by each agent. Based on the current advantage and previous advantage, each agent can estimate the consistency of change between their local advantage and joint advantage of the policy subset. The local advantage of each agent will be scaled to ensure that each local policy gradient update improves the performance of the policy subset. In practice, we propose three variants of ACPPO. Advantage Constrained Proximal Policy Optimization with Hard Threshold (ACPPO-D) is the combination of the proposed constraint coefficient and PPO. However, the hard threshold is too sensitive to errors. Therefore, we propose ACPPO with a soft threshold, including Advantage Constrained Proximal Policy Optimization with Parameter-Sharing (ACPPO-PS) and Advantage Constrained Proximal Policy Optimization with Heterogeneous-Agents (ACPPO-HA). The main contributions of this paper are summarized as follows: • We present a constraint coefficient to the local advantage, which is estimated by the difference between the local and fictitious joint advantage functions, to ensure the consistent improvement of joint policy. • We propose policy subset to heterogeneous estimate constraint coefficient to ensure monotonic improvement while avoid inefficiency caused by sequential updates and numerical overflow of important sampling. • We evaluate ACPPO on benchmarks of Multi-Agent MuJoCo against strong baselines such as HATRPO, MAPPO, and MADDPG. The results show that ACPPO achieves state-of-theart performance across all tested scenarios and demonstrate that the parameter sharing agent without a centralized mixing network performs well in Multi-Agent MuJoCo environments.

2.1. COOPERATIVE MARL PROBLEM

A fully cooperative multi-agent problem can be described as a six elements tuple < N , S, A, P, r, γ >. The N = {1, ..., n} is the set of agents, S donates the finite state space, A = n i=1 A i is the joint action spaces of all agents, P : S × A × S is the transition probability function, r : S × A → R is the reward function, and γ ∈ [0, 1) is the discount factor. At each time step t ∈ N, the agents takes an action a i t ∈ A i at state s t ∈ S. The combination of each agent's action can be described as joint action a t = (a 1 t , ..., a n t ) ∈ A, drawn from the joint policy π(•|s t ) = π i (•|s t ). Based on the A and s t , the agents receive reward r t = r(s t , a t ) ∈ R, and move to a state s t+1 according to the probability P (s t+1 |(s t , a t )). ρ 0 is the distribution of the initial state s 0 , and the marginal state distribution at time t is denoted by ρ t π . The state value function and the state-action value function are defined as follows: V π (s) = E a0:∞∼π,s1:∞∼P [ ∞ t=0 γ t r t |s 0 = s] and Q π (s, a) = E a1:∞∼π,s1:∞∼P [ ∞ t=0 γ t r t |s 0 = s, a 0 = a]. The advantage function can be described as A π (s, a) = Q π (s, a) -V π (s). The objective of the agents in cooperative problem is to maximise the expected return J (π) = E s0:∞∼ρ 0:∞ π ,a0:∞∼π [ ∞ t=0 γ t r t ]. The set of all agents excluding agents (i 1 , ..., i m ) is represented by -(i 1 , ..., i m ). Their local joint state-action value function are defined as Q π (s, a (i1,...,im) ) = E a -(i 1 ,...,im) ∼π -(i 1 ,...,im ) [Q π ], which is the expected return for the action a (i1,...,im) chosen by the set of agents (i 1 , ..., i m ). The local advantage function is defined as follows: A π (s, a (i1,...,im) ) = Q π (s, a (i1,...,im) , a -(i1,...,im) ) -Q π (s, a -(i1,...,im) ). In additional, the notations Q, V , π =< π 1 , ..., π n > are used to represent updated Q, V, π =< π 1 , ..., π n >.

