ADVANTAGE CONSTRAINED PROXIMAL POLICY OPTIMIZATION IN MULTI-AGENT REINFORCEMENT LEARNING

Abstract

We explore the combination of value-based method and policy gradient in multiagent reinforcement learning (MARL). In value-based MARL, Individual-Global-Max (IGM) principle plays an important role, which maintains the consistency between joint and local action values. At the same time, IGM is difficult to guarantee in multi-agent policy gradient methods due to stochastic exploration and conflicting gradient directions. In this paper, we propose a novel multi-agent policy gradient algorithm called Advantage Constrained Proximal Policy Optimization (ACPPO). Based on multi-agent advantage decomposition lemma, ACPPO takes an advantage network for each agent to estimate current local state-action advantage. The coefficient of each agent constrains the joint-action advantage according to the consistency of the estimated joint-action advantage and local advantage. Unlike previous policy gradient-based MARL algorithms, ACPPO does not need an extra sampled baseline to reduce variance. We evaluate the proposed methods for continuous matrix game and Multi-Agent MuJoCo tasks. Results show that ACPPO outperforms the baselines such as MAPPO, MADDPG, and HATRPO.

1. INTRODUCTION

Many complex sequential decision-making problems in the real world that involve multiple agents can be described by Multi-Agent Reinforcement Learning (MARL) problem, including autonomous vehicles (Chang et al., 2021 ), logistics management (Wang et al., 2017) , and electric transportation (Chen & Wang, 2020) . MARL in real-world environments usually requires systems with scalability and distributed execution capabilities (Shao et al., 2017) . A centralized controller will suffer from the exponential growth of the action space with the number of agents increasing (Hu et al., 2021) . Based on the parameter sharing, decentralized policies can reduce the complexity of tasks and enable scalable structures (Shao et al., 2019b) . However, directly deploying single-agent reinforcement learning in MARL suffers from a non-stationary issue. To stabilize the training process, MARL introduces monotonic improvement from trust region methods and Centralized training and Decentralized Execution (CTDE) from value-based methods. Trust region learning has played a major role in recent policy gradient methods (Shao et al., 2019a; Kakade, 2001) . Trust Region Policy Optimization (TRPO) (Schulman et al., 2015) and Proximal Policy Optimization (PPO) (Schulman et al., 2017) methods have achieved outstanding performance in single-agent reinforcement learning. The Key point of effectiveness of the trust region methods is based on the theoretically-justified guarantee of monotonic performance improvement at each step. With a KL divergence constraint, parameters can be updated within a trust region that avoids the gradient being too aggressive. Based on parameter sharing, centralized critic, and PopArt (Hessel et al., 2019) , MAPPO (Yu et al., 2021) achieves good performance in multi-agent environments. Unfortunately, the value function of MAPPO is affected by the exploration of other agents, making the convergence of MAPPO unstable. Kuba et al. (2021) introduce the optimal baseline for a more accurate estimate of the state value function in MAPPO. The optimal baseline is based on an estimated hypothesized joint action value, which may introduce potential estimation errors. However, examples (Kuba et al., 2022) show that MAPPO does not guarantee consistent improvement even with correct gradients. To obtain the guarantee of monotonic improvement in MARL, Kuba

