MULTI-AGENT TRUST REGION LEARNING

Abstract

Trust-region methods are widely used in single-agent reinforcement learning. One advantage is that they guarantee a lower bound of monotonic payoff improvement for policy optimization at each iteration. Nonetheless, when applied in multi-agent settings, such guarantee is lost because an agent's payoff is also determined by other agents' adaptive behaviors. In fact, measuring agents' payoff improvements in multi-agent reinforcement learning (MARL) scenarios is still challenging. Although game-theoretical solution concepts such as Nash equilibrium can be applied, the algorithm (e.g., Nash-Q learning) suffers from poor scalability beyond twoplayer discrete games. To mitigate the above measurability and tractability issues, in this paper, we propose Multi-Agent Trust Region Learning (MATRL) method. MATRL augments the single-agent trust-region optimization process with the multiagent solution concept of stable fixed point that is computed at the policy-space meta-game level. When multiple agents learn simultaneously, stable fixed points at the meta-game level can effectively measure agents' payoff improvements, and, importantly, a meta-game representation enjoys better scalability for multi-player games. We derive the lower bound of agents' payoff improvements for MATRL methods, and also prove the convergence of our method on the meta-game fixed points. We evaluate the MATRL method on both discrete and continuous multiplayer general-sum games; results suggest that MATRL significantly outperforms strong MARL baselines on grid worlds, multi-agent MuJoCo, and Atari games.

1. INTRODUCTION

Multi-agent systems (MAS) (Shoham & Leyton-Brown, 2008) have received much attention from the reinforcement learning community. In real-world, automated driving (Cao et al., 2012 ), StarCraft II (Vinyals et al., 2019 ) and Dota 2 (Berner et al., 2019) are a few examples of the myriad of applications that can be modeled by MAS. Due to the complexity of multi-agent problems (Chatterjee et al., 2004) , investigating if agents can learn to behave effectively during interactions with environments and other agents is essential (Fudenberg et al., 1998) . This can be achieved naively through the independent learner (IL) (Tan, 1993) , which ignores the other agents and optimizes the policy assuming a stable environment (Bus ¸oniu et al., 2010; Hernandez-Leal et al., 2017) . Due to their theoretical guarantee and good empirical performance in real-world applications, trust region methods (e.g., PPO (Schulman et al., 2015; 2017) ) based ILs are popular (Vinyals et al., 2019; Berner et al., 2019) . In single-agent learning, trust region methods can produce a monotonic payoff improvement guarantee (Kakade & Langford, 2002) via line search (Schulman et al., 2015) . In multi-agent scenarios, however, an agent's improvement is affected by other agent's adaptive behaviors (i.e., the multi-agent environment is non-stationary (Hernandez-Leal et al., 2017) ). As a result, trust region learners can measure the policy improvements of the agents' current policies, but the improvements of the updated opponents' policies are unknown (shown in Fig. 1 ). Therefore, trust region based ILs act less well in MAS as in single-agent tasks. Moreover, the convergence to a fixed point, such as a Nash equilibrium (Nash et al., 1950; Bowling & Veloso, 2004; Mazumdar et al., 2020) , is a common and widely accepted solution concept for multi-agent learning. Thus, although independent learners can best respond to other agents' current policies, they lose their convergence guarantee (Laurent et al., 2011) . One solution to address the convergence problem for independent learners is Empirical Game-Theoretic Analysis (EGTA) (Wellman, 2006) , which approximates the best response to the policies generated by the independent learners (Lanctot et al., 2017; Muller et al., 2019) . Although EGTA based methods (Lanctot et al., 2017; Omidshafiei et al., 2019; Balduzzi et al., 2019) establish 

