MULTI-AGENT TRUST REGION LEARNING

Abstract

Trust-region methods are widely used in single-agent reinforcement learning. One advantage is that they guarantee a lower bound of monotonic payoff improvement for policy optimization at each iteration. Nonetheless, when applied in multi-agent settings, such guarantee is lost because an agent's payoff is also determined by other agents' adaptive behaviors. In fact, measuring agents' payoff improvements in multi-agent reinforcement learning (MARL) scenarios is still challenging. Although game-theoretical solution concepts such as Nash equilibrium can be applied, the algorithm (e.g., Nash-Q learning) suffers from poor scalability beyond twoplayer discrete games. To mitigate the above measurability and tractability issues, in this paper, we propose Multi-Agent Trust Region Learning (MATRL) method. MATRL augments the single-agent trust-region optimization process with the multiagent solution concept of stable fixed point that is computed at the policy-space meta-game level. When multiple agents learn simultaneously, stable fixed points at the meta-game level can effectively measure agents' payoff improvements, and, importantly, a meta-game representation enjoys better scalability for multi-player games. We derive the lower bound of agents' payoff improvements for MATRL methods, and also prove the convergence of our method on the meta-game fixed points. We evaluate the MATRL method on both discrete and continuous multiplayer general-sum games; results suggest that MATRL significantly outperforms strong MARL baselines on grid worlds, multi-agent MuJoCo, and Atari games.

1. INTRODUCTION

Multi-agent systems (MAS) (Shoham & Leyton-Brown, 2008) have received much attention from the reinforcement learning community. In real-world, automated driving (Cao et al., 2012) , StarCraft II (Vinyals et al., 2019) and Dota 2 (Berner et al., 2019) are a few examples of the myriad of applications that can be modeled by MAS. Due to the complexity of multi-agent problems (Chatterjee et al., 2004) , investigating if agents can learn to behave effectively during interactions with environments and other agents is essential (Fudenberg et al., 1998) . This can be achieved naively through the independent learner (IL) (Tan, 1993) , which ignores the other agents and optimizes the policy assuming a stable environment (Bus ¸oniu et al., 2010; Hernandez-Leal et al., 2017) . Due to their theoretical guarantee and good empirical performance in real-world applications, trust region methods (e.g., PPO (Schulman et al., 2015; 2017) ) based ILs are popular (Vinyals et al., 2019; Berner et al., 2019) . In single-agent learning, trust region methods can produce a monotonic payoff improvement guarantee (Kakade & Langford, 2002) via line search (Schulman et al., 2015) . In multi-agent scenarios, however, an agent's improvement is affected by other agent's adaptive behaviors (i.e., the multi-agent environment is non-stationary (Hernandez-Leal et al., 2017) ). As a result, trust region learners can measure the policy improvements of the agents' current policies, but the improvements of the updated opponents' policies are unknown (shown in Fig. 1 ). Therefore, trust region based ILs act less well in MAS as in single-agent tasks. Moreover, the convergence to a fixed point, such as a Nash equilibrium (Nash et al., 1950; Bowling & Veloso, 2004; Mazumdar et al., 2020) , is a common and widely accepted solution concept for multi-agent learning. Thus, although independent learners can best respond to other agents' current policies, they lose their convergence guarantee (Laurent et al., 2011) . One solution to address the convergence problem for independent learners is Empirical Game-Theoretic Analysis (EGTA) (Wellman, 2006) , which approximates the best response to the policies generated by the independent learners (Lanctot et al., 2017; Muller et al., 2019) . Although EGTA based methods (Lanctot et al., 2017; Omidshafiei et al., 2019; Balduzzi et al., 2019) 

Improvement is unknown

Figure 1: The relationship of discounted returns η i for an agent i given the different joint policy pairs, where π i is the current policy, π i is the simultaneously updated policy. Given π i , the monotonic improvement against fixed opponent can be easily measured: η i (π i , π -i ) ≥ η i (π i , π -i ). However, due to the simultaneous learning, the improvement of η i (π i , π -i ) is unknown compared to η i (π i , π -i ). convergence guarantees in several games classes, the computational cost is also large when empirically approximating and solving the increasing meta-game (Yang et al., 2019) . Other multi-agent learning approaches collect or approximate additional information such as communication (Foerster et al., 2016) and centralized joint critics (Lowe et al., 2017; Foerster et al., 2017; Sunehag et al., 2018; Rashid et al., 2018) . Nevertheless, these methods usually require centralized parameters or centralized communication assumptions. Thus, there is considerable interest in multi-agent learning to find an algorithm that, while having minimal requirements and computational cost as independent learners, also improves convergence performance at the same time. This paper presents the Multi-Agent Trust Region Learning (MATRL) algorithm that augments the trust-region ILs with a meta-game analysis to improve the stability and efficiency of learning. In MATRL, a trust region trial step for an agents' payoff improvement is implemented by independent learners, which gives a predicted policy based on the current policy. Then, an empirical policy-space meta-game is constructed to compare the expected advantage of predicted policies with the current policies. By solving the meta-game, MATRL finds a restricted step by aggregating the current and predicted policies using meta-game Nash Equilibrium. Finally, MATRL takes the best responses based on the aggregated policies from last step for each agent to explore because the found TSR is not always strict stable. MATRL is, therefore, able to provide a weak stable solution compared with the naive independent learners. Based on trust region independent learners, MATRL does not need extra parameters, simulations, or modifications to the independent learner itself. We provide insights into the empirical meta-game in Section 3.2, showing that an approximated Nash equilibrium of the meta-game is a weak stable fixed point of the underlying game. Our experiments demonstrate that MATRL significantly outperforms deep independent learners (Schulman et al., 2017) with the same hyper-parameters, centralized VDN (Sunehag et al., 2018) , QMIX (Rashid et al., 2018) methods in discrete action grid-worlds, centralized MADDPG (Lowe et al., 2017) in a continuous action multi-agent MuJoCo task (de Witt et al., 2020) and zero-sum multi-agent Atari (Terry & Black, 2020) .

2. PRELIMINARY

A Stochastic Game (Shapley, 1953; Littman, 1994) can be defined as: G = N , S, {A i }, {R i }, P, p 0 , γ , where N is a set of agents, n = |N | is the number of agents and S denotes the state space. A i is the action space for agent i. A = A 1 × • • • × A n = A i × A -i is the joint action space, and for the simplicity we use -i denotes the other agents except agent i. R i = R i (s, a i , a -i ) is the reward function for agent i ∈ N . P : S × A × S → [0, 1] is the transition function. p 0 is the initial state distribution, γ ∈ [0, 1] is a discount factor. Each agent i ∈ N has a stochastic policy π i (a i |s) : S × A i → [0, 1], and aims to maximize its long term discounted return: η i (π i , π -i ) = E s 0 ,a 0 i ,a 0 -i ••• ∞ t=0 γ t R i (s t , a t i , a t -i ) , where s 0 ∼ p 0 , s t+1 ∼ P(s t+1 |s t , a t i , a t -i ), a t i ∼ π i (a t i |τ t i ). We then can have the standard definitions of the state-action value function Q πi,π-i i (s t , a t i , a t -i ) = E s t+1 ,a t+1 i ,a t+1 -i ••• [ ∞ l=0 γ l R i (s t+l , a t+l i , a t+l -i )], the value function V πi,π-i i (s t ) = E a t i ,a t -i ,s t+1 ••• [ ∞ l=0 γ l R i (s t+l , a t+l i , a t+l -i )] , and the advantage function A πi,π-i i (s t , a t i , a t -i ) = Q πi,π-i i (s t , a t i , a t -i ) -V πi,π-i i (s t ) given the state and joint action.

3. MULTI-AGENT TRUST REGION POLICY OPTIMIZATION

A trust region algorithm aims to answer two questions: how to compute a trust region trial step and whether a trial step should be accepted. In multi-agent learning, a trust region trial step towards agents' payoff improvement can be easily implemented with independent learners, and we call the independent payoff improvement area as Trust Payoff Region(TPR). The remaining issue is Monotonic Payoff Improvement

Trust Payoff Region Trust Payoff Region

Trust Nash Region ⇡ i < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 o A t T y / N c b 1 M E 0 c b S 1 O E d D j t s J k = " > A A A C F H i c d V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s x I o Q o u C m 5 c V r A P a I e S S T N t a J I Z k o x Q h n 6 E G x f 6 K + 7 E r X v / x K W Z d h a 2 x Q O B w z n 3 J i c n i D n T x n W / n c L G 5 t b 2 T n G 3 t L d / c H h U P j 5 p 6 y h R h L Z I x C P V D b C m n E n a M s x w 2 o 0 V x S L g t B N M 7 j K / 8 0 S V Z p F 8 N N O Y + g K P J A s Z w c Z K n X 7 M B i m b D c o V t + r O g d a J l 5 M K 5 G g O y j / 9 Y U Q S Q a U h H G v d 8 9 z Y + C l W h h F O Z 6 V + o m m M y Q S P a M 9 S i Q X V f j q P O 0 M X V h m i M F L 2 S I P m 6 t + N F A u t p y K w k w K b s V 7 1 M v E / L 7 t R L 7 2 f B m I l j w m v / Z T J O D F U k k W c M O H I R C h r C A 2 Z o s T w q S W Y K G Z / h M g Y K 0 y M 7 b F k q / J W i 1 k n 7 a u q V 6 v e P N Q q j d u 8 t C K c w T l c g g d 1 a M A 9 N K E F B C b w D K / w 5 r w 4 7 8 6 H 8 7 k Y L T j 5 z i k s w f n 6 B X m R n 8 c = < / l a t e x i t > ⇡ i < l a t e x i t s h a 1 _ b a s e 6 4 = " K l b c 1 K w Z m t V b Q e o k Y l + k 7 F x b m T M = " > A A A C F X i c d V D L S g M x F L 1 T X 7 W + q i 7 d B I v g x j I j B R V c F N y 4 r G A f 0 A 4 l k 2 b a 0 C Q z J B m h D P 0 J N y 7 0 V 9 y J W 9 f + i U s z 7 S x s i w c C h 3 P u T U 5 O E H O m j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 l G i C G 2 S i E e q E 2 B N O Z O 0 a Z j h t B M r i k X A a T s Y 3 2 V + + 4 k q z S L 5 a C Y x 9 Q U e S h Y y g o 2 V O r 2 Y 9 d M L N u 2 X K 2 7 V n Q G t E i 8 n F c j R 6 J d / e o O I J I J K Q z j W u u u 5 s f F T r A w j n E 5 L v U T T G J M x H t K u p R I L q v 1 0 l n e K z q w y Q G G k 7 J E G z d S / G y k W W k 9 E Y C c F N i O 9 7 G X i f 1 5 2 o 1 5 4 P w 3 E U h 4 T X v s p k 3 F i q C T z O G H C k Y l Q V h E a M E W J 4 R N L M F H M / g i R E V a Y G F t k y V b l L R e z S l q X V a 9 W v X m o V e q 3 e W l F O I F T O A c P r q A O 9 9 C A J h D g 8 A y v 8 O a 8 O O / O h / M 5 H y 0 4 + c 4 x L M D 5 + g X r 8 p / + < / l a t e x i t > ⇡i < l a t e x i t s h a 1 _ b a s e 6 4 = " n V A j T o a X S 6 T U f J k 8 m 1 M / i 0 M h Q u I = " > A A A C G n i c d V D L S g M x F M 3 U V 6 2 v q k s 3 w S K 4 K j N S U M F F w Y 3 L C v Y B 7 V A y a a Y N T T J j c k c o Q 7 / D j Q v 9 F X f i 1 o 1 / 4 t J M O w v b 4 o E L h 3 M f O T l B L L g B 1 / 1 2 C m v r G 5 t b x e 3 S z u 7 e / k H 5 8 K h l o k R T 1 q S R i H Q n I I Y J r l g T O A j W i T U j M h C s H Y x v s 3 7 7 i W n D I / U A k 5 j 5 k g w V D z k l Y C W / N y K Q 9 m I + 7 a e 2 y h W 3 6 s 6 A V 4 m X k w r K 0 e i X f 3 q D i C a S K a C C G N P 1 3 B j 8 l G j g V L B p q Z c Y F h M 6 J k P W t V Q R y Y y f z k x P 8 Z l V B j i M t C 0 F e K b + 3 U i J N G Y i A z s p C Y z M c i 8 T / + t l F 8 3 C + 2 k g l / x A e O W n X M U J M E X n d s J E Y I h w l h M e c M 0 o i I k l h G p u f 4 T p i G h C w a Z Z s l F 5 y 8 G s k t Z F 1 a t V r + 9 r l f p N H l o R n a B T d I 4 8 d I n q 6 A 4 1 U B N R 9 I i e 0 S t 6 c 1 6 c d + f D + Z y P F p x 8 5 x g t w P n 6 B b V v o p Q = < / l a t e x i t > ⇡ i < l a t e x i t s h a 1 _ b a s e 6 4 = " y 8 D m Z z R 1 F k K 1 T S c j Z B K V M R E f b 1 A = " > A A A C G 3 i c d V D L S g M x F L 3 j s 9 Z X 1 a W b Y B H c W G a k o I K L g h u X F e w D O r V k 0 k w b m m S G J C O U o f / h x o X + i j t x 6 8 I / c W m m n Y V t 8 U D C 4 Z x 7 b 2 5 O E H O m j e t + O y u r a + s b m 4 W t 4 v b O 7 t 5 + 6 e C w q a N E E d o g E Y 9 U O 8 C a c i Z p w z D D a T t W F I u A 0 1 Y w u s 3 8 1 h N V m k X y w Y x j 2 h V 4 I F n I C D Z W e v S H 2 K R + z C a 9 9 N x e p b J b c a d A y 8 T L S R l y 1 H u l H 7 8 f k U R Q a Q j H W n c 8 N z b d F C v D C K e T o p 9 o G m M y w g P a s V R i Q X U 3 n W 4 9 Q a d W 6 a M w U v Z I g 6 b q 3 4 4 U C 6 3 H I r C V A p u h X v Q y 8 T 8 v m 6 j n 3 k 8 D s b C P C a + 6 K Z N x Y q g k s 3 X C h C M T o S w o 1 G e K E s P H l m C i m P 0 R I k O s M D E 2 z q K N y l s M Z p k 0 L y p e t X J 9 X y 3 X b v L Q C n A M J 3 A G H l x C D e 6 g D g 0 g o O A Z X u H N e X H e n Q / n c 1 a 6 4 u Q 9 R z A H 5 + s X K n y i y w = = < / l a t e x i t > ⇡ ⇤ i < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 W U T x m F W U 6 7 i + U j W W s a 7 n Y U v r o w = " > A A A C F n i c d V D L S g M x F L 1 T X 7 W + q i 7 d B I s g L s q M F F R w U X D j s o J 9 S D u W T J p p Q 5 P M k G S E M v Q r 3 L j Q X 3 E n b t 3 6 J y 7 N t F 3 Y F g 8 E D u f c m 5 y c I O Z M G 9 f 9 d n I r q 2 v r G / n N w t b 2 z u 5 e c f + g o a N E E V o n E Y 9 U K 8 C a c i Z p 3 T D D a S t W F I u A 0 2 Y w v M n 8 5 h N V m k X y 3 o x i 6 g v c l y x k B B s r P X R i 1 k 3 Z + P G s W y y 5 Z X c C t E y 8 G S n B D L V u 8 a f T i 0 g i q D S E Y 6 3 b n h s b P 8 X K M M L p u N B J N I 0 x G e I + b V s q s a D a T y e B x + j E K j 0 U R s o e a d B E / b u R Y q H 1 S A R 2 U m A z 0 I t e J v 7 n Z T f q u f f T Q C z k M e G l n z I Z J 4 Z K M o 0 T J h y Z C G U d o R 5 T l B g + s g Q T x e y P E B l g h Y m x T R Z s V d 5 i M c u k c V 7 2 K u W r u 0 q p e j 0 r L Q 9 H c A y n 4 M E F V O E W a l A H A g K e 4 R X e n B f n 3 f l w P q e j O W e 2 c w h z c L 5 + A a x m o G M = < / l a t e x i t > ⇡ ⇤ i < l a t e x i t s h a _ b a s e = " y m b U E x U l N E E a h N g N p c w i s M Q = " > A A A C F i c d V D L S g M x F L j s Z X a W b Y B F E s M x I Q Q U X B T c u K g H t m P J p G k b m m S G J C O U o X / h x o X + i j t x I / c W m m n Y V t U D g c M y c k J I s c d v Z l Z X V t P b e R z a t k t O X d R g r Q m s k K F q B l h T z i S t G W Y b U a K Y h F w g i G N n f e K J K s D e m F E f Y H k v U Y w c Z K D + I d Z I z N n R S K b s m d A C S L y N F y F D t F H a Z D E g k p D O N a b m R R O s D C O c j v P t W N M I k y H u a l E g u q / W S S e I y O r d J F v V D Z I w a q H E i y H o n A T g p s B n r e S X / v P R G P f N + E o i P K Z S d M R r G h k k z j G K O T I j S k l C X K U o M H m C i W L R g M s M L E C r z t i p v v p h F U j v e e X S V W L n O S s v B I R z B C X h w A R W h S r U g I C E Z i F N + f F e X c + n M / p J K T R z A D J y v X x o J o = < / l a t e x i t > (a) Independent trust region learning.  Improvement Against Local Stable Fixed-Point Trust Payoff Region Trust Payoff Region ⇡ i < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 o A t T y / N c b 1 M E 0 c b S 1 O E d D j t s J k = " > A A A C F H i c d V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s x I o Q o u C m 5 c V r A P a I e S S T N t a J I Z k o x Q h n 6 E G x f 6 K + 7 E r X v / x K W Z d h a 2 x Q O B w z n 3 J i c n i D n T x n W / n c L G 5 t b 2 T n G 3 t L d / c H h U P j 5 p 6 y h R h L Z I x C P V D b C m n E n a M s x w 2 o 0 V x S L g t B N M 7 j K / 8 0 S V Z p F 8 N N O Y + g K P J A s Z w c Z K n X 7 M B i m b D c o V t + r O g d a J l 5 M K 5 G g O y j / 9 Y U Q S Q a U h H G v d 8 9 z Y + C l W h h F O Z 6 V + o m m M y Q S P a M 9 S i Q X V f j q P O 0 M X V h m i M F L 2 S I P m 6 t + N F A u t p y K w k w K b s V 7 1 M v E / L 7 t R L 7 2 f B m I l j w m v / Z T J O D F U k k W c M O H I R C h r C A 2 Z o s T w q S W Y K G Z / h M g Y K 0 y M 7 b F k q / J W i 1 k n 7 a u q V 6 v e P N Q q j d u 8 t C K c w T l c g g d 1 a M A 9 N K E F B C b w D K / w 5 r w 4 7 8 6 H 8 7 k Y L T j 5 z i k s w f n 6 B X m R n 8 c = < / l a t e x i t > ⇡ i < l a t e x i t s h a 1 _ b a s e 6 4 = " K l b c 1 K w Z m t V b Q e o k Y l + k 7 F x b m T M = " > A A A C F X i c d V D L S g M x F L 1 T X 7 W + q i 7 d B I v g x j I j B R V c F N y 4 r G A f 0 A 4 l k 2 b a 0 C Q z J B m h D P 0 J N y 7 0 V 9 y J W 9 f + i U s z 7 S x s i w c C h 3 P u T U 5 O E H O m j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 l G i C G 2 S i E e q E 2 B N O Z O 0 a Z j h t B M r i k X A a T s Y 3 2 V + + 4 k q z S L 5 a C Y x 9 Q U e S h Y y g o 2 V O r 2 Y 9 d M L N u 2 X K 2 7 V n Q G t E i 8 n F c j R 6 J d / e o O I J I J K Q z j W u u u 5 s f F T r A w j n E 5 L v U T T G J M x H t K u p R I L q v 1 0 l n e K z q w y Q G G k 7 J E G z d S / G y k W W k 9 E Y C c F N i O 9 7 G X i f 1 5 2 o 1 5 4 P w 3 E U h 4 T X v s p k 3 F i q C T z O G H C k Y l Q V h E a M E W J 4 R N L M F H M / g i R E V a Y G F t k y V b l L R e z S l q X V a 9 W v X m o V e q 3 e W l F O I F T O A c P r q A O 9 9 C A J h D g 8 A y v 8 O a 8 O O / O h / M 5 H y 0 4 + c 4 x L M D 5 + g X r 8 p / + < / l a t e x i t > ⇡i < l a t e x i t s h a 1 _ b a s e 6 4 = " n V A j T o a X S 6 T U f J k 8 m 1 M / i 0 M h Q u I = " > A A A C G n i c d V D L S g M x F M 3 U V 6 2 v q k s 3 w S K 4 K j N S U M F F w Y 3 L C v Y B 7 V A y a a Y N T T J j c k c o Q 7 / D j Q v 9 F X f i 1 o 1 / 4 t J M O w v b 4 o E L h 3 M f O T l B L L g B 1 / 1 2 C m v r G 5 t b x e 3 S z u 7 e / k H 5 8 K h l o k R T 1 q S R i H Q n I I Y J r l g T O A j W i T U j M h C s H Y x v s 3 7 7 i W n D I / U A k 5 j 5 k g w V D z k l Y C W / N y K Q 9 m I + 7 a e 2 y h W 3 6 s 6 A V 4 m X k w r K 0 e i X f 3 q D i C a S K a C C G N P 1 3 B j 8 l G j g V L B p q Z c Y F h M 6 J k P W t V Q R y Y y f z k x P 8 Z l V B j i M t C 0 F e K b + 3 U i J N G Y i A z s p C Y z M c i 8 T / + t l F 8 3 C + 2 k g l / x A e O W n X M U J M E X n d s J E Y I h w l h M e c M 0 o i I k l h G p u f 4 T p i G h C w a Z Z s l F 5 y 8 G s k t Z F 1 a t V r + 9 r l f p N H l o R m Z z R 1 F k K 1 T S c j Z B K V M R E f b 1 A = " > A A A C G 3 i c d V D L S g M x F L 3 j s 9 Z X 1 a W b Y B H c W G a k o I K L g h u X F e w D O r V k 0 k w b m m S G J C O U o f / h x o X + i j t x 6 8 I / c W m m n Y V t 8 U D C 4 Z x 7 b 2 5 O E H O m j e t + O y u r a + s b m 4 W t 4 v b O 7 t 5 + 6 e C w q a N E E d o g E Y 9 U O 8 C a c i Z p w z D D a T t W F I u A 0 1 Y w u s 3 8 1 h N V m k X y w Y x j 2 h V 4 I F n I C D Z W e v S H 2 K R + z C a 9 9 N x e p b J b c a d A y 8 T L S R l y 1 H u l H 7 8 f k U R Q a Q j H W n c 8 N z b d F C v D C K e T o p 9 o G m M y w g P a s V R i Q X U 3 n W 4 9 Q a d W 6 a M w U v Z I g 6 b q 3 4 4 U C 6 3 H I r C V A p u h X v Q y 8 T 8 v m 6 j n 3 k 8 D s b C P C a + 6 K Z N x Y q g k s 3 X C h C M T o S w o 1 G e K E s P H l m C i m P 0 R I k O s M D E 2 z q K N y l s M Z p k 0 L y p e t X J 9 X y 3 X b v L Q C n A M J 3 A G H l x C D e 6 g D g 0 g o O A Z X u H N e X H e n Q / n c 1 a 6 4 u Q 9 R z A H 5 + s X K n y i y w = = < / l a t e x i t > ⇡ ⇤ i < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 W U T x m F W U 6 7 i + U j W W s a 7 n Y U v r o w = " > A A A C F n i c d V D L S g M x F L 1 T X 7 W + q i 7 d B I s g L s q M F F R w U X D j s o J 9 S D u W T J p p Q 5 P M k G S E M v Q r 3 L j Q X 3 E n b t 3 6 J y 7 N t F 3 Y F g 8 E D u f c m 5 y c I O Z M G 9 f 9 d n I r q 2 v r G / n N w t b 2 z u 5 e c f + g o a N E E V o n E Y 9 U K 8 C a c i Z p 3 T D D a S t W F I u A 0 2 Y w v M n 8 5 h N V m k X y 3 o x i 6 g v c l y x k B B s r P X R i 1 k 3 Z + P G s W y y 5 Z X c C t E y 8 G S n B D L V u 8 a f T i 0 g i q D S E Y 6 3 b n h s b P 8 X K M M L p u N B J N I 0 x G e I + b V s q s a D a T y e B x + j E K j 0 U R s o e a d B E / b u R Y q H 1 S A R 2 U m A z 0 I t e J v 7 n Z T f q u f f T Q C z k M e G l n z I Z J 4 Z K M o 0 T J h y Z C G U d o R 5 T l B g + s g Q T x e y P E B l g h Y m x T R Z s V d 5 i M c u k c V 7 2 K u W r u 0 q p e j 0 r L Q 9 H c A y n 4 M E F V O E W a l A H A g K e 4 R X e n B f n 3 f l w P q e j O W e 2 c w h z c L 5 + A a x m o G M = < / l a t e x i t > ⇡ ⇤ i < l a t e x i t s h a 1 _ b a s e 6 4 = " y m b U E x U l N E E 2 a h N 3 g N p c w 6 i s M Q 8 = " > A A A C F 3 i c d V D L S g M x F L 3 j s 9 Z X 1 a W b Y B F E s M x I Q Q U X B T c u K 9 g H t m P J p G k b m m S G J C O U o X / h x o X + i j t x 6 9 I / c W m m n Y V t 8 U D g c M 6 9 y c k J I s 6 0 c d 1 v Z 2 l 5 Z X V t P b e R 3 9 z a 3 t k t 7 O 3 X d R g r Q m s k 5 K F q B l h T z i S t G W Y 4 b U a K Y h F w 2 g i G N 6 n f e K J K s 1 D e m 1 F E f Y H 7 k v U Y w c Z K D + 2 I d Z I z N n 4 8 7 R S K b s m d A C 0 S L y N F y F D t F H 7 a 3 Z D E g k p D O N a 6 5 b m R 8 R O s D C O c j v P t W N M I k y H u 0 5 a l E g u q / W S S e I y O r d J F v V D Z I w 2 a q H 8 3 E i y 0 H o n A T g p s B n r e S 8 X / v P R G P f N + E o i 5 P K Z 3 6 S d M R r G h k k z j 9 G K O T I j S k l C X K U o M H 1 m C i W L 2 R 4 g M s M L E 2 C r z t i p v v p h F U j 8 v e e X S 1 V 2 5 W L n O S s v B I R z B C X h w A R W 4 h S r U g I C E Z 3 i F N + f F e X c + n M / p 6 J K T 7 R z A D J y v X x 8 0 o J o = < / l a t e x i t > Trust Stable Region ⇡ 0 i < l a t e x i t s h a 1 _ b a s e 6 4 = " d G m S t 7 Q k s 3 B A 2 L j C v + + b z X 6 u i t c = " > A A A C H 3 i c d V D N S 8 M w H E 3 n 1 5 x f V Y 9 e g k P w N F o R V P A w 8 O J x g v u A t Z Y 0 S 7 e w p A 1 J O h i l / 4 k X D / q v e B O v + 0 8 8 m m 4 9 u A 0 f B B 7 v / X 7 J y w s F o 0 o 7 z s y q b G x u b e 9 U d 2 t 7 + w e H R / b x S U c l q c S k j R O W y F 6 I F G E 0 J m 1 N N S M 9 I Q n i I S P d c P x Q + N 0 J k Y o m 8 b O e C u J z N I x p R D H S R g p s 2 x M 0 y G j + k n l C U k 7 y w K 4 7 D W c O u E 7 c k t R B i V Z g / 3 i D B K e c x B o z p F T f d Y T 2 M y Q 1 x Y z k N S 9 V R C A 8 R k P S N z R G n C g / m y f P 4 Y V R B j B K p D m x h n P 1 7 0 a G u F J T H p p J j v R I r X q F + J 9 X 3 K i W 3 s 9 C v p J H R 7 d + R m O R a h L j R Z w o Z V A n s C g L D q g k W L O p I Q h L a n 4 E 8 Q h J h L W p t G a q c l e L W S e d q 4 Z 7 3 b h 7 u q 4 3 7 8 v S q u A M n I N L 4 I I b 0 A S P o A X a A I M J e A X v 4 M N 6 s z 6 t L + t 7 M V q x y p 1 T s A R r 9 g v E f 6 Q h < / l a t e x i t > ⇡ 0 i < l a t e x i t s h a 1 _ b a s e 6 4 = " p y o z i T v 7 9 J s J 9 q s A P o G p M J D 4 0  K c = " > A A A C I H i c d V D N S 8 M w H E 3 9 n P O r 0 6 O X 4 B C 8 O F o Z q O B h 4 M X j B P c B a y 1 p l m 5 h S V q S V B m l f 4 o X D / q v e B O P + p d 4 N N 1 2 c B s + C D z e + / 2 S l x c m j C r t O F / W y u r a + s Z m a a u 8 v b O 7 t 2 9 X D t o q T i U m L R y z W H Z D p A i j g r Q 0 1 Y x 0 E 0 k Q D x n p h K O b w u 8 8 E q l o L O 7 1 O C E + R w N B I 4 q R N l J g V 7 y E B t k Z z R 8 y L 5 G U k z y w q 0 7 N m Q A u E 3 d G q m C G Z m D / e P 0 Y p 5 w I j R l S q u c 6 i f Y z J D X F j O R l L 1 U k Q X i E B q R n q E C c K D + b R M / h i V H 6 M I q l O U L D i f p 3 I 0 N c q T E P z S R H e q g W v U L 8 z y t u V H P v Z y F f y K O j S z + j I k k 1 E X g a J 0 o Z 1 D E s 2 o J 9 K g n W b G w I w p K a H 0 E 8 R B J h b T o t m 6 r c x W K W S f u 8 5 t Z r V 3 f 1 a u N 6 V l o J H I F j c A p c c A E a 4 B Y 0 Q Q t g 8 A S e w S t 4 s 1 6 s d + v D + p y O r l i z n U M w B + v 7 F z i 1 p F g = < / l a t e x i t > ⇡ i < l a t e x i t s h a 1 _ b a s e 6 4 = " G g M P 7 Z g s 4 d X C m T j v B x m E Y d e S h b 4 = " > A A A C G 3 i c d V D L S g M x F L 3 j s 9 Z X 1 a W b Y B H c W G a k o I K L g h u X F e w D O m P J p G k b m m S G J C O U o f / h x o X + i j t x 6 8 I / c W m m n Y V t 8 U D C 4 Z x 7 b 2 5 O G H O m j e t + O y u r a + s b m 4 W t 4 v b O 7 t 5 + 6 e C w q a N E E d o g E Y 9 U O 8 S a c i Z p w z D D a T t W F I u Q 0 1 Y 4 u s 3 8 1 h N V m k X y w Y x j G g g 8 k K z P C D Z W e v R D B 2 L J V Y U B 2 k 0 6 0 n 6 N Q q P d S P l D 3 S o K n 6 t y P F Q u u x C G 2 l w G a o F 7 1 M / M / L J u q 5 9 9 N Q L O x j + l d B y m S c G C r J b J 1 + w p G J U B Y U 6 j F F i e F j S z B R z P 4 I k S F W m B g b Z 9 F G 5 S 0 G s 0 y a F x W v W r m + r 5 Z r N 3 l o B T i G E z g D D y 6 h B n d Q h w Y Q U P A M r / D m v D j v z o f z O S t d c f K e I 5 i D 8 / U L H N C i w w = = < / l a t e x i t > ⇡i < l a t e x i t s h a 1 _ b a s e 6 4 = " I D + Z / 3 j l e w 3 k k / M K 1 o z K W 9 7 O q f o = " > A A A C G n i c d V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s x I Q Q U X B T c u K 9 g H d I a S S T N t a J I Z k 4 x Q h n 6 H G x f 6 K + 7 E r R v / x K V p O w v b 4 o E L h 3 M f O T l h w p k 2 r v v t F N b W N z a 3 i t u l n d 2 9 / Y P y 4 V F L x 6 k i t E l i H q t O i D X l T N K m Y Y b T T q I o F i G n 7 X B 0 O + 2 3 n 6 j S L J Y P Z p z Q Q O C B Z B E j 2 F g p 8 E O s M j 9 h k 1 5 m q 1 x x q + 4 M a J V 4 O a l A j k a v / O P 3 Y 5 I K K g 3 h W O u u 5 y Y m y L A y j H A 6 K f m p p g k m I z y g X U s l F l Q H 2 c z 0 B J 1 Z p Y + i W N m S B s 3 U v x s Z F l q P R W g n B T Z D v d y b i v / 1 p h f 1 w v t Z K J b 8 m O g q y J h M U k M l m d u J U o 5 M j K Y 5 o T 5 T l B g + t g Q T x e y P E B l i h Y m x a Z Z s V N 5 y M K u k d V H 1 a t X r + 1 q l f p O H V o Q T O I V z 8 O A S 6 n A H D W g C g U d 4 h l d 4 c 1 6 c d + f D + Z y P F p x 8 5 x g W 4 H z 9 A q f L o o w = < / l a t e x i t > (b) Multi-agent trust region learning. Figure 2 : Comparisons between independent trust region learner and multi-agent trust region learner. π i , π -i are the current policies for two agents. πi , π-i predicted policies within TPR, (π * i , π * -i ) forms Nash equilibrium, π i and π -i are the best responses to the weak stable fixed point (π i , π-i ). (a): independent trust region learning, an agent i only considers itself's policy improvement against the fixed opponent policy π -i . (b): multi-agent trust region learning, agents' policy improvement should be explored in the joint policy space (π i , π -i ) toward a stable region. revolved by finding a restricted step leading to a stable point in the joint policy space, denoted as Trust Stable Region(TSR). In other words, multi-agent trust region learning (MATRL) decomposes the trust region learning into two parts: firstly, find a trust payoff region between current policy π i and predicted policy πi ; then, with the help of the predicted policy, a precise method can, to some extent, approximate a weak stable fixed point. Instead of line searching in a single-agent payoff improvement, MATRL searches for the joint policy space to achieve a weak stable fixed point (see Fig. 2 ). Essentially, MATRL is a simple extension of the single-agent TRPO to MAS where independent learners learn to find a stable point between current policy and predicted policy. To solve the TSR, we assume the knowledge about other agents' policies during the training to find weak stable points via empirical meta-game analysis, while the execution can still be fully decentralized. We explain every step of MATRL in detail in the following sections.

3.1. INDEPENDENT TRUST PAYOFF IMPROVEMENT

Single-agent reinforcement learning algorithms can be straightforwardly applied to multi-agent learning, where we assume that all agents behave independently (Tan, 1993) . In this section, we have chosen the policy-based reinforcement learning method as independent learners. In multi-agent games, the environment becomes a Markov decision process for agent i when each of the other agents plays according to a fixed policy. We set agent i's to make a monotonic improvement against the fixed opponent policies. Thus, at each iteration, the policy is updated by maximizing the utility function η i over a local neighborhood of the current joint policy π i , π -i : πi = arg max πi∈Πi η i (π i , π -i ) based on the trajectories sampled by π i , π -i . We can adopt trust region policy optimization (e.g., PPO (Schulman et al., 2017) ), which constrains step size in the policy update: πi = arg max πi∈Π θ i η i (π i , π -i ) s.t. D (π i , πi ) ≤ δ i , where D is a distance measurement and δ i is a constraint. Independent trust region learners produce the monotonically improved policy πi which guarantees η i (π i , π -i ) ≥ η i (π i , π -i ) and give a trust payoff bound by πi . Due to the simultaneous policy improvement without awareness of other agents , however, the lower bound of payoff improvement from single-agent Schulman et al. (2015) no longer holds for multi-agent payoff improvement. Following the similar proof procedures, we can obtain a precise lower bound for a multi-agent simultaneous trust payoff region in Theorem 1: Theorem 1 (Independent Trust Payoff Region). Denote the expected advantage gain when π i , π -i → πi , π-i as: g πi,π-i i (π i , π-i ) := s p πi,π-i (s) ai πi (a i |s) a-i π-i (a -i |s)A πi,π-i i (s, a i , a -i ). (3) Let α i = D max TV (π i , πi ) = max s D TV (π i (•|s) πi (•|s)) for agent i, where D TV is total variation divergence (Schulman et al., 2015) . Then, the following lower bound can be derived for multi-agent independent trust region optimization: η i (π i , π-i ) -η i (π i , π -i ) ≥ g πi,π-i i (π i , π-i ) - 4γ i (1 -γ) 2 (α i + α -i -α i α -i ) 2 , ( ) where i = max s,a-i,a-i A πi,π-i i (s, a i , a -i ) . Proof. See Appendix B. Based on the independent trust payoff improvement, although the predicted policy πi will guide us in determining the step size of the TPR, but the stability of (π i , π-i ) is still unknown. As shown in Theorem 1, an agent's lower bound is roughly O(4α 2 ), which is four times larger than the single-agent lower bound trust region of O(α 2 ) (Kakade & Langford, 2002). Furthermore, i = max s,a-i,a-i A πi,π-i i (s, a i , a -i ) depends on the other agents's action a -i that would be very large when agents have conflicting interests. Therefore, the most critical issue underlying the multi-agent trust region learning is to find a TSR after the TPR. In next section, we will illustrate how to search for a weak stable fixed point within the TPR, based on the policy-space meta-game analysis.

3.2. APPROXIMATING WEAK STABLE FIXED POINT

In multi-agent trust region learning, TSR is one of the essential parts. Since each iteration of MATRL requires solving the TPR and TSR sub-problems, finding the efficient solver for stable trust region sub-problems is very important. Instead of using the stable fixed points (Balduzzi et al., 2018) as TSR, we choose weak stable fixed point in Definition 1 which is much easier to be found. To maximize the objective defined in Eq. 1 we could ask that reasonable algorithms avoid all strict minimums (a.k.a unstable fixed points), which imposes only that agents are well-behaved regarding strict minimum even if their individual behaviors is not self-interested (Letcher, 2020) , and we say a point is in TSR if it is weak stable fixed point. Definition 1 (Weak Stable Fixed Point in Restricted Underlaying Game). Consider a restricted underlying game, where each agent's policy space is restricted to open sets Πi = [π i , πi ] ⊆ Π i . Denote the simultaneous gradient of the restricted underlying game as ξ = (∇ πi g i , ∇ π-i g -i ) and Hessian H = ∇ξ. We call (π i , π-i ) a fixed point if ξ(π i , π-i ) = 0. We then say (π i , π-i ) is a weak stable fixed point if H(π i , π-i ) 0 1 , which avoids unstable fixed points (strict minimum). A trust stable region within weak stable fixed points is reasonable if it converges only to fixed points and avoids unstable fixed points almost surely. Given that we already have the TPR, which produces a predicted policy, with the knowledge about all the agents policies, it is natural to conduct an empirical game-theoretic analysis (Tuyls et al., 2018) to search for a weak stable fixed point in the area bounded by current policy pair predicted policy pair. We then define a meta-game that each agent i has only two strategies π i , πi for each agent i: M(π i , πi , π -i , π-i ) = g i,-i i , g i,-i -i g i,- î i , g i,- î -i g î,-i i , g î,-i -i g î,- î i , g î,- î -i , where g î,- î i = g πi,π-i i (π i , π-i ) (as defined in Eq. 3) is an empirical payoff entry of the metagame, and note g i,-i i = 0 as it has an expected advantage over itself. Compared with using the η i (π i , π-i ) = η i (π i , π -i ) + g î,- î i as the meta-game payoff, g î,- î i has lower variance and is easier to approximate because η i (π i , π -i ) is a constant baseline. However, the most of entries in M are unknown, it often requires lots of extra simulations to estimate the payoff entries (e.g., g î,- î i ) in EGTA. Instead, we reuse the trajectories in the TPR step to approximate the g î,- î i by ignoring small changes in state visitation density caused by the π i → πi (Schulman et al., 2015) . Take two-agent case as an example, as we can see in Eq. 5, the meta-game M becomes a 2 × 2 matrix-form game, which is much smaller in size than the whole underlaying game. To this end, we can use the existing Nash solvers (e.g. CMA-ES (Hansen et al., 2003) ) for matrix-form games to compute a mixed Nash equilibrium ρ i , ρ -i = NashSolver(M) for the meta-game M, where ρ i and ρ -i ∈ [0, 1], and the mixed Nash equilibrium of the meta-game is also an approximated equilibrium of the restricted underlying game (Tuyls et al., 2020) . Then, the trust stable region policies πi , π-i can be aggregated based on current policy π i and predicted policy πi in TPR for each agent i. In the TPR step, we require ILs enjoy the monotonic improvement against the fixed opponent policies, in which the change from π i to πi is usually constrained by a small step size. Therefore, it is reasonable to assume there is a continuous and monotonic change in the restricted policy space between π i and πi . In this case, with ρ i being agent i's Nash Equilibrium policy in the meta-game, πi can be derived via the linear mixture: πi = ρ i π i + (1ρ i )π i , which delimit agent i's trust stable region. Now we can prove that (π i , π-i ) is a weak stable fixed point for the underlying game in Theorem 2.

1.. Independent Policy Improvement

Current Policies

2.. Solve the Nash Equilibrium of Meta Game Policy Aggregation

Construct the Meta Game  x W M S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 3 d z v P K H S P J Y P Z p q g H 9 G R 5 C F n 1 F i p W R 2 U K 2 7 N X Y C s E y 8 n F c j R G J S / + s O Y p R F K w w T V u u e 5 i f E z q g x n A m e l f q o x o W x C R 9 i z V N I I t Z 8 t D p 2 R C 6 s M S R g r W 9 K Q h f p 7 I q O R 1 t M o s J 0 R N W O 9 6 s 3 F / 7 x e a s I b P + M y S Q 1 K t l w U p o K Y m M y / J k O u k B k x t Y Q y x e 2 t h I 2 p o s z Y b E o 2 B G / 1 5 X X S v q x 5 b s 1 r X l X q t 3 k c R T i D c 6 i C B 9 d Q h 3 t o Q A s Y I D z x W M S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 3 d z v P K H S P J Y P Z p q g H 9 G R 5 C F n 1 F i p W R 2 U K 2 7 N X Y C s E y 8 n F c j R G J S / + s O Y p R F K w w T V u u e 5 i f E z q g x n A m e l f q o x o W x C R 9 i z V N I I t Z 8 t D p 2 R C 6 s M S R g r W 9 K Q h f p 7 I q O R 1 t M o s J 0 R N W O 9 6 s 3 F / 7 x e a s I b P + M y S Q 1 K t l w U p o K Y m M y / J k O u k B k x t Y Q y x e 2 t h I 2 p o s z Y b E o 2 B G / 1 5 X X S v q x 5 b s 1 r X l X q t 3 k c R T i D c 6 i C B 9 d Q h 3 t o Q A s Y I D z x W M S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 3 d z v P K H S P J Y P Z p q g H 9 G R 5 C F n 1 F i p W R 2 U K 2 7 N X Y C s E y 8 n F c j R G J S / + s O Y p R F K w w T V u u e 5 i f E z q g x n A m e l f q o x o W x C R 9 i z V N I I t Z 8 t D p 2 R C 6 s M S R g r W 9 K Q h f p 7 I q O R 1 t M o s J 0 R N W O 9 6 s 3 F / 7 x e a s I b P + M y S Q 1 K t l w U p o K Y m M y / J k O u k B k x t Y Q y x e 2 t h I 2 p o s z Y b E o 2 B G / 1 5 X X S v q x 5 b s 1 r X l X q t 3 k c R T i D c 6 i C B 9 d Q h 3 t o Q A s Y I D z x W M S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 3 d z v P K H S P J Y P Z p q g H 9 G R 5 C F n 1 F i p W R 2 U K 2 7 N X Y C s E y 8 n F c j R G J S / + s O Y p R F K w w T V u u e 5 i f E z q g x n A m e l f q o x o W x C R 9 i z V N I I t Z 8 t D p 2 R C 6 s M S R g r W 9 K Q h f p 7 I q O R 1 t M o s J 0 R N W O 9 6 s 3 F / 7 x e a s I b P + M y S Q 1 K t l w U p o K Y m M y / J k O u k B k x t Y Q y x e 2 t h I 2 p o s z Y b E o 2 B G / 1 5 X X S v q x 5 b s 1 r X l X q t 3 k c R T i D c 6 i C B 9 d Q h 3 t o Q A s Y I D z D K 7 w 5 j 8 6 L 8 + 5 8 L F s L T j 5 z C n / g f P 4 A b P m M r A = = < / l a t e x i t > ) < l a t e x i t s h a 1 _ b a s e 6 4 = " V c / F 7 w C L w h V r b N G a l 5 W 4 f Q h k Q 2 w = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B Z B L y U R Q Y 9 F L x 5 b s B / Q h r L Z T t q 1 m 0 3 Y 3 Q g l 9 B d 4 8 a C I V 3 + S N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S K 4 N q 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R S 8 e p Y t h k s Y h V J 6 A a B Z f Y N N w I 7 C Q K a R Q I b A f j u 5 n f f k K l e S w f z C R B P 6 J D y U P O q L F S 4 6 J f r r h V d w 6 y S r y c V C B H v V / + 6 g 1 i l k Y o D R N U 6 6 7 n J s b P q D K c C Z y W e q n G h L I x H W L X U k k j 1 H 4 2 P 3 R K z q w y I G G s b E l D 5 u r v i Y x G W k + i w H Z G 1 I z 0 s j c T / / O 6 q Q l v / I z L J D U o 2 W J R m A p i Y j L 7 m g y 4 Q m b E x B L K F L e 3 E j a i i j J j s y n Z E L z l l 1 d J 6 7 L q u V W v c V W p 3 e Z x F O E E T u E c P L i G G t x D H Z r A A O E Z X u H N e X R e n H f n Y 9 F a c P K Z Y / g D 5 / M H b n 2 M r Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " V c / F 7 w C L w h V r b N G a l 5 W 4 f Q h k Q 2 w = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B Z B L y U R Q Y 9 F L x 5 b s B / Q h r L Z T t q 1 m 0 3 Y 3 Q g l 9 B d 4 8 a C I V 3 + S N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S K 4 N q 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R S 8 e p Y t h k s Y h V J 6 A a B Z f Y N N w I 7 C Q K a R Q I b A f j u 5 n f f k K l e S w f z C R B P 6 J D y U P O q L F S 4 6 J f r r h V d w 6 y S r y c V C B H v V / + 6 g 1 i l k Y o D R N U 6 6 7 n J s b P q D K c C Z y W e q n G h L I x H W L X U k k j 1 H 4 2 P 3 R K z q w y I G G s b E l D 5 u r v i Y x G W k + i w H Z G 1 I z 0 s j c T / / O 6 q Q l v / I z L J D U o 2 W J R m A p i Y j L 7 m g y 4 Q m b E x B L K F L e 3 E j a i i j J j s y n Z E L z l l 1 d J 6 7 L q u V W v c V W p 3 e Z x F O E E T u E c P L i G G t x D H Z r A A O E Z X u H N e X R e n H f n Y 9 F a c P K Z Y / g D 5 / M H b n 2 M r Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " V c / F 7 w C L w h V r b N G a l 5 W 4 f Q h k Q 2 w = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B Z B L y U R Q Y 9 F L x 5 b s B / Q h r L Z T t q 1 m 0 3 Y 3 Q g l 9 B d 4 8 a C I V 3 + S N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S K 4 N q 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R S 8 e p Y t h k s Y h V J 6 A a B Z f Y N N w I 7 C Q K a R Q I b A f j u 5 n f f k K l e S w f z C R B P 6 J D y U P O q L F S 4 6 J f r r h V d w 6 y S r y c V C B H v V / + 6 g 1 i l k Y o D R N U 6 6 7 n J s b P q D K c C Z y W e q n G h L I x H W L X U k k j 1 H 4 2 P 3 R K z q w y I G G s b E l D 5 u r v i Y x G W k + i w H Z G 1 I z 0 s j c T / / O 6 q Q l v / I z L J D U o 2 W J R m A p i Y j L 7 m g y 4 Q m b E x B L K F L e 3 E j a i i j J j s y n Z E L z l l 1 d J 6 7 L q u V W v c V W p 3 e Z x F O E E T u E c P L i G G t x D H Z r A A O E Z X u H N e X R e n H f n Y 9 F a c P K Z Y / g D 5 / M H b n 2 M r Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " V c / F 7 w C L w h V r b N G a l 5 W 4 f Q h k Q 2 w = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B Z B L y U R Q Y 9 F L x 5 b s B / Q h r L Z T t q 1 m 0 3 Y 3 Q g l 9 B d 4 8 a C I V 3 + S N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Q S K 4 N q 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R S 8 e p Y t h k s Y h V J 6 A a B Z f Y N N w I 7 C Q K a R Q I b A f j u 5 n f f k K l e S w f z C R B P 6 J D y U P O q L F S 4 6 J f r r h V d w 6 y S r y c V C B H v V / + 6 g 1 i l k Y o D R N U 6 6 7 n J s b P q D K c C Z y W e q n G h L I x H W L X U k k j 1 H 4 2 P 3 R K z q w y I G G s b E l D 5 u r v i Y x G W k + i w H Z G 1 I z 0 s j c T / / O 6 q Q l v / I z L J D U o 2 W J R m A p i Y j L 7 m g y 4 Q m b E x B L K F L e 3 E j a i i j J j s y n Z E L z l l 1 d J 6 7 L q u V W v c V W p 3 e Z x F O E E T u E c P L i G G t x D H Z r A A O E Z X u H N e X R e n H f n Y 9 F a c P K Z Y / g D 5 / M H b n 2 M r Q = = < / l a t e x i t > ⇡i < l a t e x i t s h a 1 _ b a s e 6 4 = " J T w a V R P O B R z 6 u v Q K q W E T k z Q v Z 7 4 = " > A A A C F H i c d V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y U Q n V X c O O y g n 1 A O 5 R M m m l D k 8 y Q Z I Q y 9 C P c u N B f c S d u 3 f s n L s 2 0 s 7 A t H g g c z r k 3 O T l B z J k 2 r v v t F L a 2 d 3 b 3 i v u l g 8 O j 4 5 P y 6 V l H R 4 k i t E 0 i H q l e g D X l T N K 2 Y Y b T X q w o F g G n 3 W B 6 l / n d J 6 o 0 i + S j m c X U F 3 g s W c g I N l b q D m I 2 T N l 8 W K 6 4 V X c B t E m 8 n F Q g R 2 t Y / h m M I p I I K g 3 h W O u + 5 8 b G T 7 E y j H A 6 L w 0 S T W N M p n h M + 5 Z K L K j 2 0 0 X c O b q y y g i F k b J H G r R Q / 2 6 k W G g 9 E 4 G d F N h M 9 L q X i f 9 5 2 Y 1 6 5 f 0 0 E G t 5 T H j j p 0 z G i a G S L O O E C U c m Q l l D a M Q U J Y b P L M F E M f s j R C Z Y Y W J s j y V b l b d e z C b p 1 K p e v X r 7 U K 8 0 a 3 l p R b i A S 7 g G D x r Q h H t o Q R s I T O E Z X u H N e X H e n Q / n c z l a c P K d c 1 i B 8 / U L d o + f v Q = = < / l a t e x i t > ⇡ i < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 o A t T y / N c b 1 M E 0 c b S 1 O E d D j t s J k = " > A A A C F H i c d V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s x I o Q o u C m 5 c V r A P a I e S S T N t a J I Z k o x Q h n 6 E G x f 6 K + 7 E r X v / x K W Z d h a 2 x Q O B w z n 3 J i c n i D n T x n W / n c L G 5 t b 2 T n G 3 t L d / c H h U P j 5 p 6 y h R h L Z I x C P V D b C m n E n a M s x w 2 o 0 V x S L g t B N M 7 j K / 8 0 S V Z p F 8 N N O Y + g K P J A s Z w c Z K n X 7 M B i m b D c o V t + r O g d a J l 5 M K 5 G g O y j / 9 Y U Q S Q a U h H G v d 8 9 z Y + C l W h h F O Z 6 V + o m m M y Q S P a M 9 S i Q X V f j q P O 0 M X V h m i M F L 2 S I P m 6 t + N F A u t p y K w k w K b s V 7 1 M v E / L 7 t R L 7 2 f B m I l j w m v / Z T J O D F U k k W c M O H I R C h r C A 2 Z o s T w q S W Y K G Z / h M g Y K 0 y M 7 b F k q / J W i 1 k n 7 a u q V 6 v e P N Q q j d u 8 t C K c w T l c g g d 1 a M A 9 N K E F B C b w D K / w 5 r w 4 7 8 6 H 8 7 k Y L T j 5 z i k s w f n 6 B X m R n 8 c = < / l a t e x i t > ⇡ i < l a t e x i t s h a 1 _ b a s e 6 4 = " K l b c 1 K w Z m t V b Q e o k Y l + k 7 F x b m T M = " > A A A C F X i c d V D L S g M x F L 1 T X 7 W + q i 7 d B I v g x j I j B R V c F N y 4 r G A f 0 A 4 l k 2 b a 0 C Q z J B m h D P 0 J N y 7 0 V 9 y J W 9 f + i U s z 7 S x s i w c C h 3 P u T U 5 O E H O m j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 l G i C G 2 S i E e q E 2 B N O Z O 0 a Z j h t B M r i k X A a T s Y 3 2 V + + 4 k q z S L 5 a C Y x 9 Q U e S h Y y g o 2 V O r 2 Y 9 d M L N u 2 X K 2 7 V n Q G t E i 8 n F c j R 6 J d / e o O I J I J K Q z j W u u u 5 s f F T r A w j n E 5 L v U T T G J M x H t K u p R I L q v 1 0 l n e K z q w y Q G G k 7 J E G z d S / G y k W W k 9 E Y C c F N i O 9 7 G X i f 1 5 2 o 1 5 4 P w 3 E U h 4 T X v s p k 3 F i q C T z O G H C k Y l Q V h E a M E W J 4 R N L M F H M / g i R E V a Y G F t k y V b l L R e z S l q X V a 9 W v X m o V e q 3 e W l F O I F T O A c P r q A O 9 9 C A J h D g 8 A y v 8 O a 8 O O / O h / M 5 H y 0 4 + c 4 x L M D 5 + g X r 8 p / + < / l a t e x i t > ⇡ i < l a t e x i t s h a 1 _ b a s e 6 4 = " o W J p h u g G w X m C C K L Y M N v i e 5 T Q m L g = " > A A A C F X i c d V D L S g M x F L 1 T X 7 W + q i 7 d B I v g x j J T C u q u 4 M Z l B f u A d i i Z N N O G J p k h y Q h l 6 E + 4 c a G / 4 k 7 c u v Z P X J p p Z 2 F b P B A 4 n H N v c n K C m D N t X P f b K W x s b m 3 v F H d L e / s H h 0 f l 4 5 O 2 j h J F a I t E P F L d A G v K m a Q t w w y n 3 V h R L A J O O 8 H k L v M 7 T 1 R p F s l H M 4 2 p L / B I s p A R b K z U 7 c d s k F 6 x 2 a B c c a v u H G i d e D m p Q I 7 m o P z T H 0 Y k E V Q a w r H W P c + N j Z 9 i Z R j h d F b q J 5 r G m E z w i P Y s l V h Q 7 a f z v D N 0 Y Z U h C i N l j z R o r v 7 d S L H Q e i o C O y m w G e t V L x P / 8 7 I b 9 d L 7 a S B W 8 p j w x k + Z j B N D J V n E C R O O T I S y i t C Q K U o M n 1 q C i W L 2 R 4 i M s c L E 2 C J L t i p v t Z h 1 0 q 5 V v X r 1 9 q F e a d T y 0 o p w B u d w C R 5 c Q w P u o Q k t I M D h G V 7 h z X l x 3 p 0 P 5 3 M x W n D y n V N Y g v P 1 C + j w n / Q = < / l a t e x i t > ⇢i, ⇢ i = NASH < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 h W f f r 0 w 1 h 5 H f U 1 U F R O 4 9 K 6 i r P w = " > A A A C L 3 i c d V B N S 8 M w G E 7 9 n P O r 6 t F L c A g e d L R j o B 6 E i Z e d Z K L 7 g H W M N E u 3 s K Q t S S q M 0 q u / x o s H / S v i R b z 6 D z y a d j 2 4 D V 8 I e X i e 9 0 2 e 9 3 F D R q W y r A 9 j a X l l d W 2 9 s F H c 3 N r e 2 T X 3 9 l s y i A Q m T R y w Q H R c J A m j P m k q q h j p h I I g 7 j L S d s c 3 q d 5 +  J E L S w H 9 Q k 5 D 0 O B r 6 1 K M Y K U 3 1 T e i I U d C n p 9 k V n 9 E E X k G H I z U S P L 6 9 v q 8 n f b N k l a 2 s 4 C K w c 1 A C e T X 6 5 o 8 z C H D E i a 8 w Q 1 J 2 b S t U v R g J R T E j S d G J J A k R H q M h 6 W r o I 0 5 k L 8 4 2 S e C x Z g b Q C 4 Q + v o I Z + 3 c i R l z K C X d 1 Z 2 p S z m s p + Z + W v i h n / o 9 d P u d H e R e 9 m P p h p I i P p 3 a 8 i E E V w D Q 8 O K C C Y M U m G i A s q N 4 I 4 h E S C C s d c V F H Z c 8 H s w h a l b J d L V / e V U u 1 S h 5 a A R y C I 3 A C b H A O a q A O G q A J M H g C z + A V v B k v x r v x a X x N W 5 e M f O Y A z J T x / Q v L U 6 m i < / l t P R G k q x a O Z J i T k a C R o T D E y V u o H Y 2 S y I K G z A R 1 U q l 7 N m 8 N d J 3 5 B q l C g N a j 8 B E O J U 0 6 E w Q x p 3 f e 9 x I Q Z U o Z i R m b l I N U k Q X i C R q R v q U C c 6 D C b R 5 6 5 l 1 Y Z u r F U 9 g j j z t W / G x n i W k 9 5 Z C c 5 M m O 9 6 u X i f 1 5 + o 1 5 6 P 4 v 4 S h 4 T 3 4 Q Z F U l q i M C L O H H K X C P d v C V 3 S B X B h k 0 t Q V h R + y M X j 5 F C 2 N g u y 7 Y q f 7 W Y d d K p 1 / x G 7 f a h U W 3 W i 9 J K c A 4 X c A U + X E M T 7 q E F b c A g 4 R l e 4 c 1 5 c d 6 d D + d z M b r h F D t n s A T n 6 x f B u K F + < / l a t e x i t > ⇡ i < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 4 F W V V + c 0 v m l b B u t 4 V M c U 6 1 X 8 F w = " > A A A C G 3 i c d V D L S g M x F L 1 T X 7 W + q i 7 d B I v g x j J T C u q u 4 M Z l B d s K 7 V g y a a Y N T T J D k h H K 0 P 9 w 4 0 J / x Z 2 4 d e G f u D T T z s K 2 e C D h c M 6 9 N z c n i D n T x n W / n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q 6 y h R h L Z I x C P 1 E G B N O Z O 0 Z Z j h 9 C F W F I u A 0 0 4 w v s n 8 z h N V m k X y 3 k x i 6 g s 8 l C x k B B s r P f Z G 2 K S 9 m E 3 7 6 Y W 9 y h W 3 6 s 6 A V o m X k w r k a P b L P 7 1 B R B J B p S E c a 9 3 1 3 N j 4 K V a G E U 6 n p V 6 i a Y z J G A 9 p 1 1 K J B d V + O t t 6 i s 6 s M k B h p O y R B s 3 U v x 0 p F l p P R G A r B T Y j v e x l 4 n 9 e N l E v v J 8 G Y m k f E 1 7 5 K Z N x Y q g k 8 3 X C h C M T o S w o N G C K E s M n l m C i m P 0 R I i O s M D E 2 z p K N y l s O Z p W 0 a 1 W v X r 2 + q 1 c a t T y 0 I p z A K Z y D B 5 f Q g F t o Q g s I K H i G V 3 h z X p x 3 5 8 P 5 n J c W n L z n G B b g f P 0 C J 3 q i w Q = = < / l a t e x i t > g i, i i , g i, i i < l a t e x i t s h a 1 _ b a s e 6 4 = " I j U i o o J m w I d W t q M X G / + L J L O / 1 H w = " > A A A C K X i c d V D L S s N A F J 3 U V 6 2 v q D v d B I v g o i 2 J F N R d w Y 3 L C v Y B b Q y T 6 a Q d O j M J M x O h h I B f 4 8 a F / o o 7 d e t P u H T S B r E t H h g 4 9 9 z H 3 H v 8 i B K p b P v D K K y s r q 1 v F D d L W 9 s 7 u 3 v m / k F b h r F A u I V C G o q u D y W m h O O W I o r i b i Q w Z D 7 F H X 9 8 n e U 7 D 1 h I E v I 7 N Y m w y + C Q k 4 A g q L T k m U d D L y H p f U I q V Z J W d F D 9 j T y z b N f s K a x l 4 u S k D H I 0 P f O 7 P w h R z D B X i E I p e 4 4 d K T e B Q h F E c V r q x x J H E I 3 h E P c 0 5 Z B h 6 S b T G 1 L r V C s D K w i F f l x Z U / V v R w K Z l B P m 6 0 o G 1 U g u 5 j L x v 1 w 2 U c 7 9 n / h s Y R 8 V X L o J 4 V G s M E e z d Y K Y W i q 0 M t u s A R E Y K T r R B C J B 9 E U W G k E B k d L m l r R V z q I x y 6 R 9 X n P q t a v b e r l h 5 6 Y V w T E 4 A W f A A R e g A W 5 A E 7 Q A A o / g C b y A V + P Z e D P e j c 9 Z a c H I e w 7 B H I y v H 1 E i p 1 4 = < / l a t e x i t > g i, î i , g i, î i < l

a t e x i t s h a _ b a s e = " m u b Y I e d o U i k S X x G X p H L M e A = " >

A A A C N X i c d V C T s M w F H V l v I K M L J Y V C C G t k p Q J W C r x M J Y J P q Q h A r t N a t Z P I d p C q K H / A A w w I w s C F W V k a c N g N N x Z E s H Z z / W x s Y l c q y o V b X j c S V n l Z d v z w M g w F p i c c h C f O Q J I w G p K o Y q Q X C Y K x j X m x k f v e R C E n D F N I + J w N A q o T z F S W n L N s G b P Q h o d X a Y I y U m l V S W C p o V q N A J e J n Z M K y N F y z Z / B M M Q x J H C D E n Z t I O Q k S i m J G v I g l i R C e I J G p K p g D i R T j K J W n W h l C P x T B Q r O L d C e J S T r m n K z l S Y n M v E / L s o F / P P F Y R / l X T k K D K F Y k w P N / J h B F c I s Q j i k g m D F p p o g L K i + C O I x E g g r H X R Z R U X g k m n Y u a h f z U q T S s P r Q S O w Q k B z a B E w C q g D T B A s / g F b w Z L a H W l z U t X j L z n C C z A + P F M Z W s + A = = < / l a t e x i t > g î, î i , g î, î i < l a t e x i t s h a _ b a s e = " P E y T u z N D U Y C g R L r + I M O l + o = " > A A A C Q X i c d Z B N S M w G M f T + T b n W W j l + A Q P G y j H Q P N v D i c Y J g a W N E u s K Q t S S q M u / h p / H i Q b + C H G b e B K m G u E f C P z / k z / P I k a l s q w o C v r G V d w u e z u R + Y h c d G c Y C k z Y O W S h H p K E Y C F V W M C J B E P c Y X q T z v P h A h a R j c q W l E H I G A f U p R k p L r l k f u Q l N P B G C k N l W o O a U U b T d y x b N W t W c B X s H M o g r Z r f g G I Y C R R m S M q + b U X K S Z B Q F D O S l g a x J B H C E z Q i f Y B k Q y W y F J p Z Q j U O g T K D h T f I E J d y y j d y Z E a y U v E / / z s h f l w v + J x f m U f l k A g i h U J H w c P Z Q h T C L E w p I F i x q Q a E B d U b Q T x G A m G l Q y / p q O z l Y F a h U / Z j d r V b a P c t P L Q i u A E n I J z Y I M L A Q o A X a A I N H A R e w K v x b L w b H b n v L V g H e O w U I Z z X D r K S < / l a t e x i t > g î, i i , g i, î i < l a t e x i t s h a 1 _ b a s e 6 4 = " F w 1 d f C b Z J k K F I 2 h n P Q 9 8 L M P U P A Y = " > A A A C N X i c d Z A 9 T 8 M w E I Y d P k v 5 C j C y W F Q g h r Z K U C V g q 8 T C W C T 6 I b U h c l y 3 t W o 7 k e 0 g V V H + A b + G h Q H + C A M b Y m V l x G k z 0 F a c d N K r 5 + 7 s u z e I G F X a c d 6 t l d W 1 9 Y 3 N w l Z x e 2 d 3 b 9 8 + O G y p M J a Y N H H I Q t k J k C K M C t L U V D P S i S R B P G C k H Y x v s n r 7 k U h F Q 3 G v J x H x O B o K O q A Y a Y N 8 + 2 z o J z R 9 S H o j p I 0 o V 0 w a V M k Y L V d y n P p 2 y a k 6 0 4 D L w s 1 F C e T R 8 O 2 f X j / E M S d C Y 4 a U 6 r p O p L 0 E S U 0 x I 2 m x F y s S I T x G Q 9 I 1 U i B O l J d M 7 0 n h q S F 9 O A i l S a H h l P 6 d S B B X a s I D 0 8 m R H q n F W g b / q 2 U v q r n / k 4 A v 7 K M H V 1 5 C R R R r I v B s n U H M o A 5 h Z i H s U 0 m w Z h M j E J b U X A T x C E m E t T G 6 a K x y F 4 1 Z F q 2 L q l u r X t / V S n U n N 6 0 A j s E J O A c u u A R 1 c A s a o A k w e A L P 4 B W 8 W S / W h / V p f c 1 a V 6 x 8 5 g j M h f X 9 C z S 8 r P g = < / l a t e x i t > ⇡i = PA(⇡i, ⇡i, ⇢i) < l a t e x i t s h a 1 _ b a s e 6 4 = " F J v k U 7 D V 4 j S S B K r n D i n b Q e K H K f c = " > A A A C S n i c d V B d S 8 M w F E 3 n 1 D m / q j 7 6 E h z C B B n t G K i I M P H F x w n u A 9 Z S 0 i z b g k l b k l Q Y p T / G X + O L D / r o 3 / B J 8 c V 0 K + I 2 v B A 4 O e f c 5 N 7 j R 4 x K Z V n v R m G l u L q 2 X t o o b 2 5 t 7 + y a e / s d G c Y C k z Y O W S h 6 P p K E 0 Y C 0 F V W M 9 C J B E P c Z 6 f o P N 5 n e f S R C 0 j C 4 V 5 O I u B y N A j q k G C l N e e a l 4 y O R O B F N v Y S m 8 A o 6 H K m x 4 E n r O q 1 q O m N P o T N G 6 t e k r 2 I c e v T E M y t W z Z o W X A Z 2 D i o g r 5 Z n f j q D E M e c B A o z J G X f t i L l J k g o i h l J y 0 4 s S Y T w A x q R v o Y B 4 k S 6 y X T J F B 5 r Z g C H o d A n U H D K / u 1 I E J d y w n 3 t z F a Q i 1 p G / q d l L 8 q 5 / x O f L 8 y j h u d u Q o M o V i T A s 3 G G M Y M q h F m u c E A F w Y p N N E B Y U L 0 R x G M k E F Y 6 / b K O y l 4 M Z h l 0 6 j W 7 U b u 4 a 1 S a 9 T y 0 E j g E R 6 A K b H A G m u A W t E A b Y P A E n s E r e D N e j A / j y / i e W Q t G 3 n M A 5 q p Q / A E i a r Q 8 < / l a t e x i t > ⇡ i = PA(⇡ i, ⇡ i, ⇢ i) < l a t e x i t s h a 1 _ b a s e 6 4 = " Q m 7 9 6 a g V E w o L g 3 4 N M a p Q F t + y P J g = " > A A A C U H i c d V F d S 8 M w F L 2 d X 7 N + T X 3 0 J T i E C T r a M V A f B M U X H y e 4 T V j L S L N s C 0 v a k q T C K P 0 9 / h p f f F D 8 J 7 5 p W i f q x A s h J + e e m 9 x 7 E s S c K e 0 4 r 1 Z p Y X F p e a W 8 a q + t b 2 x u V b Z 3 O i p K J K F t E v F I 3 g V Y U c 5 C 2 t Z M c 3 o X S 4 p F w G k 3 m F z l + e 4 9 l Y p F 4 a 2 e x t Q X e B S y I S N Y G 6 p f u f Q C L F M v Z l k / P W Y Z O k e e w H o s R d q 6 z G q G L + g j 5 I 2 x / p a Z s x x H B T z s V 6 p O 3 S k C / Q X u D F R h F q 1 + 5 c 0 b R C Q R N N S E Y 6 V 6 r h N r P 8 V S M 8 J p Z n u J o j E m E z y i P Q N D L K j y 0 2 L U D B 0 Y Z o C G k T Q r 1 K h g f 1 a k W C g 1 F Y F R 5 n O o + V x O / p f L b 1 S / 3 k 8 D M d e P H p 7 6 K Q v j R N O Q f L Y z T D j S E c r d R Q M m K d F 8 a g A m k p m J E B l j i Y k 2 f 2 A b q 9 x 5 Y / 6 C T q P u N u t n N 8 3 q R W N m W h n 2 Y B 9 q 4 M I J X M A 1 t K A N B B 7 g E Z 7 h x X q y 3 q z 3 k v U p / d p h F 3 5 F y f 4 A J N S 1 J Q = = < / l a t e x i t >  ⇡0 i = argmax ⇡ i ⌘ i (⇡ i , ⇡ i ) < l ⇡0 i = argmax ⇡i ⌘ i (⇡ i , ⇡i ) < l a t e x i t s h a 1 _ b a s e 6 4 = " R m E h X u K M E S e X T L q I t V f 6 3 8 6 v z f s = " > A A A C e 3 i c d V H f a x N B E N 4 7 t b Z R 2 1 Q f + 7 I Y h C g x 3 E l K 4 4 M Q 8 M X H C C Y t 5 G K Y 2 0 z S / Q W J k m P 2 i Z 4 V j D P J E z K Y A c N a m r K A Z T R J k s J 4 U s f z p k p M a S f + G R B r o 1 u g A z L / 9 i D b + c L i K p p v i v q I y Q o A L N r V 4 f Z d n i e 8 L 3 k 3 o j a A e r 4 I c g 3 I A G 2 0 R / U n + M p q n I N S Y k F F g 7 C o O M x m 4 k k k J h W Y t y i x m I B c x x 5 G A C G u 2 4 W L l S 8 n e O m f J Z a t x L i K / Y 7 Y o C t L V L H T t l t Z 7 d z 1 X k / 3 J V R 7 v z f x H r v X l o 1 h 0 X M s l y w k S s x 5 n l i l P K q 0 P w q T Q o S C 0 d A G G k 2 4 i L W z A g y J 2 r 5 q w K 9 4 0 5 B M N P 7 b D T / v y 9 0 + h 1 N 6 Y d s w v 2 l j V Z y K 5 Y j 3 1 j f T Z g g t 2 z R 8 / z f O + 3 3 / A / + K 2 1 1 P c 2 N W / Y T v i X f w A B P s a m < / l a t e x i t > Algorithm 1 Multi-Agent Trust Region Learning Input: Initializing policies π i for each i. ⌘ i (⇡ i , ⇡ i ) ⌘ i (⇡ i , ⇡ i ) < l a t e x i t W U d q V G Q G l j A h Z Z 2 I y b 6 X H N B 9 h O q 1 i p v 0 p i / 4 P q g 4 T U b x 5 f N + s n R y L R 5 2 I Q t 2 A E P D u E E z u E C W i D g C Z 7 h F d 4 q L 5 V P B x z n W + p U R j U b M B b O 4 h d N e 7 T O < / l a t e x i t > ⌘ i (⇡ i , ⇡ i ) ⌘ i (⇡ i , ⇡ i ) < l a t e x i t s h a 1 _ b a s e 6 4 = " c 8 W Z P Q 4 2 q H 3 C m I b J 3 K s e Y D u F 5 + Q = " > A A A C U 3 i c d V F N S w M x E E 3 X r 1 q / q h 6 9 B I t Q Q c u u F N S b o A e P F W w V 3 L J k 0 2 k b z H 6 Y z A p l 2 T / k r / H i o f 4 S w Y v Z d g + 2 1 Y H A y 5 s 3 k 5 k X P 5 Z C o 2 2 P S 9 b S 8 s r q W n m 9 s r G 5 t b 1 T 3 d 3 r 6 C h R H N o 8 k p F 6 9 J k G K U J o o 0 A J j 7 E C F v g S H v z n 6 z z / 8 A p K i y i 8 x 1 E M 3 Y A N Q t E X n K G h v O q N C 8 i 8 9 F R k d T c W X i q y E + o O G a b m l k 3 4 Y + o O 4 I X + p c t B r v C q N b t h T 4 I u A q c A N V J E y 6 t + u b 2 I J w G E y C X T + s m x Y + y m T K H g E r K K m 2 i I G X 9 m A 3 g y M G Q B 6 G 4 6 2 T a j R 4 b p 0 X 6 k z A m R T t j f F S k L t B 4 F v l E 1: for k ∈ {0, 1, 2, • • • } do 2: Using current policies π i , π -i to collect trajectories.

3:

For each i: compute a trust payoff region policy πi using Eq. 2. £ Trust Payoff Region.

4:

Solve meta-game M(π i , πi , π -i , π-i ) and obtain a meta-game Nash ρ i , ρ -i .

5:

Compute weak stable fixed point πi , π-i . £ Trust Stable Region.

6:

For each i: compute best response π i using Eq. 6. £ Best Response to Fixed Point. 7: π i ← π i , π -i ← π -i . 8: end for Theorem 2 (Existence of Weak Stable Fixed Point). Consider the restricted underlying game where policy space is bounded in a linear continuous policy-space [π i , πi ], where πi is monotonically improved based on π i within TPR. If (ρ i , ρ -i ) is a Nash equilibrium of the meta-game M, then, the linear mixture joint policy (π i , π-i ) is a weak stable fixed point for the restricted underlying game.

Proof. see Appendix C.

According to Theorem 2, (π i , π-i ) is a weak stable fixed point of the restricted underlying game. Although the weak stable fixed point is relatively weak compared to the stable fixed points (Balduzzi et al., 2018) , as we have stated, a weak stable fixed point is a reasonable (not strong as rational) requirement for an algorithm to avoid the minimum. Furthermore, the weak stable fixed points can suit for general game settings. As shown in Appendix C, in cooperative, competitive, and generalsum games, the fixed-point found by the meta-game analysis can be either stable or saddle points. Similarly, a local Nash equilibrium can be a stable or saddle in different differential games (Mazumdar et al., 2020) . Therefore, the goodness of stable concepts would depend on specific settings. If we make some additional game class assumptions, we can easily obtain a stronger fixed-point types. Nevertheless, it comes with a cost, requiring additional computation or assumptions which may break the most general settings. Besides, when the meta-game has multiple Nash equilibria, an equilibrium is randomly selected in our work. Some equilibria can produce a more stable fixed point, however, we leave the equilibrium selection problem for future work. Extra Cost for Approximating and Solving meta-game. There are two major-cost sources in common meta-game analysis: approximating and solving the meta-game (Muller et al., 2019) . In our case, the meta-game is restricted to a local two-action game, where two actions π i and πi are close to each other. This proximity property reduces the meta-game approximation cost (without extra sampling) by reusing the collected trajectories in the TPR step (Tuyls et al., 2020) . The next crucial problem is how to solve the n-agent two-action meta-game, which consists of the 2 n entries of each of the n payoff matrices. This is much simpler than solving the whole underlying game, which increases exponentially with state size, action size, agent number, and time horizons. As the general-sum matrix-form game has no fully polynomial-time approximation for computing Nash equilibria (Chen et al., 2006) , it usually costs a lot to solve the game (Daskalakis et al., 2009) . If we only require an approximated Nash equilibrium, when n is small, for example, n ≤ 10, it is affordable to find a meta-game Nash equilibrium in a sub-exponential complexity (Lipton et al., 2003) . However, this problem still exists when n is large. In this case, we could try mean-field approximation (Yang et al., 2018) or utilizing special payoff structure assumptions (e.g., graphical game (Littman et al., 2002; Daskalakis et al., 2009) , which is polynomial-time computable.) in the meta-game to reduce the computation complexity.

3.3. IMPROVEMENT AGAINST WEAK STABLE FIXED POINT

Although the weak stable fixed point, (π i , π-i ), binds the policy update to another fixed point, there are still undesired saddle points according to Theorem 1. It is difficult to generalize for the other parts of the policy-space not reached by these saddle points, especially in the anti-coordination games (Lanctot et al., 2017) . Similar to the extra-gradient method (Mertikopoulos et al., 2018) , to escape the saddle points we apply the best response against the weak stable fixed point: π i = arg max πi η i (π i , π-i ) . To perform the best response, we need another round to collect the experiences and do a gradient step in Eq. 6. However, in practice, since we already have the trajectories in the TPR step and so the best response to the weak stable fixed point can be easily estimated through importance sampling. Alternatively, through defining c i def = min 1 + c, max(1c, πi(ai|s) πi(ai|s) ) as truncated importance sampling weights, we can re-write the best response update to Eq. 6 into an equivalent form to the following one in terms of expectations: π i = arg max πi E a-i∼π-i [c -i η i (π i , π -i )]. Connections to Existing Methods. MATRL generalizes many existing methods with the best response. In extreme cases where the meta-game Nash is (ρ i , ρ -i ) = (1, 1), which means the Nash aggregated policies always keep the current policies, MATRL degenerates to independent learners. Here, we always best response to the other agents' current policy π i and π i = arg max πi η i (π i , π -i ) following the Eq. 6. The policy prediction (Zhang & Lesser, 2010; Foerster et al., 2018; Letcher et al., 2018) , extra-gradient (Antipin, 2003) and exploitability descent (Tang et al., 2018; Lockhart et al., 2019) methods are also special instances of MATRL when meta-game Nash is (ρ i , ρ -i ) = (0, 0). This means the best response to the most aggressive predicted policy π-i and π i = arg max πi η i (π i , π-i ). Global Convergence. MATRL is a gradient-based algorithm with the best response to policies within TSR, which is essentially a variant of LookAhead methods(e.g., LOLA (Foerster et al., 2018) , SOS (Letcher et al., 2018) and IGA-PP (Zhang & Lesser, 2010) . More specifically, MATRL enhances the classic LookAhead method with variable step size scaling (Bowling & Veloso, 2002) or two time-scale update rule (Heusel et al., 2017) at each TSR step, which is controlled by the restricted meta-game analysis. It has been proven that LookAhead method can locally converge and avoid strict saddles in all differentiable games (Letcher et al., 2018) , and enjoys the better convergence with variable step size scaling (Song et al., 2019) . The convergence analysis of gradient-based algorithms is usually based on fixed-point iterations and dynamical systems. And please note, here, to investigate the convergence, the fixed-point iterations are conducted on the whole learning process. While the meta-game analysis step in MATRL borrows the concepts of different fixed-points to show the meta-game analysis is reasonable to avoid unstable fixed-points. Unlike LOLA, which uses first-order Taylor expansion to estimate the best response to a predicted policy, we elaborately design the look-ahead step within the TSR and do the best response gradient steps to TSR. We also show that MATRL empirically outperforms the typical LookAhead method independent learner the policy prediction (IL-PP) in the experiments. In summary, independent trust region learners' learning in MTARL will be constrained by a weak stable fixed point. By analyzing the relatively simpler meta-game, we can easily approximate this weak stable fixed point without extra rollouts or simulation. Although MATRL's training is centralized, its execution is fully decentralized. It also does not require any extra centralized parameters or higherorder gradient computation. Fig. 3 shows the overview of MATRL. We also give the pseudocode of MATRL in Algo. 1, which is compatible with any policy-based independent learner.

4. RELATED WORK

The study of gradient-based methods in multi-agent learning is quite extensive (Mazumdar et al., 2020; Bus ¸oniu et al., 2010) . Some works on learning in games have mostly focused on adjusting the step size, which attempts to use a multiple-timescales learning scheme (Leslie & Collins, 2005; Leslie et al., 2003; Bowling & Veloso, 2002) to achieve convergence. Balduzzi et al. (2018) ; Mazumdar et al. (2019) ; Letcher et al. (2018) tried to utilize the second-order methods to shape the step size. However, the computational cost for second-order methods is very limiting in many cases. Alternatively, MATRL approximates the second-order fixed-point information via a small meta-game with less cost comparing to real Hessian computation. An alternative augments the gradient-based algorithms with the best response to predicted polices (Antipin, 2003; Zhang & Lesser, 2010; Lin et al., 2020; Foerster et al., 2018; Tang et al., 2018; Lockhart et al., 2019) , that have coordination problems. It also outperforms MADDPG (Lowe et al., 2017) for continuous multi-agent MuJoCo games. Besides, we test the algorithms with 2-agent pong Atari game to investigate if MATRL can mitigate unstable cyclic behaviors (Balduzzi et al., 2019) in zero-sum games. In these tasks, MATRL uses the same PPO configurations as ILs to examine the effectiveness of the trust region gradient-update mechanism, and we use official implementations for the other baselines. The step-by-step PPO based MATRL algorithm is given in Algo. 2. Finally, ablation studies are conducted by: 1. removing the best response, called the MATRL w/o BR; 2. skipping the trust-stable region estimation, named IL-PP, which has similar procedures as LOLA Foerster et al. (2018) ; Zhang & Lesser (2010) , which approximated the best response to the predicted policies via Taylor expansion, but IL-PP takes the best response gradient steps to the predicted policies. These configurations provide insights about how much does the trust stable region and the best response contribute to the MATRL's performance if any. We also provide more environment details and extra experiment results, including 4-agent Ant (multi-agent MuJoCo), in the Appendix D and E with detailed experiment settings and hyper-parameters used for the algorithms. The code and experiment scripts are also anonymously available at https://github.com/matrl-project/matrl. Matrix Game and Random 2 × 2 Matrix Games. To illustrate the effectiveness of MATRL, we conducted an experiment on well known zero-sum matching pennies (MP) (Bruns, 2015) game and devise the 2 × 2 random matrix games. Using IGA (Singh et al., 2000) as ILs of MATRL, the learning dynamics of MATRL on MP are shown in Fig. 4 , where the blue arrow is trust payoff direction and the pale blue area is TSR.. The MATRL reaches the Nash Equilibrium (central red star point) by updating the policies with the constraints from the trust stable region (the pale blue area). It would be trapped to a cyclic loop if following the original trust pay off direction (the dark blue arrow). To adequately examine the MATRL on border matrix games, we randomly generate three thousand 2 × 2 games for three types: coordination, anti-coordination, and cyclic (Pangallo et al., 2017) . More details about the game generation are provided in Section D. We choose the IGA and IGA-PP (Zhang & Lesser, 2010) as baselines, and the results in Table 1 show that MATRL has a higher convergence rate and needs fewer steps for convergence in all types of games. Grid World Checker and Switch. We evaluated MATRL in two grid world games from MA-Gym (Koul, 2019) , two-agent checker, and four-agent switch, which are similar to games in Sunehag et al. (2018) , but with more agents to examine if the MATRL can handle the games that have more than two agents. In the checker game, two agents cooperate in collecting fruits on the map; the sensitive agent gets 5 for apple and -5 for lemon, while the other one gets 1 and -1 respectively. So the optimal solution is to let the sensitive agent get the apple and the less sensitive one get the lemon. In the four-agent switch game, two rooms are connected with a corridor, each room has two agents, and the four agents try to go through one corridor to the target in the opposite room. Only one agent can pass the corridor at one time, and agents get -0.1 for each step and 5 for reaching targets, so they need to cooperate to get optimal scores. In both games, The agents can move in four directions and only partially observe their position. Although our formulation uses a fully observable setting, in this game, the methods adopt to the partially observable by pretending the observation is a state. We compare the MATRL with the PPO based IL and two off-policy centralized training and decentralized execution baselines: VDN (Sunehag et al., 2018) , QTRAN (Son et al., 2019) and QMIX (Rashid et al., 2018) . Results are given in Fig. 5a and 5b , where MATRL has a stable improvement and outperforms other baselines. In two-player checker, using the best response, our method can achieve a total reward of 18, while the independent learners' rewards stay at -2. Besides, although PPO-based MATRL uses on-policy learning, it achieved better final results in fewer time steps compared to the off-policy baselines. As for the four-player switch, as shown in Fig. 5b , MATRL can continuously improve the total rewards to 6.5, which is the closest to the optimal score for this game when compared with lemon. In the four-agent switch game, each side of the corridor has a room, each room has two agents, and the four agents try to go through one corridor to the target in the opposite room. Only one agent can pass the corridor at one time, and agents get 0.1 for each step and 5 for reaching targets, so they need to cooperate to get optimal scores. In both games, The agents can move in four directions and only partially observe their position. We compare the MATRL with the PPO based IL and two off-policy centralized training and decentralized execution baselines: VDN (Sunehag et al., 2018) and QMIX (Rashid et al., 2018) . Results are given in Fig. 5a and 5b , where MATRL has a stable improvement and outperforms other baselines. In two-player checker, using the best response, our method can achieve a total reward of 18, while the independent learners' rewards stay at 2. Besides, although PPO-based MATRL uses on-policy learning, it achieved better final results in fewer time steps compared to the off-policy baselines. As for the four-player switch, as shown in Fig. 5b , MATRL can continuously improve the total rewards to 6.5, which is the closest to the optimal score for this game when compared with other baselines. The result in the four-agent switch also demonstrates the effectiveness of MATRL in guaranteeing the stable policy improvement for the games that have more than two agents. Multi-Agent MuJoCo Hopper. We also examined MATRL in a multi-agent continuous control task with a three-agent hopper from (de Witt et al., 2020) . Here, three agents cooperatively control each part of a hopper to move forward. The agents are rewarded with distance and the number of steps they make before falling. Fig. 5c shows that MATRL significantly outperforms IL, MADDPG, and also the benchmarks in de Witt et al. ( 2020) within the same amount of time. Effect and Cost of Trust Stable Region and Best Response to Fixed Point. This section analyzes the effect of the TSR from meta-game Nash and the best response against the weak stable fixed point. The ablation settings are obtained by removing the trust stable region (IL-PP) and the best response (MATRL w/o BR). In Fig. 5 , we can observe that in all the tasks, without the best response to the fixed point, the learning curves of MATRL o/w BR have higher variance and the lowest final scores. This establishes the importance of the best response to stabilize and improve agents' performance. Also, without the TSR to select a fixed point, the MATRL recovers to independent learners with the policy prediction (IL-PP) (Zhang & Lesser, 2010; Foerster et al., 2018) . Similarly, the curves of IL-PP have lower final scores, and the convergence speed is not as good as the MATRL, which suggests that the TSR provides benefits. The MATRL w/o BR has lower variance compared to the IL-PP, which reveals the trust stable region can stabilize the learning via weak stable fixed point constraints. Finally, when comparing to IL and IL-PP, the time for each training step in MATRL is 8 Figure 6 : MATRL/IL versus MATRL/IL in the two-agent pong game. For each setting, the grids are pair-wise performance (average scores) by pitting their ten checkpoints against one another, yellow means higher score. other baselines. The result in the four-agent switch also demonstrates the effectiveness of MATRL in guaranteeing the stable policy improvement for the games that have more than two agents. Multi-Agent MuJoCo Hopper. We also examined MATRL in a multi-agent continuous control task with a three-agent hopper from (de Witt et al., 2020) . Here, three agents cooperatively control each part of a hopper to move forward. The agents are rewarded with distance and the number of steps they make before falling. Fig. 5c shows that MATRL significantly outperforms IL, MADDPG, and also the benchmarks in de Witt et al. ( 2020) within the same amount of time. Multi-Agent Pong Atari Game. In the 2-agent pong game experiments, we used raw pixel as observation and train the MATRL and IL agents independently. Following training, we compare these models' pair-wise performance by pitting their ten checkpoints against one another and recording the average scores. We report the results in Fig. 6 , which shows MATRL outperforms IL in MATRL vs. IL setting in most of the policy pairs. Besides, from the MATRL vs. MATRL and IL vs. IL settings' results, we can see MATRL has a more transitive learning process than IL, which means MATRL can mitigate the common cyclic behaviors in zero-sum games. Effect and Cost of Trust Stable Region and Best Response to Fixed Point. This section analyzes the effect of the TSR from meta-game Nash and the best response against the weak stable fixed point. The ablation settings are obtained by removing the trust stable region (IL-PP) and the best response (MATRL w/o BR). In Fig. 5 , we can observe that in all the tasks, without the best response to the fixed point, the learning curves of MATRL o/w BR have higher variance and the lowest final scores. This establishes the importance of the best response to stabilize and improve agents' performance, and empirically shows that the MATRL has better convergence ability than other baselines. Also, without the TSR to select a fixed point, the MATRL recovers to independent learners with the policy prediction (IL-PP) (Zhang & Lesser, 2010; Foerster et al., 2018) . Similarly, the curves of IL-PP have lower final scores, and the convergence speed is not as good as the MATRL, which suggests that the TSR provides benefits. The MATRL w/o BR has lower variance compared to the IL-PP, which reveals the trust stable region can stabilize the learning via weak stable fixed point constraints. Finally, when comparing to IL and IL-PP, as shown in Fig. 7 , in 2-4 agents games with 20,000 environment steps and 50 gradient steps, the training time of MATRL is empirically about 1.1-1.2 times slower. We think this extra computational cost from the TSR and the best response is acceptable given the performance improvement brought by these operations.

6. CONCLUSION

We proposed and analyzed the trust region method for multi-agent learning problems, which considers the trust payoff region and the trust stable region to meet the multi-agent learning objectives. In practice, based on independent trust payoff learners, we provide a convenient way to approximate a further restricted step size within TSR via the meta-game. This ensures that the MATRL is generalized, flexible, and easily implemented to deal with multi-agent learning problems in general. Our experimental results justify the fact that MATRL method significantly outperforms independent learners using the same configurations, and other strong MARL baselines on both continuous and discrete games with various agent numbers. A MATRL ALGORITHM BASED ON PPO  for k ∈ {0, 1, 2, • • • } do 2: Using π 1 (θ 1 ), π 2 (θ 2 ) to collect trajectories τ 1 , τ 2 . 3: Compute GAE reward Ri for each i.

4:

Compute estimated advantages Â1 , Â2 based on the current value functions V φ1 , V φ2 .

5:

for i ∈ {1, 2} do 6: Compute a trust payoff region policy πi using Eq. 2. where g is a clipping function. 8: Fit value function by regression on mean-squared error: φ i = arg min φi 1 |τ i | T τ ∈τi T t=0 V φ (s t ) -Ri,t 2 9: end for 10: Construct the meta-game M(π 1 (θ 1 ), π1 ( θ1 ), π 2 (θ 2 ), π2 ( θ2 )).

11:

Solve M and obtain meta Nash ρ 1 , ρ 2 . 12: Compute aggregated weak stable fixed point (π 1 , π2 ). 13: for i ∈ {1, 2} do 14: Compute π ( ) i which best responses to π-i using Eq. 6.

15:

Estimate the best response by importance sampling: θ i = θi |τ i | T τ ∈τi T t=0 g ( , π i /π -i ) 16: end for 17: θ 1 ← θ 1 , θ 2 ← θ 2 . 18: end for Output: π 1 (θ 1 ), π 2 (θ 2 ).

B INDEPENDENT TRUST PAYOFF REGION

We use the total variation divergence, which is defined by D TV (p q) = 1 2 j |p jq j | for discrete probability distributions p, q (Schulman et al., 2015) . D max TV (π, π) is defined as: D max TV (π, π) = max s D TV (π(•|s) π(•|s)). Based on this, we can define α-coupled policy as: Definition 2 (α-Coupled Policy (Schulman et al., 2015) ). (π, π ) is an α-coupled policy pair if it defines a joint distribution (a, a )|s, such that P (a = a |s) ≤ α for all s. π and π will denote the marginal distributions of a and a , respectively. When the joint policy pair π i , π -i changes to π i , π -i and coupled with α i and α -i correspondingly: η i (π i , π -i ) -η i (π i , π -i ) ≥ A πi,π-i i (π i , π -i ) - 4γ (1 -γ) 2 (α i + α -i -α i α -i ) 2 , = max s,ai,a-i A πi,π-i i (s, a i , a -i ) . The proofs are as following: Lemma 1. Given that (π i , π i ) and (π -i , π -i ) are both α-coupled policies bounded by α i and α -i respectively, for all s, A πi,π-i i (s) ≤ 2(α i + α -i -α i α -i ) max s,a-i,a-i A πi,π-i i (s, a i , a -i ) (9) Proof. A πi,π-i i (s) = E a i ,a -i ∼π i ,π -i A πi,π-i i (s, a i , a -i ) (10) = E (ai,a i )∼(πi,π i ),(a-i,a -i )∼(π-i,π -i ) A πi,π-i i (s, a i , a -i ) -A πi,π-i i (s, a i , a -i ) (11) = P (a i = a i ∨ a -i = a -i |s)E (ai,a i )∼(πi,π i ),(a-i,a -i )∼(π-i,π -i ) [A πi,π-i i (s, a i , a -i ) (12) -A πi,π-i i (s, a i , a -i )] ≤ (α i + α -i -α i α -i ) • 2 max s,a-i,a-i A πi,π-i i (s, a i , a -i ) , where P (a i = a i ∨ a -i = a -i |s) = 1 -(1 -α i )(1 -α -i ) = α i + α -i -α i α -i . Lemma 2. Let (π i , π i ) and (π -i , π -i ) are α-coupled policy pairs. Then, E st∼π i ,π -i A πi,π-i i (s) -E st∼πi,π-i A πi,π-i i (s) ≤ 4(α i + α -i -α i α -i )(1 -(1 -α i ) t (1 -α -i ) t ) max s,a-i,a-i A πi,π-i i (s, a i , a -i ) Proof. The preceding Lemma bounds the difference in expected advantage at each time step t. When t = 0 indicates that π i , π -i and π i , π -i both agreed on all time steps less than t. By the definition of α i , α -i , P (π i , π - i := π i , π -i |t = i) ≥ (1 -α i )(1 -α -i ), so P (t = 0) ≥ (1 -α i ) t (1 -α -i ) t and P (t > 0) ≤ 1 -(1 -α i ) t (1 -α -i ) t . We can sum over time to bind the difference between η i (π i , π -i ) and η i (π i , π -i ). η i (π i , π -i ) -L πi,π-i i (π i , π -i ) = ∞ t=0 γ t E st∼π i ,π -i A πi,π-i i (s) -E st∼πi,π-i A πi,π-i i (s) (16) ≤ ∞ t=0 γ t • 4 (α i + α -i -α i α -i )(1 -(1 -α i ) t (1 -α -i ) t ) (17) = 4 (α i + α -i -α i α -i ) 1 1 -γ - 1 1 -γ(1 -α i )(1 -α -i ) (18) = 4 (α i + α -i -α i α -i ) 2 (1 -γ)(1 -γ(1 -α i )(1 -α -i )) (19) ≤ 4 (α i + α -i -α i α -i ) 2 (1 -γ) 2 , ( ) where = max s,ai,a-i A πi,π-i i (s, a i , a -i ) . Note that L πi,π-i i (π i , π -i ) = η i (π i , π -i ) + s ρ πi,π-i (s) ai π i (a i |s) a-i π -i (a -i |s)A πi,π-i i (s, a i , a -i ). (21) Then, we can have η i (π i , π -i ) -η i (π i , π -i ) ≥ A πi,π-i i (π i , π -i ) - 4γ (1 -γ) 2 (α i + α -i -α i α -i ) 2 . ( ) C PROOF OF THEOREM 2 At each iteration, denote ∇ i g i = ∇ πi g πi,π-i i and ∇ i,-i g i = ∇ πi ∇ π-i g πi,π-i i for each i. Consider the simultaneous gradient ξ of the expected advantage gains and the corresponding Hessian H: ξ(π i , π -i ) = (∇ i g i , ∇ -i g -i ) , H = ∇ξ = ∇ i,i g i ∇ i,-i g i ∇ -i,i g -i ∇ -i,-i g -i . ( ) For a restricted underlying game, where policy space is bounded: π i ∈ [π i , πi ] . Assume π i is the linear mixture of π i , πi , and πi = ρ i π i + (1ρ i )π i , where ρ i ∈ [0, 1]. Therefore, we can re-write the g πi,π-i i (π i , π -i ) in the form of: g πi,π-i i (π i , π -i ) = g πi,π-i i (ρ i , ρ -i ) = ρ i (1-ρ -i )g i,- î i +(1-ρ i )ρ -i g î,-i i +(1-ρ i )(1-ρ -i )g î,- î i . (25) Then we have: ∇ i g i (ρ -i ) = (1 -ρ -i )g i,- î i -ρ -i g î,-i i -(1 -ρ -i )g î, -i i , and ξ(π i , π -i ) = ξ(ρ i , ρ -i ). Given a meta Nash policy pair (π i , π-i ), where πi = ρi π i + (1ρi )π i , according to the Nash definition, we have: ρi 1 -ρi T g i,-i i g i,- î i g î,-i i g î,- î i ρ-i 1 -ρ-i ≥ ρ i 1 -ρ i T g i,-i i g i,- î i g î,-i i g î,- î i ρ-i 1 -ρ-i , which implies: (ρ i -ρ i )∇ i g i (ρ -i ) ≥ 0, ρi , ∀ρ -i ∈ [0, 1], (ρ -i -ρ -i )∇ -i g -i (ρ i ) ≥ 0, ρi , ∀ρ -i ∈ [0, 1]. When ρi , ρ-i ∈ (0, 1) in accordance with the Nash condition in Eq. 28, ∇ i g i (ρ -i ) = ∇ -i g -i (ρ i ) = 0. It shows that (π i , π-i ) is a fixed point due to ξ(π i , π-i ) = ξ(ρ i , ρ-i ) = 0. For the boundary case, where ρi or ρ-i ∈ {0, 1}, because they are constrained to the unit square [0, 1] × [0, 1], the gradients on the boundaries of the unit square are projected onto the unit square, which means additional points of zero gradient exist. In other words, ∇ i g i and ∇ -i g -i are still equal to zero in boundary case, and the (π i , π-i ) is a fixed point in both cases. Next, we determine what types of the fixed point that (π i , π-i ) belongs to. According to the Eq. 24, we have the exact Hessian Matrix for the restricted game: H = ∇ξ = 0 g î,- î i -g i,- î i -g î,-i i g î,- î -i -g i,- î -i -g î,-i -i 0 (29) The eigenvalue λ of H can be computed: λ 2 -Tr(H)λ + det(H) = λ 2 -(g î,- î i -g i,- î i -g î,-i i )(g î,- î -i -g i,- î -i -g î,-i -i ) = 0 Denotes ḡi := g î,- î i g i,- î i g î,-i i , we have λ = ± √ ḡi ḡ-i . Therefore, we can have following cases for the fixed point (ρ i , ρ-i ): 1. Fully cooperative games: ḡi ≤ 0, ḡ-i ≤ 0, then H(ρ i , ρ-i ) 0, which means (ρ i , ρ-i ) is a stable fixed point as we are maximizing the objective. 2. Fully competitive games: ḡi > 0, ḡ-i < 0 or ḡi < 0, ḡ-i > 0, all λ have two pure imaginary eigenvalues with zero real part, where (ρ i , ρ-i ) is a saddle point. 3. General sum games: they are in-between the cooperative and competitive games, which means (ρ i , ρ-i ) can be either stable fixed point or saddle point. Because we assume πi monotonically improved compared to π i , then even in zero-sum case, there is at least one negative value in ḡi and ḡ-i . Therefore, in all the situations, (ρ i , ρ-i ) is not unstable, and could be a stable point or saddle point. We define them as a weak stable fixed point. It also has a tighter lower bound than the independent trust region improvement seen in Remark 1: Grid World Games. In two-player checker, as shown in Fig. 8a , there is one sensitive player who gets reward 5 when they collect an apple and 5 when they collect a lemon; a less sensitive player gets 1 for apple and 1 for lemon. The learning goal is to let the sensitive player get apples and the other one get lemons to have a higher total reward. In four-player switch, as shown in Fig. 8b , to reach the targets, agents need to figure out a way to go through a narrow corridor. The agent gets -1 for taking each step and 5 when arriving at a target. Four-player switch uses the same map as two-player switch, where two agents start from the left side and the others from the right side to go through the corridor to reach the targets. With more agents in four-player switch, learning becomes more challenging. MATRL agents achieved higher total rewards compared to baseline algorithms within the same number of steps. Multi-Agent MuJoCo Tasks. We used the three-agent Hopper environment described in (de Witt et al., 2020) , and Fig. 8c , where three agents control three joints of the robot and learn to cooperate to move forward as far as possible. The agent is rewarded by the number of time steps that they move without falling. Each agent has 3 continuous output values as the action, and all the agents have a full observation of the states of size 17. We use the same hyper-parameters for MATRL, MATRL w/o BR, and IL-PP. For MADDPG agent, we use the hyper-parameters described in the paper (de Witt et al., 2020) . Multi-Agent Atari Game. The pong game is a multi-agent Atari versionfoot_1 of table tennis, as shown in Fig. 10 . Two players must prevent a ball from whizzing past their paddles and allowing their opponent to score. The game ends when one side earns 21 points.

E EXPERIMENTAL PARAMETER SETTINGS

For all the tasks, the most important hyper-parameters are learning rate/step size, the number of update steps, batch size and value, policy loss coefficient. Appropriate learning rate and update steps plus larger batch size give a more stable learning curve. And for different environments, policy and value network loss coefficients that keep two losses at the same scale are essential in improving the learning result and speed. Also, for meta-game construction and best response update where we use the importance ratio to do estimation, a clipping factor of the ration is vital to achieving a stable and monotonic improving result. The followings are the detailed parameter settings for each task. Matrix Game and Random 2 × 2 Matrix Games. The hyper-parameters settings for MATRL, IGA-PP, and WoLF are listed in Table 2 . As shown in Fig. 9 , we also listed the additional convergence analysis in classical Chicken and Prisoners' Dilemma Games, which demonstrate good convergence performance of MATRL on both games. For MATRL, we have the KL-divergence coefficient as an extra hyper-parameter to add the KL-divergence as part of the loss in policy updating. And for the baseline algorithm WoLF, we give the real NE of the game as part of the parameters. In all the games, all the algorithms shared the same initial policy values [0.9, 0.1] for player 1 and [0.2, 0.8] for player 2. Grid World Games and Multi-agent Continuous Control Task. The hyper-parameters settings for MATRL are given in Table 3 . We used the same hyper-parameters for MATRL, MATRL w/o BR, IL-PP, and IL. The only difference is whether to use Best Response and the meta-game or not. We 



In this paper, we want to maximize the return, not minimizing the loss, so we need to avoid strict minimum. https://github.com/PettingZoo-Team/Multi-Agent-ALE



n a B T d I 4 8 d I n q 6 A 4 1 U B N R 9 I i e 0 S t 6 c 1 6 c d + f D + Z y P F p x 8 5 x g t w P n 6 B b V v o p Q = < / l a t e x i t > ⇡ i < l a t e x i t s h a 1 _ b a s e 6 4 = " y 8 D

r F I / Z p N u e m 6 v U t m t u F O g Z e L l p A w 5 6 t 3 S j 9 + L S C K o N I R j r T u e G 5 s g x c o w w u m k 6 C e a x p i M 8 I

< l a t e x i t s h a 1 _ b a s e 6 4 = " e T q 1 / s 9 f 4 / o E O S l T e Q S U 6 p I 7 d 1 4 = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B a h p 5 K I o M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y

D K 7 w 5 j 8 6 L 8 + 5 8 L F s L T j 5 z C n / g f P 4 A b P m M r A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " e T q 1 / s 9 f 4 / o E O S l T e Q S U 6 p I 7 d 1 4 = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B a h p 5 K I o M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y

D K 7 w 5 j 8 6 L 8 + 5 8 L F s L T j 5 z C n / g f P 4 A b P m M r A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " e T q 1 / s 9 f 4 / o E O S l T e Q S U 6 p I 7 d 1 4 = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B a h p 5 K I o M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y

D K 7 w 5 j 8 6 L 8 + 5 8 L F s L T j 5 z C n / g f P 4 A b P m M r A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " e T q 1 / s 9 f 4 / o E O S l T e Q S U 6 p I 7 d 1 4 = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B a h p 5 K I o M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y

a t e x i t > ⇡i < l a t e x i t s h a 1 _ b a s e 6 4 = " A I 2 5 S Q j k S Z d E b 9 0 u W + v N E g R k l 0 g = " > A A A C G H i c d V D L S s N A F L 3 x W e u r 6 t J N s A i u S l I K 6 q 7 g x m U F + 4 A m l M l 0 0 g 6 d R 5 i Z C C X 0 M 9 y 4 0 F 9 x J 2 7 d + S c u n b R Z 2 B Y P D B z O u X f m z I k S R r X x v G 9 n Y 3 N r e 2 e 3 t F f e P z g 8 O q 6 c n H a 0 T B U m b S y Z V L 0 I a c K o I G 1 D D S O 9 R B H E I 0 a 6 0 e Q u 9 7

a t e x i t s h a 1 _ b a s e 6 4 = " K j x X 7 v Y 8 U a M I 8 5 2 O z A + r 3 Z b 0 H k E = " > A A A C f n i c d V F N S 8 N A E N 3 G 7 / p V 9 e h l s S g K W h M R r A d B 8 O J R w a p g a p l s p 3 V x N w m 7 E 7 G E / E w P / h P B i 5 t a U S s O L D z e v J m d e R O l S l r y / d e K N z E 5 NT 0 z O 1 e d X 1 h c W q 6 t r F 7 b J D M C W y J R i b m N w K K S M b Z I k s L b 1 C D o S O F N 9 H h W 5 m + e 0 F i Z x F c 0 S L G t o R / L n h R A j u r U k j A C k 4 e p L D r 5 n i z u H T R S Y 8 F P e K i B H o z O w f S L L 6 z h 2 Q l D k q q L 3 1 V F i A R D t P 2 j n S x 2 + b h y p 1 O r + w 1 / G P w v C E a g z k Z x 0 a m 9 h d 1 E Z B p j E g q s v Q v 8 l N p u K J J C Y V E N M 4 s p i E f o 4 5 2 D M W i 0 7 X x o T M E 3 H d P l v c S 4 F x M f s j 8 r c t D W D n T k l O W C d j x X k v / l y o 7 2 1 / 9 5 p M f m o V 6 z n c s 4 z Q h j 8 T l O L 1 O c E l 7 e g n e l Q U F q 4 A A I I 9 1 G X D y A A U H u Y l V n V T B u z F 9 w f d A I D h v H l 4 f 1 0 + b I t F m 2 z j b Y N g v YE T t l 5 + y C t Z h g L + y 9 M l m Z 8 p i 3 5 e 1 5 + 5 9 S r z K q W W O / w m t + A I 1 Q x k w = < / l a t e x i t >

J b t 3 x + 5 c M R z 3 T / a t / 0 l f B P e S K P m B A w s f 3 3 w z O / N N n C l p K Q g e P P / J 0 2 d H z 4 9 P a i 9 e v j o 9 q 5 + / H t o 0 N w I H I l W p u Y n B o p I J D k i S w p v M I O h Y 4 X W 8 + F r l r +

s h a 1 _ b a s e 6 4 = " U L W i / Z I c 5 l 2 m K 3 6 y R 9 9 h r 0 X K r 7M = " > A A A C U X i c d V F N S w M x E J 2 u 3 / W r 6 t F L s A g K W n a l o N 5 E L x 4 V r A p u W b L p t A 1 m P 0 x m h b L s / / H X e P G g / h O 9 m a 0 F b c W B w M u b N 5 O Z l z B V 0 p D r v l e c q e m Z 2 b n 5 h e r i 0 v L K a m 1 t / d o k m R b Y E o l K 9 G 3 I D S o Z Y 4 s k K b x N N f I o V H g T 3 p + V + Z t H 1 E Y m 8 R U N U m x H v B f L r h S c L B X U T n 0 k H u S y 2 P H 7 n H I / l U V 5 2 2 M W B f m + L H a Z 3 8 M H 9 i M r + T F B U K u 7 D X c Y 7 C / w R q A O o 7 g I a h 9 + J x F Z h D E J x Y2 5 8 9 y U 2 j n X J I X C o u p n B l M u 7 n k P 7 y y M e Y S m n Q 9 3 L d i 2 Z T q s m 2 h 7 Y m J D 9 n d F z i N j B l F o l R G n v p n M l e R / u b K j G X s / D 6 O J e a h 7 1 M 5 l n G a E s f g e p 5 s p R g k r 7

Figure 3: Overview of the multi-agent trust region learning phases in two-agent games. It can be easily extended to the n-agent case by solving the n-agent two-action matrix form meta-game.

Figure 5: Learning curves in discrete and continuous tasks. The solid lines are average episode returns with 10 random seeds for each model, and the light color areas are the error bar.

(a) Two-agent checker. (b) Four-agent switch. (c) Three-agent MuJoCo hopper.

Figure 5: Learning curves in discrete and continuous tasks, each with 5 random seeds.

Figure 7: Running time of 20,000 environment steps (including 50 gradient steps) for the algorithms in 2-4 agents games.

Figure 10: Pong game in Atari 2600.

Algorithm 2 Multi-Agent Trust Region Learning Algorithm (PPO Based, Two-Agent Example). Input: The initial policy parameters θ 1 , θ 2 , initial value function parameters φ 1 , φ 2 and .

Hyper-parameter settings for baseline algorithms in grid worlds.

Hyper-parameter settings in multi-agent MuJoCo hopper.

Hyper-parameter settings in multi-agent pong Atari.

Matching Pennies

which target the challenge of instability caused by agents' change policies. Instead of taking the best response to the approximated opponent's policy, MATRL exploits the ideas from both streams and and introduces the improvement over the weak stable fixed point.The research also focuses on the EGTA (Tuyls et al., 2018; Jordan & Wellman, 2009; Tuyls et al., 2020) , which creates a policy-space meta-game for modeling the multi-agent interactions. Using various evaluation metrics, it then updates and extends the policies based on the analysis of the meta policies (Lanctot et al., 2017; Muller et al., 2019; Omidshafiei et al., 2019; Balduzzi et al., 2019; Yang et al., 2019) . Although these methods are broad with respect to multi-agent tasks, they require extensive computing resources to estimate the empirical meta-game and solve it with its increasing size (Omidshafiei et al., 2019; Yang et al., 2019) . In our method, we adopt the idea of a policy-space meta-game to approximate the fixed point. Unlike previous works, we only maintain the current policies and predicted policies to construct the meta-game, which is computationally achievable in most cases. The payoff entry in MATRL's meta-game is the expected advantage, which has a lower estimation variance compared to the commonly used empirically-estimated return in EGTAs.Regardless, we can reuse the trajectories in the TPR step to estimate the payoffs without incurring additional sampling costs.Recently, due to the use of neural networks as a function approximation for policies and values, there have emerged many works on deep reinforcement learning (DRL) (Mnih et al., 2013; Lillicrap et al., 2015) . Trust region policy optimization (Kakade & Langford, 2002; Schulman et al., 2015; 2017) is one of the most successful DRL methods in the single-agent setting, which puts constraints on the step size of policy updates, preserving any improvements monotonically. Based the monotonic improvement in single-agent trust region policy optimization (TRPO) (Schulman et al., 2015) , MATRL extends the improvement guarantee to the multi-agent level, towards a weak stable fixed point. Some works directly apply fully decentralized single-agent DRL methods (Tan, 1993) , which can be unstable during when learning due to the non-stationary issue. Whereas (Foerster et al., 2016; Sukhbaatar et al., 2016; Peng et al., 2017) 2017) further exploit the setting of centralized learning decentralized execution (CTDE). These methods provide solutions for training agents in complex multi-agent environments, and the experimental results show the effectiveness compared with independent learners. Similar to the CTDE setting, the MATRL also enjoy fully decentralized execution. Although MATRL still needs knowledge about the other agents' policies during the training, it only requires a centralized mechanism to adjust the step size rather than the additional centralized critic or communication.

5. EXPERIMENTS

We design the experiments to answer the following questions: 1). Can the MATRL method empirically contribute to the convergence in the general game settings, including cooperative/competitive and continuous/discrete games? 2). How is the performance of MATRL compared to the ILs with the same hyper-parameters and other strong MARL baselines in discrete and continuous games with various agent number? 3). Do the meta-game and best response to the weak stable fixed point bring benefits?We first evaluate the convergence performance of MATRL in matrix form games to answer the first question and validate the effectiveness of convergence. For Question 2, we show that MATRL largely outperforms ILs (PPO (Schulman et al., 2017) ) and other centralized baselines (QMIX (Rashid et al., 2018) , QTRAN (Son et al., 2019) and VDN (Sunehag et al., 2018) ) for discrete grid world games Remark 1. Let (ρ i , ρ -i ) be a Nash equilibrium of the policy-space meta-game M(π i , πi , π -i , π-i ), which is used for computing the linear mixture policies πi , π-i . For simplicity, define ρi = 1ρ i , then we have the payoff improvement lower bound for πi , π-i :that is a tighter lower bound compared with Theorem 1. Finally, we obtain MATRL as follows: First, an agent i collects a set of trajectories using its current policy π i by independent play with other agents. Then a predicted policy πi can be estimated using the single-agent trust region methods, which has a trust payoff improvement against the other agents' current policy π -i . However, this trust payoff improvements would not benefit convergence requirements for the multi-agent system due to other agents adaptive learning. To solve this problem, we approximate a n-agent two-action meta-game in policy-space by reusing the trajectories from the last TPR step. In this game, each agent i has two pure strategies: choosing the current policy π i or predicted policy πi and the corresponding payoffs are the expected advantages (defined in Eq. 3) of the joint policy pairs. By constructing such a meta-game, we transform a complex multiagent interactions problem into game-theoretic analysis concerning the underlying game restricted in [π i , πi ]. Then we can obtain a weak stable fixed point as TSR within the TPR by solving the meta-game,. When the fixed point is a saddle point we then take the best response to the weak stable fixed point to get the next iteration's policies. This encourages exploration and avoid stagnation at an unexpected saddle point.

D ENVIRONMENT DETAILS

Random 2 × 2 Matrix Games. We created a generator of 2 × 2 matrix games based on the category provided by (Pangallo et al., 2017) . Coordination games have characteristics enabling one agent to improve the payoff without decreasing the payoff of the other agent. Anti-coordination games are ones where one agent improves the payoff while the other agent's payoff decreases. Both coordination and anti-coordination games can have two pure NEs and one mixed strategy NE. In cyclic games, the action selections of agents that is based on their actions will form a cycle, ensuring that there is no pure NE in the game. Instead only mixed strategy NE will be found. used Leaky ReLU as the activation function for both policies and value networks. For the training, we used paralleled workers to collect experience data and update the network weights separated then synchronize all the works to have the final updated weights. We used different value loss and policy loss coefficients to balance the weights of two losses. For the Switch games, we used small value loss coefficients because the value loss is between [0 -10] while the absolute value policy loss is smaller than 1e -2. For the Checker game, the value loss and policy loss are at the same scale between [1e -4, 1e -2]. Also, we added entropy loss and KL loss to encourage exploration and limit the policy update for each step. We used ( Šebek, 2013) as the Nash equilibrium solver for finding the meta-game Nash. The Nash solver is CMAES for all the experiments. If not particularly indicated, all the baselines use common settings as listed in Table 3 . VDN, QMIX use common individual action-value networks as those used by MATRL; each consists of two 128-width hidden layers. We includes more experiment result on 4-agent ant task multi-agent MuJoCo task in Fig. 9c , which also demonstrate the superior performance of MATRL compared to other settings. The specialized parameter settings for each algorithm are provided in Table 4 and 5: Multi-agent Atari Pong. The hyper-parameters setting for MATRL are listed in Table 6 . We used the same hyper-parameters for MATRL and IL. We take the raw pixel input from the Atari environment, and we processed it with a convolution network, which has filter sizes [8, 4, 3] , kernel sizes (3,3,3), and stride sizes [4,2,1] and "VALID" as padding. Then we pass the processed embedding to a 2 layer fully connected network to get the policy.

