MULTI-AGENT TRUST REGION LEARNING

Abstract

Trust-region methods are widely used in single-agent reinforcement learning. One advantage is that they guarantee a lower bound of monotonic payoff improvement for policy optimization at each iteration. Nonetheless, when applied in multi-agent settings, such guarantee is lost because an agent's payoff is also determined by other agents' adaptive behaviors. In fact, measuring agents' payoff improvements in multi-agent reinforcement learning (MARL) scenarios is still challenging. Although game-theoretical solution concepts such as Nash equilibrium can be applied, the algorithm (e.g., Nash-Q learning) suffers from poor scalability beyond twoplayer discrete games. To mitigate the above measurability and tractability issues, in this paper, we propose Multi-Agent Trust Region Learning (MATRL) method. MATRL augments the single-agent trust-region optimization process with the multiagent solution concept of stable fixed point that is computed at the policy-space meta-game level. When multiple agents learn simultaneously, stable fixed points at the meta-game level can effectively measure agents' payoff improvements, and, importantly, a meta-game representation enjoys better scalability for multi-player games. We derive the lower bound of agents' payoff improvements for MATRL methods, and also prove the convergence of our method on the meta-game fixed points. We evaluate the MATRL method on both discrete and continuous multiplayer general-sum games; results suggest that MATRL significantly outperforms strong MARL baselines on grid worlds, multi-agent MuJoCo, and Atari games.

1. INTRODUCTION

Multi-agent systems (MAS) (Shoham & Leyton-Brown, 2008) have received much attention from the reinforcement learning community. In real-world, automated driving (Cao et al., 2012) , StarCraft II (Vinyals et al., 2019) and Dota 2 (Berner et al., 2019) are a few examples of the myriad of applications that can be modeled by MAS. Due to the complexity of multi-agent problems (Chatterjee et al., 2004) , investigating if agents can learn to behave effectively during interactions with environments and other agents is essential (Fudenberg et al., 1998) . This can be achieved naively through the independent learner (IL) (Tan, 1993) , which ignores the other agents and optimizes the policy assuming a stable environment (Bus ¸oniu et al., 2010; Hernandez-Leal et al., 2017) . Due to their theoretical guarantee and good empirical performance in real-world applications, trust region methods (e.g., PPO (Schulman et al., 2015; 2017) ) based ILs are popular (Vinyals et al., 2019; Berner et al., 2019) . In single-agent learning, trust region methods can produce a monotonic payoff improvement guarantee (Kakade & Langford, 2002) via line search (Schulman et al., 2015) . In multi-agent scenarios, however, an agent's improvement is affected by other agent's adaptive behaviors (i.e., the multi-agent environment is non-stationary (Hernandez-Leal et al., 2017) ). As a result, trust region learners can measure the policy improvements of the agents' current policies, but the improvements of the updated opponents' policies are unknown (shown in Fig. 1 ). Therefore, trust region based ILs act less well in MAS as in single-agent tasks. Moreover, the convergence to a fixed point, such as a Nash equilibrium (Nash et al., 1950; Bowling & Veloso, 2004; Mazumdar et al., 2020) , is a common and widely accepted solution concept for multi-agent learning. Thus, although independent learners can best respond to other agents' current policies, they lose their convergence guarantee (Laurent et al., 2011) . One solution to address the convergence problem for independent learners is Empirical Game-Theoretic Analysis (EGTA) (Wellman, 2006) , which approximates the best response to the policies generated by the independent learners (Lanctot et al., 2017; Muller et al., 2019) . Although EGTA based methods (Lanctot et al., 2017; Omidshafiei et al., 2019; Balduzzi et al., 2019) establish   < l a t e x i t s h a 1 _ b a s e 6 4 = " 1 N N 7 F 0 a Z f r d h N G V 5 c + 4 o O H r B f T M = " > A A A C E X i c d V D L S g M x F M 3 U V 6 2 v q k s 3 w S K 4 K j N S 0 Y W L g h u X F e w D 2 q F k 0 j t t a J I Z k 4 x Q h v 6 C G x f 6 K + 7 E r V / g n 7 g 0 0 8 7 C t n g g c D j n 3 u T k B D F n 2 r j u t 1 N Y W 9 / Y 3 C p u l 3 Z 2 9 / Y P y o d H L R 0 l i k K T R j x S n Y B o 4 E x C 0 z D D o R M r I C L g 0 A 7 G t 5 n f f g K l W S Q f z C Q G X 5 C h Z C G j x G R S j 8 N j v 1 x x q + 4 M e J V 4 O a m g H I 1 + + a c 3 i G g i Q B r K i d Z d z 4 2 N n x J l G O U w L f U S D T G h Y z K E r q W S C N B + O s s 6 x W d W G e A w U v Z I g 2 f q 3 4 2 U C K 0 n I r C T g p i R X v Y y 8 T 8 v u 1 E v v J 8 G Y i m P C a / 9 l M k 4 M S D p P E 6 Y c G w i n N W D B 0 w B N X x i C a G K 2 R 9 h O i K K U G N L L N m q v O V i V k n r o u r V q p f 3 t U r 9 J i + t i E 7 Q K T p H H r p C d X S H G q i J K B q h Z / S K 3 p w X 5 9 3 5 c D 7 n o w U n 3 z l G C 3 C + f g H H + Z 5 O < / l a t e x i t > ? < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 H E u A o 3 N H + K p H j v H / J 6 u 6 l M b s x Q = " > A A A C D n i c d V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I o g v B g h u X L d g H t E P J p J k 2 N M k M S U Y o Q 7 / A j Q v 9 F X f i 1 l / w T 1 y a a W d h W z w Q O J x z b 3 J y g p g z b V z 3 2 y m s r W 9 s b h W 3 S z u 7 e / s H 5 c O j l o 4 S R W i T R D x S n Q B r y p m k T c M M p 5 1 Y U S w C T t v B + D 7 z 2 0 9 U a R b J R z O J q S / w U L K Q E W y s 1 L j r l y t u 1 Z 0 B r R I v J x X I U e + X f 3 q D i C S C S k M 4 1 r r r u b H x U 6 w M I 5 x O S 7 1 E 0 x i T M R 7 S r q U S C 6 r 9 d B Z 0 i s 6 s M k B h p O y R B s 3 U v x s p F l p P R G A n B T Y j v e x l 4 n 9 e d q N e e D 8 N x F I e E 9 7 4 K Z N x Y q g k 8 z h h w p G J U N Y N G j B F i e E T S z B R z P 4 I k R F W m B j b Y M l W 5 S 0 X s 0 p a F 1 X v s n r V u K z U b v P S i n A C p 3 A O H l x D D R 6 g D k 0 g Q O E Z X u H N e X H e n Q / n c z 5 a c P K d Y 1 i A 8 / U L D l C c 0 Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 M w W V F V 0 m T J 6 P j 2 j W 7 1 g F Q U X J f k = " > A A A C E X i c d V D L S g M x F M 3 U V 6 2 v q k s 3 w S K 4 K j N S 0 Y W L g h u X F e w D 2 q F k 0 j t t a J I Z k 4 x Q h v 6 C G x f 6 K + 7 E r V / g n 7 g 0 0 8 7 C t n g g c D j n 3 u T k B D F n 2 r j u t 1 N Y W 9 / Y 3 C p u l 3 Z 2 9 / Y P y o d H L R 0 l i k K T R j x S n Y B o 4 E x C 0 z D D o R M r I C L g 0 A 7 G t 5 n f f g K l W S Q f z C Q G X 5 C h Z C G j x G R S b w i P / X L F r b o z 4 F X i 5 a S C c j T 6 5 Z / e I K K J A G k o J 1 p 3 P T c 2 f k q U Y Z T D t N R L N M S E j s k Q u p Z K I k D 7 6 S z r F J 9 Z Z Y D D S N k j D Z 6 p f z d S I r S e i M B O C m J G e t n L x P + 8 7 E a 9 8 H 4 a i K U 8 J r z 2 U y b j x I C k 8 z h h w r G J c F Y P H j A F 1 P C J J Y Q q Z n + E 6 Y g o Q o 0 t s W S r 8 p a L W S W t i 6 p X q 1 7 e 1 y r 1 m 7 y 0 I j p B p + g c e e g K 1 d E d a q A m o m i E n t E r e n N e n H f n w / m c j x a c f O c Y L c D 5 + g W / n Z 5 J < / l a t e x i t > ? < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 H E u A o 3 N H + K p H j v H / J 6 u 6 l M b s x Q = " > A A A C D n i c d V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I o g v B g h u X L d g H t E P J p J k 2 N M k M S U Y o Q 7 / A j Q v 9 F X f i 1 l / w T 1 y a a W d h W z w Q O J x z b 3 J y g p g z b V z 3 2 y m s r W 9 s b h W 3 S z u 7 e / s H 5 c O j l o 4 S R W i T R D x S n Q B r y p m k T c M M p 5 1 Y U S w C T t v B + D 7 z 2 0 9 U a R b J R z O J q S / w U L K Q E W y s 1 L j r l y t u 1 Z 0 B r R I v J x X I U e + X f 3 q D i C S C S k M 4 1 r r r u b H x U 6 w M I 5 x O S 7 1 E 0 x i T M R 7 S r q U S C 6 r 9 d B Z 0 i s 6 s M k B h p O y R B s 3 U v x s p F l p P R G A n B T Y j v e x l 4 n 9 e d q N e e D 8 N x F I e E 9 7 4 K Z N x Y q g k 8 z h h w p G J U N Y N G j B F i e E T S z B R z P 4 I k R F W m B j b Y M l W 5 S 0 X s 0 p a F 1 X v s n r V u K z U b v P S i n A C p 3 A O H l x D D R 6 g D k 0 g Q O E Z X u H N e X H e n Q / n c z 5 a c P K d Y 1 i A 8 / U L D l C c 0 Q = = < / l a t e x i t > Monotonic Improvement Against Fixed Opponent Policy Simultaneously Updated Policies ⌘ i (⇡ i , ⇡ 0 i ) < l a t e x i t s h a 1 _ b a s e 6 4 = " m l f S 6 P 1 w r 0 p s E / 0 O 2 O 9 h k r l F v 2 g = " > A A A C b H i c d V H B T t w w E P W m U O i W l m 3 L D S F Z X Y G o t K y S a h E g c U D l A E d a d Q G J p J H j T B Y L O 4 n s C W J l 5 e v 6 F X w C l x 7 a S 4 9 4 w x 6 6 i z q S p a f 3 Z s b P z 0 k p h U H f v 2 9 5 L x Y W X y 4 t v 2 q / X n n z d r X z 7 v 2 5 K S r N Y c g L W e j L h B m Q I o c h C p R w W W p g K p F w k d w c T / S L W 9 B G F P l 3 H J c Q K T b K R S Y 4 Q 0 f F n S h M I X O z z S a r x h r S 2 n 4 7 + V L b Y H / Q o w e 7 P T o I 6 n Y I y G K 7 I + r t s B S x F X U v R L j D m a l G c S 0 / H N J C Q V 1 / i j t d v + 8 3 R Z + D Y A q 6 Z F p n c e d v m B a 8 U p A j l 8 y Y q 8 A v M b J M o + A S n I / K Q M n 4 D R v B l Y M 5 U 2 A i 2 9 i o 6 a Z j U p o V 2 p 0 c a c P + O 2 G Z M m a s E t e p G F 6 b e W 1 C / k + b b D Q z 9 9 t E z f n B b D + y I i 8 r h J w / 2 c k q S b G g k + R p K j R w l G M H G N f C v Y j y a 6 Y Z R / c / b R d V M B / M c 3 D + u R 8 M + g d f B 9 2 j w 2 l o y 2 S d f C T b J C B 7 5 I i c k j M y J J z 8 J A / k N / n T + u W t e e v e x l O r 1 5 r O f C A z 5 W 0 9 A r j w v 3 Y = < / l a t e x i t > ⌘ i (⇡ 0 i , ⇡ 0 i ) < l a t e x i t s h a 1 _ b a s e 6 4 = " D c I A K y R K 9 5 o t G m n 1 y l f K J L k P Q G 4 = " > A A A C i H i c d V H B T t t A E F 2 b 0 o a 0 h U C P X F a N K l E p R D Z K R S J x Q H C A I 6 A G k H B q r d d j W L F r W 7 t j h L X y h 3 L i N 3 p k E 3 I g Q Y y 0 0 t O b N z N v Z 5 J S C o N B 8 O T 5 K 5 9 W P 3 9 p r b W / f v u + v t H Z 3 L o 0 R a U 5 j H k h C 3 2 d M A N S 5 D B G g R K u S w 1 M J R K u k v v j a f 7 q A b Q R R f 4 X 6 x I m i t 3 m I h O c o a P i T h 2 l k L n a W S e r a g 1 p Y y 9 O j h o b D g c 9 O v r T o 4 O w a U e A L L a 7 o t m J E B 5 x Q R 2 V I r a i + e e A F g q a p v e B Z v e t 6 H f c 6 Q b 9 Y B b 0 P Q j n o E v m c R Z 3 / k d p w S s F O X L J j L k J g x I n l m k U X I L z W B k o G b 9 n t 3 D j Y M 4 U m I m d 2 W j o L 8 e k N C u 0 e z n S G f u 2 w j J l T K 0 S p 1 Q M 7 8 x y b k p + l J t 2 N A v z b a K W / G A 2 n F i R l x V C z l / t Z J W k W N D p V W g q N H C U t Q O M a + F + R P k d 0 4 y j u 1 3 b r S p c X s x 7 c L n X D w f 9 0 f m g e 3 g w X 1 q L b J O f Z I e E Z J 8 c k l N y R s a E k 2 d v 1 V v 3 N v y 2 H / j 7 / u h V 6 n v z m h 9 k I f y j F / p H y L I = < / l a t e x i t > ⌘ i (⇡ 0 i , ⇡ 0 i ) < l a t e x i t s h a 1 _ b a s e 6 4 = " E G s G K u g D 1 6 z q H 5 7 Y P B 5 K P f i u v x o = " > A A A C h 3 i c d V F N T x G C A A i X c F R q Y i i X c x k / f a / 3 2 G b Q R e X a N k w K G i j 1 k I h W c o a N G z X G U Q O p 6 p 5 O s m m h I K v v r x 7 f K h r 1 u m / a P 2 7 Q b V o 0 I k I 2 s q A 4 j h D H O m a N C 1 M p v B 7 R Q U F X t d z x f X p s + j 5 q t o B N M i 7 4 F 4 Q y 0 y K y u R s 1 / U Z L z U k G G X D J j 7 s O g w K F l G g W X 4 C K W B g r G n 9 g D 3 D u Y M Q V m a K c x K n r g m I S m u X Y n Q z p l X 3 d Y p o y Z q N g 5 F c N H s 6 j V 5 H t a P d H M 3 W 9 j t Z A H 0 9 7 Q i q w o E T L + E i c t J c W c 1 p 9 C E 6 G B o 5 w 4 w L g W 7 k W U P z L N O L q v a 7 h V h Y u L e Q t u j j p h t 9 P / 2 W 2 d n 8 2 W t k b 2 y D 4 5 J C E 5 J e f k k l y R A e H k r 7 f s b X p b / r r / 1 T / x e y 9 W 3 5 v 1 f C J z 5 V / 8 B 2 y m y H s = < / l a t e x i t > ⌘ i (⇡ 0 i , ⇡ i ) < l a t e x i t s h a 1 _ b a s e 6 4 = " Q m m Z 0 3 / S f 6 m n j X K t 2 E 0 v 8 K a a h a A = " > A A A C a 3 i c d V H B T t w w E P U G a G F p Y Y E b 9 G C x Q q L S s k q q R Y D E A d E D P Q J i A Y k s K 8 e Z g I W d R P Y E s b L y c / 2 L / g E X D v T U Y 5 2 w h + 6 i j m T p 6 b 2 Z 8 f N z l E t h 0 P d / N b y Z 2 b k P H + c X m o u f P i 8 t t 1 Z W L 0 1 W a A 5 9 n s l M X 0 f M g B Q p 9 F G g h O t c A 1 O R h K v o 4 X u l X z 2 C N i J L L 3 C U w 0 C x u 1 Q k g j N 0 1 L A V h j E k b r b e Z N V I Q 1 z a 8 5 P j 0 g b 7 v Q 4 9 2 O 3 Q X l A 2 Q 0 A 2 t K L c D h G e c K I 5 z E W l 3 D q g h Y K y 7 N T M j i i / D l t t v + v X R d + D Y A z a Z F y n w 9 a f M M 5 4 o S B F L p k x N 4 G f 4 8 A y j Y J L c D Y K A z n j D + w O b h x M m Q I z s L W d k m 4 5 J q Z J p t 1 J k d b s v x O W K W N G K n K d i u G 9 m d Y q 8 n 9 a t d F M 3 G 8 j N e U H k / 2 B F W l e I K T 8 z U 5 S S I o Z r Y K n s d D A U Y 4 c Y F w L 9 y L K 7 5 l m H N 3 3 N F 1 U w X Q w 7 8 H l t 2 7 Q 6 x 6 c 9 d p H h + P Q 5 s k G 2 S T b J C B 7 5 I j 8 I K e k T z j 5 S Z 7 J K / n d e P H W v H X v y 1 u r 1 x j P r J G J 8 r b + A j x h v z 8 = < / l a t e x i t > ⌘ i (⇡ i , ⇡ i ) < l a t e x i t s h a 1 _ b a s e 6 4 = " o a 4 S t d e I I x + X A 1 q s M V A N M u q x E M Q = " > A A A C T 3 i c d V H L S g M x F M 3 U V 1 t f V Z d u g k V Q q G V G K i q 4 K L r Q p Y p V w Z a S y d z R Y D I z J B m h h P k d v 8 a N C w W / R F d i Z p y F V r x w u Y d z n z n x E 8 6 U d t 0 3 p z I x O T U 9 U 6 3 V Z + f m F x Y b S 8 u X K k 4 l h R 6 N e S y v f a K A s w h 6 m m k O 1 4 k E I n w O V / 7 9 U Z 6 / e g C p W B x d 6 F E C A 0 F u I x Y y S r S l h o 1 u P 4 D Q 9 h a T j B h J C D J z f n y Y G W + v 0 8 L 7 O y 3 c 8 b J 6 H z Q Z G p Z t 9 B O W x 1 Y R t 1 i 2 O W w 0 3 b Z b G P 4 L v B I 0 U W m n w 8 Z H P 4 h p K i D S l B O l b j w 3 0 Q N D p G a U g 1 2 V K k g I v S e 3 c G N h R A S o g S n u y / C 6 Z Q I c x t J 6 p H H B / u w w R C g 1 E r 6 t F E T f q f F c T v 6 X y y e q X / u N L 8 b u 0 e H e w L A o S T V E 9 P u c M O V Y x z g X F w d M A t V 8 Z A G h k t k X Y X p H J K H a f k H d S u W N C / M X X G 6 3 v U 5 7 / 6 z T 7 B 6 U o l X R K l p D G 8 h D u 6 i L T t A p 6 i G K H t E T e k G v z r P z 7 n x W y t K K U 4 I V 9 M s q t S / 6 M L Q F < / l a t e x i t > ⌘ i (⇡ i , ⇡ i ) < l a t e x i t s h a 1 _ b a s e 6 4 = " H 8 U E m S j a / X 4 E c O J 7 3 1 P X D e g U M N 8 = " > A A A C U H i c d V H P S x t B F H 4 b t W q 0 b b T H X g Z D I U I a d i V F B Q / W H u w x l k Y D J i y z s 2 9 1 c G Z 3 m Z k V w r B / j 3 + N F w + K / 4 k 3 n W x S 0 E g f P N 7 H 9 3 7 O N 1 E u u D a + / + j V F h a X P i y v r N b X 1 j 9 + + t z Y 2 D z V W a E Y 9 l k m M j W I q E b B U + w b b g Q O c o V U R g L P o q t f k / z Z N S r N s / S v G e c 4 k v Q i 5 Q l n 1 D g q b P w c x p i 4 3 m q S l W O F c W n / H B + V N t j r t s n + j z b p B m V 9 i I a G 9 j s v W 8 O c h 5 a X 7 S o 6 Y j t s N P 2 O X x l 5 D 4 I Z a M L M e m H j a R h n r J C Y G i a o 1 u e B n 5 u R p c p w J t D t K j T m l F 3 R C z x 3 M K U S 9 c h W B 5 b k m 2 N i k m T K e W p I x b 7 u s F R q P Z a R q 5 T U X O r 5 3 I T 8 X 2 4 y U b / Z b y M 5 d 4 9 J 9 k a W p 3 l h M G X T c 5 J C E J O R i b o k 5 g q Z E W M H K F P c v Y i w S 6 o o M + 4 P 6 k 6 q Y F 6 Y 9 + B 0 p x N 0 O / s n 3 e b h w U y 0 F f g K W 9 C C A H b h E H 5 D D / r A 4 A Z u 4 R 4 e v D v v y X u u e d P S f x G + w B u r 1 V 8 A f Z i 0 P A = = < / l a t e x i t >

Improvement is unknown

Figure 1 : The relationship of discounted returns η i for an agent i given the different joint policy pairs, where π i is the current policy, π i is the simultaneously updated policy. Given π i the monotonic improvement against fixed opponent can be easily measured: η i (π i , π -i ) ≥ η i (π i , π -i ). However, due to the simultaneous learning, the improvement of η i (π i , π -i ) is unknown compared to η i (π i , π -i ). convergence guarantees in several games classes, the computational cost is also large when empirically approximating and solving the increasing meta-game (Yang et al., 2019) . Other multi-agent learning approaches collect or approximate additional information such as communication (Foerster et al., 2016) and centralized joint critics (Lowe et al., 2017; Foerster et al., 2017; Sunehag et al., 2018; Rashid et al., 2018) . Nevertheless, these methods usually require centralized parameters or centralized communication assumptions. Thus, there is considerable interest in multi-agent learning to find an algorithm that, while having minimal requirements and computational cost as independent learners, also improves convergence performance at the same time. This paper presents the Multi-Agent Trust Region Learning (MATRL) algorithm that augments the trust-region ILs with a meta-game analysis to improve the stability and efficiency of learning. In MATRL, a trust region trial step for an agents' payoff improvement is implemented by independent learners, which gives a predicted policy based on the current policy. Then, an empirical policy-space meta-game is constructed to compare the expected advantage of predicted policies with the current policies. By solving the meta-game, MATRL finds a restricted step by aggregating the current and predicted policies using meta-game Nash Equilibrium. Finally, MATRL takes the best responses based on the aggregated policies from last step for each agent to explore because the found TSR is not always strict stable. MATRL is, therefore, able to provide a weak stable solution compared with the naive independent learners. Based on trust region independent learners, MATRL does not need extra parameters, simulations, or modifications to the independent learner itself. We provide insights into the empirical meta-game in Section 3. 

2. PRELIMINARY

A Stochastic Game (Shapley, 1953; Littman, 1994) can be defined as: G = N , S, {A i }, {R i }, P, p 0 , γ , where N is a set of agents, n = |N | is the number of agents and S denotes the state space. A i is the action space for agent i. A = A 1 × • • • × A n = A i × A -i is the joint action space, and for the simplicity we use -i denotes the other agents except agent i. R i = R i (s, a i , a -i ) is the reward function for agent i ∈ N . P : S × A × S → [0, 1] is the transition function. p 0 is the initial state distribution, γ ∈ [0, 1] is a discount factor. Each agent i ∈ N has a stochastic policy π i (a i |s) : S × A i → [0, 1], and aims to maximize its long term discounted return: η i (π i , π -i ) = E s 0 ,a 0 i ,a 0 -i ••• ∞ t=0 γ t R i (s t , a t i , a t -i ) , where s 0 ∼ p 0 , s t+1 ∼ P(s t+1 |s t , a t i , a t -i ), a t i ∼ π i (a t i |τ t i ). We then can have the standard definitions of the state-action value function Q 

3. MULTI-AGENT TRUST REGION POLICY OPTIMIZATION

A trust region algorithm aims to answer two questions: how to compute a trust region trial step and whether a trial step should be accepted. In multi-agent learning, a trust region trial step towards agents' payoff improvement can be easily implemented with independent learners, and we call the independent payoff improvement area as Trust Payoff Region(TPR). The remaining issue is



s x E P U u U C D l I 9 A j F 4 s I i U p p u o s C J B I H a A / l S C s C S G w a e b 2 z Y G H v r u x Z l M j a H 9 p L f 0 e P 9 Y Y c S B A j W X p 6 7 8 3 4 e R w X U h g M g j + e v 7 S 8 8 m F 1 b b 3 x c W N z a 7 u 5 s 3 t j 8 l J z G P B c 5 v o u Z g a k y

(s t ) given the state and joint action.

