PROVABLE RICH OBSERVATION REINFORCEMENT LEARNING WITH COMBINATORIAL LATENT STATES

Abstract

We propose a novel setting for reinforcement learning that combines two common real-world difficulties: presence of observations (such as camera images) and factored states (such as location of objects). In our setting, the agent receives observations generated stochastically from a latent factored state. These observations are rich enough to enable decoding of the latent state and remove partial observability concerns. Since the latent state is combinatorial, the size of state space is exponential in the number of latent factors. We create a learning algorithm FactoRL (Fact-o-Rel) for this setting which uses noise-contrastive learning to identify latent structures in emission processes and discover a factorized state space. We derive polynomial sample complexity guarantees for FactoRL which polynomially depend upon the number factors, and very weakly depend on the size of the observation space. We also provide a guarantee of polynomial time complexity when given access to an efficient planning algorithm.

1. INTRODUCTION

Most reinforcement learning (RL) algorithms scale polynomially with the size of the state space, which is inadequate for many real world applications. Consider for example a simple navigation task in a room with furniture where the set of furniture pieces and their locations change from episode to episode. If we crudely approximate the room as a 10 × 10 grid and consider each element in the grid to contain a single bit of information about the presence of furniture, then we end up with a state space of size 2 100 , as each element of the grid can be filled independent of others. This is intractable for RL algorithms that depend polynomially on the size of state space. The notion of factorization allows tractable solutions to be developed. For the above example, the room can be considered a state with 100 factors, where the next value of each factor is dependent on just a few other parent factors and the action taken by the agent. Learning in factored Markov Decision Processes (MDP) has been studied extensively (Kearns & Koller, 1999; Guestrin et al., 2003; Osband & Van Roy, 2014) with tractable solutions scaling linearly in the number of factors and exponentially in the number of parent factors whenever planning can be done efficiently. However, factorization alone is inadequate since the agent may not have access to the underlying factored state space, instead only receiving a rich-observation of the world. In our room example, the agent may have access to an image of the room taken from a megapixel camera instead of the grid representation. Naively, treating each pixel of the image as a factor suggests there are over a million factors and a prohibitively large number of parent factors for each pixel. Counterintuitively, thinking of the observation as the state in this way leads to the conclusion that problems become harder as the camera resolution increases or other sensors are added. It is entirely possible, that these pixels (or more generally, observation atoms) are generated by a small number of latent factors with a small number of parent factors. This motivates us to ask: can we achieve PAC RL guarantees that depend polynomially on the number of latent factors and very weakly (e.g., logarithmically) on the size of observation space? Recent work has addressed this for a rich-observation setting with a non-factored latent state space when certain supervised learning problems are tractable (Du et al., 2019; Misra et al., 2020; Agarwal et al., 2020) . However, addressing the rich-observation setting with a latent factored state space has remained elusive. Specifically, ignoring the factored structure in the latent space or treating observation atoms as factors yields intractable solutions.

Identify Emission Structure

Learn State Decoder Learn Model and Policy Cover s 0 [1] < l a t e x i t s h a 1 _ b a s e 6 4 = " g m 8 P n S / O E Y g Q D 2 z q U s x o 6 L q r y u g  = " > A A A B 7 H i c d V B N S w M x E J 2 t X 7 V + V T 1 6 C R b R U 8 l W q + 2 t 6 M V j B b c W t k v J p m k b m s 0 u S V Y o p b / B i w d F v P q D v P l v T N s V V P T B w O O 9 G W b m h Y n Y h V O y S a C S 6 Z Z 7 g R r J 0 o R q J Q s L t w d D X z 7 + 6 Z 0 j y W t 2 a c s C A i A 8 n 7 n B J j J U 8 f + 2 7 Q L Z Z w u Y r d + r m L c B n P Y U m l i u s 1 j N x M K U G G Z r f 4 3 u n F N I 2 Y N F Q Q r X 0 X J y a Y E G U 4 F W x a 6 K S a J Y S O y I D 5 l k o S M R 1 M 5 s d O 0 Z F V e q g f K 1 v S o L n 6 f W J C I q 3 H U W g 7 I 2 K G + r c 3 E / / y / N T 0 a 8 G E y y Q 1 T N L F o n 4 q k I n R 7 H P U 4 4 p R I 8 a W E K q 4 v R X R I V G E G p t P w Y b w 9 S n 6 n 7 Q q Z f e 0 X L k 5 K z U u s z j y c A C H c A I u X E A D r q E J Q U U L G o q q b r q 7 / E R w p T H + s J a W V 1 b X 1 g s b x c 2 t 7 Z 3 d 0 t 5 + W 8 W p p K x F Y x H L r k 8 U E z x i L c 2 1 Y N 1 E M h L 6 g n X 8 8 f X M 7 9 w z q X g c 3 e l J w r y Q D C M e c E q 0 k T p u h k 8 d d 9 o v l b F d x U 7 9 w k H Y x n M Y U q n i e g 0 j J 1 f K k K P Z L 7 2 7 g 5 i m I Y s 0 F U S p n o M T 7 W V E a k 4 F m x b d V L G E 0 D E Z s p 6 h E Q m Z 8 r L 5 u V N 0 b J Q B C m J p K t J o r n 6 f y E i o 1 C T 0 T W d I 9 E j 9 9 m b i X 1 4 v 1 U H N y 3 i U p J p F d L E o S A X S M Z r 9 j g Z c M q r F x B B C J T e 3 I j o i k l B t E i q a E L 4 + R f + T d s V 2 z u z K 7 X m 5 c Z X H U Y B D O I I T c O A S G n A D T W g B h T E 8 w B M 8 W 4 n 1 a L 1 Y r 4 v W J S u f O Y A f s N 4 + A b z Y j y 8 = < / l a t e x i t > Decoder Model i < l a t e x i t s h a 1 _ b a s e 6 4 = " x x R c k S U s F V 0 0 c a 7 O m + F 3 b O H S 1 m U = " > A A A B 7 X i c d V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B 0 5 K t V t t b 0 Y v H C r Y V 2 q V k Q P O S X G S q 1 u M u Q 9 3 i u W s F v B X u 3 M Q 9 j F c 1 h S r u B a F S M v U 0 q Q o d E r v n f 7 M U 0 j J g 0 V R O u O h x P j T 4 g y n A o 2 L X R T z R J C R 2 T A O p Z K E j H t T + b X T t G R V f o o j J U t a d B c / T 4 x I Z H W 4 y i w n R E x Q / 3 b m 4 l / e Z 3 U h F V / w m W S G i b p Y l G Y C m R i N H s d 9 b l i 1 I i x J Y Q q b m 9 F d E g U o c Y G V L A h f H 2 K / i J O K R 7 P h Y U c 4 E b W q m O e 3 E k u L Q 5 7 T t j y 9 n f v u O S s U i c a M n M f V C P B Q s Y A R r I 9 3 2 4 h H r u + X 7 4 3 6 x h O w K c m p n D k Q 2 m s M Q t 4 J q V Q S d T C m B D I 1 + 8 b 0 3 i E g S U q E J x 0 p 1 H R R r L 8 V S M 8 L p t N B L F I 0 x G e M h 7 R o q c E i V l 8 4 P n s I j o w x g E E l T Q s O 5 + n 0 i x a F S k 9 A 3 n S H W I / X b m 4 l / e d 1 E B 1 U v Z S J O N B V k s S h I O N Q R n H 0 P B 0 x S o v n E E E w k M 7 d C M s I S E 2 0 y K p g Q v j 6 F / 5 O W a z s n t n t 9 W q p f Z H H k w Q E 4 B G X g g H N Q B 1 e g A Z q A g B A 8 g C f w b E n r 0 X q x X h e t O S u b 2 Q c / Y L 1 9 A i d 2 j / 0 = < / l a t e x i t > 3(x) < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 h z w l m n e 0 E V c o 8 h U r 1 + E B Y f Z a t U = " > A A A B 8 H i c d V B N T w I x E O 3 i F + I X 6 t F L I z H B y 6 Y L o n A j e v G I i Y A G N q R b u t D Q d j d t 1 0 g I v 8 K L B C R r C x 0 m 0 3 H r J e u X h / 3 M s X k F t B X u 3 U g 8 h F c 1 h S q q B a F U E v V Q o g R a O X f + / 2 I 5 I I K g 3 h W O u O h 2 L j T 7 A y j H A 6 z X U T T W N M R n h A O 5 Z K L K j 2 J / O D p / D I K n 0 Y R s q W N H C u f p + Y Y K H 1 W A S 2 U 2 A z 1 L + 9 m f i X 1 0 l M W P U n T M a J o Z I s F o U J h y a C s + 9 h n y l K D B 9 b g o l i 9 l Z I h l h h Y m x G O R v C 1 6 f w f 9 I q u V 7 Z L V 2 d F O r n a R x Z c A A O Q R F 4 4 A z U w S V o g C Y g Q I A H 8 A S e H e U 8 O i / O 6 6 I 1 4 6 Q z + + A H n L d P K P 2 P / g = = < / l a t e x i t > 3(x 0 ) < l a t e x i t s h a 1 _ b a s e 6 4 = " o x N d W D x g d t N Q T g f U 2 1 p 6 G j V A y E c = " > A A A B 8 X i c d V D L T g I x F O 3 g C / G F u n T T S I y 4 m X R A F H Z E N y 4 x k U e E C e m U D j R 0 O p O 2 Y y Q T / s K N C 4 1 x 6 9 + 4 8 2 8 s M C Z q 9 C Q 3 O T n n 3 t x 7 j x d x p j R C H 1 Z m a X l l d S 2 7 n t v Y 3 N r e y e / u t V Q Y S 0 K b J O S h 7 H h Y U c 4 E b W q m O e 1 E k u L A 4 7 T t j S 9 n f v u O S s V C c a M n E X U D P B T M Z w R r I 9 3 2 o h H r l 4 v 3 x y f 9 f A H Z F e T U z h y I b D S H I a U K q l U R d F K l A F I 0 + v n 3 3 i A k c U C F J h w r 1 X V Q p N 0 E S 8 0 I p 9 N c L 1 Y 0 w m S M h 7 R r q M A B V W 4 y v 3 g K j 4 w y g H 4 o T Q k N 5 + r 3 i Q Q H S k 0 C z 3 Q G W I / U b 2 8 m / u V 1 Y + 1 X 3 Y S J K N Z U k M U i P + Z Q h 3 U r F I 3 O h x T L 0 Q D w Q L G M H a S L f d e M h 6 b u n + 6 L h X K C K 7 j J z q m Q O R j W Y w x C 2 j a g V B J 1 O K I E O 9 V 3 j v 9 i O S h F R o w r F S H Q f F 2 k u x 1 I x w O s l 3 E 0 V j T E Z 4 Q D u G C h x S 5 a W z i y f w 0 C h 9 G E T S l N B w p n 6 f S H G o 1 D j 0 T W e I 9 V D 9 9 q b i X 1 4 n 0 U H F S 5 m I E 0 0 F m S 8 K E g 5 1 B K f v w z 6 T l G g + N g Q T y c y t k A y x x E S b k P I m h K 9 P 4 f + k 6 d r O i e 1 e n x Z r F 1 k c O b A P D k A J O O A c 1 M A V q I M G I E C A B / A E n i 1 l P V o v 1 u u 8 d c H K Z v b A D 1 h v n 4 p 7 k C 4 = < / l a t e x i t > 1(x 0 ) < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " q q Y S s 2 x f f g z B 8 r 0 w x y 9 o 4 J / z X H < l a t e x i t s h a 1 _ b a s e 6 4 = " g m 8 P n S / O E Y g Q D 2 z q U s x o 6 L q r y u g  6 W w H r b / 6 A j d 6 O z X g F X u v o R + y M = " > A A A B 8 X i c d V B N T w I x E J 3 1 E / E L 9 e i l k R j x Q r o o C j e i F 4 + Y y E e E D e m W A g 3 d 7 q b t G g n h X 3 j x o D F e / T f e / D c W W B M 1 + p J J X t 6 b y c w 8 P x J c G 4 w / n I X F p e W V 1 d R a e n 1 j c 2 s 7 s 7 N b 1 2 G s K K v R U I S q 6 R P N B J e s Z r g R r B k p R g J f s I Y / v J z 6 j T u m N A / l j R l F z A t I X / I e p 8 R Y 6 b Y d D X j H z d 0 f H X c y W Z w v Y r d 8 5 i K c x z N Y U i j i c g k j N 1 G y k K D a y b y 3 u y G N A y Y N F U T r l o s j 4 4 2 J M p w K N k m 3 Y 8 0 i Q o e k z 1 q W S h I w 7 Y 1 n F 0 / Q o V W 6 q B c q W 9 K g m f p 9 Y k w C r U e B b z s D Y g b 6 t z c V / / J a s e m V v D G X U W y Y p P N F v V g g E 6 L p + 6 j L F a N G j C w h V H F 7 K 6 I D o g g 1 N q S 0 D e H r U / Q / q R f y 7 k m + c H 2 a r V w k c a R g H w 4 g B y 6 c Q w W u o A o 1 o C D h A Z 7 g 2 d H O o / P i v M 5 b F 5 x k Z g 9 + w H n 7 B I j z k C 0 = < / l a t e x i t > = {• • • , ⇡ 1,u , • • • } < l a t e x i t l t N 6 + p U K y M L h R k 4 j 2 f D w M m M c I V h r 1 z Q O n I R k 8 h 0 7 i k E G o Z A k 6 E e s n d i l O t c 2 Q k / b N I r I q y K 6 d 2 h B Z K J M 2 5 Q q q V R G 0 5 6 Q I 5 m r 0 z X d n E J L Y p 4 E i H E v Z t V G k e g k W i h F O 0 4 I T S x p h M s Z D 2 t U 2 w D 6 V v S S 7 J o W H m g y g F w r 9 A g U z + n 0 i w b 6 U E 9 / V n T 5 W I / m 7 N o V / 1 b q x 8 q q 9 h A V R r G h A Z o u 8 m E M V w m k 0 c M A E J Y p P t M F E M P 1 X S E Z Y Y K J 0 g A U d w t e l 8 H / T K l v 2 s V W + P i n W L + Z x 5 M E e 2 A d H w A Z n o A 6 u Q A M 0 A Q F 3 4 A E 8 g W f j 3 n g 0 X o z X W W v O m M / U v D L c J M G 8 = " > A A A C H H i c d V D L S g M x F M 3 U V 6 2 v U Z d u g k W s m 5 J p r b a I U H T j s o J 9 Q F u G T J p 2 Q j M P k o y 0 D P 0 Q N / 6 K G x e K u H E h + D d m 2 g o q e i B w O O c + c o 8 T c i Y V Q h 9 G a m F x a X k l v Z p Z W 9 / Y 3 D K 3 d x o y i A S h d R L w Q L Q c L C l n P q 0 r p j h t h Y J i z + G 0 6 Q w v E 7 9 5 S 4 V k g X + j x i H t e n j g s z 4 j W G n J N o u d k N m x d R Z N Y M d z g l E M l c B 6 V g + q A O p B x I V Q W 6 H L b C s 3 O j y C 5 z C y z S z K l 5 B V O b E g y q M p N C m U U K W M o D V X s m C O m m 2 + d X o B i T z q K 8 K x l G 0 L h a o b Y 6 E Y 4 X S S 6 U S S h p g M 8 Y C 2 N f W x R E = " > A A A B 6 3 i c d V D L S g M x F M 3 U V 6 2 v q k s 3 w S K 4 G j L V a C j j q C X K x W Q o z U s v w 9 G E U k E l Y Z w r O 0 M F B s / x c o w w u m 8 N E g 0 j T G Z 4 j H t W y q x o N p P F 7 v O 4 Y l V R j C M l H 3 S w I X 6 v S P F Q u u Z C G y l w G a i f 3 u Z + J f X T 0 x Y 9 1 M m 4 8 R Q S Z Y f h Q m H J o L Z 4 X D E F C W G z y z B R D G 7 K y Q T r T q V g i b v U k p U G M h 4 J F j G B t J E 8 d + 2 7 Q r 9 a Q X U d O 8 9 y B y E Z z G O L W U b O B o F M o N V C g 3 a + + 9 w Y J y W I q N O F Y K d 9 B q Q 5 y L D U j n E 4 r v U z R F J M x H l L f U I F j q o J 8 f u w U H h l l A K N E m h I a z t X v E z m O l Z r E o e m M s R 6 p 3 9 5 M / M v z M x 0 1 g p y J N N N U k M W i K O N Q J 3 D 2 O R = " > A A A B 7 H i c d V B N S w M x E J 2 t X 7 V + V T 1 6 C R b R U 8 l W q + 2 t 6 M V j B b c W t k v J p m k b m s 0 u S V Y o p b / B i w d F v P q D v P l v T N s V V P T B w O O 9 G W b m h Y n Y h V O y S a C S 6 Z Z 7 g R r J 0 o R q J Q s L t w d D X z 7 + 6 Z 0 j y W t 2 a c s C A i A 8 n 7 n B J j J U 8 f + 2 7 Q L Z Z w u Y r d + r m L c B n P Y U m l i u s 1 j N x M K U G G Z r f 4 3 u n F N I 2 Y N F Q Q r X 0 X J y a Y E G U 4 F W x a 6 K S a J Y S O y I D 5 l k o S M R 1 M 5 s d O 0 Z F V e q g f K 1 v S o L n 6 f W J C I q 3 H U W g 7 I 2 K G + r c 3 E / / y / N T 0 a 8 G E y y Q 1 T N L F o n 4 q k I n R 7 H P U 4 4 p R I 8 a W E K q 4 v R X R I V G E G p t P w Y b w 9 S n 6 n 7 Q q Z f e 0 X L k 5 K z U u s z j y c A C H c A I u X E A D r q E J D M L w U 2 v j k O e r D v U i R o W s U 0 = " > A A A B 6 H i c d V D L S g N B E J y N r x h f U Y 9 e B o P g a Z m N R p N b 0 I v H B M w D k i X M T n q T M b M P Z m a F s O Q L v H h Q x K u f 5 M 2 / c Z K s o K I F D U V V N 9 1 d X i y 4 0 o R 8 W L m V 1 b X 1 j f x m Y W t 7 Z 3 e v u H / Q V l E i G b R Y J C L Z 9 a g C w U N o a a 4 F d G M J N P A E d L z J 9 d z v 3 I N U P A p v 9 T Q G N 6 C j k P u c U W 2 k J h 0 U S 8 S u E K d 2 4 W B i k w U M K V d I r U q w k y k l l K E x K L 7 3 h x F L A g g 1 E 1 S p n k N i 7 a Z U a s 4 E z A r 9 R E F M 2 Y S O o G d o S A N Q b r o 4 d I Z P j D L E f i R N h R o v 1 O 8 T K Q 2 U m g a e 6 Q y o H q v f 3 l z 8 y + s l 2 q + 6 K Q / j R E P I l o v 8 R G A d 4 f n X e M g l M C 2 m h l A m u b k V s z G V l G m T T c G E 8 P U p / p + 0 y 7 Z z Z p e b 5 6 X 6 V R Z H H h 2 h Y 3 S K H H S J 6 u g G N V A L M Q T o A T 2 h Z + v O e r R e r N d l a 8 7 K Z g 7 R D 1 h v n w c 0 j R c = < / l a t e x i t > x 0 [u] < l a t e x i t s h a 1 _ b a s e 6 4 = " Q w + q 6 D m A A b w y U q b Q 2 i Q j l S W M G i k = " > A A A B 7 H i c d V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b R U 9 h U q + 2 t 6 M V j B d M W 0 l A 2 2 0 2 7 d L M J u x u x h P 4 G L x 4 U 8 e o P 8 u a / c d t G U N E H A 4 / 3 Z p i Z F y S c K Y 3 Q h 1 V Y W l 5 Z X S u u l z Y 2 t 7 Z 3 y r t 7 b R W n k l C X x D y W 3 Q A r y p m g r m a a 0 2 4 i K Y 4 C T j v B + G r m d + 6 o V C w W t 3 q S U D / C Q 8 F C R r A 2 k n t / 7 K V + v 1 x B d g 0 5 j X M H I h v N Y U i 1 h h p 1 B J 1 c q Y A c r X 7 5 v T e I S R p R o Q n H S n k O S r S f Y a k Z 4 X R a 6 q W K J p i M 8 Z B 6 h g o c U e V n 8 2 O n 8 M g o A x j G 0 p T Q c K 5 + n 8 h w p N Q k C k x n h P V I / f Z m 4 l + e l + q w 7 m d M J K m m g i w W h S m H O o a z z + G A S U o 0 n x i C i W T m V k h G W G K i T T 4 l E 8 L X p / B / 0 q 7 a z q l d v T m r N C / z O I r g A B y C E + C A C 9 A E 1 6 A F X E A A A w / g C T x b w n q 0 X q z X R W v B y m f 2 w Q 9 Y b 5 / E m o 6 q < / l a t e x i t > x 0 [u 0 ] < l a t e x i t s h a 1 _ b a s e 6 4 = " / Q S 9 K E P y 2 T y K / q S y Y T a 5 Z 5 n v f 7 o = " > A A A B 7 X i c d V D L S g M x F M 3 U V 6 2 v q k s 3 w S J 1 N W S q 1 X Z X d O O y g n 3 A d C i Z N N P G Z j J D k h H L 0 H 9 w 4 0 I R t / 6 P O / / G t B 1 B R Q 9 c O J x z L / f e 4 8 e c K Y 3 Q h 5 V b W l 5 Z X c u v F z Y 2 t 7 Z 3 i r t 7 b R U l k t A W i X g k u z 5 W l D N B W 5 p p T r u x p D j 0 O e 3 4 4 8 u Z 3 7 m j U r F < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 M M x a u d h 0 V P t i f 6 e n + 3 k q 9 R a z 5 8 = " > A A A B 7 H i c d V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b R U 9 h U q + 2 t 6 M V j Contributions. We combine two threads of research on RL and factored MDP by proposing a new problem setup called Factored Block MDP (Section 2). In this setup, observations are emitted by latent states that obey the dynamics of a factored MDP. We assume observations to be composed of atoms (which can be pixels for an image) that are emitted by the latent factors. A single factor can emit a large number of atoms but no two factors can control the same atom. Following existing rich-observation RL literature, we assume observations are rich enough to decode the current latent state. We introduce an algorithm FactoRL that achieves the desired guarantees for a large class of Factored Block MDPs under certain computational and realizability assumptions (Section 4). The main challenge that FactoRL handles is to map atoms to the parent factor that emits them. We achieve this by reducing the identification problem to solving a set of independence test problems with distributions satisfying certain properties. We perform independence tests in a domain-agnostic setting using noise-contrastive learning (Section 3). Once we have mapped atoms to their parent factors, FactoRL then decodes the factors, estimates the model, recovers the latent structure in the transition dynamics, and learns a set of exploration policies. Figure 1 shows the different steps of FactoRL. This provides us with enough tools to visualize the latent dynamics, and plan for any given reward function. Due to the space limit, we defer the discussion of related work to Appendix B. I 3 O h J T L 0 Q D w U L G M H a S O 3 7 s p u U v X 6 x h O w q c u p n D k Q 2 m s O Q S h X V a w g 6 m V I C G Z r 9 4 n t v E J E k p E I T j p V y H R R r L 8 V S M 8 L p t N B L F I 0 x G e M h d Q 0 V O K T K S + f X T u G R U Q Y w i K Q p o e F c / T 6 R 4 l C p S e i b z h D r k f r t z c S / P D f R Q c 1 L m Y g T T Q V Z L A o S D n U E Z 6 / D A Z O U a D 4 x B B P J z K 2 Q j L D E R J u A C i a E r 0 / h / 6 R d s Z 0 T u 3 J 9 W m p c Z H H k w Q E 4 B M f A A e To the best of our knowledge, our work represents the first provable solution to rich-observation RL with a combinatorially large latent state space.

2. THE FACTORED BLOCK MDP SETTING

There are many possible ways to add rich observations to a factored MDP resulting in inapplicability or intractability. Our goal here is to define a problem setting that is tractable to solve and covers potential real-world problems. We start with the definition of Factored MDP (Kearns & Koller, 1999) , but first review some useful notation that we will be using: We assume a deterministic start state. We also assume, without loss of generality, that each state and observation is reachable at exactly one time step. This can be easily accomplished by concatenating the time step information to state and observations. This allows us to write the state space as S = (S 1 , S 2 , • • • , S H ) where S h is the set of states reachable at time step h.



g 2 m D 8 4 e S W l l d W 1 / L r h Y 3 N r e 2 d 4 u 5 e S 8 e p o s y j s

H l D g 8 A B P 8 O x I 5 9 F 5 c V 4 X r T k n m 9 m H H 3 D e P g F V n o 5 h < / l a t e x i t > {0, 1} < l a t e x i t s h a 1 _ b a s e 6 4 = " I d 5 t T 3 W o 7 5 Y n s P J M k m J M y B Q b B v U = " > A A A B 7 n i c d V D L S g N B E O z 1 G e M r 6 t H L Y B A 8 y D I b j S a 3 o B e P E c w D s k u Y n c w m Q 2 Y f z M w K Y c l H e P G g i F e / x 5 t / 4 y R Z

0 2 w b m 8 0 u S V Y o p f / B i w d F v P p / v P l v T N s V V P T B w O O 9 G W b m B Y n g 2 m D 8 4 e S W l l d W 1 / L r h Y 3 N r e 2 d 4 u 5 e S 8 e p o q x J Y x G r 2 4 B o J r h k T c O N Y L e J Y i Q K B G s H o 8 u Z 3 7 5 n S v N Y 3 p h x w v y I D C

e t s u u d u O X r 0 1 L 9 I o s j D w d w C M f g w T n U 4 Q o a 0 A Q K d / A A T / D s x M 6 j 8 + K 8 L l p z T j a z D z / g v H 0 C 0 + 6 P T Q = = < / l a t e x i t >1(x)< l a t e x i t s h a 1 _ b a s e 6 4 = " c W f u v y j x j L l e 2 H Q d H H L e 5 X 9 2 r g 4= " > A A A B 8 H i c d V D L S g M x F M 3 U V 6 2 v q k s 3 w S L U z Z C p V t t d 0 Y 3 L C v Y h 7 V A y a d q G J p k h y Y h l 6 F e 4 c a G I W z / H n X 9 j 2 o 6 g o g c u H M 6 5 l 3 v v C S L O t E H o w 8 k s L a + s r m X X c x u b W 9 s 7 + d 2 9 p g 5 j R W i D h D x U 7 Q B r y p m k D c M M p + 1 I U S w C T l v B + H L m t + 6 o 0 i y U N 2 Y S U V / g o W Q D R r C x 0 m 0 3 G r G e V 7 w / 7 u U L y C 0 j r 3 r m Q e S i O S w p l V G 1 g q C X K g W Q o t 7 L v 3 f 7 I Y k F l Y Z w r H X H Q 5 H x E 6 w M I 5 x O c 9 1 Y 0 w i T M R 7 S j q U S C 6 r 9 Z H 7 w F B 5 Z p Q 8 H o b I l D Z y r 3 y c S L L S e i M B 2 C m x G + r c 3 E / / y O r E Z V P y E y S g 2 V J L F o k H M o Q n h 7 H v Y Z 4 o S w y e W Y K K Y v R W S E V a Y G J t R z o b w 9 S n 8 n z R L r n f i l q 5 P C 7 W L N I 4 s O A C H o A g 8 c A 5 q 4 A r U Q Q M Q I M A De A L P j n I e n R f n d d G a c d K Z f f A D z t s n J e + P / A = = < / l a t e x i t > 2(x) < l a t e x i t s h a 1 _ b a s e 6 4 = " p I 7 0 E s n l m 4 x X C T i / Y r c Z m o X b 4 c E = " > A A A B 8 H i c d V D L S g M x F M 3 U V 6 2 v q k s 3 w S L U z Z A Z r b a 7 o h u X F e x D 2 q F k 0 k w b m s k M S U Y s Q 7 / C j Q t F 3 P o 5 7 v w b 0 3 Y E F T 1 w 4 X D O v d x 7 j x 9 z p j R C H 1 Z u a X l l d S 2 / X t j Y 3 N r e K e 7 u t V S U S E K b

4 3 x 6 s / x 5 r + x w J q o 0 Z d M 8 v L e T G b m B T F n 2 i D 0 4 W S W l l d W 1 7 L r u Y 3 N r e 2 d / O 5 e S 0 e J I r R J I h 6 p m w B r y p m k T c M M p z e x o l g E n L a D 0 c X M b 9 9 R p V k k r 8 0 4 p r 7 A A 8 l

D 2 P h w w S Y n m E 0 M w k c z c C s k I S 0 y 0 C S l n Q v j 6 F P 5 P W i X b K d u l 6 9 N C / S K N I w s O w C E o A g e c g z q 4 A g 3 Q B A Q I 8 A C e w L O l r E f r x X p d t G a s d G Y f / I D 1 9 g m M A 5 A v < / l a t e x i t > 2(x 0 ) < l a t e x i t s h a 1 _ b a s e 6 4 = " D F K / K 0 f r 6 9 / C G Z L 9 N C r T 1 L Q H Z I Q = " > A A A B 8 X i c d V D L S g M x F M 3 4 r P V V d e k m W M S 6 G T K j 1 X Z X d O O y g n 1 g O 5 R M m m l D M 5 k h y Y h l 6 F + 4 c a G I W / / G n X 9 j 2 o 6 g o g c u H M 6 5 l 3 v v 8 W P O l E b o w 1 p Y X F p e W c 2 t 5 d c 3 N r e 2 C z u 7 T R U l k t A G i X g k 2 z 5 W l D N B G 5 p p T t u x p D j 0 O W 3 5 o 8 u p 3 7 q j

s h a 1 _ b a s e 6 4 = " B 6 R I u z j 5 Y N e p P s P r R 0 w 4 e P 9 z 9 R 4 = " > AA A C D X i c d Z D L S g M x F I Y z 9 V b r b d S l m 2 A V X J Q h U 6 2 2 C 6 H o x m U F e 4 F O K Z k 0 0 4 Z m L i Q Z o Q z z A m 5 8 F T c u F H H r 3 p 1 v Y z q t o K I / BH 6 + c w 4 n 5 3 c j z q R C 6 M P I L S w u L a / k V w t r 6 x u b W + b 2 T k u G s S C 0 S U I e i o 6 L J e U s o E 3 F F K e d S F D s u 5 y 2 3 f H

s g h 8 y 3 j 4 B 8 v C a 4 g = = < / l a t e x i t > ⇡1;u trained to reach 1(x 0 ) = u < l a t e x i t s h a 1 _ b a s e 6 4 = " O b g g t W e T O 8 3 9 F K 4 m w d

2 U 3 n h 4 3 g Q d a 6 c F + I P T z F Z y q 3 z t i 7 E k 5 9 h x d 6 W H l y t 9 e I v 7 l t S P V L 3 d j 5 o e R o j 6 Z L e p H P D k 8 S Q r 2 m K B E 8 b E m m A i m / w q J i w U m S u e Z 0 S F 8 X Q r / J 4 1 C 3 i r m C 9 f H 2 e r F P I 4 0 2 A P 7 I A c s c A q q 4 A r U Q B 0 Q c A c e w B N 4 N u 6 N R + P F e J 2 V p o x 5 z y 7 4 A e P 9 E w 4 6 o A 4 = < / l a t e x i t > s[1]

r s r u n F Z w T 6 g H U o m z b S h S W Z I M k I Z + g t u X C j i 1 h 9 y 5 9 + Y a U d Q 0 Q O B w z n 3 3 t x 7 g p g z b R D 6 c A o r q 2 v r G 8 X N 0 t b 2 z u 5 e e f + g o 6 N E E d o m E Y 9 U L 8 C a c i Z p 2 z D D a S 9 W F I u A 0 2 4 w v c 7 8 7 j 1 V m k X y z s x i 6 g s 8 l i x k B J t M 0 n 3 P H 5 Y r y K 0 h r 3 H h Q e S i B S y p 1 l

D A x N p 6 S D e H r U v g / 6 V R d 7 8 y t 3 p 5 X m l d 5 H E V w B I 7 B K f D A J W i C G 9 A C b U D A B D y A J / D s C O f R e X F e l 6 U F J + 8 5 B D / g v H 0 C 9 K G O M A = = < / l a t e x i t > s[2]< l a t e x i t s h a 1 _ b a s e 6 4 = " 3 4 Y P X H a l g r F Y x z x x I C R q R / Y B h T 0 = " >A A A B 6 3 i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 V w F S b R a r s r u n F Z w T 4 g D W U y n b R D Z 5 I w M x F K 6 C + 4 c a G I W 3 / I n X / j p I 2 g o g c u H M 6 5 l 3 v v C R L O l E b o w y q t r K 6 t b 5 Q 3 K 1 v b O 7 t 7 1 f 2 D r o p T S W i H x D y W / Q A r y l l E O 5 p p T v u J p F g E n P a C 6 X X u 9 + 6 p V C y O 7 v Q s o b 7 A 4 4 i F j G C d S 8 p z / W G 1 h u w 6 c p o X D k Q 2 W s A Q t 4 6 a D Q S d Q q m B A u 1 h 9 X 0 w i k k q a K Q J x 0 p 5 D k q 0 n 2 G p G e F 0 X h m k i i a Y T P G Y e o Z G W F D l Z 4 t b 5 / D E K C M Y x t J U p O F C / T 6 R Y a H U T A S m U 2 A 9 U b + 9 X P z L 8 1 I d N v y M R U m q a U S W i 8 K U Q x 3 D / H E 4 Y p I S z W e G Y C K Z u R W S C Z a Y a B N P x Y T w 9 S n 8 n 3 R d 2 z m z 3 d v z W u u q i K M M j s A x O A U O u A Q t c A P a o A M I m I A H 8 A S e L W E 9 W i / W 6 7 K 1 Z B U z h + A H r L d P 9 i a O M Q = = < / l a t e x i t > s[3] < l a t e x i t s h a 1 _ b a s e 6 4 = " p v x 4 e y h F d v W 2 D O x K P v M 2 R J p W J 0 I = " > A A A B 6 3 i c d V D L S g M x F M 3 4 r P V V d e k m W A R X Q 6 a 1 2 u 6 K b l x W s A + Y D i W T Z t r Q J D M k G a G U / o I b F 4 q 4 9 Y f c + T d m 2 h F U 9 M C F w z n 3 c u 8 9 Y c K Z N g h 9 O C u r a + s b m 4 W t 4 v b O 7 t 5 + 6 e C w o + N U E d o m M Y 9 V L 8 S a c i Z p 2 z D D a S 9 R F I u Q 0 2 4 4 u c 7 8 7 j 1 V m s X y z k w T G g g 8 k i x i B J t M 0 n 4 1 G J T K y K 0 h r 3 H h Q e S i B S y p 1 F C j j q C X K 2 W Q o z U o v f e H M U k F l Y Z w r L X v o c Q E M 6 w M I 5 z O i / 1 U 0 w S T C R 5 R 3 1 K J B d X B b H H r H J 5 a Z Q i j W N m S B i 7 U 7 x M z L L S e i t B 2 C m z G + r e X i X 9 5 f m q i e j B j M k k N l W S 5 K E o 5 N D H M H o d D p i g x f G o J J o r Z W y E Z Y 4 W J s f E U b Q h f n 8 L / Sa f i e l W 3 c n t e b l 7 l c R T A M T g B Z 8 A D l 6 A J b k A L t A E B Y / A A n s C z I 5 x H 5 8 V 5 X b a u O P n M E f g B 5 + 0 T 9 6 u O M g = = < / l a t e x i t > s 0 [3] < l a t e x i t s h a 1 _ b a s e 6 4 = " U E f O l w G v x p h H W 9 m u Z r M P J M m pG I s = " > A A A B 7 H i c d V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b R U 9 i 0 V t t b 0 Y v H C q Y t p K F s t p t 2 6 W Y T d j d C C f 0 N X j w o 4 t U f 5 M 1 / 4 7 a N o K I P B h 7 v z T A z L 0 g 4 U x q h D 6 u w s r q 2 v l H c L G 1 t 7 + z u l f c P O i p O J a E u i X k s e w F W l D N B X c 0 0 p 7 1 E U h w F n H a D y f X c 7 9 5 T q V g s 7 v Q 0 o X 6 E R 4 K F j G B t J Fe d e j V / U K 4 g u 4 6 c 5 o U D k Y 0 W M K R a R 8 0 G g k 6 u V E C O 9 q D 8 3 h / G J I 2 o 0 I R j p T w H J d r P s N S M c D o r 9 V N F E 0 w m e E Q 9 Q w W O q P K z x b E z e G K U I Q x j a U p o u F C / T 2 Q 4 U m o a B a Y z w n q s f n t z 8 S / P S 3 X Y 8 D M m k l R T Q Z a L w p R D H c P 5 5 3 D I J C W a T w 3 B R D J z K y R j L D H R J p + S C e H r U / g / 6 V R t p 2 Z X b 8 8 r r a s 8 j i I 4 A s f g D D j g E r T A D W g D F x D A w A N 4 A s + W s B 6 t F + t 1 2 V q w 8 p l D 8 A P W 2 y d Y q I 5 j < / l a t e x i t > s 0 [2] < l a t e x i t s h a 1 _ b a s e 6 4 = " N k R 4 / B b W o d H Y O Z 9 q x i 9 Y F X 8 h P h M = " > A A A B 7 H i c d V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b R U 9 h E q + 2 t 6 M V j B V M L a S i b 7 a Z d u t m E 3 Y 1 Q Q n + D F w + K e P U H e f P f u G 0 j q O i D g c d 7 M 8 z M C 1 P O l E b o w y o t L a + s r p X X K x u b W 9 s 7 1 d 2 9 j k o y S a h H E p 7 I b o g V 5 U x Q T z P N a T e V F M c h p 3 f h + G r m 3 9 1

w w S Y n m E 0 M w k c z c C s k I S 0 y 0 y a d i Q v j 6 F P 5 P O q 7 t n N r u z V m t d V n E U Q Y H 4 B C c A A d c g B a 4 B m 3 g A Q I Y e A B P 4 N k S 1 q P 1 Y r 0 u W k t W M b M P f s B 6 + w R X I 4 5 i < / l a t e x i t > s 0 [1]

g 2 m D 8 4 e S W l l d W 1 / L r h Y 3 N r e 2 d 4 u 5 e S 8 e p o s y j s

H l D g 8 A B P 8 O x I 5 9 F 5 c V 4 X r T k n m 9 m H H 3 D e P g F V n o 5 h < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " O S C r J

e g A a 5 A E 7 Q A A b f g A T y B Z y u y H q 0 X 6 3 X R m r O y m X 3 w A 9 b b J y Z H j t s = < / l a t e x i t >x 0 [v] < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 K D Z 4 z M 2 p J k r D E N m t B P J b Y y u g r Y = " > A A A B 7 H i c d V B N S 8 N A E N 3 4 W e t X 1 a O X x S J 6 C p t q t b 0 V v X i s Y N p C G s p m u 2 m X b j Z h d 1 M s o b / B i w d F v P q D v P l v 3 L Y R V P T B w O O 9 G W b m B Q l n S i P 0 Y S 0 t r 6 y u r R c 2 i p t b 2 z u 7 p b 3 9 l o p T S a h L Y h 7 L T o A V 5 U x Q V z P N a S e R F E c B p + 1 g d D 3 z 2 2 M q F Y v F n Z 4 k 1 I / w Q L C Q E a y N 5 N 6 f e G O / V y o j u 4 q c + o U D k Y 3 m M K R S R f U a g k 6 u l E G O Z q / 0 3 u3 H J I 2 o 0 I R j p T w H J d r P s N S M c D o t d l N F E 0 x G e E A 9 Q w W O q P K z + b F T e G y U P g x j a U p o O F e / T 2 Q 4 U m o S B a Y z w n q o f n s z 8 S / P S 3 V Y 8 z M m k l R T Q R a L w p R D H c P Z 5 7 D P J C W a T w z B R D J z K y R D L D H R J p + i C e H r U / g / a V V s 5 8 y u 3 J 6 X G 1 d 5 H A V w C I 7 A K X D A J W i A G 9 A E L i C A g Q f w B J 4 t Y T 1 a L 9 b r o n X J y m c O w A 9 Y b 5 / G H 4 6 r < / l a t e x i t > x 0 [w]

Figure 1: Left: A room navigation tasks as a Factored Block MDP setting showing atoms and factors. Center and Right: Shows the different stages executed by the FactoRL algorithm. We do not show the observation x by s for brevity. In practice a would emit many more atoms.

For any n ∈ N, we use[n]  to denote the set {1, 2, • • • , n}. For any ordered set (or a vector) U of size n, and an ordered index set I ⊆ [n] and length k, we use the notationU[I] to denote the ordered set (U[I[1]], U[I[2]], • • • , U[I[k]]). Definition 1. A Factored MDP (S, A, T, R, H) consists of a d-dimensional discrete state space S ⊆ {0, 1} d, a finite action space A, an unknown transition function T : S × A → ∆(S), an unknown reward function R : S × A → [0, 1] and a time horizon H. Each state s ∈ S consists of d factors with the i th factor denoted as s[i]. The transition function satisfies T (s | s, a) = d i=1 T i (s [i] | s[pt(i)], a) for every s, s ∈ S and a ∈ A, where T i : {0, 1} |pt(i)| × A → ∆({0, 1}) defines a factored transition distribution and a parent function pt : [d] → 2 [d] defines the set of parent factors that can influence a factor at the next timestep.

