PROVABLE RICH OBSERVATION REINFORCEMENT LEARNING WITH COMBINATORIAL LATENT STATES

Abstract

We propose a novel setting for reinforcement learning that combines two common real-world difficulties: presence of observations (such as camera images) and factored states (such as location of objects). In our setting, the agent receives observations generated stochastically from a latent factored state. These observations are rich enough to enable decoding of the latent state and remove partial observability concerns. Since the latent state is combinatorial, the size of state space is exponential in the number of latent factors. We create a learning algorithm FactoRL (Fact-o-Rel) for this setting which uses noise-contrastive learning to identify latent structures in emission processes and discover a factorized state space. We derive polynomial sample complexity guarantees for FactoRL which polynomially depend upon the number factors, and very weakly depend on the size of the observation space. We also provide a guarantee of polynomial time complexity when given access to an efficient planning algorithm.

1. INTRODUCTION

Most reinforcement learning (RL) algorithms scale polynomially with the size of the state space, which is inadequate for many real world applications. Consider for example a simple navigation task in a room with furniture where the set of furniture pieces and their locations change from episode to episode. If we crudely approximate the room as a 10 × 10 grid and consider each element in the grid to contain a single bit of information about the presence of furniture, then we end up with a state space of size 2 100 , as each element of the grid can be filled independent of others. This is intractable for RL algorithms that depend polynomially on the size of state space. The notion of factorization allows tractable solutions to be developed. For the above example, the room can be considered a state with 100 factors, where the next value of each factor is dependent on just a few other parent factors and the action taken by the agent. Learning in factored Markov Decision Processes (MDP) has been studied extensively (Kearns & Koller, 1999; Guestrin et al., 2003; Osband & Van Roy, 2014) with tractable solutions scaling linearly in the number of factors and exponentially in the number of parent factors whenever planning can be done efficiently. However, factorization alone is inadequate since the agent may not have access to the underlying factored state space, instead only receiving a rich-observation of the world. In our room example, the agent may have access to an image of the room taken from a megapixel camera instead of the grid representation. Naively, treating each pixel of the image as a factor suggests there are over a million factors and a prohibitively large number of parent factors for each pixel. Counterintuitively, thinking of the observation as the state in this way leads to the conclusion that problems become harder as the camera resolution increases or other sensors are added. It is entirely possible, that these pixels (or more generally, observation atoms) are generated by a small number of latent factors with a small number of parent factors. This motivates us to ask: can we achieve PAC RL guarantees that depend polynomially on the number of latent factors and very weakly (e.g., logarithmically) on the size of observation space? Recent work has addressed this for a rich-observation setting with a non-factored latent state space when certain supervised learning problems are tractable (Du et al., 2019; Misra et al., 2020; Agarwal et al., 2020) . However, addressing the rich-observation setting with a latent factored state space has remained elusive. Specifically, ignoring the factored structure in the latent space or treating observation atoms as factors yields intractable solutions.

Identify Emission Structure

Learn State Decoder Learn Model and Policy Cover s 0 [1] < l a t e x i t s h a 1 _ b a s e 6 4 = " g m 8 P n S / O E Y g Q D 2 z q U s x o 6 L q r y u g = " > A A A B 7 H i c d V B N S w M x E J 2 t X 7 V + V T 1 6 C R b R U 8 l W q + 2 t 6 M V j  G s K K v R U I S q 6 R P N B J e s Z r g R r B k p R g J f s I Y / v J z 6 j T u m N A / l j R l F z A t I X / I e p 8 R Y 6 b Y d D X j H z d 0 f H X c y W Z w v Y r d 8 5 i K c x z N Y U i j i c g k j N 1 G y k K D a y b y 3 u y G N A y Y N F U T r l o s j 4 4 2 J M p w K N k m 3 Y 8 0 i Q o e k z 1 q W S h I w 7 Y 1 n F 0 / Q o V W 6 q B c q W 9 K g m f p 9 Y k w C r U e B b z s D Y g b 6 t z c V / / J a s e m V v D G X U W y Y p P N F v V g g E 6 L p + 6 j L F a N G j C w h V H F 7 K 6 I D o g g 1 N q S 0 D e H r U / Q / q R f y 7 k m + c H 2 a r V w k c a R g H w 4 g B y 6 c Q w W u o A o 1 o C D h A Z 7 g 2 d H O o / P i v M 5 b F 5 x k Z g 9 + w H n 7 B I j z k C 0 = < / l a t e x i t > = {• • • , ⇡ 1,u , • • • } < l a t e x i t s h a 1 _ b a s e 6 4 = " B 6 R I u z j 5 Y N e p P s P r R 0 w 4 e P 9 z 9 R 4 < l a t e x i t s h a 1 _ b a s e 6 4 = " q q Y S s 2 x f f g z B 8 r 0 w x y 9 o 4 J / z X H = " > A A A C D X i c d Z D L S g M x F I Y z 9 V b r b d S l m 2 A V X J Q h U 6 2 2 C 6 H o x m U F e 4 F O K Z k 0 0 4 Z m L i Q Z o Q z z A m 5 8 F T c u F H H r 3 p 1 v Y z q t o K I / B H 6 + c w 4 n 5 3 c j z q R C 6 M P I L S w u L a / k V w t r 6 x u b W + b 2 T k u G s S C 0 S U I e i o 6 L J e U s o E 3 F F K e d S F D s u 5 y 2 3 f H l t N 6 + p U K y M L h R k 4 j 2 f D w M m M c I V h r 1 z Q O n I R k 8 h 0 7 i k E G o Z A k 6 E e s n d i l O t c 2 Q k / b N I r I q y K 6 d 2 h B Z K J M 2 5 Q q q V R G 0 5 6 Q I 5 m r 0 z X d n E J L Y p 4 E i H E v Z t V G k e g k W i h F O 0 4 I T S x p h M s Z D 2 t U 2 w D 6 V v S S 7 J o W H m g y g F w r 9 A g U z + n 0 i w b 6 U E 9 / V n T 5 W I / m 7 N o V / 1 b q x 8 q q 9 h A V R r G h A Z o u 8 m E M V w m k 0 c M A E J Y p P t M F E M P 1 X S E Z Y Y K J 0 g A U d w t e l 8 H / T K l v 2 s V W + P i n W L + Z x 5 M E e 2 A d H w A Z n o A 6 u Q A M 0 A Q F 3 4 A E 8 g W f j 3 n g 0 X o z X W W v O m M / s g h 8 y 3 j 4 B 8 v C a 4 g = = < / l a t e x i t > ⇡1;u trained to reach 1(x ) = u < l a t e x i t s h a 1 _ b a s e 6 4 = " O b g g t W e T O 8 3 9 F K 4 m w d U v D L c J M G 8 = " > A A A C H H i c d V D L S g M x F M 3 U V 6 2 v U Z d u g k W s m 5 J p r b a I U H T j s o J 9 Q F u G T J p 2 Q j M P k o y 0 D P 0 Q N / 6 K G x e K u H E h + D d m 2 g o q e i B w O O c + c o 8 T c i Y V Q h 9 G a m F x a X k l v Z p Z W 9 / Y 3 D K 3 d x o y i A S h d R L w Q L Q c L C l n P q 0 r p j h t h Y J i z + G 0 6 Q w v E 7 9 5 S 4 V k g X + j x i H t e n j g s z 4 j W G n J N o u d k N m x d R Z N Y M d z g l E M l c B 6 V g + q A O p B x I V Q W 6 H L b C s 3 O j y C 5 z C y z S z K l 5 B V O b E g y q M p N C m U U K W M o D V X s m C O m m 2 + d X o B i T z q K 8 K x l G 0 L h a o b Y 6 E Y 4 X S S 6 U S S h p g M 8 Y C 2 N f W x R 2 U 3 n h 4 3 g Q d a E = " > A A A B 6 3 i c d V D L S g M x F M 3 U V 6 2 v q k s 3 w S K 4 G j L V a r s r u n F Z w T 6 g H U o m z b S h S W Z I M k I Z + g t u X C j i 1 h 9 y 5 9 + Y a U d Q 0 Q O B w z n 3 3 t x 7 g p g z b R D 6 c A o r q 2 v r G 8 X N 0 t b 2 z u 5 e e f + g o 6 N E E d o m E Y 9 U L 8 C a c i Z p 2 z D D a S 9 W F I u A 0 2 4 w v c 7 8 7 j 1 V m k X y z s x i 6 g s 8 l i x k B J t M 0 n 3 P H 5 Y r y K 0 h r 3 H h Q e S i B S y p 1 l C j j q C X K x W Q o z U s v w 9 G E U k E l Y Z w r O 0 M F B s / x c o w w u m 8 N E g 0 j T G Z 4 j H t W y q x o N p P F 7 v O 4 Y l V R j C M l H 3 S w I X 6 v S P F Q u u Z C G y l w G a i f 3 u Z + J f X T 0 x Y 9 1 M m 4 8 R Q S Z Y f h Q m H J o L Z 4 X D E F C W G z y z B R D G 7 K y Q T r D A x N p 6 S D e H r U v g / 6 V R d 7 8 y t 3 p 5 X m l d 5 H E V w B I 7 B K f D A J W i C G 9 A C b U D A B D y A J / D s C O f R e X F e l 6 U F J + 8 5 B D / g v H 0 C 9 K G O M A = = < / l a t e x i t > s[2] < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 4 Y P X H a l g r F Y x z x x I C R q R / Y B h T 0 = " > A A A B 6 3 i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 V w F S b R a r s r u n F Z w T 4 g D W U y n b R D Z 5 I w M x F K 6 C + 4 c a G I W 3 / I n X / j p I 2 g o g c u H M 6 5 l 3 v v C R L O l E b o w y q t r K 6 t b 5 Q 3 K 1 v b O 7 t 7 1 f 2 D r o p T S W i H x D y W / Q A r y l l E O 5 p p T v u J p F g E n P a C 6 X X u 9 + 6 p V C y O 7 v Q s o b 7 A 4 4 i F j G C d S 8 p z / W G 1 h u w 6 c p o X D k Q 2 W s A Q t 4 6 a D Q S d Q q m B A u 1 h 9 X 0 w i k k q a K Q J x 0 p 5 D k q 0 n 2 G p G e F 0 X h m k i i a Y T P G Y e o Z G W F D l Z 4 t b 5 / D E K C M Y x t J U p O F C / T 6 R Y a H U T A S m U 2 A 9 U b + 9 X P z L 8 1 I d N v y M R U m q a U S W i 8 K U Q x 3 D / H E 4 Y p I S z W e G Y C K Z u R W S C Z a Y a B N P x Y T w 9 S n 8 n 3 R d 2 z m z 3 d v z W u u q i K M M j s A x O A U O u A Q t c A P a o A M I m I A H 8 A S e L W E 9 W i / W 6 7 K 1 Z B U z h + A H r L d P 9 i a O M Q = = < / l a t e x i t > s[3] < l a t e x i t s h a 1 _ b a s e 6 4 = " p v x 4 e y h < l a t e x i t s h a 1 _ b a s e 6 4 = " F d v W 2 D O x K P v M 2 R J p W J 0 I = " > A A A B 6 3 i c d V D L S g M x F M 3 4 r P V V d e k m W A R X Q 6 a 1 2 u 6 K b l x W s A + Y D i W T Z t r Q J D M k G a G U / o I b F 4 q 4 9 Y f c + T d m 2 h F U 9 M C F w z n 3 c u 8 9 Y c K Z N g h 9 O C u r a + s b m 4 W t 4 v b O 7 t 5 + 6 e C w o + N U E d o m M Y 9 V L 8 S a c i Z p 2 z D D a S 9 R F I u Q 0 2 4 4 u c 7 8 7 j 1 V m s X y z k w T G g g 8 k i x i B J t M 0 n 4 1 G J T K y K 0 h r 3 H h Q e S i B S y p 1 F C j j q C X K 2 W Q o z U o v f e H M U k F l Y Z w r L X v o c Q E M 6 w M I 5 z O i / 1 U 0 w S T C R 5 R 3 1 K J B d X B b H H r H J 5 a Z Q i j W N m S B i 7 U 7 x M z L L S e i t B 2 C m z G + r e X i X 9 5 f m q i e j B j M k k N l W S 5 K E o 5 N D H M H o d D p i g x f G o J J o r Z W y E Z Y 4 W J s f E U b Q h f n 8 L / S a f i e l W 3 c n t e b l 7 l c R T A M T g B Z 8 A D l 6 A J b k A L t A E B Y / A A n s C z I 5 x H 5 8 V 5 X b a u O P n M E f g B 5 + 0 T 9 6 u O M g = = < / l a t e x i t > s 0 [3] < l a t e x i t s h a 1 _ b a s e 6 4 = " U E f O l w G v x p h H W 9 m u Z r M P J M m p G I s = " > A A A B 7 H i c d V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b R U 9 i 0 V t t b 0 Y v H C q Y t p K F s t p t 2 6 W Y T d j d C C f 0 N X j w o 4 t U f 5 M 1 / 4 7 a N o K I P B h 7 v z T A z L 0 g 4 U x q h D 6 u w s r q 2 v l H c L G 1 t 7 + z u l f c P O i p O J a E u i X k s e w F W l D N B X c 0 0 p 7 1 E U h w F n H a D y f X c 7 9 5 T q V g s 7 v Q 0 o X 6 E R 4 K F j G B t J F e d e j V / U K 4 g u 4 6 c 5 o U D k Y 0 W M K R a R 8 0 G g k 6 u V E C O 9 q D 8 3 h / G J I 2 o 0 I R j p T w H J d r P s N S M c D o r 9 V N F E 0 w m e E Q 9 Q w W O q P K z x b E z e G K U I Q x j a U p o u F C / T 2 Q 4 U m o a B a Y z w n q s f n t z 8 S / P S 3 X Y 8 D M m k l R T Q Z a L w p R N k R 4 / B b W o d H Y O Z 9 q x i 9 Y F X 8 h P h M = " > A A A B 7 H i c d V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b R U 9 h E q + 2 t 6 M V j B V M L a S i b 7 a Z d u t m E 3 Y 1 Q Q n + D F w + K e P U H e f P f u G 0 j q O i D g c d 7 M 8 z M C 1 P O l E b o w y o t L a + s r p X X K x u b W 9 s 7 1 d 2 9 j k o y S a h H E p 7 I b o g V 5 U x Q T z P N a T e V F M c h p 3 f h + G r m 3 9 1 T q V g i b v U k p U G M h 4 J F j G B t J E 8 d + 2 7 Q r 9 a Q X U d O 8 9 y B y E Z z G O L W U b O B o F M o N V C g 3 a + + 9 w Y J y W I q N O F Y K d 9 B q Q 5 y L D U j n E 4 r v U z R F J M x H l L f U I F j q o J 8 f u w U H h l l A K N E m h I a z t X v E z m O l Z r E o e m M s R 6 p 3 9 5 M / M v z M x 0 1 g p y J N N N U k M W i K O N Q J 3 D 2 O R w w S Y n m E 0 M w k c z c C s k I S 0 y 0 y a d i Q v j 6 F P 5 P O q 7 t n N r u z V m t d V n E U Q Y H 4 B C c A A d c g B a 4 B m 3 g A Q I Y e A B P 4 N k S 1 q P 1 Y r 0 u W k t W M b M P f s B 6 + w R X I 4 5 i < / l a t e x i t > s 0 [1] < l a t e x i t s h a 1 _ b a s e 6 4 = " g m 8 P n S / O E Y g Q D 2 z q U s x o 6 L q r y u g  = " > A A A B 7 H i c d V B N S w M x E J 2 t X 7 V + V T 1 6 C R b R U 8 l W q + 2 t 6 M V j B b c W t k v J p m k b m s 0 u S V Y o p b / B i w d F v P q D v P l v T N s V V P T B w O O 9 G W b m h Y n Y h V O y S a C S 6 Z Z 7 g R r J 0 o R q J Q s L t w d D X z 7 + 6 Z 0 j y W t 2 a c s C A i A 8 n 7 n B J j J U 8 f + 2 7 Q L Z Z w u Y r d + r m L c B n P Y U m l i u s 1 j N x M K U G G Z r f 4 3 u n F N I 2 Y N F Q Q r X 0 X J y a Y E G U 4 F W x a 6 K S a J Y S O y I D 5 l k o S M R 1 M 5 s d O 0 Z F V e q g f K 1 v S o L n 6 f W J C I q 3 H U W g 7 I 2 K G + r c 3 E / / y / N T 0 a 8 G E y y Q 1 T N L F o n 4 q k I n R 7 H P U 4 4 p R I 8 a W E K q 4 v R X R I V G E G p t P w Y b w 9 S n 6 n 7 Q q Z f e 0 X L k 5 K z U u s z j y c A C H c A I u X E A D r q E J D M L w U 2 v j k O e r D v U i R o W s U 0 = " > A A A B 6 H i c d V D L S g N B E J y N r x h f U Y 9 e B o P g a Z m N R p N b 0 I v H B M w D k i X M T n q T M b M P Z m a F s O Q L v H h Q x K u f 5 M 2 / c Z K s o K I F D U V V N 9 1 d X i y 4 0 o R 8 W L m V 1 b X 1 j f x m Y W t 7 Z 3 e v u H / Q V l E i G b R Y J C L Z 9 a g C w U N o a a 4 F d G M J N P A E d L z J 9 d z v 3 I N U P A p v 9 T Q G N 6 C j k P u c U W 2 k J h 0 U S 8 S u E K d 2 4 W B i k w U M K V d I r U q w k y k l l K E x K L 7 3 h x F L A g g 1 E 1 S p n k N i 7 a Z U a s 4 E z A r 9 R E F M 2 Y S O o G d o S A N Q b r o 4 d I Z P j D L E f i R N h R o v 1 O 8 T K Q 2 U m g a e 6 Q y o H q v f 3 l z 8 y + s l 2 q + 6 K Q / j R E P I l o v 8 R G A d 4 f n X e M g l M C 2 m h l A m u b k V s z G V l G m T T c G E 8 P U p / p + 0 y 7 Z z Z p e b 5 6 X 6 V R Z H H h 2 h Y 3 S K H H S J 6 u g G N V A L M Q T o A T 2 h Z + v O e r R e r N d l a 8 7 K Z g 7 R D 1 h v n w c 0 j R c = < / l a t e x i t > x 0 [u] < l a t e x i t s h a 1 _ b a s e 6 4 = " Q w + q 6 D m A A b w y U q b Q 2 i Q j l S W M G i k = " > A A A B 7 H i c d V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b R U 9 h U q + 2 t 6 M V j B d M W 0 l A 2 2 0 2 7 d L M J u x u x h P 4 G L x 4 U 8 e o P 8 u a / c d t G U N E H A 4 / 3 Z p i Z F y S c K Y 3 Q h 1 V Y W l 5 Z X S u u l z Y 2 t 7 Z 3 y r t 7 b R W n k l C X x D y W 3 Q A r y p m g r m a a 0 2 4 i K Y 4 C T j v B + G r m d + 6 o V C w W t 3 q S U D / C Q 8 F C R r A 2 k n t / 7 K V + v 1 x B d g 0 5 j X M H I h v N Y U i 1 h h p 1 B J 1 c q Y A c r X 7 5 v T e I S R p R o Q n H S n k O S r S f Y a k Z 4 X R a 6 q W K J p i M 8 Z B 6 h g o c U e V n 8 2 O n 8 M g o A x j G 0 p T Q c K 5 + n 8 h w p N Q k C k x n h P V I / f Z m 4 l + e l + q w 7 m d M J K m m g i w W h S m H O o a z z + G A S U o 0 n x i C i W T m V k h G W G K i T T 4 l E 8 L X p / B / 0 q 7 a z q l d v T m r N C / z O I r g A B y C E + C A C 9 A E 1 6 A F X E A A A w / g C T x b w n q 0 X q z X R W v B y m f 2 w Q 9 Y b 5 / E m o 6 q < / l a t e x i t > x 0 [u 0 ] < l a t e x i t s h a 1 _ b a s e 6 4 = " / Q S 9 K E P y 2 T y K / q S y Y T a 5 Z 5 n v f 7 o = " > A A A B 7 X i c d V D L S g M x F M 3 U V 6 2 v q k s 3 w S J 1 N W S q 1 X Z X d O O y g n 3 A d C i Z N N P G Z j J D k h H L 0 H 9 w 4 0 I R t / 6 P O / / G t B 1 B R Q 9 c O J x z L / f e 4 8 e c K Y 3 Q h 5 V b W l 5 Z X c u v F z Y 2 t 7 Z 3 i r t 7 b R U l k t A W i X g k u z 5 W l D N B W 5 p p T r u x p D j 0 O e 3 4 4 8 u Z 3 7 m j U r F I 3 O h J T L 0 Q D w U L G M H a S O 3 7 s p u U v X 6 x h O w q c u p n D k Q 2 m s O Q S h X V a w g 6 m V I C G Z r 9 4 n t v E J E k p E I T j p V y H R R r L 8 V S M 8 L p t N B L F I 0 x G e M h d Q 0 V O K T K S + f X T u G R U Q Y w i K Q p o e F c / T 6 R 4 l C p S e i b z h D r k f r t z c S / P D f R Q c 1 L m Y g T T Q V Z L A o S D n U E Z 6 / D A Z O U a D 4 x B B P J z K 2 Q j L D E R J u A C i a E r 0 / h / 6 R d s Z 0 T u 3 J 9 W m p c Z H H k w Q E 4 B M f A A e e g A a 5 A E 7 Q A A b f g A T y B Z y u y H q 0 X 6 3 X R m r O y m X 3 w A 9 b b J y Z H j t s = < / l a t e x i t > x 0 [v] < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 K D Z 4 z M 2 p J k r D E N m t B P J b Y y u g r Y = " > A A A B 7 H i c d V B N S 8 N A E N 3 4 W e t X 1 a O X x S J 6 C p t q t b 0 V v X i s Y N p C G s p m u 2 m X b j Z h d 1 M s o b / B i w d F v P q D v P l v 3 L Y R V P T B w O O 9 G W b m B Q l n S i P 0 Y S 0 t r 6 y u r R c 2 i p t b 2 z u 7 p b 3 9 l o p T S a h L Y h 7 L T o A V 5 U x Q V z P N a S e R F E c B p + 1 g d D 3 z 2 2 M q F Y v F n Z 4 k 1 I / w Q L C Q E a y N 5 N 6 f e G O / V y o j u 4 q c + o U D k Y 3 m M K R S R f U a g k 6 u l E G O Z q / 0 3 u 3 H J I 2 o 0 I R j p T w H J d r P s N S M c D o t d l N F E 0 x G e E A 9 Q w W O q P K z + b F T e G y U P g x j a U p o O F e / T 2 Q 4 U m o S B a Y z w n q o f n s z 8 S / P S 3 V Y 8 z M m k l R T Q R a L w p R D H c P Z 5 7 D P J C W a T w z B R D J z K y R D L D H R J p + i C e H r U / g / a V V s 5 8 y u 3 J 6 X G 1 d 5 H A V w C I 7 A K X D A J W i A G 9 A E L i C A g Q f w B J 4 t Y T 1 a L 9 b r o n X J y m c O w A 9 Y b 5 / G H 4 6 r < / l a t e x i t > x 0 [w] < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 M M x a u d h 0 V P t i f 6 e n + 3 k q 9 R a z 5 8 = " > A A A B 7 H i c d V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b R U 9 h U q + 2 t 6 M V j B d M K a S i b 7 a Z d u t m E 3 Y 1 a Q n + D F w + K e P U H e f P f u G 0 j q O i D g c d 7 M 8 z M C x L O l E b o w y o s L C 4 t r x R X S 2 v r G 5 t b 5 e 2 d t o p T S a h L Y h 7 L m w A r y p m g r m a a 0 5 t E U h w F n H a C 0 c X U 7 9 x S q V g s r v U 4 o X 6 E B 4 K F j G B t J P f + 0 L v z e + U K s m v I a Z w 6 E N l o B k O q N d S o I + j k S g X k a P X K 7 9 1 + T N K I C k 0 4 V s p z U K L 9 D E v N C K e T U j d V N M F k h A f U M 1 T g i C o / m x 0 7 g Q d G 6 c M w l q a E h j P 1 + 0 S G I 6 X G U W A 6 I 6 y H 6 r c 3 F f / y v F S H d T 9 j I k k 1 F W S + K E w 5 1 D G c f g 7 7 T F K i + d g Q T C Q z t 0 I y x B I T b f I p m R C + P o X / k 3 b V d o 7 t 6 t V J p X m e x 1 E E e 2 A f H A E H n I E m u A Q t 4 A I C G H g A T + D Z E t a j 9 W K 9 z l s L V j 6 z C 3 7 A e v s E x 6 S O r A = = < / l a t e x i t > Center and Right: Shows the different stages executed by the FactoRL algorithm. We do not show the observation x emitted by s for brevity. In practice a factor would emit many more atoms. Contributions. We combine two threads of research on rich-observation RL and factored MDP by proposing a new problem setup called Factored Block MDP (Section 2). In this setup, observations are emitted by latent states that obey the dynamics of a factored MDP. We assume observations to be composed of atoms (which can be pixels for an image) that are emitted by the latent factors. A single factor can emit a large number of atoms but no two factors can control the same atom. Following existing rich-observation RL literature, we assume observations are rich enough to decode the current latent state. We introduce an algorithm FactoRL that achieves the desired guarantees for a large class of Factored Block MDPs under certain computational and realizability assumptions (Section 4). The main challenge that FactoRL handles is to map atoms to the parent factor that emits them. We achieve this by reducing the identification problem to solving a set of independence test problems with distributions satisfying certain properties. We perform independence tests in a domain-agnostic setting using noise-contrastive learning (Section 3). Once we have mapped atoms to their parent factors, FactoRL then decodes the factors, estimates the model, recovers the latent structure in the transition dynamics, and learns a set of exploration policies. Figure 1 shows the different steps of FactoRL. This provides us with enough tools to visualize the latent dynamics, and plan for any given reward function. Due to the space limit, we defer the discussion of related work to Appendix B. To the best of our knowledge, our work represents the first provable solution to rich-observation RL with a combinatorially large latent state space.

2. THE FACTORED BLOCK MDP SETTING

There are many possible ways to add rich observations to a factored MDP resulting in inapplicability or intractability. Our goal here is to define a problem setting that is tractable to solve and covers potential real-world problems. We start with the definition of Factored MDP (Kearns & Koller, 1999) , but first review some useful notation that we will be using: Notations: For any n ∈ N, we use [n] to denote the set {1, 2, • • • , n}. For any ordered set (or a vector) U of size n, and an ordered index set I ⊆ [n] and length k, we use the notation d] defines the set of parent factors that can influence a factor at the next timestep. U[I] to denote the ordered set (U[I[1]], U[I[2]], • • • , U[I[k]]). We assume a deterministic start state. We also assume, without loss of generality, that each state and observation is reachable at exactly one time step. This can be easily accomplished by concatenating the time step information to state and observations. This allows us to write the state space as S = (S 1 , S 2 , • • • , S H ) where S h is the set of states reachable at time step h. A natural question to ask here is why we assume factored transition. In tabular MDPs, the lower bound for sample complexity scales linearly w.r.t. the size of the state set (Kakade, 2003) . If we do not assume a factorized transition function then we can encode an arbitrary MDP with a state space of size 2 d , which would yield a lower bound of Ω(2 d ) rendering the setting intractable. Instead, we will prove sample complexity guarantees for FactoRL that scales in number of factors as d O(κ) where κ := max i∈ [d] |pt(i)| is the size of the largest parent factor set. The dependence of κ in the exponent is unavoidable as we have to find the parent factors from all possible d κ combinations, as well as learn the model for all possible values of the parent factor. However, for real-world problems we expect κ to be a small constant such as 2. This yields significant improvement, for example, if κ = 2 and d = 100 then d κ = 100 while 2 d ≈ 10 30 . Based on the definition of Factored MDP, we define the main problem setup of this paper, called Factored Block MDP, where the agent does not observe the state but instead receives an observation containing enough information to decode the latent state. Definition 2. A Factored Block MDP consists of an observation space X = X m and a latent state space S ⊆ {0, 1} d . A single observation x ∈ X is made of m atoms with the k th denoted by x[k] ∈ X . Observations are generated stochastically given a latent state s ∈ S according to a factored emission function q(x | s) m] is a child function satisfying ch(i)∩ch(j) = ∅ whenever i = j. The emission function satisfies the disjointness property: for every i ∈ [d], we have supp = d i=1 q i (x[ch(i)] | s[i]) where q i : {0, 1} → ∆(X |ch(i)| ) and ch : [d] → 2 [ (q i (• | 0)) ∩ supp (q i (• | 1)) = ∅. 1 The dynamics of the latent state space follows a Factored MDP (S, A, T, R, H), with parent function pt and a deterministic start state. The notion of atoms generalizes commonly used abstractions. For example, if the observation is an image then atoms can be individual pixels or superpixels, and if the observation space is a natural language text then atoms can be individual letters or words. We make no assumption about the structure of the atom space X or its size, which can be infinite. An agent is responsible for mapping each observation x ∈ X to individual atoms (x[1], • • • , x[m]) ∈ X m . For the two examples above, this mapping is routinely performed in practice. If observation is a text presented to the agent as a string, then it can use off-the-shelf tokenizer to map it to sequence of tokens (atoms). Similar to states, we assume the set of observations reachable at different time steps is disjoint. Additionally, we also allow the parent (pt) and child function (ch) to change across time steps. We denote these functions at time step h by pt h and ch h . The disjointness property was introduced in Du et al. (2019) for Block MDPs-a class of richobservation non-factorized MDPs. This property removes partial observability concerns and enables tractable learning. We expect this property to hold in real world problems whenever sufficient sensor data is available to decode the state from observation. For example, disjointness holds true for the navigation task with an overhead camera in Figure 1 . In this case, the image provides us with enough information to locate all objects in the room, which describes the agent's state.. Disjointness allows us to define a decoder φ i : X |ch(i)| → {0, 1} for every factor i ∈ [d], such that φ i (x[ch(i)]) = s[i] if x[ch(i)] ∈ supp (q i (. | s[i])). We define a shorthand φ i (x) = φ i (x[ch(i)] ) whenever ch is clear from the context. Lastly, we define the state decoder φ : X → {0, 1} d where φ (x )[i] = φ i (x). The agent interacts with the environment by taking actions according to a policy π : X → ∆(A).

These interactions consist of episodes {s

1 , x 1 , a 1 , r 1 , s 2 , x 2 , a 2 , r 2 , • • • , a H , s H } with s 1 = 0, x h ∼ q(. | s h ), r h = R(x h , a h ), and s h+1 ∼ T (. | s h , a h ). The agent never observes {s 1 , • • • , s H }. Technical Assumptions. We make two assumptions that are specific to the FactoRL algorithm. The first is a margin assumption on the transition dynamics that enables us to identify different values of a factor. This assumption was introduced by Du et al. (2019) , and we adapt it to our setting. Assumption 1 (Margin Assumption). For every h ∈ {2, 3, • • • , H}, i ∈ [d], let u i be the uniform distribution jointly over actions and all possible reachable values of s h-1 [pt(i)]. Then we assume: P ui (•, • | s h [i] = 1) -P ui (•, • | s h [i] = 0) TV ≥ σ where P ui (s h-1 [pt(i)], a | s h [i] ) is the backward dynamics denoting the probability over parent values and last action given s h [i] and roll-in distribution u i , and σ > 0 is the margin. Assumption 1 captures a large set of problems, including all deterministic problems for which the value of σ is 1. Assumption 1 helps us identify the different values of a factor but it does not help with mapping atoms to the factors from which they are emitted. In order to identify if two atoms come from the same factor, we make the following additional assumption to measure their dependence. Assumption 2 (Atom Dependency Bound). For any h ∈ [H], u, v ∈ [m] and u = v, if ch -1 (u) = ch -1 (v), i.e., atoms x h [u] and x h [v] have the same factor. Then under any distribution D ∈ ∆(S h ) we have P D (x h [u], x h [v]) -P D (x h [u])P D (x h [v]) TV ≥ β min . Dependence assumption states that atoms emitted from the same factor will be correlated. This is true for many real-world problems. For example, consider a toy grid-based navigation task. Each state factor s[i] represents a cell in the grid which can be empty (s[i] = 0) or occupied (s[i] = 1). In the latter case, a randomly sampled box from the set {red box, yellow box, black box}, occupies its place. We expect Assumption 2 to hold in this case as pixels emitted from the same factor come from the same object and hence will be correlated. More specifically, if one pixel is red in color, then another pixel from the same cell will also be red as the object occupying the cell is a red box. This assumption does not remove the key challenge in identifying factors. As atoms from different factors can still be dependent due to actions and state distributions from previous time steps. Model Class. We use two regressor classes F and G. The first regressor class F : X × X → [0, 1] takes a pair of atoms and outputs a scalar in [0, 1]. To define the second class, we first define a decoder class Φ : X * → {0, 1}. We allow this class to be defined on any set of atoms. This is motivated by empirical research where commonly used neural network models operate on inputs of arbitrary lengths. For example, the LSTM model can operate on a text of arbitrary length (Sundermeyer et al., 2012) . However, this is without loss of generality as we can define a different model class for different numbers of atom. We also define a model class U : X × A × {0, 1} → [0, 1]. Finally, we define the regressor class G : X × A × X * → [0, 1] as {(x, a, x) → u(x, a, φ(x)) | u ∈ U, φ ∈ Φ}. We assume F and G are finite classes and derive sample complexity guarantees which scale as log |F| and log |G|. However, since we only use uniform convergence arguments extending the guarantees to other statistical complexity measures such as Rademacher complexity is straightforward. Let Π all : S → A denote the set of all non-stationary policies of this form. We then define the class of policies Π : X → A by {x → ϕ(φ (x)) | ∀ϕ ∈ Π all }, which we use later to define our task. We use P π [E] to denote probability of an event E under the distribution over episodes induced by policy π. Computational Oracle. We assume access to two regression oracles REG for model classes F and G. Let D 1 be a dataset of triplets (x[u], x[v], y) where u, v denote two different atoms and y ∈ {0, 1}. Similarly, let D 2 be a dataset of quads (x, a, x , y) where x ∈ X , a ∈ A, x ∈ X * , and y ∈ {0, 1}. Lastly, let E D [•] denote the empirical mean over dataset D. The two computational oracles compute: REG(D 1 , F) = arg min f ∈F E D1 (f (x[u], x[v]) -y) 2 , REG(D 2 , G) = arg min g∈G N E D2 (g(x, a, x) -y) 2 . We also assume access to a ∆ pl -optimal planning oracle planner. Let S = ( S 1 , • • • , S h ) be a learned state space and T = ( T 1 , • • • , T H ) with T h : S h-1 × A → ∆( S h ) be the learned dynamics, and R : S × A → [0, 1] be a given reward function. Let ϕ : S → A be a policy and V (ϕ; T , R) be the policy value. Then for any ∆ pl > 0 the output of planner φ = planner( T , R, ∆ pl ) satisfies V ( φ; T , R) ≥ sup ϕ V (ϕ; T , R) -∆ pl , where supremum is taken over policies of type S → A. Task Definition. We focus on a reward-free setting with the goal of learning a state decoder and estimating the latent dynamics T . Since the state space is exponentially large, we cannot visit every state. However, the factorization property allows us to estimate the model by reaching factor values. In fact, we show that controlling the value of at most 2κ factors is sufficient for learning the model. Let C ≤k (U) denote the space of all sets containing at most k different elements selected from the set U including ∅. We define the reachability probability η h (K, Z) for a given h ∈ [H], K ⊆ [d], and Z ∈ {0, 1} |K| , and the reachability parameter η min as: η h (K, Z) := sup π∈ΠNS P π (s h [K] = Z), η min := inf h∈[H] inf s∈S h inf K∈C ≤2κ ([d]) η h (K, s[K]). Our sample complexity scales polynomially with η -1 min . Note that we only require that if s h [K] = Z is reachable, then it is reachable with at least η min probability, i.e., either η h (K, Z) = 0 or it is at least η min . These requirements are similar to those made by earlier work for non-factored state space (Du et al., 2019; Misra et al., 2020) . The key difference being that instead of requiring every state to be reachable with η min probability, we only require a small set of factor values to be reachable. For reference, if every policy induces a uniform distribution over S = {0, 1} d , then probability of visiting any state is 2 -d but the probability of two factors taking certain values is only 0.25. This gives us a more practical value for η min . Besides estimating the dynamics and learning a decoder, we also learn an α-policy cover to enable exploration of different reachable values of factors. We define this below: Definition 3 (Policy Cover). A set of policies Ψ is an α-policy cover of S h for any α > 0 and h if: ∀s ∈ S h , K ∈ C ≤2κ ([d]), sup π∈Ψ P π (s h [K] = s[K]) ≥ αη h (K, s[K]).

3. DISCOVERING EMISSION STRUCTURE WITH CONTRASTIVE LEARNING

Directly applying the prior work (Du et al., 2019; Misra et al., 2020) to decode a factored state from observation results in failure, as the learned factored state need not obey the transition factorization. Instead, the key high-level idea of our approach is to first learn the latent emission structure ch, and then use it to decode each factor individually. We next discuss our approach for learning ch. Reducing Identification of Latent Emission Structure to Independence Tests. Assume we are able to perfectly decode the latent state and estimate the transition model till time step h -1. Our goal is to infer the latent emission structure ch h at time step h, which is equivalent to: given an arbitrary pair of atoms u and v, determine if they are emitted from the same factor or not. This is challenging since we cannot observe or control the latent state factors at time step h.  Let i = ch -1 (u) and j = ch -1 (v) be the factors that emit x[u] and x[v]. If i = j,

This observation motivates us to reduce this identification problem to performing independence

tests with different roll-in distributions D ∈ ∆(S h-1 × A). Naively, we can iterate over all subsets K ∈ C ≤2κ ([d] ) where for each K we create a roll-in distribution such that the values of s h-1 [K] and the action a h-1 are fixed, and then perform independence test under this distribution. If two atoms are independent then there must exist a K that makes them independent. Otherwise, they should always be dependent by Assumption 2. However, there are two problems with this approach. Firstly, we do not have access to the latent states but only a decoder at time step h -1. Further, it may not even be possible to find a policy that can set the values of factors deterministically. We later show that our algorithm FactoRL can learn a decoder that induces a bijection between learned factors and values, and the real factors and values. Therefore, maximizing the probability of ÊK;Z = { φ h-1 (x h-1 )[K] = Z} for a set of learned factors K and their values Z, implicitly maximizes the probability of E K ;Z = {s h-1 [K ] = Z } for corresponding real factors K and their values Z . Since the event ÊK;Z is observable we can use rejection sampling to increase its probability sufficiently close to 1 which makes the probability of E K ;Z close to 1. The second problem is to perform independence tests in a domain agnostic setting. Directly estimating mutual information I(x[u]; x[v] ) can be challenging. Instead, we propose an oraclized independence test that reduces the problem to binary classification using noise-contrastive learning. Oraclized Independent Test. Here, we briefly sketch the main idea of our independence test scheme and defer the details to Appendix C. We comment that the high-level idea of our independence testing subroutine is similar to Sen et al. (2017) . Suppose we want to test if two random variables Y and Z are independent. Firstly, we construct a dataset in the following way: sample a Bernoulli random variable w ∼ Bern( 1 /2), and two pairs of independent realizations (y (1) , z (1) ) and (y (2) , z (2) ); if w = 1, add (y (1) , z (1) , w) to the dataset, and (y (1) , z (foot_1) , w) otherwise. We repeat the sampling procedure n times and obtain a dataset {(y i , z i , w i )} n i=1 . Then we can fit a classifier that predicts the value of w i using (y i , z i ). If Y and Z are independent, then (y i , z i ) will provide no information about w i and thus no classifier can do better than random guess. However, if Y and Z are dependent, then the Bayes optimal classifier would perform strictly better than random guess. As a result, by looking at the training loss of the learned classifier, we can determine whether Y and Z are dependent or not.

4. FactoRL: REINFORCEMENT LEARNING IN FACTORED BLOCK MDPS

In this section, we present the main algorithm FactoRL (Algorithm 1). It takes as input the model classes F, G, failure probability δ > 0, and five hyperparameters σ, η min , β min ∈ (0, 1) and d, κ ∈ N. 2 We use these hyperparamters to define three sample sizes n ind , n abs , n est and rejection sample frequency k. For brevity, we defer the exact values of these constants to Appendix D.7. FactoRL returns a learned decoder φ h : X → {0, 1} d h for some d h ∈ [m], an estimated transition model T h , learned parent pt h and child functions ch h , and a 1 /2-policy cover Ψ h of S h for every time step h ∈ {2, 3, • • • , H}. We use ŝh to denote the learned state at time step h. Formally, ŝh = ( φ h1 (x h ), • • • , φ hd h (x h )). In the analysis of FactoRL, we show that d h-1 = d, and ch h is equivalent to ch h up to permutation with high probability. Further, we show that φ h and ch h together learn a bijection between learned factors and their values and real factors and their values. FactoRL operates inductively over the time steps (Algorithm 1, line 2-8). In the h th iteration, the algorithm performs four stages of learning: identifying the latent emission structure, decoding the factors, estimating the model, and learning a policy cover. We describe these below. Algorithm 1 FactoRL(F, G, δ, σ, η min , β min , d, κ). RL in Factored Block MDPs. 1: Initialize Ψ h = ∅ for every h ∈ [H] and φ 1 = X → {0}. Set global constants n ind , n abs , n est , k. 2: for h ∈ {2, 3, • • • , H} do 3: ch h = FactorizeEmission(Ψ h-1 , φ h-1 , F) // stage 1: discover latent emission structure 4: φ h = LearnDecoder(G, Ψ h-1 , ch h ) // stage 2: learn a decoder for factors 5:  T h , pt h = EstModel(Ψ h-1 , φ h-1 , φ h ) // If V ( φhIZ ; T , R hIZ ) ≥ 3ηmin /4 then Ψ h ← Ψ h ∪ { φhIZ • φ h } return { ch h , φ h , T h , pt h , Ψ h } H h=2 Identifying Latent Emission Process. The FactorizeEmission collects a dataset of observations for every policy in Ψ h-1 and action a ∈ A (Algorithm 2, line 1-4). Policies in Ψ h-1 are of the type π I;Z where I ∈ C ≤2κ ([d h-1 ]) and Z ∈ {0, 1} |I| . We can inductively assume π I;Z to be maximizing the probability of E I;Z = {ŝ h-1 [I] = Z}. If our decoder is accurate enough, then we hope that maximizing the probability of this event in turn maximizes the probability of fixing the values of a set of real factors. However, it is possible that P π I;Z (ŝ h-1 [I] = Z) is only O(η min ). Therefore, as explained earlier, we use rejection sampling to drive the probability of this event close to 1. Formally, we define a procedure RejectSamp(π I;Z , E I;Z , k) which rolls-in at time step h -1 with π I;Z to observe x h-1 (line 3). If the event E I;Z holds for x h-1 then we return x h-1 , otherwise, we repeat the procedure. If we fail to satisfy the event k times then we return the last sample. We use this to define our main sampling procedure x h ∼ D I,Z,a := RejectSamp(π I;Z , E I;Z , k) • a which first samples x h-1 using the rejection sampling procedure and then takes action a to observe x h . We collect a dataset of observation pairs (x (1) , x (2) ) sampled independently from D I,Z,a . For every pair of atoms u, v ∈ [m], we calculate if they are independent under the distribution induced by D I,Z,a using IndTest with dataset D I,Z,a (line 5-7). We share the dataset across atoms for sample efficiency. If there exists at least one (I, Z, a) triple such that we evaluate x[u], x[v] to be independent, then we mark these atoms as coming from different factors. Intuitively, such an I would contain parent factors of at least ch -1 h (u) or ch -1 h (v). If no such I exists then we mark these atoms as being emitted from the same factor.

Algorithm 2 FactorizeEmission(Ψ

h-1 , φ h-1 , F). 1: for (π I;Z , a) ∈ Ψ h-1 × A and i ∈ [n ind ] do 2: Define E I;Z := 1{ φ h-1 (x h-1 )[I] = Z} 3: Sample x (1) h , x (2) h ∼ RejectSamp(π I;Z , E I;Z , k) • a // rejection sampling procedure 4: D I;Z;a ← D I;Z;a ∪ {(x (1) h , x h )} // initialize D I;Z;a = ∅ 5: for u ∈ {1, 2, • • • , m -1} and v ∈ {u + 1, • • • , m} do 6: Mark u, v as coming from the same factor, i.e., ch -1 h (u) = ch -1 h (v) if ∀(I, Z, a) 7: the oraclized independence test finds x h [u], x h [v] as dependent using D I;Z;a and F return ch h // label ordering of parents does not matter. Algorithm 3 LearnDecoder(G, Ψ h-1 , ch h ). Child function has type ch h : [d h ] → 2 [m] 1: for i in [d h ], define ω = ch h (i), D = ∅ do 2: for n abs times do // collect a dataset of real (y = 1) and imposter (y = 0) transitions 3: 2) [ω], y) Sample (x (1) , a (1) , x (1) ), (x (2) , a (2) , x (2) ) ∼ Unf(Ψ h-1 ) • Unf(A) and y ∼ Bern( 1 2 ) 4: If y = 1 then D ← D ∪ (x (1) , a (1) , x (1) [ω], y) else D ← D ∪ (x (1) , a (1) , x 5: u i , φ i = REG(D, G) // train the decoder using noise-contrastive learning return φ : X → {0, 1} d h where for any x ∈ X and i ∈ [d h ] we have φ(x)[i] = φ i (x[ ch h (i)]). Decoding Factors. LearnDecoder partitions the set of atoms into groups based on the learned child function ch h (Algorithm 3). For the i th group ω, we learn a decoder φ hi : X → {0, 1} by adapting the prediction problem of Misra et al. (2020) to Factored Block MDP setting. We define a sampling procedure (x, a, x ) ∼ Unf(Ψ h-1 ) • Unf(A) where x is observed after roll-in with a uniformly selected policy in Ψ h-1 till time step h -1, action a is taken uniformly, and x ∼ T (• | x, a) (line 3). We collect a dataset D of real and imposter transitions. A single datapoint in D is collected by sampling two independent transitions (x (1) , a (1) , x (1) ), (x (2) , a (2) , x (2) ) ∼ Unf(Ψ h-1 ) • Unf(A) and a Bernoulli random variable y ∼ Bern( 1 /2). If y = 1 then we add the real transition (x (1) , a (1) , x (1) [ω], y) to D, otherwise we add the imposter transition (x (1) , a (1) , x (2) [ω], y) (line 4). The key difference from Misra et al. (2020) is our use x [ω] instead of x which allows us to decode a specific latent factor. We train a model to predict the probability that a given transition (x, a, x [ω]) is real by solving a regression task with model class G (line 5). The bottleneck structure of G allows us to recover a decoder φ i from the learned model. The algorithm also checks for the special case where a factor takes a single value. If it does, then we return the decoder that always outputs 0, otherwise we stick with φ i . For brevity, we defer the details of this special case to Appendix D.2.2. The decoder for the h th timestep is given by composition of decoders for each group. Algorithm 4 EstModel(Ψ h-1 , φ h-1 , φ h ). 1: Collect dataset D of n est triplets (x, a, x ) ∼ Unf(Ψ h-1 ) • Unf(A) 2: for I, J ∈ C ≤κ ([d h-1 ]) satisfying I ∩ J = ∅ do 3: Estimate P(ŝ h [k] | ŝh-1 [I], ŝh-1 [J ], a) from D using φ, ∀a ∈ A, k ∈ [d h ]. 4: For every k define pt h (k) as solution to following: (where we bind ŝ = ŝh and ŝ = ŝh-1 ) argmin I max u,J1,J2,w1,w2,a P(ŝ [k] | ŝ[I] = u, ŝ[J 1 ] = w 1 , a) -P(ŝ [k] | ŝ[I] = u, ŝ[J 2 ] = w 2 , a) TV 5: Define T h (ŝ | ŝ, a) = k P(ŝ[k] | ŝ[ pt h (k)], a) and return T h , pt h Estimating the Model. EstModel routine first collects a dataset D of n est independent transitions (x, a, x ) ∼ Unf(Ψ h-1 ) • Unf(A) (Algorithm 4, line 1). We iterate over two disjoint sets of factors I, J of size at most κ. We can view I as the control set and J as the variable set. For every learned factor k ∈ [d h ], factor set I, J and action a ∈ A, we estimate the model P(ŝ h [k] | ŝh-1 [I], ŝh-1 [J ], a) using count based statistics on dataset D (line 3). Consider the case where the ch t = ch t for every t ∈ [h] and where we ignore the label permutation for brevity. If I contains the parent factors pt(k), then we expect the value of P (ŝ [k] | ŝ[I], ŝ[J ], a) ≈ T k (ŝ [k] | ŝ[pt(k)], a) to not change significantly on varying either the set J or its values. This motivates us to define the learned parent set as the I which achieves the minimum value of this gap (line 4). When computing the gap, we take max only over those values of ŝ[I] and ŝ[J ] which can be reached jointly using a policy in Ψ h-1 . This is important since we can only reliably estimate the model for reachable factor values. The learned parent function pt h need not be identical to pt h even upto relabeling. However, finding the exact parent factor is not necessary for learning an accurate model, and may even be impossible. For example, two factors may always take the same value making it impossible to distinguish between them. We use the learned parent function pt h to define T h similar to the structure of T (line 5). Learning a Policy Cover. We plan in the latent space using the estimated model { T t } h t=1 , to find a policy cover for time step h. Formally, for every I ∈ C ≤2κ ([d h ]) and Z ∈ {0, 1} |I| , we find a policy φhIZ to reach {ŝ h [I] = Z} using the planner (Algorithm 1, line 7). This policy acts on the learned state space and is easily lifted to act on observations by composition with the learned decoder. We add every policy that achieves a return of at least O(η min ) to Ψ h (line 8).

5. THEORETICAL ANALYSIS AND DISCUSSION

In this section, we present theoretical guarantees for FactoRL. For technical reasons, we make the following realizability assumption on the function classes F and G. Assumption 3 (Realizability). For any h ∈ [H], i ∈ [d] and distribution ρ ∈ ∆({0, 1}), there exists g ihρ ∈ G, such that for all ∀(x, a, x ) ∈ X h-1 × A × X h and x = x [ch h (i)] we have: g ihρ (x, a, x) = T i (φ i (x) | φ (x), a) T i (φ i (x) | φ (x), a) + ρ(φ i (x)) . For any h ∈ [H], u, v ∈ [m] with u = v, and any D ∈ ∆(S h ), there exists f uvD ∈ F satisfying: ∀s ∈ supp(D), x ∈ supp(q(• | s)), f uvD (x[u], x[v]) = D(x[u], x[v]) D(x[u], x[v]) + D(x[u])D(x[v]) . Assumption 3 requires the function classes to be expressive enough to represent optimal solutions for our regression tasks. Realizability assumptions are common in literature and are in practice satisfied by using deep neural networks (Sen et al., 2017; Misra et al., 2020) . Theorem 1 (Main Theorem). For any δ > 0, FactoRL returns a transition function T h , a parent function ch h , a decoder φ h , and a set of policies Ψ h for every h ∈ {2, 3, • • • , H}, that with probability at least 1 -δ satisfies: (i) ch h is equal to ch h upto permutation, (ii) Ψ h is a 1 /2 policy cover of S h , (iii) For every h ∈ [H], there exists a permutation mapping θ h : {0, 1} d → {0, 1} d such that for every s ∈ S h-1 , a ∈ A, s ∈ S h and x ∈ X h we have: P( φ h (x ) = θ -1 h (s ) | s ) ≥ 1 -O η 2 min κH , T (s | s, a)-T h (θ -1 h (s ) | θ -1 h-1 (s), a) TV ≤ η min 8H , and the sample complexity is poly( d 16κ , |A|, H, 1 ηmin , 1 δ , 1 βmin , 1 σ , ln m, ln |F|, ln |G|). Discussion. The proof and the detailed analysis of Theorem 1 has been deferred to Appendix C-D. Our guarantees show that FactoRL is able to discover the latent emission structure, learn a decoder, estimate the model, and learn a policy cover for every timestep. We set the hyperparmeters in order to learn a 1 /2-policy cover, however, they can also be set to achieve a desired accuracy for the decoder or the transition model. This will give a polynomial sample complexity that depends on this desired accuracy. It is straightforward to plan a near-optimal policy for a given reward function in our learned latent space, using the estimated model and the learned decoder. This incurs zero sample cost apart from the samples needed to learn the reward function. Our results show that we depend polynomially on the number of factors and only logarithmic in the number of atoms. This appeals to real-world problem where d and m can be quite large. We also depend logarithmically on the size of function classes. This allows us to use exponentially large function classes, further, as stated before, our results can also be easily extended to Rademacher complexity. Our algorithm only makes a polynomial number of calls to computational oracles. Hence, if these oracles can be implemented efficiently then our algorithm has a polynomial computational complexity. The squared loss oracles are routinely used in practice, but planning in a fully-observed factored MDP is EXPTIME-complete (see Theorem 2.24 of Mausam ( 2012)). However, various approximation strategies based on linear programming and dynamic programming have been employed succesfully (Guestrin et al., 2003) . These assumptions provide a black-box mechanism to leverage such efforts. Note that all computational oracles incur no additional sample cost and can be simply implemented by enumeration over the search space. 2020) proposed a model-free approach that learns a decoder by training a classifier to distinguish between real and imposter transitions. As optimal policies for factored MDPs do not factorize, therefore, a model-free approach is unlikely to succeed (Sun et al., 2019) . Feng et al. (2020) proposed another approach for solving Block MDPs. They assume access to a purely unsupervised learning oracle, that can learn an accurate decoder using a set of observations. This oracle assumption is significantly stronger than those made in Du et al. ( 2019) and Misra et al. (2020) , and reduces the challenge of learning the decoder. Crucially, these three approaches have a sample complexity guarantee which depends polynomially on the size of latent state space. This yields an exponential dependence on d when applied to our setting. It is unclear if these approaches can be extended to give polynomial dependence on d. For general discussion of related work see Appendix B. Proof of Concept Experiments. We empirically evaluate FactoRL to support our theoretical results, and to provide implementation details. We consider a problem with d factors each emitting 2 atoms. We generate atoms for factor s[i], by first defining a vector z i = [1, 0] if s[i] = 0, and z i = [0, 1] otherwise. We then sample a scalar Gaussian noise g i with 0 mean and 0.1 standard deviation, and add it to both component of z i . Atoms emitted from each factor are concatenated to generate a vector z ∈ R 2d . The observation x is generated by applying a fixed time-dependent permutation to z to shuffle atoms from different factors. This ensures that an algorithm cannot figure out the children function by relying on the order in which atoms are presented. We consider an action space A = {a 1 , a 2 , • • • , a d } and non-stationary dynamics. For each time step t ∈ [H], we define σ t as a fixed permutation of {1, 2, • • • , d}. Dynamics at time step t are given by: T t (s t+1 | s t , a) = d i=1 T ti (s t+1 [i] | s t [i], a), where T ti (s t+1 [i] = s t [i] | s t [i], a) = 1 for all a = a σt(i) , and T ti (s t+1 [i] = 1 -s t [i] | s t [i], a σt(i) ) = 1. We evaluate on the setting d = 10 and H = 10. We implement model classes F and G using feed-forward neural networks. Specifically, for G we apply the Gumbel-softmax trick to model the bottleneck following Misra et al. (2020) . We train the models using cross-entropy loss instead of squared loss that we use for theoretical analysis. 3 For the independence test task, we declare two atoms to be independent, if the best log-loss on the validation set is greater than c. We train the model using Adam optimization and perform model selection using a held-out set. We defer the full model and training details to Appendix F. For each time step, we collect 20,000 samples and share them across all routines. This gives a sample complexity of 20, 000 × H. We repeat the experiment 3 times and found that each time, the model was able to perfectly detect the latent child function, learn a 1 /2-policy cover, and estimate the model with error < 0.01. This is in sync with our theoretical findings and demonstrates the empirical use of FactoRL. We will make the code available at: https://github.com/cereb-rl. Conclusion. We introduce Factored Block MDPs that model the real-world difficulties of richobservation environments with exponentially large latent state spaces. We also propose a provable RL algorithm called FactoRL for solving a large class of Factored Block MDPs. We hope the setting and ideas in FactoRL will stimulate both theoretical and empirical work in this important area.

APPENDIX ORGANIZATION

This appendix is organized as follows. • Appendix A provides a list of notations used in this paper. • Appendix B covers related work • Appendix C describes the independence test algorithm and its sample complexity guarantees • Appendix D provides sample complexity guarantees for FactoRL • Appendix E provides list of supporting results • Appendix F provides details of the experimental setup and optimization

A NOTATIONS

We present notations and their definition in Table 1 . In general, calligraphic notations represent sets. All logarithms are with respect to base e.

B RELATED WORK

There is a rich literature on sample-efficient reinforcement learning in tabular MDPs with a small number of observed states (Brafman & Tennenholtz, 2002; Strehl et al., 2006; Kearns & Singh, 2002; Jaksch et al., 2010; Azar et al., 2017; Jin et al., 2018) . While recent state-of-the-art results along this line achieve near-optimal sample complexity, these algorithms do not exploit the latent structure in the environment, and therefore, cannot scale to many practical settings such as rich-observation environments with possibly a large number of factored latent states. In order to overcome this challenge, one line of research has been focusing on factored MDPs (Kearns & Koller, 1999; Guestrin et al., 2002; 2003; Strehl et al., 2010) , which allow a combinatorial number of observed states with a factorized structure. Planning in factored MDPs is EXPTIME-complete (Mausam, 2012) yet often tractable in practice, with factored MDPs statistically learnable with polynomial samples in the number of parameters that encode the factored MDP (Osband & Van Roy, 2014; Li et al., 2011) . There has also been several empirical works that either focus on the factored state space setting (e.g., Kim & Mnih (2018) 2019)), or the factored action space setting (e.g., He et al. (2016) ; Sharma et al. (2017) ). However, these works do not directly address our problem and do not provide sample complexity guarantees. Another line of work focuses on exploration in a rich observation environment. Empirical results (Tang et al., 2017; Chen et al., 2017; Bellemare et al., 2016; Pathak et al., 2017) have achieved inspiring performance on several RL benchmarks, while theoretical works (Krishnamurthy et al., 2016; Jiang et al., 2017) show that it is information-theoretically possible to explore these environments. As discussed before, recent works of Du et al. ( 2019 

C INDEPENDENCE TESTING USING NOISE CONTRASTIVE ESTIMATION

In this section, we introduce the independence testing algorithm, Algorithm 5 and provide its theoretic guarantees. Algorithm 5 will be used in Algorithm 2 as a subroutine for determining if two atoms are emitted from the same latent factor. We comment that the high-level idea of Algorithm 5 is similar to Sen et al. (2017) , which reduces independence testing to regression by adding imposter samples. Published as a conference paper at ICLR 2021 Notation Description N Space of natural numbers {1, 2, • • • , } Z Space of integers {• • • , -2, -1, 0, 1, 2, • • • , } Z ≥0 Space of positive integers {0, 1, 2, • • • , }, equivalent to N ∪ {0}. [N ] Given n ∈ N, this notation denotes the set {1, 2, • • • , N }. C ≤k (U) Given an ordered set U and k ∈ Z ≥0 , this denotes the set of all ordered subsets of U with at most k elements including the empty set.

u[I]

For a given n-dimensional vector u and I = (i 1 , i 2 , • • • , i k ) ∈ 2 [n] , u[I] denotes the k-dimensional vector (u[i 1 ], u[i 2 ], • • • , u[i k ]) [I; J ] Denotes concatenation of two ordered sets I and J ∆(U) Set of all possible distributions over the set U. Horizon of the problem denoting the maximum number of actions an agent takes in a single episode. pt h : [d] → 2 [d] Parent function for time step h. We drop h when it is clear. ch h : [d] → 2 [m] Child function for time step h. By definition, for any i, j ∈ [d] and i = j we have ch h (i) ∩ ch h (j) = ∅. We drop h when clear. q : S → ∆(X ) Emission function that generates x ∼ q(• | s) given s ∈ S. T i i th product term in the transition function. Formally, for s, s ∈ S and a ∈ A we have T (s | s, a) = d i=1 T i (s [i] | s[pt(i)], a). q i i th product term in emission function. Formally for x ∈ X with φ (x) = s we have q(x | s) = d i q i (x[ch(i)] | s[i]) F, G Regressor classes. We will denote individual member of the class as f ∈ F and g ∈ G Φ : X → {0, 1} Decoder class.  P D (x[u], x[v]), P D (x h [u]) and P D (x h [v] ) be the joint and marginal distributions over x h [u], x h [v] with respect to roll-in distribution D. The goal of our algorithm to determine if x h [u] and x h [v] are independent under P D . Algorithm 5 IndTest(F, D, u, v, β) Oraclized Independency Test. We initialize D train = ∅. 1: Initialize D train = ∅ and sample z 1 , z 2 , • • • , z n ∼ Bern( 1 2 ). 2: for i ∈ [n] do 3: if z i = 1 then 4: D train ← D train ∪ {(x (i,1) [u], x (i,1) [v], 1)}. 5: else 6: D train ← D train ∪ {(x (i,1) [u], x (i,2) [v], 0)}. 7: Compute f := arg min f ∈F L(D train , f ), where L(D train , f ) := 1 n (x[u],x[v],z)∈Dtrain {f (x[u], x[v]) -z} 2 . 8: return Independent if L(D train , f ) > 0.25β 2 /10 3 else return Dependent. We solve this task using the IndTest algorithm (Algorithm 5) which takes as input a dataset of observation pairs D = {(x (i,1) , x (i,2) )} n i=1 where x (i,1) , x (i,2) ∼ P D (•, •), and a scalar β ∈ (0, 1). We use D to create a dataset D train of real and imposter atom pairs (x[u], x [v] ). This is done by taking every datapoint in D and sampling a Bernoulli random variable z i ∼ Bern( 1 /2) (line 1). If z i = 1 then we add the real pair (x (i,1) [u], x (i,1) [v], 1) to D train (line 4), otherwise we add the imposter pair (x (i,1) [u], x (i,2) [v], 0) (line 6). We train a classifier to predict if a given atom pair (x[u], x[v]) is real or an imposter (line 7). The Bayes optimal classifier for this problem is given by: ∀x ∈ supp(P D ), f D (x[u], x[v]) := P D (z = 1 | x[u], x[v]) = P D (x[u], x[v]) P D (x[u], x[v]) + P D (x[u])P(x[v]) . If x[u] and x[v] are independent then we have P D (x[u], x[v]) = P D (x[u])P D (x[v] ) everywhere on the support of P D . This implies f D (x) = 1 2 and its training loss will concentrate around 0.25. Intuitively, this can be interpreted as the classifier having no information to tell real samples from imposter samples. However, if x[u], x[v] are dependent and P D (x[u], x[v]) -P D (x[u])P D (x[v] ) TV ≥ β then we can show the training loss of f is less than 0.25 -O(β 2 ) with high probability. The remainder of this section is devoted to a rigorous proof for this argument.

C.2 ANALYSIS OF ALGORITHM 5

Before analyzing Algorithm 5, we want to slightly simplify the problem in terms of notations. We introduce two simple notations X and Y which represents the random variables x[u] and x[v], respectively. We will simply use D to denote the joint distribution of X and Y . Define D train to be the distribution of the training data (X train , Y train , z) produced in Algorithm 5. It's easy to verify that D train (X train , Y train , z) = 1 2 [zD(X train , Y train ) + (1 -z)D(X train )D(Y train )] . Suppose the distribution D is specially designed such that at least one of the following hypothesis holds (which can be guaranteed when we invoke Algorithm 5): H 0 : D(X, Y ) -D(X)D(Y ) 1 ≥ β 2 v.s. H 1 : D(X, Y ) -D(X)D(Y ) 1 ≤ β 2 10 3 . In the remaining part, we will prove that Algorithm 5 can correctly distinguish between H 0 and H 1 with high probability.

C.2.1 TWO PROPERTIES OF THE BAYES OPTIMAL CLASSIFIER

Our first lemma shows that the Bayes optimal classifier for the optimization problem in line 7 is a constant function equal to 1/2 if X and Y are independent. Lemma 1 (Bayes Optimal Classifier for Independent Case). In Algorithm 5, if X and Y (atoms u and v) are independent under distribution D, then for the optimization problem in line 7 , the Bayes optimal classifier is given by: ∀(X train , Y train ) f (X train , Y train ) = 1 2 . Proof. From Bayes rule we have: f (X train , Y train ) =D train (z = 1 | X train , Y train ) = D train (X train , Y train | z = 1)D train (z = 1) D train (X train , Y train | z = 1)D train (z = 1) + D train (X train , Y train | z = 0)D train (z = 0) = D train (X train , Y train | z = 1) D train (X train , Y train | z = 1) + D train (X train , Y train | z = 0) , where the last identity uses D train (z = 1) = D train (z = 0) = 1/2. When z = 0, we collect X train and Y train from two independent samples. Therefore, we have D train (X train , Y train | z = 0) = D(X train )D(Y train ). When z = 1, using the fact that X train and Y train are independent under distribution D, we also have D train (X train , Y train | z = 1) = D(X train , Y train ) = D(X train )D(Y train ). Consequently, f (X train , Y train ) ≡ 1/2. Our second lemma provides an upper bound for the expected training loss of the Bayes Optimal Classifier. Later we will use this lemma to show the training loss is less than 0.25 -O(β 2 ) with high probability when H 0 holds. Lemma 2 (Square Loss of the Bayes Optimal Classifier). In Algorithm 5 line 7, the Bayes optimal classifier has expected square loss E Dtrain L(f , D train ) ≤ 1 4 - 1 2 E D 1 2 - D(X, Y ) D(X)D(Y ) + D(X, Y ) 2 , Proof. Recall the formula of the Bayes optimal classifier in Lemma 1, f (X, Y ) = D(X, Y ) D(X)D(Y ) + D(X, Y ) . Plugging it into the square loss, we obtain E Dtrain (f (X, Y ) -y) 2 =E Dtrain f (X, Y ) (f (X, Y ) -1) 2 + (1 -f (X, Y )) (f (X, Y )) 2 =E Dtrain [f (X, Y ) (1 -f (X, Y ))] = 1 4 -E Dtrain 1 2 - D(X, Y ) D(X)D(Y ) + D(X, Y ) 2 ≤ 1 4 -E Dtrain 1 2 - D(X, Y ) D(X)D(Y ) + D(X, Y ) 2 ≤ 1 4 - 1 2 E D 1 2 - D(X, Y ) D(X)D(Y ) + D(X, Y ) 2 .

C.2.2 THREE USEFUL LEMMAS

To proceed, we need to take a detour and prove three useful technical lemmas. Lemma 3. Let µ and ν be two probability measures defined on a countable set X . If µν TV ≥ c, then E x∼µ µ(x) µ(x) + ν(x) - 1 2 ≥ c 4 . Proof. E x∼µ µ(x) µ(x) + ν(x) - 1 2 = E x∼µ 1 2 µ(x) -ν(x) µ(x) + ν(x) ≥ 1 2 E x∼µ 1{µ(x) > ν(x)} µ(x) -ν(x) µ(x) + ν(x) = 1 2 x∈X µ(x) 1{µ(x) > ν(x)} µ(x) -ν(x) µ(x) + ν(x) ≥ 1 4 x∈X [1{µ(x) > ν(x)} (µ(x) -ν(x))] ≥ c 4 . Lemma 4. Fix δ ∈ (0, 1). Then with probability at least 1δ, we have L( f , D train ) -E Dtrain L(f , D train ) ≤ 10 C(F, δ) n , where D train is the training set consisting of n i.i.d. samples sampled from D train , f is the empirical minimizer of L(f, D train ) over F, f is the population minimizer, and C(F, δ) := ln |F | δ is the complexity measure of function class F. Proof. By Hoeffding's inequality and union bound, with probability at least 1δ, for every f ∈ F, we have |L(f, D train ) -E Dtrain L(f, D train )| ≤ 10 C(F, δ) n . Because f is the empirical optimizer, L( f , D train ) ≤ L(f , D train ) ≤ E Dtrain L(f , D train ) + 10 C(F, δ) n . Because f is the population optimizer, L( f , D train ) ≥ E Dtrain L( f , D train ) - C(F, δ) n ≥ E Dtrain L(f , D train ) -10 C(F, δ) n . Combining the two inequalities above, we finish the proof. For notational convenience, we introduce the following factor distribution D factor defined on the same domain of (X, Y, z): D factor (X, Y, z) = 1 2 D(X)D(Y ). Lemma 5. Suppose F contains the constant function f ≡ 1/2. Then with probability at least 1δ, we have L( f , D train ) - 1 4 ≤ 10 C(F, δ) n + 2 D train -D factor TV . Proof. By Lemma 4, with probability at least 1δ, we have L( f , D train ) -E Dtrain L(f , D train ) ≤ 10 C(F, δ) n . Noticing that L is bounded by 1, we have for every f ∈ F |E D factor L(f, D train ) -E Dtrain L(f, D train )| ≤ 2 D train -D factor TV , where E D factor L(f, D train ) defines the expected loss of f over D train where D train consists of samples i.i.d. sampled from D factor . Since y is a symmetric Bernoulli r.v. independent of (X, Y ) under distribution D factor , we have min f ∈F E D factor L(f, D train ) = 1 4 . ( ) Using the inequality 4) and ( 3) we bound the minimum loss under distribution D train as: | min f L 1 (f ) -min f L 2 (f )| ≤ max f |L 1 (f ) -L 2 (f )| for any functionals L 1 , L 2 , along with ( min f ∈F E Dtrain L(f, D train ) - 1 4 ≤ 2 D train -D factor TV . Combining ( 5) with (2) completes the proof.

C.2.3 MAIN THEOREM FOR ALGORITHM 5

Finally, we are ready to state and prove the main theorem for Algorithm 5. Theorem 2. Under the realizability assumption and n ≥ Ω( C(F ,δ) β 4 ), Algorithm 5 can distinguish H 0 : D(X, Y ) -D(X)D(Y ) 1 ≥ β 2 v.s. H 1 : D(X, Y ) -D(X)D(Y ) 1 ≤ β 2 10 3 correctly with probability at least 1δ. Proof. If H 1 holds, by Lemma 5, we have the following lower bound for the training loss of the empirical minimizer, L( f , D) ≥ 1 4 -10 C(F, δ) n - β 2 10 3 . In contrast, if H 0 is true, applying Lemma 4, we obtain L( f , D) ≤ E Dtrain L(f , D) + 10 C(F, δ) n . Invoke Lemma 2 and Lemma 3, E Dtrain L(f , D train ) ≤ 1 4 - 1 2 E D 1 2 - D(X, Y ) D(X)D(Y ) + D(X, Y ) 2 ≤ 1 4 - β 2 256 . Therefore, under H 0 , the training loss of the empirical minimizer is upper bounded as below L( f , D) ≤ 1 4 - β 2 256 + 10 C(F, δ) n . Plugging n ≥ O( C(F ,δ) β 4 ) into ( 6) and ( 7), we complete the proof.

D THEORETICAL ANALYSIS OF FactoRL

In this section, we provide a detailed theoretical analysis of FactoRL. The structure of the algorithm is iterative making an inductive case argument appealing. We will, therefore, make an induction hypothesis for each time step that we will guarantee at the end of the time step. Induction Hypothesis. We make the following induction assumption for FactoRL under Assumption 1-3 and across all time steps. For all t ∈ {2, 3,  (u) = ch -1 t (v) if and only if ch -1 t (u) = ch -1 t (v). Note that a child function is invertible by definition. We can ignore this label permutation and assume ch t = ch t for cleaner expressions. This can be done without any effect. We will assume ch t = ch t when stating the next three induction hypothesis. IH.2 There exists a permutation mapping θ t : {0, 1} d → {0, 1} d and ∈ (0, 1 2d ) such that for every i ∈ [d] and s ∈ S t we have: P( φ t (x t )[i] = θ -1 t (s)[i] | s[i]) ≥ 1 -, P( φ t (x t ) = θ -1 t (s) | s) ≥ 1 -d ≥ 1 2 The two distributions are independent of the roll-in distribution at time step t. The first one holds as φ t (x t )[i] only depends upon the value x t [ ch t (i)] = x t [ch t (i)] which only depends on s[i]. The second one holds as x t is independent of everything else given s t . The form of will become clear the end of analysis. IH.3 For every s ∈ S t-1 , s ∈ S t and a ∈ A we have: T t (θ -1 t (s ) | θ -1 t-1 (s), a) -T (s | s, a) TV ≤ 3d(∆ est + ∆ app ), where ∆ est , ∆ app > 0 denote estimation and approximation errors whose form will become clear at the end of analysis. IH.4 For every s ∈ S t and K ∈ C ≤2κ ([d]), let Z = s[K] and Z = θ -1 t (s)[K], then there exists a policy π K Z ∈ Ψ t such that: P π K Z (s t [K] = Z) ≥ αη t (K, Z) ≥ αη min . Base Case. In the base case (t = 1), we have a deterministic start state. Therefore, we can without loss of generality assume a single factor and define ch 1 [1] = m. As we can also define ch 1 [1] = [m] without loss of generality, therefore, this trivially satisfies the induction hypothesis 1. We define φ 1 : X → [0] d (Algorithm 1, line 1). This satisfies induction hypothesis 2 with θ 1 being the identity map. The induction hypothesis 3 is vacuous since there is no transition function before time step 1. For the last condition, we have for any K, Z = [0] |K| and Z = [0] |K| . For any policy π we have P π (s 1 [K] = Z) = P π ( φ 1 (x 1 )[K] = Z) = 1 ≥ ηmin 2 . Note that we never take any action from this policy, therefore, we can simply define Ψ 1 = ∅.

D.1 GRAPH STRUCTURE IDENTIFICATION

In this section, we analyze the performance of Algorithm 2, given as input Ψ h-1 , φ h-1 , F, β and n. We will analyze the performance a fixed pair of atoms u, v ∈ [m] and then apply the full result using union bound. We first state the result for the rejection sampling. Lemma 6. For policy π I; Z ∈ Ψ h-1 , event E I; Z = {ŝ h-1 [I] = Z} and k ∈ N, let D reject I; Z := RejectSamp(π I; Z , E I; Z , k) be the distribution induced by our rejection sampling procedure. Let Z = θ( Z) denote the real factor values corresponding Z. Then we have: P D reject I; Z (s h-1 [I] = Z) ≥ 1 --1 - η min 4 k . ( ) Proof. From IH.4 we have P π I; Z (s h-1 [I] = Z) ≥ ηmin 2 . This implies: P π I; Z (ŝ h-1 [I] = Z) ≥ P π I; Z (ŝ h-1 [I] = Z | s h-1 [I] = Z)P π I; Z (s h-1 [I] = Z) ≥ (1 -d )η min 2 ≥ η min 4 , (using IH.2 and IH.4). Let a = P π I; Z (ŝ h-1 [I] = Z) be the acceptance probability of event E I; Z . then it is easy to see that the probability of the event occurring under D reject I; Z is: P D reject I; Z E I; Z = a + (1 -a)a + (1 -a) 2 a + • • • (1 -a) k-1 a = 1 -(1 -a) k ≥ 1 -1 - η min 4 k . We express the desired failure probability as shown: P D reject I; Z (s h-1 [I] = Z) = P D reject I; Z s h-1 [I] = Z, ŝh-1 [I] = Z +P D reject I; Z s h-1 [I] = Z, ŝh-1 [I] = Z (9) We bound the two terms below: P D reject I; Z s h-1 [I] = Z, ŝh-1 [I] = Z ≤ P D reject I; Z ŝh-1 [I] = Z ≤ 1 - η min 4 k , P D reject I; Z s h-1 [I] = Z, ŝh-1 [I] = Z ≤ P D reject I; Z ŝh-1 [I] = Z | s h-1 [I] = Z, ≤ Combining Equation 9, Equation 10and Equation 11 we get: P D reject I; Z (s h-1 [I] = Z) = 1 -P D reject I; Z (s h-1 [I] = Z) ≥ 1 --1 - η min 4 k . ( ) We now analyze the situation for a given pair of atoms. Recall for any distribution D ∈ ∆(S h-1 ) and a ∈ A, we denote D • a as the distribution over S h where s ∼ D • a is sampled by sampling s ∼ D and then s ∼ T (. | s, a). We want to derive roll-in distributions at time step h, such that atoms coming from the same parent satisfy hypothesis H 0 and atoms coming from different parents satisfy hypothesis H 1 under this roll-in distribution. This will allow us to use independence test to identify the parent structure in the emission process. Specifically, we consider the roll-in distributions induced by D reject I; Z • a for some sets I, J and action a. Instantiating the definition of these hypothesis from Appendix C, with these roll-in distributions and setting β = β min gives us: H 0 : P D reject I; Z •a (x[u], x[v]) -P D reject I; Z •a (x[u])P D reject I; Z •a (x[v]) 1 ≥ β min 2 v.s. H 1 : P D reject I; Z •a (x[u], x[v]) -P D reject I; Z •a (x[u])P D reject I; Z •a (x[v]) 1 ≤ β 2 min 100 Lemma 7 (Same Factors). If for two atoms u, v we have ch -1 h (u) = ch -1 h (v), i.e., they are from the same factor then the hypothesis H 0 is true for D • a for any D ∈ ∆(S h-1 ) and a ∈ A. In particular, this is true for D reject I; Z • a for any choice of sets I, J and action a. Proof. Follows trivially from Assumption 2. Lemma 8 (Different Factors). If for two atoms u, v we have ch -1 h (u) = i and ch -1 h (v) = j and i = j, then if I contains pt h (i) ∪ pt h (j), then for ≤ β 2 min 1200 and k ≥ 8 ηmin ln 30 βmin , the hypothesis H 1 holds for D reject I; Z • a for any Z such that π I; Z ∈ Ψ h-1 and a ∈ A. Proof. Let D ∈ ∆(S h-1 ) be a distribution that deterministically sets s h-1 [I] = Z. Then it is easy to verify that P D •a (x h [u], x h [v]) = P D •a (x h [u])P D •a (x h [v] ) for any a ∈ A and x h ∈ X h . Then for any Z and action a ∈ A we have using triangle inequality: P D reject I; Z •a (x h [u], x h [v]) -P D reject I; Z •a (x h [u])P D reject I; Z •a (x h [v]) 1 ≤ P D reject I; Z •a (x h [u], x h [v]) -P D •a (x h [u], x h [v]) 1 + P D •a (x h [u]) -P D reject I; Z •a (x h [u]) 1 + P D •a (x h [v]) -P D reject I; Z •a (x h [v]) 1 As x h [u] and x h [v] come from different factors, therefore, we have P(x h [u], x h [v] | s h-1 , a) = P(x h [u] | s h-1 [I], a)P(x h [v] | s h-1 [I], a). We use this to bound the three terms in the summation above. P D reject I; Z •a (x h [u], x h [v]) -P D •a (x h [u], x h [v]) 1 = x h [u],x h [v] s h-1 [I] P(x h [u] | s h-1 [I], a)P(x h [v] | s h-1 [I], a) P D reject I; Z (s h-1 [I]) -P D (s h-1 [I]) ≤ s h-1 [I] x h [u],x h [v] P(x h [u] | s h-1 [I], a)P(x h [v] | s h-1 [I], a) P D reject I; Z (s h-1 [I]) -P D (s h-1 [I]) ≤ s h-1 [I] P D reject I; Z (s h-1 [I]) -P D (s h-1 [I]) = 1 -P D reject I; Z (s h-1 [I] = Z) + s h-1 [I] =Z P D reject I; Z (s h-1 [I]) = 2 1 -P D reject I; Z (s h-1 [I] = Z) ≤ 2 + 2 1 - η min 4 k . The other two terms are bounded similarly which gives us: P D reject I; Z •a (x h [u], x h [v]) -P D reject I; Z •a (x h [u])P D •a (x h [v]) 1 ≤ 6 + 6 1 - η min 4 k . We want this quantity to be less than β 2 min 100 to satisfy hypothesis H 1 . We distribute the errors equally and use ln(1 + a) ≤ a for all a > -1 to get: ≤ β 2 min 1200 , k ≥ 8 η min ln 30 β min . ( ) Theorem 3 (Learning ch h ). Fix δ ind ∈ (0, 1). If ≤ , then learned ch h is equivalent to ch h upto label permutation with probability at least 1δ ind . Proof. For any pair of atom u, v, if they are from the same factor then H 0 holds from Lemma 7 and IndTest mark them dependent with probability at least 1δ. This holds for every triplet of I, Z, a and there are at most |A|(2ed) 2κ+1 , of them. Hence, from union bound we mark u, v correctly as coming from different factors with probability at least 1 -|A|(2ed) 2κ+1 δ. If u and v have different factors then for any I containing the parents of both of them, and any value of Z and a, H 1 always holds from Lemma 8 and IndTest marks them as independent. Note that < l a t e x i t s h a 1 _ b a s e 6 4 = " z 8 u 0 E V p P y y M H H P o 5 D x n 9 l m v D O v A = " > A A A B 7 H i c d V B N S w M x E J 2 t X 7 V + V T 1 6 C R b R U 8 l W q + 2 t 6 M V j B b c W t k v J p m k b m s 0 u S V Y o p b / B i w d F v P q D v P l v T N s V V P T B w O O 9 G W b m h Y n g 2 m D 8 4 e S W l l d W 1 / L r h Y 3 N r e 2 d 4 u 5 e S 8 e p o s y j s Y h V O y S a C S 6 Z Z 7 g R r J 0 o R q J Q s L t w d D X z 7 + 6 Z 0 j y W t 2 a c s C A i A 8 n 7 n B J j J U 8 f + z z o F k u 4 X M V u / d x F u I z n s K R S x f U a R m 6 m l C B D s 1 t 8 7 / R i m k Z M G i q I 1 r 6 L E x N M i D K c C j Y t d F L N E k J H Z M B 8 S y W J m A 4 m 8 2 O n 6 M g q P d S P l S 1 p 0 F z 9 P j E h k d b j K L S d E T F D / d u b i X 9 5 f m r 6 t W D C Z Z I a J u l i U T 8 V y M R o 9 j n q c c W o E W N L C F X c 3 o r o k C h C j c 2 n Y E P 4 + h T 9 T 1 q V s n t a r t y c l R q X W R x 5 O I B D O A E X L q A B 1 9 A E D y h w e I A n e H a k 8 + i 8 O K + L 1 p y T z e z D D z h v n 6 q 2 j p k = < / l a t e x i t > x < l a t e x i t s h a 1 _ b a s e 6 4 = " P O D + j n l R S X 9 F g 4 7 x s c 1 i i K s Y P D Y = " > A A A B 8 H i c d V D L S g M x F M 3 U V 6 2 v q k s 3 w S K 4 G j L V a r s r u n F Z w T 6 k H U o m z b S h S W Z I M m I Z + h V u X C j i 1 s 9 x 5 9 + Y t i O o 6 I E L h 3 P u 5 d 5 7 g p g z b R D 6 c H J L y y u r a / n 1 w s b m 1 v Z O c X e v p a N E E d o k E Y 9 U J 8 C a c i Z p 0 z D D a S d W F I u A 0 3 Y w v p z 5 7 T u q N I v k j Z n E 1 B d 4 K F n I C D Z W u u 2 R E S X j 9 H 7 a L 5 a Q W 0 F e 7 c y D y E V z W F K u o F o V Q S 9 T S i B D o 1 9 8 7 w 0 i k g g q D e F Y 6 6 6 H Y u O n W B l G O J 0 W e o m m M S Z j P K R d S y U W V P v p / O A p P L L K A I a R s i U N n K v f J 1 I s t J 6 I w H Y K b E b 6 t z c T / / K 6 i Q m r f s p k n B g q y W J R m H B o I j j 7 H g 6 Y o s T w i S W Y K G Z v h W S E F S b G Z l S w I X x 9 C v 8 n r b L r n b j l 6 9 N S / S K L I w 8 O w C E 4 B h 4 4 B 3 V w B R q g C Q g Q 4 A E 8 g W d H O Y / O i / O 6 a M 0 5 2 c w + + A H n 7 R N o Q 5 D Q < / l a t e x i t > š < l a t e x i t s h a 1 _ b a s e 6 4 = " y e n s 8 1 0 j X i F W h n c M u Y x o d l v y + u 8 = " > A A A B 8 H i c d V B N S w M x E M 3 6 W e t X 1 a O X Y B E 8 L d l q t b 0 V v X i s Y D + k X U o 2 n b a h 2 e y S Z I W y 9 F d 4 8 a C I V 3 + O N / + N a b u C i j 4 Y e L w 3 w 8 y 8 I B Z c G 0 I + n K X l l d W 1 9 d x G f n N r e 2 e 3 s L f f 1 F G i G D R Y J C L V D q g G w S U 0 D D c C 2 r E C G g Y C W s H 4 a u a 3 7 k F p H s l b M 4 n B D + l Q 8 g F n 1 F j p r s t G w M a p n v Y K R e K W i V c 9 9 z B x y R y W l M q k W i H Y y 5 Q i y l D v F d 6 7 / Y g l I U j D B N W 6 4 5 H Y + C l V h j M B 0 3 w 3 0 R B T N q Z D 6 F g q a Q j a T + c H T / G x V f p 4 E C l b 0 u C 5 + n 0 i p a H W k z C w n S E 1 I / 3 b m 4 l / e Z 3 E D C p + y m W c G J B s s W i Q C G w i P P s e 9 7 k C Z s T E E s o U t 7 d i N q K K M m M z y t s Q v j 7 F / 5 N m y f V O 3 d L N W b F 2 m c W R Q 4 f o C J 0 g D 1 2 g G r p G d d R A D I X o A T 2 h Z 0 c 5 j 8 6 L 8 7 p o X X K y m Q P 0 A 8 7 b J 2 C q k M s = < / l a t e x i t > Time step h 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 3 X n g y T 2 S 9 m y 6 e h 8 R O 7 2 z q Z e t L g = " > A A A B 6 n i c d V D L S g N B E O y N r x h f U Y 9 e B o P g x W U 2 G k 1 u Q S 8 e I 5 o H J E u Y n c w m Q 2 Y f z M w K Y c k n e P G g i F e / y J t / 4 y R Z Q U U L G o q q b r q 7 v F h w p T H + s H J L y y u r a / n 1 w s b m 1 v Z O c X e v p a J E U t a k k Y h k x y O K C R 6 y p u Z a s E 4 s G Q k 8 w d r e + G r m t + + Z V D w K 7 / Q k Z m 5 A h i H 3 O S X a S L e j E 6 d f L G G 7 g p 3 a u Y O w j e c w p F z B t S p G T q a U I E O j X 3 z v D S K a B C z U V B C l u g 6 O t Z s S q T k V b F r o J Y r F h I 7 J k H U N D U n A l J v O T 5 2 i I 6 M M k B 9 J U 6 F G c / X 7 R E o C p S a B Z z o D o k f q t z c T / / K 6 i f a r b s r D O N E s p I t F f i K Q j t D s b z T g k l E t J o Y Q K r m 5 F d E R k Y R q k 0 7 B h P D 1 K f q f t M q 2 c 2 q X b 8 5 K 9 c s s j j w c w C E c g w M X U I d r a E A T K A z h A Z 7 g 2 R L W o / V i v S 5 a c 1 Y 2 s w 8 / Y L 1 9 A u s P j Z A = < / l a t e x i t > h < l a t e x i t s h a 1 _ b a s e 6 4 = " P s S J 9 d 0 I l 7 7 F r 7 n M 2 G F C 4 w 9 r r H such an I will exists since the we iterate over all possible sets of size upto 2κ. Hence, with probability at least 1 -|A|2 κ δ, we find u and v to be independent for every Z and a. Hence, our algorithm correctly will mark them as coming from different factors. Y = " > A A A B 6 H i c d V D L S g N B E O y N r x h f U Y 9 e B o P g a Z m N R p N b 0 I v H B M w D k i X M T m a T M b M P Z m a F s O Q L v H h Q x K u f 5 M 2 / c Z K s o K I F D U V V N 9 1 d X i y 4 0 h h / W L m V 1 b X 1 j f x m Y W t 7 Z 3 e v u H / Q V l E i K W v R S E S y 6 x H F B A 9 Z S 3 M t W D e W j A S e Y B 1 v c j 3 3 O / d M K h 6 F t 3 o a M z c g o 5 D 7 n B J t p O Z 4 U C x h u 4 K d 2 o W D s I 0 X M K R c w b U q R k 6 m l C B D Y 1 B 8 7 w 8 j m g Q s 1 F Q Q p X o O j r W b E q k 5 F W x W 6 C e K x Y R O y I j 1 D A 1 J w J S b L g 6 d o R O j D J E f S V O h R g v 1 + 0 R K A q W m g W c 6 A 6 L H 6 r c 3 F / / y e o n 2 q 2 7 K w z j R L K T L R X 4 i k I 7 Q / G s 0 5 J J R L a a G E C q 5 u R X R M Z G E a p N N w Y T w 9 S n 6 n 7 T L t n N m l 5 v n p f p V F k c e j u A Y T s G B S 6 j D D T S g B R Q Y P M A T P F t 3 1 q P 1 Y r 0 u W 3 N W N n M I P 2 C 9 f Q I R 0 I 0 e < / l For a given u and v, we correctly predict their output with probability at least 1 -|A|(2ed) 2κ+1 δ. Therefore, using union bound we correctly output right result for each u and v with probability at least 1-|A|m 2 (2ed) 2κ+1 δ. From Theorem 2, we require n ind ≥ O 1 β 4 min ln |F | δ . Binding |A|(2ed) 2κ+1 δ to δ ind then gives us the required value of n ind to achieve a success probability of at least 1δ ind . If we correctly assess the dependence for every pair of atoms correctly, then trivially partitioning them using the dependence equivalence relation gives us ch h which is same as ch h upto label permutation.

D.2 LEARNING A STATE DECODER

We focus on the task of learning an abstraction at time step h using Algorithm 3. We have access to ĉh h which is same as ch h upto label permutation. We showed how to do this in Appendix C. We will ignore the label permutation to avoid having to complicate our notations. This would essentially mean that we will recover a backward decoder φ h = φ h1 , • • • , φ hd , where there is a bijection between φ hi i and φ j j . As we learn each decoder { φ hi } independently of each other, therefore, we will focus on learning the decoder φ hi for a fixed i. The same analysis will hold for other decoder and with application of union bound, we will establish guarantees for all decoders. Further, since we are learning the decoder at a fixed time step h, therefore, we will drop the h from the subscript for brevity. We use additional shorthand described below and visualize some of them in Figure 2 . • s and x denote a state s h-1 and an observation x h-1 at time step h -1 • s and x denotes state s h and observation x h at time step h • s [i] denotes i th factor of state at time step h • š denotes s[pt h (i)] which is the set of parent factors of s [i]. Recall that from the factorization assumption, we have T (s [i] | s, a) = T i (s [i] | š, a) for any s, a. • φ i denotes φ hi decoder for i th factor at time step h • ω denotes pt(i) which is the set of indices of atoms emitted by s [i]. • x denotes x [ch h (i)] which is the collection of atoms generated by s [i]. • N = |Ψ h-1 | is the size of policy cover for previous time step. Let D = {(x (k) , a (k) , x(k) , y (k) )} nabs k=1 be a dataset of n abs real transitions (y = 1) and imposter transitions (y = 0) collected in Algorithm 3, line 2-4. We define the expected risk minimizer (ERM) solution as: ĝi = arg min g∈G 1 n abs nabs k=1 g(x (k) , a (k) , x(k) ) -y (k) 2 Recall that by the structure of G, we have ĝi = (û i , φ i ) where ŵi ∈ W 2 and φ i ∈ Φ : X * → {0, 1} is the learned decoder. Our algorithm only cares about the properties of the decoder and we throw away the regressor ûi . Let D(x, a, x) be the marginal distribution over transitions. We get the marginal distribution by marginalizing out the real (y = 1) and imposter transition (y = 0). We also define D(x, a) as the marginal distribution over x, a. We have D(x, a) = µ h-1 (x) 1 |A| as both real and imposter transitions involve sampling x ∼ µ h-1 and taking action uniformly. Recall that µ h-1 is generated by roll-in with a uniformly selected policy in Ψ h-1 till time step h -1. Let P (x, a, x | y = 1) be the probability of a transition being real and P (x, a, x | y = 0) be the probability of the transition being imposter. We can express these probabilities as: P (x, a, x | y = 1) = D(x, a)T (x | x, a), P (x, a, x | y = 0) = D(x, a)ρ(x), where ρ(x) = E (x,a)∼D [T (x | x, a) ] is the marginal distribution over x. We will overload the notation ρ to also define ρ(x ) = E (x,a)∼D [T (x | x, a) ]. Lastly, we can express the marginal distribution over transition as: D(x, a, x) = P (x, a, x | y = 1)P (y = 1) + P (x, a, x | y = 0)P (y = 0) = µ h-1 (x) 2|A| {T (x | x, a) + ρ(x)} We start by expressing the Bayes optimal classifier for problem in Equation 14. Lemma 9 (Bayes Optimal Classifier). Bayes optimal classifier g for problem in Equation 14 is given by: ∀(x, a, x) ∈ suppD, g (x, a, x) = T i (φ i (x) | φ (x)[pt(i)], a) T i (φ i (x) | φ (x)[pt(i)], a) + ρ(φ i (x)) Proof. The Bayes optimal classifier is given by g (x, a, x) = P (y = 1 | x, a, x) which can be expressed using Bayes rule as: P (y = 1 | x, a, x) = P (x, a, x | y = 1)P (y = 1) P (x, a, x | y = 1)P (y = 1) + P (x, a, x | y = 0)P (y = 0) = P (x, a, x | y = 1) P (x, a, x | y = 1) + P (x, a, x | y = 0) , using p(y) = Bern( 1 /2) = D(x, a)T (x | x, a) D(x, a)T (x | x, a) + D(x, a)ρ(x) = T (x | x, a) T (x | x, a) + ρ(x) = q i (x | φ i (x))T i (φ i (x) | x, a) q i (x | φ i (x))T i (φ i (x) | x, a) + q i (x | φ i (x))ρ(φ i (x)) = T i (φ i (x) | x, a) T i (φ i (x) | x, a) + ρ(φ i (x)) = T i (φ i (x) | φ (x)[pt(i)], a) T i (φ i (x) | φ (x)[pt(i)], a) + ρ(φ i (x)) . Theorem 4 (Decoder Regression Guarantees). For any given δ abs ∈ (0, 1) and n abs ∈ N we have the following with probability at least 1δ abs : E x,a,x∼D (ĝ i (x, a, x) -g (x, a, x)) 2 ≤ ∆(n abs , δ abs , |G|), where ∆(n abs , δ abs , |G|) := c nabs ln |G| δabs and c is a universal constant. Proof. This is a standard regression guarantee derived using Bernstein's inequality with realizability (3). For example, see Proposition 11 in Misra et al. (2020) for proof. Corollary 5. For any given δ abs ∈ (0, 1) and n abs ∈ N we have the following with probability at least 1δ abs : E x,a,x∼D [|ĝ i (x, a, x) -g (x, a, x)|] ≤ ∆(n abs , δ abs , |G|) Proof. Applying Jensen's inequality (E[ √ Z] ≤ E[Z] ) to Theorem 4 gives us: E x,a,x∼D [|ĝ(x, a, x) -g (x, a, x)|] = E x,a,x∼D |ĝ(x, a, x) -g (x, a, x)| 2 ≤ E x,a,x∼D (ĝ(x, a, x) -g (x, a, x)) 2 ≤ ∆(n abs , δ abs , |G|). Coupling Distribution We introduce a coupling distribution following Misra et al. (2020) . D coup (x, a, x1 , x2 ) = D(x, a)ρ(x 1 )ρ(x 2 ). We also define the following quantity which will be useful for stating our results: ξ(x 1 , x2 , x, a) = T (x 1 | x, a) ρ(x 1 ) - T (x 2 | x, a) ρ(x 2 ) . ( ) Lemma 10. For any fixed δ abs ∈ (0, 1) we have the following with probability at least 1δ abs : E x,a,x1,x2∼Dcoup 1{ φ i (x 1 ) = φ i (x 2 )} |ξ(x 1 , x2 , x, a)| ≤ 8 ∆(n abs , δ abs , |G|). Proof. We define a shorthand notation E = 1{ φ i (x 1 ) = φ i (x 2 )} for brevity. We also define a different coupled distribution D coup given below: D coup (x, a, x1 , x2 ) = D(x, a)D(x 1 | x, a)D(x 2 | x, a) where D(x | x, a) = 1 2 {T (x | x, a) + ρ(x)}. It is easy to see that marginal distribution of D coup over x, a, x1 is same as D(x, a, x1 ). We first use the definition of ξ (Equation 19) and g (Equation 9) to express their relation: |g (x, a, x1 ) -g (x, a, x2 )| = ρ(x 1 )ρ(x 2 ) |ξ(x 1 , x2 , x, a)| (T (x 1 | x, a) + ρ(x 1 ))(T (x 1 | x, a) + ρ(x 2 )) = ρ(x 1 )ρ(x 2 ) 4D(x 1 | x, a)D(x 2 |, x, a) |ξ(x 1 , x2 , x, a)| . The second line uses the definition of D(x | x, a). We can view ρ(x1) D(x1|x,a) and ρ(x2) D(x2|x,a) as importance weight terms. Multiplying both sides by E and taking expectation with respect to D coup then gives us: E D coup [E |g (x, a, x1 ) -g (x, a, x2 )|] = 1 4 E Dcoup [E|ξ(x 1 , x2 , x, a)|] We bound the left hand side of Equation 22 as shown below: E D coup [E |g (x, a, x1 ) -g (x, a, x2 )|] ≤ E D coup [E |g (x, a, x1 ) -ĝi (x, a, x1 )|] + E D coup [E |ĝ i (x, a, x1 ) -g (x, a, x2 )|] = E D coup [E |g (x, a, x1 ) -ĝi (x, a, x1 )|] + E D coup [E |ĝ i (x, a, x2 ) -g (x, a, x2 )|] = 2E D coup [E |g (x, a, x1 ) -ĝi (x, a, x1 )|] = 2E D [E |g (x, a, x) -ĝ(x, a, x)|] ≤ 2 ∆(n abs , δ abs , G) Here the first inequality follows from triangle inequality. The second step is key where we use ĝi (x, a, x1 ) = ĝi (x, a, x2 ) whenever E = 1. This itself follows from the bottleneck structure of G where ĝi (x, a, xi ) = ŵi (x, a, φ i (x)). The third step uses the symmetry of x1 and x2 in D coup whereas the fourth step uses the fact that marginal distribution of D coup is same as D. Lastly, final inequality uses E ≤ 1 and the result of Corollary 5. Combining the derived inequality with Equation 22 proves our result. We define the quantity P(s [i] = z | D ) := E (s,a)∼D [1{s [i] = z}] for any distribution D ∈ ∆(S h-1 × A). From the definition of ρ, we have ρ(s [i] = z) = P(s [i] = z | D). Intuitively, as we have policy cover at time step h -1 and we take actions uniformly, therefore, we expect to have good lower bound on P(s [i] = z | D) for every i ∈ [d] and reachable z ∈ {0, 1}. Note that if z = 0 (z = 1) is not reachable then it means we always have s h [i] = 1 (s h [i] = 0) from our reachability assumption (see Section 2). We formally prove this next which will be useful later. Lemma 11. For any z ∈ {0, 1} such that s [i] = z is reachable, we have: ρ(s [i] = z) = P(s [i] = z | D) ≥ αη min N |A| Proof. Fix z in {0, 1}. As s [i] = z is reachable, therefore, from the definition of η min we have: η min ≤ sup π∈Π P π (s [i] = z) ≤ sup π š,a P π (š)T (s [i] = z | š, a) ≤ š,a sup π∈Π P π (š)T (s [i] | š, a) = š,a η(š)T (s [i] | š, a) We use the derived inequality to bound P(s [i] = z | D) as shown: P(s [i] = z | D) = š,a µ h-1 (š) |A| T (s [i] = z | š, a) ≥ α N |A| š,a η(š)T (s [i] = z | š, a) ≥ αη min N |A| . The first inequality uses the fact that µ h-1 is created by roll-in with a uniformly selected policy in Ψ h-1 which is an α policy cover. Recall that N = |Ψ h-1 |. The second inequality uses the derived result above. Lemma 12. For any x1 , x2 such that φ i (x 1 ) and φ i (x 2 ) is reachable, we have: E x,a∼D [|ξ(x 1 , x2 , x, a)|] ≥ 1{φ i (x 1 ) = φ i (x 2 )} αη min σ 2N . Proof. For any x, a, x1 , x2 we can express ξ (Equation 19) as: |ξ(x 1 , x2 , x, a)| = T (φ i (x 1 ) | φ (x)[pt(i)], a) ρ(φ i (x 1 )) - T (φ i (x 2 ) | φ (x)[pt(i)], a) ρ(φ (x 2 )) where we use the factorization assumption and decodability assumption. Note that we are implicitly assuming φ i (x 1 ) and φ i (x 2 ) are reachable, for the quantity ξ(x 1 , x2 , x, a) to be well defined. We define D i to be the marginal distribution over S[pt(i)] × A. Taking expectation on both side gives us: E x,a∼D [|ξ(x 1 , x2 , x, a)|] = E š,a∼Di T (φ i (x 1 ) | š, a) ρ(φ i (x 1 )) - T (φ i (x 2 ) | š, a) ρ(φ i (x 2 )) = š,a |P Di (š, a | φ i (x 1 )) -P Di (š, a | φ i (x 2 ))| = 2 P Di (., . | φ i (x 1 )) -P Di (., . | φ i (x 2 )) TV The second equality uses the definition of backward dynamics P Di over S[pt(i)] × A and the identity ρ(s [i]) = P(s [i] | D). If φ i (x 1 ) = φ i (x 2 ) then the quantity on the right is 0. Otherwise, this quantity is given by 2 P Di (., . | s [i] = 1) -P Di (., . | s [i] = 0 TV . In the later case, both s [i] = 1 and s [i] = 0 configurations are reachable, and without loss of generality we can assume P(s [i] = 0 | D) ≥ 1 /2. Our goal is to bound this term using the margin assumption (Assumption 1). We do so using importance weight as shown below: 2 P Di (., . | s [i] = 1) -P Di (., . | s [i] = 0) TV = š,a P ui (š, a | s [i] = 1) P Di (š, a | s [i] = 1) P ui (š, a | s [i] = 1) -P ui (š, a | s [i] = 0) P Di (š, a | s [i] = 0) P ui (š, a | s [i] = 0) = š,a D i (š, a) u i (š, a) P ui (š, a | s [i] = 1) P(s [i] = 1 | u i ) P(s [i] = 1 | D i ) -P ui (š, a | s [i] = 0) P(s [i] = 0 | u i ) P(s [i] = 0 | D i ) ≥ min š,a D i (š, a) u i (š, a) P(s [i] = 0 | D i ) P(s [i] = 0 | u i ) P ui (., . | s [i] = 1) -P ui (., . | s [i] = 0) TV ≥ min š,a D i (š, a) u i (š, a) P(s [i] = 0 | D i ) P(s [i] = 0 | u i ) σ The first step applies importance weight. As u i has support over all reachable configurations š and actions a ∈ A, hence, we can apply importance weight. The second step uses the definition of backward dynamics (P D , P ui ). The third step uses Lemma H.1 of Du et al. Du et al. (2019) (see Lemma 24 in Appendix E for statement). Finally, the last step uses Assumption 1. We bound the two multiplicative terms below: We have D(š, a) = µ h-1 (š) 1 |A| ≥ αηmin N |A| . The first equality uses the fact that actions are taken uniformly and second inequality uses the fact that µ h-1 is an α-policy cover. As u i is the uniform distribution over S[pt(i)] × A, therefore, we have u i (š, a) = 1 2 |pt(i)| |A| . This gives us D(š,a) ui(š,a) ≥ αηmin N 2 |pt(i)| . We bound the other multiplicative term as shown below: P(s [i] = 0 | D i ) P(s [i] = 0 | u i ) ≥ P(s [i] = 0 | D i ) ≥ 1 2 . Combining the lower bounds for the two multiplicative terms and using 2 |pt(i)| ≥ 1 we get: 2 P Di (., . | s [i] = 1) -P Di (., . | s [i] = 0) TV ≥ αη min σ 2N . Lastly, recall that our desired result is given by 2 P Di (., . | s [i] = 1) -P Di (., . | s [i] = 0) TV whenever φ i (x 1 ) = φ i (x 2 ) and 0 otherwise. Therefore, using the derived lower bound multiplied by 1{φ i (x 1 ) = φ i (x 2 )} gives us the desired result. Corollary 6. We have the following with probability at least 1δ abs : E x1,x2∼ρ 1{ φ i (x 1 ) = φ i (x 2 )}1{φ i (x 1 ) = φ i (x 2 )} ≤ 16N αη min σ ∆(n abs , δ abs , |G|). Proof. The proof trivially follows from applying the bound in Lemma 12 to Lemma 10 as shown below: E x,a,x1,x2∼Dcoup 1{ φ i (x 1 ) = φ i (x 2 )} |ξ(x 1 , x2 , x, a)| = E x1,x2∼ρ 1{ φ i (x 1 ) = φ i (x 2 )}E x,a,∼D [|ξ(x 1 , x2 , x, a)|] ≥ αη min σ 2N E x1,x2∼ρ [1{ φ i (x 1 ) = φ i (x 2 )}1{φ i (x 1 ) = φ i (x 2 )} The inequality here uses Lemma 12. The left hand side is bounded by 8 ∆(n abs , δ abs , |G|) using Lemma 10. Combining the two bounds and rearranging the terms proves the result. At this point we analyze the two cases separately. In the first case, s [i] can be set to both 0 and 1. In the second case, s [i] can only be set to one of the values and we will call s [i] as degenerate at time step h. We will show how we can detect the second case, at which point we just output a decoder that always outputs 0. We analyze the first case below.

D.2.1 CASE A: WHEN s [i] CAN TAKE ALL VALUES

Corollary 6 allows us to define a correspondence between learned state (i.e., output of φ i ) and the actual state (i.e., output of φ i ). We show this correspondence in the result. Theorem 7 (Correspondence Theorem). For any state factor s [i], there exists exists û0 ∈ {0, 1} and û1 = 1 -û0 with probability at least 1δ abs such that: P( φ i (x) = û0 | s [i] = 0) ≥ 1 -, P( φ i (x) = û1 | s [i] = 1) ≥ 1 -, where := 16N 2 |A| α 2 η 2 min σ ∆(n abs , δ abs , |G|) and x ∼ ρ, provided ∈ 0, 1 2 . Proof. For any u, z ∈ {0, 1} we define the following quantities: P z := E x∼ρ [1{φ i (x) = z}], P uz := E x∼ρ [1{ φ i (x) = u}1{φ i (x) = z}]. It is easy to see that these quantities are related by: P z = P uz + P (1-u)z . We define û0 = arg max u∈{0,1} P u0 and û1 = 1 -û0 . This can be viewed as the learned bit value which is in most correspondence with s[i] = 0. We will derive lower bound on P û0 0/P 0 and P û1 1/P 1 which gives us the desired result. We first derive the following lower bound on P û00 : P û00 ≥ P û00 + P û10 2 ≥ P 0 2 , where we use the fact that max is greater than average. Further, for any u, z ∈ {0, 1} we have: E x2,x2∼ρ 1{ φ i (x 1 ) = φ i (x 2 )}1{φ i (x 1 ) = φ i (x 2 )} ≥ E x1,x2∼ρ 1{ φ i (x 1 ) = u}1{ φ i (x 2 ) = u}1{φ i (x 1 ) = z}1{φ i (x 2 ) = 1 -z} = P uz P u(1-z) We define a shorthand notation ∆ := 16N αηminσ ∆(n abs , δ abs , |G|). Then from Corollary 6 we have proven that P uz P u(1-z) ≤ ∆ for any u, z ∈ {0, 1}. This allows us to write: P û11 = P 1 -P û01 ≥ P 1 - ∆ P û00 ⇒ P û11 P 1 ≥ 1 - ∆ P 1 P û00 ≥ 1 - 2∆ P 0 P 1 where the last inequality uses Equation 24. We will derive the same result for P û0 0/P 0 . P û00 = P 0 -P û10 ≥ P 0 - ∆ P û11 ⇒ P û00 P 0 ≥ 1 - ∆ P 0 P û11 ≥ 1 - ∆ P 0 P 1 -2∆ , where the last inequality uses derived bound for P û1 1/P 1. If we assume ∆ ≤ P0P1 4 then we get P û0 0 P0 ≥ 1 -2∆ P0P1 . As P 0 + P 1 = 1, therefore, we get P 0 P 1 = P 0 -P 2 0 = P 1 -P 2 1 . If P 0 ≤ 1 2 then P 0 -P 2 0 ≥ P0 2 . Otherwise, P 0 > 1 2 which implies P 1 ≤ 1 2 and P 1 -P 2 1 ≥ P1 2 . This gives us P 0 P 1 ≥ min{ P0 2 , P1 2 }. Using lower bounds for P 0 and P 1 from Lemma 11 gives us P 0 P 1 ≥ αηmin 2N |A| , and allows us to write: P û11 P 1 ≥ 1 - 4N |A|∆ αη min , P û00 P 0 ≥ 1 - 4N |A|∆ αη min . It is easy to verify that = 4N |A|∆ αηmin . As P û0 0 P0 = P( φ i (x ) = û0 | s[i] = 0) and P û1 1 P1 = P( φ i (x ) = û1 | s[i] = 1) , therefore, we prove our result. The only requirement we used is that ∆ ≤ P0P1 4 which is ensured if ∈ 0, 1 2 .

D.2.2 CASE B: WHEN s [i] TAKES A SINGLE VALUE

We want to be able to detect this case with high probability so that we can learn a degenerate decoder that only takes value 0. This would trivially give us a correspondence result similar to Theorem 7. We describe the general form of the LearnDecoder subroutine in Algorithm 6. The key difference from the case we covered earlier is line 6-10. For a given factor i, we first learn the model ĝi containing the decoder φ i , as before using noise contrastive learning. We then sample n deg iid triplets If the width is smaller than a certain value then we determine the factor to be degenerate and output a degenerate decoder φ i := 0, otherwise, we stick to the decoder learned by our regressor task. The form of sample size n deg will become clear at the end of analysis, and we will determine the reason for the choice of threshold for width in line 9. Intuitively, if the latent factor only takes one value then the optimal classifier will always output 1 /2 and so our prediction values should be close to one another. However, if the latent factor takes two values then the model prediction should be distinct. D deg = {(x j , Algorithm 6 LearnDecoder(G, Ψ h-1 , ch h ).

Child function has type ch

h : [d h ] → 2 [m] 1: for i in [d h ], define ω = ch h (i), D = ∅, D deg = ∅ do 2: for n abs times do // collect a dataset of real (y = 1) and imposter (y = 0) transitions 3: Sample (x (1) , a (1) , x (1) ), (x (2) , a (2) , x (2) ) ∼ Unf(Ψ h-1 ) • Unf(A) and y ∼ Bern( 1 2 ) 4: 1) , a (1) , x (1) [ω], y) else D ← D ∪ (x (1) , a (1) , x (2) [ω], y) If y = 1 then D ← D ∪ (x ( 5: ĝi := u i , φ i = REG(D, G) // train the decoder using noise-contrastive learning 6: for n deg times do // detect degenerate factors 7: Sample (x (1) , a (1) , x (1) ), (x (2) , a (2) , x (2) ) ∼ Unf(Ψ h-1 ) • Unf(A) 8: D deg ← D deg ∪ {(x (1) , a (1) , x (2) )} 9: if max j,k∈[ndeg] ĝi (x j , a j , x j [ω]) -ĝi (x k , a k , x k [ω]) ≤ α 2 η 2 min σ 40|Ψ h-1 | 2 |A| then // max over D deg 10: φ i := 0 // output a decoder that always returns 0 return φ : X → {0, 1} d h where for any x ∈ X and i ∈ [d h ] we have φ(x)[i] = φ i (x[ ch h (i)]). For convenience, we define D (x, a, x) = D(x, a)ρ(x), and so (x j , a j , xj ) ∼ D . For brevity reasons, we do not add additional qualifiers to differentiate x j , a j , xj from the dataset of real and imposter transitions, we used in the previous section for the regression task. In this part alone, we will use x j , a j , xj to refer to the transitions collected for the purpose of detecting degenerate factors. Lemma 13 (Markov Bound). Let {(x j , a j , xj )} ndeg j=1 be a dataset of iid transitions sampled from D . Fix a > 0. Then with probability at least 1δ abs -2ndeg √ ∆(nabs,δabs,|G|) a we have: ∀j ∈ [n deg ], |ĝ(x j , a j , xj ) -g (x j , a j , xj | ≤ a. Proof. It is straightforward to verify that for any (x j , a j , xj ) we have D(x j , a j , xj ) ≥ D (xj ,aj ,xj ) /2. Using Corollary 5 we get: E x,a,x∼D [|ĝ(x, a, x) -g (x, a, x)|] ≤ 2 ∆(n abs , δ abs , |G|) Let E j denote the event {|ĝ(x j , a j , xj )g (x j , a j , xj )| ≤ a} and E j be its negation, then: P(∩ ndeg j=1 E j ) ≥ 1 - ndeg j=1 P(E j ) ≥ 1 - 2n deg ∆(n abs , δ abs , |G|) a , where the first inequality uses union bound and the second inequality uses Markov's inequality. As Corollary 5 holds with probability δ abs , our overall failure probability is at most δ abs + 2ndeg √ ∆(nabs,δabs,|G|) a . Lemma 14. For any reachable parent factor values š, action a ∈ A and reachable s [i] ∈ {0, 1}, we have D (š, a, s [i]) ≥ α 2 η 2 min N 2 |A| 2 . Proof. We have D (š, a, s [i]) = µ h-1 (š) |A| ρ(s [i]) ≥ α 2 η 2 min N 2 |A| 2 , where used the induction hypothesis IH.4 that Ψ is an α-policy cover of S h-1 and Lemma 11. we have: max j,k∈[ndeg] |ĝ(x j , a j , xj ) -ĝ(x k , a k , xk )| ≤ 2a Proof. When s [i] takes a single value then g is the constant function 1 2 . For any j and k we get the following using Lemma 13 and triangle inequality. |ĝ(x j , a j , xj ) -ĝ(x k , a k , xk )| ≤ |ĝ(x j , a j , xj ) -g (x j , a j , xj )| + |g (x k , a k , xk ) -ĝ(x k , a k , xk )| ≤ 2a. Lemma 16 (Non Degenerate Factors). Fix a > 0 and assume n deg ≥ N 2 |A| 2 α 2 η 2 min , then we have: max j,k∈[ndeg] |ĝ(x j , a j , xj ) -ĝ(x k , a k , xk )| ≥ α 2 η 2 min σ 16N 2 |A| -2a with probability at least 1 -δ abs - 2ndeg √ ∆(nabs,δabs,|G|) a -4 exp - α 2 η 2 min ndeg 3N 2 |A| 2 . Proof. Equation 23 implies that there exists š, a such that T (s [i] = 1 | š, a) ρ(s [i] = 1) - T (s [i] = 0 | š, a) ρ(s [i] = 0) ≥ αη min σ 2N Combining this with Equation 21 we get: |g (š, a, s [i] = 1) -g (š, a, s [i] = 0)| ≥ ρ(s [i] = 1)ρ(s [i] = 0) 4D(s [i] = 1 | š, a)D(s [i] = 0 | š, a) αη min σ 2N ≥ α 2 η 2 min σ 16N 2 |A| (26) where Equation 26 uses ρ(s [i] = 1)ρ(s [i] = 0) ≥ αηmin 2N |A| , as one of the terms is at least 1 /2 and other can be bounded using Lemma 11. Say we have two examples in our dataset, say {(x 1 , a 1 , x1 ), (x 2 , a 2 , x2 )} without loss of generality, such that φ (x 1 )[ω] = φ (x 2 )[ω] = š, action a 1 = a 2 = a, φ i (x 1 ) = 1, and φ i (x 2 ) = 0. Then we have: . Further, we also assume that our dataset contains both (š, a, s [i] = 1) and (š, a, s [i] = 0). Probability of one of these events is given by Lemma 14. Therefore, if n deg ≥ N 2 |A| 2 α 2 η 2 min then from Lemma 25 and union bound, the probability that at least one of these transitions does not occur is given by 4 exp - max j,k∈[ndeg] |ĝ(x j , a j , xj ) -ĝ(x k , a k , xk )| ≥ |ĝ(x 1 , a 1 , x1 ) -ĝ(x 2 , a 2 , x2 )| ≥ |g (š, a, 1) -g (š, a, 0)| -|ĝ(x 1 , a 1 , x1 ) -g (x 1 , a 1 , x1 )| -|ĝ(x 2 , a 2 , x2 ) -g (x 2 , a 2 , x2 )| ≥ α 2 η 2 min σ 16N 2 |A| - α 2 η 2 min ndeg 3N 2 |A| 2 . The total failure probability is given by union bound and computes to: δ abs + 2n deg ∆(n abs , δ abs , |G|) a + 4 exp - α 2 η 2 min n deg 3N 2 |A| 2 . If we fix a = α 2 η 2 min σ 80N 2 |A| then in the two case we have: Proof. The result follows by combining Lemma 16 and Lemma 15, and using the value of a described above. These two results hold with probability at least: (Degenerate Factor) max j,k∈[ndeg] |ĝ(x j , a j , xj ) -ĝ(x k , a k , xk )| ≤ α 2 η 2 min σ 40N 2 |A| (Non-Degenerate Factor) max j,k∈[ndeg] |ĝ(x j , a j , xj ) -ĝ(x k , a k , xk )| ≥ 3α 2 η 2 min σ 80N 2 |A| 1 -δ abs - 2n deg ∆(n abs , δ abs , |G|) a -4 exp - α 2 η 2 min n deg 3N 2 |A| 2 Setting the hyperparameters to satisfy the following: n deg = 3N 2 |A| 2 α 2 η 2 min log 4 δ abs , ∆(n abs , δ abs , |G|) -1 /2 ≥ 480N 4 |A| 3 α 4 η 4 min δ abs σ log 4 δ abs , gives a failure probability of at most 3δ abs . The later condition can be expressed in terms of which gives us the desired bounds (see Theorem 7 for definition of ). Lastly, note that setting n deg this way also satisfies the requirement in Lemma 16. Lastly, note that the resultant bound on is much stronger than required for Theorem 7. Therefore, we can significantly improve the complexity bounds in the setting where there are no degenerate state factors.

D.2.3 COMBINING CASE A AND CASE B

Theorem 8 shows that we can detect degenerate state factors with high probability. If we have a degenerate state factor and we detect it, then correspondence theorem holds trivially. However, if we don't have degeneracy and we correctly predict it, then we stick our learned decoder and Theorem 7 holds true. These two results allows us to define a bijection between real states and learned states that we explain below. Bijective Mapping between real and learned states For a given time step h and state bit s[i] = z, we will define ûhiz as the corresponding learned state bit. When h and i will be clear from the context then we will express this as ûz . We will use the notation ŝ to denote a learned state at time step h -1 and ŝ to denote learned state at time step h. Let pt We define a mapping θ h : {0, 1} d → {0, 1} d from learned state to real state. We will drop the subscript h when the time step is clear from the context. We denote the domain of θ h by S h which is a subset of {0, 1} d . Note that every real state may not be reachable at time step h. E.g., maybe our decoder outputs ŝ = (0, 0) but that the corresponding real state is not reachable at time step h. Figure 3 visualizes the mapping.  (i) = (i 1 , • • • , i l ) and s[pt(i)] = w := (w 1 , w 2 , • • • w l ), then we define ŵ = (û (h-1)i1w1 • • • û(h-1)i l w l ) > A A A B 9 H i c d V D L S g M x F L 3 j s 9 Z X 1 a W b Y B F c D Z l q t d 0 V 3 b i s a B / Q D i W T p m 1 o 5 m G S K Z S h 3 + H G h S J u / R h 3 / o 2 Z d g Q V P R A 4 n H M v 9 + R 4 k e B K Y / x h L S 2 v r K 6 t 5 z b y m 1 v b O 7 u F v f 2 m C m N J W Y O G I p R t j y g m e M A a m m v B 2 p F k x P c E a 3 n j q 9 R v T Z h U P A z u 9 D R i r k + G A R 9 w S r S R 3 K 5 P 9 I g S k d z O e q N e o Y j t M n a q 5 w 7 C N p 7 D k F I Z V y s Y O Z l S h A z 1 X u G 9 2 w 9 p 7 L N A U 0 G U 6 j g 4 0 m 5 C p O Z U s F m + G y s W E T o m Q 9 Y x N C A + U 2 4 y D z 1 D x 0 b p o 0 E o z Q s 0 m q v f N x L i K z X 1 P T O Z h l S / v V T 8 y + v E e l B x E x 5 E s W Y B X R w a x A L p E K U N o D 6 X j G o x N Y R Q y U 1 W R E d E E q p N T 3 l T w t d P 0 f + k W b K d U 7 t 0 c 1 a s X W Z 1 5 O A Q j u A E H L i A G l x D H R p A 4 R 4 e 4 A m e r Y n 1 a L 1 Y r 4 v R J S v b O Y A f s N 4 + A V B 9 k n Y = < / l a t e x i t > {0, 1} d < l a t e x i t s h a 1 _ b a s e 6 4 = " A W u t B b 2 r c A m 7 G q u p a 0 9 0 C 9 u O H u I = " > A A A B 8 X i c d V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b B g 4 R N t d r e i l 4 8 V r C t 2 M S y 2 W z a p Z t N 2 N 0 I J f R f e P G g i F f / j T f / j d s 2 g o o + G H i 8 N 8 P M P D / h T G m E P q z C w u L S 8 k p x t b S 2 v r G 5 V d 7 e 6 a g 4 l Y S 2 S c x j e e N j R T k T t K 2 Z 5 v Q m k R R H P q d d f 3 Q x 9 b v 3 V C o W i 2 s 9 T q g X 4 Y F g I S N Y G + n W z d A R d N z J X d A v V 5 B d Q 0 7 j 1 I H I R j M Y U q 2 h R h 1 B J 1 c q I E e r X 3 5 3 g 5 i k E R W a c K x U z 0 G J 9 j I s N S O c T k p u q m i C y Q g P a M 9 Q g S O q v G x 2 8 Q Q e G C W A Y S x N C Q 1 n 6 v e J D E d K j S P f d E Z Y D 9 V v b y r + 5 f V S H d a 9 j I k k 1 V S Q + a I w 5 V D H c P o + D J i k R P O x I Z h I Z m 6 F Z I g l J t q E V D I h f H 0 K / y e d q u 0 c 2 9 W r k 0 r z P I + j C P b A P j g E D j g D T X A J W q A N C B D g A T y B Z 0 t Z j 9 a L 9 T p v L V j 5 z C 7 4 A e v t E 4 j / k C 8 = < / l a t e x i t > b S h < l a t e x i t s h a 1 _ b a s e 6 4 = " H W s b q k 6 T u 6 D v Q y 4 e g S q 7 7 3 V J A j g = " > A A A C A H i c d V D L S s N A F J 3 U V 6 2 v q A s X b g a L 4 C p M q t V 2 V 3 T j s q J 9 Q B P C Z D p t h 0 4 e z E y U E r L x V 9 y 4 U M S t n + H O v 3 H S V l D R A x c O 5 9 z L v f f 4 M W d S I f R h F B Y W l 5 Z X i q u l t f W N z S 1 z e 6 c t o 0 Q Q 2 i I R j 0 T X x 5 J y F t K W Y o r T b i w o D n x O O / 7 4 I v c 7 t 1 R I F o U 3 a h J T N 8 D D k A 0 Y w U p L n r n n 3 L E + H W G V O g F W I 4 J 5 e p 1 l 3 s g z y 8 i q I r t + a k N k o S k 0 q V R R v Y a g P V f K Y I 6 m Z 7 4 7 / Y g k A Q 0 V 4 V j K n o 1 i 5 a Z Y K E Y 4 z U p O I m m M y R g P a U / T E A d U u u n 0 g Q w e a q U P B 5 H Q F S o 4 V b 9 P p D i Q c h L 4 u j O / U v 7 2 c v E v r 5 e o Q c 1 N W R g n i o Z k t m i Q c K g i m K c B + 0 x Q o v h E E 0 w E 0 7 d C M s I C E 6 U z K + k Q v j 6 F / 5 N 2 x b K P r c r V S b l x P o + j C P b B A T g C N j g D D X A J m q A F C M j A A 3 g C z 8 a 9 8 W i 8 G K + z 1 o I x n 9 k F P 2 C 8 f Q L s G p d F < / l a t e x i t > {0, 1} d < l a t e x i t s h a 1 _ b a s e 6 4 = " A W u t B b 2 r c A m 7 G q u p a 0 9 0 C 9 u O H u I = " > A A A B 8 X i c d V B N S 8 N A E N 3 U r 1 q / q h 6 9 L B b B g 4 R N t d r e i l 4 8 V r C t 2 M S y 2 W z a p Z t N 2 N 0 I J f R f e P G g i F f / j T f / j d s 2 g o o + G H i 8 N 8 P M P D / h T G m E P q z C w u L S 8 k p x t b S 2 v r G 5 V d 7 e 6 a g 4 l Y S 2 S c x j e e N j R T k T t K 2 Z 5 v Q m k R R H P q d d f 3 Q x 9 b v 3 V C o W i 2 s 9 T q g X 4 Y F g I S N Y G + n W z d A R d N z J X d A v V 5 B d Q 0 7 j 1 I H I R j M Y U q 2 h R h 1 B J 1 c q I E e r X 3 5 3 g 5 i k E R W a c K x U z 0 G J 9 j I s N S O c T k p u q m i C y Q g P a M 9 Q g S O q v G x 2 8 Q Q e G C W A Y S x N C Q 1 n 6 v e J D E d K j S P f d E Z Y D 9 V v b y r + 5 f V S H d a 9 j I k k 1 V S Q + a I w 5 V D H c P o + D J i k R P O x I Z h I Z m 6 F Z I g l J t q E V D I h f H 0 K / y e d q u 0 c 2 9 W r k 0 r z P I + j C P b A P j g E D j g D T X A J W q A N C B D g A T y B Z 0 t Z j 9 a L 9 T p v L V j 5 z C 7 4 A e v t E 4 j / k C 8 = < / l a t e x i t > ✓ h < l a t e x i t s h a 1 _ b a s e 6 4 = " V l 9 W E N p N u D D 1 V v e f L w D p 9 Q 8 c 0 w 8 = " > A A A B 7 3 i c d V D L S g N B E J z 1 G e M r 6 t H L Y B A 8 L b P R a H I L e v E Y w T w g W c L s Z D Y 7 Z P b h T K 8 Q l v y E F w + K e P V 3 v P k 3 T p I V V L S g o a j q p r v L S 6 T Q Q M i H t b S 8 s r q 2 X t g o b m 5 t 7 + y W 9 v b b O k 4 V 4 y 0 W y 1 h 1 P a q 5 F B F v g Q D J u 4 n i N P Q k 7 3 j j q 5 n f u e d K i z i 6 h U n C 3 Z C O I u E L R s F I 3 T 4 E H O g g G J T K x K 4 S p 3 7 u Y G K T O Q y p V E m 9 R r C T K 2 W U o z k o v f e H M U t D H g G T V O u e Q x J w M 6 p A M M m n x X 6 q e U L Z m I 5 4 z 9 C I h l y 7 2 f z e K T 4 2 y h D 7 s T I V A Z 6 r 3 y c y G m o 9 C T 3 T G V I I 9 G 9 v J v 7 l 9 V L w a 2 4 m o i Q F H r H F I j + V G G I 8 e x 4 P h e I M 5 M Q Q y p Q w t 2 I W U E U Z m I i K J o S v T / H / p F 2 x n V O 7 c n N W b l z m c R T Q I T p C J 8 h B F 6 i B r l E T t R B D E j 2 g J / R s 3 V m P 1 o v 1 u m h d s v K Z A / Q D 1 t s n Z J q Q N Q = = < / l a t e x i t > unreachable < l a t e x i t s h a 1 _ b a s e 6 4 = " n C L k / J Q k t x C E K A s f u 2 l N X M v c 3 / U = " > A A A B + 3 i c d V B L S w M x G M z 6 r P W 1 1 q O X Y B E 8 L d l q t b 0 V v X i s Y B / Q L i W b p m 3 o J l m S r L Q s / S t e P C j i 1 T / i z X 9 j 2 q 6 g o g O B Y e a b 5 M u E c c S 0 Q e j D W V l d W 9 / Y z G 3 l t 3 d 2 9 / b d g 0 J T y 0 Q R 2 i A y k q o d Y k 0 j J m j D M B P R d q w o 5 m F E W + H 4 e u 6 3 7 q n S T I o 7 M 4 1 p w P F Q s A E j 2 F i p 5 x a 6 P J S T N B E 2 R E b Y x m Y 9 t 4 i 8 M v K r F z 5 E H l r A k l I Z V S s I + p l S B B n q P f e 9 2 5 c k 4 V Q Y E m G t O z 6 K T Z B i Z R i x 9 + W 7 i a Y x J m M 8 p B 1 L B e Z U B + l i 9 x k 8 s U o f D q S y R x i 4 U L 8 n U s y 1 n v L Q T n J s R v q 3 N x f / 8 j q J G V S C l I k 4 M V S Q 5 U O D J I J G w n k R s M 8 U J S a a W o K J Y n Z X a A t Q m B h b V 9 6 W 8 P V T + D 9 p l j z / z C v d n h d r V 1 k d O X A E j s E p 8 M E l q I E b U A c N Q M A E P I A n 8 O z M n E f n x X l d j q 4 4 W e Y Q / I D z 9 g k F V Z U V < / l a t e x i t > states < l a t e x i t s h a 1 _ b a s e 6 4 = " e O X / F i r Z H S y + 8 m E G j b / 4 2 f S 8 G c k = " > A A A B 9 H i c d V D L T g J B E J z F F + I L 9 e h l I j H x t J l F U b g R v X j E R I Q E N m R 2 m I U J s w 9 n e o l k w 3 d 4 8 a A x X v 0 Y b / 6 N A 6 y J G q 2 k k 0 p V d 7 q 7 v F g K D Y R 8 W L m l 5 Z X V t f x 6 Y W N z a 3 u n u L t 3 q 6 N E M d 5 k k Y x U 2 6 O a S x H y J g i Q v B 0 r T g N P 8 p Y 3 u p z 5 r T F X W k T h D U x i 7 g Z 0 E A p f M A p G c r u B F 9 2 n G i h w P e 0 V S 8 S u E K d 2 5 m B i k z k M K V d I r U q w k y k l l K H R K 7 5 3 + x F L A h 4 C k 1 T r j k N i c F O q Q D D J p 4 V u o n l M 2 Y g O e M f Q k A Z c u + n 8 6 C k + M k o f + 5 E y F Q K e q 9 8 n U h p o P Q k 8 0 x l Q G O r f 3 k z 8 y + s k 4 F f d V I R x A j x k i 0 V + I j F E e J Y A 7 g v F G c i J I Z Q p Y W 7 F b E g V Z W B y K p g Q v j 7 F / 5 P b s u 2 c 2 O X r 0 1 L 9 I o s j j w 7 Q I T p G D j p H d X S F G q i J G L p D D + g J P V t j 6 9 F 6 s V 4 X r T k r m 9 l H P 2 C 9 f Q L U T Z L M < / l a t e x i t > Figure 3 : Bijection between learned state space S h and the state space at time step h. The bijection maps every state in S h to a unique state in S h . However S h also contains learned states that do not correspond to a reachable state at this time step. This happens due to error in the decoder. For a learned state ŝ we have s = θ(ŝ) if s = (z 1 , • • • , z d ) and ŝ = (u h1z1 , • • • , u hdz d ). We would also overload our notation to write w = θ( ŵ) for a given ŝ[K] = ŵ where w = θ(ŝ)[K], whenever factor set K is clear from the context. We call s (or s) as reachable if s ∈ S h (or s ∈ S h-1 ). Similarly, we call ŝ (or ŝ) as reachable if θ h (ŝ ) (or θ h-1 (ŝ)) are reachable. For a given set of factors K, we call w ∈ {0, 1} |K| as reachable for K if there exists a reachable state with factors K taking on value w. Similarly, we define ŵ ∈ {0, 1} |K| as reachable for a given K if θ( ŵ) is reachable for K. We use the mapping to state the correspondence theorem for the whole state. Corollary 9 (General Case). If n deg = 3N 2 |A| 2 α 2 η 2 min log 4 δabs and ≤ α 2 η 2 min δabs 30N 2 |A| 2 log -1 4 δabs holds, then with probability at least 1 -3dδ abs , we have: ∀s ∈ S h , P(ŝ [i] = θ -1 (s )[i] | s [i]) ≥ 1 -, P(ŝ = θ -1 (s ) | s ) ≥ 1 -d . Proof. The first result directly follows from being able to detect if we are in degenerate setting or not and if not, then applying Theorem 7, and if yes then result holds trivially. This holds with probability at least 1 -3δ abs from Theorem 8. Applying union bound over all d learned factors gives us success probability across all factors of at least 1 -3dδ abs . The second one follows from union bound. P(ŝ = θ -1 (s) | s) = P(∃i : ŝ [i] = θ -1 (s)[i] | s[i]) ≤ d i=1 P(ŝ[i] = θ -1 (s)[i] | s[i]) ≤ d . As we noticed before, our bounds can be significantly improved in the case of no degenerate factors. This prevents application of expensive Markov inequality. Therefore, we also state bounds for the special case below. Corollary 10 (Degenerate Factors Absent). If ∈ 0, 1 2 , then with probability at least 1dδ abs , we have: ∀s ∈ S h , P(ŝ [i] = θ -1 (s )[i] | s [i]) ≥ 1 -, P(ŝ = θ -1 (s ) | s ) ≥ 1 -d . Proof. Same as Corollary 9 except we can directly apply Theorem 7 as we don't need to do any expensive check for a degenerate factor.

D.3 MODEL ESTIMATION

Our next goal is to estimate a model T h : S h-1 × A → ∆( S h ) and latent parent structure pt h , given roll-in distribution D ∈ ∆(S h-1 ), and the learned decoders { φ t } t≤h . Recall that our approach estimates the model by count-based estimation. Let D = {(x (k) , a (k) , x (k) )} nest k=1 be the sample collected for model estimation. Recall that we estimate pt h (i) be using a set of learned factors I that we believe is pt h (i) and varying a disjoint set of learned factors J . If pt(i) ⊆ I then we expect the learned model to behave the same We bound Term 1 below: Term 1: 1 |A| s h-1 [K] =s[K] P D (s [i] | ŝ[K], s h-1 [K], a)P(ŝ[K] | s h-1 [K])P D (s h-1 [K]) ≤ |A| s h-1 [K] =s[K] P D (s [i] | ŝ[K], s h-1 [K], a)P D (s h-1 [K]) ≤ |A| s h-1 [K] =s[K] P D (s h-1 [K]) ≤ |A| The key inequality here is P(ŝ[K] | s h-1 [K]) = k∈K P(ŝ[k] | s h-1 [k] ) ≤ , as there exist at least one j ∈ K such that s h-1 [j] = s[j] and for this j we have P(ŝ[j] | s h-1 [j]) ≤ . We bound Term 2 similarly:  ε 2 = s h-1 [K] P D (ŝ[K], s h-1 [K], a) - ŝh-1 [K] P D (ŝ h-1 [K], s[K], a) = s h-1 [K] =s[K] P D (ŝ[K], s h-1 [K], a) - ŝh-1 [K] =ŝ[K] P D (ŝ h-1 [K], s[K], a) max            s h-1 [K] =s[K] P D (ŝ[K], s h-1 [K], a) Term 3 , ŝh-1 [K] =ŝ[K] P D (ŝ h-1 [K], s[K], a) Term 4            We bound Term 3 below similar to Term 1: Term 3: 1 |A| s h-1 [K] =s[K] P D (ŝ[K]|s h-1 [K])P D (s h-1 [K]) ≤ |A| s h-1 [K] =s[K] P D (s h-1 [K]) ≤ |A| and Term 4 is bounded similar to Term 2 below: Term 4: which is the desired result. 1 |A| ŝh-1 [K] =ŝ[K] P D (ŝ h-1 [K] | s[K])P D (s[K]) ≤ 1 |A| {1 -P D (ŝ[K] | s[K])} ≤ 1 |A| 1 -( We can merge the estimation error and approximation error to generate the total error. Lemma 22 (Closure Result). Let T • be the closure of transition model with respect to some learned transition model. Then for any policy π ∈ Π and any event E which is a function of an episode sampled using π, we have P π (E; T ) = P π (E; T • ), where P π (E; T ) denotes the probability of event E when sampling from π and using transition model T . Proof. The proof follows form observing that when using T • we will never reach a state s ∈ S h-1 for any h -1 by definition of S h-1 . From definition of T • this means that both T • and T will generate the same range of episodes sampled from π and will assign the same probabilities to them. As E is a function of an episode, therefore, its probability remains unchanged. With the definition of closure, we are now ready to state our last result in this section, which bounds the total variation between the estimated model and the transition closure under the bijection map θ. Theorem 13 (Model Error). For any ŝ ∈ S h-1 and a ∈ A we have: ŝ ∈ S h T h (ŝ | ŝ, a) -T • h (θ(ŝ ) | θ(ŝ) , a) ≤ 6d (∆ est (n est , δ est ) + ∆ app ) . Proof. If θ(ŝ) ∈ S h-1 then by definition T • the bound holds trivially. Therefore, we focus on θ(ŝ) ∈ S h-1 for which T • = T . We define the quantity for every j ∈ [d]: from Theorem 12. We will assume the induction hypothesis to be true for S k for all k > j. We handle the inductive below with triangle inequality:  S j = ŝ [j]•••ŝ [d]∈{0,1} S j ≤ ŝ [j]•••ŝ [d]∈{0,1}

D.6 LEARNING A POLICY COVER

In this section we show how we learn the policy cover. We start by defining some notation. The next Lemma is borrowed from Du et al. (2019) . They state their Lemma for a specific event (E = α(ŝ) in their notation) but this choice of event is not important and their proof holds for any event. Lemma 30 (Lemma G.5 of Du et al. ( 2019)). Let S = (S 1 , • • • , S H ) and S = ( S 1 , • • • , S H ) be the real and learned state space. We assume access to a decoder φ : X → S and let φ : X → S be the oracle decoder. Let θ h : S h → S h be a bijection for every h ∈ [H] and θ : S → S where θ(ŝ) = θ h (ŝ) for ŝ ∈ S h . For any h ∈ [H] and s h ∈ S h , we assume P(ŝ h = θ -1 h (s h ) | s h ) ≥ 1ε, i.e., given the real state s h , our decoder ( φ) will map it to θ -1 h (s h ) with probability at least 1ε. Let ϕ : S → A be a deterministic policy on the real state space and φ : S → A be the induced policy given by φ(ŝ) = ϕ(θ(ŝ)) for every ŝ ∈ S. Let π, π : X → A where π(x) = ϕ(φ (x)) and π(x) = φ( φ(x)) for every x ∈ X . For every random event E we have: Let ϕ : S → A be a policy for M. Let ϕ : S → A be the induced policy on M such that for any ŝ ∈ S we have ϕ(ŝ) = ϕ(θ(ŝ)). Then for any h ∈ [H] we have: s h ∈S h P φ(θ -1 (ŝ h )) -P ϕ (s h ) ≤ hε

F EXPERIMENT DETAILS

We provide details for our proof of concept experiment below. Modeling Details. We model F, used for performing independence test, using a single-layer feedforward network θ F with Leaky ReLu non-linearity (Maas et al., 2013) and a softmax output layer. Give a pair of atoms x[u] and x[v], we concatenate these atoms and map it to a probability distribution over {0, 1} by applying θ F . We implement the model class G for learning state decoder following suggestion of Misra et al. (2020) . Recall that a function in G maps a transition (x, a, x) ∈ X × A × X to a value in [0, 1]. We first map x and x to vectors v 1 and v 2 respectively, using two separate linear layers. We map the action a to its one-hot vector representation 1 a . We map the vector v 2 to a probability distribution using the Gumbel-softmax trick (Jang et al., 2016) , by computing q i ∝ exp(v 2 [i] + ϑ i ) for all i ∈ {1, 2}, where ϑ i is sampled independently from the Gumbel distribution. We concatenate the vectors v 1 , 1 a and q and map it to a probability distribution over {0, 1}, through a single layer feed-forward network θ G with Leaky ReLu non-linearity. We recover a decoder φ from the model that maps a set of atoms x to φ(x) = arg max i∈{0,1} q i+1 .



The notation supp(p) denotes the support of the distribution p. Formally, supp(p) = {z | p(z) > 0}. Our analysis can use any non-zero lower bound on ηmin, βmin, σ and an upper bound on d and κ. We can also easily modify our proof to use cross-entropy loss by using generalization bounds for log-loss (see Appendix E inAgarwal et al. (2020))



B b c W t k v J p m k b m s 0 u S V Y o p b / B i w d F v P q D v P l v T N s V V P T B w O O 9 G W b m h Y n g 2 m D 8 4 e S W l l d W 1 / L r h Y 3 N r e 2 d 4 u 5 e S 8 e p o s y j s Y h V O y S a C S 6 Z Z 7 g R r J 0 o R q J Q s L t w d D X z 7 + 6 Z 0 j y W t 2 a c s C A i A 8 n 7 n B J j J U 8 f + 2 7 Q L Z Z w u Y r d + r m L c B n P Y U m l i u s 1 j N x M K U G G Z r f 4 3 u n F N I 2 Y N F Q Q r X 0 X J y a Y E G U 4 F W x a 6 K S a J Y S O y I D 5 l k o S M R 1 M 5 s d O 0 Z F V e q g f K 1 v S o L n 6 f W J C I q 3 H U W g 7 I 2 K G + r c 3 E / / y / N T 0 a 8 G E y y Q 1 T N L F o n 4 q k I n R 7 H P U 4 4 p R I 8 a W E K q 4 v R X R I V G E G p t P w Y b w 9 S n 6 n 7 Q q Z f e 0 X L k 5 K z U u s z j y c A C H c A I u X E A D r q E J H l D g 8 A B P 8 O x I 5 9 F 5 c V 4 X r T k n m 9 m H H 3 D e P g F V n o 5 h < / l a t e x i t > {0, 1} < l a t e x i t s h a 1 _ b a s e 6 4 = " I d 5 t T 3 W o 7 5 Y n sP J M k m J M y B Q b B v U = " > A A A B 7 n i c d V D L S g N B E O z 1 G e M r 6 t H L Y B A 8 y D I b j S a 3 o B e P E c w D s k u Y n c w m Q 2 Y f z M w K Y c l H e P G g i F e / x 5 t / 4 y R Z Q U U L G o q q b r q 7 / E R w p T H + s J a W V 1 b X 1 g s b x c 2 t 7 Z 3 d 0 t 5 + W 8 W p p K x F Y x H L r k 8 U E z x i L c 2 1 Y N 1 E M h L 6 g n X 8 8 f X M 7 9 w z q X g c 3 e l J w r y Q D C M e c E q 0 k T p u h k 8 d d 9 o v l b F d x U 7 9 w k H Y x n M Y U q n i e g 0 j J 1 f K k K P Z L 7 2 7 g 5 i m I Y s 0 F U S p n o M T 7 W V E a k 4 F m x b d V L G E 0 D E Z s p 6 h E Q m Z 8 r L 5 u V N 0 b J Q B C m J p K t J o r n 6 f y E i o 1 C T 0 T W d I 9 E j 9 9 m b i X 1 4 v 1 U H N y 3 i U p J p F d L E o S A X S M Z r 9 j g Z c M q r F x B B C J T e 3 I j o i k l B t E i q a E L 4 + R f + T d s V 2 z u z K 7 X m 5 c Z X H U Y B D O I I T c O A S G n A D T W g B h T E 8 w B M 8 W 4 n 1 a L 1 Y r 4 v W J S u f O Y A f s N 4 + A b z Y j y 8 = < / l a t e x i t > Decoder Model i < l a t e x i t s h a 1 _ b a s e 6 4 = " x x R c k S U s F V 0 0 c a 7 O m + F 3 b O H S 1 m U = " > A A A B 7 X i c d V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B 0 5 K t V t t b 0 Y v H C r Y V 2 q V k 0 2 w b m 8 0 u S V Y o p f / B i w d F v P p / v P l v T N s V V P T B w O O 9 G W b m B Y ng 2 m D 8 4 e S W l l d W 1 / L r h Y 3 N r e 2 d 4 u 5 e S 8 e p o q xJ Y x G r 2 4 B o J r h k T c O N Y L e J Y i Q K B G s H o 8 u Z 3 7 5 n S v N Y 3 p h x w v y I D C Q P O S X G S q 1 u M u Q 9 3 i u W s F v B X u 3 M Q 9 j F c 1 h S r u B a F S M v U 0 q Q o d E r v n f 7 M U 0 j J g 0 V R O u O h x P j T 4 g y n A o 2 L X R T z R J C R 2 T A O p Z K E j H t T + b X T t G R V f o o j J U t a d B c / T 4 x I Z H W 4 y i w n R E x Q / 3 b m 4 l / e Z 3 U h F V / w m W S G i b p Y l G Y C m R i N H s d 9 b l i 1 I i x J Y Q q b m 9 F d E g U o c Y G V L A h f H 2 K / i e t s u u d u O X r 0 1 L 9 I o s j D w d w C M f g w T n U 4 Q o a 0 A Q K d / A A T / D s x M 6 j 8 + K 8 L l p z T j a z D z / g v H 0 C 0 + 6 P T Q = = < / l a t e x i t > 1(x)< l a t e x i t s h a 1 _ b a s e 6 4 = " c W f u v y j x j L l e 2 H Q d H H L e 5 X 9 2 r g 4= " > A A A B 8 H i c d V D L S g M x F M 3 U V 6 2 v q k s 3 w S L U z Z C p V t t d 0 Y 3 L C v Y h 7 V A y a d q G J p k h y Y h l 6 F e 4 c a G I W z / H n X 9 j 2 o 6 g o g c u H M 6 5 l 3 v v C S L O t E H o w 8 k s L a + s r m X X c x u b W 9 s 7 + d 2 9 p g 5 j R W i D h D x U 7 Q B r y p m k D c M M p + 1 I U S w C T l v B + H L m t + 6 o 0 i y U N 2 Y S U V / g o W Q D R r C x 0 m 0 3 G r G e V 7 w / 7 u U L y C 0 j r 3 r m Q e S i O S w p l V G 1 g q C X K g W Q o t 7 L v 3 f 7 I Y k F l Y Z w r H X H Q 5 H x E 6 w M I 5 x O c 9 1 Y 0 w i T M R 7 S j q U S C 6 r 9 Z H 7 w F B 5 Z p Q 8 H o b I l D Z y r 3 y c S L L S e i M B 2 C m x G + r c 3 E / / y O r E Z V P y E y S g 2 V J L F o k H M o Q n h 7 H v Y Z 4 o S w y e W Y K K Y v R W S E V a Y G J t R z o b w 9 S n 8 n z R L r n f i l q 5 P C 7 W L N I 4 s O A C H o A g 8 c A 5 q 4 A r U Q Q M Q I M A De A L P j n I e n R f n d d G a c d K Z f f A D z t s n J e + P / A = = < / l a t e x i t > 2(x) < l a t e x i t s h a 1 _ b a s e 6 4 = " p I 7 0 E s n l m 4 x X C T i / Y r c Z m o X b 4 c E= " > A A A B 8 H i c d V D L S g M x F M 3 U V 6 2 v q k s 3 w S L U z Z A Z r b a 7 o h u X F e x D 2 q F k 0 k w b m s k M S U Y s Q 7 / C j Q t F 3 P o 5 7 v w b 0 3 Y E F T 1 w 4 X D O v d x 7 j x 9 z p j R C H 1 Z u a X l l d S 2 / X t j Y 3 N r e K e 7 u t V S U S E K b J O K R 7 P h Y U c 4 E b W q m O e 3 E k u L Q 5 7 T t j y 9 n f v u O S s U i c a M n M f V C P B Q s Y A R r I 9 3 2 4 h H r u + X 7 4 3 6 x h O w K c m p n D k Q 2 m s M Q t 4 J q V Q S d T C m B D I 1 + 8 b 0 3 i E g S U q E J x 0 p 1 H R R r L 8 V S M 8 L p t N B L F I 0 x G e M h 7 R o q c E i V l 8 4 P n s I j o w x g E E l T Q s O 5 + n 0 i x a F S k 9 A 3 n S H W I / X b m 4 l / e d 1 E B 1 U v Z S J O N B V k s S h I O N Q R n H 0 P B 0 x S o v n E E E w k M 7 d C M s I S E 2 0 y K p g Q v j 6 F / 5 O W a z s n t n t 9 W q p f Z H H k w Q E 4 B G X g g H N Q B 1 e g A Z q A g B A 8 g C f w b E n r 0 X q x X h e t O Su b 2 Q c / Y L 1 9 A i d 2 j / 0 = < / l a t e x i t > 3(x) < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 h z w l m n e 0 E V c o 8 h U r 1 + E B Y f Z a t U = " > A A A B 8 H i c d V B N T w I x E O 3 i F + I X 6 t F L I z H B y 6 Y L o n A j e v G I i Y A G N q R b u t D Q d j d t 1 0 g I v 8 K L B 4 3 x 6 s / x 5 r + x w J q o 0 Z d M 8 v L e T G b m B T F n 2 i D 0 4 W S W l l d W 1 7 L r u Y 3 N r e 2 d / O 5 e S 0 e J I r R J I h 6 p m w B r y p m k T c M M p z e x o l g E n L a D 0 c X M b 9 9 R p V k k r 8 0 4 p r 7 A A 8 lC R r C x 0 m 0 3 H r J e u X h / 3 M s X k F t B X u 3 U g 8 h F c 1 h S q q B a F U E v V Q o g R a O X f + / 2 I 5 I I K g 3 h W O u O h 2 L j T 7 A y j H A 6 z X U T T W N M R n h A O 5 Z K L K j 2 J / O D p / D I K n 0 Y R s q W N H C u f p + Y Y K H 1 W A S 2 U 2 A z 1 L + 9 m f i X 1 0 l M W P U n T M a J o Z I s F o U J h y a C s + 9 h n y l K D B 9 b g o l i 9 l Z I h l h h Y m x G O R v C 1 6 f w f 9 I q u V 7 Z L V 2 d F O r n a R x Z c A A O Q R F 4 4 A z U w S V o g C Y g Q I A H 8 A Se H e U 8 O i / O 6 6 I 1 4 6 Q z + + A H n L d P K P 2 P / g = = < / l a t e x i t > 3(x ) < l a t e x i t s h a 1 _ b a s e 6 4 = " o x N d W D x g d t N Q T g f U 2 1 p 6 G j V A y E c = " > A A A B 8 X i c d V D L T g I x F O 3 g C / G F u n T T S I y 4 m X R A F H Z E N y 4 x k U e E C e m U D j R 0 O p O 2 Y y Q T / s K N C 4 1 x 6 9 + 4 8 2 8 s M C Z q 9 C Q 3 O T n n 3 t x 7 j x d x p j R C H 1 Z m a X l l d S 2 7 n t v Y 3 N r e y e / u t V Q Y S 0 K b J O S h 7 H h Y U c 4 E b W q m O e 1 E k u L A 4 7 T t j S 9 n f v u O S s V C c a M n E X U D P B T M Z w R r I 9 3 2 o h H r l 4 v 3 x y f 9 f A H Z F e T U z h y I b D S H I a U K q l U R d F K l A F I 0 + v n 3 3 i A k c U C F J h w r 1 X V Q p N 0 E S 8 0 I p 9 N c L 1 Y 0 w m S M h 7 R r q M A B V W 4 y v 3 g K j 4 w y g H 4 o T Q k N 5 + r 3 i Q Q H S k 0 C z 3 Q G W I / U b 2 8 m / u V 1 Y + 1 X 3 Y S J K N Z U k M U i P + Z Q h 3 D 2 P h w w S Y n m E 0 M w k c z c C s k I S 0 y 0 C S l n Q v j 6 F P 5 P W i X b K d u l 6 9 N C / S K N I w s O w C E o A g e c g z q 4 A g 3 Q B A Q I 8 A C e w L O l r E f r x X p d t G a s d G Y f / I D 1 9 g m M A 5 A v < / l a t e x i t > 2(x ) < l a t e x i t s h a 1 _ b a s e 6 4 = " D F K / K 0 f r 6 9 / C G Z L 9 N C r T 1 L Q H Z I Q = " > A A A B 8 X i c d V D L S g M x F M 3 4 r P V V d e k m W M S 6 G T K j 1 X Z X d O O y g n 1 g O 5 R M m m l D M 5 k h y Y h l 6 F + 4 c a G I W / / G n X 9 j 2 o 6 g o g c u H M 6 5 l 3 v v 8 W P O l E b o w 1 p Y X F p e W c 2 t 5 d c 3 N r e 2 C z u 7 T R U l k t A G i X g k 2 z 5 W l D N B G 5 p p T t u x p D j 0 O W 3 5 o 8 u p 3 7 q jU r F I 3 O h x T L 0 Q D w Q L G M H a S L f d e M h 6 b u n + 6 L h X K C K 7 j J z q m Q O R j W Y w x C 2 j a g V B J 1 O K I E O 9 V 3 j v 9 i O S h F R o w r F S H Q f F 2 k u x 1 I x w O s l 3 E 0 V j T E Z 4 Q D u G C h x S5 a W z i y f w 0 C h 9 G E T S l N B w p n 6 f S H G o 1 D j 0 T W e I 9 V D 9 9 q b i X 1 4 n 0 U H F S 5 m I E 0 0 F m S 8 K E g 5 1 B K f v w z 6 T l G g + N g Q T y c y t k A y x x E S b k P I m h K 9 P 4 f + k 6 d r O i e 1 e n x Z r F 1 k c O b A P D k A J O O A c 1 M A V q I M G I E C A B / A E n i 1 l P V o v 1 u u 8 d c H K Z v b A D 1 h v n 4 p 7 k C 4 = < / l a t e x i t > 1(x ) < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 3 6 W w H r b / 6 A j d 6 O z X g F X u v o R + y M = " > A A A B 8 X i c d V B N T w I x E J 3 1 E / E L 9 e i l k R j x Q r o o C j e i F 4 + Y y E e E D e m W A g 3 d 7 q b t G g n h X 3 j x o D F e / T f e / D c W W B M 1 + p J J X t 6 b y c w 8 P x J c G 4 w / n I X F p e W V 1 d R a e n 1 j c 2 s 7 s 7 N b 1 2

6 c F + I P T z F Z y q 3 z t i 7 E k 5 9 h x d 6 W H l y t 9 e I v 7 l t S P V L 3 d j 5 o e R o j 6 Z L e p H P D k 8 S Q r 2 m K B E 8 b E m m A i m / w q J i w U m S u e Z 0 S F 8 X Q r / J 4 1 C 3 i r m C 9 f H 2 e r F P I 4 0 2 A P 7 I A c s c A q q 4 A r U Q B 0 Q c A c e w B N 4 N u 6 N R + P F e J 2 V p o x 5 z y 7 4 A e P 9 E w 4 6 o A 4 = < / l a t e x i t > s[1]

D H c P 5 5 3 D I J C W a T w 3 B R D J z K y R j L D H R J p + S C e H r U / g / 6 V R t p 2 Z X b 8 8 r r a s 8 j i I 4 A s f g D D j g E r T A D W g D F x D A w A N 4 A s + W s B 6 t F + t 1 2 V q w 8 p l D 8 A P W 2 y d Y q I 5 j < / l a t e x i t > s 0[2]

g 2 m D 8 4 e S W l l d W 1 / L r h Y 3 N r e 2 d 4 u 5 e S 8 e p o s y j s

H l D g 8 A B P 8 O x I 5 9 F 5 c V 4 X r T k n m 9 m H H 3 D e P g F V n o 5 h < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " O S C r J

Figure 1: Left: A room navigation tasks as a Factored Block MDP setting showing atoms and factors. Center and Right: Shows the different stages executed by the FactoRL algorithm. We do not show the observation x emitted by s for brevity. In practice a factor would emit many more atoms.

A Factored MDP (S, A, T, R, H) consists of a d-dimensional discrete state space S ⊆ {0, 1} d , a finite action space A, an unknown transition function T : S × A → ∆(S), an unknown reward function R : S × A → [0, 1] and a time horizon H. Each state s ∈ S consists of d factors with the i th factor denoted as s[i]. The transition function satisfies T (s | s, a) = d i=1 T i (s [i] | s[pt(i)], a) for every s, s ∈ S and a ∈ A, where T i : {0, 1} |pt(i)| × A → ∆({0, 1}) defines a factored transition distribution and a parent function pt : [d] → 2 [

then Assumption 2 implies that these atoms are dependent on each other for any roll-in distribution D ∈ ∆(S h-1 × A) over previous state and action. However, if i = j then deterministically setting the previous action and values of the parent factors pt(i) or pt(j), makes x[u] and x[v] independent. For the example in Figure1, fixing the value of s[1], s[2] and a would make x[u] and x[u ] independent of each other.

stage 3: find latent pt h and estimate model 6: for I ∈ C ≤2κ ([d]), Z ∈ {0, 1} |I| do 7: φhIZ = planner( T , R hIZ , ∆ pl ) where R hIZ := 1{ s h [I] = Z} // stage 4: planning 8:

; Thomas et al. (2018); Laversanne-Finot et al. (2018); Miladinović et al. (

),Misra et al. (2020) andFeng et al. (2020) provide computationally and sample efficient algorithms for Block MDP which is a rich-observation setting with a latent non-factored state space. Nevertheless, this line of results crucially relies on the number of latent states being relatively small. Finally, we comment that the contrastive learning technique used in this paper has been used by other reinforcement learning algorithms for learning feature representation (e.g.,Kim et al. (2019);Nachum et al. (2019);Srinivas et al. (2020)) without theoretical guarantee.

Figure 2: Scheme showing the important variables for the decoding step.

a j , xj )} ndeg j=1 where (x j , a j ) ∼ D and xj ∼ ρ (line 6-8). Recall x = x[ch h (i)]. Next, we compute the width of prediction values over D deg as defined below: max j,k∈[ndeg] |ĝ(x j , a j , xj )ĝ(x k , a k , xk )| (25)

Degenerate Factors). Fix a > 0. If s [i] only takes a single value then with probability at least 1δ abs -2ndeg √ ∆(nabs,δabs,|G|) a

(using Equation 26 and Lemma 13) We use Lemma 13 which has a failure probability of δ abs + 2ndeg √ ∆(nabs,δabs,|G|) a

as the learned state bits corresponding to w. More generally, for a given set K ∈ 2 d , we denote the real state factors as s[K] (or s [K]) and the corresponding learned state factors as ŝ[K] (or ŝ [K]).

t e x i t s h a 1 _ b a s e 6 4 = " C z i m b 1 i t e A t O g r 2 E j m l e 3 s 8 G e b M = "

Model Approximation Error). For anyi ∈ [d], K ∈ C ≤2κ ([d]), s ∈ S h-1 , a ∈ A, s ∈ S h , let ŝ = θ -1h-1 (s) and ŝ = θ -1 h (s ). Then we have:|P D (ŝ [i] | ŝ[K], a) -P D (s [i] | s[K], a)| ≤ ∆ app := 5κ N αη min .Proof. We will first bound|P D (s [i] | ŝ[K], a) -P D (s [i] | s[K], a)| and then use correspondence result (Corollary 9) to prove the desired result. We start by expressing our conditional probabilities as ratio of joint probabilities.P D (s [i] | ŝ[K], a) = P D (s [i], ŝ[K], a) P D (ŝ[K], a) , P D (s [i] | s[K], a) = P D (s [i], s[K], a) P D (s[K], a) .From Lemma 29 we have:|P D (s [i] | ŝ[K], a) -P D (s [i] | s[K], a)| ≤ ε + ε 2 P D (s[K], a), where (27)ε 1 := |P D (s [i], ŝ[K], a) -P D (s [i], s[K], a)| and ε 2 := |P D (ŝ[K], a) -P D (s[K], a)|.We bound these two quantities below:ε 1 = s h-1 [K] P D (s [i], ŝ[K], s h-1 [K], a)ŝh-1 [K] P D (s [i], ŝh-1 [K], s[K], a) = s h-1 [K] =s[K] P D (s [i], ŝ[K], s h-1 [K], a)ŝh-1 [K] =ŝ[K] P D (s [i], ŝh-1 [K], s[K], a)Where the first inequality uses |a -b| ≤ max{a, b} for a, b > 0.

D (s [i] | ŝh-1 [K], s[K], a)P D (ŝ h-1 [K] | s[K])P D (s[K]) ≤ 1 |A| ŝh-1 [K] =ŝ[K] P D (ŝ h-1 [K] | s[K]) = 1 |A| {1 -P D (ŝ[K] | s[K])} ≤ 1 |A| (1 -(1 -) |K| ) ≤ 2κ |A|where we useP D (ŝ[K] | s[K]) = k∈K P(ŝ[k] | s[k]) ≥ (1 -) |K| and |K| ≤ 2κ.Published as a conference paper at ICLR 2021 This gives us ε 1 ≤ 2κ |A| . The proof for ε 2 is similar.

-) |K| ≤ 2κ |A| This gives us ε 2 ≤ 2κ |A| . Plugging bounds for ε 1 and ε 2 in Equation 27 and using P D (s[K], a) = P D (s[K]) |A| ≥ αηmin N |A| gives us:|P D (s [i] | ŝ[K], a) -P D (s [i] | s[K], a| ≤ 4κ |A|P D (s[K], a) ≤ 4κ N αη min . (28)We can use correspondence result to derive a lower bound:P D (ŝ [i] | ŝ[K], a) ≥ P D (ŝ [i] | s [i])P D (s [i] | ŝ[K], a) ≥ (1 -)P D (s [i] | ŝ[K], a) ≥ P D (s [i] | ŝ[K], a) -and an upper bound:P D (ŝ [i] | ŝ[K], a) = P D (ŝ [i] | s [i])P D (s [i] | ŝ[K], a)+ P(ŝ [i] | 1s [i])P D (1s [i] | ŝ[K], a) ≤ P D (s [i] | ŝ[K], a) +Combing the lower and upper bounds with Equation 28 gives us:|P D (ŝ [i] | ŝ[K], a) -P D (s [i] | s[K], a)| ≤ 4κ N αη min + ≤ 5κ N αη min .

hi (ŝ [i] | ŝ[ pt(i)], a) -d i=j T i (θ(ŝ )[i] | θ(ŝ)[pt(i)], a)(30)We claim thatS j ≤ 6(dj + 1)(∆ est + ∆ app ) for every j ∈ [d].For base case we have:S d = ŝ [d]∈{0,1} T hd (ŝ [d] | ŝ[ pt(d)], a) -T (θ(ŝ )[d] | θ(ŝ)[pt(d)], a) ≤ 6(∆ est + ∆ app ),

hi (ŝ [i] | ŝ[ pt(i)], a)| T hj (ŝ [j] | ŝ[ pt(j)], a)-T (θ(ŝ )[j] | θ(ŝ)[pt(j)], a)|+ ŝ [j]•••ŝ [d]∈{0,1} T (θ(ŝ )[j] | θ(ŝ)[pt(j)], a)| d i=j+1 T hi (ŝ [i] | ŝ[ pt(i)], a)d i=j+1 T (θ(ŝ )[i] | θ(ŝ)[pt(i)], a)| The first term is equivalent to ŝ [j]∈{0,1} | T hj (ŝ [j] | ŝ[ pt(j)], a) -T (θ(ŝ )[j] | θ(ŝ)[pt(j)], a)|which is bounded by 6(∆ est + ∆ app ) following base case analysis. The second term is equivalent to S j+1 which is bounded by 6(dj)(∆ est + ∆ app ) by induction hypothesis. Combining these two bounds proves the induction hypothesis and the result then follows from bound for S 1 .

Lemma H.3 of Du et al. (2019)). For any a, b, c, d > 0 with a ≤ b and c ≤ d we have:

|P π (E) -P π (E)| ≤ 2εHLemma 31(Lemma H.2. of Du et al. (2019)). Let there be two tabular MDPs M and M. Let S = (S 1 , • • • , S H ) be the state space of M and S = ( S 1 , • • • , S H ) be the state space of M. Both MDPs have a A be the action space of both MDPs and horizon of H. For every h ∈ [H], there exists a bijection θ h : S h → S h . Let T : S × A → ∆(S) and T : S × A → ∆( S) be transition dynamics for M and M satisfying:∀h ∈ [H], a ∈ A, ŝ ∈ S, ŝ∼ S h T h (θ(ŝ ) | θ(ŝ), a) -T h (ŝ | ŝ, a) ≤ ε

Comparison with Block MDP Algorithms. Our work is closely related to algorithms for Block MDPs, which can be viewed as a non-factorized version of our setting. Du et al. (2019) proposed a model-based approach for Block MDPs. They learn a decoder for a given time step by training a classifier to predict the decoded state and action at the last time step. In our case, this results in a classification problem over exponentially many classes which can be practically undesirable. In contrast,Misra et al. (

List of notations and their definition. C.1 ALGORITHM DESCRIPTION Let D ∈ ∆(S h-1 × A) be our roll-in distribution that induces a probability distribution P D ∈ ∆(X h ) over observations at time step h. Let u, v ∈ [m] be two different atoms, and

• • • , H}, at the end of time step t (Algorithm 1, line 8), FactoRL finds a child function ch t : [d] → 2 [m] , a decoder φ t : X → {0, 1} d , a transition function T t : {0, 1} d × A → {0, 1} d , and a set of policies Ψ t satisfying the following: IH.1 ch t : [d] → 2 m and ch t : [d] → 2 m are same upto relabeling, i.e, for all u, v ∈ [m] we

Theorem 8 (Detecting Degenerate Case). We correctly predict if s [i] is a degenerate factor or not when using n deg = 3N 2 |A| 2

Two MDPs. After time step h, we can define two Markov Decision Processes (MDPs) at this time M h and M h . M h is the true MDP consists of state space (S • 1 , • • • , S • h ), action space A, horizon h, a deterministic start state s 1 = {0} d , and transition function T • Recall that the set S h ⊆ {0, 1} d denote states which are reachable at time step h, and the set

annex

Published as a conference paper at ICLR 2021 irrespective of how we vary the factors J . However, if pt(i) ⊆ I then there exists a parent factor that on varying will different values for the learned dynamics. Let K = I ∪ J be the set of all factors in the control group (I) and variable group (J ).We will first analyze the case for a fixed i ∈ S h and control group I and variable group J . For a given v ∈ {0, 1}, ŵ ∈ {0, 1} |K| and a ∈ A, we have P D (ŝ [i] = v | ŝ[K] = ŵ, a) denoting the estimate probability derived from our count based estimation (Algorithm 4, line 3). Let P D (ŝ [i] = v | ŝ[K] = ŵ, a) be the probabilities that are being estimated. It is important to note that we use subscript D for these notations as the learned states ŝ are not Markovian and K may not contain pt(i), therefore, the estimated probabilities P and expected probabilities P will be dependent on the roll-in distribution D.In order to estimate P D (. | ŝh-1 [K] = ŵ, a) we want good lower bounds on P D (ŝ h-1 [K] = ŵ, a) for every a ∈ A and ŵ reachable for K. Our roll-in distribution D only guarantees lower bound on P D (s h-1 [K], a). However, we can use IH.2 to bound the desired quantity below. Lemma 17 (Model Estimation Coverage). If ≤ 1 2 then for all K ∈ C ≤2κ ([d] ), a ∈ A and ŵ ∈ {0, 1} |K| reachable for K, we have:as actions are taken uniformly. Let s h-1 = θ h-1 (ŝ h-1 ) and w = s h-1 [K] . We bound P D (ŝ h-1 [K] = ŵ) as shown:where the third step uses the fact that value of learned state ŝh-1 [k] is independent of other decoders given the real state bit s h-1 [k] . The fourth step uses IH.2 and the fact that we have good coverage over all sets of state factors of size at most 2κ. Last inequality uses |K| ≤ 2κ and ≤ 1 2 .We now show that our count-based estimator P D converges to P D and derive the rate of convergence. Lemma 18 (Model Estimation Error). Fix δ est ∈ (0, 1). Then with probability at least 1δ est for every K ∈ C ≤2κ ([d] ), ŵ ∈ {0, 1} |K| reachable for K, and a ∈ A we have the following:whereProof. We sample n est samples by roll-in at time step h -1 with distribution D and taking actions uniformly. We first analyze the failure probability for a given K, ŵ, a. with probability at least 1δ. Lemma 17 shows that probability of E(K, ŵ, a) is at least αηmin 4 κ N |A| . Therefore, from Lemma 28 if n est ≥ 2 2κ+1 mN |A| αηmin ln e δ then we get at least m samples of event E(K, ŵ, a) with probability at least 1δ. Therefore, the total failure probability is at most 2δ: δ due to not getting at least m samples and δ due to Corollary 15 on getting m samples. This holds for every triplet (K, ŵ, a) and Lemma 23 shows that there are at most 2(ed) 2κ |A| such triplets. Hence, with application of union bound we get the desired result.Published as a conference paper at ICLR 2021 Lemma 20 (K-Model Error). For any i ∈h-1 (s) and ŝ = θ -1 h (s ). Then we have:with probability at least 1δ est .Proof. Follows trivially by combining the estimation error (Lemma 18) and approximation error (Lemma 19) with application of triangle inequality.

D.4 DETECTING LATENT PARENT STRUCTURE IN TRANSITION pt h

We are now ready to analyze the performance of learned parent function pt) and ŵ1 ∈ {0, 1} |K1| , ŵ2 ∈ {0, 1} |K2| . We will assume ŵ1 is reachable for K 1 and ŵ2 is reachable for K 2 . For convenience we will define the following quantity Ω to measure total variation distance between distributions, and for a fixed action a ∈ A:We can compute Ω for every value of i, K 1 , ŵ1 , K 2 , ŵ2 , a in computational time of O (2ed) 3κ+3 |A| . We also define a similar metric for the true distribution for any K 1 , K 2 and v ∈ {0, 1}, w 1 ∈ {0, 1} |K1| , w 2 ∈ {0, 1} |K2| and a ∈ A:Recall where max is taken overProof. We fix J 1 , J 2 , û, ŵ1 , ŵ2 , a and let, and w 2 = θ( ŵ2 ). As pt(i) ⊆ I, therefore, we have:Using this result along with Lemma 20 and application of triangle inequality we get:Summing over v, dividing by 2, and using the definition of Ω proves the result.The following is a straightforward corollary of Lemma 21.Corollary 11. Fix i ∈ [d] then there exists an I such that for all a ∈ A and û ∈ {0, 1} |I| :where max is taken over J 1 , J 2 , ŵ1 , ŵ2 satisfy the restrictions stated in Lemma 21.Proof. Take any I such that pt(i) ⊆ I and apply Lemma 21. Note that we are allowed to pick such an I as |pt(i)| ≤ κ by our assumption.Recall that we define pt(i) as the solution of the following problem:where I ∈ C ≤κ ([d] ), a ∈ A, and û, J 1 , J 2 , ŵ1 , ŵ2 satisfy the restrictions stated in Lemma 21.We are now ready to state our main result for pt.Theorem 12 (Property of pt). For anyThen the learned parent function pt satisfies:As pt(i) ⊆ K, therefore, we have:Combining this result with Lemma 20 we get:From the definition of pt(i) (Equation 29) and Corollary 11 we have:Note that we are allowed to use Corollary 11 as ŝ[ pt(i); ∅] and ŝ[ pt(i); J ] are both reachable since they are derived from a reachable real state s,Combining the previous two inequalities using triangle inequality completes the proof.

D.5 BOUND TOTAL VARIATION BETWEEN ESTIMATED MODEL AND TRUE MODEL

Given the learned transition parent function pt h we define the transition model as:From Theorem 12 we have for anyTransition Closure. A subtle point remains before we prove the model error between T h and T . Theorem 12 only states guarantee for those ŝ that are inverse of a reachable state s. However, as stated before, due to decoder error we can reach a state ŝ which does not have a corresponding reachable state, i.e. θ(ŝ) ∈ S h (see Figure 3 ). We cannot get model guarantees for these unreachable states ŝ since we may reach them with arbitrarily small probability. However, we can still derive model error if we can simply define the real transition probabilities in terms of the learned probabilities for these states. This will not cause a problem since the real model will never reach these states. We start by defining the closure of the transition model T • for time step h as:We also define the state space domain ofIt is easy to see that θ h-1 represents a bijection between S h-1 and S • h-1 . We will derive our guarantees with respect to T • which will allow us to define a bijection between the domain of T and T • , and use important lemmas from the literature. The next result shows that our use of T • is harmless as it assigns the same probability as T to any event.S t ⊆ S The second MDP M h consists of the learned state space ( S 1 , • • • , S h ), action space A, horizon h, a deterministic start state ŝ1 = {0} d , and transition function T t : S t-1 × A → ∆( S t ).For every t ∈ [h], we have θ t : S t → S • t represent a bijection from the learned state space to the closure of the set of reachable states at time step t. The learned decoder φ t predict θ t (s) given s ∈ S t with high probability for all t < h by IH.2 and for t = h due to Corollary 9.Lastly, the transition model T • t and T t are close in L 1 distance for t < h due to IH.3 and for t = h due to Theorem 13.These results enable us to utilize the analysis of Du et al. (2019) for learning to learn a policy cover.Let φ : S → A denote a non-stationary deterministic policy that operates on the learned state space. Similarly, ϕ : S • → A denote a non-stationary deterministic policy that operates on the real state. We denote φ = ϕ • θ if for every ŝ ∈ S, φ(ŝ) = ϕ(θ(ŝ)). Similarly, we denote ϕ = φ • θ -1 if for every s ∈ S • , ϕ(s) = φ(θ -1 (s)). Let π : X → A be a non-stationary deterministic policy operating on the observation space. We say π = φ • φ if for every x ∈ X we have π(x) = φ( φ(x)). Similarly, we define π = ϕ • φ if for every x ∈ X we have π(x) = ϕ(φ (x)).We will use P π [E] to denote probability of an event E when actions are taken according to policy π : X → A. We will use P ϕ [E] to denote the probability of event E when we operate directly on the real state and take actions using ϕ. Similarly, we define P φ[E ] to denote the probability of event Ê when we operate on the learned state space. Lastly, let P φ[E ] denote probability of an event E when actions are taken according to policy φ operating directly over the latent state and following our estimated transition dynamics T : S × A → ∆( S). Recall that our planner will be optimizing with respect to P φ[E ].Theorem 14 (Planner Guarantee). further, we have:and if {s h [I] = θ( ŵ)} is unreachable, thenProof. We define two events E := {s h [I] = θ( ŵ)} and Ê := {ŝ h [I] = ŵ}. We define a policy ϕ R = φR • θ -1 where for every s ∈ S we have ϕ R (s) = φR (θ -1 (s)). We also define π :Hence, every time our decoder outputs the correct mapped state θ(s), policies π and π take the same action. We use the result of Du et al. ( 2019) stated in Lemma 30 (setting ε set to d using Corollary 9) to write:Let ϕ : S • → A be any policy on real state space and let φ : S → A be the induced policy on learned state space given by φ(ŝ) = ϕ • θ(ŝ) = ϕ(θ(ŝ)) for any ŝ ∈ S. We showed in Theorem 13 that T and T have small L 1 distance under the bijection θ. Therefore, from the perturbation result of Du et al. ( 2019) stated in Lemma 31 we have:where ε := 6d (∆ est (n est , δ est ) + ∆ app ) due to Theorem 13. As {s h [I] = θ( ŵ)} ⇔ {ŝ h [I] = ŵ}, therefore, we can derive the following bound:Let ϕ = arg max P ϕ [E] be the optimal policy to satisfy {s h [I] = θ(w)}. Note that ϕ is also the latent policy that optimizes the reward function R on the real dynamics. Let φ = ϕ • θ -1 be the induced policy on learned states. We now bound the desired quantity as shown: 34)≥ P φR ( Ê) -2d H -Hε (using Equation 35)This proves Equation 31 and Equation 32. Note that our calculations above show:Plugging this in the above equation proves Equation 33 and completes the proof.

D.7 WRAPPING UP THE PROOF FOR FactoRL

We are almost done. All we need to do is to make sure is to set the hyperparameters and verify each induction hypothesis. We first set hyperparameters.Setting Hyperparameters. Let {s h [I] = θ( ŵ)} be reachable for some I ∈ C ≤2κ ([d] ) and ŵ ∈ {0, 1} |I| . Then applying Theorem 14 and using the definition of η min we have:As we want the right hand side to be at least αη min we divide the error equally between the three terms. This gives us:(Planning Error) ∆ pl ≤ (1-α)ηmin /4 (36)The model approximation error places a more stringent requirement on than the decoding error for planning. However, throughout the proof for FactoRL in this section, we made other requirements on our hyperparameters. For this is given by min{ β 2 min/1200, 1 /2} = β 2 min/1200 by combining constraints in Lemma 8 and Theorem 7, and an additional constraint for detecting non-degenerate factors stated in Corollary 9. Due to the inefficiency of the non-degenerate factors detection, we state results separately for the two cases:Using the definition of from Theorem 7, we get a value of n abs for non-degenerate factor (Equation 40) and general case (Equation 41) given below:Recall that for detecting degenerate factors we collect n deg samples. Corollary 9 gives value of this hyperparameter aswhich also satisfies the condition in Lemma 16. Lastly, Theorem 3 gives number of samples for independence testing n ind and rejection sampling frequency k as:Failure probabilities for a single timestep are bounded by δ ind due to identification of emission structure (Theorem 3), 3dδ abs due to decoding (Corollary 9), and δ est due to model estimation (Lemma 17). The total failure probability using union bound for a single step is given by δ ind + 3dδ abs + δ est , and for the whole algorithm is given by δ ind H + 3dδ abs H + δ est H. Binding δ ind H → δ /3, 3dδ abs H → δ /3, δ est H → δ /3, gives us total failure probability of δ and the right value of hyperparameters.Sample complexity of FactoRL is at most kHn ind + Hn abs + Hn deg + Hn est episodes which is order of:where use the fact that N = |Ψ h-1 | can be at most 2(ed) 2κ from Lemma 23. Note that if we did not have to apply the expensive degeneracy detection step, then we would get logarithmic dependence on 1 /δabs. Cheaper ways of detecting degeneracy can, therefore, significantly improve the sample complexity.We have not attempted to optimize the degree and exponent in the sample complexity above.For our choice of two hyperparameters n est and n abs , we can bound the model error and decoding failure by:Verifying Induction Hypothesis. Finally, we verify the different induction hypothesis below.1. We already verified IH.1 with Theorem 3. We learn a ch h that is equivalent to ch h upto label permutation. 2. We already verified IH.2 with Corollary 9. Given a real state s ∈ S h , our decoder outputs the corresponding learned state with high probability. We also derived the form of . 3. We already verified IH.3 with Theorem 13. We also derived the form of ∆ est and ∆ app . 4. Lastly, Theorem 14 and our subsequent calculations for hyperparameter show that Ψ h is an α-policy cover of S h and that the size of Ψ h is at most 2(ed) 2κ from Lemma 23. Lastly, for all reachable factor values we get the value of learned policy as at least (1+α)ηmin /2 using Equation 32 and our choice of hyperparameter values. Similarly, from Equation 33 we get the value of learned policy for all unreachable factor values as at most (1-α)ηmin /4. This allows us to filter all unreachable factor values. In the main paper, we focus on the value of α = 1 /2, which explains why on Algorithm 1, line 8 we only keep those policies with value at least (1+α)ηmin /2 = 3ηmin /4. This verifies IH.4.This completes the analysis for FactoRL.

E SUPPORTING RESULT

Lemma 23 (Assignment Counting Lemma). For a given k, d ∈ N and k ≤ d, the cardinality of the setProof. Assume k ≥ 2. The cardinality of this set is given by k i=0 d i 2 i which can be bounded as shown below:The first inequality here uses the well-known bound for binomial coefficients n i ≤ ed i i for any n, i ∈ N and i ≤ n. Further bounding the above result using ed -1 ≥ ed /2 gives us:The proof is completed by checking that inequality holds for k < 2.Lemma 24 (Lemma H.1 in Du et al. ( 2019)). Let u, v ∈ R d + with u 1 = v 1 = 1 and u-v 1 ≥ ε. Then for any α > 0 we have αuu 1 ≥ ε 2 . Lemma 25 (Chernoff Bound). Let q be the probability of an event occurring. Then given n iid samples with n ≥ 1 q , the probability that the event occurred at least once is at least 1 -2 exp( -qn 3 ).Proof. Let X i be a 0-1 indicator denoting if the event occurred or not, and let X = n i=1 X i . We have E[X i ] = q and E[X] = qn. Let t = 1 -1 /qn. We will assume that qn > 1 and so t ∈ (0, 1).Then the probability that the event never occurs is bounded by:Lemma 26 (Hoeffding's Inequality). Let X 1 , X 2 , • • • , X n be independent random variables bounded by the interval [0, 1]. Let empirical mean of these random variables be X = 1 n n i=1 X n , then for any t > 0 we have: Weissman et al. (2003) ). Let P be a probability distribution over a discrete set of size a. Let X n = X 1 , X 2 , • • • , X n be independent identical distributed random variables distributed according to P . Let PX n be the empirical probability distribution estimated from sample set X n . Then for all > 0:The next result is a direct corollary of Lemma 27.Corollary 15. For any m ≥ 8a 2 ln ( 1 /δ) samples, we have P -P X n 1 < with probability at leastµ ln e δ then P(X < m) ≤ δ.Proof. This is a standard Chernoff bound argument. We have E[X] = nµ. Assuming n ≥ m /µ then from multiplicative Chernoff bound we have:Setting right hand side equal to δ and solving gives us n ≥ 2 µ m + ln( 1 δ ) which is satisfied whenever n ≥ 2mµ ln e δ .Learning Details. We train the two models using cross-entropy loss. Formally, given a dataset D ind = {(x i [u], x i [v], y i )} nind i=1 for performing independence testing, and a dataset D abs = {(x i , a i , xi , y i )} nabs i=1 for learning a decoder, we optimize the model by minimizing the cross-entropy loss as shown below: Here we overload our notation to allow the output of models to be distribution over {0, 1} rather than a scalar value in [0, 1] as we assumed before. This is in sync with how we implement these model class and allows us to conveniently perform cross-entropy loss minimization.Planner Details. We use a simple planner based on approximate dynamic programming. Given model estimate, reward function and a set of visited learned states, we perform dynamic programming to compute optimal Q-values for the set of visited states. We assume the Q-values for non-visited states to be 0. This allows us to compute Q-values in a computationally-efficient manner. Later, if the agent visits an unvisited state, then it simply takes random action.Hyperparameters. We set the hidden dimension of θ F to 10 and that of θ G to 56. We set the threshold c on held-out log-loss value, when performing independence test to be 0.65. For reference, if one uses a random uniform classifier then its performance isln(0.5) ≈ 0.69. We train both models for 10 epochs using Adam optimization with learning rate of 0.001, and a batch size of 32. We remove 0.2% of the training data and use it as a validation set. We evaluate on the validation set after every epoch, and use the model with the best performance on the validation set. We used PyTorch 1.6 to develop the code and used default initialization scheme for all parameters.

