OPAL: OFFLINE PRIMITIVE DISCOVERY FOR ACCEL-ERATING OFFLINE REINFORCEMENT LEARNING

Abstract

Reinforcement learning (RL) has achieved impressive performance in a variety of online settings in which an agent's ability to query the environment for transitions and rewards is effectively unlimited. However, in many practical applications, the situation is reversed: an agent may have access to large amounts of undirected offline experience data, while access to the online environment is severely limited. In this work, we focus on this offline setting. Our main insight is that, when presented with offline data composed of a variety of behaviors, an effective way to leverage this data is to extract a continuous space of recurring and temporally extended primitive behaviors before using these primitives for downstream task learning. Primitives extracted in this way serve two purposes: they delineate the behaviors that are supported by the data from those that are not, making them useful for avoiding distributional shift in offline RL; and they provide a degree of temporal abstraction, which reduces the effective horizon yielding better learning in theory, and improved offline RL in practice. In addition to benefiting offline policy optimization, we show that performing offline primitive learning in this way can also be leveraged for improving few-shot imitation learning as well as exploration and transfer in online RL on a variety of benchmark domains. Visualizations and code are available at https://sites.google.com/view/opal-iclr 

1. INTRODUCTION

Reinforcement Learning (RL) systems have achieved impressive performance in a variety of online settings such as games (Silver et al., 2016; Tesauro, 1995; Brown & Sandholm, 2019) and robotics (Levine et al., 2016; Dasari et al., 2019; Peters et al., 2010; Parmas et al., 2019; Pinto & Gupta, 2016; Nachum et al., 2019a) , where the agent can act in the environment and sample as many transitions and rewards as needed. However, in many practical applications the agent's ability to continuously act in the environment may be severely limited due to practical concerns (Dulac-Arnold et al., 2019) . For example, a robot learning through trial and error in the real world requires costly human supervision, safety checks, and resets (Atkeson et al., 2015) , rendering many standard online RL algorithms inapplicable (Matsushima et al., 2020) . However, in such settings we might instead have access to large amounts of previously logged data, which could be logged from a baseline hand-engineered policy or even from other related tasks. For example, in self-driving applications, one may have access to large amounts of human driving behavior; in robotic applications, one might have data of either humans or robots performing similar tasks. While these offline datasets are often undirected (generic human driving data on various routes in various cities may not be directly relevant to navigation of a specific route within a specific city) and unlabelled (generic human driving data is often not labelled with the human's intended route or destination), this data is still useful in that it can inform the algorithm about what is possible to do in the real world, without the need for active exploration. In this paper, we study how, in this offline setting, an effective strategy to leveraging unlabeled and undirected past data is to utilize unsupervised learning to extract potentially useful and temporally extended primitive skills to learn what types of behaviors are possible. For example, consider a dataset of an agent performing undirected navigation in a maze environment (Figure 1 ). While the dataset does not provide demonstrations of exclusively one specific point-to-point navigation task, it nevertheless presents clear indications of which temporally extended behaviors are useful and natural in this environment (e.g., moving forward, left, right, and backward), and our unsupervised learning objective aims to distill these behaviors into temporally extended primitives. Once these locomotive primitive behaviors are extracted, we can use them as a compact constrained temporallyextended action space for learning a task policy with offline RL, which only needs to focus on task relevant navigation, thereby making task learning easier. For example, once a specific point-to-point navigation is commanded, the agent can leverage the learned primitives for locomotion and only focus on the task of navigation, as opposed to learning locomotion and navigation from scratch. We refer to our proposed unsupervised learning method as Offline Primitives for Accelerating offline reinforcement Learning (OPAL), and apply this basic paradigm to offline RL, where the agent is given a single offline dataset to use for both the initial unsupervised learning phase and then a subsequent task-directed offline policy optimization phase. Despite the fact that no additional data is used, we find that our proposed unsupervised learning technique can dramatically improve offline policy optimization compared to performing offline policy optimization on the raw dataset directly. To the best of our knowledge, ours is the first work to theoretically justify and experimentally verify the benefits of primitive learning in offline RL settings, showing that hierarchies can provide temporal abstraction that allows us to reduce the effect of compounding errors issue in offline RL. These theoretical and empirical results are notably in contrast to previous related work in online hierarchical RL (Nachum et al., 2019b) , which found that improved exploration is the main benefit afforded by hierarchically learned primitives. We instead show significant benefits in the offline RL setting, where exploration is irrelevant. Beyond offline RL, and although this isn't the main focus of the work, we also show the applicability of our method for accelerating RL by incorporating OPAL as a preprocessing step to standard online RL, few-shot imitation learning, and multi-task transfer learning. In all settings, we demonstrate that the use of OPAL can improve the speed and quality of downstream task learning.

2. RELATED WORK

Offline RL. Offline RL presents the problem of learning a policy from a fixed prior dataset of transitions and rewards. Recent works in offline RL (Kumar et al., 2019; Levine et al., 2020; Wu et al., 2019; Ghasemipour et al., 2020; Jaques et al., 2019; Fujimoto et al., 2018) constrain the policy to be close to the data distribution to avoid the use of out-of-distribution actions (Kumar et al., 2019; Levine et al., 2020) . To constrain the policy, some methods use distributional penalties, as measured by KL divergence (Levine et al., 2020; Jaques et al., 2019) , MMD (Kumar et al., 2019) , or Wasserstein distance (Wu et al., 2019) . Other methods first sample actions from the behavior policy and then either clip the maximum deviation from those actions (Fujimoto et al., 2018) or just use those actions (Ghasemipour et al., 2020) during the value backup to stay within the support of the offline data. In contrast to these works, OPAL uses an offline dataset for unsupervised learning of a continuous space of primitives. The use of these primitives for downstream tasks implicitly constrains a learned primitive-directing policy to stay close to the offline data distribution. As we demonstrate in our experiments, the use of OPAL in conjunction with an off-the-shelf offline RL algorithm in this way can yield significant improvement compared to applying offline RL to the dataset directly. Online skill discovery. There are a number of recent works (Eysenbach et al., 2018; Nachum et al., 2018a; Sharma et al., 2019) which use unsupervised objectives to discover skills and use the discovered skills for planning (Sharma et al., 2019) , few-shot imitation learning, or online RL (Eysenbach et al., 2018; Nachum et al., 2018a) . However, these works focus on online settings and assume access to the environment. In contrast, OPAL focuses on settings where a large dataset of diverse behaviors is provided but access to the environment is restricted. It leverages these static offline datasets to discover primitive skills with better state coverage and avoids the exploration issue of learning primitives from scratch. Hierarchical policy learning. Hierarchical policy learning involves learning a hierarchy of policies where a low-level policy acts as primitive skills and a high-level policy directs the low-level policy to solve a task. While some works (Bacon et al., 2017; Stolle & Precup, 2002; Peng et al., 2019) learn a discrete set of lower-level policies, each behaving as a primitive skill, other works (Vezhnevets et al., 2017; Nachum et al., 2018b; 2019a; Hausman et al., 2018) learn a continuous space of primitive skills representing the lower-level policy. These methods have mostly been applied in online settings. However, there have been some recent variants of the above works (Lynch et al., 2020; Shankar & Gupta, 2020; Krishnan et al., 2017; Merel et al., 2018) which extract skills from a prior dataset and using it for either performing tasks directly (Lynch et al., 2020) or learning downstream tasks (Shankar & Gupta, 2020; Krishnan et al., 2017; Merel et al., 2018) with online RL. While OPAL is related to these works, we mainly focus on leveraging the learned primitives for asymptotically improving the performance of offline RL; i.e., both the primitive learning and the downstream task must be solved using a single static dataset. Furthermore, we provide performance bounds for OPAL and enumerate the specific properties an offline dataset should possess to guarantee improved downstream task learning, while such theoretical guarantees are largely absent from existing work.

3. PRELIMINARIES

We consider the standard Markov decision process (MDP) setting (Puterman, 1994) , specified by a tuple M = (S, A, P, µ, r, γ) where S represents the state space, A represents the action space, P(s |s, a) represents the transition probability, µ(s) represents the initial state distribution, r(s, a) ∈ (-R max , R max ) represents the reward function, and γ ∈ (0, 1) represents the discount factor. A policy π in this MDP corresponds to a function S → ∆(A), where ∆(A) is the simplex over A. It induces a discounted future state distribution d π , defined by d π (s) = (1γ) ∞ t=0 γ t P(s t = s|π), where P(s t = s|π) is the probability of reaching the state s at time t by running π on M. For a positive integer k, we use d π k (s) = (1 -γ k ) ∞ t=0 γ tk P(s tk = s|π) to denote the everyk-step state distribution of π. The return of policy π in MDP M is defined as J RL (π, M) = 1 1-γ E s∼d π ,a∼π(a|s) [r(s, a) ]. We represent the reward-and discount-agnostic environment as a tuple E = (S, A, P, µ). We aim to use a large, unlabeled, and undirected experience dataset D := {τ r i := (s t , a t ) c-1 t=0 } N i=1 associated with E to extract primitives and improve offline RL for downstream task learning. To account for the fact that the dataset D may be generated by a mixture of diverse policies starting at diverse initial states, we assume D is generated by first sampling a behavior policy π ∼ Π along with an initial state s ∼ κ, where Π, κ represent some (unknown) distributions over policies and states, respectively, and then running π on E for c time steps starting at s 0 = s. We define the probability of a sub-trajectory τ := (s 0 , a 0 , . . . , s c-1 , a c-1 ) in D under a policy π as π(τ ) = κ(s 0 ) c-1 t=1 P(s t |s t-1 , a t-1 ) c-1 t=0 π(a t |s t ), and the conditional probability as π(τ |s) = 1[s = s 0 ] c-1 t=1 P(s t |s t-1 , a t-1 ) c-1 t=0 π(a t |s t ). In this work, we will show how to apply unsupervised learning techniques to D to extract a continuous space of primitives π θ (a|s, z), where z ∈ Z, the latent space inferred by unsupervised learning. We intend to use the learned π θ (a|s, z) to asymptotically improve the performance of offline RL for downstream task learning. For offline RL, we assume the existence of a dataset D r := {τ r i := (s t , a t , r t ) c-1 t=0 } N i=1 , corresponding to the same sub-trajectories in D labelled with MDP rewards. Additionally, we can use the extracted primitives for other applications such as few-shot imitation learning, online RL, and online multi-task transfer learning. We review the additional assumptions for these applications in Appendix A.  ⇡ ✓ (a|s, z) < l a t e x i t s h a 1 _ b a s e 6 4 = " X G K G E G H X Y j G 4 N a q J q f b P t i v A u y E = " > A A A B / H i c b V D L S s N A F J 3 U V 6 2 v a J d u B o t Q Q U p S B V 0 W 3 b i s Y B / Q h D C Z T t u h k w c z N 0 K M 9 V f c u F D E r R / i z r 9 x 2 m a h r Q c u H M 6 5 l 3 v v 8 W P B F V j W t 1 F Y W V 1 b 3 y h u l r a 2 d 3 b 3 z P 2 D t o o S S V m L R i K S X Z 8 o J n j I W s B B s G 4 s G Q l 8 w T r + + H r q d + 6 Z V D w K 7 y C N m R u Q Y c g H n B L Q k m e W n Z h 7 m Q M j B m R S J Y / q 9 O H E M y t W z Z o B L x M 7 J x W U o + m Z X 0 4 / o k n A Q q C C K N W z r R j c j E j g V L B J y U k U i w k d k y H r a R q S g C k 3 m x 0 / w c d a 6 e N B J H W F g G f q 7 4 m M B E q l g a 8 7 A w I j t e h N x f + 8 X g K D S z f j Y Z w A C + l 8 0 S A R G C I 8 T Q L 3 u W Q U R K o J o Z L r W z E d E U k o 6 L x K O g R 7 8 e V l 0 q 7 X 7 L N a / f a 8 0 r j K 4 y i i Q 3 S E q s h G F 6 i B b l A T t R B F K X p G B R E t y f M y h k x M k / A F w M = " > A A A B n i c b V D L S g M x F L T X W + q i d B I v g q s x U Q Z d F X b i s Y B w H U o m z b S h m W R I M k I Z + h l u X C j i q x + Y a W e h r Q c C h P u J e e e M O F M G f d k p r x u b W + X t y s u v B f C o o W q C G T y a X q h V h T z g R t G Y S W K j j k t B t O b n O / + S V Z l I m m l C g x i P B I s Y w c Z K f j / G Z k w w z + m g r N r b t z o F X i F a Q G B V q D l d / K E k a U E I x r n p u Y I M P K M M L p r N J P N U w m e A R S V O K Y y O a R Z + j M K k M U S W W f M G i u / t I c K z N A t Z B R L u + J / n p y a D j I m k t R Q Q R Y f R S l H R q L f j R k i h L D p Z g o p j N i s g Y K y M b a l i S / C W T l n U b d u g H i r z Z u i j j K c w C m c g w d X I R a E E b C E h h l d c z z r w H v R k l P s H M M f O J / d i u R X g = = < / l a t e x i t > Task Policy ⇡ (z|s) < l a t e x i t s h a 1 _ b a s e 6 4 = " G C K 2 + n g Q Z u 2 E K M 9 l 8 D 2 K k 7 T d Z B w = " > A A A B + H i c b V B N T 8 J A E J 3 i F + I H V Y 9 e G o k J X k i L J n o k e v G I i Y A J b Z r t s s C G 7 X a z u z W B y i / x 4 k F j v P p T v P l v X K A H B V 8 y y c t 7 M 5 m Z F w l G l X b d b 6 u w t r 6 x u V X c L u 3 s 7 u 2 X 7 Y P D t k p S i U k L J y y R D x F S h F F O W p p q R h 6 E J C i O G O l E o 5 u Z 3 3 k k U t G E 3 + u x I E G M B p z 2 K U b a S K F d 9 g U N M 1 8 o O q 1 O n t R Z a F f c m j u H s 0 q 8 n F Q g R z O 0 v / x e g t O Y c I 0 Z U q r r u U I H G Z K a Y k a m J T 9 V R C A 8 Q g P S N Z S j m K g g m x 8 + d U 6 N 0 n P 6 i T T F t T N X f 0 9 k K F Z q H E e m M 0 Z 6 q J a 9 m f i f 1 0 1 1 / y r I K B e p J h w v F v V T 5 u j E m a X g 9 K g k W L O x I Q h L a m 5 1 8 B B J h L X J q m R C 8 J Z f X i X t e s 0 7 r 9 X v L i q N 6 z y O I h z D C V T B g 0 t o w C 0 0 o Q U Y U n i G V 3 i z J t a L 9 W 5 9 L F o L V j 5 z B H 9 g f f 4 A 1 q + T M w = = < / l a t e x i t > Labelled Data  v r G Z m G r u L 2 z u 7 d f O j h s a p k o Q h t E c q n a A d a U M 0 E b h h l O 2 7 G i O A o 4 b Q W j m 8 x v j a n S T I o H M 4 m p H + G B Y C E j 2 F j J 7 0 b Y D A n m 6 e 3 0 U f V K Z b f i z o C W i Z e T M u S o 9 0 p f 3 b 4 k S U S F I R x r 3 f H c 2 P g p V o Y R T q f F b q J p j M k I D 2 j H U o E j q v 1 0 F n q K T q 3 S R 6 F U 9 g m D Z u r v j R R H W k + i w E 5 m I f W i l 4 n / e Z 3 E h F d + y k S c G C r I / F C Y c G Q k y h p A f a Y o M X x i C S x X H t U t w = " > A A A C B H i c b V C 7 T s M w F H X K q 5 R X g L G L R Y X E V C U F C c Y K G B i L R B 9 S G y L H d V u r t h P Z D l I V Z W D h V 1 g Y Q I i V j v a f E + G z U t i a M 9 g k N g w c = " > A A A C B H i c b V C 7 T s M w F H X K q 5 R X g L G L R Y X E V C U F C c Y K G B i L R B 9 S G y L H d V u r j h 3 Z D l I V Z W D h V 1 g Y Q I i V j 2 D j b 3 D S D N B y J E v H 5 9 y r e + 8 J I k a V d p x v q 7 S y u r a + U d 6 s b G 3 v 7 O 7 Z + w c d J W K J S R s L J m Q v Q I o w y k l b U 8 1 I L 5 I E h Q E j 3 W B 6 l f n d B y I V F f x O z y L i h W j M 6 Y h i p I 3 k 2 9 V B i P Q E I 5 Z c p / f S T / K v D B M m 0 t S 3 a 0 7 d y Q G X i V u Q G i j Q 8 u 2 v w V D g O C R c Y 4 a U 6 r t O p L 0 E S U 0 x I 2 l l E C s S I T x F Y 9 I 3 l K O Q K C / J j 0 j h s V G G c C S k e V z D X P 3 d k a B Q q V k Y m M p s R 7 X o Z e J / X L i 0 v L K 8 X V 0 t r 6 x u a W v r 3 T l l E i C G 2 R i E e i 6 2 F J O Q t p C x h w 2 o 0 F x Y H H a c c b X o 3 9 z g M V k k X h L Y x i 6 g R 4 E D K f E Q x K c v V 9 O 7 U B J / W K d O E Y u 3 D k p l A 3 s 7 u U n F i Z n b l 6 2 a y a E x j z x M p J G e V o u v q X 3 Y 9 I E t A Q C M d S 9 i w z B i f F A h j h N C v Z i a Q x J k M 8 o D 1 F Q x x Q 6 a S T P z L j U C l 9 w 4 + E q h C M i f p 7 I s W B l K P A U 5 0 B h n s 5 6 4 3 F / 7 x e A v 6 F k 7 I w T o C G Z L r I T 7 g B k T E O x e g z Q Q n w k S K Y C K Z u N c g 9 F p i A i q 6 k Q r B m X 5 4 n 7 V r V O q 3 W b s 7 K j c s 8 j i L a Q w e o g i x 0 j h r o G j V R C x H 0 i J 7 R K 3 r T n r Q X 7 V 3 7 m L Y W t H x m F / 2 B 9 v k D 7 I u Y t A = = < / l a t e x i t > {⌧ = (st, at, rt) c 1 t=0 } < l a t e x i t s h a 1 _ b a s e 6 4 = " z k c z p X m 8 c C w K C 6 F t 6 j K 4 7 H O C j P U = " > A A A C C 3 i c b V D L S s N A F J 3 U V 6 2 v q E s 3 o U W o o C W p g m 4 K R T c u K 9 g H N D F M p p N 2 6 O T B z I 1 Q Q v Z u / B U 3 L h R x 6 w + 4 8 2 + c t l l o 9 c C F w z n 3 c u 8 9 X s y Z B N P 8 0 g p L y y u r a 8 X 1 0 s b m 1 v a O v r v X k V E i C G 2 T i E e i 5 2 F J O Q t p G x h w 2 o s F x Y H H a d c b X 0 3 9 7 j 0 V k k X h L U x i 6 g R 4 G D K f E Q x K c v W y n d q A k 0 Z V u n C M V Q k X j t w U G m Z 2 l 5 I T K 7 M z V 6 + Y N X M G 4 y + x c l J B O V q u / m k P I p I E N A T C s Z R 9 y 4 z B S b E A R j j N S n Y i a Y z J G A 9 p X 9 E Q B 1 Q 6 6 e y X z D h U y s D w I 6 E q B G O m / p x I c S D l J P B U Z 4 B h J B e 9 q f i f 1 0 / A v 3 B S F s Y J 0 J D M F / k J N y A y p s E Y A y Y o A T 5 R B B P B 1 K 0 G G W G B C a j 4 S i o E a / H l v 6 R T r 1 m n t f r N W a V 5 m c d R R A e o j K r I Q u e o i a 5 R C 7 U R Q Q / o C b 2 g V + 1 R e 9 b e t P d 5 a 0 H L Z / b R L 2 g f 3 9 6 C m k 0 = < / l a t e x i t > {⌧ = (st, at, z) c 1 t=0 } < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 S M D c 0 h b h 8 K S 6 W 0 q J D 2 Y T / + z N j I = " > A A A C C X i c b V D J S g N B E O 2 J W 4 x b 1 K O X x i B E 0 D A T B b 0 E g l 4 8 R j A L Z M a h p 9 O T N O l Z 6 K 4 R 4 j B X L / 6 K F w + K e P U P v P k 3 d p a D R h 8 U P N 6 r o q q e F w u u w D S / j N z C 4 t L y S n 6 1 s L a + s b l V 3 N 5 p q S i R l D V p J C L Z 8 Y h i g o e s C R w E 6 8 S S k c A T r O 0 N L 8 d + + 4 5 J x a P w B k Y x c w L S D 7 n P K Q E t u U V s p z a Q p F Z W L h w R X f e H b g o 1 M 7 t N 6 b G V 2 Z l b L J k V c w L 8 l 1 g z U k I z N N z i p 9 2 L a B K w E K g g S n U t M w Y n J R I 4 F S w r 2 I l i M a F D 0 m d d T U M S M O W k k 0 8 y f K C V H v Y j q S s E P F F / T q Q k U G o U e L o z I D B Q 8 9 5 Y / M / r J u C f O y k P 4 w R Y S K e L / E R g i P A 4 F t z j k l E Q I 0 0 I l V z f i u m A S E J B h 1 f Q I V j z L / 8 l r W r F O q l U r 0 9 L 9 Y t Z H H m 0 h / Z R G V n o D N X R F W q g J q L o A T 2 h F / R q P B r P x p v x P m 3 N G b O Z X f Q L x s c 3 R 7 O Z b g = = < / l a t e x i t > {(s0, z, sc, c 1 X t=0 t rt)} < l a t e x i t s h a 1 _ b a s e 6 4 = " O U K f M a w 3 B 0 L G z f T K 0 g k J T 5 7 p H / g = " > A A A C F H i c b V D L S s N A F J 3 U V 6 2 v q E s 3 w S J U r C W p g m 6 E o h u X F e w D m j Z M p t N 2 6 E w S Z m 6 E G v I R b v w V N y 4 U c e v C n X / j 9 L H Q 6 o E L h 3 P u 5 d 5 7 / I g z B b b 9 Z W Q W F p e W V 7 K r u b X 1 j c 0 t c 3 u n r s J Y E l o j I Q 9 l 0 8 e K c h b Q G j D g t B l J i o X P a c M f X o 3 9 x h 2 V i o X B L Y w i 2 h a 4 H 7 A e I x i 0 5 J l H b l J Q n l 2 8 L y q P F F 0 V C y + B C z v t J O T Y S d 0 + F g J 3 Q H p w 6 K a e m b d L 9 g T W X + L M S B 7 N U P X M T 7 c b k l j Q A A j H S r U c O 4 J 2 g i U w w m m a c 2 N F I 0 y G u E 9 b m g Z Y U N V O J k + l 1 o F W u l Y v l L o C s C b q z 4 k E C 6 V G w t e d A s N A z X t j 8 T + v F U P v v J 2 w I I q B B m S 6 q B d z C 0 J r n J D V Z Z I S 4 C N N M J F M 3 2 q R A Z a Y g M 4 x p 0 N w 5 l / + S + r l k n N S K t + c 5 i u X s z i y a A / t o w J y 0 B m q o G t U R T V E 0 A N 6 Q i / o G C K 2 + n g Q Z u 2 E K M 9 l 8 D 2 K k 7 T d Z B w = " > A A A B + H i c b V B N T 8 J A E J 3 i F + I H V Y 9 e G o k J X k i L J n o k e v G I i Y A J b Z r t s s C G 7 X a z u z W B y i / x 4 k F j v P p T v P l v X K A H B V 8 y y c t 7 M 5 m Z F w l G l X b d b 6 u w t r 6 x u V X c L u 3 s 7 u 2 X 7 Y P D t k p S i U k L J y y R D x F S h F F O W p p q R h 6 E J C i O G O l E o 5 u Z 3 3 k k U t G E 3 + u x I E G M B p z 2 K U b a S K F d 9 g U N M 1 8 o O q 1 O n t R Z a F f c m j u H s 0 q 8 n F Q g R z O 0 v / x e g t O Y c I 0 Z U q r r u U I H G Z K a Y k a m J T 9 V R C A 8 Q g P S N Z S j m K g g m x 8 + d U 6 N 0 n P 6 i T T F t T N X f 0 9 k K F Z q H E e m M 0 Z 6 q J a 9 m f i f 1 0 1 1 / y r I K B e p J h w v F v V T 5 u j E m a X g 9 K g k W L O x I Q h L a m 5 1 8 B B J h L X J q m R C 8 J Z f X i X t e s 0 7 r 9 X v L i q N 6 z y O I h z D C V T B g 0 t o w C 0 0 o Q U Y U n i G V 3 i z J t a L 9 W 5 9 L F o L V j 5 z B H 9 g f f 4 A 1 q + T M w = = < / l a t e x i t > ⇡ ✓ (a|s, z) < l a t e x i t s h a 1 _ b a s e 6 4 = " X G K G E G H X Y j G 4 N a q J q f b P t i v A u y E = " > A A A B / H i c b V D L S s N A F J 3 U V 6 2 v a J d u B o t Q Q U p S B V 0 W 3 b i s Y B / Q h D C Z T t u h k w c z N 0 K M 9 V f c u F D E r R / i z r 9 x 2 m a h r Q c u H M 6 5 l 3 v v 8 W P B F V j W t 1 F Y W V 1 b 3 y h u l r a 2 d 3 b 3 z P 2 D t o o S S V m L R i K S X Z 8 o J n j I W s B B s G 4 s G Q l 8 w T r + + H r q d + 6 Z V D w K 7 y C N m R u Q Y c g H n B L Q k m e W n Z h 7 m Q M j B m R S J Y / q 9 O H E M y t W z Z o B L x M 7 J x W U o + m Z X 0 4 / o k n A Q q C C K N W z r R j c j E j g V L B J y U k U i w k d k y H r a R q S g C k 3 m x 0 / w c d a 6 e N B J H W F g G f q 7 4 m M B E q l g a 8 7 A w I j t e h N x f + 8 X g K D S z f j Y Z w A C + l 8 0 S A R G C I 8 T Q L 3 u W Q U R K o J o Z L r W z E d E U k o 6 L x K O g R 7 8 e V l 0 q 7 X 7 L N a / f a 8 0 r j K 4 y i i Q 3 S E q s h G F 6 i B b l A T t R B F K X p G X G K G E G H X Y j G 4 N a q J q f b P t i v A u y E = " > A A A B / H i c b V D L S s N A F J 3 U V 6 2 v a J d u B o t Q Q U p S B V 0 W 3 b i s Y B / Q h D C Z T t u h k w c z N 0 K M 9 V f c u F D E r R / i z r 9 x 2 m a h r Q c u H M 6 5 l 3 v v 8 W P B F V j W t 1 F Y W V 1 b 3 y h u l r a 2 d 3 b 3 z P 2 D t o o S S V m L R i K S X Z 8 o J n j I W s B B s G 4 s G Q l 8 w T r + + H r q d + 6 Z V D w K 7 y C N m R u Q Y c g H n B L Q k m e W n Z h 7 m Q M j B m R S J Y / q 9 O H E M y t W z Z o B L x M 7 J x W U o + m Z X 0 4 / o k n A Q q C C K N W z r R j c j E j g V L B J y U k U i w k d k y H r a R q S g C k 3 m x 0 / w c d a 6 e N B J H W F g G f q 7 4 m M B E q l g a 8 7 A w I j t e h N x f + 8 X g K D S z f j Y Z w A C + l 8 0 S A R G C I 8 T Q L 3 u W Q U R K o J o Z L r W z E d E U k o 6 L x K O g R 7 8 e V l 0 q 7 X 7 L N a / f a 8 0 r j K 4 y i i Q 3 S E q s h G F 6 i B b l A T t R B F K X p G r + j N e D J e j H f j Y 9 5 a M P K Z M v o D 4 / M H j T i U s g = = < / l a t e x i t > Primitive Policy Prior ⇢ ! (z|s) < l a t e x i t s h a 1 _ b a s e 6 4 = " G V o 4 f k 2 j q 6 k Q 2 n 6 l Q 4 V a g e K e o n Y = " > A A A B + 3 i c b V D L S s N A F J 3 4 r P U V 6 9 L N Y B H q p i R V 0 G X R j c s K 9 g F N C J P p p B 0 6 m Q k z E 7 H G / I o b F 4 q 4 9 U f c + T d O 2 y y 0 9 c C F w z n 3 c u 8 9 Y c K o 0 o 7 z b a 2 s r q 1 v b J a 2 y t s 7 u 3 v 7 9 k G l o 0 Q q M W l j w Y T s h U g R R j l p a 6 o Z 6 S W S o D h k p B u O r 6 d + 9 5 5 I R Q W / 0 5 O E + D E a c h p R j L S R A r v i y Z E I M k / E Z I j y 2 u O T O g 3 s q l N 3 Z o D L x C 1 I F R R o B f a X N x A 4 j Q n X m C G l + q 6 T a D 9 D U l P M S F 7 2 U k U S h M d o S P q G c h Q T 5 W e z 2 3 N 4 Y p Q B j I Q 0 x T W c q b 8 n M h Q r N Y l D 0 x k j P V K L 3 l T 8 z + u n O r r 0 M 8 q T V B O O 5 4 u i l E E t 4 D Q I O K C S Y M 0 m h i A s q b k V 4 h G S C G s T V 9 m E 4 C 6 + v E w 6 j b p 7 V m / c n l e b V 0 U c J X A E j k E N u O A C N M E N a I E 2 w O A B P I N X 8 G b l 1 o v 1 b n 3 M W 1 e s Y u Y Q / I H 1 + Q M h 8 J R + < / l a t e x i t > Encoder q (z|⌧ ) < l a t e x i t s h a 1 _ b a s e 6 4 = " m i t o t M Q h u 7 W C E x q w O K L 8 k k 0 l h k I = " > A A A B + X i c b V B N S 8 N A E J 3 4 W e t X 1 K O X x S L U S 0 m q o M e i F + + C w q e J U E t o g M Y 9 l O 8 C K c i Z o Q z P N a T u R F E c B p 6 1 g e D P 1 W y M q F Y v F v R 4 n 1 I t w X 7 C Q E a y N 5 N v 2 g 5 9 1 k w G b l B + f u h q n Z 7 5 d c i r O D G i Z u D k p Q Y 6

4. OFFLINE RL WITH OPAL

In this section, we elaborate on OPAL, our proposed method for extracting primitives from D and then leveraging these primitives to learn downstream tasks with offline RL. We begin by describing our unsupervised objective, which distills D into a continuous space of latent-conditioned and temporally-extended primitive policies π θ (a|s, z). For learning downstream tasks with offline RL, we first label D r with appropriate latents using the OPAL encoder q φ (z|τ ) and then learn a policy π ψ (z|s) which is trained to sample an appropriate primitive every c steps to optimize a specific task, using any off-the-shelf offline RL algorithm. A graphical overview of offline RL with OPAL is shown in Figure 2 . While we mainly focus on offline RL, we briefly discuss how to use the learned primitives for few-shot imitation learning, online RL, and multi-task online transfer learning in section 5 and provide more details in Appendix A.

4.1. EXTRACTING TEMPORALLY-EXTENDED PRIMITIVES FROM DATA

We would like to extract a continuous space of temporally-extended primitives π θ (a|s, z) from D which we can later use as an action space for learning downstream tasks with offline RL. This would reduce our effective task horizon, thereby making the downstream learning easier, as well as allow the downstream policy to stay close to the offline data distribution, thereby bringing stability to the downstream learning. We propose the following objective for learning π θ , incorporating an auto-encoding loss function with a KL constraint to encourage better generalization: min θ,φ,ω J(θ, φ, ω) = Êτ∼D,z∼q φ (z|τ ) - c-1 t=0 log π θ (a t |s t , z) (1) s.t. Êτ∼D [D KL (q φ (z|τ )||ρ ω (z|s 0 ))] ≤ KL (2) where Ê indicates empirical expectation. The learned components of this objective may be interpreted as encoder, decoder, and prior: Encoder: q φ (z|τ ) encodes the trajectory τ of state-action pairs into a distribution in latent space and gives out parameters of that distribution. In our case, we represent q φ with a bidirectional GRU which takes in τ and gives out parameters of a Gaussian distribution (µ enc z , σ enc z ). Decoder (aka Primitive Policy): π θ (a|s, z) is the latent-conditioned policy. It maximizes the conditional log-likelihood of actions in τ given the state and the latent vector. In our implementation, we parameterize it as a feed-forward neural network which takes in current state and latent vector and gives out parameters of a Gaussian distribution for the action (µ a , σ a ). Prior/Primitive Predictor: ρ ω (z|s 0 ) tries to predict the encoded distribution of the sub-trajectory τ from its initial state. Our implementation uses a feed-forward neural network which takes in the initial state and gives out parameters of a Gaussian distribution (µ pr z , σ pr z ). KL-constraint (Equation 2). As an additional component of the algorithm, we enforce consistency in the latent variables predicted by the encoder q φ (z|τ ) and the prior ρ ω (z|s 0 ). Since our goal is to obtain a primitive z that captures a temporal sequence of actions for a given sub-trajectory τ = (s 0 , a 0 , • • • , s c-1 , a c-1 ) (as defined in Section 3), we utilize a regularization that enforces the distribution, q φ (z|τ ) to be close to just predicting the primitive or the latent variable z given the start state of this sub-trajectory, s 0 (i.e. ρ ω (z|s 0 )). This conditioning on the initial state regularizes the distribution q φ (z|τ ) to not overfit to the the complete sub-trajectory τ as the same z should also be predictable only given s 0 . The above form of KL constraint is inspired from past works (Lynch et al., 2020; Kumar et al., 2020a) . In particular Lynch et al. (2020) add a KL-constraint (Equation 2, "Plan prior matching" in Lynch et al. (2020) ) that constrains the distribution over latent variables computed only given the initial state and the goal state to the distribution over latent variables computed using the entire trajectory. Our form in Equation 2 is similar to this prior except that we do not operate in a goal-conditioned RL setting and hence only condition ρ ω on the initial state s 0 . In practice, rather than solving the constrained optimization directly, we implement the KL constraint as a penalty, weighted by an appropriately chosen coefficient β. Thus, one may interpret our unsupervised objective as using a sequential β-VAE (Higgins et al., 2016) . However, as mentioned above, our prior is conditioned on s 0 and learned as part of the optimization because the set of primitives active in D depends on s 0 . If β = 1, OPAL is equivalent to a conditional VAE maximizing log probability of τ conditioned on its initial state s 0 ; see Appendix D for more details. Despite the similarities between our proposed objective and VAEs, our presentation of OPAL as a constrained auto-encoding objective is deliberate. As we will show in Section 4.3, our theoretical guarantees depend on a well-optimized auto-encoding loss to provide benefits of using learned primitives π θ for downstream tasks. In contrast, a VAE loss, which simply maximizes the likelihood of observed data, may not necessarily provide a benefit for downstream tasks. For example, if the data can be generated by a single stationary policy, a VAE-optimal policy π θ can simply ignore the latent z, thus producing a degenerate space of primitives. In contrast, when the KL constraint in our objective is weak (i.e., KL 0 or β < 1), the auto-encoding loss is encouraged to find a unique z for distinct τ to optimize reconstruction loss.

4.2. OFFLINE RL WITH PRIMITIVES FOR DOWNSTREAM TASKS

After distilling learned primitives from D in terms of an encoder q φ (z|τ ), a latent primitive policy (or decoder) π θ (a|s, z), and a prior ρ ω (z|s 0 ), OPAL then applies these learned models to improve offline RL for downstream tasks. As shown in Figure 2 , our goal is to use a dataset with reward labeled sub-trajectories D r = {τ i := (s i t , a i t , r i t ) c-1 t=0 } N i=1 to learn a behavior policy π that maximizes cumulative reward. With OPAL, we use the learned primitives π θ (a|s, z) as low-level controllers, and then learn a high-level controller π ψ (z|s). To do so, we relabel the dataset D r in terms of temporally extended transitions using the learned encoder q φ (z|τ ). Specifically, we create a dataset D r hi = {(s i 0 , z i , c-1 t=0 γ t r i t , s i c )} N i=1 , where z i ∼ q φ (•|τ i ). Given D r hi , any off-the-shelf offline RL algorithm can be used to learn π ψ (z|s) (in our experiments we use CQL (Kumar et al., 2020b) ). As a way to ensure that the c-step transitions τ i := (s i t , a i t , r i t ) c-1 t=0 remain consistent with the labelled latent action z i , we finetune π θ (a|s, z) on D r lo = {((s i t , a i t ) c-1 t=0 , z i )} N i=1 with a simple latent-conditioned behavioral cloning loss: min θ Ê(τ,z)∼D r lo - c-1 t=0 log π θ (a t |s t , z) . (3)

4.3. SUBOPTIMALITY AND PERFORMANCE BOUNDS FOR OPAL

Now, we will analyze OPAL and derive performance bounds for it in the context of offline RL, formally examining the benefit of the temporal abstraction afforded by OPAL as well as studying what properties D should possess so that OPAL can improve downstream task performance. As explained above, when applying OPAL to offline RL, we first learn the primitives π θ (a|s, z) using D, and then learn a high-level task policy π ψ (z|s) in the space of the primitives. Let π ψ * (z|s) be the optimal task policy. Thus the low-level π θ and high-level π ψ * together comprise a hierarchical policy, which we denote as π θ,ψ * . To quantify the performance of policies obtained from OPAL, we define the notion of suboptimality of the learned primitives π θ (a|s, z) in an MDP M with an associated optimal policy π * as SubOpt(θ) := |J RL (π * , M) -J RL (π θ,ψ * , M)|. To relate SubOpt(θ) with some notion of divergence between π * and π θ,ψ * , we introduce the following performance difference lemma. Lemma 4.0.1. If π 1 and π 2 are two policies in M, then |J RL (π 1 , M) -J RL (π 2 , M)| ≤ 2 (1 -γ c )(1 -γ) R max E s∼d π 1 c [D TV (π 1 (τ |s)||π 2 (τ |s))], where D TV (π 1 (τ |s)||π 2 (τ |s)) denotes the TV divergence over c-length sub-trajectories τ sampled from π 1 vs. π 2 (see section 3). Furthermore, SubOpt(θ) ≤ 2 (1 -γ c )(1 -γ) R max E s∼d π * c [D TV (π * (τ |s)||π θ,ψ * (τ |s))]. The proof of the above lemma and all the following results are provided in Appendix B.1. Through above lemma, we showed that the suboptimality of the learned primitives can be bounded by the total variation divergence between the optimal policy π * in M and the optimal policy acting through the learned primitives π θ,ψ * . We now continue to bound the divergence between π * and π θ,ψ * in terms of how representative D is of π * and how optimal the primitives π θ are with respect to the auto-encoding objective (equation 1). We begin with a definition of how often an arbitrary policy appears in Π, the distribution generating D: Definition 1. We say a policy π in M is ζ-common in Π if E π∼Π,s∼κ [D TV (π(τ |s)||π(τ |s))] ≤ ζ. Theorem 4.1. Let θ, φ, ω be the outputs of solving equation 1, such that J(θ, φ, ω) = c . Then, with high probability 1δ, for any π that is ζ-common in Π, there exists a distribution H over z such that for π H θ (τ |s) := E z∼H [π θ (τ |z, s)], E s∼κ [D TV (π(τ |s)||π H θ (τ |s))] ≤ ζ + 1 2 c + S J δ + H c ( ) where H c = E π∼Π,τ ∼π,s0∼κ [ c-1 t=0 log π(a t |s t )] (i.e. a constant and property of D) and S J is a positive constant incurred due to sampling error in J(θ, φ, ω) and depends on concentration properties of π θ (a|s, z) and q φ (z|τ ). Corollary 4.1.1. If the optimal policy π * of M is ζ-common in Π, and d π * c κ ∞ ≤ ξ, then, with high probability 1 -δ, SubOpt(θ) ≤ 2ξ (1 -γ c )(1 -γ) R max   ζ + 1 2 c + S J δ + H c   . As we can see, SubOpt(θ) will reduce as D gets closer to π * (i.e. ζ approaches 0) and better primitives are learned (i.e. c decreases). While it might be tempting to increase c (i.e. the length of sub-trajectories) to reduce the suboptimality, a larger c will inevitably make it practically harder to control the autoencoding loss c , thereby leading to an increase in overall suboptimality and inducing a trade-off in determining the best value of c. In our experiments we treat c as a hyperparameter and set it to c = 10, although more sophisticated ways to determine c can be an interesting avenue for future work. Till now, we have argued that there exists some near-optimal task policy π ψ * if θ is sufficiently learned and π * is sufficiently well-represented in D. Now, we will show how primitive learning can improve downstream learning, by considering the benefits of using OPAL with offline RL. Building on the policy performance analysis from Kumar et al. (2020b) , we now present theoretical results bounding the performance of the policy obtained when offline RL is performed with OPAL. (Kumar et al., 2020b) and CQL+OPAL (ours). Theorem 4.2. Let π ψ * (z|s) be the policy obtained by CQL and let π ψ * ,θ (a|s) refer to the policy when π ψ * (z|s) is used together with π θ (a|s, z). Let π β ≡ {π ; π ∼ Π} refer to the policy generating D r in MDP M and z ∼ π H β (z|s) ≡ τ ∼ π β,s0=s , z ∼ q φ (z|τ ). Then, J(π ψ * ,θ , M ) ≥ J(π β , M )-κ with high probability 1δ where κ = O 1 (1 -γ c )(1 -γ) E s∼d π ψ * ,θ MH (s) |Z|(D CQL (π ψ * , π H β )(s) + 1) (9) - α 1 -γ c E s∼d π ψ * M H (s) D CQL (π ψ * ,θ , π H β )(s) , where D CQL is a measure of the divergence between two policies; see the appendix for a formal statement. The precise bound along with a proof is described in Appendix B.1. Intuitively, this bound suggests that the worst-case deterioration over the learned policy depends on the divergence between the learned latent-space policy D CQL and the actual primitive distribution, which is controlled via any conservative offline RL algorithm (Kumar et al. (2020b) in our experiments) and the size of the latent space |Z|. Crucially, note that comparing Equation 9 to the performance bound for CQL (Equation 6in Kumar et al. (2020b) ) reveals several benefits pertaining to (1) temporal abstraction -a reduction in the factor of horizon by virtue of γ c , and (2) reduction in the amount of worst-case error propagation due to a reduced action space |Z| vs. |A|. Thus, as evident from the above bound, the total error induced due to a combination of distributional shift and sampling is significantly reduced when OPAL is used as compared to the standard RL counterpart of this bound which is affected by the size of the entire action space for each and every timestep in the horizon. This formalizes our intuition that OPAL helps to partly mitigate distributional shift and sampling error. One downside of using a latent space policy is that we incur unsupervised learning error while learning primitives. However, empirically, this unsupervised learning error gets dominated by other error terms pertaining to offline RL. That is, it is much easier to control unsupervised loss than errors arising in offline RL.

5. EVALUATION

In this section, we will empirically show that OPAL improves learning of downstream tasks with offline RL, and then briefly show the same with few-shot imitation learning, online RL, and online multi-task transfer learning. Unless otherwise stated, we use c = 10 and dim(Z) = 8. See Appendix C for further implementation and experimental details. Visualizations and code are available at https://sites.google.com/view/opal-iclr 

5.1. OFFLINE RL WITH OPAL

Description: We use environments and datasets provided in D4RL (Fu et al., 2020) . Since the aim of our method is specifically to perform offline RL in settings where the offline data comprises varied and undirected multi-task behavior, we focus on Antmaze medium (diverse dataset), Antmaze large (diverse dataset), and Franka kitchen (mixed and partial datasets). The Antmaze datasets involve a simulated ant robot performing undirected navigation in a maze. The task is to use this undirected dataset to solve a specific point-to-point navigation problem, traversing the maze from one corner to the opposite corner, with only sparse 0-1 completion reward for reaching the goal. The kitchen datasets involves a franka robot manipulating multiple objects (microwave, kettle, etc.) either in an Table 2 : Average success rate (%) (over 4 seeds) of few-shot IL methods: BC, BC+OPAL, and BC+SVAE (Wang et al., 2017) . undirected manner (mixed dataset) or in a partially task directed manner (partial dataset). The task is to use the datasets to arrange objects in a desired configuration, with only sparse 0-1 completion reward for every object that attains the target configuration. Baseline: We use Behavior cloning (BC), BEAR (Kumar et al., 2019) , EMAQ (Ghasemipour et al., 2020) , and CQL (Kumar et al., 2020b) as baselines. We compare it to CQL+OPAL, which first uses OPAL to distill primitives from the offline dataset before applying CQL to learn a primitive-directing high-level policy. Results: As shown in Table 1 , CQL+OPAL outperforms nearly all the baselines on antmaze (see Figure 1 and Figure 3 for visualization) and kitchen tasks, with the exception of EMAQ having similar performance on kitchen mixed. To ensure fair comparison with EMAQ, we use an autoregressive primitive policy. With the exception of EMAQ on kitchen mixed, we are not aware of any existing offline RL algorithms that achieves similarly good performance on these tasks; moreover, we are not aware of any existing online RL algorithms which solve these tasks (see Table 3 for some comparisons), highlighting the benefit of using offline datasets to circumvent exploration challenges. There are two potential reasons for OPAL's success. First, temporally-extended primitives could make the reward propagation learning problem easier. Second, the primitives may provide a better latent action space than the atomic actions of the environment. To understand the relative importance of these factors, we experimented with an ablation of CQL+OPAL that uses c = 1 to remove temporal abstraction. In this case, we find the method's performance to be similar to standard CQL. This implies that the temporal abstraction provided by OPAL is one of the main contributing factors to its good performance. This observation also agrees with our theoretical analysis. See Appendix E for detailed discussion.

5.2. FEW-SHOT IMITATION LEARNING WITH OPAL

Description: Previously, we assumed that we have access to a task reward function, but only undirected data that performs other tasks. Now, we will study the opposite case, where we are not provided with a reward function for the new task either, but instead receive a small number of taskspecific demonstrations that illustrate optimal behavior. Simply imitating these few demonstrations is insufficient to obtain a good policy, and our experiments evaluate whether OPAL can effectively incorporate the prior data to enable few-shot adaptation in this setting. We use the Antmaze environments (diverse datasets) to evaluate our method and use an expert policy for these environments to sample n = 10 successful trajectories. ing the importance of temporal abstraction and ascertaining the quality of learned primitives. See Appendix A for detailed discussion.

5.3. ONLINE RL AND MULTI-TASK TRANSFER WITH OPAL

Description: For online RL and multi-task transfer learning, we learn a task policy in space of primitives π θ (a|s, z) while keeping it fixed. For multi-task transfer, the task policy also takes in the task id and we use c = 5 and Z = 8. Since the primitives need to transfer to a different state distribution for multi-task transfer, it only learns the action sub-trajectory distribution and doesn't take in the state feedback. See Appendix A for a detailed description of models. For online RL, we use the Antmaze environments (diverse datasets) with sparse and dense rewards for evaluating our method. For online multi-task transfer learning, we learn primitives with expert data from pick-andplace task and then use it to learn multi-task policy for MT10 and MT50 (from metaworld (Yu et al., 2020) ), containing 10 and 50 robotic manipulation tasks which needs to be solved simultaneously. Baseline and Results: For online RL, we use HIRO (Nachum et al., 2018b) , a state-of-the-art hierarchical RL method, SAC (Haarnoja et al., 2018) with Behavior cloning (BC) pre-training on D, and Discovery of Continuous Options (DDCO) (Krishnan et al., 2017) which uses D to learn a discrete set of primitives and then learns a task policy in space of those primitives with online RL (Double DQN (DDQN) (Van Hasselt et al., 2015) ). For online multi-task transfer learning, we use PPO (Schulman et al., 2017) and SAC (Haarnoja et al., 2018) as baselines. As shown in Table 3 and Table 4 , OPAL uses temporal abstraction to improve exploration and thus accelerate online RL and multi-task transfer learning. See Appendix A for detailed discussion.

6. DISCUSSION

We proposed Offline Primitives for Accelerating offline RL (OPAL) as a preproccesing step for extracting recurring primitive behaviors from undirected and unlabelled dataset of diverse behaviors. We derived theoretical statements which describe under what conditions OPAL can improve learning of downstream offline RL tasks and showed how these improvements manifest in practice, leading to significant improvements in complex manipulation tasks. We further showed empirical demonstrations of OPAL's application to few-shot imitation learning, online RL, and online multi-task transfer learning. In this work, we focused on simple auto-encoding models for representing OPAL, and an interesting avenue for future work is scaling up this basic paradigm to image-based tasks. of these three baselines. First, to test the role of D in exploration, we use HIRO (Nachum et al., 2018b) , a state-of-the-art hierarchical RL method, as a baseline. Second, to test the role of temporal abstraction, we pre-train a flat policy on D using behavioral cloning (BC) and then finetune the policy on downstream tasks with SAC. Third, to test the quality of extracted primitives for online RL, we extract a discrete set of primitives with Deep Discovery of Continuous Options (DDCO) (Krishnan et al., 2017) and use Double DQN (DDQN) (Van Hasselt et al., 2015) to learn a task policy in the space of learned discrete primitives. Results: As shown in Table 3 , SAC+OPAL outperforms all the baselines, showing (1) the importance of D in exploration, (2) the role of temporal abstraction, and (3) the good quality of learned primitives. Except for HIRO on Antmaze large with dense rewards, all other baselines fail to make any progress at all. In contrast, SAC+OPAL only fails to make progress on Antmaze large with sparse rewards.

A.3 ONLINE MULTI-TASK TRANSFER LEARNING WITH OPAL

Additional Assumption: We assume the existence of M additional MDPs {M i = (S i , A, P i , r i , γ)} M i=1 where the action space A, and the discount factor γ are same as those of M. How to use it with OPAL? In the multi-task setting, we aim to learn near-optimal behavior policies on M MDPs {M i = (S i , A, P i , r i , γ)} M i=1 . As in the previous applications of OPAL, we learn a set of high-level policies π ψ (z|s, i) which will direct pretrained primitives π θ (a|s, z) to maximize cumulative rewards. Since the state space in the M MDPs is potentially distinct from that in the offline dataset D, we cannot transfer the state distribution and can only hope to transfer the action sub-trajectory distribution. Therefore, during the unsupervised training phase for learning π θ , we make the encoder and the decoder blind to the states in sub-trajectory. Specifically, the encoder becomes (µ enc z , σ enc z ) = q φ (z t |s t , {a t+i } c-1 i=0 ) and is represented by a bidirectional GRU. The decoder becomes π θ ({a t , . . . , a t+c-1 }|z t ) which decodes the entire action sub-trajectory from the latent vector and is represented by a GRU. With these state-agnostic primitives in hand, we then learn a policy π ψ (z|s, i) using any off-the-shelf online RL method. In our experiments, we use Proximal Policy Optimization (PPO) (Schulman et al., 2017) . Evaluation Description: We use the Metaworld task suite (Yu et al., 2020) to evaluate our method. The dataset D for learning primitives consists of trajectories generated by an expert policy for a goal-conditioned pick-and-place task. The pick-and-place task is suitable for unsupervised primitive learning because it contains all the basic operations (eg: move, grasp, place) required for performing more complex manipulation tasks in the Metaworld. Once we have learned the primitives, we learn a policy π ψ (z|s, i) for the MT10 and MT50 benchmarks, where MT10 and MT50 contain 10 and 50 robotic manipulation tasks, respectively, which we need to be solved simultaneously. In these experiments we use c = 5 and dim(Z) = 8. Baseline: We use SAC (Haarnoja et al., 2018) and PPO (Schulman et al., 2017) as baselines. Results: As shown in Table 4 , PPO+OPAL clearly outperforms both PPO and SAC, showing the importance of temporal abstraction in online multi-task transfer.

B PROOF OF THEOREMS B.1 BOUNDING THE SUBOPTIMALITY OF THE LEARNED PRIMITIVES

We will begin by proving the following lemma which bounds the sampling error incurred by J(φ, θ, ω). Lemma B.0.1. With high probability 1 -δ, J(θ, φ, ω) -E π∼Π,τ ∼π,z∼q φ (z|τ ) - c-1 t=0 log π θ (a t |s t , z) ≤ S J δ (11) where S J is a constant dependent on concentration properties of π θ (a|s, z) and q φ (z|τ ). Proof. To be concise, let us denote the sampling error in J(φ, θ, ω) by ∆ J = J(θ, φ, ω) -E π∼Π,τ ∼π,z∼q φ (z|τ ) - c-1 t=0 log π θ (a t |s t , z) Applying Chebyshev's inequality to ∆ J , we get that, with high probability 1δ, ∆ J ≤ Var π∼Π,τ ∼π,z∼q φ (z|τ ) (- c-1 t=0 log π θ (a t |s t , z)) δ = S J δ Therefore, combining all the equations, we have E π∼Π,τ ∼π,z∼q φ (z|τ ) - c-1 t=0 log π θ (a t |s t , z) ≤ J(θ, φ, ω) + S J δ We present a general performance difference lemma that will help in our proof of Lemma 4.0.1. Lemma B.0.2. If π 1 and π 2 are two policies in M, then |J RL (π 1 , M) -J RL (π 2 , M)| ≤ 2 (1 -γ) 2 R max E s∼d π 1 [D TV (π 1 (a|s)||π 2 (a|s))]. Proof. Following the derivations in Achiam et al. (2017) and Nachum et al. (2018a) , we express the performance of a policy π in M in terms of linear opterators: J RL (π, M) = (1 -γ) -1 R (I -γΠ π P) -1 Π π µ, where R is a vector representation of the rewards of M, Π π is a linear operator mapping state distributions to state-action distributions according to π, and µ is a vector representation of the initial state distribution of M. Accordingly, we express the performance difference of π 1 , π 2 as, |J RL (π 1 , M) -J RL (π 2 , M)| = |R ((I -γΠ 1 P) -1 Π 1 µ -(I -γΠ 2 P) -1 Π 2 µ)| (17) ≤ R max |(I -γΠ 1 P) -1 Π 1 µ -(I -γΠ 2 P) -1 Π 2 µ|. (18) By the triangle inequality, we may bound equation 18 by R max |(I -γΠ 1 P) -1 Π 1 µ -(I -γΠ 2 P) -1 Π 1 µ| + |(I -γΠ 2 P) -1 Π 1 µ -(I -γΠ 2 P) -1 Π 2 µ| . (19) We begin by approaching the first term inside the parentheses of equation 19. That first term may be expressed as |(I -γΠ 2 P) -1 (I -γΠ 2 P -(I -γΠ 1 P))(I -γΠ 1 P) -1 Π 1 µ| (20) = |γ(I -γΠ 2 P) -1 (Π 1 -Π 2 )P(I -γΠ 1 P) -1 Π 1 µ| (21) ≤ γ 1 -γ |(Π 1 -Π 2 )P(I -γΠ 1 P) -1 Π 1 µ| (22) = 2γ (1 -γ) 2 E s∼(1-γ)P(I-γΠ1P) -1 Π1µ [D TV (π 1 (a|s) π 2 (a|s))]. Now we continue to the second term inside the parentheses of equation 19. This second term may be expressed as |(I -γΠ 2 P) -1 (Π 1 -Π 2 )µ| ≤ 1 1 -γ |(Π 1 -Π 2 )µ| (24) = 2 1 -γ E s∼µ [D TV (π 1 (a|s) π 2 (a|s))]. To combine equations 23 and 25, we note that d π1 = γ • (1 -γ)P(I -γΠ 1 P) -1 Π 1 µ + (1 -γ) • µ. Thus, we have |J RL (π 1 , M) -J RL (π 2 , M)| ≤ 2 (1 -γ) 2 R max E s∼d π 1 [D TV (π 1 (a|s) π 2 (a|s))], as desired. Lemma 4.0.1. If π 1 and π 2 are two policies in M, then |J RL (π 1 , M) -J RL (π 2 , M)| ≤ 2 (1 -γ c )(1 -γ) R max E s∼d π 1 c [D TV (π 1 (τ |s)||π 2 (τ |s))], where D TV (π 1 (τ |s)||π 2 (τ |s)) denotes the TV divergence over c-length sub-trajectories τ sampled from π 1 vs. π 2 (see section 3). Furthermore, SubOpt(θ) ≤ 2 (1 -γ c )(1 -γ) R max E s∼d π * c [D TV (π * (τ |s)||π θ,ψ * (τ |s))]. Proof. We focus on proving equation 28, as the subsequent derivation of equation 29 is straightforward by definition of SubOpt. To derive equation 28, we may simply consider π 1 and π 2 acting in an "every-c-steps" version of M, where the action space is now τ and reward are accumulated over c steps using γ-discounting. Note that in this abstracted version of M, the max reward is 1-γ c 1-γ R max and the MDP discount is γ c . Plugging this into the result of Lemma B.0.2 immediately yields the desired claim. Theorem 4.1. Let θ, φ, ω be the outputs of solving equation 1, such that J(θ, φ, ω) = c . Then, with high probability 1δ, for any π that is ζ-common in Π, there exists a distribution H over z such that for π H θ (τ |s) := E z∼H [π θ (τ |z, s)], E s∼κ [D TV (π(τ |s)||π H θ (τ |s))] ≤ ζ + 1 2 c + S J δ + H c , where H c = E π∼Π,s∼κ,τ ∼π [ c-1 t=0 log π(a t |s t )] (i.e. a constant and property of D) and S J is a positive constant incurred due to sampling error in J(θ, φ, ω) and depends on concentration properties of π θ (a|s, z) and q φ (z|τ ). Proof. We start with application of the triangle inequality: D TV (π(τ |s)||π H θ (τ |s)) ≤ D TV (π(τ |s)||π(τ |s)) + D TV (π(τ |s)||π H θ (τ |s)). Taking expectation with respect to π ∼ Π, s ∼ κ on both the sides, we get E s∼κ [D TV (π(τ )||π H θ (τ ))] ≤ E π∼Π,s∼κ [D TV (π(τ )||π(τ ))] + E π∼Π,s∼µ [D TV (π(τ )||π H θ (τ ))] (32) ≤ E π∼Π,s∼κ [D TV (π(τ )||π(τ ))] (33) + E π∼Π,s∼µ [ 1 2 D KL (π(τ )||π H θ (τ ))] ≤ ζ + 1 2 E π∼Π,τ ∼π,s∼κ [log π(τ ) -log E z∼H [π θ (τ |z)]] ≤ ζ + 1 2 E π∼Π,s∼κ,τ ∼π,z∼H [log π(τ ) -log π θ (τ |z)]. The last two inequality used Jensen's inequality. Let H(z) = E π∼Π,τ ∼π [q φ (z|τ )]. Cancelling out the dynamics and using equation 14 we get, E s∼κ [D TV (π(τ )||π H θ (τ ))] ≤ ζ + 1 2 E π∼Π,s∼κ,τ ∼π,z∼q φ (z|τ ) c-1 t=0 (log π(a t |s t )log π θ (a t |s t , z)) ≤ ζ + 1 2 H c + S J δ + J(θ, φ, ω) = ζ + 1 2 H c + c + S J δ Corollary 4.1.1. If the optimal policy π * of M is ζ-common in Π, and d π * c κ ∞ ≤ ξ, then, with high probability 1 -δ, SubOpt(θ) ≤ 2ξ (1 -γ c )(1 -γ) R max   ζ + 1 2 c + S J δ + H c   Proof. Expanding lemma 4.0.1 using the above assumption, we have SubOpt(θ) ≤ |J RL (π * , M) -J RL (π H θ , M)| (38) ≤ 2 (1 -γ c )(1 -γ) R max E s∼d π * c [D TV (π * (τ |s)||π H θ (τ |s))] ≤ 2 (1 -γ c )(1 -γ) R max d π * c κ ∞ E s∼κ [D TV (π * (τ |s)||π H θ (τ |s))] (40) ≤ 2ξ (1 -γ c )(1 -γ) R max E s∼κ [D TV (π * (τ |s)||π H θ (τ |s))] Now, we can use theorem 4.1 to prove the corollary.

B.2 PERFORMANCE BOUNDS FOR OPAL

Theorem 4.2. Let π ψ * (z|s) be the policy obtained by CQL and let π ψ * ,θ (a|s) refer to the policy when π ψ * (z|s) is used together with π θ (a|s, z). Let π β ≡ {π ; π ∼ Π} refer to the policy generating D r in MDP M and z ∼ π H β (z|s) ≡ τ ∼ π β,s0=s , z ∼ q φ (z|τ ). Then, J(π ψ * ,θ , M ) ≥ J(π β , M )-κ with high probability 1δ where κ = O 1 (1 -γ c )(1 -γ) E s∼d π ψ * ,θ MH (s) |Z|(D CQL (π ψ * , π H β )(s) + 1) (42) - α 1 -γ c E s∼d π ψ * M H (s) D CQL (π ψ * ,θ , π H β )(s) Proof. We assume that the variational posterior q φ (z|τ ) obtained after learning OPAL from D is same (or nearly same) as the true posterior p(z|τ ). q φ can be used to define π H β (z|s) as z ∼ π H β ≡ τ ∼ π β , z ∼ q φ (z|τ ). This induces an MDP M H = (S, Z, P z , r z , γ c ) where Z is the inferred latent space for choosing primitives, P z and r z are the latent dynamics and reward function such that s t+c ∼ P z (s t+c |s t , z t ) ≡ s t+i+1 ∼ P(s t+i+1 |s t+i , a t+i ), a t+i ∼ π(a t+i |s t+i , z t ) ∀i ∈ {0, 1, . . . , c -1} and r z (s t , z t ) = c-1 i=0 γ i r(s t+i , a t+i ), and γ c is the new discount factor effectively reducing the task horizon by a factor of c. π(a|s, z) is the primitive induced by q φ and π β . Since q φ captures the true posterior, π(a|s, z) is the optimal primitive you can learn and its autoencoding loss, under true expectation, is * c = E π∼Π,τ ∼π,z∼q φ (z|τ ) - c-1 t=0 log π(a t |s t , z) . Therefore, τ ∼ π β ≡ z ∼ π H β , τ ∼ π(•|•, z ). π β is used to collect the data D r which induces an empirical MDP. We refer to the empirical MDP induced by D r as M = (S, A, P, μ, r, γ) where P( ŝ |ŝ, â) = (s,a,s )∼D 1[s=ŝ,a=â,s = ŝ ] (s,a)∼D 1 [s=ŝ,a=â] and μ(ŝ) = s 0 ∼D 1[s0=ŝ] N . We use q φ to get D r hi from D r which induces another empirical MDP MH . Using these definitions, we will try to bound |J(π ψ * ,θ , M) -J(π β , M)|. Let's break |J(π ψ * ,θ , M) -J(π β , M)| into |J(π ψ * ,θ , M) -J(π β , M)| ≤ |J(π ψ * ,θ , M) -J(π ψ * , M H )| (44) + |J(π ψ * , M H -J(π H β , M H )| (45) + |J(π H β , M H ) -J(π β , M)| Since q φ captures the true variational posterior, τ Kumar et al. (2020b) and apply it to M H to get  ∼ π β ≡ z ∼ π H β , τ ∼ π(•|•, z) and therefore, |J(π H β , M H ) -J(π β , M)| = 0. For bounding, |J(π ψ * , M H -J(π H β , M H )|, we use theorem 3.6 from |J(π ψ * , M H ) -J(π H β , M H )| (47) ≤ 2 C r,δ 1 -γ c + γ c R max C P,δ (1 -γ c )(1 -γ) E s∼d π ψ * ,θ MH (s) |Z| |D(s)| D CQL (π ψ * , π H β )(s) + 1 (48) - α 1 -γ c E s∼d π ψ * M H (s) D CQL (π ψ * ,θ , π H β )(s) = κ 2 |J(π ψ * ,θ , M) -J(π ψ * ,π(•|•,z) , M)| (50) ≤ 2R max (1 -γ c )(1 -γ) E s∼d π ψ * ,π(•|•,z) c [D TV (π ψ * ,π(•|•,z) (τ |s)||π ψ * ,θ (τ |s))] ≤ 2R max (1 -γ c )(1 -γ) E s∼d π ψ * ,π(•|•,z) c 1 2 D KL (π ψ * ,π(•|•,z) (τ |s)||π ψ * ,θ (τ |s)) Now, we will try to bound D KL (π ψ * ,π(•|•,z) (τ |s)||π ψ * ,θ (τ |s)). We have E z∼π ψ * (z|s),τ ∼π(τ |s,z) log π ψ * (z|s) c-1 t=1 P(s t |s t-1 , a t-1 ) c-1 t=0 π(a t |s t , z) π ψ * (z|s) c-1 t=1 P(s t |s t-1 , a t-1 ) c-1 t=0 π θ (a t |s t , z) (53) = E z∼π ψ * (z|s),τ ∼π(τ |s,z) c-1 t=0 log π(a t |s t , z) -log π θ (a t |s t , z) (54) = E z∼π β (z|s),τ ∼π(τ |s,z) π ψ * (z|s) π H β (z|s) c-1 t=0 log π(a t |s t , z) -log π θ (a t |s t , z) (55) ≤ π ψ * (z|s) π H β (z|s) ∞ E z∼π β (z|s),τ ∼π(τ |s,z) c-1 t=0 log π(a t |s t , z) -log π θ (a t |s t , z) (56) ≤ π ψ * (z|s) π H β (z|s) ∞ c - * c + S J δ The last equation comes from above definition of * c and equation 14. We will now try to bound π ψ * (z|s) π H β (z|s) ∞ using D CQL (π ψ * , π H β )(s). Using definition of D CQL (π ψ * , π H β )(s), we have D CQL (π ψ * , π H β )(s) = z π ψ * (z|s) π ψ * (z|s) π H β (z|s) -1 (58) ⇒ D CQL (π ψ * , π H β )(s) + 1 = z π H β (z|s) π ψ * (z|s) π H β (z|s) 2 ≤ π ψ * (z|s) π H β (z|s) 2 ∞ π H β (z|s) where z = arg max z π ψ * (z|s) π H β (z|s) . To be concise, let ∆ c = c - * c + S J δ Combining above equations, we have D KL (π ψ * (z|s)π(τ |s, z)||π ψ * (z|s)π θ (τ |s, z)) ≤ ∆ c D CQL (π ψ * , π H β )(s) + 1 π H β (z|s) Using this to bound the returns, we get |J(π ψ * ,θ , M) -J(π ψ * ,π(•|•,z) , M)| (62) ≤ 2R max (1 -γ c )(1 -γ) E s∼d π ψ * ,π(•|•,z) c   D CQL (π ψ * , π H β )(s) + 1 π H β (z|s) 1 4 1 2 ∆ c   (63) = κ 1 We get κ = κ 1 + κ 2 . We apply O to get the notation in the theorem.

C EXPERIMENT DETAILS C.1 OPAL EXPERIMENT DETAILS

Encoder The encoder q φ (z|τ ) takes in state-action trajectory τ of length c. It first passes the individual states through a fully connected network with 2 hidden layers of size H and ReLU activation. Then it concatenates the proccessed states with actions and passes it through a bidirectional GRU with hidden dimension of H and 4 GRU layers. It projects the output of GRU to mean and log standard deviation of the latent vector through linear layers. Prior The prior ρ ω (z|s) takes in the current state s and passes it through a fully connected network with 2 hidden layers of size H and ReLU activation. It then projects the output of the hidden layers to mean and log standard deviation of the latent vector through linear layers.

Primitive Policy

The primitive (i.e. decoder) π θ (a|s, z) has same architecture as the Prior but it takes in state and latent vector and produces mean and log standard deviation of the action. For kitchen environments, we use an autoregressive primitive policy with same architecture as used by EMAQ (Ghasemipour et al., 2020) . We use H = 200 for antmaze environments and H = 256 for kitchen environments. In both cases, OPAL was trained for 100 epochs with a fixed learning rate of 1e -3, β = 0.1 (Lynch et al., 2020) , Adam optimizer (Kingma & Ba, 2014) and a batch size of 50.

C.2 TASK POLICY ARCHITECTURE

In all environments, for task policy, we used a fully connected network with 3 hidden layers of size 256 and ReLU activation. It then projects the output of the hidden layers to mean and log standard deviation of the latent vector through linear layers. 

C.3 SAC HYPERPARAMETERS

We used SAC (Haarnoja et al., 2018) for online RL experiments in learning a task policy either in action space A or latent space Z. For the discrete primitives extracted from DDCO (Krishnan et al., 2017) , we used Double DQN Van Hasselt et al. (2015) . We used the standard hyperparameters for SAC and Double DQN as provided in rlkit code base (https://github.com/vitchyr/ rlkit) with both policy learning rate and q value learning rate as 3e -4.

C.4 CQL HYPERPARAMETERS

We used CQL (Kumar et al., 2020b) for offline RL experiments in learning a task policy either in action space A or latent space Z. We used the standard hyperparameters, as mentioned in Kumar et al. (2020b) ., with minor differences. We used policy learning rate of 3e -5, q value learning rate of 3e -4, and primitive learning rate of 3e -4. For antmaze tasks, we used CQL(H) variant with τ = 5 and learned α. For kitchen tasks we used CQL(ρ) variant with fixed α = 10. In both cases, we ensured α never dropped below 0.001.

D CONNECTION BETWEEN OPAL AND VAE OBJECTIVES

We are given an undirected, unlabelled and diverse dataset D of sub-trajectories of length c. We would like to fit a sequential VAE model to D which maximizes max θ E τ ∼D [log p θ (τ |s 0 )] where s 0 is initial state of the sub-trajectory. Let's consider log p θ (τ |s 0 ) = log p θ (τ, z|s 0 )dz = log p θ (τ, z|s 0 )q φ (z|τ ) q φ (z|τ ) dz (66) (using Jensen's inequality) ≥ q φ (z|τ )[log p θ (τ, z|s 0 )log q φ (z|τ )]dz = E z∼q φ (z|τ ) log p θ (τ |z, s 0 )log q φ (z|τ ) p θ (z|s 0 ) (67) Using the above equation, we have the following lower-bound for our objective function  We separate the parameters of decoder from prior and hence write p θ (z|s 0 ) = ρ ω (z|s 0 ). We can expand log p θ (τ |z, s 0 ) = c-1 t=1 log P(s t |s t-1 , a t-1 ) + c-1 t=0 log π θ (a t |s t , z). Since P is fixed it can be removed from the objective function. Therefore, we can write the objective function as max θ,φ E τ ∼D,z∼q φ (z|τ ) c-1 t=0 log π θ (a t |s t , z) -βD KL (q φ (z|τ )||ρ ω (z|s 0 )) where β = 1. This is similar to the autoencoding loss function we described in section 4.

E ABLATION STUDIES

As shown in details). We went with k = 10 as our final model since it's simpler. Offline CARML effectively uses only 6 skills as the other 4 skills had p ω (z) = 0. Offline DADS uses all the skills. The results are described in Table 7 . In addition to calculating the average success rate, we also calculate the average cumulative dense rewards for entire trajectory and the average cumulative dense rewards for the last 5 time steps. Here, the dense reward is negative l 2 distance to the goal. The resulting trajectory clusters (using a subset of the dataset) from discrete skills are also visualized in Figure 4 where different colors represent different clusters. Since offline CARML treats the states in the trajectory conditionally independent of each other given z, the clustering mainly focuses on the spatial location. Therefore, offline CARML isn't able to separate out different control modes starting around the same spatial locations which explains its poor performance when combined with CQL. As we can see from Figure 5 , offline CARML is able to make progress towards the goal, but gets stuck along the way due to poor separation of control modes. On the other hand, offline DADS treats the state transitions in the trajectory conditionally independent of each other given z and thus clusters trajectories with similar state transitions together. This allows it to more effectively separate out the control modes. Therefore, CQL+offline DADS slightly improves upon CQL but is still limited by discrete number of skills. Furthermore, increasing the number of skills from 10 to 20 gives similar performance. Moreover, in these methods, it's intractable to use continuous skill space since we use Bayes rule to calculate p φ,ω (z|τ ). Therefore, we decided to switch to learning a β-VAE (Higgins et al., 2016) style generative model with continuous skill space i.e. OPAL. Offline CARML p φ (s|z) takes in the latent one-hot vector z and passes it through a fully connected network with 2 hidden layers of size H = 200 and ReLU activation. It then projects the output of the hidden layers to mean and log standard deviation of the reduced state s (only global x-y pose considered) through linear layers. Offline DADS p φ (s t |s t-1 , z) has the same architecture as p φ (s|z) but also takes in the reduced state from the previous timestep. Primitive Policy The primitive policy π θ (a|s, z) takes in the current state s and latent one-hot vector z and passes it through a fully connected network with 2 hidden layers of size H = 200 and ReLU activation. It then projects the output of the hidden layers to mean and log standard deviation of action through linear layers. We perform the clustering for 25 epochs with a fixed learning rate of 1e -3, Adam optimizer (Kingma & Ba, 2014) and a batch size of 50 using the Algorithm 2 from Jabri et al. (2019) . Task Policy For task policy π ψ (s), we used a fully connected network with 3 hidden layers of size 256 and ReLU activation. It then projects the output of the hidden layers to the logits (corresponding to the components of discrete latent space) through linear layers. CQL Hyperparameters We used the standard hyperparameters for CQL(H) with discrete action space, as mentioned in Kumar et al. (2020b) .



Figure 1: Visualization of (a subset of) diverse datasets for (a) antmaze medium and (c) antmaze large, along with trajectories sampled from CQL+OPAL trained on diverse datasets of (b) antmaze medium and (d) antmaze large.

r + j N e D J e j H f j Y 9 5 a M P K Z M v o D 4 / M H j T i U s g = = < / l a t e x i t > Primitive Policy Encoder q (z|⌧ ) < l a t e x i t s h a 1 _ b a s e 6 4 = " m i t o t M Q h u 7 W C E x q w O K L 8 k k 0 l h k I = " > A A A B + X i c b V B N S 8 N A E J 3 4 W e t X 1 K O X x S L U S 0 m q o M e i F 4 8 V 7 A e 0 I W y 2 m 3 b p Z h N 3 N 4 U a + 0 + 8 e F D E q / / E m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I O F M a c f 5 t l Z W 1 9 Y 3 N g t b x e 2 d 3 b 1 9 + + C w q e J U E t o g M Y 9 l O 8 C K c i Z o Q z P N a T u R F E c B p 6 1 g e D P 1 W y M q F Y v F v R 4 n 1 I t w X 7 C Q E a y N 5 N v 2 g 5 9 1 k w G b l B + f u h q n Z 7 5 d c i r O D G i Z u D k p Q Y 6 6 b 3 9 1 e z F J I y o 0 4 V i p j u s k 2 s u w 1 I x w O i l 2 U 0 U T T I a 4 T z u G C h x R 5 W W z y y f o 1 C g 9 F M b S l N B o p v 6 e y H C k 1 D g K T G e E 9 U A t e l P x P 6 + T 6 v D K y 5 h I U k 0 F m S 8 K U 4 5 0 j K Y x o B 6 T l G g + N g Q T y c y t i A y w x E S b s I o m B H f x 5 W X S r F b c 8 0 r 1 7 q J U u 8 7 j K M A x n E A Z X L i E G t x C H R p A Y A T P 8 A p v V m a 9 W O / W x 7 x 1 x c p n j u A P r M 8 f m e G T o Q = = < / l a t e x i t > Unlabelled & Undirected Data D < l a t e x i t s h a _ b a s e = " v h c S Z K

t e x i t s h a 1 _ b a s e 6 4 = " 7 f e d 4 a b o O + 1 0 2 x r M n P u o 3 u k a b c M = " > A A A B 9 H i c b V D L S g M x F L 3 j s 9 Z X 1 a W b Y B F c l Z k q 6 L K o C 5 c V 7 A P a s W T S T B u a S c Y k U y h D v 8 O N C 0 X c + j H u / B s z 7 S y 0 9 U D g c M 6 9 3 J M T x J x p 4 7 r f z s r q 2

a K 2 a y I D L H C x N i e i r Y E b / H L y 6 R Z r X j n l e r 9 R b l 2 n d d R g G M 4 g T P w 4 B J q c A d 1 a A C B J 3 i G V 3 h z x s 6 L 8 + 5 8 z E d X n H z n C P 7 A + f w B B L 2 S Q g = = < / l a t e x i t > D r hi < l a t e x i t s h a 1 _ b a s e 6 4 = " r j i b p O y Y n 6 7 i E S U e u f z I

2 D j b 3 D S D N B y J E v H 5 9 y r e + 8 J I k a V d p x v q 7 S y u r a + U d 6 s b G 3 v 7 O 7 Z + w c d F c Y S k z Y O W S h 7 A V K E U U H a m m p G e p E k i A e M d I P p V e Z 3 H 4h U N B R 3 e h Y R j 6 O x o C O K k T a S b 1 c H H O k J R i y 5 T u + l n + R f y Z M J T V P f r j l 1 J w d c J m 5 B a q B A y 7 e / B s M Q x 5 w I j R l S q u 8 6 k f Y S J D X F j K S V Q a x I h P A U j U n f U I E 4 U V 6 S H 5 H C Y 6 M M 4 S i U 5 g k N c / V 3 R 4 K 4 U j M e m M p s R 7 X o Z e J / X j / W o w s v o S K K N R F 4 P m g U M 6 h D m C U C h 1 Q S r N n M E I Q l N b t C P E E S Y W 1 y q 5 g Q 3 M W T l 0 m n UX d P 6 4 3 b s 1 r z s o i j D K r g C J w A F 5 y D J r g B L d A G G D y C Z / A K 3 q w n 6 8 V 6 t z 7 m p S W r 6 D k E f 2 B 9 / g A U N p k E < / l a t e x i t > D r lo < l a t e x i t s h a 1 _ b a s e 6 4 = " S d F x D s 6 a

j / W o w s v o T y K N e F 4 P m g U M 6 g F z B K B Q y o J 1 m x m C M K S m l 0 h n i C J s D a 5 V U w I 7 u L J y 6 T T q L u n 9 c b t W a 1 5 W c R R B l V w B E 6 A C 8 5 B E 9 y A F m g D D B 7 B M 3 g F b 9 a T 9 W K 9 W x / z 0 p J V 9 B y C P 7 A + f w A j d p k O < / l a t e x i t > {⌧ = (st, at) c 1 t=0 } < l a t e x i t s h a 1 _ b a s e 6 4 = " j O B c C H i s I 7 K S x 4 e P E N Z k O q N E 7 i c = " > A A A C B 3 i c b V D L S s N A F J 3 U V 6 2 v q E t B g k W o o C W p g m 4 K R T c u K 9 g H N D F M p p M 6 d P J g 5 k Y o I T s 3 / o o b F 4 q 4 9 R f c + T d O 2 y y 0 9 c C F w z n 3 c u 8 9 X s y Z B N P 8 1 g o

1 X g 0 n o 0 3 4 3 3 a m j F m M 7 v o F 4 y P b 8 K t n g k = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = "

r + j N e D J e j H f j Y 9 5 a M P K Z M v o D 4 / M H j T i U s g = = < / l a t e x i t > Primitive Policy a (2) Offline training of task policy (3) Test-time policy execution ⇡ ✓ (a|s, z) < l a t e x i t s h a 1 _ b a s e 6 4 = "

4 8 V 7 A e 0 I W y 2 m 3 b p Z h N 3 N 4 U a + 0 + 8 e F D E q / / E m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I O F M a c f 5 t l Z W 1 9 Y 3 N g t b x e 2 d 3 b 1 9

Figure 2: Overview of offline RL with OPAL. OPAL is trained on unlabelled data D using autoencoding objective. For offline RL, the encoder first labels the reward-labelled data D r with latents, and divides it into D r hi and D r lo . The task policy π ψ is trained on D r hi using offline RL while the primitive policy π θ is finetuned on D lo using behavioral cloning (BC).

Figure 3: State visitation heatmaps for antmaze medium policies learned using (1) CQL and (2) CQL+OPAL, and antmaze large policies learned using (3) CQL and (4) CQL+OPAL. Environment BC BC+OPAL (ours) BC+SVAE antmaze medium (diverse) 30.1 ± 3.2 81.5 ± 2.7 72.8 ± 2.3 antmaze large (diverse)

Now, we will try to bound |J(π ψ * ,θ , M) -J(π ψ * , M H )|. The only difference between the two is that the primitive π θ (a|s, z) is used in M and the primitive π(a|s, z) is used in M H . Therefore, we can write the above bound as |J(π ψ * ,θ , M) -J(π ψ * ,π(•|•,z) , M)|. Let's first bound their value function at a particular state s. Using Lemma 4.0.1, we get

max θ E τ ∼D [log p θ (τ |s 0 )] ≥ max θ,φ E τ ∼ D,z∼q φ (z|τ ) log p θ (τ |z, s 0 )log q φ (z|τ ) p θ (z|s 0 ) τ ∼D,z∼q φ (z|τ ) [log p θ (τ |z, s 0 )] -D KL (q φ (z|τ )||p θ (z|s 0 ))

Figure 4: Visualization of (a subset of) dataset trajectories colored according to their assigned cluster using (a) Offline DADS and (b) Offline CARML. We use k = 10.

Figure 5: State visitation heatmaps for antmaze medium policies learned using (1) CQL, (2) CQL+OPAL, (2) CQL+Offline DADS and (4) CQL+Offline CARML. Note that while offline CARML and offline DADS get stuck at various corners of the maze, OPAL is able to find its path through the maze to the goal location on the top right.

Due to improved exploration, PPO+OPAL outperforms PPO and SAC on MT10 and MT50 in terms of average success rate (%) (over 4 seeds).

Average success rate (%) (over 4 seeds) of CQL+OPAL for different values of dim(Z). We fix c = 10.

Table5, we experimented with different choices of dim(Z) on antmaze-medium (diverse). Using the hyperparameters fromNachum et al. (2018a), we fixed c = 10. While ± 5.7 59.1 ± 3.1 59.6 ± 2.9CQL+Offline CARML 13.3 ± 4.7 15.1 ± 2.6 14.9 ± 3.8 Average success rate on antmaze medium (diverse) (%) (over 4 seeds) of CQL combined with offline DADS and offline CARML for different values of k.

Average success rate (%), cumulative dense reward, and cumulative dense reward (last 5 steps) (over 4 seeds) of CQL combined with different offline skill discovery methods on antmaze medium (diverse). For CQL + (Offline) DADS and CQL + (Offline) CARML, we use k = 10. Note that CQL+OPAL outperforms both other methods for unsupervised skill discovery on all of these different evaluation metrics.

7. ACKNOWLEDGEMENTS

We would like to thank Ben Eysenbach and Kamyar Ghasemipour for valuable discussions at different points over the course of this work. This work was supported by Google, DARPA Machine Common Sense grant and MIT-IBM grant.

A OTHER APPLICATIONS OF OPAL

A.1 FEW-SHOT IMITATION LEARNING WITH OPAL Additional Assumption: In addition to D, we assume access to a small number of expert demonstrations D exp = {τ i := (s t , a t ) T -1 t=0 } n i=1 where n N .How to use with OPAL? In imitation learning, the aim is to recover an expert policy given a small number of stochastically sampled expert demonstrations D exp = {τ i := (s t , a t ) T -1 t=0 } n i=1 . As in the offline RL setting, we use the primitives π θ (a|s, z) learned by OPAL as a low-level controller and learn a high-level policy π ψ (z|s). We first partition the expert demonstrations into sub-trajectoriesWe then use the learned encoder q φ (z|τ ) to label these sub-trajectories with latent actions z i,k ∼ q φ (z|τ i,k ) and thus create a dataset D exp hi = {(s i k+t , z i,k ) for k = 0, . . . , T -c} n i=1 . We use D exp hi to learn the high-level policy π ψ (z|s) using behavioral cloning. As in the offline RL setting, we also finetune π θ (a|s, z) with latent-conditioned behavioral cloning to ensure consistency of the labelled latents.Evaluation Description: We receive a small number of task-specific demonstrations that illustrate optimal behavior. Simply imitating these few demonstrations is insufficient to obtain a good policy, and our experiments evaluate whether OPAL can effectively incorporate the prior data to enable fewshot adaptation in this setting. We use the Antmaze environments (diverse datasets) to evaluate our method and use an expert policy for these environments to sample n = 10 successful trajectories.Baseline: We evaluate two baselines. First, we test a simple behavioral cloning (BC) baseline, which trains using a max-likelihood loss on the expert data. In order to make the comparison fair to OPAL (which uses the offline dataset D in addition to the expert dataset), we pretrain the BC agent on the undirected dataset using the same max-likelihood loss. As a second baseline and to test the quality of OPAL-extracted primitives, we experiment with an alternative unsupervised objective from Wang et al. (2017) , which prescribes using a sequential VAE (SVAE) over state trajectories in conjunction with imitation learning.Results: As shown in Table 2 , BC+OPAL clearly outperforms other baselines, showing the importance of temporal abstraction and ascertaining the quality of learned primitives. SVAE's slightly worse performance suggests that decoding the state trajectory directly is more difficult than simply predicting the actions, as OPAL does, and that this degrades downstream task learning.

A.2 ONLINE RL WITH OPAL

Additional Assumptions: We assume online access to M; i.e., access to Monte Carlo samples of episodes from M given an arbitrary policy π.

How to use with OPAL?

To apply OPAL to standard online RL, we fix the learned primitives π θ (a|s, z) and learn a high-level policy π ψ (z|s) in an online fashion using the latents z as temporally-extended actions. Specifically, when interacting the environment, π ψ (z|s) chooses an appropriate primitive every c steps, and this primitive acts on the environment directly for c timesteps. Any off-the-shelf online RL algorithm can be used to learn ψ. In our experiments, we use SAC (Haarnoja et al., 2018) . To ensure that π ψ (z|s) stays close to the data distribution and avoid generalization issues associated with the fixed π θ (a|s, z), we add an additional KL penalty to the reward of the form D KL (π ψ (z|s)||ρ ω (z|s 0 )).Evaluation Description: We use Antmaze medium (diverse dataset) and Antmaze large (diverse dataset) from the D4RL task suite (Fu et al., 2020) to evaluate our method. We evaluate using both a dense distance-based rewardgant xy and a sparse success-based reward 1[ gant xy ≤ 0.5] (the typical default for this task), where ant xy is the 2d position of the ant in the maze.Baseline: To solve these tasks through online RL, we need both (i) hierarchy (i.e. learning a policy on top of primitives) which improves exploration (Nachum et al., 2019b ) and (ii) unlabelled (i.e. no task reward) offline dataset which allows us to bootstrap the primitives. This informed our choice dim(Z) = 8, 16 gave similar performances, dim(Z) = 4 performed slightly worse. Therefore, we selected dim(Z) = 8 for our final model as it was simpler.

Temporal abstraction actually helps:

To empirically verify that the gain in performance was due to temporal abstraction and not better action space learned through latent space, we tried c = 1 (dim(Z) = 8) and found the performance to be similar to that of CQL (i.e. 55.3 ± 3.8) thereby empirically supporting the theoretical benefits of temporal abstraction.We found dim(Z) = 8 and c = 10 to work well with other environments as well. However, we acknowledge that the performance of CQL+OPAL can be further improved by carefully choosing better hyperparameters for each environment or by using other offline hyperparameter selection methods for offline RL, which is a subject of future work.

DATA

We describe alternative methods for extracting a primitive policy from offline data. These methods are offline variants of CARML (Jabri et al., 2019) and DADS (Sharma et al., 2019) . We tried these techniques in an early phase of our project and used the environment antmaze-medium (diverse) to evaluate these methods.Let's consider an offline undirected, unlabelled and diverse datasett=0 represent the state trajectory. To extract primitives, we first cluster the trajectories by maximizing the mutual information between the state trajectory τ and latent variable z (indicating cluster index) with respect to the parameters of joint distribution p φ,ω (τ, z) = p ω (z)p φ (τ |z). For now, we consider p ω (z) = Cat(p 1 , . . . , p k ) (i.e. discrete latent variables sampled from a Categorical distribution) and represent z as one-hot vector of dimension k. The choice of p ω (z) is consistent with the choices made in Jabri et al. (2019) and Sharma et al. (2019) . Since z is discrete, we can use Bayes rule to calculate p φ,ω (z|τ ) as Offline CARML and offline DADS differ only in how they model p φ (τ |z):•Here, we only model p φ (s t |s t-1 , z) and not p(s 0 ). Since, log is additive in nature, p(s 0 ) will be ignored while calculating gradient.To optimize equation 72, we use Algorithm 2 from Jabri et al. (2019) . Once we have clustered the state trajectories τ with labels z by maximizing I(τ ; z), we can use behavioral cloning (BC) to learn π θ (a|s, z).Finally, we use p φ,ω (z|τ ) to label the reward-labelled data D r = {(s i t , a i t , r i t ) c-1 t=0 } N i=1 with latents, and transform it into D r hi = {(s i 0 , z i , c-1 t=0 γ i r i t , s c ) N i=1 }. The task policy π ψ is trained on D r hi using Conservative Q Learning (CQL) (Kumar et al., 2020b) . Since the primitive policy π θ is trained after p φ,ω (z|τ ) is fully trained, it doesn't need any additional finetuning.

F.1 RESULTS

Using the hyperparameters from Nachum et al. (2018a) , we used c = 10. We experimented with different values of k = 5, 10, 20 and found that k = 10, 20 works the best (see Table 6 for more

