HIERARCHICAL REINFORCEMENT LEARNING BY DISCOVERING INTRINSIC OPTIONS

Abstract

We propose a hierarchical reinforcement learning method, HIDIO, that can learn task-agnostic options in a self-supervised manner while jointly learning to utilize them to solve sparse-reward tasks. Unlike current hierarchical RL approaches that tend to formulate goal-reaching low-level tasks or pre-define ad hoc lowerlevel policies, HIDIO encourages lower-level option learning that is independent of the task at hand, requiring few assumptions or little knowledge about the task structure. These options are learned through an intrinsic entropy minimization objective conditioned on the option sub-trajectories. The learned options are diverse and task-agnostic. In experiments on sparse-reward robotic manipulation and navigation tasks, HIDIO achieves higher success rates with greater sample efficiency than regular RL baselines and two state-of-the-art hierarchical RL methods.

1. INTRODUCTION

Imagine a wheeled robot learning to kick a soccer ball into a goal with sparse reward supervision. In order to succeed, it must discover how to first navigate in its environment, then touch the ball, and finally kick it into the goal, only receiving a positive reward at the end for completing the task. This is a naturally difficult problem for traditional reinforcement learning (RL) to solve, unless the task has been manually decomposed into temporally extended stages where each stage constitutes a much easier subtask. In this paper we ask, how do we learn to decompose the task automatically and utilize the decomposition to solve sparse reward problems? Deep RL has made great strides solving a variety of tasks recently, with hierarchical RL (hRL) demonstrating promise in solving such sparse reward tasks (Sharma et al., 2019b; Le et al., 2018; Merel et al., 2019; Ranchod et al., 2015) . In hRL, the task is decomposed into a hierarchy of subtasks, where policies at the top of the hierarchy call upon policies below to perform actions to solve their respective subtasks. This abstracts away actions for the policies at the top levels of the hierarchy. hRL makes exploration easier by potentially reducing the number of steps the agent needs to take to explore its state space. Moreover, at higher levels of the hierarchy, temporal abstraction results in more aggressive, multi-step value bootstrapping when temporal-difference (TD) learning is employed. These benefits are critical in sparse reward tasks as they allow an agent to more easily discover reward signals and assign credit. Many existing hRL methods make assumptions about the task structure (e.g., fetching an object involves three stages: moving towards the object, picking it up, and combing back), and/or the skills needed to solve the task (e.g., pre-programmed motor skills) (Florensa et Nachum et al., 2018) . Thus these methods may require manually designing the correct task decomposition, explicitly formulating the option space, or programming pre-defined options for higher level policies to compose. Instead, we seek to formulate a general method that can learn these abstractions from scratch, for any task, with little manual design in the task domain. The main contribution of this paper is HIDIO (HIerarchical RL by Discovering Intrinsic Options), a hierarchical method that discovers task-agnostic intrinsic options in a self-supervised manner while learning to schedule them to accomplish environment tasks. The latent option representation is uncovered as the option-conditioned policy is trained, both according to the same self-supervised worker objective. The scheduling of options is simultaneously learned by maximizing environment reward collected by the option-conditioned policy. HIDIO can be easily applied to new sparsereward tasks by simply re-discovering options. We propose and empirically evaluate various instantiations of the option discovery process, comparing the resulting options with respect to their final task performance. We demonstrate that HIDIO is able to efficiently learn and discover diverse options to be utilized for higher task reward with superior sample efficiency compared to other hierarchical methods.

2. PRELIMINARIES

We consider the reinforcement learning (RL) problem in a Markov Decision Process (MDP). Let s ∈ R S be the agent state. We use the terms "state" and "observation" interchangeably to denote the environment input to the agent. A state can be fully or partially observed. Without loss of generality, we assume a continuous action space a ∈ R A for the agent. Let π θ (a|s) be the policy distribution with learnable parameters θ, and P(s t+1 |s t , a t ) the transition probability that measures how likely the environment transitions to s t+1 given that the agent samples an action by a t ∼ π θ (•|s t ). After the transition to s t+1 , the agent receives a deterministic scalar reward r(s t , a t , s t+1 ). The objective of RL is to maximize the sum of discounted rewards with respect to θ: E π θ ,P ∞ t=0 γ t r(s t , a t , s t+1 ) where γ ∈ [0, 1] is a discount factor. We will omit P in the expectation for notational simplicity. In the options framework (Sutton et al., 1999) , the agent can switch between different options during an episode, where an option is translated to a sequence of actions by an option-conditioned policy with a termination condition. A set of options defined over an MDP induces a hierarchy that models temporal abstraction. For a typical two-level hierarchy, a higher-level policy produces options, and the policy at the lower level outputs environment actions conditioned on the proposed options. The expectation in Eq. 1 is taken over policies at both levels.

3. HIERARCHICAL RL BY DISCOVERING INTRINSIC OPTIONS

Scheduler 𝜋 𝜃 𝑢 ℎ,0 𝑠 ℎ,0 ) We now introduce our hierarchical method for solving sparse reward tasks. We assume little prior knowledge about the task structure, except that it can be learned through a hierarchy of two levels. The higher-level policy (the scheduler π θ ), is trained to maximize environment reward, while the lower-level policy (the worker π φ ) is trained in a self-supervised manner to efficiently discover options that are utilized by π θ to accomplish tasks. Importantly, by self-supervision the worker gets access to dense intrinsic rewards regardless of the sparsity of the extrinsic rewards. Worker 𝜋 𝜙 𝑎 ℎ,𝑘 ҧ 𝑠 ℎ,𝑘 , ത 𝑎 ℎ,𝑘-1 , 𝑢 ℎ ) 𝐾 Time Discriminator 𝑞 𝜓 (𝑢 ℎ | ҧ 𝑠 ℎ,𝑘+1 , ത 𝑎 ℎ,𝑘 ) Environment 𝑟 ℎ,𝑘+1 𝑙𝑜 𝑅 ℎ Without loss of generality, we assume that each episode has a length of T and the scheduler outputs an option every K steps. The scheduled option u ∈ [-1, 1] D (where D is a pre-defined dimensionality), is a latent representation that will be learned from scratch given the environment task. Modulated by u, the worker executes K steps before the scheduler outputs the next option. Let the time horizon of the scheduler be H = T K . Formally, we define Scheduler policy: u h ∼ π θ (•|s h,0 ), 0 ≤ h < H Worker policy: a h,k ∼ π φ (•|s h,k , u h ), 0 ≤ k < K Environment dynamics: s h,k+1 ∼ P(•|s h,k , a h,k ), 0 ≤ h < H, 0 ≤ k < K (2) where we denote s h,k and a h,k as the k-th state and action respectively, within the h-th option window of length K. Note that given this sampling process, we have s h,K ≡ s h+1,0 , namely, the last state of the current option u h is the initial state of the next option u h+1 . The overall framework of our method is illustrated in Figure 1 .

3.1. LEARNING THE SCHEDULER

Every time the scheduler issues an option u h , it receives an reward R h computed by accumulating environment rewards over the next K steps. Its objective is: max θ E π θ H-1 h=0 β h R h , where β = γ K and R h = E π φ K-1 k=0 γ k r(s h,k , a h,k , s h,k+1 ) This scheduler objective itself is not a new concept, as similar ones have been adopted by other hRL methods (Vezhnevets et al., 2017; Nachum et al., 2018; Riedmiller et al., 2018) . One significant difference between our option with that of prior work is that our option u is simply a latent variable; there is no explicit constraint on what semantics u could represent. In contrast, existing methods usually require their options to reside in a subspace of the state space, to be grounded to the environment, or to have known structures, so that the scheduler can compute rewards and termination conditions for the worker. Note that our latent options can be easily re-trained given a new task.

3.2. LEARNING THE WORKER

The main focus of this paper is to investigate how to effectively learn the worker policy in a selfsupervised manner. Our motivation is that it might be unnecessary to make an option dictate the worker to reach some " -space" of goals (Vezhnevets et al., 2017; Nachum et al., 2018) . As long as the option can be translated to a short sequence of primitive actions, it does not need to be grounded with concrete meanings such as goal reaching. Below we will treat the option as a latent variable that modulates the worker, and propose to learn its latent representation in a hierarchical setting from the environment task.

3.2.1. WORKER OBJECTIVE

We first define a new meta MDP on top of the original task MDP so that for any h, k, and t: 1) s h,k := (s h,0 , . . . , s h,k ), 2) a h,k := (a h,0 , . . . , a h,k ), 3) r(s h,k , a h,k , s h,k+1 ) := r(s h,k , a h,k , s h,k+1 ), 4) P(s h,k+1 |s h,k , a h,k ) := P(s h,k+1 |s h,k , a h,k ). This new MDP equips the worker with historical state and action information since the time (h, 0) when an option h was scheduled. Specifically, each state s h,k or action a h,k encodes the history from the beginning (h, 0) up to (h, k) within the option. In the following, we will call pairs {a h,k , s h,k+1 } option sub-trajectories. The worker policy now takes option sub-trajectories as inputs: a h,k ∼ π φ (•|s h,k , a h,k-1 , u h ), 0 ≤ k < K, whereas the scheduler policy still operates in the original MDP. Denote h,k ≡

H-1 h=0

K-1 k=0 for simplicity. The worker objective, defined on this new MDP, is to minimize the entropy of the option u h conditioned on the option sub-trajectory {a h,k , s h,k+1 }: max φ E π θ ,π φ h,k log p(u h |a h,k , s h,k+1 ) negative conditional option entropy -β log π φ (a h,k |s h,k , a h,k-1 , u h ) worker policy entropy (4) where the expectation is over the current π θ and π φ but the maximization is only with respect to φ. Intuitively, the first term suggests that the worker is optimized to confidently identify an option given a sub-trajectory. However, it alone will not guarantee the diversity of options because potentially even very similar sub-trajectories can be classified into different options if the classification model has a high capacity, in which case we say that the resulting sub-trajectory space has a very high "resolution". As a result, the conditional entropy alone might not be able to generate useful options to be exploited by the scheduler for task solving, because the coverage of the sub-trajectory space is poor. To combat this degenerate solution, we add a second term which maximizes the entropy of the worker policy. Intuitively, while the worker generates identifiable sub-trajectories corresponding to a given option, it should act as randomly as possible to separate sub-trajectories of different options, lowering the "resolution" of the sub-trajectory space to encourage its coverage. Because directly estimating the posterior p(u h |a h,k , s h,k+1 ) is intractable, we approximate it with a parameterized posterior log q ψ (u h |a h,k , s h,k+1 ) to obtain a lower bound (Barber & Agakov, 2003) , where q ψ is a discriminator to be learned. Then we can maximize this lower bound instead: max φ,ψ E π θ ,π φ h,k log q ψ (u h |a h,k , s h,k+1 ) -β log π φ (a h,k |s h,k , a h,k-1 , u h ). The discriminator q ψ is trained by maximizing likelihoods of options given sampled sub-trajectories. The worker π φ is trained via max-entropy RL (Soft Actor-Critic (SAC) (Haarnoja et al., 2018) ) with the intrinsic reward r lo h,k+1 := log q ψ (•) -β log π φ (•). β is fixed to 0.01 in our experiments. Note that there are at least four differences between Eq. 5 and the common option discovery objective in either VIC (Gregor et al., 2016) or DIAYN (Eysenbach et al., 2019 ): 1. Both VIC and DIAYN assume that a sampled option will last through an entire episode, and the option is always sampled at the beginning of an episode. Thus their option trajectories "radiate" from the initial state set. In contrast, our worker policy learns options that initialize every K steps within an episode, and they can have more diverse semantics depending on the various states s h,0 visited by the agent. This is especially helpful for some tasks where new options need to be discovered after the agent reaches unseen areas in later stages of training. 2. Actions taken by the worker policy under the current option will have consequences on the next option. This is because the final state s h,K of the current option is defined to be the initial state s h+1,0 of the next option. So in general, the worker policy is trained not only to discover diverse options across the current K steps, but also to make the discovery easier in the future steps. In other words, the worker policy needs to solve the credit assignment problem across options, under the expectation of the scheduler policy. 3. To enable the worker policy to learn from a discriminator that predicts based on option subtrajectories {a h,k , s h,k+1 } instead of solely on individual states s h,k , we have constructed a new meta MDP where each state s h,k encodes history from the beginning (h, 0) up to (h, k) within an option h. This new meta MDP is critical, because otherwise one simply cannot learn a worker policy from a reward function that is defined by multiple time steps (sub-trajectories) since the learning problem is no longer Markovian. 4. Lastly, thanks to the new MDP, we are able to explore various possible instantiations of the discriminator (see Section 3.3). As observed in the experiments, individual states are actually not the optimal features for identifying options. These differences constitute the major novelty of our worker objective.

3.2.2. SHORTSIGHTED WORKER

It's challenging for the worker to accurately predict values over a long horizon, since its rewards are densely computed by a complex nonlinear function q ψ . Also each option only lasts at most K steps. Thus we set the discount η for the worker in two shortsighted ways: 1. Hard: setting η = 0 every K-th step and η = 1 otherwise. Basically this truncates the temporal correlation (gradients) between adjacent options. Its benefit might be faster and easier value learning because the value is bootstrapped over at most K steps (K T ). 2. Soft: η = 1 -1 K , which considers rewards of roughly K steps ahead. The worker policy still needs to take into account the identification of future option sub-trajectories, but their importance quickly decays. We will evaluate both versions and compare their performance in Section 4.1.

3.3. INSTANTIATING THE DISCRIMINATOR

We explore various ways of instantiating the discriminator q ψ in order to compute useful intrinsic rewards for the worker. Previous work has utilized individual states (Eysenbach et ). However we note that unlike these works, the distribution of our option sub-trajectories is also determined by the scheduler in the context of hRL. The other four feature extractors have not been evaluated before. With the extracted feature, the log-probability of predicting an option is simply computed as the negative squared L2 norm: log q ψ (u h |a h,k , s h,k+1 ) = -f ψ (a h,k , s h,k+1 ) -u h 2 2 , by which we implicitly assume the discriminator's output distribution to be a N (0, I D ) multivariate Gaussian.

3.4. OFF-POLICY TRAINING

The scheduler and worker objectives (Eq. 3 and Eq. 5) are trained jointly. In principle, on-policy training such as A2C (Clemente et al., 2017) is needed due to the interplay between the scheduler and worker. However, to reuse training data and improve sample efficiency, we employ off-policy training (SAC (Haarnoja et al., 2018) ) for both objectives with some modifications.

Modified worker objective

In practice, the expectation over the scheduler π θ in Eq. 5 is replaced with the expectation over its historical versions. Specifically, we sample options u h from a replay buffer, together with sub-trajectories {a h,k , s h,k+1 }. This type of data distribution modification is conventional in off-policy training (Lillicrap et al., 2016) .

Intrinsic reward relabeling

We always recompute the rewards in Eq. 5 using the up-to-date discriminator for every update of φ, which can be trivially done without any additional interaction with the environment.

Importance correction

The data in the replay buffer was generated by historical worker policies. Thus a sampled option sub-trajectory will be outdated under the same option, causing confusion to the scheduler policy. To resolve this issue, when minimizing the temporal-difference (TD) error between the values of s h,0 and s h+1,0 for the scheduler, an importance ratio can be multiplied: K-1 k=0 π φ (a h,k |s h,k ,a h,k-1 ,u h ) π old φ (a h,k |s h,k ,a h,k-1 ,u h ) . A similar correction can also be applied to the discriminator loss. However, in practice we find that this ratio has a very high variance and hinders the training. 

4. EXPERIMENTS

Environments We evaluate success rate and sample efficiency across two environment suites, as shown in Figure 2 . Important details are presented here with more information in appendix Section B. The first suite consists of two 7-DOF reaching and pushing environments evaluated in Chua et al. (2018) . They both emulate a one-armed PR2 robot. The tasks have sparse rewards: the agent gets a reward of 0 at every timestep where the goal is not achieved, and 1 upon achieved. There is also a small L 2 action penalty applied. In 7-DOF REACHER, the goal is achieved when the gripper reaches a 3D goal position. In 7-DOF PUSHER, the goal is to push an object to a 3D goal position. Episodes have a fixed length of 100; a success of an episode is defined to be if the goal is achieved at the final step of the episode. We also propose another suite of environments called SOCIALROBOTfoot_2 . We construct two sparse reward robotic navigation and manipulation tasks, GOALTASK and KICKBALL. In GOALTASK, the agent gets a reward of 1 when it successfully navigates to a goal, -1 if the goal becomes too far, -0.5 every time it is too close to a distractor object, and 0 otherwise. In KICKBALL, the agent receives a reward of 1 for successfully pushing a ball into the goal, 0 otherwise, and has the same distractor object penalty. At the beginning of each episode, both the agent and the ball are spawned randomly. Both environments contain a small L 2 action penalty, and terminate an episode upon a success. Comparison methods One baseline algorithm for comparison is standard SAC (Haarnoja et al., 2018) , the building block of our hierarchical method. To verify if our worker policy can just be replaced with a naïve action repetition strategy, we compare with SAC+ActRepeat with an action repetition for the same length K as our option interval. We also compare against HIRO (Nachum et al., 2018) , a data efficient hierarchical method with importance-based option relabeling, and HiPPO (Li et al., 2020) which trains the lower level and higher level policies together with one unified PPO-based objective. Both are state-of-the-art hierarchical methods proposed to solve sparse reward tasks. Similar to our work, HiPPO makes no assumptions about options, however it utilizes a discrete option space and its options are trained with environment reward. We implement HIDIO based on an RL framework called ALF 4 . A comprehensive hyperparameter search is performed for every method, with a far greater search space over HiPPO and HIRO than our method HIDIO to ensure maximum fairness in comparison; details are presented in Appendix D. Evaluation For every evaluation point during training, we evaluate the agent with current deterministic policies (by taking arg max of action distributions) for a fixed number of episodes and compute the mean success rate. We plot the mean evaluation curve over 3 randomly seeded runs with standard deviations shown as the shaded area around the curve.

4.1. WORKER DESIGN CHOICES

We ask and answer questions about the design choices in HIDIO specific to the worker policy π φ . 1. What sub-trajectory feature results in good option discovery? We evaluate all six features proposed in Section 3.3 in all four environments. These features are selected to evaluate how different types of subtrajectory information affect option discovery and final performance. They encompass varying types of both local and global subtrajectory information. We plot comparisons of StateAction includes the current action and next state, encouraging π φ to differentiate its options with different actions even at similar states. Similarly, Action includes the option initial state and current action, encouraging option diversity by differentiating between actions conditioned on initial states. Meanwhile StateDiff simply encodes the difference between the next and current state, encouraging π φ to produce options with different state changes at each step. 2. How do soft shortsighted workers (Soft) compare against hard shortsighted workers (Hard)? In Figure 3 , we plot all features with Soft in dotted lines. We can see that in general there is not much difference in performance between Hard and Soft except some extra instability of Soft in REACHER regarding the StateConcat and State features. One reason of this similar general performance could be that since our options are very short-term in Hard, the scheduler policy has the opportunity of switching to a good option before the current one leads to bad consequences. In a few cases, Hard seems better learned, perhaps due to an easier value bootstrapping for the worker.

4.2. COMPARISON RESULTS

We compare our three best sub-trajectory features of Hard, in Section 4.1, against the SAC baselines and hierarchical RL methods across all four environments in Figure 4 . Generally we see that HIDIO (solid lines) achieves greater final performance with superior sample efficiency than the compared methods. Both SAC and SAC+ActRepeat perform poorly across all environments, and all baseline methods perform significantly worse than HIDIO on REACHER, GOALTASK, and KICKBALL. In PUSHER, HiPPO displays competitive performance, rapidly improving from the start. However, all three HIDIO instantiations achieve nearly 100% success rates while HiPPO is unable to do so. Furthermore, HIRO and SAC+ActRepeat take much longer to start performing well, but never achieve similar success rates as HIDIO. HIDIO is able to solve REACHER while HiPPO achieves only about a 60% success rate at best. Meanwhile, HIRO, SAC+ActRepeat, and SAC are unstable or non-competitive. REACHER is a difficult exploration problem as the arm starts far from the goal position, and we see that HIDIO's automatically discovered options ease exploration for the higher level policy to consistently reach the goal. HIDIO performs well on GOALTASK, achieving 60-80% success rates, while the task is too challenging for every other method. In KICKBALL, the most challenging task, HIDIO achieves 30-40% success rates while every other learns poorly again, highlighting the need for the intrinsic option discovery of HIDIO in these environments. We ask the next question: is jointly training π θ and π φ necessary? To answer this, we compare HIDIO against a pre-training baseline where we first pre-train π φ , with uniformly sampled options u for a portion ρ of total numbers of training time steps, and then fix π φ while training π θ for the remaining (1 -ρ) time steps. This is essentially using pre-trained options for downstream higher-level tasks as demonstrated in DIAYN (Eysenbach et al., 2019) . We conduct this experiment with the StateAction feature on both KICKBALL and PUSHER, with ρ = { 1 16 , 1 8 , 1 4 }. The results are shown in Figure 6 . We can see that in PUSHER, fewer pre-training time steps are more sample efficient, as the environment is simple and options can be learned from a small amount of samples. The nature of PUSHER also only requires options that can be learned independent of the scheduler policy evolution. Nevertheless, the pretraining baselines seem less stable. In KICKBALL, the optimal pre-training baseline is on ρ = 1 8 of the total time steps. However without the joint training scheme of HIDIO, the learned options are unable to be used as efficiently for the difficult obstacle avoidance, navigation, and ball manipulation subtasks required for performing well.

4.4. OPTION BEHAVIORS

Finally, since options discovered by HIDIO in our sparse reward environments help it achieve superior performance, we ask, what do useful options look like? To answer this question, after training, we sample options from the scheduler π θ to visualize their behaviors in different environments in Figure 5 . For each sampled option u, we fix it until the end of an episode and use the worker π φ to output actions given u. We can see that the options learned by HIDIO are low-level navigation and manipulation skills useful for the respective environments. We present more visualizations in Figure 9 and more analysis in Section C.2 in the appendix. Furthermore, we present an analysis of task performance for different option lengths in appendix Section C.1 and Figures 7 and 8 .

5. RELATED WORK

Hierarchical RL Much of the previous work in hRL makes assumptions about the task structure and/or the skills needed to solve the task. While obtaining promising results under specific settings, they may have difficulties with different scenarios. For example, SAC-X (Riedmiller et al., 2018) requires manually designing auxiliary subtasks as skills to solve a given downstream task. SNN4HRL (Florensa et al., 2016) is geared towards tasks with pre-training and downstream components. Lee ) which make higher-level manager policies output goals for lower-level worker policies to achieve. Usually the goal space is a subspace of the state space or defined according to the task so that lower-level rewards are easy to compute. This requirement of manually "grounding" goals in the environment poses generalization challenges for tasks that cannot be decomposed into state or goal-reaching. The MAXQ decomposition (Dietterich, 2000) defines an hRL task decomposition by breaking up the target MDP into a hierarchy of smaller MDPs such that the value function in the target MDP is represented as the sum of the value functions of the smaller ones. This has inspired works that use such decompositions (Mehta et al., 2008; Winder et al., 2020; Li et al., 2017) to learn structured, hierarchical world models or policies to complete target tasks or perform transfer learning. However, building such hierarchies makes these methods limited to MDPs with discrete action spaces. Our method HIDIO makes few assumptions about the specific task at hand. It follows from the options framework (Sutton et al., 1999) , which has recently been applied to continuous domains (Bacon et al., 2017) , spawning a diverse set of recent hierarchical options methods (Bagaria & Konidaris, 2020; Klissarov et al., 2017; Riemer et al., 2018; Tiwari & Thomas, 2019; Jain et al., 2018) . HIDIO automatically learns intrinsic options that avoids having explicit initiation or termination policies dependent on the task at hand. HiPPO (Li et al., 2020) , like HIDIO, also makes no major assumptions about the task, but does not employ self-supervised learning for training the lower-level policy. Self-supervised option/skill discovery There are also plenty of prior works which attempt to learn skills or options without task reward. DIAYN (Eysenbach et al., 2019) and VIC (Gregor et al., 2016) learn skills by maximizing the mutual information between trajectory states and their corresponding skills. VALOR (Achiam et al., 2018) 2020) demonstrate pre-trained options to be useful for hRL. These methods usually pre-train options in an initial stage separate from downstream task learning; few works directly integrate option discovery into a hierarchical setting. For higher dimensional input domains, Lynch et al. (2020) learns options from human-collected robot interaction data for image-based, goal-conditioned tasks, and Chuck et al. (2020) learns a hierarchy of options by discovering objects from environment images and forming options which can manipulate them. HIDIO can also be applied to image-based environments by replacing fully-connected layers with convolutional layers in the early stages of the policy and discriminator networks. However, we leave this to future work to address possible practical challenges arising in this process.

6. CONCLUSION

Towards solving difficult sparse reward tasks, we propose a new hierarchical reinforcement learning method, HIDIO, which can learn task-agnostic options in a self-supervised manner and simultaneously learn to utilize them to solve tasks. We evaluate several different instantiations of the discriminator of HIDIO for providing intrinsic rewards for training the lower-level worker policy. We demonstrate the effectiveness of HIDIO compared against other reinforcement learning methods in achieving high rewards with better sample efficiency across a variety of robotic navigation and manipulation tasks. There is an action penalty in both environments: at every timestep the squared L 2 norm of the agent action is subtracted from the reward. In PUSHER, this penalty is multiplied by a coefficient of 0.001. In REACHER, it's multiplied by 0.0001.

B.0.2 GOALTASK AND KICKBALL

For both SOCIALROBOT environments, an episode terminates early when either a success is reached or the goal is out of range. For each episode, the positions of all objects (including the agent) are randomly picked. Observations are 18-dimensional. In GOALTASK, these observations include egocentric positions, distances, and directions from the agent to different objects while in KICKBALL, they are absolute positions and directions. In KICKBALL, the agent receives a reward of 1 for successfully pushing a ball into the goal (episode termination) and 0 otherwise. At the beginning of each episode, the ball is spawned randomly inside the neighborhood of the agent. Three distractor objects are included on the ground to increase task difficulty. In GOALTASK, the number of distractor objects increases to 5. Both environments contain a small L 2 action penalty: at every time step the squared L 2 norm of the agent action, multiplied by 0.01, is subtracted from the reward. GOALTASK has a time horizon of 100 steps, while KICKBALL's horizon is 200. Observations are 30-dimensional, including absolute poses and velocities of the goal, the ball, and the agent. Both GOALTASK and KICKBALL use the same navigation robot PIONEER2DX which has 2-dimensional actions that control the angular velocities (scaled to [-1, 1]) of the two wheels.

C OPTION DETAILS C.1 OPTION LENGTH ABLATION

We ablate the option length K in all four environments on the three best HIDIO instantiations in Figure 7 . K = {1, 3, 5} timesteps per option are shown, with K = 3 and K = 5 performing similarly across all environments, but K = 1 performing very poorly in comparison. K = 1 provides no temporal abstraction, resulting in worse sample efficiency in PUSHER and REACHER, and failing to learn in GOALTASK and KICKBALL. Although K = 5 and K = 3 are generally similar, we see in GOALTASK that K = 5 results in better performance than K = 3 across all three instantiations, demonstrating the potential benefit of longer temporal abstraction lengths. We also plot the distribution of (x, y) velocitiesfoot_4 in GOALTASK and (x, y) coordinates in KICK-BALL of randomly sampled options of different lengths in Figure 8 . Despite the fact that these two dimensions only represent a small subspace of the entire (30-dimensional) state space, they still demonstrate a difference in option behavior at different option lengths. We can see that as the option length K increases, the option behaviors become more consistent within a trajectory. Meanwhile regarding coverage, K = 1's (blue) trajectory distribution in both environments is less concentrated near the center, while K = 5 (green) is the most concentrated at the center. K = 3 (orange) lies somewhere in between. We believe that this difference in behavior signifies a trade off between the coverage of the state space and how consistent the learned options can be depending on the option length. Given the same entropy coefficient (β in Eq 5), with longer option lengths, it is likely that the discriminator can more easily discriminate the sub-trajectories created by these options, so that their coverage does not have to be as wide for the worker policy to obtain high intrinsic rewards. Meanwhile, with shorter option lengths, the shorter sub-trajectories have to be more distinct for the discriminator to be able to successfully differentiate between the options.

C.2 OPTION VISUALIZATIONS

We visualize more option behaviors in Figure 9 , produced in the same way as in Figure 5 and as detailed in Section 4.4. The top 4 picture reels are from KICKBALL. We see that KICKBALL options lead to varied directional driving behaviors that can be utilized for efficient navigation. For example, the second, third, and fourth highlight options that produce right turning behavior, however at different speeds and angles. The option in the third reel is a quick turn that results in the robot tumbling over into an unrecoverable state, but the options in the second and fourth reels turn more slowly and do not result in the robot flipping. The first option simply proceeds forward from the robot starting position, kicking the ball into the goal. The bottom 4 reels are from PUSHER. Each option results in different sweeping behaviors with varied joint positioning and arm height. These sweeping and arm folding behaviors, when utilized in short sub-trajectories, are useful for controlling where and how to move the arm to push the puck into the goal.

D HYPERPARAMETERS

To ensure a fair comparison across all methods, we perform a hyperparameter search over the following values for each algorithm and suite of environments. 



In this paper we focus on non-image observations that can be processed with MLPs, although our method doesn't have any assumption about the observation space. One possible reason is that the deep RL process is "highly non-stationary anyway, due to changing policies, state distributions and bootstrap targets"(Schaul et al., 2016). https://github.com/HorizonRobotics/SocialRobot https://github.com/HorizonRobotics/alf Velocities are relative to the agent's yaw rotation. Because GOALTASK has egocentric inputs, the agent is not aware of the absolute (x, y) coordinates in this task.



Figure 1: The overall framework of HIDIO. The scheduler π θ samples an option u h every K (3 in this case) time steps, which is used to guide the worker π φ to directly interact in the environment conditioned on u h and the current sub-trajectory s h,k , a h,k-1 . The scheduler receives accumulated environment rewards R h , while the worker receives intrinsic rewards r lo h,k+1 . Refer to Eq. 2 for sampling and Eqs. 3 and 5 for training.

Like the similar observations made in Nachum et al. (2018); Fedus et al. (2020), even without importance correction our method is able to perform well empirically 2 .

Figure 2: The four tasks we evaluate on. From left to right: 7-DOF PUSHER, 7-DOF REACHER, GOALTASK, and KICKBALL. The first two tasks simulate a one-armed PR2 robot environment while the last two are in the SOCIALROBOT environment. The final picture shows a closeup of the PIONEER2DX robot used in SOCIALROBOT.

Figure 3: Comparison of all discriminator features against each other across the four environments. Solid lines indicate hard short-sighted workers (Hard), dotted lines indicated soft short-sighted workers (Soft).

Figure 4: Comparisons of the mean success rates of three features of HIDIO (Action, StateAction, StateDiff; solid lines) against other methods (dashed lines).

Figure 5: Two example options from the StateAction instantiation on KICKBALL (top) and PUSHER (bottom). The top option navigates directly to the goal by bypassing obstructions along the way and the bottom option sweeps the puck towards one direction.

Figure 6: Pretraining baseline comparison at fractions { 1 16 , 1 8 , 1 4 } of the total number of training time steps.

et al. (2019; 2020) learns to modulate or compose given primitive skills that are customized for their particular robotics tasks. Ghavamzadeh & Mahadevan (2003) and Sohn et al. (2018) operate under the assumption that tasks can be manually decomposed into subtasks. The feudal reinforcement learning proposal (Dayan & Hinton, 1993) has inspired another line of works (Vezhnevets et al., 2017; Nachum et al., 2018; Levy et al., 2019; Rafati & Noelle, 2019

learns options by maximizing the probability of options given their resulting observation trajectory. DADS (Sharma et al., 2019a) learns skills that are predictable by dynamics models. DISCERN (Warde-Farley et al., 2019) maximizes the mutual information between goal and option termination states to learn a goal-conditioned reward function. Brunskill & Li (2014) learns options in discrete MDPs that are guaranteed to improve a measure of sample complexity. Portable Option Discovery (Topin et al., 2015) discovers options by merging options from source policies to apply to some target domain. Eysenbach et al. (2019); Achiam et al. (2018); Sharma et al. (2019a); Lynch et al. (

Figure 7: Comparisons of the mean success rates of three features of HIDIO (Action, StateAction, StateDiff at different option lengths K. Dotted lines indicate K = 1, solid lines indicate K = 3, and dashed lines indicate K = 5. K = 3 was used across all environments for the results in the main text.

Figure 8: Trajectory distributions compared for different option lengths K for the StateAction HIDIO instantiation in both SOCIALROBOT environments. These are obtained by randomly sampling an option uniformly in [-1, 1] D and keeping it fixed for the entire trajectory. 100 trajectories from each option are visualized and plotted in different colors.

Figure 9: Eight example options from the StateAction instantiation on KICKBALL (top 4) and PUSHER (bottom 4).

al., 2016; Riedmiller et al., 2018; Lee et al., 2019; Hausman et al., 2018; Lee et al., 2020; Sohn et al., 2018; Ghavamzadeh & Mahadevan, 2003;

al., 2019; Jabri et al., 2019) or full observation trajectories (Warde-Farley et al., 2019; Sharma et al., 2019a; Achiam et al., 2018) for option discrimination. Thanks to the newly defined meta MDP, our discriminator is able to take option sub-trajectories instead of current individual states for prediction. In this paper, we investigate six sub-trajectory feature extractors f ψ :

A PSEUDO CODE FOR HIDIO

Step through the environment s h,k+1 ∼ P( Shared hyperparameters across all methods are listed below (where applicable, and except when overridden by hyperparameters listed for each individual method). For all methods, we take the hyperparameters that perform best across 3 random seeds in terms of the area under the evaluation success curve (AUC) in the PUSHER environment. • Rollout Length: {25, 50, 100} 6 The target entropy used for automatically adjusting α is calculated as: i [ln(Mi -mi) + ln ∆] where Mi/mi are the maximium/minimum value of action dim i. Intuitively, the target distribution concentrates on a segment of length (Mi -mi)∆ with a constant probability. 7 Chosen to match the option interval K of HIDIO.

D.1.4 HIRO

• Steps per option: {3, 5, 8}• Replay buffer size (total): {500000, 2000000}• Meta action space (actions are relative, e.g., meta-action is current obs + action): (-np.ones(obs space -3D goal pos) * 2, np.ones(obs space -3D goal pos) * 2) Other hyperparameters are kept the same as the optimal SAC ones.

D.2.3 HIDIO

Due to the large hyperparameter search space, we only search over the option vector size and rollout length, and select everything else heuristically.• Latent option u vector dimension (D): {4, 6}• Policy/Q network hidden layers for π φ (128, 128, 128)• Steps per option (K): 3• π φ has a fixed entropy coefficient α of 0.01. Target entropy min prob ∆ for π θ is 0.2.• Discriminator network hidden layers: (32, 32)• Replay buffer length per parallel actor: 20000• Steps per option: {3, 5, 8}• Replay buffer size (total): {500000, 2000000}• Meta action space (actions are relative, e.g., meta-action is current obs + action):-GOALTASK: (-np.ones(obs space) * 2, np.ones(obs space) * 2) -KICKBALL:(-np.ones(obs space -goal space) * 2, np.ones(obs space -goal space) * 2) (because the goal position is given but will not change in the observation space) 

