HIERARCHICAL REINFORCEMENT LEARNING BY DISCOVERING INTRINSIC OPTIONS

Abstract

We propose a hierarchical reinforcement learning method, HIDIO, that can learn task-agnostic options in a self-supervised manner while jointly learning to utilize them to solve sparse-reward tasks. Unlike current hierarchical RL approaches that tend to formulate goal-reaching low-level tasks or pre-define ad hoc lowerlevel policies, HIDIO encourages lower-level option learning that is independent of the task at hand, requiring few assumptions or little knowledge about the task structure. These options are learned through an intrinsic entropy minimization objective conditioned on the option sub-trajectories. The learned options are diverse and task-agnostic. In experiments on sparse-reward robotic manipulation and navigation tasks, HIDIO achieves higher success rates with greater sample efficiency than regular RL baselines and two state-of-the-art hierarchical RL methods.

1. INTRODUCTION

Imagine a wheeled robot learning to kick a soccer ball into a goal with sparse reward supervision. In order to succeed, it must discover how to first navigate in its environment, then touch the ball, and finally kick it into the goal, only receiving a positive reward at the end for completing the task. This is a naturally difficult problem for traditional reinforcement learning (RL) to solve, unless the task has been manually decomposed into temporally extended stages where each stage constitutes a much easier subtask. In this paper we ask, how do we learn to decompose the task automatically and utilize the decomposition to solve sparse reward problems? Deep RL has made great strides solving a variety of tasks recently, with hierarchical RL (hRL) demonstrating promise in solving such sparse reward tasks (Sharma et al., 2019b; Le et al., 2018; Merel et al., 2019; Ranchod et al., 2015) . In hRL, the task is decomposed into a hierarchy of subtasks, where policies at the top of the hierarchy call upon policies below to perform actions to solve their respective subtasks. This abstracts away actions for the policies at the top levels of the hierarchy. hRL makes exploration easier by potentially reducing the number of steps the agent needs to take to explore its state space. Moreover, at higher levels of the hierarchy, temporal abstraction results in more aggressive, multi-step value bootstrapping when temporal-difference (TD) learning is employed. These benefits are critical in sparse reward tasks as they allow an agent to more easily discover reward signals and assign credit. Many existing hRL methods make assumptions about the task structure (e.g., fetching an object involves three stages: moving towards the object, picking it up, and combing back), and/or the skills needed to solve the task (e.g., pre-programmed motor skills) (Florensa et al., 2016; Riedmiller et al., 2018; Lee et al., 2019; Hausman et al., 2018; Lee et al., 2020; Sohn et al., 2018; Ghavamzadeh & Mahadevan, 2003; Nachum et al., 2018) . Thus these methods may require manually designing the correct task decomposition, explicitly formulating the option space, or programming pre-defined options for higher level policies to compose. Instead, we seek to formulate a general method that can learn these abstractions from scratch, for any task, with little manual design in the task domain. The main contribution of this paper is HIDIO (HIerarchical RL by Discovering Intrinsic Options), a hierarchical method that discovers task-agnostic intrinsic options in a self-supervised manner while learning to schedule them to accomplish environment tasks. The latent option representation is uncovered as the option-conditioned policy is trained, both according to the same self-supervised worker objective. The scheduling of options is simultaneously learned by maximizing environment reward collected by the option-conditioned policy. HIDIO can be easily applied to new sparsereward tasks by simply re-discovering options. We propose and empirically evaluate various instantiations of the option discovery process, comparing the resulting options with respect to their final task performance. We demonstrate that HIDIO is able to efficiently learn and discover diverse options to be utilized for higher task reward with superior sample efficiency compared to other hierarchical methods.

2. PRELIMINARIES

We consider the reinforcement learning (RL) problem in a Markov Decision Process (MDP). Let s ∈ R S be the agent state. We use the terms "state" and "observation" interchangeably to denote the environment input to the agent. A state can be fully or partially observed. Without loss of generality, we assume a continuous action space a ∈ R A for the agent. Let π θ (a|s) be the policy distribution with learnable parameters θ, and P(s t+1 |s t , a t ) the transition probability that measures how likely the environment transitions to s t+1 given that the agent samples an action by a t ∼ π θ (•|s t ). After the transition to s t+1 , the agent receives a deterministic scalar reward r(s t , a t , s t+1 ). The objective of RL is to maximize the sum of discounted rewards with respect to θ: E π θ ,P ∞ t=0 γ t r(s t , a t , s t+1 ) (1) where γ ∈ [0, 1] is a discount factor. We will omit P in the expectation for notational simplicity. In the options framework (Sutton et al., 1999) , the agent can switch between different options during an episode, where an option is translated to a sequence of actions by an option-conditioned policy with a termination condition. A set of options defined over an MDP induces a hierarchy that models temporal abstraction. For a typical two-level hierarchy, a higher-level policy produces options, and the policy at the lower level outputs environment actions conditioned on the proposed options. The expectation in Eq. 1 is taken over policies at both levels.

3. HIERARCHICAL RL BY DISCOVERING INTRINSIC OPTIONS

Scheduler 𝜋 𝜃 𝑢 ℎ,0 𝑠 ℎ,0 ) We now introduce our hierarchical method for solving sparse reward tasks. We assume little prior knowledge about the task structure, except that it can be learned through a hierarchy of two levels. The higher-level policy (the scheduler π θ ), is trained to maximize environment reward, while the lower-level policy (the worker π φ ) is trained in a self-supervised manner to efficiently discover options that are utilized by π θ to accomplish tasks. Importantly, by self-supervision the worker gets access to dense intrinsic rewards regardless of the sparsity of the extrinsic rewards. Worker Without loss of generality, we assume that each episode has a length of T and the scheduler outputs an option every K steps. The scheduled option u ∈ [-1, 1] D (where D is a pre-defined dimensionality), is a latent representation that will be learned from scratch given the environment task. Modulated by u, the worker executes K steps before the scheduler outputs the next option. Let the time horizon of the scheduler be H = T K . Formally, we define



Figure1: The overall framework of HIDIO. The scheduler π θ samples an option u h every K (3 in this case) time steps, which is used to guide the worker π φ to directly interact in the environment conditioned on u h and the current sub-trajectory s h,k , a h,k-1 . The scheduler receives accumulated environment rewards R h , while the worker receives intrinsic rewards r lo h,k+1 . Refer to Eq. 2 for sampling and Eqs. 3 and 5 for training.

