LEARNING SUBGOAL REPRESENTATIONS WITH SLOW DYNAMICS

Abstract

In goal-conditioned Hierarchical Reinforcement Learning (HRL), a high-level policy periodically sets subgoals for a low-level policy, and the low-level policy is trained to reach those subgoals. A proper subgoal representation function, which abstracts a state space to a latent subgoal space, is crucial for effective goal-conditioned HRL, since different low-level behaviors are induced by reaching subgoals in the compressed representation space. Observing that the high-level agent operates at an abstract temporal scale, we propose a slowness objective to effectively learn the subgoal representation (i.e., the high-level action space). We provide a theoretical grounding for the slowness objective. That is, selecting slow features as the subgoal space can achieve efficient hierarchical exploration. As a result of better exploration ability, our approach significantly outperforms stateof-the-art HRL and exploration methods on a number of benchmark continuouscontrol tasks 12 . Thanks to the generality of the proposed subgoal representation learning method, empirical results also demonstrate that the learned representation and corresponding low-level policies can be transferred between distinct tasks.

1. INTRODUCTION

Deep Reinforcement Learning (RL) has demonstrated increasing capabilities in a wide range of domains, including playing games (Mnih et al., 2015; Silver et al., 2016) , controlling robots (Schulman et al., 2015; Gu et al., 2017) and navigation in complex environments (Mirowski et al., 2016; Zhu et al., 2017) . Solving temporally extended tasks with sparse or deceptive rewards is one of the major challenges for RL. Hierarchical Reinforcement Learning (HRL), which enables control at multiple time scales via a hierarchical structure, provides a promising way to solve those challenging tasks. Goal-conditioned methods have long been recognized as an effective paradigm in HRL (Dayan & Hinton, 1993; Schmidhuber & Wahnsiedler, 1993; Nachum et al., 2019) . In goal-conditioned HRL, higher-level policies set subgoals for lower-level ones periodically, and lower-level policies are incentivized to reach these selected subgoals. A proper subgoal representation function, abstracting a state space to a latent subgoal space, is crucial for effective goal-conditioned HRL, because the abstract subgoal space, i.e., high-level action space, simplifies the high-level policy learning, and explorative low-level behaviors can be induced by setting different subgoals in this compressed space as well. Recent works in goal-conditioned HRL have been concentrated on implicitly learning the subgoal representation in an end-to-end manner with hierarchical policies (Vezhnevets et al., 2017; Dilokthanakul et al., 2019) , e.g., using a variational autoencoder (Péré et al., 2018; Nair & Finn, 2019; Nasiriany et al., 2019) , directly utilizing the state space (Levy et al., 2019) or a handcrafted space (Nachum et al., 2018) as a subgoal space. Sukhbaatar et al. (2018) proposed to learn subgoal embeddings via self-play, and Ghosh et al. ( 2018) designed a representation learning objective using an actionable distance metric, but both of the methods need a pretraining process. Near-Optimal Representation (NOR) for HRL (Nachum et al., 2019) learns an abstract space concurrently with hierarchical policies by bounding the sub-optimality. However, the NOR subgoal space could not support efficient exploration in challenging deceptive reward tasks. In this paper, we develop a novel method, which LEarns the Subgoal representation with SlOw dyNamics (LESSON) along with the hierarchical policies. Subgoal representation in HRL is not only a state space abstraction, but also a form of high-level action abstraction. Since the high-level agent makes decisions at a low temporal resolution, our method extracts features with slow dynamics from observations as the subgoal space to enable temporal coherence. LESSON minimizes feature changes between adjacent low-level timesteps, in order for the learned feature representation to have the slowness property. To capture dynamic features and prevent the collapse of the learned representation space, we also introduce an additional contrastive objective that maximizes feature changes between high-level temporal intervals. We provide a theoretical motivation for the slowness objective. That is, selecting slow features as the subgoal space can achieve the most efficient hierarchical exploration when the subgoal space dimension is low and fixed. We illustrate on a didactic example that our method LESSON accomplishes the most efficient state coverage among all the compared subgoal representation functions. We also compare LESSON with state-of-theart HRL and exploration methods on complex MuJoCo tasks (Todorov et al., 2012) . Experimental results demonstrate that (1) LESSON dramatically outperforms previous algorithms and learns hierarchical policies more efficiently; (2) our learned representation with slow dynamics can provide interpretability for the hierarchical policy; and (3) our subgoal representation and low-level policies can be transferred between different tasks.

2. PRELIMINARIES

In reinforcement learning, an agent interacts with an environment modeled as an MDP M = (S, A, P, R, γ), where S is a state space, A is an action space. P : S × A × S → [0, 1] is an unknown dynamics model, which specifies the probability P (s |s, a) of transitioning to next state s from current state s by taking action a. R : S × A → R is a reward function, and γ ∈ [0, 1) is a discount factor. We optimize a stochastic policy π(a|s), which outputs a distribution over the action space for a given state s. The objective is to maximize the expected cumulative discounted reward E π [ ∞ t=0 γ t r t ] under policy π.

3. METHOD

In this section, we present the proposed method for LEarning Subgoal representations with SlOw dyNamics (LESSON). First, we describe a two-layered goal-conditioned HRL framework. We then introduce a core component of LESSON, the slowness objective for learning the subgoal representation of HRL. Finally, we summarize the whole learning procedure.

3.1. FRAMEWORK

Following previous work (Nachum et al., 2018; 2019) , we model a policy π(a|s) as a two-level hierarchical policy composed of a high-level policy π h (g|s) and a low-level policy π l (a|s, g). The high-level policy π h (g|s) selects a subgoal g in state s every c timesteps. The subgoal g is in a low dimensional space abstracted by representation function φ(s) : S → R k . The low-level policy π l (a|s, g) takes the high-level action g as input and interacts with the environment every timestep. Figure 1 depicts the execution process of the hierarchical policy. LESSON iteratively learns the subgoal representation function φ(s) with the hierarchical policy. To encourage policy π l to reach the subgoal g, we train π l with an intrinsic reward function based on the negative Euclidean distance in the latent space, r l (s t , a t , s t+1 , g) = -||φ(s t+1 ) -g|| 2 . Policy π h is trained to optimize the expected extrinsic rewards r env t . We use the off-policy algorithm SAC (Haarnoja et al., 2018) as our base RL optimizer. In fact, our framework is compatible with any standard RL algorithm. Apparently, a proper subgoal representation φ(s) is critical not only for learning an effective lowlevel goal-conditioned policy but also for efficiently learning an optimal high-level policy to solve a given task. As the feature dimension k is low, φ(s) has a compression property, which is necessary

