LATENT SKILL PLANNING FOR EXPLORATION AND TRANSFER

Abstract

To quickly solve new tasks in complex environments, intelligent agents need to build up reusable knowledge. For example, a learned world model captures knowledge about the environment that applies to new tasks. Similarly, skills capture general behaviors that can apply to new tasks. In this paper, we investigate how these two approaches can be integrated into a single reinforcement learning agent. Specifically, we leverage the idea of partial amortization for fast adaptation at test time. For this, actions are produced by a policy that is learned over time while the skills it conditions on are chosen using online planning. We demonstrate the benefits of our design decisions across a suite of challenging locomotion tasks and demonstrate improved sample efficiency in single tasks as well as in transfer from one task to another, as compared to competitive baselines. Videos are available at: https://sites.google.com/view/latent-skill-planning/ 

1. INTRODUCTION

Humans can effortlessly compose skills, where skills are a sequence of temporally correlated actions, and quickly adapt skills learned from one task to another. In order to build re-usable knowledge about the environment, Model-based Reinforcement Learning (MBRL) (Wang et al., 2019) provides an intuitive framework which holds the promise of training agents that generalize to different situations, and are sample efficient with respect to number of environment interactions required for training. For temporally composing behaviors, hierarchical reinforcement learning (HRL) (Barto & Mahadevan, 2003) seeks to learn behaviors at different levels of abstraction explicitly. A simple approach for learning the environment dynamics is to learn a world model either directly in the observation space (Chua et al., 2018; Sharma et al., 2019; Wang & Ba, 2019) or in a latent space (Hafner et al., 2019; 2018) . World models summarize an agent's experience in the form of learned transition dynamics, and reward models, which are used to learn either parametric policies by amortizing over the entire training experience (Hafner et al., 2019; Janner et al., 2019) , or perform online planning as done in Planet (Hafner et al., 2018), and PETS (Chua et al., 2018) . Amortization here refers to learning a parameterized policy, whose parameters are updated using samples during the training phase, and which can then be directly queried at each state to output an action, during evaluation. Fully online planning methods such as PETS (Chua et al., 2018) only learn the dynamics (and reward) model and rely on an online search procedure such as Cross-Entropy Method (CEM; Rubinstein, 1997) on the learned models to determine which action to execute next. Since rollouts from the learned dynamics and reward models are not executed in the actual environment during training, these learned models are sometimes also referred to as imagination models (Hafner et al., 2018; 2019) . Fully amortized methods such as Dreamer (Hafner et al., 2019) , train a reactive policy with many rollouts from the imagination model. They then execute the resulting policy in the environment. The benefit of the amortized method is that it becomes better with experience. Amortized policies are also faster. An action is computed in one forward pass of the reactive policy as opposed to the potentially expensive search procedure used in CEM. Additionally, the performance of the amortized method is more consistent as CEM relies on drawing good samples from a random action distribution. On the other hand, the shortcoming of the amortized policy is generalization. When attempting novel tasks unseen during training, CEM will plan action sequences for the new task, as per the new reward function while a fully amortized method would be stuck with a behaviour optimized for the training tasks. Since it is intractable to perform fully online random shooting based planning in high-dimensional action spaces (Bharadhwaj et al., 2020; Amos & Yarats, 2019) , it motivates the question: can we combine online search with amortized policy learning in a meaningful way to learn useful and transferable skills for MBRL? To this end, we propose a partially amortized planning algorithm that temporally composes high-level skills through the Cross-Entropy Method (CEM) (Rubinstein, 1997), and uses these skills to condition a low-level policy that is amortized over the agent's experience. Our world model consists of a learned latent dynamics model, and a learned latent reward model. We have a mutual information (MI) based intrinsic reward objective, in addition to the predicted task rewards that are used to train the low level-policy, while the high level skills are planned through CEM using the learned task rewards. We term our approach Learning Skills for Planning (LSP). The key idea of LSP is that the high-level skills are able to abstract out essential information necessary for solving a task, while being agnostic to irrelevant aspects of the environment, such that given a new task in a similar environment, the agent will be able to meaningfully compose the learned skills with very little fine-tuning. In addition, since the skill-space is low dimensional, we can leverage the benefits of online planning in skill space through CEM, without encountering intractability of using CEM for planning directly in the higher dimensional action space and especially for longer time horizons (Figure 1 ). In summary, our main contributions are developing a partially amortized planning approach for MBRL, demonstrating that high-level skills can be temporally composed using this scheme to condition low level policies, and experimentally demonstrating the benefit of LSP over challenging locomotion tasks that require composing different behaviors to solve the task, and benefit in terms of transfer from one quadruped locomotion task to another, with very little adaptation in the target task.

2. BACKGROUND

We discuss learning latent dynamics for MBRL, and mutual information skill discovery, that serve as the basic theoretical tools for our approach.

2.1. LEARNING LATENT DYNAMICS AND BEHAVIORS IN IMAGINATION

Latent dynamics models are special cases of world models used in MBRL, that project observations into a latent representation, amenable for planning (Hafner et al., 2019; 2018) . This framework is general as it can model both partially observed environments where sensory inputs can be pixel observations, and fully observable environments, where sensory inputs can be proprioceptive state features. The latent dynamics models we consider in this work, consist of four key components, a representation module p θ (s t |s t-1 , a t-1 , o t ) and an observation module q θ (o t |s T ) that encode observations and actions to continuous vector-valued latent states s t , a latent forward dynamics module q θ (s t |s t-1 , a t-1 ) that predicts future latent states given only the past states and actions, and a task reward module q θ (r t |s t ), that predicts the reward from the environment given the current latent state. To learn this model, the agent interacts with the environment and maximizes the following



Figure 1: Visual illustration of the 2D root position of the quadruped trained with LSP on an environment with random obstacles and transferred to this environment with obstacles aligned in a line. The objective is to reach the goal location in red.

