CHOREOGRAPHER: LEARNING AND ADAPTING SKILLS IN IMAGINATION

Abstract

Unsupervised skill learning aims to learn a rich repertoire of behaviors without external supervision, providing artificial agents with the ability to control and influence the environment. However, without appropriate knowledge and exploration, skills may provide control only over a restricted area of the environment, limiting their applicability. Furthermore, it is unclear how to leverage the learned skill behaviors for adapting to downstream tasks in a data-efficient manner. We present Choreographer, a model-based agent that exploits its world model to learn and adapt skills in imagination. Our method decouples the exploration and skill learning processes, being able to discover skills in the latent state space of the model. During adaptation, the agent uses a meta-controller to evaluate and adapt the learned skills efficiently by deploying them in parallel in imagination. Choreographer is able to learn skills both from offline data and by collecting data simultaneously with an exploration policy. The skills can be used to effectively adapt to downstream tasks, as we show in the URL benchmark, where we outperform previous approaches from both pixels and states inputs. The learned skills also explore the environment thoroughly, finding sparse rewards more frequently, as shown in goal-reaching tasks from the DMC Suite and Meta-World.

1. INTRODUCTION

Deep Reinforcement Learning (RL) has yielded remarkable success in a wide variety of tasks ranging from game playing (Mnih et al., 2013; Silver et al., 2016) to complex robot control (Smith et al., 2022; OpenAI et al., 2019) . However, most of these accomplishments are specific to mastering a single task relying on millions of interactions to learn the desired behavior. Solving a new task generally requires to start over, collecting task-specific data, and learning a new agent from scratch. Instead, natural agents, such as humans, can quickly adapt to novel situations or tasks. Since their infancy, these agents are intrinsically motivated to try different movement patterns, continuously acquiring greater perceptual capabilities and sensorimotor experiences that are essential for the formation of future directed behaviors (Corbetta, 2021) . For instance, a child who understands how object relations work, e.g. has autonomously learned to stack one block on top of another, can quickly master how to create structures comprising multiple objects (Marcinowski et al., 2019) . With the same goal, unsupervised RL (URL) methods aim to leverage intrinsic motivation signals, used to drive the agent's interaction with the environment, to acquire generalizable knowledge and behaviors. While some URL approaches focus on exploring the environment (Schmidhuber, 1991; Mutti et al., 2020; Bellemare et al., 2016 ), competence-based (Laskin et al., 2021) methods aim to learn a set of options or skills that provide the agent with the ability to control the environment (Gregor et al., 2016; Eysenbach et al., 2019 ), a.k.a. empowerment (Salge et al., 2014) . Learning a set of options can provide an optimal set of behaviors to quickly adapt and generalize to new tasks (Eysenbach et al., 2021) . However, current methods still exhibit several limitations. Some of these are due to the nature of the skill discovery objective (Achiam et al., 2018) , struggling to capture behaviors that are natural and meaningful for humans. Another major issue with current methods is the limited exploration brought by competence-based methods, which tend to commit to behaviors that are easily discriminable but that guarantee limited control over a small area of the environment. This difficulty has been both analyzed theoretically (Campos et al., 2020) and demonstrated empirically (Laskin et al., 2021; Rajeswar et al., 2022) . A final important question with competence-based methods arises when adapting the skills learned without supervision to downstream tasks: how to exploit the skills in an efficient way, i.e. using the least number of environment interactions? While one could exhaustively test all the options in the environment, this can be expensive when learning a large number of skills (Eysenbach et al., 2019) or intractable, for continuous skill spaces (Kim et al., 2021; Liu & Abbeel, 2021a) . In this work, we propose Choreographer, an agent able to discover, learn and adapt unsupervised skills efficiently by leveraging a generative model of the environment dynamics, a.k.a. world model (Ha & Schmidhuber, 2018) . Choreographer discovers and learns skills in imagination, thanks to the model, detaching the exploration and options discovery processes. During adaptation, Choreographer can predict the outcomes of the learned skills' actions, and so evaluate multiple skills in parallel in imagination, allowing to combine them efficiently for solving downstream tasks. Contributions. Our contributions can be summarized as follow: • We describe a general algorithm for discovering, learning, and adapting unsupervised skills that is exploration-agnostic and data-efficient. (Section 3). • We propose a code resampling technique to prevent the issue of index collapse when learning high-dimensional codes with vector quantized autoencoders (Kaiser et al., 2018) , which we employ in the skill discovery process (Section 3.2); • We show that Choreographer can learn skills both from offline data or in parallel with exploration, and from both states and pixels inputs. The skills are adaptable for multiple tasks, as shown in the URL benchmark, where we outperform all baselines (Section 4.1); • We show the skills learned by Choreographer are effective for exploration, discovering sparse rewards in the environment more likely than other methods (Section 4.2), and we further visualize and analyze them to provide additional insights (Section 4.3).

2. PRELIMINARIES AND BACKGROUND

Reinforcement Learning (RL). The RL setting can be formalized as a Markov Decision Process, where we denote observations from the environment with x t , actions with a t , rewards with r t , and the discount factor with γ. The objective of an RL agent is to maximize the expected discounted sum of rewards over time for a given task, a.k.a. the return: G t = T k=t+1 γ (k-t-1) r k . We focus on continuous-actions settings, where a common strategy is to learn a parameterized function that outputs the best action given a certain state, referred to as the actor π θ (a|x), and a parameterized model that estimates the expected returns from a state for the actor, referred to as the critic v ψ (x). The models for actor-critic algorithms can be instantiated as deep neural networks to solve complex continuous control tasks (Haarnoja et al., 2018; Lillicrap et al., 2016; Schulman et al., 2017) . Unsupervised RL (URL). In our method, we distinguish three stages of the URL process (Figure 1 ). During stage (i) data collection, the agent can interact with the environment without rewards, driven by intrinsic motivation. During stage (ii) pre-training, the agent uses the data collected to 



Figure 1: Unsupervised Reinforcement Learning. The agent should effectively leverage the unsupervised phase, consisting of the data collection and the pre-training (PT) stages, to efficiently adapt during the supervised phase, where the agent is fine-tuned (FT) for a downstream task.

