CHOREOGRAPHER: LEARNING AND ADAPTING SKILLS IN IMAGINATION

Abstract

Unsupervised skill learning aims to learn a rich repertoire of behaviors without external supervision, providing artificial agents with the ability to control and influence the environment. However, without appropriate knowledge and exploration, skills may provide control only over a restricted area of the environment, limiting their applicability. Furthermore, it is unclear how to leverage the learned skill behaviors for adapting to downstream tasks in a data-efficient manner. We present Choreographer, a model-based agent that exploits its world model to learn and adapt skills in imagination. Our method decouples the exploration and skill learning processes, being able to discover skills in the latent state space of the model. During adaptation, the agent uses a meta-controller to evaluate and adapt the learned skills efficiently by deploying them in parallel in imagination. Choreographer is able to learn skills both from offline data and by collecting data simultaneously with an exploration policy. The skills can be used to effectively adapt to downstream tasks, as we show in the URL benchmark, where we outperform previous approaches from both pixels and states inputs. The learned skills also explore the environment thoroughly, finding sparse rewards more frequently, as shown in goal-reaching tasks from the DMC Suite and Meta-World.

1. INTRODUCTION

Deep Reinforcement Learning (RL) has yielded remarkable success in a wide variety of tasks ranging from game playing (Mnih et al., 2013; Silver et al., 2016) to complex robot control (Smith et al., 2022; OpenAI et al., 2019) . However, most of these accomplishments are specific to mastering a single task relying on millions of interactions to learn the desired behavior. Solving a new task generally requires to start over, collecting task-specific data, and learning a new agent from scratch. Instead, natural agents, such as humans, can quickly adapt to novel situations or tasks. Since their infancy, these agents are intrinsically motivated to try different movement patterns, continuously acquiring greater perceptual capabilities and sensorimotor experiences that are essential for the formation of future directed behaviors (Corbetta, 2021) . For instance, a child who understands how object relations work, e.g. has autonomously learned to stack one block on top of another, can quickly master how to create structures comprising multiple objects (Marcinowski et al., 2019) . With the same goal, unsupervised RL (URL) methods aim to leverage intrinsic motivation signals, used to drive the agent's interaction with the environment, to acquire generalizable knowledge and behaviors. While some URL approaches focus on exploring the environment (Schmidhuber, 1991; Mutti et al., 2020; Bellemare et al., 2016 ), competence-based (Laskin et al., 2021) methods aim to learn a set of options or skills that provide the agent with the ability to control the environment (Gregor et al., 2016; Eysenbach et al., 2019 ), a.k.a. empowerment (Salge et al., 2014) . Learning a set of options can provide an optimal set of behaviors to quickly adapt and generalize to new tasks (Eysenbach et al., 2021) . However, current methods still exhibit several limitations. Some of these are due to the nature of the skill discovery objective (Achiam et al., 2018) , struggling to capture behaviors that are natural and meaningful for humans. Another major issue with current

