LEARNING TO DISCOVER SKILLS WITH GUIDANCE

Abstract

Unsupervised skill discovery (USD) allows agents to learn diverse and discriminable skills without access to pre-defined rewards, by maximizing the mutual information (MI) between skills and states reached by each skill. The most common problem of MI-based skill discovery is insufficient exploration, because each skill is heavily penalized when it deviates from its initial settlement. Recent works introduced an auxiliary reward to encourage the exploration of the agent via maximizing the state's epistemic uncertainty or entropy. However, we have discovered that the performance of these auxiliary rewards decreases as the environment becomes more challenging. Therefore, we introduce a new unsupervised skill discovery algorithm, skill discovery with guidance (DISCO-DANCE), which (1) selects the guide skill which has the highest potential to reach the unexplored states, (2) guide other skills to follow the guide skill, then (3) the guided skills are diffused to maximize their discriminability in the unexplored states. Empirically, DISCO-DANCE substantially outperforms other USD baselines on challenging environments including two navigation benchmarks and a continuous control benchmark.

1. INTRODUCTION

In recent years, Deep Reinforcement Learning (DRL) has shown great success in various complex tasks, ranging from playing video games (Mnih et al., 2015; Silver et al., 2016) to complex robotic manipulation (Andrychowicz et al., 2017; Gu et al., 2017) . Despite their remarkable success, most DRL models focus on training from scratch for every single task, which results in significant inefficiency. In addition, the reward functions adopted for training the agents are generally handcrafted, acting as an impediment that prevents DRL to scale for various real-world tasks. For these reasons, there has been an increasing interest in training task-agnostic policies without access to a pre-defined reward function (Campos et al., 2020; Eysenbach et al., 2018; Gregor et al., 2016; Hansen et al., 2019; Laskin et al., 2021; Liu & Abbeel, 2021; Sharma et al., 2019; Strouse et al., 2022; Park et al., 2022; Laskin et al., 2022; Shafiullah & Pinto, 2022) . This training paradigm falls in the category of Unsupervised Skill Discovery (USD) where the goal of the USD is to acquire diverse and discriminable behaviors, known as skills. These pre-trained skills can be utilized as useful primitives or directly employed to solve various downstream tasks. Most of the previous studies in USD (Achiam et al., 2018; Eysenbach et al., 2018; Gregor et al., 2016; Hansen et al., 2019; Sharma et al., 2019) discover a set of diverse and discriminable skills by maximizing the self-supervised, intrinsic motivation as a form of reward. Commonly, mutual information (MI) between the skill's latent variables and the states reached by each skill is utilized as this self-supervised reward. However, it has been shown in recent research that solely maximizing the sum of MI rewards is insufficient to explore the state space because the agent receives larger rewards for visiting known states rather than for exploring the novel states asymptotically (Campos et al., 2020; Liu & Abbeel, 2021; Strouse et al., 2022) . To ameliorate this issue, recent studies designed an auxiliary exploration reward that incentivizes the agent when it succeeds in visiting novel states (Strouse et al., 2022; Lee et al., 2019; Liu & Abbeel, 2021) . However, albeit provided these auxiliary rewards, previous approaches do not often work efficiently in complex environment. Fig. 1 conceptually illustrates how previous methods work ineffectively in a simple environment. Suppose that the upper region in Fig. 1a is hard to reach with MI rewards, resulting in obtaining skills which are stuck in the lower-left region. To make these skills explore the upper region, previous methods provide auxiliary exploration reward using intrinsic motivation (e.g., disagreement, curiosity based bonus). However, since they do not indicate exactly which direction to explore, it becomes more inefficient in challenging environments. We detail the limitations of previous approaches in Section 2.2. In response, we design a new exploration objective that aims to provide direct guidance to the unexplored states. To encourage skills to explore the unvisited states, we first pick a guide skill z * which has the highest potential to reach the unexplored states (Fig. 1(d-1 )). Next, we select the skills which will move towards a guide skill (e.g., relatively unconverged skills; receiving low MI rewards). Then they are incentivized to follow the guide skill, aiming to leap over the region with low MI reward (Fig. 1 2 3 4 )). Finally, they are diffused to maximize their distinguisability (Fig. 1 2 3 4 5 )), resulting in obtaining a set of skills with high state-coverage. We call this algorithm as skill discovery with guidance (DISCO-DANCE) and is further presented in Section 3. DISCO-DANCE can be thought of as filling the pathway to the unexplored region with a positive dense reward. In Section 4, we demonstrate that DISCO-DANCE outperforms previous approaches with auxiliary exploration reward in terms of state space coverage and downstream task performances in two navigation environments (2D mazes and Ant mazes), which have been commonly used to validate the performance of the USD agent (Campos et al., 2020; Kamienny et al., 2021) . Furthermore, we also experiment in DMC (Tunyasuvunakool et al., 2020) , and show that the learned set of skills from DISCO-DANCE provides better primitives for learning general behavior (e.g., run, jump and flip) compared to previous baselines.

2. PRELIMINARIES

In Section 2.1, we formalize USD and explain the inherent pessimism that arise in USD. Section 2.2 describes existing exploration objectives for USD and the pitfalls of these exploration objectives.



Figure 1: Conceptual illustration of previous methods and DISCO-DANCE. Each skill is shown with a grey-colored trajectory. Blue skill z i indicates an unconverged skill. Here, (b,c) illustrates a reward landscape of previous methods, DISDAIN, APS, and SMM. (b) DISDAIN fails to reach upper region due to the absence of a pathway to the unexplored states. (c) APS and SMM fail since they do not provide exact direction to the unexplored states. On the other hand, (d), DISCO-DANCE directly guides z i towards selected guide skill z * which has the highest potential to reach the unexplored states. Detailed explanation of the limitations of each baselines is described in Section 2.2.

