LEARNING TO DISCOVER SKILLS WITH GUIDANCE

Abstract

Unsupervised skill discovery (USD) allows agents to learn diverse and discriminable skills without access to pre-defined rewards, by maximizing the mutual information (MI) between skills and states reached by each skill. The most common problem of MI-based skill discovery is insufficient exploration, because each skill is heavily penalized when it deviates from its initial settlement. Recent works introduced an auxiliary reward to encourage the exploration of the agent via maximizing the state's epistemic uncertainty or entropy. However, we have discovered that the performance of these auxiliary rewards decreases as the environment becomes more challenging. Therefore, we introduce a new unsupervised skill discovery algorithm, skill discovery with guidance (DISCO-DANCE), which (1) selects the guide skill which has the highest potential to reach the unexplored states, (2) guide other skills to follow the guide skill, then (3) the guided skills are diffused to maximize their discriminability in the unexplored states. Empirically, DISCO-DANCE substantially outperforms other USD baselines on challenging environments including two navigation benchmarks and a continuous control benchmark.

1. INTRODUCTION

In recent years, Deep Reinforcement Learning (DRL) has shown great success in various complex tasks, ranging from playing video games (Mnih et al., 2015; Silver et al., 2016) to complex robotic manipulation (Andrychowicz et al., 2017; Gu et al., 2017) . Despite their remarkable success, most DRL models focus on training from scratch for every single task, which results in significant inefficiency. In addition, the reward functions adopted for training the agents are generally handcrafted, acting as an impediment that prevents DRL to scale for various real-world tasks. For these reasons, there has been an increasing interest in training task-agnostic policies without access to a pre-defined reward function (Campos et al., 2020; Eysenbach et al., 2018; Gregor et al., 2016; Hansen et al., 2019; Laskin et al., 2021; Liu & Abbeel, 2021; Sharma et al., 2019; Strouse et al., 2022; Park et al., 2022; Laskin et al., 2022; Shafiullah & Pinto, 2022) . This training paradigm falls in the category of Unsupervised Skill Discovery (USD) where the goal of the USD is to acquire diverse and discriminable behaviors, known as skills. These pre-trained skills can be utilized as useful primitives or directly employed to solve various downstream tasks. Most of the previous studies in USD (Achiam et al., 2018; Eysenbach et al., 2018; Gregor et al., 2016; Hansen et al., 2019; Sharma et al., 2019) discover a set of diverse and discriminable skills by maximizing the self-supervised, intrinsic motivation as a form of reward. Commonly, mutual information (MI) between the skill's latent variables and the states reached by each skill is utilized as this self-supervised reward. However, it has been shown in recent research that solely maximizing the sum of MI rewards is insufficient to explore the state space because the agent receives larger rewards for visiting known states rather than for exploring the novel states asymptotically (Campos et al., 2020; Liu & Abbeel, 2021; Strouse et al., 2022) . To ameliorate this issue, recent studies designed an auxiliary exploration reward that incentivizes the agent when it succeeds in visiting novel states (Strouse et al., 2022; Lee et al., 2019; Liu & Abbeel, 2021) . However, albeit provided these auxiliary rewards, previous approaches do not often work efficiently in complex environment. Fig. 1 conceptually illustrates how previous methods work ineffectively in a simple environment. Suppose that the upper region in Fig. 1a is hard to reach

