AUTOMATIC CURRICULUM FOR UNSUPERVISED REIN-FORCEMENT LEARNING

Abstract

Recent unsupervised reinforcement learning (URL) can learn meaningful skills without task rewards by carefully designed training objectives. However, most existing works lack quantitative evaluation metrics for URL but mainly rely on visualizations of trajectories to compare the performance. Moreover, each URL method only focuses on a single training objective, which can hinder further learning progress and the development of new skills. To bridge these gaps, we first propose multiple evaluation metrics for URL that can cover different preferred properties. We show that balancing these metrics leads to what a "good" trajectory visualization embodies. Next, we use these metrics to develop an automatic curriculum that can change the URL objective across different learning stages in order to improve and balance all metrics. Specifically, we apply a non-stationary multi-armed bandit algorithm to select an existing URL objective for each episode according to the metrics evaluated in previous episodes. Extensive experiments in different environments demonstrate the advantages of our method on achieving promising and balanced performance over all URL metrics.

1. INTRODUCTION

Reinforcement learning (RL) has recently achieved remarkable success in autonomous control (Kiumarsi et al., 2017) and video games (Mnih et al., 2013) . Its mastery of Go (Silver et al., 2016) and large-scale multiplayer video games (Vinyals et al., 2019) has drawn growing attention. However, a primary limitation for the current RL is that it is highly task-specific and easily overfitting to the training task, while it is still challenging to gain fundamental skills generalizable to different tasks. Moreover, due to the sparse rewards in many tasks and poor exploration of the state-action space, RL can be highly inefficient. To overcome these weaknesses, intrinsic motivations (Oudeyer & Kaplan, 2009) have been studied to help pre-train RL agents in earlier stages even without any task assigned. The so called "unsupervised RL (URL)" does not rely on any extrinsic task rewards and its primary goal is to encourage exploration and develop versatile skills that can be adapted to downstream tasks. Although URL provides additional objectives and rewards to train fundamental and task-agnostic skills, it lacks quantitative evaluation metrics and yet mainly relies on visualizations of trajectories to demonstrate its effectiveness. Although it can be evaluated through downstream tasks by their extrinsic rewards (Laskin et al., 2021b) , this requires further training and can be prone to overfitting or bias towards specific tasks. A key challenge in developing evaluation metrics for URL is how to cover different expectations or preferable properties for the agent, which usually cannot be all captured by a single metric. Recently, IBOL (Kim et al., 2021) introduced the concept of disentanglement to evaluate the informativeness and separability of learned skills. However, it does not consider other characteristics such as the coverage over the state space. In addition, how to balance multiple metrics in the evaluation is an open challenge. Therefore, it is critical to develop a set of metrics that can provide a complete and precise evaluation of an URL agent. In this paper, we take a first step towards quantitative evaluations of URL by proposing a set of evaluation metrics that can cover different preferred capabilities of URL, e.g., on both exploration and skill discovery. In case studies, we show that URL achieving balanced and high scores over all the proposed metrics fulfills our requirements for a promising pre-trained agent. However, excelling on only one metric cannot exclude certain poorly learned URL policies. In contrast to the ambiguity of evaluation metrics for current URL, the existing intrinsic rewards for URL are quite specific and focused, e.g., the novelty/uncertainty of states (Pathak et al., 2017; Burda et al., 2019; Pathak et al., 2019) , the entropy of state distribution (Lee et al., 2019; Mutti et al., 2021; Liu & Abbeel, 2021a) , and the mutual information between states and skills (Gregor et al., 2017; Eysenbach et al., 2019b) , which are task-free and can provide dense feedback. For most URL methods, each one only focuses on learning with a single intrinsic reward. They mainly differ on implementations, e.g., how to define the novelty, how to estimate the state entropy or mutual information. However, the quality of implementations significantly depends on the modeling of the environment dynamics, which cannot be always accurate everywhere if the exploration is only guided by a single reward. For example, as shown later, agents learning with a single intrinsic reward for exploration could be hindered from further exploration since its novelty approximation is limited to local regions. Moreover, in order to achieve consistent improvement on multiple evaluation metrics and balance their trade-offs, training with a single intrinsic reward is not enough. Hence, it is necessary in URL to take multiple intrinsic rewards into account. In this paper, we leverage multiple existing intrinsic rewards and aim to automatically choose the most helpful one in each learning stage for optimizing the proposed multiple evaluation metrics. This produces a curriculum of URL whose training objective is adjusted over the course of training to keep improving all evaluation metrics. Since the intrinsic reward is varying concurrently with URL on the fly, we apply a multi-objective multi-armed bandits algorithm to address the explorationexploitation trade-off, i.e., we intend to select the intrinsic reward (1) that has been rarely selected before (exploration) or ( 2) that results in the greatest and balanced improvement over all the metrics in history (exploitation). Specifically, we adopt Pareto UCB (Drugan & Nowe, 2013) to optimize the multi-objective defined by the metrics and then extend it to capture the non-stationary dynamics of curriculum learning, i.e., the best intrinsic reward may change across learning stages. This assumption is in line with our observation that a single intrinsic reward cannot keep improving all metrics and URL may stop exploration and ends with sub-optimal skills. To the best of our knowledge, our work is among a few pioneering studies focusing on developing evaluation metrics for URL. While automatic curriculum learning (ACL) has achieved success in deep RL (Portelas et al., 2020) , it has not been studied for URL, though adaptively changing intrinsic motivations is a natural human learning strategy in exploring an unknown world.In experiments, we evaluate our approach on challenging URL environments. Our method consistently achieves better and more balanced results over multiple evaluation metrics than SOTA URL methods. Moreover, we present thorough empirical analyses to demonstrate the advantages brought by the automatic curriculum and the multi-objective for optimizing the curriculum.

2. RELATED WORKS

Unsupervised Reinforcement Learning. Intrinsic rewards are used for training URL. For exploration, intrinsic motivations can be based on curiosity and surprise of environtal dynamics (Di Domenico & Ryan, 2017) , such as Intrinsic Curiosity Module (ICM) (Pathak et al., 2017) , Random Network Distillation (RND) (Burda et al., 2019), and Disagreement (Pathak et al., 2019) . Another common way to explore is to maximize the state entropy. State Marginal Matching (SMM) (Lee et al., 2019) approximates the state marginal distribution, and matching it to the uniform distribution is equivalent to maximizing the state entropy. Other methods approximate state entropy by particle-based method MEPOL (Mutti et al., 2020) , APT (Liu & Abbeel, 2021a) , Pro-toRL (Yarats et al., 2021) , APS (Liu & Abbeel, 2021b) . Mutual information-based approaches have been used for self-supervised skill discovery, such as VIC (Gregor et al., 2017) , DIAYN (Eysenbach et al., 2019a ), VALOR (Achiam et al., 2018) . VISR (Hansen et al., 2020 ) also optimizes the same ojective, but its special approximation brought successor feature (Barreto et al., 2016) into unsupervised skill learning paradigm and enables fast task inference. APS (Liu & Abbeel, 2021b) combines the exploration of APT and successor feature of VISR. Automatic Curriculum Learning. Automatic curriculum learning has been widely studied. It allows models to learn in a specific order for learning harder tasks more efficiently (Graves et al., 2017; Bengio et al., 2009) . In RL, a lot of work considers scheduling learning tasks (Florensa et al., 2018; 2017; Fang et al., 2019; Matiisen et al., 2019; Schmidhuber, 2013) . In URL, handcrafted curriculum is used by EDL (Campos et al., 2020) and IBOL (Kim et al., 2021) . EDL first explores, then assigns

