AUTOMATIC CURRICULUM FOR UNSUPERVISED REIN-FORCEMENT LEARNING

Abstract

Recent unsupervised reinforcement learning (URL) can learn meaningful skills without task rewards by carefully designed training objectives. However, most existing works lack quantitative evaluation metrics for URL but mainly rely on visualizations of trajectories to compare the performance. Moreover, each URL method only focuses on a single training objective, which can hinder further learning progress and the development of new skills. To bridge these gaps, we first propose multiple evaluation metrics for URL that can cover different preferred properties. We show that balancing these metrics leads to what a "good" trajectory visualization embodies. Next, we use these metrics to develop an automatic curriculum that can change the URL objective across different learning stages in order to improve and balance all metrics. Specifically, we apply a non-stationary multi-armed bandit algorithm to select an existing URL objective for each episode according to the metrics evaluated in previous episodes. Extensive experiments in different environments demonstrate the advantages of our method on achieving promising and balanced performance over all URL metrics.

1. INTRODUCTION

Reinforcement learning (RL) has recently achieved remarkable success in autonomous control (Kiumarsi et al., 2017) and video games (Mnih et al., 2013) . Its mastery of Go (Silver et al., 2016) and large-scale multiplayer video games (Vinyals et al., 2019) has drawn growing attention. However, a primary limitation for the current RL is that it is highly task-specific and easily overfitting to the training task, while it is still challenging to gain fundamental skills generalizable to different tasks. Moreover, due to the sparse rewards in many tasks and poor exploration of the state-action space, RL can be highly inefficient. To overcome these weaknesses, intrinsic motivations (Oudeyer & Kaplan, 2009) have been studied to help pre-train RL agents in earlier stages even without any task assigned. The so called "unsupervised RL (URL)" does not rely on any extrinsic task rewards and its primary goal is to encourage exploration and develop versatile skills that can be adapted to downstream tasks. Although URL provides additional objectives and rewards to train fundamental and task-agnostic skills, it lacks quantitative evaluation metrics and yet mainly relies on visualizations of trajectories to demonstrate its effectiveness. Although it can be evaluated through downstream tasks by their extrinsic rewards (Laskin et al., 2021b) , this requires further training and can be prone to overfitting or bias towards specific tasks. A key challenge in developing evaluation metrics for URL is how to cover different expectations or preferable properties for the agent, which usually cannot be all captured by a single metric. Recently, IBOL (Kim et al., 2021) introduced the concept of disentanglement to evaluate the informativeness and separability of learned skills. However, it does not consider other characteristics such as the coverage over the state space. In addition, how to balance multiple metrics in the evaluation is an open challenge. Therefore, it is critical to develop a set of metrics that can provide a complete and precise evaluation of an URL agent. In this paper, we take a first step towards quantitative evaluations of URL by proposing a set of evaluation metrics that can cover different preferred capabilities of URL, e.g., on both exploration and skill discovery. In case studies, we show that URL achieving balanced and high scores over all the proposed metrics fulfills our requirements for a promising pre-trained agent. However, excelling on only one metric cannot exclude certain poorly learned URL policies.

