EUCLID: TOWARDS EFFICIENT UNSUPERVISED RE-INFORCEMENT LEARNING WITH MULTI-CHOICE DY-NAMICS MODEL

Abstract

Unsupervised reinforcement learning (URL) poses a promising paradigm to learn useful behaviors in a task-agnostic environment without the guidance of extrinsic rewards to facilitate the fast adaptation of various downstream tasks. Previous works focused on the pre-training in a model-free manner while lacking the study of transition dynamics modeling that leaves a large space for the improvement of sample efficiency in downstream tasks. To this end, we propose an Efficient Unsupervised reinforCement Learning framework with multi-choIce Dynamics model (EUCLID), which introduces a novel model-fused paradigm to jointly pretrain the dynamics model and unsupervised exploration policy in the pre-training phase, thus better leveraging the environmental samples and improving the downstream task sampling efficiency. However, constructing a generalizable model which captures the local dynamics under different behaviors remains a challenging problem. We introduce the multi-choice dynamics model that covers different local dynamics under different behaviors concurrently, which uses different heads to learn the state transition under different behaviors during unsupervised pretraining and selects the most appropriate head for prediction in the downstream task. Experimental results in the manipulation and locomotion domains demonstrate that EUCLID achieves state-of-the-art performance with high sample efficiency, basically solving the state-based URLB benchmark and reaching a mean normalized score of 104.0±1.2% in downstream tasks with 100k fine-tuning steps, which is equivalent to DDPG's performance at 2M interactive steps with 20× more data. More visualization videos are released on our homepage.

1. INTRODUCTION

Reinforcement learning (RL) has shown promising capabilities in many practical scenarios (Li et al., 2022b; Ni et al., 2021; Shen et al., 2020; Zheng et al., 2019) . However, RL typically requires substantial interaction data and task-specific rewards for the policy learning without using any prior knowledge, resulting in low sample efficiency (Yarats et al., 2021c) and making it hard to generalize quickly to new downstream tasks (Zhang et al., 2018; Mu et al., 2022) . For this, unsupervised reinforcement learning (URL) emerges and suggests a new paradigm: pre-training policies in an unsupervised way, and reusing them as prior for fast adapting to the specific downstream task (Li et al., 2020; Peng et al., 2022; Seo et al., 2022) , shedding a promising way to further promote RL to solve complex real-world problems (filled with various unseen tasks). Most URL approaches focus on pre-train a policy with diverse skills via exploring the environment guided by the designed unsupervised signal instead of the task-specific reward signal (Hansen et al., 2020; Liu & Abbeel, 2021a) . However, such a pre-training procedure may not always benefit downstream policy learning. As shown in Fig. 1 , we pre-train a policy for 100k, 500k, 2M steps in the robotic arm control benchmark Jaco, respectively, and use them as the prior for the downstream policy learning to see how pre-training promotes the learning. Surprisingly, long-hour pre-training does not always bring benefits and sometimes deteriorates the downstream learning (500k vs 2M in the

