EUCLID: TOWARDS EFFICIENT UNSUPERVISED RE-INFORCEMENT LEARNING WITH MULTI-CHOICE DY-NAMICS MODEL

Abstract

Unsupervised reinforcement learning (URL) poses a promising paradigm to learn useful behaviors in a task-agnostic environment without the guidance of extrinsic rewards to facilitate the fast adaptation of various downstream tasks. Previous works focused on the pre-training in a model-free manner while lacking the study of transition dynamics modeling that leaves a large space for the improvement of sample efficiency in downstream tasks. To this end, we propose an Efficient Unsupervised reinforCement Learning framework with multi-choIce Dynamics model (EUCLID), which introduces a novel model-fused paradigm to jointly pretrain the dynamics model and unsupervised exploration policy in the pre-training phase, thus better leveraging the environmental samples and improving the downstream task sampling efficiency. However, constructing a generalizable model which captures the local dynamics under different behaviors remains a challenging problem. We introduce the multi-choice dynamics model that covers different local dynamics under different behaviors concurrently, which uses different heads to learn the state transition under different behaviors during unsupervised pretraining and selects the most appropriate head for prediction in the downstream task. Experimental results in the manipulation and locomotion domains demonstrate that EUCLID achieves state-of-the-art performance with high sample efficiency, basically solving the state-based URLB benchmark and reaching a mean normalized score of 104.0±1.2% in downstream tasks with 100k fine-tuning steps, which is equivalent to DDPG's performance at 2M interactive steps with 20× more data. More visualization videos are released on our homepage.

1. INTRODUCTION

Reinforcement learning (RL) has shown promising capabilities in many practical scenarios (Li et al., 2022b; Ni et al., 2021; Shen et al., 2020; Zheng et al., 2019) . However, RL typically requires substantial interaction data and task-specific rewards for the policy learning without using any prior knowledge, resulting in low sample efficiency (Yarats et al., 2021c) and making it hard to generalize quickly to new downstream tasks (Zhang et al., 2018; Mu et al., 2022) . For this, unsupervised reinforcement learning (URL) emerges and suggests a new paradigm: pre-training policies in an unsupervised way, and reusing them as prior for fast adapting to the specific downstream task (Li et al., 2020; Peng et al., 2022; Seo et al., 2022) , shedding a promising way to further promote RL to solve complex real-world problems (filled with various unseen tasks). Most URL approaches focus on pre-train a policy with diverse skills via exploring the environment guided by the designed unsupervised signal instead of the task-specific reward signal (Hansen et al., 2020; Liu & Abbeel, 2021a) . However, such a pre-training procedure may not always benefit downstream policy learning. As shown in Fig. 1 , we pre-train a policy for 100k, 500k, 2M steps in the robotic arm control benchmark Jaco, respectively, and use them as the prior for the downstream policy learning to see how pre-training promotes the learning. Surprisingly, long-hour pre-training does not always bring benefits and sometimes deteriorates the downstream learning (500k vs 2M in the orange line). We visualize three pre-trained policies on the left of Fig. 1 , and find they learn different skills (i.e., each covering a different state space). Evidently, only one policy (pre-trained with 500k) is beneficial for downstream learning as it happens to focus on the area where the red brick exists. This finding reveals that the downstream policy learning could heavily rely on the pre-trained policy, and poses a potential limitation in existing URL approaches: Only pre-training policy via diverse exploration is not enough for guaranteeing to facilitate downstream learning. Specifically, most mainstream URL approaches pre-train the policy in a model-free manner (Pathak et al., 2019; 2017; Campos et al., 2020) , meaning that the skill discovered later in the pre-training, will more or less suppress the earlier ones (like the catastrophic forgetting). This could result in an unpredictable skill that is most likely not the one required for solving the downstream task (2M vs. 500k). We refer to this as the mismatch issue which could make pre-training even less effective than randomized policy in the downstream learning. Similarly, Laskin et al. ( 2021) also found that simply increasing pre-training steps sometimes brings no monotonic improvement but oscillation in performance. To alleviate above issue, we propose the Efficient Unsupervised reinforCement Learning framework with multi-choIce Dynamic model (EUCLID), introducing the model-based RL paradigm to achieve rapid downstream task adaption and higher sample efficiency. First, in the pre-training phase, EUCLID proposes to pre-train the environment dynamics model, which barely suffers from the mismatch issue as the upstream and downstream tasks in most time share the same environment dynamics. Notably, the pre-training dynamics model is also orthogonal to the pre-training policy, thus EUCLID pre-trains them together and achieves the best performance (see Fig. 1 ). In practice, EUCLID requires merely no additional sampling burden as the transition collected during the policy pre-training can also be used for the dynamics model pre-training. On the other hand, in the finetuning phase, EUCLID leverages the pre-trained dynamics model for planning, which is guided by the pre-trained policy. Such a combination could eliminate the negative impact caused by the mismatch issue and gain fast adaptation performance. More importantly, EUCLID can monotonically benefit from an accurate dynamics model through a longer pre-training. Another practical challenge is that, due to the model capacity, pre-training one single dynamics model is hard to accurately model all the environment dynamics. The inaccuracy can be further exacerbated in complex environments with huge state space, and thus deteriorates the downstream learning performance. Inspired by multi-choice learning, EUCLID proposes a multi-headed dynamics model with each head pre-trained with separate transition data. Each head focuses on a different region of the environment, and is combined to predict the entire environment dynamics accurately. As such, in the fine-tuning phase, EUCLID could select the most appropriate head (sharing a similar dynamics to the downstream task) to achieve a fast adaptation. Our contributions are four-fold: (1) we extend the mainstream URL paradigm by innovative introducing the dynamics model in the pre-training phase, so that model-based planning can be leveraged in the fine-tuning phase to alleviate the mismatch issue and further boost the downstream policy learning performance; (2) We propose a multi-headed dynamics model to achieve a fine-grained and more accurate prediction, which promotes effective model planning in solving downstream tasks; (3) We empirically study the performance of EUCLID by comparing different mainstream URL mechanisms or designs, and comprehensively analyze how each part of EUCLID affect the ultimate performance; (4) Extensive comparisons on diverse continuous control tasks are conducted and the results demonstrate significant superiority of EUCLID in performance and sample efficiency, especially in challenging environments. Our approach basically solves the state-based URLB, achieving state-of-the-art performance with a normalized score of 104.0±1.2% and outperforming the prior leading method by 1.35×, which is equivalent to DDPG with 20× more data.



Figure 1: A motivation example.

