CURIOSITY-DRIVEN UNSUPERVISED DATA COLLEC-TION FOR OFFLINE REINFORCEMENT LEARNING

Abstract

In offline reinforcement learning (RL), while the majority of efforts are focusing on engineering sophisticated learning algorithms given a fixed dataset, very few works have been carried out to improve the dataset quality itself. More importantly, it is even challenging to collect a task-agnostic dataset such that the offline RL agent can learn multiple skills from it. In this paper, we propose a Curiositydriven Unsupervised Data Collection (CUDC) method to improve the data collection process. Specifically, we quantify the agent's internal belief to estimate the probability of the k-step future states being reachable from the current states. Different from existing approaches that implicitly assume limited feature space with a fixed temporal distance between current and next states, CUDC is capable of adapting how many steps into the future that the dynamics model should predict. Thus, the feature representation can be diversified with the dynamics information. With this adaptive reachability mechanism in place, the agent can navigate itself to collect higher-quality data with curiosity. Empirically, CUDC surpasses existing unsupervised methods in sample efficiency and learning performance in various downstream offline RL tasks of the DeepMind control suite.

1. INTRODUCTION

Deep reinforcement learning has demonstrated remarkable breakthroughs in games, robotics, and navigation in complex environments (Kiran et al., 2021; Singh et al., 2022; Sun et al., 2022a) . For online RL, agents constantly update the policy to acquire different skills through active interactions with the environments. However, online RL is impractical in many real-world environments as direct interactions with the environments might be expensive or dangerous (Kiran et al., 2021; Singh et al., 2022) . In recent years, offline RL has become a promising research area to cope with limited interactions, where agents learn a policy exclusively from previously-collected experiences stored in a fixed dataset (Levine et al., 2020; Kostrikov et al., 2021; Fujimoto & Gu, 2021) . In view of the growing popularity of offline RL, the majority of current research focuses on modelcentric practices by successively developing new algorithms (Kumar et al., 2020; Janner et al., 2021; Matsushima et al., 2021; Emmons et al., 2022; Kumar et al., 2022) . Despite the rapid progress in these algorithmic advances, their performances are inevitably limited by the quality of the precollected dataset itself. Recently, the concept of data-centric approaches has become critical in the machine learning community, emphasizing the importance of improving the training data quality over algorithmic advances (Ng, 2021; Motamedi et al., 2021; Patel et al., 2022) . Motivated by this, the offline RL research community attempts to eye on ways of engineering the training data (Prudencio et al., 2022) . To focus on more useful data, one solution is to exploit the sample importance with sampling (Zhang et al., 2020) or re-weighting (Wu et al., 2021) . Different from this approach, we aim to collect a higher-quality dataset that can be directly used for offline RL agents. More importantly, it is even more desirable yet challenging to collect a task-agnostic dataset such that offline RL agents are able to extract effective policies for multiple downstream tasks. To analyze and understand these challenges, ExORL (Yarats et al., 2022) empirically shows that unsupervised RL methods are superior to supervised methods to collect the exploratory data that allows even vanilla off-policy RL algorithms to effectively learn offline and acquire different skills. Nevertheless, these existing methods pre-define a fixed temporal distance between current states and future states to train the models, which implicitly limits the diversity in the learned feature representation.

