CURIOSITY-DRIVEN UNSUPERVISED DATA COLLEC-TION FOR OFFLINE REINFORCEMENT LEARNING

Abstract

In offline reinforcement learning (RL), while the majority of efforts are focusing on engineering sophisticated learning algorithms given a fixed dataset, very few works have been carried out to improve the dataset quality itself. More importantly, it is even challenging to collect a task-agnostic dataset such that the offline RL agent can learn multiple skills from it. In this paper, we propose a Curiositydriven Unsupervised Data Collection (CUDC) method to improve the data collection process. Specifically, we quantify the agent's internal belief to estimate the probability of the k-step future states being reachable from the current states. Different from existing approaches that implicitly assume limited feature space with a fixed temporal distance between current and next states, CUDC is capable of adapting how many steps into the future that the dynamics model should predict. Thus, the feature representation can be diversified with the dynamics information. With this adaptive reachability mechanism in place, the agent can navigate itself to collect higher-quality data with curiosity. Empirically, CUDC surpasses existing unsupervised methods in sample efficiency and learning performance in various downstream offline RL tasks of the DeepMind control suite.

1. INTRODUCTION

Deep reinforcement learning has demonstrated remarkable breakthroughs in games, robotics, and navigation in complex environments (Kiran et al., 2021; Singh et al., 2022; Sun et al., 2022a) . For online RL, agents constantly update the policy to acquire different skills through active interactions with the environments. However, online RL is impractical in many real-world environments as direct interactions with the environments might be expensive or dangerous (Kiran et al., 2021; Singh et al., 2022) . In recent years, offline RL has become a promising research area to cope with limited interactions, where agents learn a policy exclusively from previously-collected experiences stored in a fixed dataset (Levine et al., 2020; Kostrikov et al., 2021; Fujimoto & Gu, 2021) . In view of the growing popularity of offline RL, the majority of current research focuses on modelcentric practices by successively developing new algorithms (Kumar et al., 2020; Janner et al., 2021; Matsushima et al., 2021; Emmons et al., 2022; Kumar et al., 2022) . Despite the rapid progress in these algorithmic advances, their performances are inevitably limited by the quality of the precollected dataset itself. Recently, the concept of data-centric approaches has become critical in the machine learning community, emphasizing the importance of improving the training data quality over algorithmic advances (Ng, 2021; Motamedi et al., 2021; Patel et al., 2022) . Motivated by this, the offline RL research community attempts to eye on ways of engineering the training data (Prudencio et al., 2022) . To focus on more useful data, one solution is to exploit the sample importance with sampling (Zhang et al., 2020) or re-weighting (Wu et al., 2021) . Different from this approach, we aim to collect a higher-quality dataset that can be directly used for offline RL agents. More importantly, it is even more desirable yet challenging to collect a task-agnostic dataset such that offline RL agents are able to extract effective policies for multiple downstream tasks. To analyze and understand these challenges, ExORL (Yarats et al., 2022) empirically shows that unsupervised RL methods are superior to supervised methods to collect the exploratory data that allows even vanilla off-policy RL algorithms to effectively learn offline and acquire different skills. Nevertheless, these existing methods pre-define a fixed temporal distance between current states and future states to train the models, which implicitly limits the diversity in the learned feature representation. As we observe that fixing the temporal distance between current and future states may limit the feature space and result in low-quality dataset, it is desired to enhance feature representation by exploiting the reachability from current state to more distant future states. Although existing works of reachability analysis have been introduced in RL (Savinov et al., 2019; Péré et al., 2018; Ivanovic et al., 2019; Yu et al., 2022) , these approaches are however not directly applicable. For example, Savinov et al. ( 2019) only considers the reachability in a binary case, and it requires extensive comparisons to the stored embeddings in memory. Moreover, the reachability in goal space exploration requires kernel density estimation, increasing the computational cost substantially. Different from these approaches, we propose a Curiosity-driven Unsupervised Data Collection (CUDC) method with a novel reachability module. Inspired by the fact that human curiosity can enhance the learning process by motivating human beings on the novel knowledge that is beyond ones' perception (Rossing & Long, 1981; Markey & Loewenstein, 2014; Sun et al., 2022b) , CUDC facilitates the agent to collect a dataset curiously without any task-specific reward. In particular, we define the reachability module to characterize the probability of a k-step future state being reachable from the current state, with no episodic memory or feature space density modeling required. This module allows the agent to automatically determine how many steps into the future that the dynamics model should predict, where the learned feature representation can be incorporated with the information of dynamics. Compared with the existing unsupervised methods, it avoids relying on the fixed feature space by gradually expanding to more distant future states. With the enhanced representation learning, a mixed intrinsic reward encourages curious exploration towards more meaningful state-action space as well as the under-learned states. As a result, the collected dataset can lead to improved sample efficiency and better performances in downstream offline RL tasks. Our contributions can be summarized as follows. 1) We are the first to introduce reachability for improving data collection in offline RL, which is defined in a more efficient way and can enable the agent to navigate curiosity-driven learning coherently. 2) We empirically show that adapting the number of steps between current and future states to perform increasingly challenging prediction can enhance feature representation with information of the dynamics, thereby improving the collected dataset quality. 3) With the learned state and action representations, CUDC additionally incentivizes the agent to explore diverse state-action space as well as the under-learned states with high prediction errors through a mixed intrinsic reward and regularization. 4) Under the ExORL (Yarats et al., 2022) setting, CUDC outperforms the other unsupervised methods to collect the dataset that can be learned offline in multiple downstream tasks of the DeepMind control suite (Tassa et al., 2018) .

2. RELATED WORKS

Reachability in RL Savinov et al. ( 2019) constructed a reachability network to estimate how many environment steps to take for reaching a particular state. It intrinsically rewards the agent to explore the state that is unreachable i.e., takes more than a fixed threshold step, from other states in memory. However, it only takes the binary case of reachability into consideration and is quite inefficient when comparing the similarity with all stored states in memory. In the goal exploration tasks, Péré et al. (2018) defined the reachability of a goal with an estimated density and proposed to sample increasingly difficult goals to reach during exploration. Although the goal space can be learned in an unsupervised manner other than in a specifically engineered way, its sampling process requires a kernel density estimator, increasing the computational cost substantially. Following the similar idea, BARC (Ivanovic et al., 2019) adapts the initial state distribution gradually from easyto-reach to challenging-to-reach goals. As a result, agents can perform well even in a hard robotic control task. Recently, RCRL (Yu et al., 2022) has shown that leveraging reachability analysis (Hsu et al., 2021) can help learn an optimal safe policy by expanding the limited conservative feasible set to the largest feasible set of the state space. Curiosity-Driven RL Curiosity-driven RL intrinsically encourages agents to explore the task environment in a human-like way, which is of vital importance when the task-specific rewards are sparse or absent (Aubret et al., 2019; Sun et al., 2022b) . The main type of curiosity-driven RL is to incorporate an intrinsic reward that self-motivates agents to explore based on various aspects of the state, such as novelty (Bellemare et al., 2016 ), entropy (Seo et al., 2021; Liu & Abbeel, 2021b ), reachability (Savinov et al., 2019 ), prediction errors (Pathak et al., 2017; Berseth et al., 2020 ), complexity (Campero et al., 2020 ), and uncertainty (Pathak et al., 2019; Sekar et al., 2020; Li et al., 

