THE CHALLENGES OF EXPLORATION FOR OFFLINE REINFORCEMENT LEARNING Anonymous

Abstract

Offline Reinforcement Learning (ORL) enables us to separately study the two interlinked processes of reinforcement learning: collecting informative experience and inferring optimal behaviour. The second step has been widely studied in the offline setting, but just as critical to data-efficient RL is the collection of informative data. The task-agnostic setting for data collection, where the task is not known a priori, is of particular interest due to the possibility of collecting a single dataset and using it to solve several downstream tasks as they arise. We investigate this setting via curiosity-based intrinsic motivation, a family of exploration methods which encourage the agent to explore those states or transitions it has not yet learned to model. With Explore2Offline, we propose to evaluate the quality of collected data by transferring the collected data and inferring policies with reward relabelling and standard offline RL algorithms. We evaluate a wide variety of data collection strategies, including a new exploration agent, Intrinsic Model Predictive Control (IMPC), using this scheme and demonstrate their performance on various tasks. We use this decoupled framework to strengthen intuitions about exploration and the data prerequisites for effective offline RL.

1. INTRODUCTION

The field of offline reinforcement learning (ORL) is growing quickly, motivated by its promise to use previously-collected datasets to produce new high-quality policies. It enables the disentangling of collection and inference processes underlying effective RL (Riedmiller et al., 2021) . To date, the majority of research in the offline RL setting has focused on the inference side -the extraction of a performant policy given a dataset, but just as crucial is the development of the dataset itself. While challenges of the inference step are increasingly well investigated (Levine et al., 2020; Agarwal et al., 2020) , we instead investigate the collection step. For evaluation, we investigate correlations between the properties of collected data and final performance, how much data is necessary, and the impact of different collection strategies. Whereas most existing benchmarks for ORL (Fu et al., 2020; Gulcehre et al., 2020) focus on the single-task setting with the task known a priori, we evaluate the potential of task-agnostic exploration methods to collect datasets for previously-unknown tasks. Task-agnostic data is an exciting avenue to pursue to illuminate potential tasks of interest in a space via unsupervised learning. In this setting, we transfer information from the unsupervised pretraining phase not via the policy (Yarats et al., 2021) but via the collected data. Historically the question of how to act -and therefore collect data -in RL has been studied through the exploration-exploitation trade-off, which amounts to a balance of an agent's goals in solving a task immediately versus collecting data to perform better in the future. Task-agnostic exploration expands this well-studied direction towards how to explore in the absence of knowledge about current or future agent goals (Dasagi et al., 2019) . In this work, we particularly focus on intrinsic motivation (Oudeyer & Kaplan, 2009) , which explores novel states based on rewards derived from the agent's internal information. These intrinsic rewards can take many forms, such as curiosity-based methods that learning a world model (Burda et al., 2018b; Pathak et al., 2017; Shyam et al., 2019) , data-based methods that optimize statistical properties of the agent's experience (Yarats et al., 2021) , or competence-based metrics that extract skills (Eysenbach et al., 2018) . In particular, we perform a wide study of data collected via curiosity-based exploration methods, similar to ExORL (Yarats et al., 2022) . In addition, we introduce a novel method for effectively combining curiosity-based rewards with model predictive control.

