THE CHALLENGES OF EXPLORATION FOR OFFLINE REINFORCEMENT LEARNING Anonymous

Abstract

Offline Reinforcement Learning (ORL) enables us to separately study the two interlinked processes of reinforcement learning: collecting informative experience and inferring optimal behaviour. The second step has been widely studied in the offline setting, but just as critical to data-efficient RL is the collection of informative data. The task-agnostic setting for data collection, where the task is not known a priori, is of particular interest due to the possibility of collecting a single dataset and using it to solve several downstream tasks as they arise. We investigate this setting via curiosity-based intrinsic motivation, a family of exploration methods which encourage the agent to explore those states or transitions it has not yet learned to model. With Explore2Offline, we propose to evaluate the quality of collected data by transferring the collected data and inferring policies with reward relabelling and standard offline RL algorithms. We evaluate a wide variety of data collection strategies, including a new exploration agent, Intrinsic Model Predictive Control (IMPC), using this scheme and demonstrate their performance on various tasks. We use this decoupled framework to strengthen intuitions about exploration and the data prerequisites for effective offline RL.

1. INTRODUCTION

The field of offline reinforcement learning (ORL) is growing quickly, motivated by its promise to use previously-collected datasets to produce new high-quality policies. It enables the disentangling of collection and inference processes underlying effective RL (Riedmiller et al., 2021) . To date, the majority of research in the offline RL setting has focused on the inference side -the extraction of a performant policy given a dataset, but just as crucial is the development of the dataset itself. While challenges of the inference step are increasingly well investigated (Levine et al., 2020; Agarwal et al., 2020) , we instead investigate the collection step. For evaluation, we investigate correlations between the properties of collected data and final performance, how much data is necessary, and the impact of different collection strategies. Whereas most existing benchmarks for ORL (Fu et al., 2020; Gulcehre et al., 2020) focus on the single-task setting with the task known a priori, we evaluate the potential of task-agnostic exploration methods to collect datasets for previously-unknown tasks. Task-agnostic data is an exciting avenue to pursue to illuminate potential tasks of interest in a space via unsupervised learning. In this setting, we transfer information from the unsupervised pretraining phase not via the policy (Yarats et al., 2021) but via the collected data. Historically the question of how to act -and therefore collect data -in RL has been studied through the exploration-exploitation trade-off, which amounts to a balance of an agent's goals in solving a task immediately versus collecting data to perform better in the future. Task-agnostic exploration expands this well-studied direction towards how to explore in the absence of knowledge about current or future agent goals (Dasagi et al., 2019) . In this work, we particularly focus on intrinsic motivation (Oudeyer & Kaplan, 2009) , which explores novel states based on rewards derived from the agent's internal information. These intrinsic rewards can take many forms, such as curiosity-based methods that learning a world model (Burda et al., 2018b; Pathak et al., 2017; Shyam et al., 2019) , data-based methods that optimize statistical properties of the agent's experience (Yarats et al., 2021) , or competence-based metrics that extract skills (Eysenbach et al., 2018) . In particular, we perform a wide study of data collected via curiosity-based exploration methods, similar to ExORL (Yarats et al., 2022) . In addition, we introduce a novel method for effectively combining curiosity-based rewards with model predictive control.

Record Agent's Experience

Task-agnostic transitions

Reward Model

Figure 1 : The Explore2Offline framework for evaluating data-efficient intrinsic agents. First the agent acts in the environment task-agnostically to search for novel states. After a set lifetime, the agent experience stored in a replay buffer is labeled with the rewards of a task of interest. This replay buffer is used to train an RL policy with the offline reinforcement learning algorithm Critic Regularized Regression in order to finally evaluate the quality of exploration in the environment. In Explore2Offline, we use offline RL as a mechanism for evaluating exploration performance of these curiosity-based models, which separates the fundamental feedback loop key to RL in order to disentangle questions of collection and inference Riedmiller et al. ( 2021) as displayed in Fig. 1 . With this methodology, our paper has a series of contributions for understanding properties and applications of data collected by curiosity-based agents. Contribution 1: We propose Explore2Offline to combine offline RL and reward relabelling for transferring information gained in the data from task-agnostic exploration to downstream tasks. Our results showcase how experiences from intrinsic exploration can solve many tasks, partially reaching similar performance to state-of-the-art online RL data collection. Contribution 2: We propose Intrinsic Model Predictive Control (IMPC) which combines a learned dynamics model and a curiosity approach to enable online planning for exploration to minimize the potential of stale intrinsic rewards. A large sweep over existing and new methods shows where task-agnostic exploration succeeds and where it fails. Contribution 3: By investigating multi-task downstream learning, we highlight a further strength of task-agnostic data collection where each datapoint can be assigned multiple rewards in hindsight. et al., 2018) . We build on recent advancements in intrinsic curiosity with the Intrinsic Model Predictive Control agent that has two new properties: online planning of states to explore and using a separate reward model from the dynamics model used for control.

2.2. UNSUPERVISED PRETRAINING IN RL

Recent works have proposed a two-phase RL setting consisting of a long "pretraining" phase in a version of the environment without rewards, and a sample-limited "task learning" phase with visible rewards (Schwarzer et al., 2021) . In this setting the agent attempts to learn task-agnostic information about the environment in the first phase, then rapidly re-explore to find rewards and produce a policy



2.1 CURIOSITY-DRIVEN EXPLORATIONIntrinsic exploration is a well studied direction in reinforcement learning with the goal of enabling agents to generate compelling behavior in any environment by having an internal reward representation. Curiosity-driven learning uses learned models to reward agents that reach states with high modelling error or uncertainty. Many recent works use the prediction error of a learned neural network model to reward agents' that see new states Burda et al. (2018b); Pathak et al. (2017). Often, the intrinsic curiosity agents are trained with on-policy RL algorithms such as Proximal Policy Optimization (PPO) to maintain recent reward labels for visited states. Burda et al. (2018a) did a wide study on different intrinsic reward models, focusing on pixel-based learning. Instead, we use offpolicy learning and re-label the intrinsic rewards associated with a tuple when learning the policy. Other strategies for using learned dynamics models to explore is to reward agents based on the variance of the predictions Pathak et al.; Sekar et al. or the value function (Lowrey

