OPAL: OFFLINE PRIMITIVE DISCOVERY FOR ACCEL-ERATING OFFLINE REINFORCEMENT LEARNING

Abstract

Reinforcement learning (RL) has achieved impressive performance in a variety of online settings in which an agent's ability to query the environment for transitions and rewards is effectively unlimited. However, in many practical applications, the situation is reversed: an agent may have access to large amounts of undirected offline experience data, while access to the online environment is severely limited. In this work, we focus on this offline setting. Our main insight is that, when presented with offline data composed of a variety of behaviors, an effective way to leverage this data is to extract a continuous space of recurring and temporally extended primitive behaviors before using these primitives for downstream task learning. Primitives extracted in this way serve two purposes: they delineate the behaviors that are supported by the data from those that are not, making them useful for avoiding distributional shift in offline RL; and they provide a degree of temporal abstraction, which reduces the effective horizon yielding better learning in theory, and improved offline RL in practice. In addition to benefiting offline policy optimization, we show that performing offline primitive learning in this way can also be leveraged for improving few-shot imitation learning as well as exploration and transfer in online RL on a variety of benchmark domains. Visualizations and code are available at https://sites.google.com/view/opal-iclr 

1. INTRODUCTION

Reinforcement Learning (RL) systems have achieved impressive performance in a variety of online settings such as games (Silver et al., 2016; Tesauro, 1995; Brown & Sandholm, 2019) and robotics (Levine et al., 2016; Dasari et al., 2019; Peters et al., 2010; Parmas et al., 2019; Pinto & Gupta, 2016; Nachum et al., 2019a) , where the agent can act in the environment and sample as many transitions and rewards as needed. However, in many practical applications the agent's ability to continuously act in the environment may be severely limited due to practical concerns (Dulac-Arnold et al., 2019) . For example, a robot learning through trial and error in the real world requires costly human supervision, safety checks, and resets (Atkeson et al., 2015) , rendering many standard online RL algorithms inapplicable (Matsushima et al., 2020) . However, in such settings we might instead have access to large amounts of previously logged data, which could be logged from a baseline hand-engineered policy or even from other related tasks. For example, in self-driving applications, one may have access to large amounts of human driving behavior; in robotic applications, one might have data of either humans or robots performing similar tasks. While these offline datasets are often undirected (generic human driving data on various routes in various cities may not be directly relevant to navigation of a specific route within a specific city) and unlabelled (generic human driving data is often not labelled with the human's intended route or destination), this data is still useful in that it can inform the algorithm about what is possible to do in the real world, without the need for active exploration. In this paper, we study how, in this offline setting, an effective strategy to leveraging unlabeled and undirected past data is to utilize unsupervised learning to extract potentially useful and temporally extended primitive skills to learn what types of behaviors are possible. For example, consider a dataset of an agent performing undirected navigation in a maze environment (Figure 1 ). While the dataset does not provide demonstrations of exclusively one specific point-to-point navigation task,

