OPAL: OFFLINE PRIMITIVE DISCOVERY FOR ACCEL-ERATING OFFLINE REINFORCEMENT LEARNING

Abstract

Reinforcement learning (RL) has achieved impressive performance in a variety of online settings in which an agent's ability to query the environment for transitions and rewards is effectively unlimited. However, in many practical applications, the situation is reversed: an agent may have access to large amounts of undirected offline experience data, while access to the online environment is severely limited. In this work, we focus on this offline setting. Our main insight is that, when presented with offline data composed of a variety of behaviors, an effective way to leverage this data is to extract a continuous space of recurring and temporally extended primitive behaviors before using these primitives for downstream task learning. Primitives extracted in this way serve two purposes: they delineate the behaviors that are supported by the data from those that are not, making them useful for avoiding distributional shift in offline RL; and they provide a degree of temporal abstraction, which reduces the effective horizon yielding better learning in theory, and improved offline RL in practice. In addition to benefiting offline policy optimization, we show that performing offline primitive learning in this way can also be leveraged for improving few-shot imitation learning as well as exploration and transfer in online RL on a variety of benchmark domains. Visualizations and code are available at https://sites.google.com/view/opal-iclr 

1. INTRODUCTION

Reinforcement Learning (RL) systems have achieved impressive performance in a variety of online settings such as games (Silver et al., 2016; Tesauro, 1995; Brown & Sandholm, 2019) and robotics (Levine et al., 2016; Dasari et al., 2019; Peters et al., 2010; Parmas et al., 2019; Pinto & Gupta, 2016; Nachum et al., 2019a) , where the agent can act in the environment and sample as many transitions and rewards as needed. However, in many practical applications the agent's ability to continuously act in the environment may be severely limited due to practical concerns (Dulac-Arnold et al., 2019) . For example, a robot learning through trial and error in the real world requires costly human supervision, safety checks, and resets (Atkeson et al., 2015) , rendering many standard online RL algorithms inapplicable (Matsushima et al., 2020) . However, in such settings we might instead have access to large amounts of previously logged data, which could be logged from a baseline hand-engineered policy or even from other related tasks. For example, in self-driving applications, one may have access to large amounts of human driving behavior; in robotic applications, one might have data of either humans or robots performing similar tasks. While these offline datasets are often undirected (generic human driving data on various routes in various cities may not be directly relevant to navigation of a specific route within a specific city) and unlabelled (generic human driving data is often not labelled with the human's intended route or destination), this data is still useful in that it can inform the algorithm about what is possible to do in the real world, without the need for active exploration. In this paper, we study how, in this offline setting, an effective strategy to leveraging unlabeled and undirected past data is to utilize unsupervised learning to extract potentially useful and temporally extended primitive skills to learn what types of behaviors are possible. For example, consider a dataset of an agent performing undirected navigation in a maze environment (Figure 1 ). While the dataset does not provide demonstrations of exclusively one specific point-to-point navigation task, it nevertheless presents clear indications of which temporally extended behaviors are useful and natural in this environment (e.g., moving forward, left, right, and backward), and our unsupervised learning objective aims to distill these behaviors into temporally extended primitives. Once these locomotive primitive behaviors are extracted, we can use them as a compact constrained temporallyextended action space for learning a task policy with offline RL, which only needs to focus on task relevant navigation, thereby making task learning easier. For example, once a specific point-to-point navigation is commanded, the agent can leverage the learned primitives for locomotion and only focus on the task of navigation, as opposed to learning locomotion and navigation from scratch. We refer to our proposed unsupervised learning method as Offline Primitives for Accelerating offline reinforcement Learning (OPAL), and apply this basic paradigm to offline RL, where the agent is given a single offline dataset to use for both the initial unsupervised learning phase and then a subsequent task-directed offline policy optimization phase. Despite the fact that no additional data is used, we find that our proposed unsupervised learning technique can dramatically improve offline policy optimization compared to performing offline policy optimization on the raw dataset directly. To the best of our knowledge, ours is the first work to theoretically justify and experimentally verify the benefits of primitive learning in offline RL settings, showing that hierarchies can provide temporal abstraction that allows us to reduce the effect of compounding errors issue in offline RL. These theoretical and empirical results are notably in contrast to previous related work in online hierarchical RL (Nachum et al., 2019b) , which found that improved exploration is the main benefit afforded by hierarchically learned primitives. We instead show significant benefits in the offline RL setting, where exploration is irrelevant. Beyond offline RL, and although this isn't the main focus of the work, we also show the applicability of our method for accelerating RL by incorporating OPAL as a preprocessing step to standard online RL, few-shot imitation learning, and multi-task transfer learning. In all settings, we demonstrate that the use of OPAL can improve the speed and quality of downstream task learning.

2. RELATED WORK

Offline RL. Offline RL presents the problem of learning a policy from a fixed prior dataset of transitions and rewards. Recent works in offline RL (Kumar et al., 2019; Levine et al., 2020; Wu et al., 2019; Ghasemipour et al., 2020; Jaques et al., 2019; Fujimoto et al., 2018) constrain the policy to be close to the data distribution to avoid the use of out-of-distribution actions (Kumar et al., 2019; Levine et al., 2020) . To constrain the policy, some methods use distributional penalties, as measured by KL divergence (Levine et al., 2020; Jaques et al., 2019) , MMD (Kumar et al., 2019) , or Wasserstein distance (Wu et al., 2019) . Other methods first sample actions from the behavior policy and then either clip the maximum deviation from those actions (Fujimoto et al., 2018) or just use those actions (Ghasemipour et al., 2020) during the value backup to stay within the support of the offline data. In contrast to these works, OPAL uses an offline dataset for unsupervised learning of a continuous space of primitives. The use of these primitives for downstream tasks implicitly constrains a learned primitive-directing policy to stay close to the offline data distribution. As we demonstrate in our experiments, the use of OPAL in conjunction with an off-the-shelf offline RL algorithm in this way can yield significant improvement compared to applying offline RL to the dataset directly.



Figure 1: Visualization of (a subset of) diverse datasets for (a) antmaze medium and (c) antmaze large, along with trajectories sampled from CQL+OPAL trained on diverse datasets of (b) antmaze medium and (d) antmaze large.

