LEARNING ABOUT PROGRESS FROM EXPERTS

Abstract

Many important tasks involve some notion of long-term progress in multiple phases: e.g. to clean a shelf it must be cleared of items, cleaning products applied, and then the items placed back on the shelf. In this work, we explore the use of expert demonstrations in long-horizon tasks to learn a monotonically increasing function that summarizes progress. This function can then be used to aid agent exploration in environments with sparse rewards. As a case study we consider the NetHack environment, which requires long-term progress at a variety of scales and is far from being solved by existing approaches. In this environment, we demonstrate that by learning a model of long-term progress from expert data containing only observations, we can achieve efficient exploration in challenging sparse tasks, well beyond what is possible with current state-of-the-art approaches. We have made the curated gameplay dataset used in this work available at https://github.com/deepmind/nao_top10.

1. INTRODUCTION

Complex real-world tasks often involve long time dependencies, and require decision making across multiple phases of progress. This class of problems can be challenging to solve due to the fact that the effects of multiple decisions are intertwined together across timesteps. Moreover, the sparsity of the learning signal in many tasks can result in a challenging exploration problem, which motivates the use of intrinsic sources of feedback, e.g. curiosity (Pathak et al., 2017; Raileanu & Rocktäschel, 2020 ), information-gathering (Kim et al., 2018; Ecoffet et al., 2019; Guo et al., 2022) , diversity-seeking (Hong et al., 2018; Seo et al., 2021; Yarats et al., 2021 ) and many others. The internal structure of some environments implicitly advantages certain types of intrinsic motivation. For example, Go-Explore (Ecoffet et al., 2019) excels on Montezuma's Revenge by enforcing spatio-temporally consistent exploration, while ProtoRL (Yarats et al., 2021) achieves high returns in continuous control domains. Nevertheless, some challenging tasks that do not have this structure remain unsolved due to complex dynamics and large action spaces. For instance, in this work we study the game of NetHack (Küttler et al., 2020) , which manifests a mixture of long task horizon, sparse or uninformative learning signal and complex dynamics, making it an ideal testbed for building exploration-oriented agents. The complexity of NetHack prevents agents from efficiently exploring the action space, and even computing meaningful curiosity objectives can be challenging. Instead of training agents on NetHack tabula rasa, a challenging prospect for efficient exploration, we take inspiration from recent advances in another hard exploration benchmark, Minecraft (Baker et al., 2022; Fan et al., 2022) , and leverage plentiful offline human demonstration data available in the wild. Equipping learning agents with human priors has successfully been done in the context of deep reinforcement learning (Silver et al., 2016; Cruz Jr et al., 2017; Abramson et al., 2021; Shah et al., 2021; Baker et al., 2022; Fan et al., 2022) and robotic manipulation (Mandlekar et al., 2021) . One of the salient features of domains such as NetHack (Küttler et al., 2020) is the lack of an explicit signal for monotonic progress. The objective of the game is to descend through the dungeon obtaining equipment, finding food & gold, battling monsters and collecting several critical items along the way, before returning to the beginning of the game with one particular item. A full game can take even experienced players upward of 24 hours of real time and more than 30k actions, and the game can be punishingly difficult for all but the best players. In addition, the in-game score is not a meaningful measure of progress, and maximizing it does not lead agents toward completing the game (Küttler et al., 2020) . Correspondingly, agents trained via reinforcement learning methods fail to make significant long-term progress (Hambro et al., 2022a) . Many tasks of interest combine long horizons, complex temporal dependencies and sparse feedback, but demonstrations with actions are often difficult to obtain: we focus on these settings where abundant unlabeled data is available in the wild. To address these problems we propose Explore Like Experts (ELE), a simple way to use progress estimates from expert demonstrations as an exploration reward. We estimate progress by training a model to predict the temporal distance between any two observations; the model is pretrained on offline human demonstrations and does not require actions. Maximizing an exploration objective based on this progress model leads to new state-of-the-art performance on complex NetHack tasks known for their difficulty, including four tasks on which other competing methods achieve no reward at all. While we focus on NetHack in this work, the method is not specific to that domain, and can in principle be applied to any challenging exploration task with demonstration data that incorporates a notion of long-term progress.

2. RELATED WORKS

Intrinsic motivation: In complex domains with sparse extrinsic feedback, agents may resort to intrinsic motivation. Existing intrinsic motivation objectives can be informally categorized into curiositydriven (Pathak et al., 2017; Zhang et al., 2020; Raileanu & Rocktäschel, 2020 ), information-gathering (Kim et al., 2018; Ecoffet et al., 2019; Seurin et al., 2021; Guo et al., 2022) , diversity-seeking (Bellemare et al., 2016; Hong et al., 2018; Seo et al., 2021; Yarats et al., 2021) . These works propose various intrinsic motivation objectives which result in more efficient exploration than using random actions. A simple yet effective approach is Random Network Distillation (RND) (Burda et al., 2019) where the the predictions of a frozen network are used as distillation target for an online network, and the prediction error acts as an exploration reward. One significant drawback of intrinsic motivation methods is that they may need to cover large regions of sub-optimal state-action space before encountering extrinsic feedback. To mitigate this, approaches like Go-Explore (Ecoffet et al., 2019) reset to promising states and resume exploration from there, but this requires that the environment can be reset to desired states, or that arbitrary states can be reached via behavior (Ecoffet et al., 2021) . Unfortunately, the ability to instantiate arbitrary states in the environment is not feasible in many domains of interest including Nethack, and particular configurations of the environment are not always reachable from all states.



Figure 1: In this work, we learn a model of task progress from expert demonstrations and use it as an exploration reward. This figure shows an example cumulative progress curve over the first 300 steps of a representative episode, alongside three frames taken over the course of the sequence. In the course of a typical episode, the player @ will explore a procedurally-generated dungeon, interact with objects ) [ , open locked and unlocked doors + search for the stairs to the next level > , and more.

