LEARNING WHAT TO DO BY SIMULATING THE PAST

Abstract

Since reward functions are hard to specify, recent work has focused on learning policies from human feedback. However, such approaches are impeded by the expense of acquiring such feedback. Recent work proposed that agents have access to a source of information that is effectively free: in any environment that humans have acted in, the state will already be optimized for human preferences, and thus an agent can extract information about what humans want from the state (Shah et al., 2019). Such learning is possible in principle, but requires simulating all possible past trajectories that could have led to the observed state. This is feasible in gridworlds, but how do we scale it to complex tasks? In this work, we show that by combining a learned feature encoder with learned inverse models, we can enable agents to simulate human actions backwards in time to infer what they must have done. The resulting algorithm is able to reproduce a specific skill in MuJoCo environments given a single state sampled from the optimal policy for that skill.

1. INTRODUCTION

As deep learning has become popular, many parts of AI systems that were previously designed by hand have been replaced with learned components. Neural architecture search has automated architecture design (Zoph & Le, 2017; Elsken et al., 2019) , population-based training has automated hyperparameter tuning (Jaderberg et al., 2017) , and self-supervised learning has led to impressive results in language modeling (Devlin et al., 2019; Radford et al., 2019; Clark et al., 2020) and reduced the need for labels in image classification (Oord et al., 2018; He et al., 2020; Chen et al., 2020) . However, in reinforcement learning, one component continues to be designed by humans: the task specification. Handcoded reward functions are notoriously difficult to specify (Clark & Amodei, 2016; Krakovna, 2018) , and learning from demonstrations (Ng et al., 2000; Fu et al., 2018) or preferences (Wirth et al., 2017; Christiano et al., 2017) requires a lot of human input. Is there a way that we can automate even the specification of what must be done? It turns out that we can learn part of what the user wants simply by looking at the state of the environment: after all, the user will already have optimized the state towards their own preferences (Shah et al., 2019) . For example, when a robot is deployed in a room containing an intact vase, it can reason that if its user wanted the vase to be broken, it would already have been broken; thus she probably wants the vase to remain intact. However, we must ensure that the agent distinguishes between aspects of the state that the user couldn't control from aspects that the user deliberately designed. This requires us to simulate what the user must have done to lead to the observed state: anything that the user put effort into in the past is probably something the agent should do as well. As illustrated in Figure 1 , if we observe a Cheetah balancing on its front leg, we can infer how it must have launched itself into that position. Unfortunately, it is unclear how to simulate these past trajectories that lead to the observed state. So far, this has only been done in gridworlds, where all possible trajectories can be considered using dynamic programming (Shah et al., 2019) . Our key insight is that we can sample such trajectories by starting at the observed state and simulating backwards in time. To enable this, we derive a gradient that is amenable to estimation through backwards simulation, and learn an inverse policy and inverse dynamics model using supervised

