DITTO: OFFLINE IMITATION LEARNING WITH WORLD MODELS

Abstract

We propose DITTO, an offline imitation learning algorithm which uses world models and on-policy reinforcement learning to addresses the problem of covariate shift, without access to an oracle or any additional online interactions. We discuss how world models enable offline, on-policy imitation learning, and propose a simple intrinsic reward defined in the world model latent space that induces imitation learning by reinforcement learning. Theoretically, we show that our formulation induces a divergence bound between expert and learner, in turn bounding the difference in reward. We test our method on difficult Atari environments from pixels alone, and achieve state-of-the-art performance in the offline setting.

1. INTRODUCTION

Generating agents which can capably act in complex environments is challenging. In the most difficult environments, hand-designed controllers are often insufficient and learning-based methods must be used to achieve good performance. When we can exactly specify the goals and constraints of the problem using a reward function, reinforcement learning (RL) offers an approach which has been extremely successful at solving a range of complex tasks, such as the strategy games of Go (Silver et al., 2016 ), Starcraft (Vinyals et al., 2019 ), and poker (Brown et al., 2020) , and difficult real world control problems like quadrupedal locomotion (Lee et al., 2020) , datacenter cooling (Lazic et al., 2018) , and chip placement (Mirhoseini et al., 2020) . RL lifts the human designer's work from explicitly designing a good policy, to designing a good reward function. However, this optimization often results in policies that maximize reward in undesirable ways that their designers did not intend (Lehman et al., 2018) -the so-called reward hacking phenomenon (Krakovna et al., 2020) . To combat this, practitioners spend substantial effort observing agent failure modes and tuning reward functions with extra regularization terms and weighting hyperparameters to counteract undesirable behaviors (Wang et al., 2022 ) (Peng et al., 2017) . Imitation learning (IL) offers an alternative approach to policy learning which bypasses reward specification by directly mimicking the behavior of an expert demonstrator. The simplest kind of IL, behavior cloning (BC), trains an agent to predict an expert's actions from observations, then acts on these predictions at test time. This approach fails to account for the sequential nature of decision problems, since decisions at the current step affect which states are seen later. The distribution of states seen at test time will differ from those seen during training unless the expert training data covers the entire state space, and the agent makes no mistakes. This distribution mismatch, or covariate shift, leads to a compounding error problem: initially small prediction errors lead to small changes in state distribution, which lead to larger errors, and eventual departure from the training distribution altogether (Pomerleau, 1989) . Intuitively, the agent has not learned how to act under its own induced distribution. This was formalized in the seminal work of Ross & Bagnell (2010) , who gave a tight regret bound on the difference in return achieved by expert and learner, which is quadratic in the episode length for BC. Ross et al. (2011) showed that a linear bound on regret can be achieved if the agent learns online in an interactive setting with the expert: Since the agent is trained under its own distribution with expert corrections, there is no distribution mismatch at test-time. This works well when online learning is safe and expert supervision can be scaled, but is untenable in many real-world use-cases such as robotics, where online can be unsafe, time-consuming, or otherwise

