DITTO: OFFLINE IMITATION LEARNING WITH WORLD MODELS

Abstract

We propose DITTO, an offline imitation learning algorithm which uses world models and on-policy reinforcement learning to addresses the problem of covariate shift, without access to an oracle or any additional online interactions. We discuss how world models enable offline, on-policy imitation learning, and propose a simple intrinsic reward defined in the world model latent space that induces imitation learning by reinforcement learning. Theoretically, we show that our formulation induces a divergence bound between expert and learner, in turn bounding the difference in reward. We test our method on difficult Atari environments from pixels alone, and achieve state-of-the-art performance in the offline setting.

1. INTRODUCTION

Generating agents which can capably act in complex environments is challenging. In the most difficult environments, hand-designed controllers are often insufficient and learning-based methods must be used to achieve good performance. When we can exactly specify the goals and constraints of the problem using a reward function, reinforcement learning (RL) offers an approach which has been extremely successful at solving a range of complex tasks, such as the strategy games of Go (Silver et al., 2016) , Starcraft (Vinyals et al., 2019) , and poker (Brown et al., 2020) , and difficult real world control problems like quadrupedal locomotion (Lee et al., 2020) , datacenter cooling (Lazic et al., 2018) , and chip placement (Mirhoseini et al., 2020) . RL lifts the human designer's work from explicitly designing a good policy, to designing a good reward function. However, this optimization often results in policies that maximize reward in undesirable ways that their designers did not intend (Lehman et al., 2018) -the so-called reward hacking phenomenon (Krakovna et al., 2020) . To combat this, practitioners spend substantial effort observing agent failure modes and tuning reward functions with extra regularization terms and weighting hyperparameters to counteract undesirable behaviors (Wang et al., 2022 ) (Peng et al., 2017) . Imitation learning (IL) offers an alternative approach to policy learning which bypasses reward specification by directly mimicking the behavior of an expert demonstrator. The simplest kind of IL, behavior cloning (BC), trains an agent to predict an expert's actions from observations, then acts on these predictions at test time. This approach fails to account for the sequential nature of decision problems, since decisions at the current step affect which states are seen later. The distribution of states seen at test time will differ from those seen during training unless the expert training data covers the entire state space, and the agent makes no mistakes. This distribution mismatch, or covariate shift, leads to a compounding error problem: initially small prediction errors lead to small changes in state distribution, which lead to larger errors, and eventual departure from the training distribution altogether (Pomerleau, 1989) . Intuitively, the agent has not learned how to act under its own induced distribution. This was formalized in the seminal work of Ross & Bagnell (2010), who gave a tight regret bound on the difference in return achieved by expert and learner, which is quadratic in the episode length for BC. Ross et al. (2011) showed that a linear bound on regret can be achieved if the agent learns online in an interactive setting with the expert: Since the agent is trained under its own distribution with expert corrections, there is no distribution mismatch at test-time. This works well when online learning is safe and expert supervision can be scaled, but is untenable in many real-world use-cases such as robotics, where online can be unsafe, time-consuming, or otherwise infeasible. On the one hand, we want to generate data on-policy to avoid covariate shift, but on the other hand, we may not be able to afford to learn online due to safety or other concerns.

Follow-up work in

Ha & Schmidhuber (2018) propose a two-stage approach to policy learning, where agents first learn to predict the environment dynamics with a recurrent neural network called a "world model" (WM), and then learn the policy inside the WM alone. This approach is desirable since it enables on-policy learning offline, given the existence of the world model. Similar model-based learning methods have recently achieved success in standard online RL settings (Hafner et al., 2021) , and impressive zero-shot transfer of policies trained solely in the WM to physical robots (Wu et al., 2022) . In this paper, we propose an imitation learning algorithm called Dream Imitation (DITTO), which addresses the tension between offline and on-policy learning, by training an agent using on-policy RL inside a learned world model. Specifically, we define a reward which measures divergence between the agent and expert demonstrations in the latent space of the world model, and show that optimizing this reward with RL induces imitation on the expert. We discuss the relationship between our method and the imitation learning as divergence minimization framework (Ghasemipour et al., 2019) , and show that our method optimizes a similar bound without requiring adversarial training. We compare our method against behavior cloning and generative adversarial imitation learning (GAIL, Ho & Ermon (2016) ), which we adapt to the world model setting, and show that we achieve better performance and sample efficiency in challenging Atari environments from pixels alone. Our main contributions are summarized as follows: • We discuss how world models relieve the tension between offline and on-policy learning methods, which mitigates covariate shift from offline learning. • We demonstrate the first fully offline model-based imitation learning method that achieves strong performance on Atari from pixels, and show that our agent outperforms competitive baselines adapted to the offline setting. • We show how imitation learning can naturally be cast as a reinforcement learning problem in the latent space of learned world models, and propose a latent-matching intrinsic reward which compares favorably against commonly used adversarial and sparse formulations.

2. RELATED WORK

2.1 IMITATION BY REINFORCEMENT LEARNING Imitation learning algorithms can be classified according to the set of resources needed to produce a good policy. Ross et al. (2011) give strong theoretical and empirical results in the online interactive setting, which assumes that we can both learn while acting online in the real environment, and that we can interactively query an expert policy to e.g. provide the learner with the optimal action in the current state. Follow-up works have progressively relaxed the resource assumptions needed to produce good policies. Sasaki & Yamashina (2021) show that the optimal policy can be recovered with a modified form of BC when learning from imperfect demonstrations, given a constraint on the expert sub-optimality bound. Brantley et al. ( 2020) study covariate shift in the online, noninteractive setting, and demonstrate an approximately linear regret bound by jointly optimizing the BC objective with a novel policy ensemble uncertainty cost, which encourages the learner to return to and stay in the distribution of expert support. They achieve this by augmenting the BC objective with the following uncertainty cost term: Var π∼Π E (π(a|s)) = 1 E E i=1 (π i (a|s) - 1 E E j=1 π j (a|s)) 2 (1) This term measures the total variance of a policy ensemble Π E = {π 1 , ..., π E } trained on disjoint subsets of the expert data.They optimize the combined BC plus uncertainty objective using standard online RL algorithms, and show that this mitigates covariate shift. Inverse reinforcement learning (IRL) can achieve improved performance over BC by first learning a reward from the expert demonstrations for which the expert is optimal, then optimizing that reward with on-policy reinforcement learning. This two-step process, which includes on-policy RL in the

