DROPOUT'S DREAM LAND: GENERALIZATION FROM LEARNED SIMULATORS TO REALITY Anonymous authors Paper under double-blind review

Abstract

A World Model is a generative model used to simulate an environment. World Models have proven capable of learning spatial and temporal representations of Reinforcement Learning environments. In some cases, a World Model offers an agent the opportunity to learn entirely inside of its own dream environment. In this work we explore improving the generalization capabilities from dream environments to reality (Dream2Real). We present a general approach to improve a controller's ability to transfer from a neural network dream environment to reality at little additional cost. These improvements are gained by drawing on inspiration from domain randomization, where the basic idea is to randomize as much of a simulator as possible without fundamentally changing the task at hand. Generally, domain randomization assumes access to a pre-built simulator with configurable parameters but oftentimes this is not available. By training the World Model using dropout, the dream environment is capable of creating a nearly infinite number of different dream environments. Our experimental results show that Dropout's Dream Land is an effective technique to bridge the reality gap between dream environments and reality. Furthermore, we additionally perform an extensive set of ablation studies.

1. INTRODUCTION

Reinforcement learning (Sutton & Barto, 2018) (RL) has experienced a flurry of success in recent years, from learning to play Atari (Mnih et al., 2015) to achieving grandmaster level performance in StarCraft II (Vinyals et al., 2019) . However, in all these examples, the target environment is a simulator that can be directly trained in. Reinforcement learning is often not a practical solution without a simulator of the environment. Sometimes the target environment is expensive, dangerous, or even impossible to interact with. In these cases, the agent is trained in a simulated source environment. Approaches that train an agent in a simulated environment with the hopes of generalization to the target environment experience a common problem referred to as the reality gap (Jakobi et al., 1995) . One approach to bridge the reality gap is domain randomization (Tobin et al., 2017) . The basic idea is that an agent which can perform well in an ensemble of simulations will also generalize to the real environment (Antonova et al., 2017; Tobin et al., 2017; Mordatch et al., 2015; Sadeghi & Levine, 2016) . The ensemble of simulations is generally created by randomizing as much of the simulator as possible without fundamentally changing the task at hand. Unfortunately, this approach is only applicable when a simulator is provided and the simulator is configurable. A recently growing field, World Models (Ha & Schmidhuber, 2018), focuses on the side of this problem when the simulation does not exist. World Models offer a general framework for optimizing controllers directly in learned simulated environments. The learned dynamics model can be viewed as the agent's dream environment. This is an interesting area because it removes the need for an agent to operate in the target environment. Some related approaches (Łukasz Kaiser et al., 2020; Hafner et al., 2019; 2020; Sekar et al., 2020; Sutton, 1990; Kurutach et al., 2018) focus on an adjacent problem which allows the controller to continually interact with the target environment. Despite the recent improvements (Łukasz Kaiser et al., 2020; Hafner et al., 2019; Sekar et al., 2020; Kim et al., 2020; Hafner et al., 2020) of World Models, none of them address the issue that World Models are susceptible to the reality gap. The learned dream environment can be viewed as the source domain and the true environment as the target domain. Whenever there are discrepancies between the source and target domains the reality gap can cause problems. Even though World Models

