DROPOUT'S DREAM LAND: GENERALIZATION FROM LEARNED SIMULATORS TO REALITY Anonymous authors Paper under double-blind review

Abstract

A World Model is a generative model used to simulate an environment. World Models have proven capable of learning spatial and temporal representations of Reinforcement Learning environments. In some cases, a World Model offers an agent the opportunity to learn entirely inside of its own dream environment. In this work we explore improving the generalization capabilities from dream environments to reality (Dream2Real). We present a general approach to improve a controller's ability to transfer from a neural network dream environment to reality at little additional cost. These improvements are gained by drawing on inspiration from domain randomization, where the basic idea is to randomize as much of a simulator as possible without fundamentally changing the task at hand. Generally, domain randomization assumes access to a pre-built simulator with configurable parameters but oftentimes this is not available. By training the World Model using dropout, the dream environment is capable of creating a nearly infinite number of different dream environments. Our experimental results show that Dropout's Dream Land is an effective technique to bridge the reality gap between dream environments and reality. Furthermore, we additionally perform an extensive set of ablation studies.

1. INTRODUCTION

Reinforcement learning (Sutton & Barto, 2018) (RL) has experienced a flurry of success in recent years, from learning to play Atari (Mnih et al., 2015) to achieving grandmaster level performance in StarCraft II (Vinyals et al., 2019) . However, in all these examples, the target environment is a simulator that can be directly trained in. Reinforcement learning is often not a practical solution without a simulator of the environment. Sometimes the target environment is expensive, dangerous, or even impossible to interact with. In these cases, the agent is trained in a simulated source environment. Approaches that train an agent in a simulated environment with the hopes of generalization to the target environment experience a common problem referred to as the reality gap (Jakobi et al., 1995) . One approach to bridge the reality gap is domain randomization (Tobin et al., 2017) . The basic idea is that an agent which can perform well in an ensemble of simulations will also generalize to the real environment (Antonova et al., 2017; Tobin et al., 2017; Mordatch et al., 2015; Sadeghi & Levine, 2016) . The ensemble of simulations is generally created by randomizing as much of the simulator as possible without fundamentally changing the task at hand. Unfortunately, this approach is only applicable when a simulator is provided and the simulator is configurable. A recently growing field, World Models (Ha & Schmidhuber, 2018) , focuses on the side of this problem when the simulation does not exist. World Models offer a general framework for optimizing controllers directly in learned simulated environments. The learned dynamics model can be viewed as the agent's dream environment. This is an interesting area because it removes the need for an agent to operate in the target environment. Some related approaches (Łukasz Kaiser et al., 2020; Hafner et al., 2019; 2020; Sekar et al., 2020; Sutton, 1990; Kurutach et al., 2018) focus on an adjacent problem which allows the controller to continually interact with the target environment. Despite the recent improvements (Łukasz Kaiser et al., 2020; Hafner et al., 2019; Sekar et al., 2020; Kim et al., 2020; Hafner et al., 2020) of World Models, none of them address the issue that World Models are susceptible to the reality gap. The learned dream environment can be viewed as the source domain and the true environment as the target domain. Whenever there are discrepancies between the source and target domains the reality gap can cause problems. Even though World Models suffer from the reality gap, none of the domain randomization approaches are directly applicable because the dream environment does not have easily configurable parameters. In this work we present Dropout's Dream Land (DDL), a simple approach to bridge the reality gap from learned dream environments to reality. Dropout's Dream Land was inspired by the first principles of domain randomization, namely, train a controller on a large set of different simulators which all adhere to the fundamental task of the target environment. We are able to generate a nearly infinite number of different simulators via the insight that dropout (Srivastava et al., 2014) can be understood as learning an ensemble of neural networks (Baldi & Sadowski, 2013) . Our empirical results demonstrate the advantage of Dropout's Dream Land over baseline (Ha & Schmidhuber, 2018; Kim et al., 2020) approaches. Furthermore, we perform an extensive set of ablation studies which indicate the source of generalization improvements, requirements for the method to work, and when the method is most useful.

2. RELATED WORKS

2.1 DROPOUT Dropout (Srivastava et al., 2014) was introduced as a regularization technique for feedforward and convolutional neural networks. In its most general form, each unit is dropped with a probability p during the training process. Recurrent neural networks (RNNs) initially had issues benefiting from Dropout. Zaremba et al. (2014) suggests not to apply dropout to the hidden state units of the RNN cell. Gal & Ghahramani (2016b) shortly after show that the mask can also be applied to the hidden state units, but the mask must be fixed across the sequence during training. In this work, we follow the dropout approach from Gal & Ghahramani (2016b) when training the RNN. More formally, for each sequence, the boolean masks m xi , m xf , m xw , m xo , m hi , m hf , m hw , and m ho are sampled, then used in the following LSTM update: i t = W xi (x t m xi ) + W hi (h t-1 m hi ) + b i , f t = W xf (x t m xf ) + W hf (h t-1 m hf ) + b f , w t = W xw (x t m xw ) + W hw (h t-1 m hw ) + b w , o t = W xo (x t m xo ) + W ho (h t-1 m ho ) + b o , where x t , h t , and c t are the input, hidden state, and cell state, respectively,  W xi , W xf , W xw , W xo ∈ R d×r W hi , W hf , W hw , W ho ∈ R

2.2. DOMAIN RANDOMIZATION

The goal of domain randomization (Tobin et al., 2017; Sadeghi & Levine, 2016) is to create many different versions of the dynamics model with the hope that a policy generalizing to all versions of the dynamics model will do well on the true environment. Figure 1 illustrates many simulated environments (ê j ) overlapping with the actual environment (e * ). Simulated environments are often far cheaper to operate in than the actual environment. Hence, it is desirable to be able to perform the majority of interactions in the simulated environments. Randomization has been applied on observations (e.g., lighting, textures) to perform robotic grasping (Tobin et al., 2017) and collision avoidance of drones (Sadeghi & Levine, 2016) . Randomization has also proven useful when applied to the underlying dynamics of simulators (Peng et al., 2018) . Often, both the observations and simulation dynamics are randomized (Andrychowicz et al., 2020) . Domain randomization generally uses some pre-existing simulator which then injects randomness into specific aspects of the simulator (e.g., color textures, friction coefficients). Each of the simulated environments in Figure 1 can be thought of as a noisy sample of the pre-existing simulator. To the best of our knowledge, domain randomization has yet to be applied to entirely learned simulators.

2.3. WORLD MODELS

The world model (Ha & Schmidhuber, 2018) has three modules trained separately: (i) vision module (V ); (ii) dynamics module (M ); and (iii) controller (C). A high-level view is shown in Algorithm 1.



d×d are the LSTM weight matrices, and b i , b f , b w , b o ∈ R d are the LSTM biases. The masks are fixed for the entire sequence, but may differ between sequences in the mini-batch.

