TRANSFORMERS ARE SAMPLE-EFFICIENT WORLD MODELS

Abstract

Deep reinforcement learning agents are notoriously sample inefficient, which considerably limits their application to real-world problems. Recently, many model-based methods have been designed to address this issue, with learning in the imagination of a world model being one of the most prominent approaches. However, while virtually unlimited interaction with a simulated environment sounds appealing, the world model has to be accurate over extended periods of time. Motivated by the success of Transformers in sequence modeling tasks, we introduce IRIS, a data-efficient agent that learns in a world model composed of a discrete autoencoder and an autoregressive Transformer. With the equivalent of only two hours of gameplay in the Atari 100k benchmark, IRIS achieves a mean human normalized score of 1.046, and outperforms humans on 10 out of 26 games, setting a new state of the art for methods without lookahead search. To foster future research on Transformers and world models for sample-efficient reinforcement learning, we release our code and models at https://github.com/eloialonso/iris.

1. INTRODUCTION

Deep Reinforcement Learning (RL) has become the dominant paradigm for developing competent agents in challenging environments. Most notably, deep RL algorithms have achieved impressive performance in a multitude of arcade (Mnih et al., 2015; Schrittwieser et al., 2020; Hafner et al., 2021) , real-time strategy (Vinyals et al., 2019; Berner et al., 2019 ), board (Silver et al., 2016; 2018; Schrittwieser et al., 2020) and imperfect information (Schmid et al., 2021; Brown et al., 2020a) games. However, a common drawback of these methods is their extremely low sample efficiency. Indeed, experience requirements range from months of gameplay for DreamerV2 (Hafner et al., 2021) in Atari 2600 games (Bellemare et al., 2013b) to thousands of years for OpenAI Five in Dota2 (Berner et al., 2019) . While some environments can be sped up for training agents, real-world applications often cannot. Besides, additional cost or safety considerations related to the number of environmental interactions may arise (Yampolskiy, 2018). Hence, sample efficiency is a necessary condition to bridge the gap between research and the deployment of deep RL agents in the wild. Model-based methods (Sutton & Barto, 2018) constitute a promising direction towards data efficiency. Recently, world models were leveraged in several ways: pure representation learning (Schwarzer et al., 2021) , lookahead search (Schrittwieser et al., 2020; Ye et al., 2021) , and learning in imagination (Ha & Schmidhuber, 2018; Kaiser et al., 2020; Hafner et al., 2020; 2021) . The latter approach is particularly appealing because training an agent inside a world model frees it from sample efficiency constraints. Nevertheless, this framework relies heavily on accurate world models since the policy is purely trained in imagination. In a pioneering work, Ha & Schmidhuber (2018) successfully built imagination-based agents in toy environments. SimPLe recently showed promise in the more challenging Atari 100k benchmark (Kaiser et al., 2020) . Currently, the best Atari agent learning in imagination is DreamerV2 (Hafner et al., 2021) , although it was developed and evaluated with two hundred million frames available, far from the sample-efficient regime. Therefore, designing new world model architectures, capable of handling visually complex and partially observable environments with few samples, is key to realize their potential as surrogate training grounds. The Transformer architecture (Vaswani et al., 2017) is now ubiquitous in Natural Language Processing (Devlin et al., 2019; Radford et al., 2019; Brown et al., 2020b; Raffel et al., 2020) , and is also gaining traction in Computer Vision (Dosovitskiy et al., 2021; He et al., 2022) , as well as in Offline

