SPEEDYZERO: MASTERING ATARI WITH LIMITED DATA AND TIME

Abstract

Many recent breakthroughs of deep reinforcement learning (RL) are mainly built upon large-scale distributed training of model-free methods using millions to billions of samples. On the other hand, state-of-the-art model-based RL methods can achieve human-level sample efficiency but often take a much longer overall training time than model-free methods. However, high sample efficiency and fast training time are both important to many real-world applications. We develop SpeedyZero, a distributed RL system built upon a state-of-the-art model-based RL method, EfficientZero, with a dedicated system design for fast distributed computation. We also develop two novel algorithmic techniques, Priority Refresh and Clipped LARS, to stabilize training with massively parallelization and large batch size. SpeedyZero maintains on-par sample efficiency compared with EfficientZero while achieving a 14.5× speedup in wall-clock time, leading to human-level performances on the Atari benchmark within 35 minutes using only 300k samples. In addition, we also present an in-depth analysis on the fundamental challenges in further scaling our system to bring insights to the community.

1. INTRODUCTION

Deep reinforcement learning (RL) has achieved significant successes in the past few years. Prior work has scaled model-free RL training to computing clusters with tens to hundreds of machines, achieving human-level performance or beating human experts on various complex problems (Jaderberg et al., 2019; Baker et al., 2019; Berner et al., 2019; Vinyals et al., 2019) . There are two fundamental ideas behind their successes: (1) training with larger batches for faster convergence, as used in the task of hide-and-seek (Baker et al., 2019 ), DOTA 2 (Berner et al., 2019) and even in many popular PPO projects (Yu et al., 2021; Stooke & Abbeel, 2018) , (2) developing systems with high scalability, such as Gorila (Nair et al., 2015) , Ape-X (Horgan et al., 2018) , IMPALA (Espeholt et al., 2018) and R2D2 (Kapturowski et al., 2018) , which can efficiently simulate huge numbers of environments in parallel. Despite the achievements, these model-free-RL applications consume an extremely high volume of samples, which can be infeasible for many real-world scenarios without an efficient simulator accessible. By contrast, model-based RL methods require substantially fewer samples to train a strong agent. In particular, some recent works have even achieved comparable sample efficiency to humans in complex RL domains like Atari (Ye et al., 2021) or robotic control (Wu et al., 2022) . The downside of model-based RL methods is that they often require a long training time (Schrittwieser et al., 2020; Ye et al., 2021) . Although people have tried to accelerate simple model-based RL methods in the existing literature (Zhang et al., 2019; Abughalieh & Alawneh, 2019) , state-of-the-art sampleefficient model-based RL such as EfficentZero (Ye et al., 2021) , which requires complicated model learning and policy planning, are still time-consuming to run. In this paper, we aim to build a state-of-the-art sample-efficient model-based RL system that trains fast in wall-clock time. To do so, we start with EfficientZero, a state-of-the-art sample efficient model-based RL method, and then accelerate it with massively parallel distributed training. We remark that scaling state-of-the-art model-based RL methods like EfficientZero is non-trivial for

