SPEEDYZERO: MASTERING ATARI WITH LIMITED DATA AND TIME

Abstract

Many recent breakthroughs of deep reinforcement learning (RL) are mainly built upon large-scale distributed training of model-free methods using millions to billions of samples. On the other hand, state-of-the-art model-based RL methods can achieve human-level sample efficiency but often take a much longer overall training time than model-free methods. However, high sample efficiency and fast training time are both important to many real-world applications. We develop SpeedyZero, a distributed RL system built upon a state-of-the-art model-based RL method, EfficientZero, with a dedicated system design for fast distributed computation. We also develop two novel algorithmic techniques, Priority Refresh and Clipped LARS, to stabilize training with massively parallelization and large batch size. SpeedyZero maintains on-par sample efficiency compared with EfficientZero while achieving a 14.5× speedup in wall-clock time, leading to human-level performances on the Atari benchmark within 35 minutes using only 300k samples. In addition, we also present an in-depth analysis on the fundamental challenges in further scaling our system to bring insights to the community.

1. INTRODUCTION

Deep reinforcement learning (RL) has achieved significant successes in the past few years. Prior work has scaled model-free RL training to computing clusters with tens to hundreds of machines, achieving human-level performance or beating human experts on various complex problems (Jaderberg et al., 2019; Baker et al., 2019; Berner et al., 2019; Vinyals et al., 2019) . There are two fundamental ideas behind their successes: (1) training with larger batches for faster convergence, as used in the task of hide-and-seek (Baker et al., 2019 ), DOTA 2 (Berner et al., 2019) and even in many popular PPO projects (Yu et al., 2021; Stooke & Abbeel, 2018) , (2) developing systems with high scalability, such as Gorila (Nair et al., 2015) , Ape-X (Horgan et al., 2018) , IMPALA (Espeholt et al., 2018) and R2D2 (Kapturowski et al., 2018) , which can efficiently simulate huge numbers of environments in parallel. Despite the achievements, these model-free-RL applications consume an extremely high volume of samples, which can be infeasible for many real-world scenarios without an efficient simulator accessible. By contrast, model-based RL methods require substantially fewer samples to train a strong agent. In particular, some recent works have even achieved comparable sample efficiency to humans in complex RL domains like Atari (Ye et al., 2021) or robotic control (Wu et al., 2022) . The downside of model-based RL methods is that they often require a long training time (Schrittwieser et al., 2020; Ye et al., 2021) . Although people have tried to accelerate simple model-based RL methods in the existing literature (Zhang et al., 2019; Abughalieh & Alawneh, 2019) , state-of-the-art sampleefficient model-based RL such as EfficentZero (Ye et al., 2021) , which requires complicated model learning and policy planning, are still time-consuming to run. In this paper, we aim to build a state-of-the-art sample-efficient model-based RL system that trains fast in wall-clock time. To do so, we start with EfficientZero, a state-of-the-art sample efficient model-based RL method, and then accelerate it with massively parallel distributed training. We remark that scaling state-of-the-art model-based RL methods like EfficientZero is non-trivial for two major challenges. First, different from model-free RL, for which massive parallelization can be simply achieved by simultaneously simulating more environments, a single training step in Ef-ficientZero requires substantially more computation steps, including model expansion, back-track search/planning, Q-value backup, and re-analyzing past samples. Therefore, it is non-trivial to efficiently parallelize these components. Second, we empirically notice that when the data producing rate is largely accelerated via parallelization and the batch size is increased, in order to retain the same sample efficiency, EfficientZero training may suffer from significant instabilities. We present SpeedyZero, a distributed model-based RL training system, which leverages dedicated system-level optimization to largely reduce computation overhead while inheriting the high sample efficiency from EfficientZero. From the system perspective, SpeedyZero contains three major innovations for a much faster per-iteration training speed, including (1) a non-trivial partition over computation workload to reduce network communications, (2) applying shared memory queues for high-through-put and low-latency intra-process communication, and (3) an optimized data transfer scheduling with reduced CPU-GPU communication and redundant data transmission. Furthermore, from the algorithm perspective, SpeedyZero is equipped with two novel techniques, Priority Refresh (P-Refresh) and Clipped LARS, to significantly stabilize model-based training in the case of massive parallelization and larger batch sizes. P-Refresh is a distributed and more aggressive variant of prioritized experience replay (Schaul et al., 2015) , which actively re-computes the accurate priorities of all the samples in the replay buffer. Clipped LARS is a variant of LARS (You et al., 2017) to ensure stable training with large batch size. The proposed techniques are shown to be critical for the overall success of SpeedyZero. We evaluate SpeedyZero on the Atari 100k benchmark (Kaiser et al., 2019) , SpeedyZero achieves human-level performance with only 35 minutes of training and 300k samples. Compared with Effi-cientZero, which requires 8.5 hours of training, SpeedyZero retains a comparable sample efficiency while achieving a 14.5× speedup in wall-clock time. Ablation studies are also presented to show the effectiveness of each design component and technique in our system. In addition, we also conduct a further study on the effect of batch size, and the results show that when the batch size increases, SpeedyZero may significantly drop. We carefully analyze the underlying bottleneck and hope the insights can benefit the community for future research. Our main contributions are summarized as follows, • We develop SpeedyZero, a massively parallel distributed model-based RL training system featuring high sample efficiency and fast training speed. SpeedyZero masters Atari in 35 minutes with 300k samples, achieving 14.5× speed up and on-par sample efficiency compared with the state-of-the-art EfficientZero algorithm. • SpeedyZero adopts three system optimization techniques that significantly reduce training latency and improve training throughput and further leverages two algorithmic techniques, Priority Refresh and Clipped LARS, to stabilize training with massive parallelization and larger batch sizes. • We present comprehensive ablation studies on our system components and in-depth analysis on the existing bottlenecks when further scaling SpeedyZero, leading to practical suggestions and open questions for the community.

2. RELATED WORK

Distributed Machine Learning With the emergence of larger datasets and larger models, distributed machine learning systems proliferate in industry and in research. Two main branches exist in this field: data parallelism and model parallelism. Data parallelism partitions an enormous dataset into small chunks computationally tractable on single machines (or GPUs) and assigns the chunks to different machines (or GPUs) in the training cluster. Successful frameworks of data parallelism include three generations of parameter servers (Smola & Narayanamurthy, 2010; Dean et al., 2012; Li et al., 2014) and distributed data parallel (Li et al., 2020) . On the other hand, model parallelism like Megatron-lm (Shoeybi et al., 2019) handles the problem of training gigantic models with billions of parameters by assigning different layers of the model to different machines. In our case, the model is small enough to fit onto a single GPU, while the sample batch is too large for efficient single GPU training. Therefore, we use distributed data parallel provided by PyTorch (Li et al., 2020) for multi-GPU training in SpeedyZero.

