SPEEDYZERO: MASTERING ATARI WITH LIMITED DATA AND TIME

Abstract

Many recent breakthroughs of deep reinforcement learning (RL) are mainly built upon large-scale distributed training of model-free methods using millions to billions of samples. On the other hand, state-of-the-art model-based RL methods can achieve human-level sample efficiency but often take a much longer overall training time than model-free methods. However, high sample efficiency and fast training time are both important to many real-world applications. We develop SpeedyZero, a distributed RL system built upon a state-of-the-art model-based RL method, EfficientZero, with a dedicated system design for fast distributed computation. We also develop two novel algorithmic techniques, Priority Refresh and Clipped LARS, to stabilize training with massively parallelization and large batch size. SpeedyZero maintains on-par sample efficiency compared with EfficientZero while achieving a 14.5× speedup in wall-clock time, leading to human-level performances on the Atari benchmark within 35 minutes using only 300k samples. In addition, we also present an in-depth analysis on the fundamental challenges in further scaling our system to bring insights to the community.

1. INTRODUCTION

Deep reinforcement learning (RL) has achieved significant successes in the past few years. Prior work has scaled model-free RL training to computing clusters with tens to hundreds of machines, achieving human-level performance or beating human experts on various complex problems (Jaderberg et al., 2019; Baker et al., 2019; Berner et al., 2019; Vinyals et al., 2019) . There are two fundamental ideas behind their successes: (1) training with larger batches for faster convergence, as used in the task of hide-and-seek (Baker et al., 2019) , DOTA 2 (Berner et al., 2019) and even in many popular PPO projects (Yu et al., 2021; Stooke & Abbeel, 2018) , (2) developing systems with high scalability, such as Gorila (Nair et al., 2015) , Ape-X (Horgan et al., 2018) , IMPALA (Espeholt et al., 2018) and R2D2 (Kapturowski et al., 2018) , which can efficiently simulate huge numbers of environments in parallel. Despite the achievements, these model-free-RL applications consume an extremely high volume of samples, which can be infeasible for many real-world scenarios without an efficient simulator accessible. By contrast, model-based RL methods require substantially fewer samples to train a strong agent. In particular, some recent works have even achieved comparable sample efficiency to humans in complex RL domains like Atari (Ye et al., 2021) or robotic control (Wu et al., 2022) . The downside of model-based RL methods is that they often require a long training time (Schrittwieser et al., 2020; Ye et al., 2021) . Although people have tried to accelerate simple model-based RL methods in the existing literature (Zhang et al., 2019; Abughalieh & Alawneh, 2019) , state-of-the-art sampleefficient model-based RL such as EfficentZero (Ye et al., 2021) , which requires complicated model learning and policy planning, are still time-consuming to run. In this paper, we aim to build a state-of-the-art sample-efficient model-based RL system that trains fast in wall-clock time. To do so, we start with EfficientZero, a state-of-the-art sample efficient model-based RL method, and then accelerate it with massively parallel distributed training. We remark that scaling state-of-the-art model-based RL methods like EfficientZero is non-trivial for two major challenges. First, different from model-free RL, for which massive parallelization can be simply achieved by simultaneously simulating more environments, a single training step in Ef-ficientZero requires substantially more computation steps, including model expansion, back-track search/planning, Q-value backup, and re-analyzing past samples. Therefore, it is non-trivial to efficiently parallelize these components. Second, we empirically notice that when the data producing rate is largely accelerated via parallelization and the batch size is increased, in order to retain the same sample efficiency, EfficientZero training may suffer from significant instabilities. We present SpeedyZero, a distributed model-based RL training system, which leverages dedicated system-level optimization to largely reduce computation overhead while inheriting the high sample efficiency from EfficientZero. From the system perspective, SpeedyZero contains three major innovations for a much faster per-iteration training speed, including (1) a non-trivial partition over computation workload to reduce network communications, (2) applying shared memory queues for high-through-put and low-latency intra-process communication, and (3) an optimized data transfer scheduling with reduced CPU-GPU communication and redundant data transmission. Furthermore, from the algorithm perspective, SpeedyZero is equipped with two novel techniques, Priority Refresh (P-Refresh) and Clipped LARS, to significantly stabilize model-based training in the case of massive parallelization and larger batch sizes. P-Refresh is a distributed and more aggressive variant of prioritized experience replay (Schaul et al., 2015) , which actively re-computes the accurate priorities of all the samples in the replay buffer. Clipped LARS is a variant of LARS (You et al., 2017) to ensure stable training with large batch size. The proposed techniques are shown to be critical for the overall success of SpeedyZero. We evaluate SpeedyZero on the Atari 100k benchmark (Kaiser et al., 2019) , SpeedyZero achieves human-level performance with only 35 minutes of training and 300k samples. Compared with Effi-cientZero, which requires 8.5 hours of training, SpeedyZero retains a comparable sample efficiency while achieving a 14.5× speedup in wall-clock time. Ablation studies are also presented to show the effectiveness of each design component and technique in our system. In addition, we also conduct a further study on the effect of batch size, and the results show that when the batch size increases, SpeedyZero may significantly drop. We carefully analyze the underlying bottleneck and hope the insights can benefit the community for future research. Our main contributions are summarized as follows, 

2. RELATED WORK

Distributed Machine Learning With the emergence of larger datasets and larger models, distributed machine learning systems proliferate in industry and in research. Two main branches exist in this field: data parallelism and model parallelism. Data parallelism partitions an enormous dataset into small chunks computationally tractable on single machines (or GPUs) and assigns the chunks to different machines (or GPUs) in the training cluster. Successful frameworks of data parallelism include three generations of parameter servers (Smola & Narayanamurthy, 2010; Dean et al., 2012; Li et al., 2014) and distributed data parallel (Li et al., 2020) . On the other hand, model parallelism like Megatron-lm (Shoeybi et al., 2019) handles the problem of training gigantic models with billions of parameters by assigning different layers of the model to different machines. In our case, the model is small enough to fit onto a single GPU, while the sample batch is too large for efficient single GPU training. Therefore, we use distributed data parallel provided by PyTorch (Li et al., 2020) for multi-GPU training in SpeedyZero. Distributed Deep Reinforcement Learning There have been many successful attempts to scale out model-based deep RL methods with distributed training. Assuming known environment models, Al-phaGo (Silver et al., 2016) achieves super-human performance in the game of Go by distributing the rollout process in the Monte-Carlo Tree Search (MCTS) over hundreds of machines. Its successor MuZero (Schrittwieser et al., 2020) achieves astonishing results in the game of Go with thousands of TPUs. (Zhang et al., 2019) proposes an asynchronized framework for paralleling model-based RL training using the idea of parameter servers. However, their method is based on a simple modelbased RL method, which is easy to parallelize but has relatively low sample efficiency. In the domain of model predictive control, many works (Abughalieh & Alawneh, 2019) study parallelism for improved speed, but focus on simple settings that are far less complicated than games like Atari. There are also many efforts in scaling out model-free deep RL methods, including Gorila (Nair et al., 2015) , Ape-X (Horgan et al., 2018) , IMPALA (Espeholt et al., 2018) and R2D2 (Kapturowski et al., 2018) . These prior works typically focus on data-rich settings, requiring millions to billions of samples and hours to days of training. (Stooke & Abbeel, 2018) studies accelerated training of model-free methods such as PPO and A2C on a single machine. Our focus is on speeding up the training of model-based RL methods while maintaining high sample efficiency, which could bridge the training speed gap between model-based and model-free methods. Besides system design optimizations, many prior works also adopt algorithmic improvements for better performance in distributed training of deep RL agents. IMPALA (Espeholt et al., 2018) introduces V-trace to correct the policy lag between the actors and learners. R2D2 (Kapturowski et al., 2018) proposes 'burn-in' steps to deal with the parameter lag in the recurrent neural networks. Many works also adopt prioritized experience replay (PER) in distributed training settings (Horgan et al., 2018; Kapturowski et al., 2018; Schrittwieser et al., 2020) where u t is the reward at step t. MuZero. MuZero (Schrittwieser et al., 2020 ) is a model-based RL method based on the Monte-Carlo Tree Search (MCTS) algorithm. MuZero learns the environment dynamics and performs MCTS over the learned environment model to find a better policy. More specifically, MuZero models the environment with a representation function h, a dynamics function g, and a prediction function f . To find a high-quality policy given a history of observations o ≤t , MuZero first encodes the observation history by s 0 t = h(o ≤t ), which is used as the latent state at the root of the tree. To perform MCTS, MuZero runs N simulation steps. In the k-th simulation step, a leaf node s ′ and an unexplored action a ′ on the leaf node are chosen by employing the UCT rule (Kocsis & Szepesvári, 2006; Rosin, 2011) . Then a node expansion step comes by computing the next latent state s k+1 ). At the end of each simulation step, the value v k+1 t is back-propagated along the tree path to update the Q values. The MCTS process is computationally expensive since it requires extensive CPU operations to do tree search as well as GPU resources for model inference. MuZero interacts with the environment by searching a policy using MCTS over the learned environment model. The trajectories of data are then stored in the replay buffer. During training, a batch of observation histories are sampled and MuZero rolls out the environment model on the batch of observation histories {o ≤t } along the actions {a t...t+K-1 } at the following K steps, and predicts a batch of rewards {r 1...K-1 t }, policies {π 0...K-1 t } and values {v 0...K-1 t }. To learn the models, the following loss is minimized, k L(u t+k , r k t ) + λ 1 L(π t+k , p k t ) + λ 2 L(z t+k , v k t ) where u t+k is the environment reward, π t+k is the target policy obtained through MCTS over a target model, z t+k = n-1 i=0 γ i u t+k+i + γ n v ′ t+k+n is the discounted n-step return, v ′ t is the value computed by the target model. To improve the sample efficiency, MuZero Reanalyze algorithm (Schrittwieser et al., 2020) regenerates the policy and value of a training batch when the batch is sampled from the replay buffer. Compared with MuZero, MuZero Reanalyze uses significantly fewer samples while still achieving strong results. It is worth noting that the reanalysis step over the sampled batch is computationally expensive since it involves an additional MCTS procedure. EfficientZero. EfficientZero (Ye et al., 2021 ) is a sample-efficient visual RL algorithm built on top of MuZero Reanalyze algorithm, which re-computes the target policies via MCTS when a training batch is sampled from the replay buffer. EfficientZero further proposes several augmentations in visual RL tasks, including using self-supervised consistency loss to provide more training signals to the environment model, predicting the value prefix instead of the reward to deal with aleatoric uncertainty and off-policy correction for the n-step return. The workflow of EfficientZero is shown in Fig. 1a , where the Reanalyze workers continuously generate the training batch. EfficientZero suffers from the same computation expense issue as MuZero Reanalyze due to the reliance on reanalyzing batches. Our method SpeedyZero inherits the sample efficiency optimizations from EfficientZero and boosts its training speed by 14.5×.

4.1. OVERVIEW

The ultimate goal of SpeedyZero is to speed up the training of EfficientZero-based RL agents while maintaining on-par sample efficiency. We achieve this through efforts on both the system side and algorithm side in SpeedyZero, as shown in Fig. 1 . The system optimizations in SpeedyZero help us reduce the time needed for each training step. The algorithm optimizations reduce the number of training steps needed while maintaining the stability of the training process. We will discuss the system optimizations in Sec. 4.2 and the algorithm optimizations in Sec. 4.3 respectively.

4.2. SPEEDYZERO SYSTEM DESIGN

As shown in Fig. 1 , SpeedyZero partitions the workflow of EfficientZero into three stages: data collection, batch generation, and training. The three stages are distributed to the data node, the reanalysis node, and the trainer node respectively. SpeedyZero features the following three system design novelties for higher training step throughput and lower latency on critical modules. Modular Design and Non-Trivial Workload Partition: A naive partition of workload to multiple machines entails massive network data transfer, inducing high latency and low throughput. SpeedyZero follows a modular design, in which we partition the workflow into three major stages so that the data transfer across stages is reduced and we assign one type of node for each stage. The machines (nodes) in SpeedyZero work together asynchronously and we can easily add more machines for a single type of node if it becomes the bottleneck. Our partition strategy helps us scale out SpeedyZero for higher throughput with low network latency overhead. Efficient On-Node Communication with Shared Memory: Tradition message passing communication between different processes on the same machine serializes data on the sender side and deserializes them again on the receivers (known as data ser/des). This process also entails multiple memory copies. The data ser/des and memory copies make message passing extremely slow when the amount of data we need to transfer is huge. We notice that in our settings, the majority of data transferred can be expressed using NumPy arrays and many of these arrays are written once but read multiple times. Therefore, we develop a special shared memory object store for on-node communication. It avoids data ser/des for NumPy arrays and supports non-copy, lock-free reads for them. With this shared memory object store in hand, processes in SpeedyZero can communicate with higher bandwidth and lower latency.

Data Transfer Optimizations:

We empirically find that batch generation latency on the reanalysis node and priority refresh latency on the data node affect the final performance a lot. We also notice that a considerable amount of time is spent on CPU-GPU data transfer in the two components. Therefore, we reduce the CPU-GPU data transfer during batch reanalysis by storing all MCTS latent states on GPU. For priority refresh (Sec. 4.3), we store the observations in the replay buffer on GPU to avoid loading them every time the priorities are re-computed. These two optimizations help SpeedyZero better utilize GPU VRAM to reduce the latency on critical components. Data transfer can also overlap with computation in many cases to further reduce latency and improve throughput. In SpeedyZero, we overlap network transmission with computation by sending and receiving network packages in separate processes, allowing workers to continue their jobs when the package is flying through the network. Moreover, on the trainer node, we also overlap batch loading with training by preloading the next batch into GPU.

4.3. ALGORITHM IMPROVEMENTS

Unstable Model-Based Training: We empirically find that the predicted values may behave unstably, especially during the initial training stage. The instability issue is largely due to the accelerated training speed and a reduced amount of training steps due to the requirement of unchanged sample efficiency. As shown in Fig. 2 , the predicted values climb abnormally high at the beginning, and it takes many steps before the values decrease back to the normal range. Notably, this phenomenon exists when using either prioritized experience replay (DPER) (Horgan et al., 2018) or uniform sampling from the replay buffer. The surge in predicted values at the beginning prevents proper policy improvements since MCTS relies on predicted values. Also, when scaling SpeedyZero to a larger batch size, we observe several sudden large gradients during training across a wide range of trials, as shown in Fig. 3a . We remark that the issues are not severe in EfficientZero since EfficientZero uses a much longer overall training time and a smaller batch size than SpeedyZero does. Priority Refresh: To address the issue of unstable values, we propose Priority Refresh, in which we actively refresh the priorities of all data points in the replay buffer. As shown in Fig. 1b , a group of priority refreshers periodically update the priorities of all data points in the replay buffer. The latest priorities are synchronized to all reanalysis nodes with a constant frequency. Since the goal is to stabilize the values, we use TD errors as the priorities. The key difference between P-Refresh and Figure 2 : The predicted values of different trials when using uniform sampling from the replay buffer (Unif.), distributed prioritized experience replay (DPER), and priority refresh (P-Refresh) in Jamesbond. Uniform sampling exhibits very unstable values. Values of DPER are less unstable but still suffer from the same instability issue. In contrast, P-Refresh shows stable improvement in the predicted values and exhibits much lower variance across different trials. Distributed Prioritized Experience Replay (DPER) (Horgan et al., 2018) is that DPER only updates priorities of data points that are trained on, while P-Refresh updates priorities of all data points. This difference allows SpeedyZero to effectively use data from old policies, which stabilizes the training and leads to better performance. Clipped LARS: To tackle the unstable issue of large batch size training, we propose an optimizer called Clipped LARS. Clipped LARS updates the parameters with the following rule, w t+1 = w t -γ • min η||w t || 2 ||∇w t || 2 + β||w t || 2 , 1 • (||∇w t || 2 + β||w t || 2 ) (1) where w t is the parameter of a layer after t training steps, β is the weight decay coefficient, γ is the base learning rate, η is a scaling factor to control the change in the parameter. As shown in Fig. 3c , LARS (You et al., 2017) shows an over-regularization effect in the early training stage. Clipped LARS overcomes the over-regularization issue by clipping the scaling ratio to less than 1 to avoid magnifying the gradients but only shrinking the exploding gradients. More details about Clipped LARS can be found in Appendix A.3.

4.4. IMPLEMENTATION

SpeedyZero is highly optimized for higher training throughput and lower latency on critical modules. In this section, we will introduce some key system efficiency optimizations. For more details about the implementation of SpeedyZero, please refer to the appendix. Distributed Data Parallel: We use Distributed Data Parallel provided by PyTorch (Li et al., 2020) to amortize batch size on multiple GPUs for faster training.

Data Compression:

We compress the batches sent through the network with the lz4 algorithm, reducing the average network bandwidth requirement by 12×. Also, lz4 features high compression and decompression throughput, causing negligible overhead to our overall latency. Replay Buffer Replication: We keep a replica of the replay buffer on each reanalysis node. This grants data loaders fast access to the data they need and also reduces traffic over the network. Each trajectory is only transmitted once while priorities are synchronized throughout the training process.

5.1. EXPERIMENT SETUP

Atari 100k Benchmark. The Atari 100k benchmark is proposed for testing the effectiveness of sample efficient RL methods (Kaiser et al., 2019) . It contains 26 Atari games that are deemed solvable with a limited amount of samples. In this benchmark, agents are allowed to take at most 100 thousand environment steps, which are equivalent to 400 thousand frames due to a frameskip of 4. EfficientZero is the first method that achieves human performance in terms of both the mean and median of the human normalized score on this benchmark. In our experiments, we test SpeedyZero on the Atari 100k benchmark and also conduct additional experiments on the same set of games with 300k environment steps. Raw performance on each game as well as the mean and median of the human normalized score is reported. Human normalized score is computed as (score agentscore random )/(score humanscore random ). The baselines we compare against include SimPLe (Kaiser et al., 2019) , CuRL (Srinivas et al., 2020) , SPR (Schwarzer et al., 2020) We observe severe performance degradation on some environments, e.g. Jamesbond, when shifting from the 50min experiment to the 35min experiment. The inconsistency of the performance among different machine configurations also occurred in our early experiments. (See A.6 for more details.) The main reason for the performance gap is the non-uniform speedup of different components on machines with faster GPUs. For example, increasing the training speed while not accelerating the data generation process will influence the "priority staleness", which measures the model version gap between when a batch is sampled from the replay buffer and when it is trained on.

5.3. ABLATION STUDY

Priority Refresh. As stated in Sec. 4.3, the RL agent's final performance suffers a lot from the unstable predicted values during the initial training stage. Therefore we propose Priority Refresh to stabilize the training process. As shown in Fig. 2 , when using uniform sampling, the predicted values Table 2 : Ablation study on different training data sampling strategy, i.e., DPER, uniform sampling, and Priority Refresh. Unif. is the worst in all environments while Priority Refresh outperforms all the baselines in these environments. remain unstable throughout the training process. When using DPER, although the values sometimes seem more stable, the variance across different trials is still considerably high. In contrast, Priority Refresh ensures stable training within a single run and across multiple runs. In Table . 2, Priority Refresh achieves the highest score among all sampling methods in a set of environments. This suggests the superiority of Priority Refresh in stabilizing training and improving final performance. Clipped LARS. Table . 3 compares the performance of SpeedyZero when using SGD, LARS, and Clipped LARS as the optimizer. When using LARS, training fails completely. Clipped LARS stabilizes large batch size training of SpeedyZero and significantly improves the performance over SGD and LARS, indicating that Clipped LARS is critical to the overall success of SpeedyZero.

5.4. EFFECT OF BATCH SIZE

Prior works have shown that PPO can be easily parallelized and benefit from large batch size training (Stooke & Abbeel, 2018) . However, we find that it is hard to train with larger batch sizes in With an increasing batch size, PPO maintains its performance while the performance of SpeedyZero drops a lot. The most significant drop happens when increasing the batch size from 2048 to 4096. Table 4 : Bottleneck analysis on the Reanalyze staleness. Oracle staleness is the optimal staleness that could be achieved by the synchronized version of SpeedyZero. SpeedyZero Staleness measures the actual reanalyze staleness achieved by SpeedyZero. Finally, "Staleness Ratio" is the ratio of SpeedyZero staleness to the oracle staleness. A larger staleness ratio indicates a larger gap between the synchronized and asynchronized execution. As the batch size increases, the reanalyze ratio becomes larger, indicating a much more severe gap between SpeedyZero and the synchronized version. SpeedyZero. Fig. 4 compares PPO and SpeedyZero when using different batch sizes in a number of selected environments. PPO doesn't show performance degradation and maintains its performance when using larger batch size. However, SpeedyZero demonstrates different degrees of performance degradation in different environments and shows the most significant performance degradation when increasing the batch size from 2048 to 4096. We find the "Reanalyze staleness" as a critical factor influencing the performance of SpeedyZero and blocking the learning of SpeedyZero. Reanalyze Staleness. Since reanalysis and training are parallelized, DDP trainers often do not receive batches that are reanalyzed by the latest target model but by an old version of the model. This model version gap of the training batches, which we call "Reanalyze staleness", could significantly affect the quality of training. In practice, we find that there could be several reasons contributing to an increased Reanalyze staleness, including improper queue design between the reanalysis node and the trainer node, a large latency of the Reanalyze processes, communication overhead and compression time over the training batches. When using a larger batch size, which requires a shorter interval to update the target model, the issue of Reanalyze staleness becomes more severe since the latency of components except for the trainers remains the same while the interval between two consecutive target model updates is much shorter. . 12. The performance shows a large variance across different machine configurations. The main reason for the performance gap between these experiments is the non-uniform speedup of different components on machines with faster GPUs. For example, increasing the training speed while not accelerating the data generating process will influence the "priority staleness", which measures the model version gap between when a batch is sampled from the replay buffer and when it is trained on. It is an interesting and challenging direction to stabilize the performance of RL agents across different hardware configurations. Priority Staleness. As data loaders run in parallel with the DDP trainers, there is a time gap between a batch is sampled and is trained on. This means that datapoints in the batch are sampled according to priorities from an old version of the model. This gap is measured by "priority staleness", i.e. the step when the batch is trained on minus the step when it is sampled. We ablate different degrees of priority staleness in Table . 13. When the priority staleness is very small, SpeedyZero has poor performance. We hypothesize that the reason is that a larger priority staleness brings some regularization effect on the sampling probability, hence preventing the training from only focusing on a limited number of datapoints. However, the priority staleness is hard to control in SpeedyZero, since it depends on the latency of multiple components in the system, which could differ a lot on machines with different configurations. We leave the study of optimal priority staleness and improved schemes to control the priority staleness as future work. 



Figure 1: System Architecture Comparison between EfficientZero and SpeedyZero: Effi-cientZero finishes all computation on a single machine. In comparison, SpeedyZero partitions the workflow into data collection (Data Node), batch reanalysis (Reanalysis Node), and training (Trainer node) and distributes the three stages to different machines. SpeedyZero has three main novelties in its architecture (Sec. 4.2): (1) modular design with non-trivial workload partition, (2) efficient on-node communication with shared memory, (3) data transfer optimizations that reduce CPU-GPU communication and overlap data transfer with computation. It also implements our proposed Priority Refresh (Sec. 4.3) in priority refresher of the data node for more stable training.

s ′ , a ′ ), as well as the policy and value at s k+1 t by the prediction function f : v k+1 t

Figure 3: L1 norms of the parameters of the representation network of several trials with a larger batch size using SGD, Clipped LARS, and LARS respectively in Breakout. (a) There exist several sudden huge changes in the weights across all trials when using SGD. (b) Clipped LARS significantly stabilizes the training process. (c) LARS causes over-regularization in the initial training stage and numerical instability or excessive shrinkage of the gradient during further training.

Figure 4: The effect of batch size for SpeedyZero comparing with a distributed implementation of PPO. We report the scores of PPO and SpeedyZero on selected games with batch size of 512/1024/2048/4096. Here PPO consumes 25M samples while SpeedyZero uses 100k samples.With an increasing batch size, PPO maintains its performance while the performance of SpeedyZero drops a lot. The most significant drop happens when increasing the batch size from 2048 to 4096.

We present comprehensive ablation studies on our system components and in-depth analysis on the existing bottlenecks when further scaling SpeedyZero, leading to practical suggestions and open questions for the community.

O⟩, where S is a set of states, A is the set of possible actions, T is a transition function over next states given the actions at current states, and U : S × A → R is the reward function. Ω is the set of observations of the agent and O maps states to probability distributions over observations. We use o ≤t to denote the history of observations at timestep t. The objective of the agent is to find a policy π that maximizes the expected discounted return E[ t γ t u t |a t ∼ π(•|o ≤t )]

, MuZero and EfficientZero. All baselines use 100k environment steps. For the main results in Sec. 5.2 and ablation study in Sec. 5.3, the trainer node is configured with 8 DDP trainers and each DDP trainer receives batches with batch size 256 for training, indicating a total batch size of 2048. The model held by each Reanalyze workers is updated every 25 training steps. The models of the priority refreshers and actors are updated every 10 training steps. The total number of training steps is 15k. We run SpeedyZero with two different clusters, resulting in 35min and 50min total running time due to differences in the machine hardware. The detailed hardware configuration of the two clusters is listed in the appendix. As a reference, EfficientZero uses a batch size of 256 and 120k training steps, taking over 8.5 hours to finish training under the same machine configuration used by the 35min experiments of SpeedyZero.Additionally, We perform experiments with 300k environment steps on two different clusters which allow SpeedyZero to finish training in 35 and 50 minutes. In the 35 minutes experiment, which accelerates EfficientZero by 14.5×, SpeedyZero achieves a normalized mean of 2.594 and normalized median of 0.520. In the 50 minutes experiment, SpeedyZero achieves a similar normalized mean to the 35 minutes experiment, i.e. 2.915, and shows a stronger normalized median of 1.113.

Scores and running time achieved by SpeedyZero and some baselines on the Atari 100k benchmark. Compared with previous RL methods, SpeedyZero achieves human-level performance and performs best on 9 out of 26 games with 10× shorter training time. The results of SpeedyZero are evaluated with 100 evaluation episodes. The 35min results of SpeedyZero with 300k environment steps are evaluated with 3 training seeds. The 50min results of SpeedyZero are evaluated with 16 training seeds. The training time of EfficientZero is evaluated under the same machine configuration of the 300k, 35 minutes experiment of SpeedyZero.

Performance of SpeedyZero using SGD, LARS, and Clipped LARS on a number of selected Atari games with a batch size of 2048. LARS completely fails to learn any useful policies. Clipped LARS is significantly better than SGD and LARS.

Table. 4  reports the ratio of actual SpeedyZero Reanalyze staleness to the oracle Reanalyze staleness when using different batch sizes. As expected, the ratio increases as the batch size increases, indicating a much more severe gap between SpeedyZero and the synchronized version. SpeedyZero achieves human-level performance on the Atari 100k benchmark with only 300k samples and 35 minutes of training. This work is one step towards the application of RL in real-world scenarios where both sample efficiency and training time are mission-critical. We expect future research to further accelerate SpeedyZero and apply SpeedyZero in the real world. Common hyper-parameters of SpeedyZero.

Best η found for a number of selected games. For other games, we use the default η = 0.03. RoadRunner, and UpNDown. The best η found for the selected games are listed in Table.7. For other games, we use η = 0.03 by default. We use the linear scaling rule for large batch size training. The hyper-parameters for large batch size training are shown in Table.8.

Hyper-parameters of SpeedyZero with different batch sizes. For PPO, we only change the batch size for different experiments. The hyper-parameters of PPO are shown inTable. 9. The network architecture we use for PPO is shown in Table. 10.

Key system configuration of SpeedyZero.

Early experiment results with a batch size of 512 and 20k training steps. Compared with previous RL methods, SpeedyZero achieves human-level performance and performs best on 12 out of 26 games with 10× shorter training time. The results of SpeedyZero are evaluated with 100 evaluation episodes. The 0.5 hour and 1 hour results of SpeedyZero with 300k environment steps are evaluated with 3 training seeds. The 0.75 hour results of SpeedyZero with 100k and 300k environment steps are evaluated with 16 seeds. The training time of EfficientZero is evaluated under the same machine configuration of the 300k, 30 minutes experiment of SpeedyZero.

ACKNOWLEDGMENTS

This work is supported by the Ministry of Science and Technology of the People´s Republic of China, the 2030 Innovation Megaprojects "Program on New Generation Artificial Intelligence" (Grant No. 2021AAA0150000).

A APPENDIX

A.1 CLUSTER HARDWARE CONFIGURATIONS This section lists the cluster hardware configurations of the 35min and 50min experiments. For the 35min experiments, the trainer node and the data node are both machines with 8 A100 80G GPUs (with NV-Switch), 128 CPU cores, and 1TB of RAM. There are 9 reanalysis nodes, each of which contains 4 A100 80G GPUs (with NV-Switch), 64 CPU cores, and 512GB of RAM. For the 50min experiments, the trainer node and the data node both contain 8 A100 80G GPUs (without NV-Switch), 128 CPU cores, and 512GB of RAM and the 15 reanalysis nodes all contain 1 NVIDIA RTX 3090 GPUs, 128 CPU cores, and 512GB of RAM.

A.2 MODELS AND HYPER-PARAMETERS

SpeedyZero uses the same model as EfficientZero. The model consists of three modules: representation function, dynamics function, and prediction function, which are all represented as neural networks. We list the architecture of each module below: We stack four historical frames as the input observations, with an interval of 4 frame-skip. The frames are staked on the channel dimension, hence the shape of the input is 96 × 96 × 12. 

A.5 SYSTEM WORKFLOW AND SYSTEM CONFIGURATIONS

In this section, we describe the detailed workflow of SpeedyZero. We illustrate this by looking at how one data point (environment step) takes effect in the whole training process. The story begins on the data node. Actors on the data node collect the data point from the environment and put it into the replay buffer. Priorities refreshers periodically recompute the priorities of data points in the replay buffer. Since we have replay buffer replicas on reanalysis nodes, we need to synchronize the data points and their priorities to the replicas. Each data point is only sent once (when the trajectory is finished) while priorities are synchronized periodically throughout the training process.On the reanalysis node, data loaders sample batches based on previously computed priorities and reformat the batches for ease of GPU reanalysis. The batches are then sent into a shared memory queue between data loaders and Reanalyze workers. The separation of data loading and reanalyzing decouples CPU workload with GPU workload, improving resource utilization. During our implementation, we partition data loaders and Reanalyze workers into several groups. Each group has one shared memory queue, and the data loaders can only communicate with Reanalyze workers in the same group. We use grouping here to achieve higher bandwidth and lower latency. Continue with the workflow, Reanalyze workers will then pop batches from the shared memory queue and reanalyze them using MCTS. The Reanalyze workers will not send the reanalyzed batches to the trainers. Instead, they will push the batches into another shared memory queue, called the batch queue. Batch senders will then take out the batches from the batch queue and send the batches through the network to the trainer node. This additional step overlaps slow network transmission with computation since Reanalyze workers can work on the next batch when the batch senders are sending the current batch.Similarly, on the trainer node, the DDP trainers do not receive batches themselves. The batch receivers are responsible for batch receiving and they will put the batches into a trainer-side batch queue. The DDP trainers directly read the batches from this queue and use the batches in training.In Table . 11, we show some key system configurations used to setup SpeedyZero on our cluster. This can serve as a reference when setting up SpeedyZero on new clusters.

