LARGE BATCH SIMULATION FOR DEEP REINFORCEMENT LEARNING

Abstract

We accelerate deep reinforcement learning-based training in visually complex 3D environments by two orders of magnitude over prior work, realizing end-to-end training speeds of over 19,000 frames of experience per second on a single GPU and up to 72,000 frames per second on a single eight-GPU machine. The key idea of our approach is to design a 3D renderer and embodied navigation simulator around the principle of "batch simulation": accepting and executing large batches of requests simultaneously. Beyond exposing large amounts of work at once, batch simulation allows implementations to amortize in-memory storage of scene assets, rendering work, data loading, and synchronization costs across many simulation requests, dramatically improving the number of simulated agents per GPU and overall simulation throughput. To balance DNN inference and training costs with faster simulation, we also build a computationally efficient policy DNN that maintains high task performance, and modify training algorithms to maintain sample efficiency when training with large mini-batches. By combining batch simulation and DNN performance optimizations, we demonstrate that PointGoal navigation agents can be trained in complex 3D environments on a single GPU in 1.5 days to 97% of the accuracy of agents trained on a prior state-of-the-art system using a 64-GPU cluster over three days. We provide open-source reference implementations of our batch 3D renderer and simulator to facilitate incorporation of these ideas into RL systems.

1. INTRODUCTION

Speed matters. It is now common for modern reinforcement learning (RL) algorithms leveraging deep neural networks (DNNs) to require billions of samples of experience from simulated environments (Wijmans et al., 2020; Petrenko et al., 2020; OpenAI et al., 2019; Silver et al., 2017; Vinyals et al., 2019) . For embodied AI tasks such as visual navigation, where the ultimate goal for learned policies is deployment in the real world, learning from realistic simulations is important for successful transfer of learned policies to physical robots. In these cases simulators must render detailed 3D scenes and simulate agent interaction with complex environments (Kolve et al., 2017; Dosovitskiy et al., 2017; Savva et al., 2019; Xia et al., 2020; Gan et al., 2020) . Evaluating and training a DNN on billions of simulated samples is computationally expensive. For instance, the DD-PPO system (Wijmans et al., 2020) used 64 GPUs over three days to learn from 2.5 billion frames of experience and achieve near-perfect PointGoal navigation in 3D scanned environments of indoor spaces. At an even larger distributed training scale, OpenAI Five used over 50,000 CPUs and 1000 GPUs to train Dota 2 agents (OpenAI et al., 2019) . Unfortunately, experiments at this scale are out of reach for most researchers. This problem will only grow worse as the field explores more complex tasks in more detailed environments. Many efforts to accelerate deep RL focus on improving the efficiency of DNN evaluation and training -e.g., by "centralizing" computations to facilitate efficient batch execution on GPUs or TPUs (Espeholt et al., 2020; Petrenko et al., 2020) or by parallelizing across GPUs (Wijmans et al., 2020) . However, most RL platforms still accelerate environment simulation by running many copies of off-the-shelf, unmodified simulators, such as simulators designed for video game engines (Bellemare et al., 2013; Kempka et al., 2016; Beattie et al., 2016; Weihs et al., 2020) , on large numbers et al., 2017) environments such as the ones shown here. These environments feature detailed scans of real-world scenes composed of up to 600K triangles and high-resolution textures. Our system is able to train agents using 64×64 depth sensors (a highresolution example is shown on the left) in these environments at 19,900 frames per second, and agents with 64×64 RGB cameras at 13,300 frames per second on a single GPU. of CPUs or GPUs. This approach is a simple and productive way to improve simulation throughput, but it makes inefficient use of computation resources. For example, when rendering complex environments (Kolve et al., 2017; Savva et al., 2019; Xia et al., 2018) , a single simulator instance might consume gigabytes of GPU memory, limiting the total number of instances to far below the parallelism afforded by the machine. Further, running many simulator instances (in particular when they are distributed across machines) can introduce overhead in synchronization and communication with other components of the RL system. Inefficient environment simulation is a major reason RL platforms typically require scale-out parallelism to achieve high end-to-end system throughput. In this paper, we crack open the simulation black box and take a holistic approach to co-designing a 3D renderer, simulator, and RL training system. Our key contribution is batch simulation for RL: designing high-throughput simulators that accept large batches of requests as input (aggregated across different environments, potentially with different assets) and efficiently execute the entire batch at once. Exposing work en masse facilitates a number of optimizations: we reduce memory footprint by sharing scene assets (geometry and textures) across rendering requests (enabling orders of magnitude more environments to be rendered simultaneously on a single GPU), amortize rendering work using GPU commands that draw triangles from multiple scenes at once, hide latency of scene I/O, and exploit batch transfer to reduce data communication and synchronization costs between the simulator, DNN inference, and training. To further improve end-to-end RL speedups, the DNN workload must be optimized to match high simulation throughput, so we design a computationally efficient policy DNN that still achieves high task performance in our experiments. Large-batch simulation increases the number of samples collected per training iteration, so we also employ techniques from large-batch supervised learning to maintain sample efficiency in this regime. We evaluate batch simulation on the task of PointGoal navigation (Anderson et al., 2018) in 3D scanned Gibson and Matterport3D environments, and show that end-to-end optimization of batched rendering, simulation, inference, and training yields a 110× speedup over state-of-the-art prior systems, while achieving 97% of the task performance for depth-sensor-driven agents and 91% for RGB-camera-driven agents. Concretely, we demonstrate sample generation and training at over 19,000 frames of experience per second on a single GPU. 1 In real-world terms, a single GPU is capable of training a virtual agent on 26 years of experience in a single day.foot_1 This new performance regime significantly improves the accessibility and efficiency of RL research in realistic 3D environments, and opens new possibilities for more complex embodied tasks in the future.

2. RELATED WORK

Systems for high-performance RL. Existing systems for high-performance RL have primarily focused on improving the efficiency of DNN components of the workload (policy inference and optimization) and use a simulator designed for efficient single agent simulation as a black box. For example, Impala and Ape-X used multiple worker processes to asynchronously collect experience for a centralized learner (Espeholt et al., 2018; Horgan et al., 2018) . SEED RL and Sample Factory built upon this idea and introduced inference workers that centralize network inference, thereby allowing it to be accelerated by GPUs or TPUs (Espeholt et al., 2020; Petrenko et al., 2020) . DD-PPO proposed a synchronous distributed system for similar purposes (Wijmans et al., 2020) . A number



Samples of experience used for learning, not 'frameskipped' metrics typically used in Atari/DMLab. Calculated on rate a physical robot (LoCoBot (Carnegie Mellon University, 2019)) collects observations when operating constantly at maximum speed (0.5 m/s) and capturing 1 frame every 0.25m.



Figure1: We train agents to perform PointGoal navigation in visually complexGibson (Xia et al.,  2018) and Matterport3D (Chang et al., 2017)  environments such as the ones shown here. These environments feature detailed scans of real-world scenes composed of up to 600K triangles and high-resolution textures. Our system is able to train agents using 64×64 depth sensors (a highresolution example is shown on the left) in these environments at 19,900 frames per second, and agents with 64×64 RGB cameras at 13,300 frames per second on a single GPU.

