PARALLEL Q-LEARNING: SCALING OFF-POLICY RE-INFORCEMENT LEARNING

Abstract

Reinforcement learning algorithms require a long time to learn policies on complex tasks due to the need for a large amount of training data. With the recent advances in GPU-based simulation, such as Isaac Gym, data collection has been sped up thousands of times on a commodity GPU. Most prior works have used on-policy methods such as PPO to train policies in Isaac Gym due to its simpleness and effectiveness in scaling up. Off-policy methods are usually more sample-efficient but more challenging to be scaled up, resulting in a much longer wall-clock training time in practice. In this work, we presented a novel Parallel Q-Learning (PQL) framework that is substantially faster in wall-clock time and achieves better sample efficiency than PPO. Our key insight is to parallelize the data collection, policy function learning, and value function learning as much as possible. Different from prior works on distributed off-policy learning, such as Apex, our framework is designed specifically for massively parallel GPU-based simulation and optimized to work on a single workstation. We demonstrate the capability of scaling up Q learning methods to tens of thousands of parallel environments. We also investigate various factors that can affect policy learning speed, including the number of parallel environments, exploration schemes, batch size, GPU models, etc.

1. INTRODUCTION

Reinforcement learning (RL) has achieved impressive results on many problems: video games (Berner et al., 2019; Mnih et al., 2015) , robotics (Kober et al., 2013; Miki et al., 2022) , drug discovery (Popova et al., 2018) and others. A primary challenge in using RL is the need for large amounts of training data. One way to tackle this problem is by improving off-policy RL algorithms (Mnih et al., 2015; Lillicrap et al., 2015) that make better use of data than on-policy algorithms (Schulman et al., 2017; Mnih et al., 2016) . Another strategy is to substantially reduce the need for real-world data collection by training in simulation. Recent works have achieved remarkable success in deploying policies trained in simulation to the real-world (Hwangbo et al., 2019; OpenAI et al., 2020; Margolis et al., 2022; Miki et al., 2022) . In a sim-to-real training paradigm, it is not the amount of training data but the wall-clock time that is a major constraint. Faster training, measured as wall-clock time, speeds up the experimentation cycle and unlocks the potential for addressing a broader range of more complex problems that currently take a long time to train. The need for faster training has been recognized in the literature resulting in several distributed learning frameworks (Horgan et al., 2018; Espeholt et al., 2018) . Typically these frameworks operate at the server scale requiring hundreds or thousands of computers, making them infeasible for most researchers. Many of these computers are used to run multiple copies of a "slow" simulator in parallel to feed the learning process. Recent advances in GPU-based simulation, such as Isaac Gym (Makoviychuk et al., 2021) , have mitigated the need for a large number of machines for parallel simulation by enabling the parallel simulation of thousands of environments on a commodity GPU on a single workstation. One natural question to ask is, what RL algorithm is apt in such a setting? Many prior works (Allshire et al., 2021; Rudin et al., 2022; Chen et al., 2022) use PPO (Schulman et al., 2017) for training agents in Isaac Gym due to its simplicity and easy-to-scale nature. such as DDPG (Lillicrap et al., 2015) or SAC (Haarnoja et al., 2018) without making use of parallel environments usually requires a much longer training time than PPO, even though these algorithms are more sample efficient. We build upon distributed frameworks for Q-learning developed and deployed in server-scale settings (Horgan et al., 2018; Nair et al., 2015) to leverage GPU-based simulation on a single workstation. We present an approach to scale up Q-learning, Parallel Q-Learning (PQL), that can be deployed on a workstation to leverage thousands of environment simulations running in parallel efficiently. The key factor that boosts the learning speed in PQL is that we parallelize the data collection, policy function update, and value function update as much as possible on a single workstation. Such parallelization would be non-trivial for on-policy algorithms such as PPO as the policy update require on-policy interaction data, which means the data collection and the policy update have to happen in sequence. PQL outperforms state-of-the-art algorithms such as PPO in terms of both wall-clock and data efficiency. We empirically investigate the effectiveness of our method on six Isaac Gym tasks (Makoviychuk et al., 2021) , demonstrating the superior performance of PQL against commonly used state-ofthe-art (SOTA) RL algorithms. We also analyze several important factors that can affect learning speed, such as the number of parallel environments, batch size, balancing between parallel processes performing simulation, data-collection and learning, exploration scheme, replay buffer size, number of GPUs, different GPU hardware, etc. Other noteworthy findings are: (i) we empirically found that DDPG performs better than SAC when using a large number of parallel environments. (ii) We can mitigate the need for tuning the hyper-parameter controlling exploration. Overall, while previous distributed learning frameworks were accessible only to researchers with access to serverscale compute, we hope our framework leveraging recent advances in GPU-based simulation will be a useful tool for the broader research community training RL agents on a single workstation.

2. RELATED WORK

Massively Parallel Simulation Simulation has been an important tool for various research fields, including robotics, drug discovery, physics, etc. In the past, researchers have used simulators like MuJoCo (Todorov et al., 2012 ), PyBullet (Coumans & Bai, 2016) for rigid body simulation. Recently, there has been a new wave of development in GPU-based simulation, such as Isaac Gym (Makoviychuk et al., 2021) . GPU-based simulation has substantially improved the simulation speed, allowing a massive amount of parallel simulation on a single commodity GPU. It has been used in various challenging robotics control problems such as quadruped locomotion (Rudin et al., 2022; Margolis et al., 2022 ), dexterous manipulation (Chen et al., 2022; Allshire et al., 2021) . With the advent of fast simulation, one can get much more environment interaction data in the same training time as before. This poses a challenge to the reinforcement learning algorithm in making the best use of the massive amount of data. A straightforward way is to use on-policy algorithms such as PPO, which can be scaled up easily and is also the default algorithm that researchers use in Isaac Gym. However, on-policy algorithms are less data efficient. In our work, we investigate how to scale up off-policy algorithms to achieve both higher sample efficiency as well as shorter wall-clock training time for massively parallel simulation. Our parallel training framework works on a commodity workstation without requiring a big compute cluster. Distributed Reinforcement Learning Model-free reinforcement learning typically requires a large number of environment interactions (low sample efficiency). One way to speed up policy learning is by distributed training. There have been numerous works developing different distributed reinforcement learning frameworks. One line of work focuses on Q-learning methods. Gorila (Nair et al., 2015) distributes DQN agents to many machines where each machine has its local environment, replay buffer, value learning, and uses asynchronous SGD for a centralized Q function learning. Similarly, Popov et al. (2017) applies asynchronous SGD to the DDPG algorithm (Lillicrap et al., 2015) . Combining with prioritized replay (Schaul et al., 2015) , n-step returns (Sutton, 1988) , and double-Q learning (Hasselt, 2010 ), Horgan et al. (2018) (Ape-X) parallelizes the actor thread (environment interactions) for data collection and uses a centralized learner thread for policy and value function learning. Built upon Ape-X, Kapturowski et al. (2018) adapts the distributed prioritized experience replay for RNN-based DQN agents.

