PARALLEL Q-LEARNING: SCALING OFF-POLICY RE-INFORCEMENT LEARNING

Abstract

Reinforcement learning algorithms require a long time to learn policies on complex tasks due to the need for a large amount of training data. With the recent advances in GPU-based simulation, such as Isaac Gym, data collection has been sped up thousands of times on a commodity GPU. Most prior works have used on-policy methods such as PPO to train policies in Isaac Gym due to its simpleness and effectiveness in scaling up. Off-policy methods are usually more sample-efficient but more challenging to be scaled up, resulting in a much longer wall-clock training time in practice. In this work, we presented a novel Parallel Q-Learning (PQL) framework that is substantially faster in wall-clock time and achieves better sample efficiency than PPO. Our key insight is to parallelize the data collection, policy function learning, and value function learning as much as possible. Different from prior works on distributed off-policy learning, such as Apex, our framework is designed specifically for massively parallel GPU-based simulation and optimized to work on a single workstation. We demonstrate the capability of scaling up Q learning methods to tens of thousands of parallel environments. We also investigate various factors that can affect policy learning speed, including the number of parallel environments, exploration schemes, batch size, GPU models, etc.

1. INTRODUCTION

Reinforcement learning (RL) has achieved impressive results on many problems: video games (Berner et al., 2019; Mnih et al., 2015) , robotics (Kober et al., 2013; Miki et al., 2022 ), drug discovery (Popova et al., 2018) and others. A primary challenge in using RL is the need for large amounts of training data. One way to tackle this problem is by improving off-policy RL algorithms (Mnih et al., 2015; Lillicrap et al., 2015) that make better use of data than on-policy algorithms (Schulman et al., 2017; Mnih et al., 2016) . Another strategy is to substantially reduce the need for real-world data collection by training in simulation. Recent works have achieved remarkable success in deploying policies trained in simulation to the real-world (Hwangbo et al., 2019; OpenAI et al., 2020; Margolis et al., 2022; Miki et al., 2022) . In a sim-to-real training paradigm, it is not the amount of training data but the wall-clock time that is a major constraint. Faster training, measured as wall-clock time, speeds up the experimentation cycle and unlocks the potential for addressing a broader range of more complex problems that currently take a long time to train. The need for faster training has been recognized in the literature resulting in several distributed learning frameworks (Horgan et al., 2018; Espeholt et al., 2018) . Typically these frameworks operate at the server scale requiring hundreds or thousands of computers, making them infeasible for most researchers. Many of these computers are used to run multiple copies of a "slow" simulator in parallel to feed the learning process. Recent advances in GPU-based simulation, such as Isaac Gym (Makoviychuk et al., 2021) , have mitigated the need for a large number of machines for parallel simulation by enabling the parallel simulation of thousands of environments on a commodity GPU on a single workstation. One natural question to ask is, what RL algorithm is apt in such a setting? Many prior works (Allshire et al., 2021; Rudin et al., 2022; Chen et al., 2022) use PPO (Schulman et al., 2017) for training agents in Isaac Gym due to its simplicity and easy-to-scale nature. Intuitively, by virtue of requiring lesser data than on-policy algorithms, off-policy algorithms should reduce the wall-clock time of training. However, a naive implementation of off-policy algorithms

