VARIANCE DOUBLE-DOWN: THE SMALL BATCH SIZE ANOMALY IN MULTISTEP DEEP REINFORCEMENT LEARNING

Abstract

In deep reinforcement learning, multi-step learning is almost unavoidable to achieve state-of-the-art performance. However, the increased variance that multistep learning brings makes it difficult to increase the update horizon beyond relatively small numbers. In this paper, we report the counterintuitive finding that decreasing the batch size parameter improves the performance of many standard deep RL agents that use multi-step learning. It is well-known that gradient variance decreases with increasing batch sizes, so obtaining improved performance by increasing variance on two fronts is a rather surprising finding. We conduct a broad set of experiments to better understand what we call the variance doubledown phenomenon.

1. INTRODUCTION

Deep reinforcement learning (DRL), which combines traditional reinforcement learning (RL) techniques with neural networks, has had a number of recent successes, including achieving superhuman performance on challenging games (Mnih et al., 2015; Schrittwieser et al., 2020; Perolat et al., 2022) , overcoming difficult robotics challenges (Andrychowicz et al., 2020; Smith et al., 2022) , and being successfully applied to large-scale real-world tasks (Bellemare et al., 2020; Degrave et al., 2022 ). Yet successful application of DRL to new problems remains a challenge, in large part due to the difficulty in understanding how neural network training is affected by the vast number of hyperparameters involved. Despite a number of recent works developing a greater understanding of the dynamics of training neural networks for reinforcement learning (Ceron & Castro, 2021; Araújo et al., 2021; Nikishin et al., 2022; Ostrovski et al., 2021; Schaul et al., 2022) , the relationship between particular hyper-parameter configurations and performance on a given environment remains hard to predict. One generally held desire in training neural networks is to reduce the variance of gradient updates, so as to avoid unstable and unreliable learning. For example, in the reinforcement learning literature there has been a growing trend to use multi-step (or n-step) learning (Hessel et al., 2018; Schwarzer et al., 2020; Kapturowski et al., 2018; Agarwal et al., 2022) for improved performance. Despite their demonstrated advantage, researchers have been limited to small values of n to avoid performance collapse, in part due to the increased variance arising from larger n. The supervised learning literature suggests that an effective mechanism for mitigating variance is through the choice of batch size: Shallue et al. ( 2019) empirically demonstrate that larger batch sizes result in reduced variance and increased performance. In this paper, we report the counterintuitive finding that reducing the batch size can help avoid performance collapse with larger n-step updates. This is effectively doubling down on increased variance for improved performance. We showcase this anomaly in a broad set of training regimens and value-based RL agents, and conduct an empirical analysis to develop a better understanding of its causes. Additionally, we demonstrate that reduced batch sizes also results in reduced overall computation time during training. In Appendix A we provide background on deep reinforcement learning, including a description of n-step updates and batch sizes. 

2. CASE STUDY: THE ARCADE LEARNING ENVIRONMENT

Advances in deep reinforcement learning (DRL) often build on prior algorithms, network architectures, and hyper-parameter selections. Given the large number of options, new work typically re-tunes only those components necessary for the new methods being considered. Thus, we have accumulated a set of, mostly static, parameters upon which new ideas are tested (this may be a form of the "social dynamics of research" hypothesized by Schaul et al. ( 2022)). One of the static parameters for training single-agent value-based agents has been the choice of batch size. Since the introduction of DQN by Mnih et al. (2015) , single-agent training on the Arcade Learning Environment (ALE, Bellemare et al., 2013) has used a batch size of 32, where this value was carefully tuned by the authors for performance. Since then, this value has rarely been changed, save for distributed agent training (Kapturowski et al., 2018; Espeholt et al., 2018) . If one takes the general advice from the supervised learning literature, we should be aiming to increase the batch size so as to reduce variance and improve performance (Shallue et al., 2019) . We focus on the effect of changing the batch size, while keeping all else equal.

2.1. EXPERIMENTAL SETUP

For this case study, we use JAX implementations of agents provided by the Dopamine library (Castro et al., 2018) and applied to game-playing in the ALE (Bellemare et al., 2013) .foot_0 For computational reasons, we evaluate our agents on 20 games chosen by Fedus et al. (2020) in their analysis of replay ratios; these were picked to offer a diversity of difficulty and dynamics. Similarly, we run each learning trial for 100 million frames (as opposed to the standard 200 million). In exploratory experiments, we determined that for our purposes there are unsubstantial differences at 100M and 200M frames. The four agents we consider are: DQN (Mnih et al., 2015) , Rainbow (Hessel et al., 



Dopamine uses sticky actions by default(Machado et al., 2018).



Figure 1: Varying batch sizes for DQN, Rainbow, QR-DQN, and IQN.

