Q-ENSEMBLE FOR OFFLINE RL: DON'T SCALE THE ENSEMBLE, SCALE THE BATCH SIZE

Abstract

Training large neural networks is known to be time-consuming, with the learning duration taking days or even weeks. To address this problem, large-batch optimization was introduced. This approach demonstrated that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude. While long training time was not typically a major issue for model-free deep offline RL algorithms, recently introduced Q-ensemble methods achieving state-of-the-art performance made this issue more relevant, notably extending the training duration. In this work, we demonstrate how this class of methods can benefit from large-batch optimization, which is commonly overlooked by the deep offline RL community. We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time, effectively shortening training duration by 3-4x times on average.

1. INTRODUCTION

Offline Reinforcement Learning (ORL) provides a data-driven perspective on learning decisionmaking policies by using previously collected data without any additional online interaction during the training process (Lange et al., 2012; Levine et al., 2020) . Despite its recent development (Fujimoto et al., 2019; Nair et al., 2020; An et al., 2021; Zhou et al., 2021; Kumar et al., 2020) and application progress (Zhan et al., 2022; Apostolopoulos et al., 2021; Soares et al., 2021) , one of the current challenges in ORL remains algorithms extrapolation error, which is an inability to correctly estimate the values of unseen actions (Fujimoto et al., 2019) . Numerous algorithms were designed to address this issue. For example, Kostrikov et al. (2021) (IQL) avoids estimation for out-of-sample actions entirely. Similarly, Kumar et al. (2020) (CQL) penalizes out-of-distribution actions such that their values are lower-bounded. Other methods explicitly make the learned policy closer to the behavioral one (Fujimoto & Gu, 2021; Nair et al., 2020; Wang et al., 2020) . In contrast to prior studies, recent works (An et al., 2021) demonstrated that simply increasing the number of value estimates in the Soft Actor-Critic (SAC) (Haarnoja et al., 2018) algorithm is enough to advance state-of-the-art performance consistently across various datasets in the D4RL benchmark (Fu et al., 2020) . Furthermore, An et al. (2021) showed that the double-clip trick actually serves as an uncertainty-quantification mechanism providing the lower bound of the estimate, and simply increasing the number of critics can result in a sufficient penalization for out-of-distribution actions. Despite its state-of-the-art results, the performance gain for some datasets requires significant computation time or optimization of an additional term, leading to extended training duration (Figure 2 ). In this paper, inspired by parallel works on reducing the training time of large models in other areas of deep learning (You et al., 2019; 2017) (commonly referred to as large batch optimization), we study the overlooked use of large batches 1 in the deep ORL setting. We demonstrate that, instead of increasing the number of critics or introducing an additional optimization term in SAC-N (An et al., 2021) algorithm, simple batch scaling and naive adjustment of the learning rate can (1) provide a sufficient penalty on out-of-distribution actions and (2) match state-of-the-art performance on the D4RL benchmark. Moreover, this large batch optimization approach significantly reduces the convergence time, making it possible to train models 4x faster on a single-GPU setup. To the best of our knowledge, this is the first study that examines large batch optimization in the ORL setup.

2. Q-ENSEMBLE FOR OFFLINE RL

Mini-Batch, B = 256 The introduced approach does not require an auxiliary optimization term while making it possible to effectively reduce the number of critics in the Q-ensemble by switching to the large-batch optimization setting. Q-Ensemble, N < 500 Q-Ensemble Update, Vanilla SAC-N Mini-Batch, B = 256 Q-Ensemble, N < 50 Q-Ensemble Update, EDAC Mini-Batch, B = 10K Q-Ensemble, N < 50 Q-Ensemble Update, Vanilla LB-SAC Ensembles have a long history of applications in the reinforcement learning community. They are employed in the model-based approaches to combat compounding error and model exploitation (Kurutach et al., 2018; Chua et al., 2018; Lai et al., 2020; Janner et al., 2019) , in model-free to greatly increase sample efficiency (Chen et al., 2021; Hiraoka et al., 2021; Liang et al., 2022) and in general to boost exploration in online RL (Osband et al., 2016; Chen et al., 2017; Lee et al., 2021; Ciosek et al., 2019) . In offline RL, ensembles were mostly utilized to model epistemic uncertainty in value function estimation (Agarwal et al., 2020; Bai et al., 2022; Ghasemipour et al., 2022) , introducing uncertainty aware conservatism. Recently, An et al. ( 2021) investigated an isolated effect of clipped Q-learning on value overestimation in offline RL, increasing the number of critics in the Soft Actor Critic (Haarnoja et al., 2018) algorithm from 2 to N . Surprisingly, with tuned N SAC-N outperformed previous state-of-the-art algorithms on D4RL benchmark (Fu et al., 2020 ) by a large margin, although requiring up to 500 critics on some datasets. To reduce the ensemble size, An et al. ( 2021) proposed EDAC which adds auxiliary loss to diversify the ensemble, allowing to greatly reduce N (Figure 1 ). Such pessimistic Q-ensemble can be interpreted as utilizing the Lower Confidence Bound (LCB) of the Q-value predictions. Assuming that Q(s, a) follows a Gaussian distribution with mean m, standard deviation σ and {Q j (s, a)} N j=1 are realizations of Q(s, a), we can approximate expected minimum (An et al., 2021; Royston, 1982) 



Offline RL is often referred to as Batch RL, here, we extensively use the batch term to denote a mini-batch, not the dataset size.



Figure 1: The difference between recently introduced SAC-N, EDAC, and the proposed LB-SAC.The introduced approach does not require an auxiliary optimization term while making it possible to effectively reduce the number of critics in the Q-ensemble by switching to the large-batch optimization setting.

as E min j=1,...,N Q j (s, a) ≈ m(s, a) -Φ -1N -

