RAISIN: RESIDUAL ALGORITHMS FOR VERSATILE OFFLINE REINFORCEMENT LEARNING

Abstract

The residual gradient algorithm (RG), gradient descent of the Mean Squared Bellman Error, brings robust convergence guarantees to bootstrapped value estimation. Meanwhile, the far more common semi-gradient algorithm (SG) suffers from wellknown instabilities and divergence. Unfortunately, RG often converges slowly in practice. Baird (1995) proposed residual algorithms (RA), weighted averaging of RG and SG, to combine RG's robust convergence and SG's speed. RA works moderately well in the online setting. We find, however, that RA works disproportionately well in the offline setting. Concretely, we find that merely adding a variable residual component to SAC gives state-of-the-art scores for about half of the D4RL gym tasks. We further show that using the minimum of ten critics lets our algorithm approximately match SAC-N 's state-of-the-art returns using 50× less compute. In contrast, TD3+BC with the same minimum-of-ten-critics trick does not match SAC-N 's returns on many environments. The only hyperparameter we tune is our residual weight -we leave all other hyperparameters unchanged from SAC-N .

1. INTRODUCTION

Strong data scaling has given us baffling success in supervised learning. Offline reinforcement learning (offline RL) holds promise for RL to scale with that same success, among other benefits. Despite all the compelling motivations of offline RL (Levine et al., 2020) , we still lack a simple, versatile, and computationally efficient solution. By versatile, we mean algorithms that attain high returns when trained on any of a diverse range of datasets, such as data collected by greatly differing policies. Arguably the simplest and most versatile approach thus far is SAC-N (An et al., 2021) , which uses the minimum of N critics instead of SAC's usual two critics. SAC-N achieves state-of-the-art scores but, unfortunately, requires up to 500 critics for sufficient pessimism on benchmark problems. Hu et al. (2022) illustrates that stronger pessimism, specifically a smaller discount factor, enables SAC-N to solve harder tasks (Rajeswaran et al., 2017) . A smaller discount factor is simple and computationally efficient pessimism but not versatile -it increases bias (Zhang et al., 2020) . In this paper, we identify residual algorithms (RA) (Baird, 1995) as a simple, versatile, and computationally efficient source of pessimism for SAC-N . As we explained, RA saw moderate success in its goal of fusing RG's convergence with SG's speed. Recently, Zhang et al. (2019) found similar success when extending RA to deep learning. But we find RA truly excels in the offline setting. Prior works in both the online and offline settings (Geist et al., 2016; Fujimoto et al., 2022; Saleh & Jiang, 2019) show that, while RG performs well with data near the optimal policy, RG consistently fails when the data is far from the optimal policy. Our key insight is that RA allows for the adjustable exploitation of RG's natural pessimism. In other words, a weighted RG component may serve as a superior alternative to the widespread use of a weighted behavior cloning component for offline RL (Fujimoto & Gu, 2021; Buckman et al., 2020) . Critically, however, we also find that no single weight for the RG component universally works well: you must tune it per dataset, similar to SAC-N (An et al., 2021) . We discuss potential routes for automatic tuning in Section 5. We propose Raisin, roughly RA for SAC-N, giving D4RL (Fu et al., 2020) gym scores roughly matching SAC-N -the state-of-the-art -with one-fiftieth of the critics. EDAC (An et al., 2021) 

