RAISIN: RESIDUAL ALGORITHMS FOR VERSATILE OFFLINE REINFORCEMENT LEARNING

Abstract

The residual gradient algorithm (RG), gradient descent of the Mean Squared Bellman Error, brings robust convergence guarantees to bootstrapped value estimation. Meanwhile, the far more common semi-gradient algorithm (SG) suffers from wellknown instabilities and divergence. Unfortunately, RG often converges slowly in practice. Baird (1995) proposed residual algorithms (RA), weighted averaging of RG and SG, to combine RG's robust convergence and SG's speed. RA works moderately well in the online setting. We find, however, that RA works disproportionately well in the offline setting. Concretely, we find that merely adding a variable residual component to SAC gives state-of-the-art scores for about half of the D4RL gym tasks. We further show that using the minimum of ten critics lets our algorithm approximately match SAC-N 's state-of-the-art returns using 50× less compute. In contrast, TD3+BC with the same minimum-of-ten-critics trick does not match SAC-N 's returns on many environments. The only hyperparameter we tune is our residual weight -we leave all other hyperparameters unchanged from SAC-N .

1. INTRODUCTION

Strong data scaling has given us baffling success in supervised learning. Offline reinforcement learning (offline RL) holds promise for RL to scale with that same success, among other benefits. Despite all the compelling motivations of offline RL (Levine et al., 2020) , we still lack a simple, versatile, and computationally efficient solution. By versatile, we mean algorithms that attain high returns when trained on any of a diverse range of datasets, such as data collected by greatly differing policies. Arguably the simplest and most versatile approach thus far is SAC-N (An et al., 2021) , which uses the minimum of N critics instead of SAC's usual two critics. SAC-N achieves state-of-the-art scores but, unfortunately, requires up to 500 critics for sufficient pessimism on benchmark problems. Hu et al. (2022) illustrates that stronger pessimism, specifically a smaller discount factor, enables SAC-N to solve harder tasks (Rajeswaran et al., 2017) . A smaller discount factor is simple and computationally efficient pessimism but not versatile -it increases bias (Zhang et al., 2020) . In this paper, we identify residual algorithms (RA) (Baird, 1995) as a simple, versatile, and computationally efficient source of pessimism for SAC-N . As we explained, RA saw moderate success in its goal of fusing RG's convergence with SG's speed. Recently, Zhang et al. (2019) found similar success when extending RA to deep learning. But we find RA truly excels in the offline setting. Prior works in both the online and offline settings (Geist et al., 2016; Fujimoto et al., 2022; Saleh & Jiang, 2019) show that, while RG performs well with data near the optimal policy, RG consistently fails when the data is far from the optimal policy. Our key insight is that RA allows for the adjustable exploitation of RG's natural pessimism. In other words, a weighted RG component may serve as a superior alternative to the widespread use of a weighted behavior cloning component for offline RL (Fujimoto & Gu, 2021; Buckman et al., 2020) . Critically, however, we also find that no single weight for the RG component universally works well: you must tune it per dataset, similar to SAC-N (An et al., 2021) . We discuss potential routes for automatic tuning in Section 5. We propose Raisin, roughly RA for SAC-N, giving D4RL (Fu et al., 2020) gym scores roughly matching SAC-N -the state-of-the-art -with one-fiftieth of the critics. EDAC (An et al., 2021) matches those scores as well but requires both five times more compute than Raisin and adjustment of two hyperparameters per dataset. Raisin keeps the number of critics small and fixed and solely requires adjusting one hyperparameter for pessimism (the residual weight). Raisin easily runs at the same speed with one critic as it does with its standard ten critics (on one GPU) thanks to embarrassing parallelization, similar to SAC-N (An et al., 2021) . We plan to release a clean and efficient implementation of Raisin (and SAC-N ) upon acceptance of this manuscript. Meanwhile, TD3+BC (Fujimoto & Gu, 2021) equipped with the same minimum-of-N -critics tool (TD3+BC-N ) does not appear versatile. It still does not match the scores of SAC-N on a few datasets (especially the random datasets) despite tuning the pessimism per dataset. Outside of fundamental offline RL research, such simple combinations of behavior cloning and reinforcement learning are debatably the most common approach in the literature (Humphreys et al., 2022; Baker et al., 2022; Nakano et al., 2021) . IQL is computationally efficient and potentially versatile but needs more testing on suboptimal data (e.g., the random datasets of D4RL gym tasks). Maybe even more importantly, IQL is not simple. For example, IQL uses a learning rate decay for its actor, rescales its rewards by 1000, clips advantages to 100, and adds a third learning rate. Each one of these components may be overfitting. Conversely, Raisin makes a single change to SAC-N , and SAC-N itself makes one change to SAC (Haarnoja et al., 2018) , an extensively well-tested algorithm. Upside-down reinforcement learning (UDRL) (Schmidhuber, 2019; Kumar et al., 2019) has shown some potential in recent work (Chen et al., 2021a; Lee et al., 2022) , but scores poorly on suboptimal data (Brandfonbrener et al., 2022) . Not to mention that it wastes capacity learning poor behaviors (Emmons et al., 2021) and that it kicks the maximization can down the road -UDRL requires a desired-return hyperparameter, a goal specification, or a complicated equivalent.

2. PRELIMINARIES

We consider the standard offline RL setting of a fixed dataset D of transitions (s, a, r, s ′ ), where s is a state, a is an action taken in that state, r is the reward received for that action, and s ′ is the new state. We denote our policy π ϕ (•|s), whose goal is to learn actions that maximize the discounted sum of rewards t γ t r, where γ is the discount factor. Q θ (s, a) is our state-action value function, whose goal is to estimate the future discounted sum of rewards given a state-action pair. For a comprehensive overview of residual algorithms (RA), see Zhang et al. (2019) . We will give a brief primer here in the context of SAC. Ignoring discounting for simplicity, SAC's loss is: L(θ i , D) := E (s,a,r,s ′ )∼D   Q θi (s, a) -min j=1,2 ȳ(j, r, s ′ ) 2   , where θ i refers to the parameters θ of the ith of two Q networks; and ȳ, the next-state target before the minimum operation, is: ȳ(j, r, s ′ ) := r + Qθj (s ′ , ã′ ) -α log π ϕ (ã ′ |s ′ ) , ã′ ∼ π ϕ (•|s ′ ). Q denotes the target network (Mnih et al., 2015) , and α is SAC's entropy coefficient. This critic loss derives from the Mean Squared Bellman Error (MSBE), a natural error function for bootstrapped value estimation. But SAC's critic does not quite optimize the MSBE, partly because SAC, like all common value-based RL algorithms using gradient descent, ignores the gradient contribution from the next-state term. In other words, gradient descent of SAC's critic loss treats the value of the next state (plus the reward) as a fixed target towards which the current state's value is stepped. This is known as the semi-gradient (SG) algorithm. In contrast, performing true gradient descent of the MSBE is called the residual gradient (RG) algorithm. As a true gradient algorithm, RG brings robust convergence guarantees. But, thus far, RG has empirically obtained poor returns and at slow speeds. It also suffers from a few theoretical concerns (Baird, 1995; Sutton & Barto, 2018) : double sampling bias (for which we discuss workarounds in Section 5); convergence to unsatisfactory

