INCORPORATING EXPLICIT UNCERTAINTY ESTIMATES INTO DEEP OFFLINE REINFORCEMENT LEARNING

Abstract

Most theoretically motivated work in the offline reinforcement learning setting requires precise uncertainty estimates. This requirement restricts the algorithms derived in that work to the tabular and linear settings where such estimates exist. In this work, we develop a novel method for incorporating scalable uncertainty estimates into an offline reinforcement learning algorithm called deep-SPIBB that extends the SPIBB family of algorithms to environments with larger state and action spaces. We use recent innovations in uncertainty estimation from the deep learning community to get more scalable uncertainty estimates to plug into deep-SPIBB. While these uncertainty estimates do not allow for the same theoretical guarantees as in the tabular case, we argue that the SPIBB mechanism for incorporating uncertainty is more robust and flexible than pessimistic approaches that incorporate the uncertainty as a value function penalty. We bear this out empirically, showing that deep-SPIBB outperforms pessimism based approaches with access to the same uncertainty estimates and performs at least on par with a variety of other strong baselines across several environments and datasets.

1. INTRODUCTION

In the study of offline reinforcement learning (OffRL), uncertainty plays a key role (Buckman et al., 2020; Levine et al., 2020) . This is because, unlike online RL where an agent receives feedback in the form of low rewards after taking a bad action, an OffRL agent must learn from a fixed dataset without feedback from the environment. As a result, a consistent issue for OffRL algorithms is the overestimation of states and actions that are not seen in the dataset, leading to poor performance when the agent is deployed and finds that those states and actions in fact have low reward (Fujimoto et al., 2019b) . To overcome this issue, OffRL algorithms often attempt to incorporate some notion of uncertainty to ensure that the learned policy avoids regions of high uncertainty. There are two main issues with this approach: (1) how to define uncertainty and (2) how to incorporate uncertainty estimates into the OffRL algorithm. In tabular and linear MDPs, issue (1) is resolved by using visitation counts and elliptical confidence regions, respectively (Yin et al., 2021; Yin & Wang, 2021; Jin et al., 2021; Laroche et al., 2019) . In the large-scale MDPs that we consider, neither of these solutions work, but there is a large literature from the deep learning community on uncertainty quantification that we can leverage for OffRL (Ciosek et al., 2019; Osband et al., 2018; 2021; Burda et al., 2019; Ostrovski et al., 2017; Lakshminarayanan et al., 2017; Blundell et al., 2015; Gal & Ghahramani, 2016) . Given these uncertainty estimators, this paper focuses primarily on issue (2), how to incorporate uncertainty for OffRL. To understand how to best incorporate uncertainty into an OffRL algorithm, we first provide a high level algorithmic template that captures the majority of related work as instances of modified policy iteration (Scherrer et al., 2012) that alternate between policy evaluation and policy improvement. We then can sort prior work into four categories along two axes: whether the algorithm modifies the evaluation step or the improvement step, and whether the algorithm uses an explicit uncertainty estimator or not. One class of algorithms modifies the evaluation step by introducing value penalties based on explicit uncertainty estimates, which we will call pessimism (Petrik et al., 2016; Buckman et al., 2020; Jin et al., 2021 ). An alternative modifies the value estimation without using an uncertainty estimate, like in CQL (Kumar et al., 2020) . Another family uses behavior constraints that modify the policy improvement step to keep the learned policy near the behavior policy (Fujimoto et al., 2019b; a) , but does not use explicit uncertainty. Instead, we propose to use the fourth class of methods that leverages uncertainty-based constraints in the policy improvement step and is inspired by the SPIBB family of algorithms (Laroche et al., 2019; Laroche & Tachet des Combes, 2019; Nadjahi et al., 2019; Simão et al., 2020) . These algorithms modify the policy improvement step like a behavior constraint, but also reason about state-based uncertainty like the pessimistic algorithms. Explicitly, we define the deep-SPIBB algorithm that effectively incorporates uncertainty estimates into OffRL. The main contributions of this paper are as follows: • We introduce the deep-SPIBB algorithm which provides a principled way to incorporate scalable uncertainty estimates for OffRL. We instantiate this algorithm using ensemblebased uncertainty estimates inspired by Bayesian inference (Ciosek et al., 2019; Osband et al., 2021) . • We provide a detailed comparison of several different mechanisms to incorporate uncertainty by considering how each mechanism operates at the extreme settings of its hyperparameters. This analysis shows that deep-SPIBB provides a flexible and robust mechanism to interpolate between various extremes (greedy RL, behavior cloning, and one-step RL). • Through experiments on classical environments (cartpole and catch) as well at Atari games, we demonstrate the efficacy of deep-SPIBB. In particular, we find that deep-SPIBB consistently outperforms pessimism when given access to the same imperfect uncertainty estimators. • When deep-SPIBB has access to better uncertainty estimators (as in the easier cartpole environment) it is able to substantially outperform our other baselines of CQL and BCQ as well. This suggests that as uncertainty estimators improve, deep-SPIBB will provide a useful mechanism for incorporating them for OffRL.

2. PRELIMINARIES

We consider an OffRL setup with a discrete action space and access to a dataset D = {(s j , a j , r j , s ′ j )} N j=1 consisting of N transitions collected by some behavior policy β. The goal is to learn a policy π from this data to maximize expected discounted returns J(π) = E τ ∼π [ ∞ t=0 γ t r t ].

2.1. ALGORITHMIC TEMPLATE

The vast majority of prior work on the OffRL problem can be seen through a common algorithmic template of modified policy iteration. Each algorithm alternates between policy evaluation and policy improvement steps. The main difference between algorithms comes in how they modify either the evaluation or the improvement step. Below we first define the generic version of the OffRL algorithmic template and then explain how different OffRL algorithms modify this template. Policy improvement by greedy maximization: π (i+1) (•|s) = arg max π∈Π a∈A π(a|s) Q(i) (s, a). Value estimation by fitted Q evaluation given the dataset D = {(s j , a j , r j , s ′ j )} N j=1 . Define the Bellman operator from datapoint j with π (i+1) , Q(i) , as T (j, π, Q) = r j + γ a ′ ∈A π(a ′ |s ′ j )Q(s ′ j , a ′ ). Then the evaluation step is: Q(i+1) = arg min Q∈Q j Q(s j , a j ) -T (j, π (i+1) , Q(i) ) 2 In addition to the policy and Q function, some algorithms we consider will also learn an estimated behavior policy β(a|s) and/or an uncertainty function û(s, a). Generally, β is learned by maximum likelihood supervised learning. The uncertainty û on the other hand can be learned many different ways. We will discuss û in more detail in Section 3 when we describe our method. With this template we can provide a characterization of much prior work that is summarized in Table 1 . The essential axes that we consider are (1) whether the algorithm modifies the improvement step or the evaluation step and (2) whether the algorithm uses an uncertainty function u(s, a) or not.

