CONFIDENCE-CONDITIONED VALUE FUNCTIONS FOR OFFLINE REINFORCEMENT LEARNING

Abstract

Offline reinforcement learning (RL) promises the ability to learn effective policies solely using existing, static datasets, without any costly online interaction. To do so, offline RL methods must handle distributional shift between the dataset and the learned policy. The most common approach is to learn conservative, or lower-bound, value functions, which underestimate the return of out-of-distribution (OOD) actions. However, such methods exhibit one notable drawback: policies optimized on such value functions can only behave according to a fixed, possibly suboptimal, degree of conservatism. However, this can be alleviated if we instead are able to learn policies for varying degrees of conservatism at training time and devise a method to dynamically choose one of them during evaluation. To do so, in this work, we propose learning value functions that additionally condition on the degree of conservatism, which we dub confidence-conditioned value functions. We derive a new form of a Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability. By conditioning on confidence, our value functions enable adaptive strategies during online evaluation by controlling for confidence level using the history of observations thus far. This approach can be implemented in practice by conditioning the Q-function from existing conservative algorithms on the confidence. We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence. Finally, we empirically show that our algorithm outperforms existing conservative offline RL algorithms on multiple discrete control domains.

1. INTRODUCTION

Offline reinforcement learning (RL) aims to learn effective policies entirely from previously collected data, without any online interaction (Levine et al., 2020) . This addresses one of the main bottlenecks in the practical adoption of RL in domains such as recommender systems (Afsar et al., 2021) , healthcare (Shortreed et al., 2011; Wang et al., 2018), and robotics (Kalashnikov et al., 2018) , where exploratory behavior can be costly and dangerous. However, offline RL introduces new challenges, primarily caused by distribution shift. Naïve algorithms can grossly overestimate the return of actions that are not taken by the behavior policy that collected the dataset (Kumar et al., 2019a) . Without online data gathering and feedback, the learned policy will exploit these likely suboptimal actions. One common approach to handle distribution shift in offline RL is to optimize a a conservative lower-bound estimate of the expected return, or Q-values (Kumar et al., 2020; Kostrikov et al., 2021; Yu et al., 2020) . By intentionally underestimating the Q-values of out-of-distribution (OOD) actions, policies are discouraged from taking OOD actions. However, such algorithms rely on manually specifying the desired degree of conservatism, which decides how pessimistic the estimated Q-values are. The performance of these algorithms is often sensitive to this choice of hyperparameter, and an imprecise choice can cause such algorithms to fail. Our work proposes the following solution: instead of learning one pessimistic estimate of Q-values, we propose an offline RL algorithm that estimates Q-values for all possible degrees of conservatism. We do so by conditioning the learned Q-values on its confidence level, or probability that it achieves a lower-bound on the true expected returns. This allows us to learn a range of lower-bound Qvalues of different confidences. These confidence-conditioned Q-values enables us to do something conservative RL algorithms could not-control the level of confidence used to evaluate actions. Specifically, when evaluating the offline-learned Q-values, policies derived from conservative offline RL algorithms must follow a static behavior, even if the online observations suggest that they are being overly pessimistic or optimistic. However, our approach enables confidence-adaptive policies that can correct their behavior using online observations, by simply adjusting the confidence-level used to estimate Q-values. We posit that this adaptation leads to successful policies more frequently than existing static policies that rely on tuning a rather opaque hyperparameter during offline training. Our primary contribution is a new offline RL algorithm that we call confidence-conditioned valuelearning (CCVL), which learns a mapping from confidence levels to corresponding lower-bound estimations of the true Q-values. Our theoretical analysis shows that our method learns appropriate lower-bound value estimates for any confidence level. Our algorithm also has a practical implementation that leverages multiple existing ideas in offline RL. Namely, we use network parameterizations studied in distributional RL to predict Q-values parameterized by confidence (Dabney et al., 2018b; a) . Our objective, similar to conservative Q-learning (CQL) (Kumar et al., 2020) , uses regularization to learn Q-values for all levels of pessimism and optimism, instead of anti-exploration bonuses that may be difficult to accurately compute in complex environments (Rezaeifar et al., 2021) . In addition, our algorithm can be easily extended to learn both lower-and upper-bound estimates, which can be useful when fine-tuning our offline-learned value function on additional data obtained via online exploration. Finally, we show that our approach outperforms existing state-of-the-art approaches in discrete-action environments such as Atari (Mnih et al., 2013; Bellemare et al., 2013) . Our empirical results also confirm that conditioning on confidence, and controlling the confidence from online observations, can lead to significant improvements in performance.

2. RELATED WORK

Offline RL (Lange et al., 2012; Levine et al., 2020) has shown promise in numerous domains. The major challenge in offline RL is distribution shift (Kumar et al., 2019a) , where the learned policy might select out-of-distribution actions with unpredictable consequences. Methods to tackle this challenge can be roughly categorized into policy-constraint or conservative methods. Policy-constraint methods regularize the learned policy to be "close" to the behavior policy either explicitly in the objective via a policy regularizer (Fujimoto et al., 2018; Kumar et al., 2019a; Liu et al., 2020; Wu et al., 2019; Fujimoto & Gu, 2021) , implicitly update (Siegel et al., 2020; Peng et al., 2019; Nair et al., 2020) , or via importance sampling (Liu et al., 2019; Swaminathan & Joachims, 2015; Nachum et al., 2019) . On the other hand, conservative methods learn a lower-bound, or conservative, estimate of return and optimize the policy against it (Kumar et al., 2020; Kostrikov et al., 2021; Kidambi et al., 2020; Yu et al., 2020; 2021) . Conservative approaches traditionally rely on estimating the epistemic uncertainty, either explicitly via exploration bonuses (Rezaeifar et al., 2021) or implicitly using regularization on the learned Q-values (Kumar et al., 2020) . The limitation of existing offline RL approaches is that the derived policies can only act under a fixed degree of conservatism, which is determined by an opaque hyperparameter that scales the estimated epistemic uncertainty, and has to be chosen during offline training. This means the policies will be unable to correct their behavior online, even if it becomes evident from online observations that the estimated value function is too pessimistic or optimistic. Our algorithm learns confidence-conditioned Q-values that capture all possible degrees of pessimism by conditioning on the confidence level, modeling epistemic uncertainty as a function of confidence. By doing so, instead of committing to one degree of pessimism, we enable policies that adapt how conservative they should behave using the observations they sees during online evaluation. Our approach is related to ensemble (Agarwal et al., 2020; Lee et al., 2021; Chen et al., 2021; An et al., 2021) approaches in that they also predict multiple Q-values to model epistemic uncertainty. However, existing ensemble methods train individual Q-values on the same objective, and rely on different parameter initializations. In contrast, each of our Q-values captures a different confidence-level. In addition, standard ensemble approaches do not consider adaptive policies. Recently, APE-V proposes using ensembles to learn adaptive policies that condition on belief over which value function is most accurate (Ghosh et al., 2022) . Our approach considers a similar strategy for adaptation, but explicitly parameterizes the value function by the confidence level, introducing a novel training objective for this purpose. In our experiments, we compare to a method that adapts APE-V to our discrete-action benchmark tasks. Jiang & Huang (2020); Dai et al. (2020) propose confidence intervals for policy evaluation at specified confidence-levels. We aim to learn a value function across all confidences, and use it for adaptive policy optimization. Finally, distributional RL (Dabney et al., 2017; Bellemare et al., 2017; Dabney et al., 2018b ) learns a distribution over values, but only capture aleatoric uncertainty, whereas our focus is on epistemic uncertainty and offline RL.

