CONFIDENCE-CONDITIONED VALUE FUNCTIONS FOR OFFLINE REINFORCEMENT LEARNING

Abstract

Offline reinforcement learning (RL) promises the ability to learn effective policies solely using existing, static datasets, without any costly online interaction. To do so, offline RL methods must handle distributional shift between the dataset and the learned policy. The most common approach is to learn conservative, or lower-bound, value functions, which underestimate the return of out-of-distribution (OOD) actions. However, such methods exhibit one notable drawback: policies optimized on such value functions can only behave according to a fixed, possibly suboptimal, degree of conservatism. However, this can be alleviated if we instead are able to learn policies for varying degrees of conservatism at training time and devise a method to dynamically choose one of them during evaluation. To do so, in this work, we propose learning value functions that additionally condition on the degree of conservatism, which we dub confidence-conditioned value functions. We derive a new form of a Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability. By conditioning on confidence, our value functions enable adaptive strategies during online evaluation by controlling for confidence level using the history of observations thus far. This approach can be implemented in practice by conditioning the Q-function from existing conservative algorithms on the confidence. We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence. Finally, we empirically show that our algorithm outperforms existing conservative offline RL algorithms on multiple discrete control domains.

1. INTRODUCTION

Offline reinforcement learning (RL) aims to learn effective policies entirely from previously collected data, without any online interaction (Levine et al., 2020) . This addresses one of the main bottlenecks in the practical adoption of RL in domains such as recommender systems (Afsar et al., 2021 ), healthcare (Shortreed et al., 2011; Wang et al., 2018), and robotics (Kalashnikov et al., 2018) , where exploratory behavior can be costly and dangerous. However, offline RL introduces new challenges, primarily caused by distribution shift. Naïve algorithms can grossly overestimate the return of actions that are not taken by the behavior policy that collected the dataset (Kumar et al., 2019a) . Without online data gathering and feedback, the learned policy will exploit these likely suboptimal actions. One common approach to handle distribution shift in offline RL is to optimize a a conservative lower-bound estimate of the expected return, or Q-values (Kumar et al., 2020; Kostrikov et al., 2021; Yu et al., 2020) . By intentionally underestimating the Q-values of out-of-distribution (OOD) actions, policies are discouraged from taking OOD actions. However, such algorithms rely on manually specifying the desired degree of conservatism, which decides how pessimistic the estimated Q-values are. The performance of these algorithms is often sensitive to this choice of hyperparameter, and an imprecise choice can cause such algorithms to fail. Our work proposes the following solution: instead of learning one pessimistic estimate of Q-values, we propose an offline RL algorithm that estimates Q-values for all possible degrees of conservatism. We do so by conditioning the learned Q-values on its confidence level, or probability that it achieves a lower-bound on the true expected returns. This allows us to learn a range of lower-bound Qvalues of different confidences. These confidence-conditioned Q-values enables us to do something conservative RL algorithms could not-control the level of confidence used to evaluate actions. Specifically, when evaluating the offline-learned Q-values, policies derived from conservative offline

