CONSERVATIVE BAYESIAN MODEL-BASED VALUE EXPANSION FOR OFFLINE POLICY OPTIMIZATION

Abstract

Offline reinforcement learning (RL) addresses the problem of learning a performant policy from a fixed batch of data collected by following some behavior policy. Model-based approaches are particularly appealing in the offline setting since they can extract more learning signals from the logged dataset by learning a model of the environment. However, the performance of existing model-based approaches falls short of model-free counterparts, due to the compounding of estimation errors in the learned model. Driven by this observation, we argue that it is critical for a model-based method to understand when to trust the model and when to rely on model-free estimates, and how to act conservatively w.r.t. both. To this end, we derive an elegant and simple methodology called conservative Bayesian model-based value expansion for offline policy optimization (CBOP), that trades off model-free and model-based estimates during the policy evaluation step according to their epistemic uncertainties, and facilitates conservatism by taking a lower bound on the Bayesian posterior value estimate. On the standard D4RL continuous control tasks, we find that our method significantly outperforms previous model-based approaches: e.g., MOPO by 116.4%, MOReL by 23.2% and COMBO by 23.7%. Further, CBOP achieves state-of-the-art performance on 11 out of 18 benchmark datasets while doing on par on the remaining datasets.

1. INTRODUCTION

Fueled by recent advances in supervised and unsupervised learning, there has been a great surge of interest in data-driven approaches to reinforcement learning (RL), known as offline RL (Levine et al., 2020) . In offline RL, an RL agent must learn a good policy entirely from a logged dataset of past interactions, without access to the real environment. This paradigm of learning is particularly useful in applications where it is prohibited or too costly to conduct online trial-and-error explorations (e.g., due to safety concerns), such as autonomous driving (Yu et al., 2018 ), robotics (Kalashnikov et al., 2018) , and operations research (Boute et al., 2022) . However, because of the absence of online interactions with the environment that give correcting signals to the learner, direct applications of online off-policy algorithms have been shown to fail in the offline setting (Fujimoto et al., 2019; Kumar et al., 2019; Wu et al., 2019; Kumar et al., 2020) . This is mainly ascribed to the distribution shift between the learned policy and the behavior policy (data-logging policy) during training. For example, in Q-learning based algorithms, the distribution shift in the policy can incur uncontrolled overestimation bias in the learned value function. Specifically, positive biases in the Q function for out-of-distribution (OOD) actions can be picked up during policy maximization, which leads to further deviation of the learned policy from the behavior policy, resulting in a vicious cycle of value overestimation. Hence, the design of offline RL algorithms revolves around how to counter the adverse impacts of the distribution shift while achieving improvements over the data-logging policy.

⇤ Equal contribution

† Corresponding authors In this work, we consider model-based (MB) approaches since they allow better use of a given dataset and can provide better generalization capability (Yu et al., 2020; Kidambi et al., 2020; Yu et al., 2021; Argenson & Dulac-Arnold, 2021) . Typically, MB algorithms -e.g., MOPO (Yu et al., 2020) , MOReL (Kidambi et al., 2020), and COMBO (Yu et al., 2021) -adopt the Dyna-style policy optimization approach developed in online RL (Janner et al., 2019; Sutton, 1990) . That is, they use the learned dynamics model to generate rollouts, which are then combined with the real dataset for policy optimization. We hypothesize that we can make better use of the learned model by employing it for target value estimation during the policy evaluation step of the actor-critic method. Specifically, we can compute h-step TD targets through dynamics model rollouts and bootstrapped terminal Q function values. In online RL, this MB value expansion (MVE) has been shown to provide a better value estimation of a given state (Feinberg et al., 2018) . However, the naïve application of MVE does not work in the offline setting due to model bias that can be exploited during policy learning. Therefore, it is critical to trust the model only when it can reliably predict the future, which can be captured by the epistemic uncertainty surrounding the model predictions. To this end, we propose CBOP (Conservative Bayesian MVE for Offline Policy Optimization) to control the reliance on the model-based and model-free value estimates according to their respective uncertainties, while mitigating the overestimation errors in the learned values. Unlike existing MVE approaches (e.g., Buckman et al. (2018) ), CBOP estimates the full posterior distribution over a target value from the h-step TD targets for h = 0, . . . , H sampled from ensembles of the state dynamics and the Q function. The novelty of CBOP lies in its ability to fully leverage this uncertainty in two related ways: (1) by deriving an adaptive weighting over different h-step targets informed by the posterior uncertainty; and (2) by using this weighting to derive conservative lower confidence bounds (LCB) on the target values that mitigates value overestimation. Ultimately, this allows CBOP to reap the benefits of MVE while significantly reducing value overestimation in the offline setting (Figure 1 ). We evaluate CBOP on the D4RL benchmark of continuous control tasks (Fu et al., 2020) . The experiments show that using the conservative target value estimate significantly outperforms previous model-based approaches: e.g., MOPO by 116.4%, MOReL by 23.2% and COMBO by 23.7%. Further, CBOP achieves state-of-the-art performance on 11 out of 18 benchmark datasets while doing on par on the remaining datasets.

2. BACKGROUND

We study RL in the framework of Markov decision processes (MDPs) that are characterized by a tuple (S, A, T, r, d 0 , ); here, S is the state space, A is the action space, T (s 0 |s, a) is the transition function, r(s, a) is the immediate reward function, d 0 is the initial state distribution, and 2 [0, 1] is the discount factor. Specifically, we call the transition and reward functions the model of the environment, which we denote as f = (T, r). A policy ⇡ is a mapping from S



Figure 1: Prevention of value overestimation & adaptive reliance on model-based value predictions. (Left) We leverage the full posterior over the target values to prevent value overestimation during offline policy learning (blue). Without conservatism incorporated, the target value diverges (orange). (Right) We can automatically adjust the level of reliance on the model-based and bootstrapped model-free value predictions based on their respective uncertainty during model-based value expansion. The 'expected horizon' (E[h] = P h w h • h, P h w h = 1) shows an effective model-based rollout horizon during policy optimization. E[h] is large at the beginning, but it gradually decreases as the model-free value estimates improve over time. The figures were generated using the hopper-random dataset from D4RL (Fu et al., 2020).

