EFFECTIVE OFFLINE REINFORCEMENT LEARNING VIA CONSERVATIVE STATE VALUE ESTIMATION

Abstract

Offline RL seeks to learn effective policies solely from historical data, which expects to perform well in the online environment. However, it faces a major challenge of value over-estimation introduced by the distributional drift between the dataset and the current learned policy, leading to learning failure in practice. The common approach is adding a penalty term to reward or value estimation in the Bellman iterations, which has given rise to a number of successful algorithms such as CQL. Meanwhile, to avoid extrapolation on unseen states and actions, existing methods focus on conservative Q-function estimation. In this paper, we propose CSVE, a new approach that learns conservative V-function via directly imposing penalty on out-of-distribution states. We prove that for the evaluated policy, our conservative state value estimation satisfies: (1) over the state distribution that samples penalizing states, it lower bounds the true values in expectation, and (2) over the marginal state distribution of data, it is no more than the true values in expectation plus a constant decided by sampling error. Further, we develop a practical actor-critic algorithm in which the critic does the conservative value estimation by additionally sampling and penalizing the states around the dataset, and the actor applies advantage weighted updates to improve the policy. We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods (e.g., CQL) and is strongly competitive among recent SOTA methods.

1. INTRODUCTION

Reinforcement Learning (RL), which learns to act by interacting with the environment, has achieved remarkable success in various tasks. However, in most real applications, it is impossible to learn online from scratch as exploration is often risky and unsafe. Instead, offline RL ((Fujimoto et al., 2019; Lange et al., 2012) ) avoids this problem by learning the policy solely from historical data. However, the naive approach, which directly uses online RL algorithms to learn from a static dataset, suffers from the problems of value over-estimation and policy extrapolation on OOD (out-of-distribution) states or actions. Recently, conservative value estimation, being conservative on states and actions where there are no enough samples, has been put forward as a principle to effectively solve offline RL ((Shi et al., 2022; Kumar et al., 2020; Buckman et al., 2020) . Prior methods, e.g., Conservative Q-Learning (CQL Kumar et al. ( 2020)), avoid the value over-estimation problem by systematically underestimating the Q values of OOD actions on the states in the dataset. In practice, it is often too pessimistic and thus leads to overly conservative algorithms. COMBO (Yu et al., 2021) leverages a learnt dynamic model to augment data in an interpolation way, and then learn a Q function that is less conservative than CQL and derives a better policy in potential. In this paper, we propose CSVE(Conservative State Value Estimation), a new offline RL approach. Unlike the above traditional methods that estimate conservative values by penalizing Q-function on OOD states or actions, CSVE directly penalizing the V-function on OOD states. We prove in theory that CSVE has tighter bounds on true state values than CQL, and same bounds as COMBO but under more general discounted state distributions which leads to more space for algorithm design. Our main contributions are as follows.

