EFFECTIVE OFFLINE REINFORCEMENT LEARNING VIA CONSERVATIVE STATE VALUE ESTIMATION

Abstract

Offline RL seeks to learn effective policies solely from historical data, which expects to perform well in the online environment. However, it faces a major challenge of value over-estimation introduced by the distributional drift between the dataset and the current learned policy, leading to learning failure in practice. The common approach is adding a penalty term to reward or value estimation in the Bellman iterations, which has given rise to a number of successful algorithms such as CQL. Meanwhile, to avoid extrapolation on unseen states and actions, existing methods focus on conservative Q-function estimation. In this paper, we propose CSVE, a new approach that learns conservative V-function via directly imposing penalty on out-of-distribution states. We prove that for the evaluated policy, our conservative state value estimation satisfies: (1) over the state distribution that samples penalizing states, it lower bounds the true values in expectation, and (2) over the marginal state distribution of data, it is no more than the true values in expectation plus a constant decided by sampling error. Further, we develop a practical actor-critic algorithm in which the critic does the conservative value estimation by additionally sampling and penalizing the states around the dataset, and the actor applies advantage weighted updates to improve the policy. We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods (e.g., CQL) and is strongly competitive among recent SOTA methods.

1. INTRODUCTION

Reinforcement Learning (RL), which learns to act by interacting with the environment, has achieved remarkable success in various tasks. However, in most real applications, it is impossible to learn online from scratch as exploration is often risky and unsafe. Instead, offline RL ((Fujimoto et al., 2019; Lange et al., 2012) ) avoids this problem by learning the policy solely from historical data. However, the naive approach, which directly uses online RL algorithms to learn from a static dataset, suffers from the problems of value over-estimation and policy extrapolation on OOD (out-of-distribution) states or actions. Recently, conservative value estimation, being conservative on states and actions where there are no enough samples, has been put forward as a principle to effectively solve offline RL ((Shi et al., 2022; Kumar et al., 2020; Buckman et al., 2020) . Prior methods, e.g., Conservative Q-Learning (CQL Kumar et al. (2020) ), avoid the value over-estimation problem by systematically underestimating the Q values of OOD actions on the states in the dataset. In practice, it is often too pessimistic and thus leads to overly conservative algorithms. COMBO (Yu et al., 2021) leverages a learnt dynamic model to augment data in an interpolation way, and then learn a Q function that is less conservative than CQL and derives a better policy in potential. In this paper, we propose CSVE(Conservative State Value Estimation), a new offline RL approach. Unlike the above traditional methods that estimate conservative values by penalizing Q-function on OOD states or actions, CSVE directly penalizing the V-function on OOD states. We prove in theory that CSVE has tighter bounds on true state values than CQL, and same bounds as COMBO but under more general discounted state distributions which leads to more space for algorithm design. Our main contributions are as follows. • The conservative state value estimation with related theoretical analysis. We prove that it lower bounds the real state values in expectation over any state distribution that is used to sample OOD states, and is up-bounded by the real values in expectation over the marginal state distribution of the dataset plus a constant term depending on only sampling errors. Compared to prior work, it has several advantages to derive a better policy in potential. • A practical Actor-Critic implementation. It approximately estimates the conservative state values in the offline context and improves the policy via advantage weighting updates. In particular, we use a dynamics model to generalize over in-distribution space and sample OOD states that are directly reachable from the dataset. • Experimental evaluation on continuous control tasks of Gym (Brockman et al., 2016) 

2. PRELIMINARIES

Offline Reinforcement Learning. Consider the Markov Decision Process M := (S, A, P, r, ρ, γ), which consists of the state space S, the action space A, the transition model P : S ×A → ∆(S), the reward function r : S × A → R, the initial state distribution ρ and the discount factor γ ∈ (0, 1]. A stochastic policy π : S → ∆(A) takes an action in probability given the current state. A transition is the tuple (s t , a t , r t , s t+1 ) where a t ∼ π(•|s t ), s t+1 ∼ P (•|s t , a t ) and r t = r(s t , a t ). We assume that the reward values satisfy |r(s, a)| ≤ R max , ∀s, a. A trajectory under π is the random sequence τ = (s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , . . . , s T ) which consists of continuous transitions starting from s 0 ∼ ρ. The standard RL is to learn a policy π ∈ Π that maximize the future cumulative rewards J π (M ) = E M,π [ ∞ t=0 γ t r t ] via active interaction with the environment M . At any time t, for the policy π, the value function of state is defined as V π (s) := E M,π [ ∞ k=0 γ t+k r t+k |s t = s], and the Q value function is Q π (s, a) := E M,π [ ∞ k=0 γ t+k r t+k |s t = s, a t = a]. The Bellman operator is a function projection: B π Q(s, a) := r(s, a) + γE s ′ ∼P (•|s,a),a ′ ∼π(•|s ′ ) [Q(s ′ , a ′ )], or B π V (s) := E a∼π(•|s) [r(s, a) + γE s ′ ∼P (•|s,a) [V (s ′ )]] , which leads to iterative value updates. Bellman consistency implies that V π (s) = B π V π (s), ∀s and Q π (s) = B π Q π (s, a), ∀s, a. In practice with function approximation, we use the empirical Bellman operator Bπ where the former expectations are estimated with data. The offline RL is to learn the policy π from a static dataset D = {(s, a, r, s ′ )} consisting of transitions collected by any behaviour policy, aiming to behave well in the online environment. Note that, unlike the standard online RL, offline RL cannot interact with the environment during learning. Conservative Value Estimation. One main challenge in offline RL is the over-estimation of values introduced by extrapolation on unseen states and actions, which may make the learned policy collapse. To address this issue, conservatism or pessimism are used in value estimation, e.g. CQL learns a conservative Q-value function by penalizing the value of unseen actions on states: Qk+1 ← arg min Q α (E s∼D,a∼µ(a|s) [Q(s, a)] -E s∼D,a∼π β (a|s) [Q(s, a)]) + 1 2 E s,a,s ′ ∼D [(Q(s, a) -βπ Qk (s, a)) 2 ] where πβ and π are the behaviour policy and learnt policy separately, µ is any arbitrary policy different from πβ , and α the factor for trade-off of conservatism. Constrained Policy Optimization. To address the issues of distribution drift between learning policy and behaviour policy, one approach is to constrain the learning policy close to the behaviour policy (Bai et al., 2021; Wu et al., 2019; Nair et al., 2020; Levine et al., 2020; Fujimoto et al., 2019) . Here we take Advantage Weighted Regression (Peng et al. (2019b) ; Nair et al. ( 2020)) which adopts an implicit KL divergence to constrain the distance of policies as example: π k+1 ← arg max π E s,a∼D log π(a|s) 1 Z(s) exp 1 λ A π k (s, a)



and Adroit(Rajeswaran et al., 2017) in D4RL (Fu et al., 2020)  benchmarks, showing that CSVE performs better than prior methods based on conservative Q-value estimation, and is strongly competitive among main SOTA offline RL algorithms.

