EFFECTIVE OFFLINE REINFORCEMENT LEARNING VIA CONSERVATIVE STATE VALUE ESTIMATION

Abstract

Offline RL seeks to learn effective policies solely from historical data, which expects to perform well in the online environment. However, it faces a major challenge of value over-estimation introduced by the distributional drift between the dataset and the current learned policy, leading to learning failure in practice. The common approach is adding a penalty term to reward or value estimation in the Bellman iterations, which has given rise to a number of successful algorithms such as CQL. Meanwhile, to avoid extrapolation on unseen states and actions, existing methods focus on conservative Q-function estimation. In this paper, we propose CSVE, a new approach that learns conservative V-function via directly imposing penalty on out-of-distribution states. We prove that for the evaluated policy, our conservative state value estimation satisfies: (1) over the state distribution that samples penalizing states, it lower bounds the true values in expectation, and (2) over the marginal state distribution of data, it is no more than the true values in expectation plus a constant decided by sampling error. Further, we develop a practical actor-critic algorithm in which the critic does the conservative value estimation by additionally sampling and penalizing the states around the dataset, and the actor applies advantage weighted updates to improve the policy. We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods (e.g., CQL) and is strongly competitive among recent SOTA methods. Reinforcement Learning (RL), which learns to act by interacting with the environment, has achieved remarkable success in various tasks. However, in most real applications, it is impossible to learn online from scratch as exploration is often risky and unsafe. Instead, offline RL ((Fujimoto et al., 2019; Lange et al., 2012) ) avoids this problem by learning the policy solely from historical data. However, the naive approach, which directly uses online RL algorithms to learn from a static dataset, suffers from the problems of value over-estimation and policy extrapolation on OOD (out-of-distribution) states or actions. Recently, conservative value estimation, being conservative on states and actions where there are no enough samples, has been put forward as a principle to effectively solve offline RL ((Shi et al., 2022; Kumar et al., 2020; Buckman et al., 2020) . Prior methods, e.g., Conservative Q-Learning (CQL Kumar et al. ( 2020)), avoid the value over-estimation problem by systematically underestimating the Q values of OOD actions on the states in the dataset. In practice, it is often too pessimistic and thus leads to overly conservative algorithms. COMBO (Yu et al., 2021) leverages a learnt dynamic model to augment data in an interpolation way, and then learn a Q function that is less conservative than CQL and derives a better policy in potential. In this paper, we propose CSVE(Conservative State Value Estimation), a new offline RL approach. Unlike the above traditional methods that estimate conservative values by penalizing Q-function on OOD states or actions, CSVE directly penalizing the V-function on OOD states. We prove in theory that CSVE has tighter bounds on true state values than CQL, and same bounds as COMBO but under more general discounted state distributions which leads to more space for algorithm design. Our main contributions are as follows.



• The conservative state value estimation with related theoretical analysis. We prove that it lower bounds the real state values in expectation over any state distribution that is used to sample OOD states, and is up-bounded by the real values in expectation over the marginal state distribution of the dataset plus a constant term depending on only sampling errors. Compared to prior work, it has several advantages to derive a better policy in potential. • A practical Actor-Critic implementation. It approximately estimates the conservative state values in the offline context and improves the policy via advantage weighting updates. In particular, we use a dynamics model to generalize over in-distribution space and sample OOD states that are directly reachable from the dataset. • Experimental evaluation on continuous control tasks of Gym (Brockman et al., 2016) and Adroit (Rajeswaran et al., 2017) in D4RL (Fu et al., 2020) benchmarks, showing that CSVE performs better than prior methods based on conservative Q-value estimation, and is strongly competitive among main SOTA offline RL algorithms.

2. PRELIMINARIES

Offline Reinforcement Learning. Consider the Markov Decision Process M := (S, A, P, r, ρ, γ), which consists of the state space S, the action space A, the transition model P : S ×A → ∆(S), the reward function r : S × A → R, the initial state distribution ρ and the discount factor γ ∈ (0, 1]. A stochastic policy π : S → ∆(A) takes an action in probability given the current state. A transition is the tuple (s t , a t , r t , s t+1 ) where a t ∼ π(•|s t ), s t+1 ∼ P (•|s t , a t ) and r t = r(s t , a t ). We assume that the reward values satisfy |r(s, a)| ≤ R max , ∀s, a. A trajectory under π is the random sequence τ = (s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , . . . , s T ) which consists of continuous transitions starting from s 0 ∼ ρ. The standard RL is to learn a policy π ∈ Π that maximize the future cumulative rewards J π (M ) = E M,π [ ∞ t=0 γ t r t ] via active interaction with the environment M . At any time t, for the policy π, the value function of state is defined as ], which leads to iterative value updates. Bellman consistency implies that V π (s) = B π V π (s), ∀s and Q π (s) = B π Q π (s, a), ∀s, a. In practice with function approximation, we use the empirical Bellman operator Bπ where the former expectations are estimated with data. V π (s) := E M,π [ ∞ k=0 γ t+k The offline RL is to learn the policy π from a static dataset D = {(s, a, r, s ′ )} consisting of transitions collected by any behaviour policy, aiming to behave well in the online environment. Note that, unlike the standard online RL, offline RL cannot interact with the environment during learning. Conservative Value Estimation. One main challenge in offline RL is the over-estimation of values introduced by extrapolation on unseen states and actions, which may make the learned policy collapse. To address this issue, conservatism or pessimism are used in value estimation, e.g. CQL learns a conservative Q-value function by penalizing the value of unseen actions on states: Qk+1 ← arg min Q α (E s∼D,a∼µ(a|s) [Q(s, a)] -E s∼D,a∼π β (a|s) [Q(s, a)]) + 1 2 E s,a,s ′ ∼D [(Q(s, a) -βπ Qk (s, a)) 2 ] (1) where πβ and π are the behaviour policy and learnt policy separately, µ is any arbitrary policy different from πβ , and α the factor for trade-off of conservatism. Constrained Policy Optimization. To address the issues of distribution drift between learning policy and behaviour policy, one approach is to constrain the learning policy close to the behaviour policy (Bai et al., 2021; Wu et al., 2019; Nair et al., 2020; Levine et al., 2020; Fujimoto et al., 2019) . Here we take Advantage Weighted Regression (Peng et al. (2019b) ; Nair et al. (2020) ) which adopts an implicit KL divergence to constrain the distance of policies as example: π k+1 ← arg max π E s,a∼D log π(a|s) 1 Z(s) exp 1 λ A π k (s, a) where A π k is the advantage of policy π k , and Z the normalization constant for s. Model-based Offline RL. In RL, the model is an approximation of the MDP M . We denote a model as M := (S, A, P , r, ρ, γ), where P and r are approximations of P and r respectively. In the setting of offline RL, the model is used to roll out and augment data (Yu et al., 2020; 2021) or act as a surrogate of real environment to interact with agent (Kidambi et al., 2020) . In this paper, we use model to sample the next states that are approximately reachable from the dataset.

3. CONSERVATIVE STATE VALUE ESTIMATION

In the offline setting, the value overestimation is a major problem resulting in failure of learning a reasonable policy (Levine et al., 2020; Fujimoto et al., 2019) . In contrast to prior works (Kumar et al., 2020; Yu et al., 2021) that get conservative value estimation via penalizing Q function for OOD state-action pairs , we directly penalize V function for OOD states. Our approach provides several novel theoretic results that allow better trade-off of conservative value estimation and policy improvement. All proofs of our theorems can be found in Appendix A.

3.1. CONSERVATIVE OFF-POLICY EVALUATION

Our approach is an alternative approach to CQL (Kumar et al., 2020) . Instead of learning a conservative Q function, we aim to conservatively estimate the value V π (s) of a target policy π given a dataset D to avoid overestimation of out-of-distribution states. To achieve this, we penalize the V-values evaluated on states that is more likely to be out-of-distribution and pushing up the V-values on states that is in the distribution of the dataset, which is achieved through the following iteration: V k+1 ← arg min V 1 2 E s∼du(s) [( Bπ V k (s) -V (s)) 2 ] + α(E s ′ ∼d(s) V (s ′ ) -E s∼du(s) V (s)) where d u (s) is the discounted state distribution of D, d(s) is any state distribution, and Bπ is the empirical Bellman operator (see appendix for more details). Considering the setting without function approximation, by setting the derivative of Eq. 3 as zero, the V function found by approximate dynamic programming in iteration k can be obtained: V k+1 (s) = Bπ V k (s) -α[ d (s) d u (s) -1], ∀s, k. Denote the function projection on V k in Eq. 4 as T π . We have Lemma 1, and thus V k converges to a unique fixed point. Lemma 1. For any d with supp d ⊆ supp d u , T π is a γ-contraction in L ∞ norm. Theorem 1. For any d with supp d ⊆ supp d u (d ̸ = d u ), with a sufficiently large α (i.e., α ≥ E s∼d(s) E a∼π(a|s) C r,t,δ Rmax (1-γ) √ |D(s,a)| /E s∼d(s) [ d(s) du(s) -1])) , the expected value of the estimation V π (s) under d(s) is the lower bound of the true value, that is: is large but finite value. We assume that with probability ≥ 1 -δ, the sampling error is less than E s∼d(s) [ V π (s)] ≤ E s∼d(s) [V π (s)]. V π (s) = lim k→∞ V k (s) C r,t,δ Rmax (1-γ) √ |D(s,a)| , while C r,t,δ is a constant (See appendix for more details.) Note that if the sampling error is ignorable, α > 0 can guarantee the lower bound results. Theorem 2. The expected value of the estimation V π (s) under the state distribution of the original dataset is the lower bound of the true value plus the term of irreducible sampling error, that is: E s∼du(s) [ V π (s)] ≤ E s∼du(s) [V π (s)] + E s∼du(s) (I -γP π ) -1 E a∼π(a|s) C r,t,δ Rmax (1-γ) √ |D(s,a)| . , where P π refers to the transition matrix coupled with policy π (see Appendix for details). Now we show that, during iterations, the gap between the value of in-distribution state and out-ofdistribution state in the estimated V-function is higher than in the true V-functions. Theorem 3. At any iteration k, with a large enough α, our method expands the difference in expected V-values under the chosen state distribution and the dataset state distribution, that is: E s∼du(s) [ V k (s)] -E s∼d(s) [ V k (s)] > E s∼du(s) [V k (s)] -E s∼d(s) [V k (s)]. In the policy extraction part, this property enables our policy to take actions a in state s(s ∼ D) that remains in distribution instead of out of distribution, given that our estimated V-function does not overestimate the erroneous out-of-distribution states compared to the in-distribution states. Now we present four remarks to explain how the above theorems guide applications of Eq. 3 in offline RL algorithms. Remark 1. In Eq. 3, if d = d u , the penalty on out-of-distribution states degenerates, which means that the policy should not reach states with low support in data, and consequently never explore the unseen actions at the state. Indeed, AWAC Nair et al. (2020) adopts this setting. We show that with proper choice of d different from d u , our method performs better than AWAC in practice. Remark 2. Theorem 2 implies that under d u , the marginal state distribution of data, the expectation estimated value of π should either be lower than the true value, or higher than the true value but within a threshold. This fact motivates our advantage weighted policy update method in Eq. 11. Remark 3. Theorem 1implies that under d, say the discounted state distribution of any policy, the expectation estimated value of π should lower bounds the true value. This fact motivates our policy improvement method of unifying advantage weighted update with a bonus exploration in Eq. 12. Remark 4. Theorem 3 states Comparison with prior work: CQL (Eq.1), which penalizes Q-function of OOD actions on states in history data, guarantees the lower bounds on state-wise value estimation: E s∼d(s) [V k (s)] -E s∼d(s) [ V k (s)] > E s∼du(s) [V k (s)] - E s∼du(s) [ V k (s)]. V π (s) = E π(a|s) ( Qπ (s, a)) ≤ E π(a|s) (Q π (s, a)) = V π (s) for all s ∈ D. COMBO, which penalizes Qfunction of OOD states and actions of an interpolation of history data and model-based roll-outs, guarantees the lower bound of state value expectation: E s∼µ0 [ V π (s)] ≤ E s∼µ0 [V π (s) ] where µ 0 is the initial state distribution (Remark 1, section A.2 of COMBO Yu et al. (2021) ); which is a special case of our result in Theorem 1 when d = µ 0 . Although both CSVE and COMBO intend to get better performance by releasing conservative estimation guarantee from the state-wise values to expectation of state values, CSVE get the same lower bounds but under more general state distribution. This provide more flexible space for algorithm design, and it is also one main reason of penalizing on V rather than Q. By controlling distance of d to the behaviour policy's discounted state distribution d β , CSVE has the potential of more performance improvement. Note that bounding E[V [s]], rather than state-wise V (s), would introduce a more adventurous policy, which would achieves better performance in in-distribution states and have more risk behaviors in OOD states. To deal with that limitation, we introduce a deep ensemble dynamic model to sample the OOD states for better estimation.

3.2. SAFE POLICY IMPROVEMENT GUARANTEES

Following prior works (Laroche et al. (2019) ; Kumar et al. (2020) ; Yu et al. ( 2021)), we show that our method has the safe policy improvement guarantees against the data-implied behaviour policy. We first show that our method optimizes a penalized RL empirical objective: Theorem 4. Let V π be the fixed point of Equation 3, then π * (a|s) = arg max π V π (s) is equivalently obtained by solving: π * (a|s) ← arg max π J(π, M ) -α 1 1 -γ E s∼d π M (s) [ d(s) d u (s) -1] Building upon Theorem 4, we show that our method provides a ζ-safe policy improvement over π β Theorem 5. Let π * (a|s) be the policy obtained in Equation 5. Then, it is a ζ-safe policy improvement over πβ in the actual MDP M, i.e., J(π * , M ) ≥ J(π β , M ) -ζ with high probability 1-δ, where ζ is given by: ζ = 2( Cr,δ 1-γ + γRmaxCT,δ (1-γ) 2 )E s∼d π M (s) [ √ |A| √ |D(s)| E a∼π(a|s) ( π(a|s) πβ (a|s) )] -(J(π * , M ) -J(π β , M )) ≥α 1 1-γ E s∼d π M (s) [ d(s) du (s) -1] .

4. METHODOLOGY

In this section, we propose a practical Actor-Critic method for computing conservative value estimation function by approximately solving Equation 3 and taking advantage weighted policy updates. It is mainly motivated by the theoretic results, as explained by the four remarks in section 3.1. Besides, the full algorithm of deep learning implementation is presented in Appendix B.

4.1. CONSERVATIVE VALUE ESTIMATION

Given the access to a dataset D collected by some behaviour policy π β , our aim is to estimate the value function V π for a target policy π. As stated in section 3, to prevent the value overestimation, we instead learn a conservative value function V π that lower bounds the real values of π by adding a penalty on out-of-distribution states into the flow of Bellman projections. Our method consists of the following iterative updates of Equations 7-9, where Qk is the target network of Qk . V k+1 ← arg min V L π V (V ; Qk ) = α E s∼D,a∼π(•|s),s ′ ∼ P (s,a) [V (s ′ )] -E s∼D [V (s)] + E s∼D (E a∼π(•|s) [ Qk (s, a)] -V (s)) 2 (7) Qk+1 ← arg min Q L π Q (Q; V k+1 ) = E s,a,s ′ ∼D r(s, a) + γ V k+1 (s ′ ) -Q(s, a) 2 (8) Qk+1 ← ω Qk + (1 -ω) Qk+1 The RHS of Eq. 7 is an approximation of Eq. 3, where the first term gives out-of-distribution states a penalty, and the second term follows the definition of V values and Q values. In Eq. 8, the RHS is TD errors estimated on transitions in the dataset D. Note that the target term here uses the sum of the immediate reward r(s, a) and the next step state's value V k+1 (s ′ ). In Eq. 9, the target Q values are updated with a soft interpolation factor ω ∈ (0, 1). Qk changes slower than Qk , which makes the TD error estimation in Eq. 7 more stable. Constrained policy. Note that in RHS of Eq. 7, we use a ∼ π(•|s) in expectation. To safely estimate the target value of V (s) by E a∼π(•|s) [ Q(s, a)], almost always requires supp(π(•|s)) ⊂ supp(π β (•|s)). We achieves this by the advantage weighted policy update, which forces π(•|s) have significant probability mass on actions taken by π β in data, as detailed in section 3.2. Model-based OOD state sampling. In Eq. 7, we implement the state sampling process s ′ ∼ d in Eq. 3 as a flow of {s ∼ D; a ∼ π(a|s); s ′ ∼ P (s ′ |s, a)}, that is the distribution of the predictive next-states from D by following π. It is beneficial in practice. On one hand, this method is efficient to sample only the states that are approximately reachable from D by one step, rather than to sample the whole state space. On the other hand, we only need the model to do one-step prediction such that no bootstrapped errors due to long horizon are introduced. Following previous work (Janner et al., 2019; Yu et al., 2020; 2021) , we implement the probabilistic dynamics model using an ensemble of deep neural networks {pθ 1 , . . . , pθ B }. Each neural network produces a Gaussian distribution over the next state and reward: P i θ (s t+1 , r|s t , a t ) = N (u i θ (s t , a t ), σ i θ (s t , a t )). Adaptive penalty factor α. The pessimism level is controlled by the parameter α ≥ 0. In practice, we set α adaptive during training as follows, which is similar as that in CQL (Kumar et al. (2020) ) max α≥0 [α(E s ′ ∼d [V ψ (s ′ )] -E s∼D [V ψ (s)] -τ )] , where τ is a budget parameter. If the expected difference in V-values is less than τ , α will decrease. Otherwise, α will increase and penalize the out of distribution state values more aggressively. Discussion: As stated in former sections, our method focuses on estimating conservative state value for learning a policy. The effectiveness of adding conservatism on V function are two folds. First, penalizing V values is with a smaller hypothesis space than penalizing Q, which would reduce the computation complexity. Second, penalizing V values can achieve a more relaxed lower bound than penalizing Q with ignoring the explicitly marginalization on Q values. A more relaxed lower bound guarantees more opportunities on achieving better policy.

4.2. ADVANTAGE WEIGHTED POLICY UPDATES

After learning the conservative V k+1 and Qk+1 (or V π and Qπ when converged), we improve the policy by the following advantage weighted policy update (Nair et al., 2020) . π ← arg min π ′ L π (π ′ ) = -E s,a∼D log π ′ (a|s) exp β Âk+1 (s, a) where Âk+1 (s, a) = Qk+1 (s, a) -V k+1 (s). Eq. 11 updates the policy π to amounts of weighted maximum likelihood which are computed by re-weighting state-action samples in D with estimated advantage Âk+1 . As discussed in the AWAC (Nair et al., 2020) , this method avoids explicitly estimating the behaviour policy and its resulted sampling errors which is an import issue in the offline RL setting (Kumar et al., 2020) . Implicit policy constraints. We adopt the advantage weighted policy updates which imposes an implicit KL divergence constraints between π and π β . This policy constraint is necessary to guarantee that the next state s ′ in Equation 7 can be safely generated through policy π. As derived in Nair et al. (2020) (Appendix A), the Eq. 11 is an parametric solution of the following problem: max π ′ E a∼π ′ (•|s) [ Âk+1 (s, a)] s.t. D KL (π ′ (•|s) ∥ π β (•|s)) ≤ ϵ, a π ′ (a|s)da = 1. Note that D KL (π ′ ∥ π β ) is an reserve KL divergence with respect to π ′ , which is mode-seeking ( (Shlens, 2014) ). When treated as Lagrangian it forces π ′ allocate its probability mass to the maximum likelihood supports of π β , re-weighted by the estimated advantage. In other words, for the space of A where π β (•|s) has no samples, π ′ (•|s) has almost zero probability mass too. Bonus of Exploration on Near States. As suggested by remarks in Section 3.1, in practice allowing the policy explore the predicated next states transition (s ∼ D) following a ∼ π ′ (•|s)) leads to better test performance. With this kind of exploration, the policy is updated as follows. π ← arg min π ′ L + π (π ′ ) = L π (π ′ ) -λE s∼D,a∼π ′ (s),s ′ ∼ P (s,a) r(s, a) + V k+1 (s ′ ) (12) The second term is an approximation to E s∼dπ(s) [V π (s)], while the first term is the approximation of E s∼du(s) [V π (s)]. While the choice of λ is ultimately just a hyper-parameter, we balance between optimistic policy optimization (in maximizing V) and constrained policy update (the first term) by adjusting λ. 

5. EXPERIMENTS

The primary goal of this section is to investigate whether the proposed tighter conservative value estimation leads to performance improvement. Besides, we would like to ascertain when further exploration has benefits and how well CSVE performs compared with SOTA algorithms. We evaluate our method on classical continuous control tasks of Gym (Brockman et al., 2016) and Adroit (Rajeswaran et al., 2017) in the standard D4RL (Fu et al. (2020) ) benchmark. The Gym control tasks include HalfCHeetah, Hopper and Walker2D, each with 5 datasets collected by following different types of policies (random, medium, medium-replay, medium-expert, and expert). The Adroid tasks include Pen, Hammer, Door and Relocate, each with 3 dataset collected by different policies (human, cloned, and expert). Our method, namely CSVE, and the compared baselines are CQL (Kumar et al., 2020) , COMBO (Yu et al., 2021) , AWACNair et al. (2020) , PBRL (Bai et al., 2021) and other SOTA algorithms TD3-BC (Fujimoto & Gu, 2021) , UWAC (Wu et al., 2021) , IQL (Kostrikov et al., 2021b) , BEAR (Kumar et al., 2019) ) whose performance results are public or have high-quality open implementations. CQL which estimates the conservative Q values on state-action pairs rather than states, is the direct comparing method to ours. COMBO also lower bounds the estimated V function. AWAC is one special case of our Eq. 3 when d = d u . PBRL is a very strong performant in offline RL, but is quite costly on computation since it uses the ensemble of hundreds of sub-models.

5.1. OVERALL PERFORMANCE

We first test on the Gym control tasks. We train our methods for 1 million steps and report the final evaluation performance. The overall results are shown in Table 1 . Compared to CQL, our method has better performance on 11 of 15 tasks and similar performance on others. In particular, our method shows consistent advantage on the datasets that generated by following random or suboptimal policies (random and medium). Compared to AWAC, our method has better performance on 9 of 15 tasks and comparable performance on others, which demonstrates the effect of our further exploration beyond cloning the behaviour policy. In particular, our method shows an obvious advantage in extrating the best policy on data of mixed policy (Medium Expert) while AWAC can not. Compared to COMBO, our method has better performance on 6 out 12 tasks and comparable performance or slightly worse on others, which demonstrates the effect of our better bounds on V. In particular, our method shows an obvious advantage in extrating the best policy on medium and medium-expert tasks. In 9 tasks evaluated, our method gets higher score than IQL in 7 of them, and has similar performance in the other tasks. Finally, our method performs close to PBRL, even PBRL has almost orders of more model capacity and computation cost. We now evaluate our method on the Adroit tasks. For CSVE, we report the final evaluation results after training in 0.1 million steps. The full results are reported in Table2. Copared to IQL, our method performs better in 8 out of 12 tasks, and performs similarly in the other 4 tasks. For the expert datasets, all methods including simple BC (behaviour cloning) can perform well, among which ours is the most competitive on all four tasks. For human and cloned datasets, almost all methods fail to learn effective policies on three tasks except the Pen task. For the Pen task, CSVE is the only one that succeeds to learn a good policy on the human dataset, while it can learn a medium policy on the cloned dataset as BC and PBRL.

5.2. SENSITIVENESS OF HYPER-PARAMETERS

We anaylyze hyper-parameter β, which trades off between behaviour cloning and policy optimization. For smaller values, the objective behaves similarly to behavior cloning (weights are close for all actions), while for larger values, it attempts to recover the maximum of the Q-function. To quantitatively analyze its effect, we test different β from {0.1, 3, 10} in mujoco tasks with the medium-type datasets, whose results are shown in Fig. 1 . We can see that λ has effect on the policy performance during training. Empirically, we found out that in general, β = 3.0 is suitable for such medium type datasets. Besides, in practice, by default we use β = 3.0 for random and medium task while 0.1 for medium-replay, medium-expert and expert datasets.

6. RELATED WORK

Offline RL (Fujimoto et al., 2019; Levine et al., 2020) aims to learn a reasonable policy from a static dataset collected by arbitrary policies, without further interactions with the environment. Compared to interactive RL, offline RL suffers two critical inherent issues, i.e., the distribution drift introduced by off-policy learning and the out-of-distribution extrapolation in value estimation (Ostrovski et al., 2021; Levine et al., 2020) . The common mind of offline RL algorithms is to incorporate conservatism or regularization into the online RL algorithms. Here we briefly review the prior work with a comparison to ours. Figure 1 : The effect of β in medium tasks. Conservative value estimation: Prior offline RL algorithms regularize the learning policy close to the data or explicitly estimated behaviour policy) and penalize the exploration to the out-ofdistribution region, via distribution correction estimation (Dai et al., 2020; Yang et al., 2020) , policy constraints with support matching (Wu et al., 2019) and distributional matching Fujimoto et al. ( 2019); Kumar et al. (2019) , applying policy divergence based penalty on Q-functions (Kostrikov et al., 2021a; Wang et al., 2020) or uncertainty-based penalty (Agarwal et al., 2020) on Q-functions and conservative Q-function estimation (Kumar et al., 2020) . Besides, model-based algorithms (Yu et al., 2020) directly estimate dynamics uncertainty and translated it into reward penalty. Different from these prior work that imposes conservatism on state-action pairs or actions, ours directly does such conservative estimation on states and requires no explicit uncertainty quantification. With learned conservative value estimation, an offline policy can be learned via implicit derivation from a state-action joint distribution or in Q-Learning and actor-critic framework. In this paper, our implementation adopts the method proposed in AWAC (Nair et al., 2020; Peng et al., 2019a) . Model-based algorithms: Model-based offline RL learns the dynamics model from the static dataset and uses it to quantify uncertainty (Yu et al., 2020) , data augmentention (Yu et al., 2021) with roll-outs, or planning (Kidambi et al., 2020; Chen et al., 2021) . Such methods typically rely on wide data coverage when planning and data augmentation with roll-outs, and low model estimation error when estimating uncertainty, which is often difficult to satisfy in reality and leads to policy instability. Instead, we use the model to sample the next-step states only reachable from data, which has no such strict requirements on data coverage or model bias. Theoretical results: Our theoretical results are derived from conservative Q-value estimation (CQL) and safe policy improvement (Laroche et al., 2019) . Besides, COMBO (Yu et al., 2021) gives a result of conservative but tighter value estimation than CQL, when dataset is augmented with model-based roll-outs. Compared to our result, COMBO's lower bounds additionally assume same initial state distribution which may not always satisfy in continuous control.

7. DISCUSSION

In this paper, we propose a new approach for offline RL based on conservative value estimation on states and discussed how the theoretical results could lead to the new RL algorithms. In particular, we developed a practical actor-critic algorithm, in which the critic does conservative state value estimation by incorporating the penalty of the model predictive next-states into Bellman iterations, and the actor does the advantage weighted policy updates with a bonus of exploring states with conservative values. Experimental evaluation shows that our method performs better than alternative methods based on conservative Q-function estimation and is competitive among the SOTA methods, confirming our theoretical analysis well. Moving forward, we hope to explore the design of more powerful algorithms based on conservative state value estimation.

A PROOFS

We first redefine notation for clarity and then provide the proofs of the results in the main paper. Notation. Let k ∈ N denote an iteration of policy evaluation(in Section 3.2). V k denotes the true, tabular (or functional) V-function iterate in the MDP, without any correction. V k denotes the approximate tabular (or functional) V-function iterate. The empirical Bellman operator can be expressed as follows: ( Bπ V k )(s) = E a∼π(a|s) r(s, a) + γ s ′ E a∼π(a|s) P (s ′ |s, a)[ V k (s ′ )] where r(s, a) is the empirical average reward obtained in the dataset when performing action a at state s . The true Bellman operator can be expressed as follows: (B π V k )(s) = E a∼π(a|s) r(s, a) + γ s ′ E a∼π(a|s) P (s ′ |s, a)[V k (s ′ )] Now we first prove that the iteration in Eq.3 has a fixed point. Assume state value function is lower bounded, i.e., V (s) ≥ C ∀s ∈ S, then Eq.3 can always be solved with Eq.4. Thus, we only need to investigate the iteration in Eq.4. Denote the iteration as a function operator T π on V . Suppose supp d ⊆ supp d u , then the operator T π is a γ-contraction in L ∞ norm where γ is the discounting factor. Proof of Lemma 1: Let V and V ′ are any two state value functions with the same support, i.e., suppV = suppV ′ . |(T π V -T π V ′ )(s)| = ( Bπ V (s) -α[ d (s) d u (s) -1]) -( Bπ V ′ (s) -α[ d (s) d u (s) -1]) = Bπ V (s) -Bπ V ′ (s) =|(E a∼π(a|s) r(s, a) + γE a∼π(a|s) s ′ P (s ′ |s, a)V (s ′ )) -(E a∼π(a|s) r(s, a) + γE a∼π(a|s) s ′ P (s ′ |s, a)V ′ (s ′ ))| =γ E a∼π(a|s) s ′ P (s ′ |s, a)[V (s ′ ) -V ′ (s ′ )] ||T π V -T π V ′ || ∞ = max s |(T π V -T π V ′ )(s)| = max s γ E a∼π(a|s) s ′ P (s ′ |s, a)[V (s ′ ) -V ′ (s ′ )] ≤γE a∼π(a|s) s ′ P (s ′ |s, a) max s ′′ |V (s ′′ ) -V ′ (s ′′ )| =γ max s ′′ |V (s ′′ ) -V ′ (s ′′ )| =γ||(V -V ′ )|| ∞ We present the bound on using empirical Bellman operator compared to the true Bellman operator. Following previous work Kumar et al. (2020) , we make the following assumptions that: P π is the transition matrix coupled with policy, specifically, P π V (s) = E a ′ ∼π(a ′ |s ′ ),s ′ ∼P (s ′ |s,a ′ ) [V (s ′ )] Assumption 1. ∀s, a ∈ M, the following relationships hold with at least (1 -δ) (δ ∈ (0, 1)) probability, |r -r(s, a)| ≤ C r,δ |D(s, a)| , || P (s ′ |s, a) -P (s ′ |s, a)|| 1 ≤ C P,δ |D(s, a)| Under this assumption, the absolute difference between the empirical Bellman operator and the actual one can be calculated as follows: |( Bπ ) V k -(B π ) V k )| = E a∼π(a|s) |r -r(s, a) + γ s ′ E a ′ ∼π(a ′ |s ′ ) ( P (s ′ |s, a) -P (s ′ |s, a))[ V k (s ′ )]| (16) ≤ E a∼π(a|s) |r -r(s, a)| + γ| s ′ E a ′ ∼π(a ′ |s ′ ) ( P (s ′ |s, a ′ ) -P (s ′ |s, a ′ ))[ V k (s ′ )]| (17) ≤ E a∼π(a|s) C r,δ + γC P,δ 2R max /(1 -γ) |D(s, a)| Thus, the estimation error due to sampling error can be bounded by a constant as a function of C r,δ and C t,δ . We define this constant as C r,T,δ . Thus we obtain: ∀V, s ∈ D, | Bπ V (s) -B π V (s)| ≤ E a∼π(a|s) C r,t,δ (1 -γ) |D(s, a)| Next we provide an important lemma. Lemma 2. (Interpolation Lemma) For any f ∈ [0, 1], and any given distribution ρ(s), let d f be an f-interpolation of ρ and D, i.e.,d f (s ) := f d(s) + (1 -f )ρ(s), let v(ρ, f ) := E s∼ρ(s) [ ρ(s)-d(s) d f (s) ], then v(ρ, f ) satisfies v(ρ, f ) ≥ 0. The proof can be found in Yu et al. (2021) . By setting f as 1, we have E s∼ρ(s) [ ρ(s)-d(s) d(s) ] > 0. Proof of Theorem 1: The V function of approximate dynamic programming in iteration k can be obtained as: V k+1 (s) = Bπ V k (s) -α[ d(s) d u (s) -1] ∀s, k The fixed point: V π (s) = Bπ V π (s) -α[ d(s) d u (s) -1] ≤ B π V π (s) + E a∼π(a|s) C r,t,δ R max (1 -γ) |D(s, a)| -α[ d(s) d u (s) -1] (21) Thus we obtain: V π (s) ≤ V π (s) + (I -γP π ) -1 E a∼π(a|s) C r,t,δ R max (1 -γ) |D(s, a)| -α(I -γP π ) -1 [ d(s) d u (s) -1] (22) , where P π is the transition matrix coupled with the policy π and P π V (s) = E a ′ ∼π(a ′ |s ′ )s ′ ∼P (s ′ |s,a ′ ) [V (s ′ )]. Then the expectation of V π (s) under distribution d(s) satisfies: E s∼d(s) V π (s) ≤E s∼d(s) (V π (s)) + E s∼d(s) (I -γP π ) -1 E a∼π(a|s) C r,t,δ R max (1 -γ) |D(s, a)| -α E s∼d(s) (I -γP π ) -1 [ d(s) d u (s) -1]) >0 (23) When α ≥ E s∼d(s) E a∼π(a|s) C r,t,δ Rmax (1-γ) √ |D(s,a)| E s∼d(s) [ d(s) du(s) -1]) , E s∼d(s) V π (s) ≤ E s∼d(s) (V π (s)). Lemma 3. For any MDP M , an empirical MDP M generated by sampling actions according to the behavior policy π β and a given policy π, |J(π, M ) -J(π, M )| ≤ ( C r,δ 1 -γ + γR max C T,δ (1 -γ) 2 )E s∼d π * M (s) [ |A| |D(s)| E a∼π(a|s) ( π(a|s) π β (a|s) )] (33) Setting π in the above lemma as π β , we get: |J(π β , M ) -J(π β , M )| ≤ ( C r,δ 1 -γ + γR max C T,δ (1 -γ) 2 )E s∼d π * M (s) [ |A| |D(s)| E a∼π * (a|s) ( π * (a|s) π β (a|s) )] , given that E a∼π * (a|s) [ π * (a|s) π β (a|s) ] is a pointwise upper bound of E a∼π β (a|s) [ a|s) ](Kumar et al. ( 2020)). Thus we get, π β (a|s) π β ( J(π * , M ) ≥ J(π β , M ) -2( C r,δ 1 -γ + γR max C T,δ (1 -γ) 2 )E s∼d π * M (s) [ |A| |D(s)| E a∼π * (a|s) ( π * (a|s) π β (a|s) )] + α 1 1 -γ E s∼d π M (s) [ d(s) d u (s) -1] (35) , which completes the proof. Here, the second term is sampling error which occurs due to mismatch of M and M ; the third term denotes the increase in policy performance due to CSVE in M . Note that when the first term is small, the smaller value of α are able to provide an improvement compared to the behavior policy.

B CSVE ALGORITHM

Now we put all in section 4 together and describe the practical deep offline reinforcement learning algorithm. In particular, the dynamic model model, value functions and policy are all parameterized with deep neural networks and trained via stochastic gradient decent methods. The pseudo code is given in Alg. 1. Initialize function parameters θ 0 , ψ 0 , ϕ 0 , θ 0 = θ 0 ; Algorithm 1: CSVE Algorithm Input : Data D = {(s, a, r, s ′ )} Parameters: Q θ , V ψ , π ϕ , Q θ , 3 foreach step k = 1 → N do 4 ψ k ← ψ k-1 -η ψ ∇ ψ L π V (V ψ ; Qθ k ); θ k ← θ k-1 -η θ ∇ θ L π Q (Q θ ; Vψ k ); 6 ϕ k ← ϕ k-1 -η ϕ ∇ ϕ L + π (π ϕ ); 7 θ k ← ωθ k-1 + (1 -ω)θ k ; C IMPLEMENTATION DETAIL We implement our method based on an offline deep reinforcement learning library d3rlpy (Seno & Imai, 2021) . The code is available at https://github.com/iclr20234089/code4098. The detailed hyper-parameters are provided in We also investigated λ values of {0.0, 0.1, 0.5, 1.0} in the medium tasks. The results are shown in Fig. 4 . We implement an ablation version of our method-penalty-Q, which directlly penalize the value of state action pairs. Specifically, we change the critic loss function into : Qk+1 ← arg min Q L π Q (Q; Qk ) = α E s∼D,a ′ ∼π(•|s) [Q(s, a ′ )] -E s,a∼D [Q(s, a)] + E s,a,s ′ ∼D r(s, a) + γ Qk+1 (s ′ , a ′ ) -Q(s, a) 2 We use the same policy extraction method and test this method on the medium-task, in which the data is collected using a medium-performed policy. In all the three tasks, the performance of the penalty-Q are worse than the the original implementation, the penalty-V counterpart. When penalty is on the state-action pair, as illustrated by our theoretical discussion, the value of the evaluated Q value tends to pointwise lower bounds the true Q value, which results in a more conservative and thus worse policy. While when we penalize V, the estimated value function only bounds the expectation of the true V function, which results in a more flexible and well-performed policy. 

D.3 RELATIONSHIP BETWEEN MODEL BIAS AND FINAL PERFORMANCE

As stated in the main paper, compared to normal model-based offline RL algorithms, CSVE is insensitive to model biases. To understand this quantitatively, now we investigate the effect of model biases to the performance. We use the the dynamic model's average L2 error on transition prediction as the surrogate of model biases. As shown in Fig. 4 , in CSVE, the model bias has very little effect to RL performance. Particularly, for halfcheetah there is observed effect of model errors to scores, while in hopper and walker2d with increasing errors, the scores have a slight downward trend where the decrease is relatively very small. In the main body of this paper, our results of COMBO adopt the results presented in literature (Rigter et al., 2022) . Our goal here is to look into more details of COMBO's asymptotic performance evaluated during training. For comparison fairness, we adopt the official COMBO code provided by author, and rerun to evaluate with the medium dataset of D4RL mujoco v2. Fig. 5 shows the asymptotic performance until 1000 epochs, in which the scores have been normalized with corresponding SAC performance. We found that in both hopper and walker2d, the scores show dramatic fluctuations. The average scores of last 10 epochs for halfcheetah, hopper and walker2d are 71.7, 65.3 and -0.26 in respect. Besides, we found even in D4RL v0 dataset, COMBO's behaviours are similar. Let us take the medium-replay type of datasets to analyze its effect. In the experiments, with fixed β = 0.1, we investigate λ values of {0.0, 0.5, 1.0, 3.0}. As shown in the upper figures in Fig. 6 , λ shows obvious effect to policy performance and variances during training. In general, there is a value under which increasing λ leads to performance improvement, while above which further increasing λ hurts performance. For example, with λ = 3.0 in hopper-medium-replay task and walker2d-medium-replay task, the performance get worse than with smaller λ values. The value of λ is task-specific, and we find that its effect is highly related to the loss in Eq. 11 which can be observed by comparing bottom and upper figures in Fig. 6 . Thus, in practice, we can choose proper λ according to the above loss without online interaction.



r t+k |s t = s], and the Q value function is Q π (s, a) := E M,π [ ∞ k=0 γ t+k r t+k |s t = s, a t = a]. The Bellman operator is a function projection: B π Q(s, a) := r(s, a) + γE s ′ ∼P (•|s,a),a ′ ∼π(•|s ′ ) [Q(s ′ , a ′ )], or B π V (s) := E a∼π(•|s) [r(s, a) + γE s ′ ∼P (•|s,a) [V (s ′ )]

is the converged value estimation with the dataset D, and error introduced by the use empirical rather than Bellman operator. If the counts of each state-action pair is greater than zero, |D(s, a)| denotes a vector of size |S||A| containing counts for each state-action pair. If the counts of this state action pair is zero, the corresponding

M ν Hyperparameters: α, λ, learning rates η θ , η ψ , η ϕ , ω begin // Train transition model with the static dataset D 1 M ν ← train(D); // Train the conservative value and policy functions 2

Figure 2: Return (upper sub-figures) and loss in Eq. 11 (bottom sub-figures) during training with different λ values

Figure 3: Permance comparison between our original implementation and the penalty-Q version.

Figure4: The relationship between score and the model biases. The correlation coefficient is respectively -0.32, -0.34, and -0.29.

Figure 5: Return of COMBO on D4RL mujoco v2 tasks

Performance comparison on Gym control tasks. The results of CSVE is over three seeds and we reimplement AWAC using d3rlpy. Results of IQL, TD3-BC, and PBRL are from their original paper ( Table1inKostrikov et al. (2021b), Table C.3 in Fujimoto & Gu (2021), and Table 1 in Bai et al. (2021) respectively. Results of COMBO are from the reproduction result in Rigter et al. (2022) (Table1), given that the original paper report their results on v0 dataset. For the same reason, results of CQL are fromBai et al. (2021).)

Performance comparison on Adroit tasks. The results of CSVE are over three seeds. Results of IQL are from Table3inKostrikov et al. (2021b)  and results of other algorithm are from Table4inBai et al. (2021).





annex

Proof of Theorem 2: The expectation of V π (s) under distribution d(s) satisfies:Noticed that the last term:We obtain that:Proof of Theorem 3: Recall that the expression of the V-function iterate is given by:Now the expectation of V π (s) under distribution d u (s) is given by:The expectation of V π (s) under distribution d(s) is given by:(29) Thus we can show that:By choosing α:We haveProof of Theorem 4: V is obtained by solving the recursive Bellman fixed point equation in the empirical MDP, with an altered reward, r(s, a) -α[ d(s) du(s) -1], hence the optimal policy π * (a|s) obtained by optimizing the value under Eq. 4.Proof of Theorem 5: The proof of this statement is divided into two parts. We first relates the return of π * in the empirical MDP M with the return of π β , we can get:The next step is to bound the difference between J(π β , M ) and J(π β , M ) and the difference between J(π * , M ) and J(π * , M ). We quote a useful lemma from Kumar et al. (2020) 

