WHEN DATA GEOMETRY MEETS DEEP FUNCTION: GENERALIZING OFFLINE REINFORCEMENT LEARNING

Abstract

In offline reinforcement learning (RL), one detrimental issue to policy learning is the error accumulation of deep Q function in out-of-distribution (OOD) areas. Unfortunately, existing offline RL methods are often over-conservative, inevitably hurting generalization performance outside data distribution. In our study, one interesting observation is that deep Q functions approximate well inside the convex hull of training data. Inspired by this, we propose a new method, DOGE (Distance-sensitive Offline RL with better GEneralization). DOGE marries dataset geometry with deep function approximators in offline RL, and enables exploitation in generalizable OOD areas rather than strictly constraining policy within data distribution. Specifically, DOGE trains a state-conditioned distance function that can be readily plugged into standard actor-critic methods as a policy constraint. Simple yet elegant, our algorithm enjoys better generalization compared to state-of-the-art methods on D4RL benchmarks. Theoretical analysis demonstrates the superiority of our approach to existing methods that are solely based on data distribution or support constraints.

1. INTRODUCTION

Offline reinforcement learning (RL) provides a new possibility to learn optimized policies from large, pre-collected datasets without any environment interaction (Levine et al., 2020) . This holds great promise to solve many real-world problems when online interaction is costly or dangerous yet historical data is easily accessible (Zhan et al., 2022) . However, the optimization nature of RL, as well as the need for counterfactual reasoning on unseen data under offline setting, have caused great technical challenges for designing effective offline RL algorithms. Evaluating value function outside data coverage areas can produce falsely optimistic values; without corrective information from online interaction, such estimation errors can accumulate quickly and misguide policy learning process (Van Hasselt et al., 2018; Fujimoto et al., 2018; Kumar et al., 2019) . Recent model-free offline RL methods investigate this error accumulation challenge in several ways: 1) Policy Constraint: directly constraining learned policy to stay inside distribution, or with the support of dataset (Kumar et al., 2019) ; 2) Value Regularization: regularizing value function to assign low values at out-of-distribution (OOD) actions (Kumar et al., 2020b) ; 3) In-sample Learning: learning value function within data samples (Kostrikov et al., 2021b) or simply treating it as the value function of behavioral policy (Brandfonbrener et al., 2021) . All three schools of methods share similar traits of being conservative and omitting evaluation on OOD data, which brings benefits of minimizing model exploitation error, but at the expense of poor generalization of learned policy in OOD regions. Thus, a gaping gap still exists when such methods are applied to real-world tasks, where most datasets only partially cover state-action space with suboptimal policies. Meanwhile, online deep reinforcement learning (DRL) that leverages powerful deep neural network (DNN) with optimistic exploration on unseen samples can yield high-performing policies with promising generalization performance (Mnih et al., 2015; Silver et al., 2017; Degrave et al., 2022;  Figure 1 : Left: Visualization of AntMaze dataset. Data transitions of two small areas on the critical pathways to the destination have been removed (red box). Right: Performance of three SOTA offline RL methods. Packer et al., 2018) . This staring contrast propels us to re-think the question: Are we being too conservative? It is well known that DNN has unparalleled approximation and generalization abilities, compared with other function approximators. These attractive abilities have not only led to huge success in computer vision and natural language processing (He et al., 2016; Vaswani et al., 2017) , but also amplified the power of RL. Ideally, in order to obtain the best policy, an algorithm should enable offline policy learning on unseen state-action pairs that function approximators (e.g., Q function, policy network) can generalize well, and add penalization only on non-generalizable areas. However, existing offline RL methods heed too much conservatism on data-related regularizations, while largely overlooking the generalization ability of deep function approximators. Intuitively, let us consider the well-known AntMaze task in the D4RL benchmark (Fu et al., 2020) , where an ant navigates from the start to the destination in a large maze. We observe that existing offline RL methods fail miserably when we remove only small areas of data on the critical pathways to the destination. As shown in Figure 1 , the two missing areas reside in close proximity to the trajectory data. Simply "stitching" up existing trajectories as approximation is not sufficient to form a near-optimal policy at missing regions. Exploiting the generalizability of deep function appoximators, however, can potentially compensate for the missing information. In our study, we observe that the value function approximated by DNN can interpolate well but struggles to extrapolate (see Section 2.2). Such an "interpolate well" phenomenon is also observed in previous studies on the generalization of DNN (Haley & Soloway, 1992; Barnard & Wessels, 1992; Arora et al., 2019a; Xu et al., 2020; Florence et al., 2022) . This finding motivates us to reconsider the generalization of function approximators in offline RL in the context of dataset geometry. Along this line, we discover that a closer distance between a training sample to the offline dataset often leads to a smaller value variation range of the learned neural network, which effectively yields more accurate inference of the value function inside the convex hull (formed by the dataset). By contrast, outside the convex hull, especially in those areas far from the training data, the value variation range usually renders too large to guarantee a small approximation error. Inspired by this, we design a new algorithm DOGE (Distance-sensitive Offline RL with better GEneralization) from the perspective of generalization performance of deep Q function. We first propose a state-conditioned distance function to characterize the geometry of offline datasets, whose output serves as a proxy to the network generalization ability. The resulting algorithm learns a state-conditioned distance function as a policy constraint on standard actor-critic RL framework. Theoretical analysis demonstrates the superior performance bound of our method compared to previous policy constraint methods that are based on data distribution or support constraints. Evaluations on D4RL benchmarks validate that our algorithm enjoys better performance and generalization abilities than state-of-the-art offline RL methods. 2 DATA GEOMETRY VS. DEEP Q FUNCTIONS

2.1. NOTATIONS

We consider the standard continuous action space Markov decision process (MDP) setting, which can be represented by a tuple (S, A, P, r, γ), where S and A are the state and action space, P(s ′ |s, a) is the transition dynamics, r(s, a) is a reward function, and γ ∈ [0, 1) is a discount factor. The objective of the RL problem is to find a policy π(a|s) that maximizes the expected cumulative discounted return, which can be represented by a approximators with learnable parameters θ, such as deep neural networks. Under offline RL setting, we are only given a fixed dataset D and cannot interact further with the environment. Therefore, the parameters θ are optimized by minimizing the following temporal difference (TD) error: Q function Q π θ (s, a) = E[ ∞ t=0 γ t r(s t , a t )|s 0 = s, a 0 = a, a t ∼ π(•|s t ), s t+1 ∼ P(•|s t , a t )]. The Q function is typically approximated by function min θ E (s,a,s ′ )∈D r(s, a) + γE a ′ ∼π(•|s ′ ) [Q π θ ′ (s ′ , a ′ )] -Q π θ (s, a) 2 (1) where Q π θ ′ is the target Q function, which is a delayed copy of the current Q network.

2.2. INTERPOLATE VS. EXTRAPOLATE

Motivating examples. Let's first consider a set of simple one-dimensional random walk tasks with different offline datasets, where agents at each step can take an action to move in the range of [-1, 1], and the state space is a straight line ranging from [-10, 10 ]. The destination is located at s = 10. The closer to the destination, the larger reward the agent gets (i.e., r = 1 at s = 10, r = 0 at s = -10). The approximation errors of the learned Q functions are visualized in Figure 2 . Note that the approximation errors of the learned Q functions tend to be low at state-action pairs that lie inside or near the boundaries of the convex hull formed by the dataset. Under continuous state-action space, state-action pairs within the convex hull of the dataset can be represented in an interpolated manner (referred as interpolated data), i.e., x in = n i=1 α i x i , n i=1 α i = 1, α i ≥ 0, x i = (s i , a i ) ∈ D; similarly, we can define the extrapolated data that lie outside the convex hull of the dataset as x out = n i=1 β i x i , where n i=1 β i = 1 and β i ≥ 0 do not hold simultaneously. We observe that the geometry of the datasets play a special role on the approximation error of deep Q functions, or in other words, deep Q functions interpolate well but struggle to extrapolate. This phenomenon is also reflected in studies on the generalization performance of deep neural networks under a supervised learning setting (Haley & Soloway, 1992; Barnard & Wessels, 1992; Arora et al., 2019a; Xu et al., 2020; Florence et al., 2022) , but is largely overlooked in modern offline RL. Theoretical explanations. Based on advanced theoretical machinery from the generalization analysis of DNN, such as neural tangent kernel (NTK) (Jacot et al., 2018) , we can theoretically demonstrate that this phenomenon is also carried over to the offline RL setting for deep Q functions. Define Proj D (x) := arg min xi∈D ∥x -x i ∥ (we denote ∥x∥ as Euclidean norm) as the projection operator that projects unseen data x to the nearest data point in dataset D. Theorem 1 gives a theoretical explanation of the "interploate well" phenomenon for deep Q functions under the NTK assumptions (see Appendix B.2 for detailed proofs): Theorem 1. (Value difference of deep Q function for interpolated and extrapolated data). Under the NTK regime, given an unseen interpolated data x in and an extrapolated data x out , then the value difference of deep Q function for interpolated and extrapolated input data can be bounded as: ∥Q θ (x in ) -Q θ (Proj D (x in ))∥ ≤ C 1 ( min(∥x in ∥, ∥Proj D (x in )∥) d xin + 2d xin ) ≤ C 1 ( min(∥x in ∥, ∥Proj D (x in )∥) √ B + 2B) (2) ∥Q θ (x out ) -Q θ (Proj D (x out ))∥ ≤ C 1 ( min(∥x out ∥, ∥Proj D (x out )∥) d xout + 2d xout ) (3) where d xin = ∥x in -Proj D (x in )∥ ≤ max xi∈D ∥x in -x i ∥ ≤ B and d xout = ∥x out -Proj D (x out )∥ are distances of x in and x out to the nearest data points in dataset D. B and C 1 are finite constants. Theorem 1 shows that given an unseen input x, Q θ (x) can be controlled by in-sample Q value Q θ (Proj D (x)) and the distance ∥x -Proj D (x)∥. The smaller the distance, the more controllable the output of deep Q functions. Therefore, because the distance to dataset is strictly bounded (at most B for interpolated data), the approximated Q values at interpolated data as well as extrapolated data near the boundaries of the convex hull formed by the dataset cannot be too far off. Moreover, as d xout can take substantially larger values than d xin , interpolated data generally enjoys a tighter bound compared with extrapolated data, if the dataset only narrowly covers a large state-action space. Empirical observations in Figure 2 and Theorem 1 both demonstrate that data geometry can induce different approximation error accumulation patterns for deep Q functions. While approximation error accumulation is generally detrimental to offline RL, a fine-grained analysis is missing in previous studies about where value function can approximate well. We argue that it is necessary to take data geometry into consideration when designing less conservative offline RL algorithms.

3. GENERALIZABLE OFFLINE RL FRAMEWORK

In this section, we present our algorithm DOGE (Distance-sensitive Offline RL with better GEneralization). By introducing a specially designed state-conditioned distance function to characterize the geometry of offline datasets, we can construct a very simple, less conservative and also more generalizable offline RL algorithm upon standard actor-critic framework.

3.1. STATE-CONDITIONED DISTANCE FUNCTION

As revealed in Theorem 1, the sample-to-dataset distance plays an important role in measuring the controllability of Q values. However, given an arbitrary state-action sample (s, a), naively computing its distance to the closest data point in a large dataset can be costly and impractical. Ideally, we prefer to have a learnable distance function which also has the ability to reflect the overall dataset geometry. Based on this intuition, we design a state-conditioned distance function that can be learned in an elegantly simple supervised manner with desirable properties. Specifically, we learn the state-conditioned distance function g(s, a) by solving the following regression problem, with state-action pairs (s, a) ∼ D and synthetic noise actions sampled from the uniform distribution over the full action space A: min g E (s,a)∼D E â∼U nif (A) [∥a -â∥ -g(s, â)] 2 In practical implementation, for each (s, a) ∼ D, we sample N noise actions uniformly in the action space A to train g(•). More implementation details can be found in Appendix E. Moreover, with the optimization objective defined in Eq. ( 4), we can show that the optimal state-conditioned distance function has two desirable properties (proofs can be found in Appendix C): Property 1. The optimal state-conditioned distance function of Eq. ( 4) is convex w.r.t. actions and is an upper bound of the distance to the state-conditioned centroid a o (s) of training dataset D: g * (s, â) = E a∼U nif (A) [C(s, a)∥â -a∥] ≥ ∥â -E a∼U nif (A) [C(s, a) • a]∥ = ∥â -a o (s)∥, ∀â ∈ A, s ∈ D (5) where C(s, a) = µ(s,a) E a∼U nif (A) µ(s,a) ≥ 0, µ(s, a) is state-action distribution of dataset D. Given a state s ∈ D, the state-conditioned centroid is defined as a o (s) = E a∼U nif (A) [C(s, a) • a]. Since L 2 -norm is convex and the non-negative combination of convex functions is still convex, g * (s, â) is also a convex function w.r.t. â. Property 2. The negative gradient of the optimal state-conditioned distance function at an extrapolated action â, -∇ âg * (s, â), points inside the convex hull of the dataset. From Property 1, we can see that the optimal state-conditioned distance function characterizes data geometry and outputs an upper bound of the distance to the state-conditioned centroid of the training dataset. Property 2 indicates that if we use the learned distance function as a policy constraint, it can drive the learned policy to move inside the convex hull of training data. We visualize the value of the trained state-conditioned distance function in Figure 3 . It is clear that the learned distance function can accurately predict the sample-to-dataset centroid distance. By utilizing such distance function, we can constrain the policy based on the global geometric information of training datasets. This desirable property is non-obtainable by simply constraining the policy based on sample-to-sample distance such as the MSE loss between policy generated and dataset actions, which can only bring local geometric information. Moreover, the learned distance function can not only predict well at in-distribution states but also generalize well at OOD states.

3.2. DISTANCE-SENSITIVE OFFLINE REINFORCEMENT LEARNING

Capturing the geometry of offline datasets, we now construct a minimalist distance-sensitive offline RL framework, by simply plugging the state-conditioned distance function as a policy constraint into standard online actor-critic methods (such as TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018) ). This results in the following policy maximization objective: π = arg max π E s∼D,a∼π(•|s) [Q(s, a)] s.t. E s∼D,a∼π(•|s) [g(s, a)] ≤ G (6) where G is a task-dependent threshold varying across tasks. In our method, we adopt a non-parametric treatment by setting G as the mean output (50% quantile) of the learned distance function on the training dataset, i.e., E (s,a)∼D [g(s, a)], which is approximated over mini-batch samples to reduce computational complexity (see Appendix G for ablation on G). The constrained optimization problem in Eq. ( 6) can be reformulated as: π = arg max π min λ E s∼D,a∼π(•|s) [βQ(s, a) -λ(g(s, a) -G)] s.t. λ ≥ 0 (7) where λ is the Lagrangian multiplier, which is auto-adjusted using dual gradient descent. Following TD3+BC (Fujimoto & Gu, 2021) , Q values are rescaled by β = α 1 n n i=1 |Q(si,ai)| to balance Q function maximization and policy constraint satisfaction, controlled by a hyperparameter α. To reduce computations, the denominator of β is approximated over mini-batch of samples. The resulting algorithm is easy to implement. In our experiments, we use TD3. Please refer to Appendix E for implementation details.

3.3.1. BELLMAN-CONSISTENT COEFFICIENT AND CONSTRAINED POLICY SET

The key difference between DOGE and other policy constraint methods lies in that DOGE relaxes the strong full coverage assumptionfoot_0 on offline datasets and allows exploitation on generalizable OOD areas. To relax the unrealistic full-coverage assumption, we resort to a weaker condition proposed by (Xie et al., 2021a) , the Bellman-consistent coefficient (Definition 1), to measure how well Bellman errors can transfer to different distributions (Theorem 2). Denote ∥f ∥ 2 2,µ := E µ [∥f ∥ 2 ]; T π Q is the Bellman operator of policy π, defined as T π Q(s, a) := r(s, a) + γE a ′ ∼π(•|s ′ ),s ′ ∼P(•|s,a) [Q(s ′ , a ′ )] := r(s, a) + γP π [Q(s ′ , a ′ )]. P π [•] is the brief notation for E a ′ ∼π(•|s ′ ),s ′ ∼P(•|s,a) [•] . F is the function class of Q networks. The Bellman-consistent coefficient is defined as: Definition 1. (Bellman-consistent coefficient). We define B(v, µ, F, π) to measure the distributional shift from an arbitrary distribution v to data distribution µ, w.r.t. F and π, B(v, µ, F, π) := sup Q∈F ∥Q -T π Q∥ 2 2,v ∥Q -T π Q∥ 2 2,µ This definition captures the generalization performance of function approximation across different distributions. Intuitively, a small value of B(v, µ, F, π) means Bellman errors for policy π can accurately transfer from distribution µ to v. This suggests that Bellman errors can transfer well between two distributions even if a large discrepancy exists, as long as the Bellman-consistent coefficient is small. Based on Definition 1, we introduce the definition of Bellman-consistent constrained policy set. Definition 2. (Bellman-consistent constrained policy set). We define the Bellman-consistent constrained policy set as Π B . The Bellman-consistent coefficient under the transition induced by Π B can be bounded by some finite constants l(k): B(ρ k , µ, F, π) ≤ l(k) (9) where ρ k = ρ 0 P π1 ...P π k , ∀π 1 , ..., π k ∈ Π B , ρ 0 is the initial state-action distribution and P πi is the transition operator induced by π i , i.e., P πi (s ′ , a ′ |s, a) = P(s ′ |s, a)π i (a ′ |s ′ ). We denote the constrained Bellman operator induced by Π B as T Π B , T Π B Q(s, a) := r(s, a) + max π∈Π B γP π [Q(s ′ , a ′ )]. T Π B can be seen as a Bellman operator on a redefined MDP, thus theoretical results of MDP also carry over, such as contraction mapping and existence of a fixed point.

3.3.2. BELLMAN CONSISTENT COEFFICIENT AND PERFORMANCE BOUND OF DOGE

We show that the policy set induced by DOGE is essentially a Bellman-consistent policy set defined in Definition 2. Meanwhile, the distance constraint in DOGE can produce a small value of B and hence guarantee the learned policy deviates only to those generalizable areas. Theorem 2. (Upper bound of Bellman-consistent coefficient). Under the NTK assumption, the Bellman-consistent coefficient B(v, µ, F, π) is upper bounded as: B(v, µ, F, π) ≤ 1 ϵµ (1 -γ)Q(so, ao) + Rmax B 1 + C1 C2 √ d1 + d1 B 2 + (2 -γ)C1P π C2 √ d2 + d2 B 3 2 2,v where we denote x = (s, a) and x ′ = (s ′ , a ′ ). x o = E x∼D [x] is the centroid of offline dataset. d 1 = ∥x -x o ∥ and d 2 = ∥x ′ -x o ∥ are the sample-to-centroid distances. C 2 = sup x∈S×A ∥x∥ is related to the upper bound of the input scale. ϵ µ is the lower bound of Bellman error (square) for π under distribution µ, i.e., ϵ µ ≤ ∥Q -T π Q∥ 2 2,µ . The RHS of Eq. ( 10) contains four parts: 1 ϵµ , B 1 , B 2 and B 3 . It is reasonable to assume ϵ µ > 0, because of the approximation error of Q networks and the distribution mismatch between µ and π. B 1 is only dependent on the Q value Q(s o , a o ) at the centroid of the dataset and the max reward R max . B 2 is related to distance d 1 and distribution v. B 3 is related to d 2 , v and P π . To be mentioned, the distance regularization in DOGE compels the learned policy to output the action that is near the state-conditioned centroid of dataset, thus B 2 and B 3 can be driven to small values. Therefore, the RHS of Eq. ( 10) can be bounded by finite constants under DOGE, which shows that the constrained policy set induced by DOGE is essentially a Bellman-consistent constrained policy set. Then, the performance gap between the policy learned by DOGE and the optimal policy can be bounded as given in Theorem 3. See Appendix D.1 and D.2 for the proof of Theorem 2 and 3. Theorem 3. (Performance bound of the learned policy by DOGE). Let Q Π B be the fixed point of T Π B , i.e., Q Π B = T Π B Q Π B , and ϵ k = Q k -T Π B Q k-1 is the Bellman error at the k-th iteration. ∥f ∥ µ := E µ [∥f ∥]. The performance of the learned policy π n is bounded by: lim n→∞ ∥Q * -Q πn ∥ ρ0 ≤ 2γ (1 -γ) 2 L(Π B ) sup k≥0 ∥ϵ k ∥ µ + 1 -γ 2γ α(Π B ) where L(Π B ) = (1 -γ) 2 ∞ k=1 kγ k-1 l(k) , which is similar to the concentrability coefficient in BEAR (Kumar et al., 2019) but in a different form. Note that l(k) is related to the RHS of Eq. ( 10) and can be driven to a small value by DOGE according to Theorem 2. α(Π B ) = ∥T Π B Q Π B -T Q * ∥ ∞ is the suboptimality constant, which is similar to α(Π) = ∥T Π Q Π -T Q * ∥ ∞ in BEAR. Compared with BEAR, DOGE allows a policy shift to some generalizable OOD areas and relaxes the strong full-coverage assumption. In addition, we have L(Π B ) ≤ L(Π) ∝ ρ0P π 1 ...P π k µ(s,a) , where L(Π) is the concentrability coefficient in BEAR. This is evident when µ(s, a) = 0 and ρ 0 P π1 ...P π k (s, a) > 0, L(Π B ) can be bounded by finite constants but L(Π) → ∞. Moreover, as Π B extends the policy set to cover more generalizable OOD areas (Π ⊆ Π B ) and produces a larger feasible region for optimization, lower degree of suboptimality can be achieved (i.e., α(Π B ) ≤ α(Π)) compared to only performing optimization on Π. Therefore, we can see that DOGE enjoys a tighter performance bound than previous more conservative methods when allowed to exploit generalizable OOD areas.

4. EXPERIMENTS

For evaluation, We compare DOGE and prior offline RL methods over D4RL Mujoco and AntMaze tasks (Fu et al., 2020) . Mujoco is a standard benchmark commonly used in previous work. AntMaze tasks are far more challenging due to the non-markovian and mixed-quality offline dataset, the stochastic property of environments, and the high dimensional state-action space. Implementation details, experimental setup and additional experimental results can be found in Appendix E and F.

4.1. COMPARISON WITH SOTA

We compare DOGE with model-free SOTA methods, such as TD3+BC (Fujimoto & Gu, 2021) , CQL (Kumar et al., 2020b) and IQL (Kostrikov et al., 2021b) . For fairness, we use the "-v2" datasets for all methods. For most Mujoco tasks, we report the scores from the IQL paper. We obtain the other results using the authors' or our implementations. For AntMaze tasks, we obtain the results of CQL, TD3+BC, and IQL using the authors' implementations. For BC (Pomerleau, 1988) , BCQ (Fujimoto et al., 2019) and BEAR (Kumar et al., 2019) , we report the scores from (Fu et al., 2020) . All methods are evaluated over the final 10 evaluations for Mujoco tasks and 100 for AntMaze tasks. Table 1 shows that DOGE achieves comparable or better performance than SOTA methods on most Mujoco and AntMaze tasks. Compared to other policy constraint approaches such as BCQ, BEAR and TD3+BC, DOGE is the first policy constraint method to successfully solve AntMaze-medium and AntMaze-large tasks. Note that IQL is an algorithm designed for multi-step dynamics programming and attains strong advantage on AntMaze tasks. Nevertheless, DOGE can compete with or even surpass IQL on most AntMaze tasks, by only employing a generalization-oriented policy constraint. These results illustrate the benefits of allowing policy learning on generalizable OOD areas.

4.2. EVALUATION ON GENERALIZATION

To evaluate the generalization ability of DOGE, we remove small areas of data from the critical pathways to the destination in AntMaze medium and large tasks, to construct an OOD dataset. The two removed areas reside in close proximity to the trajectory data (see Figure 1 ). We evaluate representative methods (such as TD3+BC, CQL, IQL) and DOGE on these modified datasets. Figure 4 shows the comparison before and after data removal. For such a dataset with partial state-action space coverage, existing policy constraint methods tend to over-constrain the policy to stay inside the support of a dataset, where the optimal policy is not well-covered. Value regularization methods suffer from deteriorated generalization performance, as the value function is distorted to assign low value at all OOD areas. In-sample learning methods are only guaranteed to retain the best policy within the partially covered dataset (Kostrikov et al., 2021b) . As shown in Figure 4 , all these methods struggle to generalize well on the missing areas and suffer severe performance drop, while DOGE maintains competitive performance. This further demonstrates the benefits of relaxing over-conservatism in existing methods. 

4.3. ABLATION STUDY

We conduct ablation studies to evaluate the impact of the hyperparameter α, the non-parametric distance threshold G in Eq. ( 6), and the number of noise actions N used to train the distance function. For α, we add or subtract 2.5 to the original value; for G, we choose 30%, 50%, 70% and 90% upper quantile of the distance values in mini-batch samples; for N , we choose N = 10, 20, 30. Compared to N and α, we find that G has a more significant impact on the performance. Figure 5b shows that an overly restrictive G (30% quantile) results in a policy set too small to cover near-optimal policies. A more tolerant G, on the other hand, is unlikely to cause excessive error accumulation and achieves relatively good performance. In addition, Figure 5a and Figure 5c show that performance is stable across variations of hyperparameters, indicating that our method is hyperparameter-robust. 

5. RELATED WORK

To prevent distributional shift and exploitation error accumulation when inferring the value function at unseen samples, a direct approach is to restrict policy learning from deviating to OOD areas. To make sure the leaned policy stays inside the distribution or support of training data, These policy constraint methods either carefully parameterize the learned policy (Fujimoto et al., 2019; Matsushima et al., 2020) , or use explicit divergence penalties (Kumar et al., 2019; Wu et al., 2019; Fujimoto & Gu, 2021; Xu et al., 2021; Dadashi et al., 2021) or implicit divergence constraints (Peng et al., 2019; Nair et al., 2020; Xu et al., 2022a) . The theories behind these methods typically assume full state-action space coverage of the offline datasets (Le et al., 2019; Kumar et al., 2019) . However, policy constraint under full-coverage assumption is unrealistic in most real-world settings, especially on datasets with partial coverage and only sub-optimal behavior policies. Some recent works try to relax the full-coverage assumption to partial coverage by introducing different distribution divergence metrics, but only in theoretical analysis (Liu et al., 2020; Zanette et al., 2021; Xie et al., 2021b; Uehara & Sun, 2021; Xie et al., 2021a) . Our method is an enhanced policy constraint method, where we relax the full-coverage assumption and allow the policy to learn on OOD areas where networks can generalize well.

A SKETCH OF THEORETICAL ANALYSIS

In this section, we present in Figure 6 a sketch of the overall logical flow in our theoretical analyses and the proposed algorithm, DOGE. We start by analyzing the effects of data geometry on the generalization patterns of deep Q-functions. We find that a small sample-to-dataset distance leads to a tightened Q-function approximation error and thus interpolation enjoys better generalization properties than extrapolation (Theorem 1). Motivated by this, we propose DOGE, which tries to control the upper bound of the sample-to-centroid distance to be small (Property 1) and enforces a convex hull based policy constraint (Property 2). Then, we dive deeper and find that the upper bound of the Bellman-consistent coefficient is well controlled by sample-to-centroid distance and thus DOGE enjoys a bounded bellman-consistent coefficient (Theorem 2). Based on these findings, we can derive a tighter performance bound of DOGE as compared to support constraint methods like BEAR (Theorem 3). (Jacot et al., 2018; Arora et al., 2019b; Bietti & Mairal, 2019 ). What's more, NTK is also a popular analyzing tool in the convergence and optimality of deep RL (Cai et al., 2019; Fan et al., 2020; Kumar et al., 2020a; Xiao et al., 2021) and thus is used in our study.

B.1 NEURAL TANGENT KERNEL

We denote a general neural network by f (θ, x) : R d → R, where θ is all the parameters in the network and x ∈ R d is the input. Given, a training dataset {(x i , y i )} n i=1 , the parameters θ are optimized by minimizing the squared loss function, i.e., L(θ) = 1 descent. The dynamics of the networks output can be formulated by Lemma 1 (Lemma 3.1. of (Arora et al., 2019b )); see (Arora et al., 2019b) for the proof of Lemma 1. Lemma 1. Consider minimizing the squared loss L(θ) by gradient descent with infinitesimally small learning rate, i,e., dθ(t) dt = -∇L(θ(t)). Let u(t) = (f (θ(t), x i )) i∈[n] ∈ R n be the network outputs on all x i 's at time t, and Y = (y i ) i∈[n] be the desired outputs. Then u(t) follows the following evolution, where H(t) is an n × n positive semidefinite matrix whose (i, j)-th entry is ∂f (θ(t),xi) ∂θ , ∂f (θ(t),xj ) ∂θ : du(t) dt = -H(t) • (u(t) -Y). Plenty of works (Jacot et al., 2018; Arora et al., 2019b; Allen-Zhu et al., 2019; Xu et al., 2020) study the dynamics of the neural networks' training process and find that if the width of networks is sufficiently large, H(t) stays almost constant during training, i.e., H(t) = H(0). What's more, if the neural networks' parameters are randomly initialized with certain scales and the networks width goes to infinity, H(0) converges to a fixed matrix K, called neural tangent kernel (NTK) (Jacot et al., 2018) . K(x, x ′ ) = E θ∼W ∂f (θ(t), x) ∂θ , ∂f (θ(t), x ′ ) ∂θ (13) where, W is Gaussian distribution. The training dynamics in Lemma 1 is identical to the dynamics of kernel regression under gradient flow, because K stays constant during training when the width of neural networks goes to infinity. Then, the final prediction function (t → ∞, assuming u(0) = 0) is equal to the kernel regression solution: f ntk (x) = (K(x, x 1 ), ..., K(x, x n )) • K -1 train Y where K -1 train is the n × n NTK for the training data (the state-action pair x = (s, a) in the policy evaluation in offline RL) and stays constant during training once the training data is fixed. Y is the training labels (r(s, a) + γE a ′ ∼π(•|s ′ ) [Q θ ′ (s ′ , a ′ )] in offline RL). K(x, x i ) is the kernel value between test data x and training data x i . We denote the feature map of K(•, •) as Φ(•), and K(x, x ′ ) = ⟨Φ(x), Φ(x ′ )⟩. Then, Eq. ( 14) is equivalent to: f ntk (x) = (⟨Φ(x), Φ(x 1 )⟩ , ..., ⟨Φ(x), Φ(x n )⟩) • K -1 train Y B.2 IMPACT OF DATA GEOMETRY ON DEEP Q FUNCTIONS In this section, we analyze the impact of data geometry on deep Q functions under the NTK regime. We first introduce the smoothness property of the feature map Φ(x) induced by NTK (Lemma 2). Then, we introduce the equivalence between the kernel regression solution in Eq. ( 15) and a min-norm solution (Lemma 3). Builds on Lemma 2 and Lemma 3, Lemma 4 analyzes the smoothness of the deep Q functions. At last, we study how data geometry affects deep Q functions (Theorem 1). Assumption 1. (NTK assumption). We assume the function approximators discussed in our paper are two-layer fully-connected ReLU neural networks with infinity width and are trained with infinitesimally small learning rate unless otherwise specified. Although there exist some gaps between the NTK assumption and the real setting, NTK is one of the most advanced theoretical machinery from the generalization analysis of DNN. In addition, Assumption 1 is common in previous analysis on the generalization of DNN (Jacot et al., 2018; Arora et al., 2019a; Bietti & Mairal, 2019) and the convergence of DRL (Cai et al., 2019; Liu et al., 2019; Xu & Gu, 2020; Fan et al., 2020) . For more accurate analysis, we should adopt more advanced analysis tools than NTK and hence leave it for future work. We first introduce Lemma 2 (Proposition 4 of (Bietti & Mairal, 2019) ), which shows the feature map Φ(x) induced by NTK is not Lipschitz continuous but holds a weaker Hölder smoothness property. Lemma 2. (Smoothness of the kernel map of two-layer ReLU networks). Let Φ be the kernel map of the neural tangent kernel induced by a two-layer ReLU neural network, x and y be two inputs, then Φ satisfies the following smoothness property. ∥Φ(x) -Φ(y)∥ ≤ min(∥x∥, ∥y∥)∥x -y∥ + 2∥x -y∥. ( ) Lemma 3 (Lemma 2 of (Xu et al., 2020) ) builds the connection between the kernel regression solution in Eq. ( 14) and the a min-norm solution. For the proof of Lemma 3, we refer the reader to (Xu et al., 2020) . Lemma 3. (Equivalence to a min-norm optimization problem). Let Φ(x) be the feature map induced by a neural tangent kernel, for any x ∈ R d . The solution to the kernel regression in Eq. ( 14) and Eq. ( 15) is equivalent to f ntk (x) = Φ(x) T β ntk , where β ntk is the optimal solution of a min-norm optimization problem defined as min β ∥β∥ s.t. Φ(x i ) T β = y i , for i = 1, ..., n. Then, deep Q functions satisfy the following smoothness property. Lemma 4. (Smoothness for deep Q functions). Given two inputs x and x ′ , the distance between these two data points is d = ∥x -x ′ ∥. C 1 := sup ∥β ntk ∥ ∞ is a finite constant. Then the difference between the output at x and the output at x ′ can be bounded by: ∥Q θ (x) -Q θ (x ′ )∥ ≤ C 1 ( min(∥x∥, ∥x ′ ∥) √ d + 2d) Proof. In offline RL, we denote a general Q network by Q θ (x) : R |S|+|A| → R, where θ is all the parameters in the network and x = (s, a) ∈ R |S|+|A| is the brief notation for state-action pair (s, a). The Q function is trained via minimizing the temporal difference error defined as 1 2 n i=1 (Q θ (x i ) -y i ) 2 by gradient descent, where y i = r(x i ) + γE a ′ i ∼π(•|s ′ i ) [Q π θ ′ (x ′ i )] ∈ R is the target value. Using kernel method from NTK, Q function can be formulated as Q θ (x) = Φ(x) T β, where Φ(x) is independent of the changes on training labels when NTK assumption holds. This is because as the width of a neural net goes to infinity, the NTK kernel K(x, x ′ ) =< Φ(x), Φ(x ′ ) > produced by this network stays constant during training, and so is the property of the feature map Φ(x) (Jacot et al., 2018) . So, the learning process under NTK framework is actually adjusting β to fit the label rather than Φ(x). As a result, Lemma 2 holds when deep Q function satisfies NTK assumptions. Given two inputs x and x ′ , the distance between these two inputs is d = ∥x -x ′ ∥. Based on Lemma 2, it is easy to see that ∥Q θ (x) -Q θ (x ′ )∥ = ∥Φ(x) T β -Φ(x ′ ) T β∥ ≤ ∥Φ(x) -Φ(x ′ )∥∥β∥ ∞ (Infinity norm) ≤ ∥β∥ ∞ ( min(∥x∥, ∥x ′ ∥) • ∥x -x ′ ∥ + 2∥x -x ′ ∥) (Lemma2) = ∥β∥ ∞ ( min(∥x∥, ∥x ′ ∥) • d + 2d) ≤ C β ( min(∥x∥, ∥x ′ ∥) • d + 2d) (C β := sup ∥β∥ ∞ ) Additionally, if we consider the delayed Q target and delayed actor updates during policy learning, we can assume the target value used for Q evaluation stays relatively stable during each policy evaluation step and the problem can be seen as solving a series of regression problems. Under this mild assumption, we can learn the actual β ntk at each step (β → β ntk and so C β → C 1 , where C 1 := sup ∥β ntk ∥ ∞ ) and thus complete the proof. Similar assumptions and treatments are also used in Section 4 of (Kumar et al., 2020a) that Q function at each iteration can fit its label well, Appendix A.8 of (Xiao et al., 2021) , as well as Appendix F of (Ghasemipour et al.). Lemma 4 states the value difference of a deep Q function for two inputs is related to the distance between these two inputs. The closer the distance, the smaller the value difference.

B.2.1 PROOF OF THEOREM 1

Builds on Lemma 4, we can combine the data geometry and analyze the impact of data geometry on deep Q functions. Proof. We first review the definition of interpolated data and extrapolated data. Under continuous state-action space, state-action pairs within the convex hull of the dataset can be represented in an interpolated manner (referred as interpolated data x in ): x in = n i=1 α i x i , n i=1 α i = 1, α i ≥ 0 (20) Similarly, we can define extrapolated data that lie outside the convex hull of the dataset as x out : x out = n i=1 β i x i , where n i=1 β i = 1 and β i ≥ 0 does not hold simultaneously. We define Proj D (x) := arg min xi∈D ∥x -x i ∥ as a projector that projects unseen data x to its nearest data in dataset D. Given an interpolated data x in and an extrapolated data x out , the distances to their nearest data in dataset are d xin = ∥x in -Proj D (x in )∥ and d xout = ∥x out -Proj D (x out )∥. Because interpolated data lie inside the convex hull of training data, d xin ≤ max xi∈D ∥x in -x i ∥ ≤ B is bounded, where B := max xi,xj ∈D ∥x i -x j ∥ is a finite constant. Then, by applying Lemma 4, the value difference of deep Q function for interpolated and extrapolated data can be formulated as the following shows. ∥Q θ (x in ) -Q θ (Proj D (x in ))∥ ≤ C 1 ( min(∥x in ∥, ∥Proj D (x in )∥) d xin + 2d xin ) ≤ C 1 ( min(∥x in ∥, ∥Proj D (x in )∥) √ B + 2B) (22) ∥Q θ (x out ) -Q θ (Proj D (x out )∥) ≤ C 1 ( min(∥x out ∥, ∥Proj D (x out )∥) d xout + 2d xout ) (23) B.3 QUANTITATIVE EXPERIMENTS ON THEOREM 1 In addition to the one-dimensional random walk experiments presented in Section 2.2, we conduct additional experiments on the more complex and high-dimensional MuJoCo tasks (including D4RL Hopper-medium-v2, Halfcheetah-medium-v2, and Walker2d-medium-v2) to provide quantitative support to Theorem 1, in particular, the pertinence of interpolation and extrapolation. We first synthesize lots of interpolated data x in and extrapolated data x out (x = (s, a) ∈ S × A) and then search for their nearest data points in offline dataset D accordingly, i.e., Proj D (x in ) and Proj D (x out ). Then, we can evaluate the Q-value differences ∥Q θ (x) -Q θ (Proj D (x))∥ (LHS of Theorem 1) at these generated data and see whether the Q-value differences align well with Theorem 1. For the detailed experiment setup, recall that an interpolated data point x in is a convex combination of the offline dataset, i.e., x in = n i=1 α i x i , x i ∼ D with weights α i that satisfy n i=1 α i = 1, α i ≥ 0. Therefore, we can interpolate the offline dataset based on α i sampled from the Dirichlet distribution to generate the interpolated data. Also, an extrapolated data point x out is expressed as a weighted sum of the offline dataset, i.e., x out = n i=1 β i x i , x i ∼ D, but its weights β i do not satisfy the non-negativity and the summing to 1 constraint. Therefore, we can generate extrapolated data by setting the sign of some weights to negative values and varying the weights not summing to 1. After obtaining the interpolated and extrapolated data, we search for their closest data points in the offline dataset D and calculate their corresponding distance ∥x -Proj D (x)∥ and Q-value difference∥Q θ (x)-Q θ (Proj D (x))∥. Figure 7a shows the relationship between the distance to dataset ∥x -Proj D (x)∥ and the Q value difference ∥Q θ (x) -Q θ (Proj D (x))∥ (LHS of Theorem 1). We also report the learned state-conditioned distance value g(s, a) on these generated data in Figure 7b . Figure 7a demonstrates that the interpolated data enjoy a tighter empirical upper bound of ∥Q θ (x) -Q θ (Proj D (x))∥ (LHS of Theorem 1) than most of the extrapolated data. Moreover, the empirical upper bound of the Q-value difference grows with the increase of the sample-to-dataset distance ∥x -Proj D(x) ∥, which is consistent with Theorem 1 (the upper bound of value difference of deep Q function is well controlled by distance to the dataset). Figure 7b shows that the stateconditioned distance function g(s, a) can output low values for interpolated data and some neardataset extrapolated data, and thus can be used as a relaxed policy constraint in these OOD regions.

C STATE-CONDITIONED DISTANCE FUNCTION C.1 PROOF OF PROPERTY 1

Proof. Given a state-action pair from the training data (s, a) ∼ D, we synthetic random noise actions from a uniform distribution over the action space, i.e. â ∼ U nif (A). Then the distance function g(•) is trained by Eq. ( 24). min g E (s,a)∼D E â∼U nif (A) [∥â -a∥ -g(s, â)] 2 (24) [∥â -a∥ -g(s, â)] 2 can be upper bounded by some finite constants because S × A is compact in our analysis. The optimization problem in Eq. ( 24) can be reformulated as the following form according to the Fubini's Theorem. min g E â∼U nif (A) E (s,a)∼D [∥â -a∥ -g(s, â)] 2 (25) Note that the objective of Eq. ( 25) can be also written as a functional J[g(s, â)] with respect to function g in following form: J[g(s, â)] = A 1 |A| E (s,a)∼D [∥â -a∥ -g(s, â)] 2 dâ = A F (s, â, g(s, â))dâ (26) Based on calculus of variation, the extrema (maxima or minima) of functional J[g(s, â)] can be obtained by solving the associated Euler-Langrane equation (∂F/∂g = 0). In our case, it requires the optimal state-conditioned distance function g * satisfies the following conditions: ∂ ∂g * E (s,a)∼D [∥â -a∥ -g * (s, â)] 2 = 0 ⇒ E (s,a)∼D ∂ ∂g * [∥â -a∥ -g * (s, â)] 2 = 0 (DNN is continuous) ⇒ E (s,a)∼D [∥â -a∥ -g * (s, â)]] = 0 (27) Conditioned on a state s ∈ D, the optimal state-conditioned distance function in Eq. ( 27) satisfies the following conditions: A µ(s,a)da ≥ 0 and A C(s, a)da = 1. Because L 2 -norm is convex and the non-negative combination of convex functions is still convex, g * (s, â) is a convex function w.r.t. â. In addition, ∀â ∈ A, by the Jensen inequality, we have: g * (s, â) ≥ â -E a∼U nif (A) [C(s, a)a] = ∥â -a o (s)∥, s ∈ D ( ) where a o (s) := E a∼U nif (A) [C(s, a)a], s ∈ D is the state-conditioned centroid of training dataset.

C.2 PROOF OF PROPERTY 2

Proof. The negative gradient of the optimal state-conditioned distance function can be formulated as: -∇ âg * (s, â) = - A C(s, a) â -a ∥â -a∥ da, ∀â ∈ A, s ∈ D = 1 A µ(s, a)da A µ(s, a) -(â -a) ∥â -a∥ da, ∀â ∈ A, s ∈ D (30) Observe that the direction of the negative gradient of g * (s, â) is related to the integral of vector -(â -a) (points towards a). When (s, a) / ∈ D, -(â -a) doesn't influence the final gradient because µ(s, a) = 0. Therefore, -(â -a) only contribute to the final gradient of g * (s, â) for (s, a) ∈ D as µ(s, a) > 0. For a given s ∈ D and any extrapolated action â that lies outside the convex hull of training data, the integral of vector -(â -a) is basically a non-negative combination of vectors -(â -a) that point toward actions a ∈ D inside the convex hull. As a result, it's easy to see that -∇ âg * (s, â) also points inside the convex hull formed by the data.

D THEORETICAL ANALYSIS OF DOGE

In this section, we analyze the performance of the policy learned by DOGE. We first adopt the Bellman-consistent coefficient from (Xie et al., 2021a) to quantify the distributional shift from the perspective of deep Q functions generalization. Then, we gives the upper bound of the Bellmanconsistent coefficient under the NTK regime (Appendix D.1). At last, we give the performance bound of DOGE (Appendix D.2).

D.1 UPPER BOUND OF BELLMAN-CONSISTENT COEFFICIENT

Let us first review the definition of Bellman-consistent coefficient B(v, µ, F, π) in (Xie et al., 2021a) . We define B(v, µ, F, π) to measure the distributional shift from an arbitrary distribution v to data distribution µ, w.r.t. F and π. F is the function class of Q networks. B(v, µ, F, π) := sup Q∈F ∥Q -T π Q∥ 2 2,v ∥Q -T π Q∥ 2 2,µ where the µ-weighted norm (square) is defined as ∥f ∥ 2 2,µ := E µ [∥f ∥ 2 ], which is also applicable for any distribution v. T π Q is the Bellman operator of policy π, defined as T π Q(s, a) := r(s, a) + γE a ′ ∼π(•|s ′ ),s ′ ∼P(•|s,a) [Q(s ′ , a ′ )] := r(s, a) + γP π [Q(s ′ , a ′ )]. P π [•] is the brief notation for E a ′ ∼π(•|s ′ ),s ′ ∼P(•|s,a) [•]. The smaller the ratio of the Bellman error under v and µ, the more transferable the Q function from µ to v, even when sup (s,a) v(s,a) µ(s,a) = ∞. Then we give the proof of Theorem 2 (Upper bound of Bellman-consistent coefficient). Proof. We denote x = (s, a) and x ′ = (s ′ , a ′ ). x o = E x∼D [x] is the centroid of offline dataset. d 1 = ∥x -x o ∥ and d 2 = ∥x ′ -x o ∥ are the sample-to-centroid distances. Let µ(x) be the distribution under the offline dataset and v(x) be any distribution. Then, for the numerator in Eq. ( 8) and Eq. (31), we have the following inequalities. ∥Q -T π Q∥ 2 2,v = S×A v(x)∥Q(x) -r(x) -γP π [Q(x ′ )]∥ 2 = S×A v(x)∥Q(x) -P π [Q(x ′ )] -r(x) + (1 -γ)P π [Q(x ′ )]∥ 2 ≤ S×A v(x) ∥Q(x) -P π [Q(x ′ )]∥ + ∥r(x)∥ + ∥(1 -γ)P π [Q(x ′ )]∥ 2 (Triangle) = S×A v(x) ∥Q(x) -Q(xo) + Q(xo) -P π [Q(x ′ )]∥ + ∥r(x)∥ + (1 -γ)∥P π [Q(x ′ )] -Q(xo) + Q(xo)∥ 2 ≤ S×A v(x) (1 -γ)∥Q(xo)∥ + ∥r(x)∥ + ∥Q(x) -Q(xo)∥ + (2 -γ)∥P π [Q(x ′ )] -Q(xo)∥ 2 (Triangle) ≤ S×A v(x)   (1 -γ)∥Q(xo)∥ + ∥r(x)∥ I 1 + ∥Q(x) -Q(xo)∥ I 2 + (2 -γ)P π [∥Q(x ′ ) -Q(xo)∥] I 3    2 (Jensen) The RHS contains three parts: I 1 = (1 -γ)∥Q(x o )∥ + ∥r(x)∥, I 2 = ∥Q(x) -Q(x o )∥ and I 3 = (2 -γ)P π [∥Q(x ′ ) -Q(x o )∥]. Because ∥r(x)∥ ∈ [0, R max ], ∀x ∈ S × A, I 1 can be upper bounded as: I 1 ≤ (1 -γ)Q(x o ) + R max By applying Lemma 4, I 2 is upper bounded as I 2 ≤ C 1 min(∥x∥, ∥x o ∥)d 1 + 2d 1 I 3 is upper bounded as I 3 ≤ C 1 (2 -γ)P π min(∥x ′ ∥, ∥x o ∥)d 2 + 2d 2 In addition, we denote C 2 := sup x∈S×A ∥x∥. Then, I 2 and I 3 can be further upper bounded by I 2 ≤ C 1 C 2 d 1 + 2d 1 I 3 ≤ (2 -γ)C 1 P π (C 2 d 2 + 2d 2 ) The above relaxation of the upper bound in Eq. ( 36) and Eq. ( 37) is not necessary, but for notation brevity, we choose to relax the upper bound by treating C 2 := sup x∈S×A ∥x∥. Plug Eq. ( 33), Eq. ( 36) and Eq. ( 37) into the RHS of Eq. ( 32), we can get ∥Q -T π Q∥ 2 2,v ≤ S×A v(x) (1 -γ)Q(x o ) + R max + C 1 (C 2 d 1 + 2d 1 ) + (2 -γ)C 1 P π (C 2 d 2 + 2d 2 ) 2 = (1 -γ)Q(s o , a o ) + R max + C 1 C 2 d 1 + 2d 1 + (2 -γ)C 1 P π C 2 d 2 + 2d 2 2 2,v For the denominator ∥Q -T π Q∥ 2 2,µ in Eq. ( 8) and Eq. ( 31), because the Q function is approximated, there exists approximation error between Q and T π Q, i.e., Q-T π Q ≥ ϵ. In addition, the distribution µ contains some mismatch w.r.t. the equilibrium distribution induced by policy π. Therefore, it is reasonable to assume ∥Q -T π Q∥ 2 2,µ ≥ ϵ µ > 0. Then, we can complete the proof by plugging the upper bound in Eq. ( 38) and ∥Q -T π Q∥ 2 2,µ ≥ ϵ µ > 0 into Eq. ( 8) or Eq. (31). B(v, µ, F, π) ≤ 1 ϵµ (1 -γ)Q(so, ao) + Rmax B 1 + C1 C2 √ d1 + 2d1 B 2 + (2 -γ)C1P π C2 √ d2 + 2d2 B 3 2 2,v To be mentioned, the distance regularization in DOGE compels the leaned policy to output the action that near the state-conditioned centroid of dataset and thus B 2 and B 3 can be driven to some small values. B 1 is independent on the distributional shift. Therefore, B(v, µ, F, π) can be bounded by some finite constants under DOGE. Therefore, the constrained policy set induced by DOGE is essentially a Bellman-consistent constrained policy set Π B defined in Definition 2. In addition, other policy constraint methods such as BEAR (Kumar et al., 2019) can also have bounded B. However, these policy constraint methods do not allow the learned policy shifts to those generalizable distributions where B(v, µ, F, π) is small but sup (s,a) v(s,a) µ(s,a) → ∞, which is essentially different with DOGE.

D.2 PERFORMANCE OF THE POLICY LEARNED BY DOGE

Here, we briefly review the definition of the Bellman-consistent constrained policy set Π B defined in Definition 2. The Bellman-consistent coefficient under the transition induced by Π B can be bounded by some finite constants l(k): B(ρ k , µ, F, π) ≤ l(k) where, ρ 0 is the initial state-action distribution and µ is the distribution of training data. ρ k = ρ 0 P π1 P π2 ...P π k , ∀π 1 , π 2 , ..., π k ∈ Π B and P πi is the transition operator on states induced by π i , i.e., P πi (s ′ , a ′ |s, a) = P(s ′ |s, a)π i (a ′ |s ′ ). We denote the constrained Bellman operator induced by Π B as T Π B , and T Π B Q(s, a) := r(s, a) + max π∈Π B γP π [Q(s ′ , a ′ )]. T Π B can be seen as a operator in a redefined MDP and hence is a contraction mapping and exists a fixed point. We denote Q Π B as the fixed point of T Π B , i.e., Q Π B = T Π B Q Π B . The Bellman optimal operator T is T Q(s, a) := r(s, a) + max π γP π [Q(s ′ , a ′ )] T is also a contraction mapping. Its fixed point is the optimal value function Q * and Q * = T Q * . Then, by the triangle inequality, we have: ∥Q * -Q πn ∥ ρ0 = ∥Q * -Q Π B + Q Π B -Q πn ∥ ρ0 ≤ ∥Q * -Q Π B ∥ ρ0 L1 + ∥Q Π B -Q πn ∥ ρ0 L2 where Q πn is the true Q value of policy π n . π n is the greedy policy w. For L 1 part in Eq. ( 42), we first focus on the infinity norm. ∥Q * -Q Π B ∥ ∞ = ∥T Q * -T Π B Q Π B ∥ ∞ ≤ ∥T Q * -T Π B Q Π B ∥ ∞ + ∥T Π B Q Π B -T Π B Q * ∥ ∞ ≤ ∥T Q * -T Π B Q Π B ∥ ∞ + γ∥Q * -Q Π B ∥ ∞ (T Π B is γ-contraction) = α(Π B ) + γ∥Q * -Q Π B ∥ ∞ where α(Π B ) := ∥T Q * -T Π B Q Π B ∥ ∞ is the suboptimality constant. Then, we get ∥Q * -Q Π B ∥ ∞ ≤ α(Π B ) 1-γ and L 1 ≤ ∥Q * -Q Π B ∥ ∞ ≤ α(Π B ) 1-γ . For L 2 , we introduce Lemma 5, which upper bounds ∥Q Π B -Q πn ∥ 2 2,ρ0 . The proof of Lemma 5 can be get by directly replacing Q * with Q Π B in the Appendix F.3. in (Le et al., 2019) , because Q Π B is the optimal value function under the modified MDP induced by T Π B . Lemma 5. (Upper bound of error propagation). ∥Q Π B -Q πn ∥ 2 2,ρ0 can be upper bounded as ∥Q Π B -Q πn ∥ 2 2,ρ 0 ≤ 2γ(1 -γ n+1 ) (1 -γ) 2 2 S×A ρ0(ds, da) n-1 k=0 α k A k ϵ 2 k + αnAn(Q Π B -Q0) 2 (s, a) where ϵ k = Q k+1 -T Π B Q k (45) α k = (1 -γ)γ n-k-1 1 -γ n+1 for k < n α n = (1 -γ)γ n 1 -γ n+1 A k = 1 -γ m≥0 γ m (P πn ) m (P π Π B ) n-k + P πn P πn-1 ...P π k+1 for k < n A n = 1 -γ 2 m≥0 γ m (P πn ) m (P π Π B ) n+1 + P πn P πn-1 ...P π0 (47) Q 0 is the Q function after initialization. Note that lim n→∞ α n A n (Q Π B -Q 0 ) 2 = 0, we leave out this term for analysis simplicity. In addition, each A k is a probability kernel that combine P πi and P π Π B (the transition operator on states induced by the constrained optimal policy π Π B ∈ Π B ) and k a k = 1. The key part in Eq. ( 44) is S×A ρ 0 A k ϵ 2 k and we expand this term as the following shows. S×A ρ 0 A k ϵ 2 k = S×A 1 -γ 2 ρ 0 m≥0 γ m (P πn ) m (P π Π B ) n-k + P πn P πn-1 ...P π k+1 ϵ 2 k = 1 -γ 2 m≥0 γ m S×A (P πn ) m (P π Π B ) n-k + (P πn ) m P πn P πn-1 ...P π k+1 ρ 0 ϵ 2 k ( ) As Eq. ( 40) shows, the policy set induced by DOGE is a Bellman-consistent constrained policy set Π B defined in Definition 2. Therefore, let ρ 0 be the initial state-action distribution and µ denote the distribution of training data. For any policy π 1 , π 2 , ..., k ∈ Π B , the distribution after k-th Bellman-consistent iteration is ρ k = ρ 0 P π1 P π2 ...P π k , there exits some finite constants l(k), that B(ρ k , µ, F, π) ≤ l(k) holds. Then we can get the following inequalities. ∥Q -T π Q∥ 2 2,ρ k ≤ ∥Q -T π Q∥ 2 2,µ l(k) S×A ρ k ϵ 2 ≤ S×A µϵ 2 l(k) (ϵ = Q -T π Q) As a result, by applying the result of Eq. ( 49) to Eq. ( 48), we can get S×A ρ 0 A k ϵ 2 k ≤ S×A (1 -γ) m≥0 γ m ϵ 2 k µl(m + n -k) Plugs Eq. ( 50) into Eq. ( 44) and leaves out α n A n (Q Π B -Q 0 ) 2 in Eq. ( 44), we get lim n→∞ L 2 2 ≤ lim n→∞ 2γ(1 -γ n+1 ) (1 -γ) 2 2   n-1 k=0 (1 -γ) m≥0 γ m l(m + n -k)α k ∥ϵ k ∥ 2 2,µ   = lim n→∞ 2γ(1 -γ n+1 ) (1 -γ) 2 2   1 1 -γ n+1 n-1 k=0 (1 -γ) 2 m≥0 γ m+n-k-1 l(m + n -k)∥ϵ k ∥ 2 2,µ   ≤ lim n→∞ 2γ(1 -γ n+1 ) (1 -γ) 2 2 1 1 -γ n+1 L(Π B ) 2 sup k≥0 ∥ϵ k ∥ 2 2,µ = 2γ (1 -γ) 2 2 L(Π B ) 2 sup k≥0 ∥ϵ k ∥ 2 2,µ where, L(Π B ) = (1 -γ) 2 ∞ k=1 kγ k-1 l(k). Then, we can bound L 2 by lim n→∞ L 2 ≤ 2γ (1 -γ) 2 L(Π B ) sup k≥0 ∥ϵ k ∥ µ (52) With the upper bound of L 1 and lim n→∞ L 2 , we can complete the proof by adding these two term together. lim n→∞ ∥Q * -Q πn ∥ ρ0 ≤ 2γ (1 -γ) 2 L(Π B ) sup k≥0 ∥ϵ k ∥ µ + 1 -γ 2γ α(Π B ) E IMPLEMENTATION DETAILS DOGE can build on top of standard online actor-critic algorithms such as TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018) . We choose TD3 as our base because of its simplicity compared to other methods. We build DOGE on top of TD3 by simply plugging the state-conditioned distance function as a policy regularization term during policy training process. Then, the learning objective of policy π in Eq. ( 7) can be formulated as: π = arg max π min λ E s∼D [βQ(s, π(s)) -λ(g(s, π(s)) -G)] s.t. λ ≥ 0 The Q function, policy and state-conditioned distance function networks are represented by 3 layers ReLU activated MLPs with 256 units for each hidden layer and are optimized by Adam optimizer. In addition, we normalize each dimension of state to a standard normal distribution for Mujoco tasks. The hyperparameters of DOGE are listed in Table 2 . For the choice of the Critic learning rate and discount factor γ, we find that for AntMaze tasks, a high Critic learning rate can improve the stability of value function during training process. This may be because the AntMaze tasks require the value function to dynamic programs more times to "stitch" suboptimal trajectories than Mujoco tasks. Therefore, we choose 1 × 10 -3 and 0.995 as the Critic learning rate and discount factor γ for AntMaze tasks, respectively. The other implementations such as policy noise scale and policy noise clipping are the same with author's implementation (Fujimoto et al., 2018) .

E.2 STATE-CONDITIONED DISTANCE FUNCTION'S IMPLEMENTATION DETAILS

We sample N = 20 noise actions from a uniform distribution that covers the full action space to approximate the estimation value in Eq. ( 4). We find N = 20 can balance the computation complexity and estimation accuracy and is the same sample numbers with CQL (Kumar et al., 2020b) . The ablation of N can be found in Fig. 15 . The practical training objective of the state-conditioned distance function is as follows: min g E (s,a)∈D,âi∼U nif (A) 1 N N i=1 [∥a -âi ∥ -g(s, âi )] 2 We find that a wider sample range than the max action space [-a max , a max ] is helpful to characterize the geometry of the full offline dataset. This is because some actions in the offline dataset lie at the boundary of the action space, which can only be sampled with little probability when sampling from a narrow distribution. At this time, the noise actions may not cover the geometry information near the boundary. Therefore, we sample noise actions from a uniform distribution that is 3 times wider than the max action space, i.e., â ∼ U nif [-3a max , 3a max ]. For the learning rate, we find that a high learning rate enables a stable training process in Mujoco tasks. Therefore, we choose 1 × 10 -3 and 1 × 10 -4 as the distance function learning rate for Mujoco and AntMaze, respectively. We also observe that for Mujoco tasks, 10 5 iterations can already produce a relatively good state-conditioned distance function, and training more times won't hurt the final results. To reduce computation, we only train the state-conditioned distance function for 10 5 steps for Mujoco tasks.

E.3 HYPERPARAMETERS TUNING OF DOGE

The scale of α determines the strength of policy constraint. We tune α to balance the trade-off between policy constraint and policy improvement. To be mentioned, α is tuned within only 5 candidates for 20 tasks (17.5 for hopper-m, hopper-m-r and all Mujoco random datasets; 7.5 for other Mujoco datasets; 5 for antmaze-u; 10 for antmaze-u-d; 70 for other AntMaze tasks). This is acceptable in offline policy tuning following (Kumar et al., 2019; Brandfonbrener et al., 2021) . To ensure numerical stability, we clip the Lagrangian multiplier λ to [1, 100] . We also find a large initial λ enables stable training for Mujoco tasks but slows down AntMaze training. Therefore, the initial value of Lagrangian multiplier λ is 5 for Mujoco and 1 for AntMaze tasks, respectively.

E.4 PSEUDOCODE OF DOGE

The pseudocode of DOGE is listed in Algorithm 1. Changes we make based on TD3 (Fujimoto et al., 2018)  ϕ ′ with ϕ ′ ← ϕ. Value network Q θi , i = 1, 2 and target value network Q θ ′ i , i = 1, 2 with θ ′ i ← θ i . State-conditioned distance network training steps N g . Policy update frequency m. 1: for t = 0, 1, ..., M do 2: Sample mini-batch transitions {(s i , a i , r i , s ′ i )} ∼ D 3: if t < N g then 4: State-Conditioned Distance Function Update: Update ψ as Eq. ( 55) shows.

5:

end if

6:

Critic Update: Update θ i using policy evaluation method in TD3.

7:

if t mod m = 0 then 8: Constrained Actor Update: Update ϕ, λ via Eq. ( 54).

9:

Update target networks: θ ′ i ← τ θ i + (1 -τ )θ ′ i , ϕ ′ ← τ ϕ + (1 -τ )ϕ 10: end if 11: end for E.5 EXPERIMENT SETUP FOR THE IMPACT OF DATA GEOMETRY ON DEEP Q FUNCTIONS We consider an one-dimensional random walk task with a fixed-horizon (50 steps for each episode), where agents at each step can move in the range of [-1, +1] and the state space is a straight ranges from [-10, 10 ]. The destination is located at s = 10. The closer the distance to the destination, the larger the reward that the agent can get. The discount factor γ = 0.9. The reward function is defined as follows: r = 400 -(s ′ -10) 2 400 ( ) We generate offline datasets with different geometry and train the agent based on these datasets. Each synthetic dataset consists of 200 transition steps. We get the approximated Q value Q by training TD3 for 1e + 4 steps each dataset. The learning rate of Actor and Critic networks are both 10 -3 . The other implementation details are the same as the implementation of original TD3 (Fujimoto et al., 2018) . The true Q function can be get by Monte-Carlo estimation. We find that the near-destination states hold higher approximation error than that far away from the destination due to the scale of true Q value near the destination is large. To alleviate the impact of Q value scale on the approximation error, we define the relative approximation error as follows: ε(s, a) = ϵ(s, a) -min a ϵ(s, a) where, ϵ(s, a) = Q(s, a) -Q(s, a). The relative error in the above definition eliminates the effect of different states on the approximation error and can capture the over-estimation error that we care about. We plot the relative approximation error of deep Q functions with different random seeds and data geometry in Fig. 13 .

F.1 COMPARISON OF GENERALIZATION ABILITY

In the well known AntMaze task in D4RL benchmark (Fu et al., 2020) , where an ant needs to navigate from the start to the destination in a large maze. The trajectories with coordinates at These clipped data counts only about one-tenth of the original dataset and lies in the close proximity of the original trajectories. Under these modified datasets, simply relaying on "stitching" data transitions is not enough to solve the navigation problems. We evaluate representative policy constraint method (TD3+BC (Fujimoto & Gu, 2021) ), value regularization method (CQL (Kumar et al., 2020b) ), insample learning method (IQL (Kostrikov et al., 2021b) ) and DOGE (our method) on these modified datasets. The evaluation results before and after clipping the trajectories are listed in Table 3 . The learning curves for the modified AntMaze medium and AntMaze large tasks are listed in Fig. 9 and Fig. 4 . Observe in Table 3 that existing offline RL methods fail miserably and suffer from severe performance drops. By contrast, DOGE maintains competitive performance after the modification of the dataset and shows good generalization ability on unknown areas. Apart from above experiments, we also evaluate DOGE when removing only one area: 7, 9] for AntMazemedium datasets. The final results can be seen in Table 4 . Figure 9 : Evaluation on TD3+BC (Fujimoto & Gu, 2021) , CQL (Kumar et al., 2020b) , IQL (Kostrikov et al., 2021b) , and DOGE (ours) before and after removing the data shown in Fig. 8a for AntMaze medium tasks. In this section, we further demonstrate the superiority of DOGE over our most related practical work TD3+BC (Fujimoto & Gu, 2021) . One can find that the biggest difference between DOGE and TD3+BC lies in the policy constraint used for policy optimization: -TD3+BC: constrains the policy to minimize the MSE BC loss. -DOGE: constrains the policy to minimize the learned state-conditioned distance function g(s, a). As discussed in Section 3.1, the learned distance function g(s, a) can capture the global geometric information of the offline dataset, while the MSE BC loss can only provide local sample-to-sample regularization, which may be noisy, especially in datasets that contain low-quality samples. Taking Figure 10 as an illustration, under strict BC constraint, policy learning on noisy low-quality samples may provide contradicting learning signals to near-optimal samples, which can cause inferior policy performance and unstable training process. By contrast, the state-conditioned distance function g(s, a) in DOGE is trained on the whole dataset and hence brings global geometric information, which is far more informative and stable as compared with the MSE BC loss. Figure 11 shows that DOGE enjoys more performance gains when the random dataset involves near-optimal data, while TD3+BC is heavily influenced by the local information from the larger proportion of the low-quality random data. Moreover, TD3+BC suffers from severe oscillation and training instability, while DOGE enjoys a stable training process due to the use of the more informative state-conditioned distance constraint that captures the overall dataset geometry. We run several experiments with different random seeds (see Figure 13 ). Although the approximation error pattern of different random seeds is not the same, they all perform in the same manner that deep Q functions produce relatively low approximation error inside the convex hull of training data. We refer to this phenomenon as deep Q functions interpolate well but struggle to extrapolate. 

H LEARNING CURVES

The learning curves for Mujoco and AntMaze tasks are listed in Fig. 18 and Fig. 19 . The learned policies are evaluated for 10 episodes and 100 episodes each seed for Mujoco and AntMaze tasks, respectively. For AntMaze tasks, we subtract 1 from rewards for the AntMaze datasets following (Kumar et al., 2020b; Kostrikov et al., 2021b) . 



sup s,a v(s,a) µ(s,a) < ∞, v and µ are marginal distributions of the learned policy and the dataset(Le et al., 2019). Another type of offline RL method, value regularization(Kumar et al., 2020b; Kostrikov et al., 2021a;Yu et al., 2021; Xu et al., 2022b;2023), directly penalizes the value function to produce low values at OOD actions. In-sample learning methods(Brandfonbrener et al., 2021; Kostrikov et al., 2021b), on the other hand, only learn the value function within data or treat it as the value function of the behavior policy. Compared with our approach, these methods exercise too much conservatism, which limits the generalization performance of deep neural networks on OOD regions, largely weakening the ability of dynamic programming. There are also uncertainty-based and model-based methods that regularize the value function or policy with epistemic uncertainty estimated from model or value function(Janner et al., 2019;Yu et al., 2020;Uehara & Sun, 2021;Wu et al., 2021;Zhan et al., 2022;Bai et al., 2021). However, the estimation of the epistemic uncertainty of DNN is still an under-explored area, with results highly dependent on evaluation methods and the structure of DNN.6 CONCLUSIONIn this study, we provide new insights on the relationship between approximation error of deep Q functions and geometry of offline datasets. Through empirical and theoretical analysis, we find that deep Q functions attain relatively low approximation error when interpolating rather than extrapolating the dataset. This phenomenon motivates us to design a new algorithm, DOGE, to empower policy learning on OOD samples within the convex hull of training data. DOGE is simple yet elegant, by plugging a dataset geometry-derived distance constraint into TD3. With such a minimal surgery, DOGE outperforms existing model-free offline RL methods on most D4RL tasks. We theoretically prove that DOGE enjoys a tighter performance bound compared with existing policy constraint methods under the more realistic partial-coverage assumption. Empirical results and theoretical analysis suggest the necessity of re-thinking the conservatism principle in offline RL algorithm design, and points to sufficient exploitation of the generalization ability of deep Q functions. n i=1 (f θ (x i ) -y i ) 2 by gradient



Figure 2: Approximation error of deep Q functions with different dataset geometry. Offline data are marked as white dots (Please refer to Appendix E.5 for detailed experimental setup) .

Figure 3: Illustration of the state-conditioned distance function. The output of the optimal distance function is the non-negative combination of the distances to all training data. G is the threshold in Eq. (6) In (b), Offline data are marked as white dots.

Figure 4: Generalization performance after removing data from AntMaze large tasks (see Appendix F.1 for detailed setup and additional results on AntMaze medium tasks).

Figure 5: Ablation results. The default parameters in our implementation are marked by *. The error bars indicate min and max over 5 seeds. See Appendix G for more detailed ablation studies.

Figure 6: Sketch of theoretical analysis

Figure 7: Quantitative experiments of Theorem 1 on the D4RL MuJoCo-medium datasets. The red star-shaped dots are the interpolated data and the circle dots are the extrapolated data. The color of the dots represents ∥Q θ (x) -Q θ (Proj D (x))∥ values in (a) and g(x) values in (b), respectively. The darker the color, the smaller the corresponding value. In (a), the yellow dash line is the empirical upper bound of ∥Q θ (x) -Q θ (Proj D (x))∥.

∥â -a∥µ(s, a)da -A µ(s, a)dag * (s, â) = 0, s ∈ D ⇒ g * (s, â) = A ∥â -a∥µ(s, a)da A µ(s, a)da , s ∈ D ⇒ g * (s, â) = A C(s, a)∥â -a∥da, s ∈ D (28) where, µ(s, a) is the empirical distribution on a finite offline dataset D = {(x i )} n i=1 , i.e., the sum of the Dirac measures 1 n n i=1 δ xi . ∀(s, a) / ∈ D, µ(s, a) = 0.∀(s, a) ∈ D, µ(s, a) > 0. C(s, a) = µ(s,a)

r.t. to Q n in the Bellmanconsistent constrained policy set Π B , i.e., π n = sup π∈Π B E a∼π(•|s) [Q n (s, a)]. Q n is the Q function after n-th value iteration under the constrained Bellman operator T Π B .

Figure 8: The trajectories in the offline dataset are visualized as blue. Data transitions of two small areas on the crtical pathways to the destination have been removed (red box).

[10.5, 21] ×[7, 9], [10.5, 21] × [7, 9] for AntMaze-large datasets and [4, 13] × [7, 9], [4, 13] × [

Figure 10: Illustrations of the differences between (a) the MSE BC constraint of TD3+BC and (b) the stateconditioned distance function constraint of DOGE. In (a), the MSE BC constraint in TD3+BC blindly enforces the imitation behavior on any data samples, which may lead to an inferior policy in the presence of noisy low-quality samples. In (b), the state-conditioned distance function g(s, a) can provide more informative global dataset geometry information to guide the stable learning of the policy.

Figure 12: Learning curves of the state-conditioned distance function g(s, a)

Figure 13: The figures above depict the effect of different data geometries on the final deep Q functions approximation error. The training data are marked as white dots.

Figure 14: Ablation for α. Error bars indicate min and max over 5 seeds.

Figure 15: Ablation for N . Error bars indicate min and max over 5 seeds.

Figure 16: Ablation for G. Error bars indicate min and max over 5 seeds.

Figure 17: Ablation for λ. Error bars indicate min and max over 5 seeds.

Figure 18: Learning curves for Mujoco Tasks. Error bars indicate min and max over 5 seeds.

Figure 19: Learning curves for AntMaze Tasks. Error bars indicate min and max over 5 seeds.

Average normalized scores and standard deviations over 5 seeds on benchmark tasks

Hyperparameters of DOGE

are marked in red. The only modification is the training process of the additional state-conditioned distance function and the constrained actor update. We can perform 1M training steps on one GTX 3080Ti GPU in less than 50min for Mujoco tasks and 1h 40min for AntMaze tasks. Dataset D. State-conditioned distance network g ψ . Policy network π ϕ and target policy network π

The performance drop after removing the data at the only way to destination.

Ablation for DOGE generalization with different removal areas.

Normalized scores of DOGE trained on distance functions with different network configurations.[128, 128]  means g network has 2 hidden layers with 128 units.[256, 256, 256]  means 3 hidden layers with 256 units.

ACKNOWLEDGMENTS

This work is supported by National Key Research and Development Program of China under Grant (2022YFB2502904). This work is also supported by Baidu Inc. through Apollo-AIR Joint Research Center. The authors would also like to thank the anonymous reviewers for their feedback on the manuscripts. Jianxiong Li would like to thank Zhixu Du, Yimu Wang, Li Jiang, Haoyi Niu, Hao Zhao and all colleagues in AIR-Dream group for valuable discussions.

availability

//github.com/Facebear-ljx/

F.3 COMPARISON WITH UNCERTAINTY-BASED METHODS

We also compare DOGE with SOTA uncertainty-based offline RL approaches, including EDAC (An et al., 2021) and PBRL (Bai et al., 2021) are more complex D4RL AntMaze tasks. The final results are presented in Table 5 . Table 5 shows that the SOTA uncertainty-based methods are unable to provide reasonable performance on the difficult Antmaze tasks, despite that they can achieve good performance on simpler MuJoCo tasks. A similar finding is also reported in a recent offline RL study (Anonymous, 2023) .In practical implementation of EDAC and PBRL, to obtain relatively accurate uncertainty measures and achieve reasonable performance, these methods typically need dozens of ensemble Q-networks, which can be quite costly and inefficient. Moreover, heavy hyperparameter tuning is also required for them to obtain the best performance. In contrast, our method quantifies the generalization ability of the Q-function from the perspective of dataset geometry and is trained using a simple regression loss in Eq. ( 4), which enjoys better training stability and simplicity. 

F.4 ADDITIONAL ANALYSIS ON DISTANCE FUNCTION

We report the learning curves of the state-conditioned distance function g(s, a) trained on different datasets (including hopper-m-v2, halfcheetah-m-v2, and walker2d-m-v2 in Figure 12 . Our proposed state-conditioned distance function is learned through a simple regression task (Eq. ( 4)), which is very easy to train. Figure 12 shows that it reaches convergence within only 1K training steps on D4RL MuJoCo medium datasets.We also change the network configurations (i.e., number of hidden layers and hidden units) of the state-conditioned distance function g(s, a) to investigate how the expressivity of g influences the performance of the policy. 

G ABLATIONS

We conduct ablation studies on the effect of α in β = α 1 n n i=1 |Q(si,ai)| (see Figure 14 ), the nonparametric threshold G in Eq. ( 6) (see Figure 16 ) and the non-parametric number of noise actions N to train state-conditioned distance function (see Figure 15 ) on the performance of the final algorithm. We also conduct ablation studies on the effect of G on the Lagrangian multiplier λ (see Figure 17 ).For α, we add or subtract 2.5 to the original value. For N , we choose N = 10, 20, 30 to conduct experiments respectively. For G, we choose 30%, 50%, 70%, 90% and 100% upper quantile of the distance value in mini-batch samples and the results can be found in Table 7 .Table 7 : Ablations on G with different quantile. Seen from Table 7 that using different G for different tasks may achieve even better performance.Particularly, for some datasets with diverse data distributions that need to find good data from suboptimal data, a more tolerant quantile (e.g., G = 70%) can reasonably extend feasible region and increase the opportunity to find the optimal policy, such as hopper-m-r, halfcheetah-m-r, walker2dm-r, hopper-m-e, halfcheetah-m-e. However, an overly relaxed quantile (e.g., G = 90% and 100%) increases the risk of including problematic OOD actions in policy learning, causing performance drop due to value overestimation and high variance.By contrast, an overly restrictive quantile such as G = 30% can be over-conservative and cause significant constraints violations that impede policy learning, as constraints satisfaction is favored over the max-Q operation in most updates. This can be reflected in the additional results for the Lagrangian multiplier λ (see Appendix E.2 for learning curves and Figure 11 for additional ablations), where λ → ∞ for some tasks under G = 30%. This will cause the suboptimality gap ( 1-γ 2γ α(Π D )) in Theorem 3 to dominate the performance bound, leading to inferior policy.As hyperparameter tuning in practical offline RL applications without online interaction is very difficult, to reduce the computational load, we set G = 50% as default in a non-parametric manner, since it consistently achieves good performance, and is neither too conservative nor too aggressive for most tasks.Observe in Figure 14 that DOGE maintains the similar performance with the changes of α on most of Mujoco tasks. At the same time, we also observe that the effect of N on the experiment is not obvious. Compared with N and α, we find that G has a more significant effect on the experimental results. Observe in Figure 16 that a small G usually causes the policy set induced by DOGE to be too small to obtain near-optimal policy. By contrast, a large G is not likely to cause excessive error accumulation and hence maintains relatively good performance.In addition, the ablation studies show that our method is hyperparameter-robust and maintains good performance with changes in hyperparameters.

