OFFLINE RL WITH NO OOD ACTIONS: IN-SAMPLE LEARNING VIA IMPLICIT VALUE REGULARIZATION

Abstract

Most offline reinforcement learning (RL) methods suffer from the trade-off between improving the policy to surpass the behavior policy and constraining the policy to limit the deviation from the behavior policy as computing Q-values using outof-distribution (OOD) actions will suffer from errors due to distributional shift. The recent proposed In-sample Learning paradigm (i.e., IQL), which improves the policy by quantile regression using only data samples, shows great promise because it learns an optimal policy without querying the value function of any unseen actions. However, it remains unclear how this type of method handles the distributional shift in learning the value function. In this work, we make a key finding that the in-sample learning paradigm arises under the Implicit Value Regularization (IVR) framework. This gives a deeper understanding of why the in-sample learning paradigm works, i.e., it applies implicit value regularization to the policy. Based on the IVR framework, we further propose two practical algorithms, Sparse Q-learning (SQL) and Exponential Q-learning (EQL), which adopt the same value regularization used in existing works, but in a complete insample manner. Compared with IQL, we find that our algorithms introduce sparsity in learning the value function, making them more robust in noisy data regimes. We also verify the effectiveness of SQL and EQL on D4RL benchmark datasets and show the benefits of in-sample learning by comparing them with CQL in small data regimes. Code is available at https://github.com/ryanxhr/IVR.

1. INTRODUCTION

Reinforcement learning (RL) is an increasingly important technology for developing highly capable AI systems, it has achieved great success in game-playing domains (Mnih et al., 2013; Silver et al., 2017) . However, the fundamental online learning paradigm in RL is also one of the biggest obstacles to RL's widespread adoption, as interacting with the environment can be costly and dangerous in real-world settings. Offline RL, also known as batch RL, aims at solving the abovementioned problem by learning effective policies solely from offline data, without any additional online interactions. It is a promising area for bringing RL into real-world domains, such as robotics (Kalashnikov et al., 2021) , healthcare (Tang & Wiens, 2021) and industrial control (Zhan et al., 2022) . In such scenarios, arbitrary exploration with untrained policies is costly or dangerous, but sufficient prior data is available. While most off-policy RL algorithms are applicable in the offline setting by filling the replay buffer with offline data, improving the policy beyond the level of the behavior policy entails querying the Q-function about values of actions produced by the policy, which are often not seen in the dataset. Those out-of-distribution actions can be deemed as adversarial examples of the Q-function, which cause extrapolation errors of the Q-function (Kumar et al., 2020) . To alleviate this issue, prior model-free offline RL methods typically add pessimism to the learning objective, in order to be pessimistic about the distributional shift. Pessimism can be achieved by policy constraint, which constrains the policy to be close to the behavior policy (Kumar et al., 2019; Wu et al., 2019; Nair et al., 2020; Fujimoto & Gu, 2021) ; or value regularization, which directly modifies the Q-function to be pessimistic (Kumar et al., 2020; Kostrikov et al., 2021a; An et al., 2021; Bai et al., 2021) . Nevertheless, this imposes a trade-off between accurate value estimation (more regularization) and maximum policy performance (less regularization). In this work, we find that we could alleviate the trade-off in out-of-sample learning by performing implicit value regularization, this bypasses querying the value function of any unseen actions, allows learning an optimal policy using in-sample learning * . More specifically, we propose the Implicit Value Regulazization (IVR) framework, in which a general form of behavior regularizers is added to the policy learning objective. Because of the regularization, the optimal policy in the IVR framework has a closed-form solution, which can be expressed by imposing weight on the behavior policy. The weight can be computed by a state-value function and an action-value function, the state-value function serves as a normalization term to make the optimal policy integrate to 1. It is usually intractable to find a closed form of the state-value function, however, we make a subtle mathematical transformation and show its equivalence to solving a convex optimization problem. In this manner, both of these two value functions can be learned by only dataset samples. Note that the recently proposed method, IQL (Kostrikov et al., 2021b) , although derived from a different view (i.e., approximate an upper expectile of dataset actions given a state), remains pretty close to the learning paradigm of our framework. Furthermore, our IVR framework explains why learning the state-value function is important in IQL and gives a deeper understanding of how IQL handles the distributional shift: it is doing implicit value regularization, with the hyperparameter τ to control the strength. This explains one disturbing issue of IQL, i.e., the role of τ does not have a perfect match between theory and practice. In theory, τ should be close to 1 to obtain an optimal policy while in practice a larger τ may give a worse result. Based on the IVR framework, we further propose some practical algorithms. We find that the value regularization terms used in CQL (Kumar et al., 2020) and AWR (Peng et al., 2019) are two valid choices in our framework. However, when applying them to our framework, we get two complete in-sample learning algorithms. The resulting algorithms also bear similarities to IQL. However, we find that our algorithm introduces sparsity in learning the state-value function, which is missing in IQL. The sparsity term filters out those bad actions whose Q-values are below a threshold, which brings benefits when the quality of offline datasets is inferior. We verify the effectiveness of SQL on widely-used D4RL benchmark datasets and demonstrate the state-of-the-art performance, especially on suboptimal datasets in which value learning is necessary (e.g., AntMaze and Kitchen). We also show the benefits of sparsity in our algorithms by comparing with IQL in noisy data regimes and the robustness of in-sample learning by comparing with CQL in small data regimes. To summarize, the contributions of this paper are as follows: • We propose a general implicit value regularization framework, where different behavior regularizers can be included, all leading to a complete in-sample learning paradigm. • Based on the proposed framework, we design two effective offline RL algorithms: Sparse Q-Learning (SQL) and Exponential Q-learning (EQL), which obtain SOTA results on benchmark datasets and show robustness in both noisy and small data regimes.

2. RELATED WORK

To tackle the distributional shift problem, most model-free offline RL methods augment existing off-policy methods (e.g., Q-learning or actor-critic) with a behavior regularization term. Behavior regularization can appear explicitly as divergence penalties (Wu et al., 2019; Kumar et al., 2019; Fujimoto & Gu, 2021) , implicitly through weighted behavior cloning (Wang et al., 2020; Nair et al., 2020) , or more directly through careful parameterization of the policy (Fujimoto et al., 2018; Zhou et al., 2020) . Another way to apply behavior regularization is via modification of the critic learning objective to incorporate some form of regularization, to encourage staying near the behavioral distribution and being pessimistic about unknown state-action pairs (Nachum et al., 2019; Kumar et al., 2020; Kostrikov et al., 2021a; Xu et al., 2022c) . There are also several works incorporating behavior regularization through the use of uncertainty (Wu et al., 2021; An et al., 2021; Bai et al., 2021) or distance function (Li et al., 2023b) . However, in-distribution constraints used in these works might not be sufficient to avoid value function extrapolation errors. Another line of methods, on the contrary, avoid value function extrapolation by performing some kind of imitation learning on the dataset. When the dataset is good enough or contains high-performing trajectories, we can simply clone or filter dataset actions to extract useful transitions (Xu et al., 2022b; Chen et al., 2020) , or directly filter individual transitions based on how advantageous they could be under the behavior policy and then clones them Brandfonbrener et al. (2021) ; Xu et al. (2021; 2022a) . While alleviating extrapolation errors, these methods only perform single-step dynamic programming and lose the ability to "stitch" suboptimal trajectories by multi-step dynamic programming. Our method can be viewed as a combination of these two methods while sharing the best of both worlds: SQL and EQL implicitly control the distributional shift and learns an optimal policy by in-sample generalization. SQL and EQL are less vulnerable to erroneous value estimation as insample actions induce less distributional shift than out-of-sample actions. Similar to our work, IQL approximates the optimum in-support policy by fitting the upper expectile of the behavior policy's action-value function, however, it is not motivated by remaining pessimistic to the distributional shift. Our method adds a behavior regularization term to the RL learning objective. In online RL, there are also some works incorporating an entropy-regularized term into the learning objective (Haarnoja et al., 2018; Nachum et al., 2017; Lee et al., 2019; Neu et al., 2017; Geist et al., 2019; Ahmed et al., 2019) , this brings multi-modality to the policy and is beneficial for the exploration. Note that the entropy-regularized term only involves the policy, it could be directly computed, resulting in a similar learning procedure as in SAC (Haarnoja et al., 2018) . While our method considers the offline setting and provides a different learning procedure to solve the problem by jointly learning a state-value function and an action-value function.

3. PRELIMINARIES

We consider the RL problem presented as a Markov Decision Process (MDP) (Sutton et al., 1998) , which is specified by a tuple M = ⟨S, A, T, r, ρ, γ⟩ consisting of a state space, an action space, a transition probability function, a reward function, an initial state distribution, and the discount factor. The goal of RL is to find a policy π(a|s) : S × A → [0, 1] that maximizes the expected discounted cumulative reward (or called return) along a trajectory as max π E ∞ t=0 γ t r (s t , a t ) s 0 = s, a 0 = a, s t ∼ T (•|s t-1 , a t-1 ) , a t ∼ π (•|s t ) for t ≥ 1 . (1) In this work, we focus on the offline setting. Unlike online RL methods, offline RL aims to learn an optimal policy from a fixed dataset D consisting of trajectories that are collected by different policies. The dataset can be heterogenous and suboptimal, we denote the underlying behavior policy of D as µ, which represents the conditional distribution p(a|s) observed in the dataset. RL methods based on approximate dynamic programming (both online and offline) typically maintain an action-value function (Q-function) and, optionally, a state-value function (V -function), refered as Q(s, a) and V (s) respectively (Haarnoja et al., 2017; Nachum et al., 2017; Kumar et al., 2020; Kostrikov et al., 2021b) . These two value functions are learned by encouraging them to satisfy single-step Bellman consistencies. Define a collection of policy evaluation operator (of different policy x) on Q and V as (T x Q)(s, a) := r(s, a) + γE s ′ |s,a E a ′ ∼x [Q(s ′ , a ′ )] (T x V )(s) := E a∼π r(s, a) + γE s ′ |s,a [V (s ′ )] , then Q and V are learned by min Q J(Q) = 1 2 E (s,a)∼D (T x Q -Q)(s, a) 2 and min V J(V ) = 1 2 E s∼D (T x V -V )(s) 2 , respectively. Note that x could be the learned policy π or the behavior policy µ, if x = µ, then a ∼ µ and a ′ ∼ µ are equal to a ∼ D and a ′ ∼ D, respectively. In offline RL, since D typically does not contain all possible transitions (s, a, s ′ ), one actually uses an empirical policy evaluation operator that only backs up a single s ′ sample, we denote this operator as T x . In-sample Learning via Expectile Regression Instead of adding explicit regularization to the policy evaluation operator to avoid out-of-distribution actions, IQL uses only in-sample actions to learn the optimal Q-function. IQL uses an asymmetric ℓ 2 loss (i.e., expectile regression) to learn the V -function, which can be seen as an estimate of the maximum Q-value over actions that are in the dataset support, thus allowing implicit Q-learning: min V E (s,a)∼D τ -1 Q(s, a) -V (s) < 0 Q(s, a) -V (s) 2 (2) min Q E (s,a,s ′ )∼D r(s, a) + γV (s ′ ) -Q(s, a) 2 , where 1 is the indicator function. After learning Q and V , IQL extracts the policy by advantageweighted regression (Peters et al., 2010; Peng et al., 2019; Nair et al., 2020) : min π E (s,a)∼D exp (β (Q(s, a) -V (s))) log π(a|s) . While IQL achieves superior D4RL benchmark results, several issues remain unsolved: • The hyperparameter τ has a gap between theory and practice: in theory τ should be close to 1 to obtain an optimal policy while in practice a larger τ may give a worse result. • In IQL the value function is estimating the optimal policy instead of the behavior policy, how does IQL handle the distributional shift issue? • Why should the policy be extracted by advantage-weighted regression, does this technique guarantee the same optimal policy as the one implied in the learned optimal Q-function?

4. OFFLINE RL WITH IMPLICIT VALUE REGULARIZATION

In this section, we introduce a framework where a general form of value regularization can be implicitly applied. We begin with a special MDP where a behavior regularizer is added to the reward, we conduct a full mathematical analysis of this regularized MDP and give the solution of it under certain assumptions, which results in a complete in-sample learning paradigm. We then instantiate a practical algorithm from this framework and give a thorough analysis and discussion of it.

4.1. BEHAVIOR-REGULARIZED MDPS

Like entropy-regularized RL adds an entropy regularizer to the reward (Haarnoja et al., 2018) , in this paper we consider imposing a general behavior regularization term to objective (1) and solve the following behavior-regularized MDP problem max π E ∞ t=0 γ t r(s t , a t ) -α • f π(a t |s t ) µ(a t |s t ) , where f (•) is a regularization function. It is known that in entropy-regularized RL the regularization gives smoothness of the Bellman operator (Ahmed et al., 2019; Chow et al., 2018) , e.g., from greedy max to softmax over the whole action space when the regularization is Shannon entropy. While in our new learning objective (4), we find that the smoothness will transfer the greedy max from policy π to a softened max (depending on f ) over behavior policy µ, this enables an in-sample learning scheme, which is appealing in the offline RL setting. In the behavior-regularized MDP, we have a modified policy evaluation operator T π f given by (T π f )Q(s, a) := r(s, a) + γE s ′ |s,a [V (s ′ )] where V (s) = E a∼π Q(s, a) -αf π(a|s) µ(

a|s) .

The policy learning objective can also be expressed as max π E s∼D [V (s)]. Compared with the origin policy evaluation operator T π , T π f is actually applying a value regularization to the Q-function. However, the regularization term is hard to compute because the behavior policy µ is unknown. Although we can use Fenchel-duality (Boyd et al., 2004) to get a sampled-based estimation if f belongs to the f -divergence (Wu et al., 2019) , this unnecessarily brings a min-max optimization problem, which is hard to solve and results in a poor performance in practice (Nachum et al., 2019) .

4.2. ASSUMPTIONS AND SOLUTIONS

We now show that we can get the optimal value function Q * and V * without knowing µ. First, in order to make the learning problems (4) analyzable, two basic assumptions are required as follows: Assumption 1. Assume π(a|s) > 0 ⇒ µ(a|s) > 0 so that π/µ is well-defined. Assumption 2. Assume the function f (x) satisfies the following conditions on (0, ∞) : ( 1 ) f (1) = 0; (2) h f (x) = xf (x) is strictly convex; (3) f (x) is differentiable. The assumptions of f (1) = 0 and xf (x) strictly convex make the regularization term be positive due to the Jensen's inequality as E µ π µ f π µ ≥ 1f (1) = 0. This guarantees that the regularization term is minimized only when π = µ. Because h f (x) is strictly convex, its derivative, h ′ f (x) = f (x) + xf ′ (x) is a strictly increasing function and thus (h ′ f ) -1 (x) exists. For simplicity, we denote g f (x) = (h ′ f ) -1 (x). The assumption of differentiability facilitates theoretic analysis and benefits practical implementation due to the widely used automatic derivation in deep learning. Under these two assumptions, we can get the following two theorems: Theorem 1. In the behavior-regularized MDP, any optimal policy π * and its optimal value function Q * and V * satisfy the following optimality condition for all states and actions: Q * (s, a) = r(s, a) + γE s ′ |s,a [V * (s ′ )] π * (a|s) = µ(a|s) • max g f Q * (s, a) -U * (s) α , 0 V * (s) = U * (s) + αE a∼µ π * (a|s) µ(a|s) 2 f ′ π * (a|s) µ(a|s) where U * (s) is a normalization term so that a∈A π * (a|s) = 1. The proof is provided in Appendix C.1. The proof depends on the KKT condition where the derivative of a Lagrangian objective function with respect to policy π(a|s) becomes zero at the optimal solution. Note that the resulting formulation of Q * and V * only involves U * and action samples from µ. U * (s) can be uniquely solved from the equation obtained by plugging Eq.( 5) into a∈A π * (a|s) = 1, which also only uses actions sampled from µ. In other words, now the learning of Q * and V * can be realized in an in-sample manner. Theorem 1 also shows how the behavior regularization influences the optimality condition. If we choose f such that there exists some x that g f (x) < 0, then it can be shown from Eq.( 5) that the optimal policy π * will be sparse by assigning zero probability to the actions whose Q-values Q * (s, a) are below the threshold U * (s) + αh ′ f (0) and assigns positive probability to near optimal actions in proportion to their Q-values (since g f (x) is increasing). Note that π * could also have no sparsity, for example, if we choose f = log(x), then g f = exp(x -1) will give all elements non-zero values. Theorem 2. Define T * f the case where π in T π f is the optimal policy π * , then T * f is a γ-contraction. The proof is provided in Appendix C.2. This theorem means that by applying Q k+1 = T * f Q k repeatedly, then sequence Q k will converge to the Q-value of the optimal policy π * when k → ∞. After giving the closed-form solution of the optimal value function. We now aim to instantiate a practical algorithm. In offline RL, in order to completely avoid out-of-distribution actions, we want a zero-forcing support constraint, i.e., µ(a|s) = 0 ⇒ π(a|s) = 0. This reminds us of the class of α-divergence (Boyd et al., 2004) , which is a subset of f -divergence and takes the following form (α ∈ R\{0, 1}): D α (µ, π) = 1 α(α -1) E π π µ -α -1 . α-divergence is known to be mode-seeking if one chooses α ≤ 0. Note that the Reverse KL divergence is the limit of D α (µ, π) when α → 0. We can also obtain Helinger distance and Neyman χ 2 -divergence as α = 1/2 and α = -1, respectively. One interesting property of α-divergence is that D α (µ, π) = D 1-α (π, µ). 4.3 SPARSE Q-LEARNING (SQL) We first consider the case where α = -1, which we find is the regularization term CQL adds to the policy evaluation operator (according to Appendix C in CQL): Q(s, a) = T π Q(s, a) -β[ π(a|s) µ(a|s) -1]. In this case, we have f (x) = x -1 and g f (x) = 1 2 x + 1 2 . Plug them into Eq.( 5) and Eq.( 6) in Theorem 1, we get the following formulation: Q * (s, a) = r(s, a) + γE s ′ |s,a [V * (s ′ )] (7) π * (a|s) = µ(a|s) • max 1 2 + Q * (s, a) -U * (s) 2α , 0 V * (s) = U * (s) + αE a∼µ π * (a|s) µ(a|s) 2 , where U * (s) needs to satisfy the following equation to make π * integrate to 1: E a∼µ max 1 2 + Q * (s, a) -U * (s) 2α , 0 = 1 (10) It is usually intractable to get the closed-form solution of U * (s) from Eq.( 10), however, here we make a mathematical transformation and show its equivalence to solving a convex optimization problem. Lemma 1. We can get U * (s) by solving the following optimization problem: min U E a∼µ 1 1 2 + Q * (s, a) -U (s) 2α > 0 1 2 + Q * (s, a) -U (s) 2α 2 + U (s) α The proof can be easily got if we set the derivative of the objective to 0 with respect to U (s), which is exactly Eq.( 10). Now we obtain a learning scheme to get Q * , U * and V * by iteratively updating Q, U and V following Eq.( 9), objective (11) and Eq.( 7), respectively. We refer to this learning scheme as SQL-U, however, SQL-U needs to train three networks, which is a bit computationally expensive. Note that the term E a∼µ π * (a|s) µ(a|s) 2 in Eq.( 9) is equal to E a∼π * π * (a|s) µ(a|s) , as π * is optimized to become mode-seeking, for actions sampled from π * , its probability π * (a|s) should be close to the probability under the behavior policy, µ(a|s). Note that for actions sampled from µ, π * (a|s) and µ(a|s) may have a large difference because π * (a|s) may be 0. Hence in SQL we make an approximation by assuming E a∼π * π * (a|s) µ(a|s) = 1, this removes one network as U * = V * -α. Replacing U * with V * , we get the following learning scheme that only needs to learn V and Q iteratively to get V * and Q * : min V E (s,a)∼D 1 1 + Q(s, a) -V (s) 2α > 0 1 + Q(s, a) -V (s) 2α 2 + V (s) α (12) min Q E (s,a,s ′ )∼D r(s, a) + γV (s ′ ) -Q(s, a) 2 After getting V and Q, following the formulation of π * in Eq.( 8), we can get the learning objective of policy π by minimizing the KL-divergence between π and π * : max π E (s,a)∼D 1 1 + Q(s, a) -V (s) 2α > 0 1 + Q(s, a) -V (s) 2α log π(a|s) . (14) 4.4 EXPONENTIAL Q-LEARNING (EQL) Now let's consider another choice, α → 0 which is the Reverse KL divergence. Note that AWR also uses Reverse KL divergence, however, it applies it to the policy improvement step and needs to sample actions from the policy when learning the value function. In this case, we get f (x) = log(x) and g f (x) = exp(x -1). Plug them into Eq.( 5) and Eq.( 6) in Theorem 1, we have Q * (s, a) = r(s, a) + γE s ′ |s,a [V * (s ′ )] π * (a|s) = µ(a|s) • exp Q * (s, a) -U * (s) α -1 V * (s) = U * (s) + αE a∼µ π * (a|s) µ(a|s) 2 µ(a|s) π * (a|s) , note that E a∼µ [( π * (a|s) µ(a|s) ) 2 µ(a|s) π * (a|s) ] is equal to 1, so we get V * (s) = U * (s) + α, this eliminates the existence of U * without any approximation. Replacing U * with V * , we get the following formulation: Q * (s, a) = r(s, a) + γE s ′ |s,a [V * (s ′ )] π * (a|s) = µ(a|s) • exp Q * (s, a) -V * (s) α Note that π * should be integrated to 1, we use the same mathematical transformation did in SQL and get the closed-form solution of V * (s) by solving the following convex optimization problem. Lemma 2. We can get V * (s) by solving the following optimization problem: min V E a∼µ exp Q * (s, a) -V (s) α + V (s) α Now the final learning objective of Q, V and π is: min V E (s,a)∼D exp Q(s, a) -V (s) α + V (s) α (15) min Q E (s,a,s ′ )∼D r(s, a) + γV (s ′ ) -Q(s, a) 2 (16) max π E (s,a)∼D exp Q(s, a) -V (s) α log π(a|s) , we name this algorithm as EQL (Exponential Q-Learning) because there is an exponential term in the learning objective. To summarize, our final algorithm, SQL and EQL, consist of three supervised stages: learning V , learning Q, and learning π. We use target networks for Q-functions and use clipped double Q-learning (take the minimum of two Q-functions) in learning V and π. We summarize the training procedure in Algorithm 1.

4.5. DISCUSSIONS

Algorithm 1 Sparse or Exponential Q-Learning Require: D, α. 1: Initialize Q ϕ , Q ϕ ′ , V ψ , π θ 2: for t = 1, 2, • • • , N do 3: Sample transitions (s, a, r, s ′ ) ∼ D 4: Update V ψ by Eq.( 12) or Eq.( 15) using V ψ , Q ϕ ′ 5: Update Q ϕ by Eq.( 13) or Eq.( 16) using V ψ , Q ϕ 6: Update Q ϕ ′ by ϕ ′ ← λϕ + (1 -λ)ϕ ′ 7: Update π θ by Eq.( 14) or Eq.( 17) using V ψ , Q ϕ ′ 8: end for SQL and EQL establishes the connection with several prior works such as CQL, IQL and AWR. Like CQL pushes down policy Q-values and pushes up dataset Q-values, in SQL and EQL, the first term in Eq.( 12) and Eq.( 15) pushes up V -values if Q -V > 0 while the second term pushes down V -values, and α trades off these two terms. SQL incorporates the same inherent conservatism as CQL by adding the χ 2 -divergence to the policy evaluation operator. However, SQL learns the value function using only dataset samples while CQL needs to sample actions from the policy. In this sense, SQL is an "implicit" version of CQL that avoids any out-ofdistribution action. Like AWR, EQL applies the KL-divergence, but implicitly in the policy evaluation step. In this sense, EQL is an "implicit" version of AWR that avoids any OOD action. Like IQL, SQL and EQL learn both V -function and Q-function. However, IQL appears to be a heuristic approach and the learning objective of V -function in IQL has a drawback. We compute the derivative of the V -function learning objective with respect to the residual (Q -V ) in SQL and IQL (see Figure 2 in Appendix A). We find that SQL keeps the derivative unchanged when the residual is below a threshold, while IQL doesn't. In IQL, the derivative keeps decreasing as the residual becomes more negative, hence, the V -function will be over-underestimated by those bad actions whose Q-value is extremely small. Note that SQL and EQL will assign a zero or exponential small probability mass to those bad actions according to Eq.( 14) and Eq.( 17), the sparsity is incorporated due to the mode-seeking behavior of χ 2 -divergence and KL-divergence. Also, IQL needs two hyperparameters (τ and β) while SQL only needs one (α). The two hyperparameters in IQL may not align well because they represent two different regularizations. Note that objective ( 17) is exactly how IQL extracts the policy! However, the corresponding optimal V -function learning objective (15) is not objective (2). This reveals that the policy extraction part in IQL gets a different policy from the one implied in the optimal Q-function.

5. EXPERIMENTS

We present empirical evaluations of SQL and EQL in this section. We first evaluate SQL and EQL against other baseline algorithms on benchmark offline RL datasets. We then show the benefits of sparsity introduced in SQL and EQL by comparing them with IQL in noisy data regimes. We finally show the robustness of SQL and EQL by comparing them with CQL in small data regimes.

5.1. BENCHMARK DATASETS

We first evaluate our approach on D4RL datasets (Fu et al., 2020) . It is worth mentioning that Antmaze and Kitchen datasets include few or no near-optimal trajectories, and highly require learning a value function to obtain effective policies via "stitching". We compare SQL with prior state-of-the-art offline RL methods, including BC (Pomerleau, 1989), 10%BC (Chen et al., 2021) , BCQ (Fujimoto et al., 2018) , DT (Chen et al., 2021) , TD3+BC (Fujimoto & Gu, 2021) , One-step RL (Brandfonbrener et al., 2021) , CQL (Kumar et al., 2020) , and IQL (Kostrikov et al., 2021a) . Aggregated results are displayed in Table 1 . In MuJoCo tasks, where performance is already saturated, SQL and EQL show competitive results to the best performance of prior methods. In more challenging AntMaze and Kitchen tasks, SQL and EQL outperform all other baselines by a large margin. This shows the effectiveness of value learning in SQL and EQL. We show learning curves and performance profiles generated by the rliable library (Agarwal et al., 2021) in Appendix D. We then compare our approach with other baselines on high-dimensional image-based Atari datasets in RL Unplugged (Gulcehre et al., 2020) . Our approach also achieves superior performance on these datasets, we show aggregated results, performance profiles and experimental details in Appendix D. In this section, we try to validate our hypothesis that the sparsity term our algorithm introduced in learning the value function will benefit when the datasets contain a large portion of noisy transitions. To do so, we make a "mixed" dataset by combining random datasets and expert dataset with different expert ratios. We test the performance of SQL, EQL and IQL under different mixing ratios in Fig. 1 . It is shown that SQL and EQL outperforms IQL under all settings. The performance of IQL is vulnerable to the expert ratio, it has a sharp decrease from 30% to 1% while SQL and EQL still retain the expert performance. For example, in walker2d, SQL and EQL reaches near 100 performance when the expert ratio is only 5%; in halfcheetah, IQL is affected even with a high expert ratio (30%). In this section, we try to explore the benefits of in-sample learning over out-of-sample learning.

5.3. SMALL DATA REGIME

We are interested to see whether in-sample learning brings more robustness than out-of-sample learning when the dataset size is small or the dataset diversity of some states is small, which are challenges one might encounter when using offline RL algorithms on real-world data. To do so, we make custom datasets by discarding some transitions in the AntMaze datasets. For each transition, the closer it is to the target location, the higher probability it will be discarded from the dataset. This simulates the scenarios (i.e., robotic manipulation) where the dataset is fewer and has limited state coverage near the target location because the (stochastic) data generation policies maybe not be successful and are more determined when they get closer to the target location (Kumar et al., 2022) . We use a hyperparameter to control the discarding ratio and build three new tasks: Easy, Medium and Hard, with dataset becomes smaller. For details please refer to Appendix D. We compare SQL with CQL as they use the same inherent value regularization but SQL uses in-sample learning while CQL uses out-of-sample learning, We demonstrate the final normalized return (NR) during evaluation and the mean squared Bellman error (BE) during training in Table 2 . It is shown that CQL has a significant performance drop when the difficulty of tasks increases, and the Bellman error also exponentially grows up, indicating that the value extrapolation error becomes large in small data regimes. SQL and EQL remain a stable yet good performance under all difficulties, the Bellman error of SQL is much smaller than that of CQL. This justifies the benefits of in-sample learning, i.e., it avoids erroneous value estimation by using only dataset samples while still allowing in-sample generalization to obtain a good performance.

6. CONCLUSIONS AND FUTURE WORK

In this paper, we propose a general Implicit Value Regularization framework, which builds the bridge between behavior regularized and in-sample learning methods in offline RL. Based on this framework, we propose two practical algorithms, which use the same value regularization in existing works, but in a complete in-sample manner. We verify the effectiveness of our algorithms on both the D4RL benchmark and customed noisy and small data regimes by comparing it with different baselines. One future work is to scale our proposed framework to online RL or offline imitaiton learning (Li et al., 2023a) . Another future work is, instead of only constraining action distribution, constraining the state-action distribution between d π and d D as considered in Nachum et al. (2019) .

A A STATISTICAL VIEW OF WHY SQL AND EQL WORK

Inspired by the analysis in IQL, we give another view of why SQL and EQL could learn the optimal policy. Consider estimating a parameter m α for a random variable X using samples from a dataset D, we show that m α could fit the extrema of X by using the learning objective of V -function in SQL: arg min mα E x∼D 1 1 + x -m α 2α > 0 1 + x -m α 2α 2 + m α α , or using the learning objective of V -function in EQL: arg min mα E x∼D exp x -m α α + m α α In Figure 2 and Figure 3 , we give an example of estimating the state conditional extrema of a twodimensional random variable, as shown, α → 0 approximates the maximum operator over in-support values of y given x. This phenomenon can be justified in our IVR framework as the value function becomes more optimal with less value regularization. However, less value regularization also brings more distributional shift, so we need a proper α to trade-off optimality against distributional shift. Figure 2 : Left: The loss with respect to the residual (Q -V ) in the learning objective of V in SQL with different α. Center: An example of estimating state conditional extrema of a two-dimensional random variable (generated by adding random noise to samples from y = sin(x)). Each x corresponds to a distribution over y. The loss fits the extrema more with α becoming smaller. Right: The comparison of the derivative of loss of SQL and IQL. In SQL, the derivative keeps unchanged when the residual is below a threshold. Derivative of Loss =0.5 (EQL) =0.7 (EQL) =0.9 (IQL) =0.7 (IQL) Figure 3 : Left: The loss with respect to the residual (Q -V ) in the learning objective of V in EQL with different α. Center: An example of estimating state conditional extrema of a two-dimensional random variable (generated by adding random noise to samples from y = sin(x)). Each x corresponds to a distribution over y. The loss fits the extrema more with α becoming smaller. Right: The comparison of the derivative of loss of EQL and IQL. In EQL, the derivative softly decreases and keeps (nearly) unchanged when the residual is below a threshold.

B HOW DOES SPARSITY BENEFIT VALUE LEARNING IN SQL?

In this section, we add more experiments about the sparsity characteristic of SQL. We use a toy example in the tabular setting to demonstrate how sparsity benefits value learning in SQL. We study the relationship of normalized score and sparsity ratio with α in the continuous action setting to clearly show that sparsity plays an important role in the performance of SQL. 

B.1 SPARSITY IN THE TABULAR SETTING

In Appendix A, we show the comparison of loss's derivative of SQL and IQL, we found that SQL keeps the derivative unchanged when the residual is below a threshold while IQL doesn't, the V -function in IQL will be over-underestimated by those bad actions whose Q-value is small. To justify this claim, we use the Four Rooms environment, where the agent starts from the bottom-left and needs to navigate through the four rooms to reach the goal in the up-right corner in as few steps as possible. There are four actions: A = {up, down, right, lef t}. The reward is zero on each time step until the agent reaches the goal-state where it receives +10. The offline dataset is collected by a random behavior policy which takes each action with equal probability. We collect 30 trajectories and each trajectory is terminated after 20 steps if not succeed, γ is 0.9. It can be seen from Fig. 4 that the value learning in IQL is corrupted by those suboptimal actions in this dataset. Those suboptimal actions prevent IQL from propagating correct learning signals from the goal location to the start location, resulting underestimated V -values and some mistaken Q-values. Particularly, incorrect Q-values at (1, 1) and (5, 9) make the agent fail to reach the goal. While in SQL, V -values and Q-values are more identical to the true optimal ones and the agent succeeds in reaching the goal location. This reveals that the sparsity term in SQL helps to alleviate the effect of bad dataset actions and learn a more optimal value function.

B.2 SPARSITY IN THE CONTINUOUS ACTION SETTING

The value of non-sparsity ratio (i.e., E (s,a)∼D [1(1 + (Q(s, a) -V (s))/2α > 0)]) is controlled by the hyperparameter α. In the continuous action setting, we show the relationship of the normalized score and non-sparsity ratio with α in Table 3 and Table 4 . It can be seen that typically a larger α gives less sparsity, sparsity plays an important role in the performance of SQL and we need to choose a proper sparsity ratio to achieve the best performance. The best sparsity ratio depends on the composition of the dataset, for example, the best sparsity ratios in MuJoCo datasets (around 0.1) are always larger than those in AntMaze datasets (around 0.4), this is because AntMaze datasets are kind of multi-task datasets (the start and goal location are different from the current ones), there is a large portion of useless transitions contained so it is reasonable to give those transitions zero weights by using more sparsity.

C PROOFS

C.1 PROOF OF THEOREM 1 In this section, we give the detailed proof for Theorem 1, which states the optimality condition of the behavior regularized MDP. The proof follows from the Karush-Kuhn-Tucker (KKT) conditions where the derivative of a Lagrangian objective function with respect to policy π(a|s) is set zero. Hence, our main theory is necessary and sufficient. Proof. The Lagrangian function of ( 4) is written as follows  L(π, β, u) = s d π (s) 0 ≤ β(a|s) (19) β(a|s)π(a|s) = 0 (20) Q(s, a) -αh ′ f π(a|s) µ(a|s) -u(s) + β(a|s) = 0 (21) where ( 18) is the feasibility of the primal problem, ( 19) is the feasibility of the dual problem, (20) results from the complementary slackness and ( 21) is the stationarity condition. We eliminate d π (s) since we assume all policies induce an irreducible Markov chain. From (21), we can resolve π(a|s) as π(a|s) = µ(a|s) • g f 1 α (Q(s, a) -u(s) + β(a|s)) Fix a state s. For any positive action, its corresponding Lagrangian multiplier β(a|s) is zero due to the complementary slackness and Q(s, a) > u(s) + αh ′ f (0) must hold. For any zero-probability action, its Lagrangian multiplier β(a|s) will be set such that π(a|s) = 0. Note that β(a|s) ≥ 0, thus Q(s, a) ≤ u(s) + αh ′ f (0) must hold in this case. From these observations, π(a|s) can be reformulated as π(a|s) = µ(a|s) • max g f 1 α (Q(s, a) -u(s)) , 0 By plugging ( 22) into (18), we can obtain an new equation E a∼µ max g f 1 α (Q(s, a) -u(s)) , 0 = 1 (23) Note that (23) has and only has one solution denoted as u * (because the LHS of ( 23) can be seen as a continuous and monotonic function of u), so u * can be solved uniquely. We denote the corresponding policy π as π * . Next we aim to obtain the optimal state value V * . It follows that V * (s) = T * f V * (s) = a π * (a|s) Q * (s, a) -αf π * (a|s) µ(a|s) = a π * (a|s) u * (s) + α π * (a|s) µ(a|s) f ′ π * (a|s) µ(a|s) = u * (s) + α a π * (a|s) 2 µ(a|s) f ′ π * (a|s) µ(a|s) = u * (s) + αE a∼µ π * (a|s) µ(a|s) 2 f ′ π * (a|s) µ(a|s) The first equality follows from the definition of the optimal state value. The second equality holds because π maximizes T * f V * (s). The third equality results from plugging (21). To summarize, we obtain the optimality condition of the behavior regularized MDP as follows Q * (s, a) = r(s, a) + γE s ′ |s,a [V * (s ′ )] π * (a|s) = µ(a|s) • max g f Q * (s, a) -u * (s) α , 0 V * (s) = u * (s) + αE a∼µ π * (a|s) µ(a|s) 2 f ′ π * (a|s) µ(a|s) C.2 PROOF OF THEOREM 2 Proof. For any two state value functions V 1 and V 2 , let π i be the policy that maximizes T * f V i , i ∈ 1, 2. Then it follows that for any state s in S, T * f V 1 (s) -T * f V 2 (s) = a π 1 (a|s) r + γE s ′ [V 1 (s ′ )] -αf π 1 (a|s) µ(a|s) -max π a π(a|s) r + γE s ′ [V 2 (s ′ )] -αf π(a|s) µ(a|s) ≤ a π 1 (a|s) r + γE s ′ [V 1 (s ′ )] -αf π 1 (a|s) µ(a|s) - a π 1 (a|s) r + γE s ′ [V 2 (s ′ )] -αf π 1 (a|s) µ(a|s) = γ a π 1 (a|s)E s ′ [V 1 (s ′ ) -V 2 (s ′ )] ≤ γ ∥V 1 -V 2 ∥ ∞ By symmetry, it follows that for any state s in S, T * f V 1 (s) -T * f V 2 (s) ≤ γ ∥V 1 -V 2 ∥ ∞ Therefore, it follows that T * f V 1 -T * f V 2 ∞ ≤ γ ∥V 1 -V 2 ∥ ∞

D EXPERIMENTAL DETAILS

D4RL experimental details For MuJoCo locomotion and Kitchen tasks, we average mean returns over 10 evaluations every 5000 training steps, over 5 random seeds. For AntMaze tasks, we average over 100 evaluations every 0.1M training steps, over 5 random seeds. Followed by IQL, we standardize the rewards by dividing the difference in returns of the best and worst trajectories in MuJoCo and kitchen tasks, we subtract 1 to rewards in AntMaze tasks. Our implementation of 10%BC is as follows, we first filter the top 10 % trajectories in terms of the trajectory return, and then run behaviour cloning on those filtered data. We re-run IQL on all datasets and report the score of IQL by choosing the best score from τ in [0.5, 0.6, 0.7, 0.8, 0.9, 0.99], using author-provided implementation † We re-run CQL on AntMaze datasets as we find the performance can be improved by carefully sweeping the hyperparameter min-q-weight in [0.5, 1, 2, 5, 10], using the PyTorch-version implementation ‡ . Other baseline results are taken directly from their corresponding papers. In SQL and EQL, we use 2-layer MLP with 256 hidden units, we use Adam optimizer (Kingma & Ba, 2015) with a learning rate of 2 • 10 -4 for all neural networks. Following Mnih et al. (2013) ; Lillicrap et al. (2016) , we introduce a target critic network with soft update weight 5 • 10 -3 . We implement our method in the framework of JAX. The only hyperparameter α used in SQL and EQL is listed in Table 5 . The sensitivity of α in SQL can be found in Table 3 and Table 4 . The sensitivity of τ in IQL can be found in Table 3 . The runtime of different algorithms can be found in Table 7 . 2.0 2.0 walker2d-medium-v2 2.0 2.0 halfcheetah-medium-replay-v2 2.0 2.0 hopper-medium-replay-v2 2.0 2.0 walker2d-medium-replay-v2 2.0 2.0 halfcheetah-medium-expert-v2 5.0 5.0 hopper-medium-expert-v2 5.0 5.0 walker2d-medium-expert-v2 5.0 5.0 antmaze-umaze-v2 0.5 0.5 antmaze-umaze-diverse-v2 5.0 5.0 antmaze-medium-play-v2 0.5 0.5 antmaze-medium-diverse-v2 0.5 0.5 antmaze-large-play-v2 0.5 0.5 antmaze-large-diverse-v2 0.5 0.5 kitchen-c 2.0 2.0 kitchen-p 2.0 2.0 kitchen-m 2.0 2.0 for the offline Atari datasets introduced in (Agarwal et al., 2020) . There are three types of Atari datasets in d3rlpy: mixed: datasets collected at the first 1M training steps of an online DQN agent, medium: datasets collected at between 9M steps and 10M training steps of an online DQN agent, expert: datasets collected at the last 1M training steps of an online DQN agent. To make the task more challenging, we use only 10% or 5% of origin datasets. We choose three image-based Atari games: Breakout, Qbert and Seaquest. We implement the discrete version of SQL (D-SQL) and IQL (D-IQL) based on d3rlpy. The implementation of discrete CQL (D-CQL) and discrete BCQ (D-BCQ) are directly taken from d3rlpy. We use consistent preprocessing and network structures to ensure a fair comparision. For baselines, we report the score of D-IQL by choosing the best score from τ in [0.5, 0.7, 0.9], we report the score of D-CQL by choosing the best score from min-q-weight in [1, 2, 5], we report the score of D-BCQ by choosing the best score from τ in [0.1, 0.3, 0.5]. For D-SQL, we use α = 1.0 for all datasets. Nosiy data regime experimental details In this experiment setting, we introduce the noisy dataset by mixing the expert and random dataset with different expert using MuJoCo locomotion datasets. The number of total transitions of the noisy dataset is 100, 000. We provide details in Table 8 . We report the score of IQL by choosing the best score from τ in [0.5, 0.6, 0.7, 0.8, 0.9]. Small data regime experimental details We generate the small dataset using the following psedocode 1, its hardness level can be found at Table 9 . We report the score of CQL by choosing the best score from min-q-weight in [0.5, 1, 2, 5, 10]. Listing 1: The sketch of generation procedure of small data regimes with different hard levels. Given an AntMaze environment and a hardness level, we discard some transitions by following the rule in the Coding List. Intuitively, the closer the transition is to the GOAL, the higher the probability that it will be discarded. 



† https://github.com/ikostrikov/implicit_q_learning ‡ https://github.com/young-geng/CQL



Figure 1: Performance of different methods in noisy data regimes.

Figure4: Evaluation of IQL and SQL on the Four Rooms environment. SQL learns a more optimal value function and produces a better policy than IQL when the dataset is heavily corrupted by suboptimal actions.

)π(a|s) , where d π is the stationary state distribution of the policy π, u and β are Lagrangian multipliers for the equality and inequality constraints respectively. Let h f (x) = xf (x). Then the KKT condition of (4) are as follows, for all states and actions we have 0 ≤ π(a|s) ≤ 1 and a π(a|s) = 1 (18)

LEVEL = {'easy', 'medium', 'hard'} obs = dataset['observations'] length = dataset['observations'].shape[0] POSITIONS = env.get_position(obs) GOAL = env.get_goal() MINIMAL_POSITION = env.get_minimal_position() # get maximal Euclidean distance MAX_EU_DIS = (GOAL -MINIMAL_POSITION) ** 2 DIS = ((POSITIONS -MINIMAL_POSITION) ** 2) / MAX_EU_DIS save_idx = np.random.random(size=length) > (DIS * hardness['LEVEL']) small_data = collections.defaultdict() for key in dataset.keys(): small_data[key] = dataset[key][save_idx]

Averaged normalized scores of SQL against other baselines. The scores are taken over the final 10 evaluations with 5 seeds. SQL or EQL achieves the highest scores in 14 out of 18 tasks.

The normalized return (NR) and Bellman error (BR) of CQL, SQL and EQL in small data regimes.

The relationship of the normalized score (left) and non-sparsity ratio (right) with α in MuJoCo datasets.

The relationship of the normalized score (left) and non-sparsity ratio (right) with α in AntMaze datasets.

α used for SQL and EQL

The sensitivity of τ in IQL.

The runtime of different algorithms.

Noisy dataset of MuJoCo locomotion tasks with different expert ratios.

ACKNOWLEDGEMENT

This work is supported by funding from Intel Corporation. The authors would like to thank the anonymous reviewers for their feedback on the manuscripts.

