KNOW YOUR BOUNDARIES: THE ADVANTAGE OF EXPLICIT BEHAVIORAL CLONING IN OFFLINE RL

Abstract

We introduce an offline reinforcement learning (RL) algorithm that explicitly clones a behavior policy to constrain value learning. In offline RL, it is often important to prevent a policy from selecting unobserved actions, since the consequence of these actions cannot be presumed without additional information about the environment. One straightforward way to implement such a constraint is to explicitly model a given data distribution via behavior cloning and directly force a policy not to select uncertain actions. However, many offline RL methods instantiate the constraint indirectly-for example, pessimistic value estimation-due to a concern about errors when modeling a potentially complex behavior policy. In this work, we argue that it is not only viable but beneficial to explicitly model the behavior policy for offline RL because the constraint can be realized in a stable way with the explicitly cloned model. We first suggest a theoretical framework that allows us to incorporate behavior-cloned models into value-based offline RL methods, enjoying the strength of both explicit behavior cloning and value learning. Then, we propose a practical method utilizing a score-based generative model for behavior cloning to better handle the complicated behaviors that an offline RL dataset might contain. The proposed method shows state-of-the-art performance on several datasets within the D4RL and Robomimic benchmarks and achieves competitive performance across all datasets tested.

1. INTRODUCTION

The goal of offline reinforcement learning (RL) is to learn a policy purely from pre-generated data. This data-driven RL paradigm is promising since it opens up a possibility for RL to be widely applied to many realistic scenarios where large-scale data is available. Two primary targets need to be considered in designing offline RL algorithms: maximizing reward and staying close to the provided data. Finding a policy that maximizes the accumulated sum of rewards is the main objective in RL, and this can be achieved via learning an optimal Q-value function. However, in the offline setup, it is often infeasible to infer a precise optimal Q-value function due to limited data coverage (Levine et al., 2020; Liu et al., 2020) ; for example, the value of states not shown in the dataset cannot be estimated without additional assumptions about the environment. This implies that value learning can typically be performed accurately only for the subset of the state (or state-action) space covered by a dataset. Because of this limitation, offline RL algorithms should implement some form of imitation learning objectives that can force a policy to stay close to the given data. Because of this limitation, some form of imitation learning objectives that can force a policy to stay close to the given data warrants consideration in offline RL. Recently, many offline RL algorithms have been proposed that instantiate an imitation learning objective without explicitly modeling the data distribution of the provided dataset. For instance, one approach applies the pessimism under uncertainty principle in value learning (Buckman et al., 2020; Kumar et al., 2020; Kostrikov et al., 2021a) in order to prevent out-of-distribution actions from being selected. While these methods show promising practical results for certain domains, it has also been reported that such methods fall short compared to simple behavior cloning methods (Mandlekar et al., 2021; Florence et al., 2021) which only model the data distribution without exploiting any reward information. We hypothesize that this deficiency occurs because the imitation learning objective in these methods is only indirectly realized without explicitly modeling the data distribution (e.g. by pessimistic value prediction). Such an indirect realization could be much more complicated than simple behavior cloning for some data distributions since it is often entangled with unstable training dynamics caused by bootstrapping and function approximation. Hence, implicit methods are prone to over-regularization (Kumar et al., 2021) or failure, and they may require delicate hyperparameter tuning to prevent this deficiency (Emmons et al., 2022) . Yet, at the same time, it is obvious that simple behavior cloning cannot extract a good policy from a data distribution composed of suboptimal policies. To this end, we ask the following question in this paper: Can offline RL benefit from explicitly modeling the data distribution via behavior cloning no matter what kind of data distribution is given? Previously, there have been attempts to use an explicitly trained behavior cloning model in offline RL (Wu et al., 2019; Kumar et al., 2019; Fujimoto et al., 2019; Liu et al., 2020) , but we argue that two important elements are missing from existing algorithms. First, high-fidelity behavior cloning has not been achieved, despite the need in offline RL for precise estimation of behavior policy (Levine et al., 2020) . First, high-fidelity generative models have not been integrated with offline RL algorithms even though inaccurate estimation of behavior policy could limit the final performance of the algorithm (Levine et al., 2020) . Florence et al. (2021) have tried an energy-based generative model, but the proposed method is an imitation learning that does not incorporate a value function. Second, the trained behavior cloning models have only been utilized with heuristics or proxy formulations that are only empirically justified (Wu et al., 2019; Kumar et al., 2019) . Second, the trained behavior cloning models have only been utilized with certain limited forms, such as KL (Wu et al., 2019) or MMD (Kumar et al., 2019) divergence between the cloned policy and an actor policy. Therefore, we tackle these two problems by: first, incorporating a state-of-the-art score-based generative model (Song & Ermon, 2019; 2020; Song et al., 2021) to fulfill the high-fidelity required for offline RL, and second, by proposing a theoretical framework, direct Q-penalization (DQP), that provides a mechanism to integrate the trained behavior model into value learning. Furthermore, DQP can provide an integrated view of different offline RL algorithms, helping to analyze the possible failures of these algorithms. We evaluate our algorithm on various benchmark datasets that differ in quality and complexity, namely D4RL and Robomimic. Our method shows not only competitive performance across different types of datasets but also state-of-the-art results on complex contact-rich tasks, such as the transport and tool-hang tasks in Robomimic. The results demonstrate the effectiveness of the proposed algorithm as well as the practical advantage of explicit behavior cloning, which was previously considered a bottleneck that would limit the final offline RL performance (Levine et al., 2020) unnecessary or infeasible. To summarize, our contributions are: (1) We provide a theoretical framework for offline RL, DQP, which provides a unified view of previously disparate offline RL algorithms; (2) Using DQP, we suggest a principled offline RL formulation that incorporates an explicitly trained behavior cloning model; (3) We propose a practical algorithm that instantiates the above formulation, leveraging a score-based generative model; and (4) we achieve competitive and state-of-the-art performance across a variety of offline RL datasets.

2. RELATED WORKS

The end goal of offline RL is to extract the best possible policy from a given dataset, regardless of the quality of the trajectories that compose the dataset (Ernst et al., 2005; Riedmiller, 2005; Lange et al., 2012; Levine et al., 2020) . One of the simplest approaches to tackle this problem is imitation learning (IL) (Schaal, 1999; Florence et al., 2021) hoping to recover the performance of the behavior policy which generated the dataset. However, simple imitation would fail to achieve the end goal of offline RL since one cannot outperform the behavior policy by just imitating it. This problem is commonly addressed with value learning, trying to resolve the distribution shift problem that arises in the offline setup. Since distribution shift commonly results in overestimation of values, offline RL algorithms try to estimate values pessimistically for out-of-distribution inputs (Kumar et al., 2020; Goo & Niekum, 2021) , sometimes by explicitly quantifying the certainty with a trained transition dynamics model (Yu et al., 2020; Kidambi et al., 2020) , a generative model (Rezaeifar et al., 2021) , or a pseudometric (Dadashi et al., 2021) . The distribution shift is also commonly addressed by constraining a policy to be close to the behavior one. Specifically, based on how the constraint is instantiated, policy-constraint methods can be further categorized into implicit methods, which constrain a policy via weighted behavior cloning or linear gradient combination (Peng et al., 2019; Wang et al., 2020; Kostrikov et al., 2021b; Brandfonbrener et al., 2021; Fujimoto & Gu, 2021; Wang et al., 2022) , and explicit methods (Fujimoto et al., 2019; Kumar et al., 2019; Wu et al., 2019; Liu et al., 2020) , which constrain the policy learning via value penalty or policy regularization. We generally follow the learning structure of the explicit policy-constraint method. Especially, our offline RL algorithm is closely related to the works of Fujimoto et al. (2019) ; Wu et al. (2019) ; Liu et al. (2020) in which a behavior policy β is explicitly cloned first, and the cloned policy is directly used to instantiate the policy-constraint. However, we try to enhance its performance by first, suggesting a principled policy-constraining method that utilizes the cloned policy, and second, using a powerful generative model for explicit behavior cloning that can greatly reduce the BC error and thereby allow precise policy regularization. Concurrent with our work, Wang et al. (2022) propose an offline RL algorithm that leverages a diffusion-based generative model. Their method is based on the works of Fujimoto & Gu (2021) in which the ordinary actor loss is linearly combined with behavior cloning loss. The proposed algorithm is simple and minimalistic, but the theoretical justification of the approach has not been fully addressed.

3. PRELIMINARIES

We use a Markov Decision Process (MDP) as a foundation for our mathematical framework. An MDP is defined by a tuple M = (S, A, T, d 0 , r, γ); a set of states s ∈ S, a set of actions a ∈ A, transition dynamics T = p(s ′ |s, a), an initial state distribution d 0 (s 0 ), a reward function r(s, a), and a discount factor γ. In this setup, the goal of reinforcement learning is to find an optimal policy π * (a|s) that maximizes the expected sum of discounted reward (return) J(π): π * = arg max π J(π) = arg max π E τ ∼ρπ H t=0 γ t r(s t , a t ) , ( ) where τ is a sequence of states and actions (s 0 , a 0 , • • • , s H , a H ) of length H, and ρ π is a trajectory distribution of a policy π, which can be represented as ρ π (τ ) = d 0 (s 0 ) H t=0 π(a t |s t )T (s t+1 |s t , a t ). We can directly optimize the return when we can compute the gradients of J(π ϕ ) with respect to the policy parameters ϕ (Williams, 1992; Schulman et al., 2017; 2015) , but this approach is not straightforwardly extend to an offline setting since on-policy data is typically required to compute the gradient. Instead, it is more common for offline RL methods to extend dynamic programming approaches which are formed around the action value function Q π (s, a) which is formally defined as: Q π (s t , a t ) = E τ ∼ρπ(st,at) H t ′ =t γ t ′ -t r(s t ′ , a t ′ ) . The Q function of a certain policy π k always implies a greedy policy π k+1 , which is better than or equal to the evaluation target policy π k . Therefore, an optimal policy can be found by iteratively evaluating a Q function for a new greedy policy π k+1 until convergence: Q π k = lim n→∞ B π k n Q π k-1 (policy evaluation), π k+1 (a|s) = arg max a Q π k (s, a) (policy improvement), where B π is the Bellman operator, which has the ground truth Q π as a unique fixed point (Lagoudakis & Parr, 2003) . This algorithm is called policy iteration. Although the convergence of the algorithm is restricted to the scenario where the unique fixed point is reachable (Sutton & Barto, 2018) , policy iteration has been widely used as a backbone for most offline RL algorithms due to its extensibility to off-policy data; policy evaluation can be done with off-policy data using bootstrapping. When the value function and the policy are represented with parameters θ and ϕ respectively, the policy iteration algorithm with bootstrapping has the following form:        θ k+1 ← arg min θ E s,a,r,s ′ ∼D d Q θ (s, a), r + γE a ′ ∼π k (a ′ |s ′ ) Q k (s ′ , a ′ ) ϕ k+1 ← arg max ϕ E s∼D,a∼π ϕ (a|s) Q k+1 (s, a) where k is an update step, d is a distance metric such as squared l 2 or Huber loss, and D is a provided (offline) dataset that contains transition tuples D = (s, a, r, s ′ ). For the brevity of notation, we denote Q θ k := Q k and π ϕ k := π k .

4. METHOD

In this paper, we consider offline RL algorithms that utilize the policy iteration scheme shown in Eq. 3, which covers several offline RL methods (Kumar et al., 2020; 2019; Wu et al., 2019) . In this family of methods, the correctness of the value function becomes the major concern since the policy evaluation could diverge due to the data restriction imposed by the offline setup. Specifically, divergence can happen because of the bootstrapping and the function approximation; when some estimates are erroneously high due to poor generalization, the over-estimated values are likely to be picked up on the policy improvement step and feed back to policy evaluation via bootstrapping, completing a vicious cycle that causes training to diverge. Therefore, offline RL methods focus on solving the overestimation problem with different regularization methods, such as policy constraints (Kumar et al., 2019; Wu et al., 2019) or pessimism (Kumar et al., 2020; Goo & Niekum, 2021) . One straightforward solution to the over-estimation problem is directly penalizing Q estimation (Rezaeifar et al., 2021; Dadashi et al., 2021) with a penalty function p(s, a): Qθ (s, a) = Q θ (s, a) -p(s, a). This can be a solution because this penalty function, if chosen carefully, can reduce erroneously high values and prevent them from being propagated via bootstrapping. We refer to this family of algorithms as Direct Q penalization (DQP). In DQP, we can easily observe that the penalty function that describes the oracle Q estimation error (i.e. p(s, a) = Q θ -Q π ϕ ) is the best solution. Therefore, we want to design a penalty function that resembles the oracle error. Since the estimation error is likely to occur more often for out-ofdistribution state-action pairs (s, a ′ ) / ∈ D, a few ad-hoc methods have been proposed for the purpose of measuring the eestimation error or epistemic uncertainty of Q θ ; the aleatoric uncertainty of the transition dynamics model is suggested as a proxy for the epistemic uncertainty of Q θ (Yu et al., 2020) , and generative models (Rezaeifar et al., 2021) or pseudometrics (Dadashi et al., 2021) are proposed with the purpose of distinguishing whether a particular (s, a) is in-distribution or not. However, it has not been thoroughly investigated how penalty functions affect the policy iteration process, nor what penalty functions are best for offline RL. Without answers to these questions, DQP methods can only be understood as ad-hoc methods in which heuristically designed penalty functions are used to prevent the overestimation. DQP provides a unified way to represent different offline RL algorithms in terms of penalty functions. However, in order for a unified perspective to provide a better understanding of offline RL algorithms and thereby be useful for developing better offline RL algorithms, it is necessary to investigate the effect of the penalty function under the policy iteration scheme. To this end, we address the following questions: (1) What is the effect of direct Q-penalization in the context of the policy iteration framework?; (2) How can we design an appropriate penalty function based on this analysis?; (3) How can we instantiate the penalty function and achieve strong performance across different offline RL datasets? 4.1 THEORETIC BACKGROUND ON DIRECT Q-PENALIZATION We describe a theorem that answers the first question: soft policy iteration (Haarnoja et al., 2018) with a penalized value function Q is equivalent to policy iteration regularized by D KL π(s)∥π p (s) where π p (a|s) := softmax -p(s, a) . This theorem is a generalized version of the theorem shown in (Rezaeifar et al., 2021) , which does not require unnecessary assumptions on the penalty function. Theorem 1 (Equivalence between KL-policy regularization and DQP ). The following two algorithms are equivalent. Policy iteration w/ KL-policy regularization:        θ k+1 ← arg min θ E s,a,r,s ′ ∼D d Q(s, a), r + γ⟨π k , Q k ⟩(s ′ ) -γD KL π k (s ′ )∥π p (s ′ ) ϕ k+1 ← arg max ϕ E s∼D ⟨π ϕ , Q k+1 ⟩(s) -D KL π ϕ (s)∥π p (s) . Soft policy iteration (Haarnoja et al., 2018) w/ penalty:        θ k+1 ← arg min θ E s,a,r,s ′ ∼D d Q(s, a), r + γ ⟨π k , Q k -p⟩(s ′ ) -Z(s ′ ) + H π k (s ′ ) ϕ k+1 ← arg max ϕ E s∼D ⟨π ϕ , Q k+1 -p⟩(s) + H π ϕ (s) where d is a distance metric, ⟨u 1 , u 2 ⟩ := a u 1 (•, a)u 2 (•, a), and Z(s) = ln a exp -p(s, a) . Proof. The common term in KL policy regularization can be rearranged as follows: ⟨π, Q⟩(s) -D KL π(s)∥π p (s) = ⟨π, Q⟩(s) -⟨π, ln π -ln π p ⟩(s) = ⟨π, Q + ln π p ⟩(s) -⟨π, ln π⟩(s) = ⟨π, Q -p⟩(s) -Z(s) + H π(s) . Then, we can get the equivalence when we replace the term with the rearranged term. Note that the normalization term Z(s) can be dropped in the policy update step since Z(s) is not a function of ϕ. Also, we can safely ignore Z(s) in the policy evaluation step when |Z(s) -Z(s ′ )| < ϵ for any pair of (s, s ′ ) ∈ S × S, because it does not affect the policy improvement step. The theorem is straightforward since it describes a special case of regularized policy iteration Geist et al. (2019) , namely KL-control (Peters et al., 2010; Schulman et al., 2015) , in which the policy is regularized through KL-divergence with respect to another policy. Yet, the theorem shows a way to interpret any penalty function from the point of view of KL-control and vice versa. Therefore, by using the theorem, we can have a unified view of previously disparate offline RL algorithms. In Table 1 , we compare different offline RL algorithms in terms of the penalty function that each algorithm uses.

4.2. WHAT MAKES A GOOD PENALTY FUNCTION?

Theorem 1 shows the connection between a penalty function and its effect as a policy regularizer, and can help to guide the construction of an effective, principled penalty function. Specifically, we propose a penalty function that can instantiate the support set constraint (Kumar et al., 2019; Liu et al., 2020) , which restricts the action space of a trained policy to be in the support set of the behavior policy β(a|s). The support set constraint is an effective way to solve the offline RL problem in that the suboptimality caused by the constraint is bounded (Kumar et al., 2019; Liu et al., 2020) . While previous works express the constraint in terms of the distribution-constrained Bellman operator, we represent the constraint via a penalty function since it allows us to compare different offline RL algorithms under the same viewpoint, helping to analyze the possible failures of these algorithms. The following penalty function instantiates the support set constraint: p(s, a) = 0 for {(s, a)|β(a|s) ≥ ϵ} ∞ otherwise ( ) where ϵ is a threshold hyperparameter to decide whether (s, a) is considered out-of-support or not. The penalty function carries the same effect as the filtration operator in (Liu et al., 2020) under the policy iteration scheme. This is because the function prevents out-of-support actions from being chosen by the policy while it imposes no preference over in-support actions; a rare action that has not occurred often in a dataset can be selected as long as it provides a high Q value. This indifference is a desirable property when good trajectories compose only a small portion of a dataset, since good actions could be drowned out by more frequent actions if the penalty function is designed to prefer more frequent actions. We can also confirm the characteristics of the penalty function by observing the flip side: the penaltyinduced policy π p and the KL-constraint D KL (π∥π p ). Since the reverse KL term makes π seek a mode of π p which is the uniform distribution for the in-support actions, the policy π is guided to select one of the actions in the support set while there is no preference over actions in the set. Therefore, the penalty function instantiates the support set constraint. Given the proposed penalty function, the similarity between different offline RL methods can be observed. For instance, we can see that BRAC-KL and CQL penalize the out-of-support actions infinitely: when β(a|s) is zero, the penalty becomes infinite. However, some discrepancies can also be noted, and this provides some hints on how and why other methods could fail due to excessive or insufficient pessimism. First, BRAC-KL could fail because it prefers actions that are more frequently executed by the behavior policy. This can be easily seen from the KL-policy regularization perspective. Since π p (s) = β in BRAC-KL, the penalty would make a policy to seek the mode of β(a|s) when the KL regularization term dominates the policy update. Therefore, the algorithm could work like behavior cloning that disregards rare but good actions in the provided dataset. CQL could also exhibit a similar problem since the penalty function is defined with β. Like BRAC-KL, when the value function implied policy µ k selects an action that is infrequent in β, it will be harshly penalized. Especially when CQL is tuned to strongly penalize the out-of-support action (i.e., α k is large), it could force the policy to mimic the dataset (Kumar et al., 2021) . Similarly, TD3+BC and Diffusion RL (Wang et al., 2022) could suffer from the same problem when the penalty strength α is not properly tuned. Another common problem that arises in other penalty functions is their use of proxies and their formulation that replace the β(a|s); for example, a conditional variational autoencoder (CVAE) (Rezaeifar et al., 2021) , a pseudometric (Dadashi et al., 2021) , or a transition dynamics model (Yu et al., 2020) are estimated instead of the behavior policy β, and penalty functions are designed heuristically with the proxy estimates. While such formulations could show some positive correlation to the suggested penalty function, there is no clear connection that allows us to interpret the penalty in terms of β or the support set. BEAR, on the other hand, is designed to implement the support set constraint as ours, so it could avoid the problem of BRAC-KL or CQL that prefers more frequent actions. However, BEAR could be inaccurate since they instantiate the constraint without explicitly modeling the behavior policy β. Especially, they resort to maximum mean discrepancy (MMD) distance since it can be computed only using the samples from a dataset, but the use of the distance metric is only empirically justified (Kumar et al., 2019) . In contrast, we directly instantiate the support set constraint by explicitly modeling the behavior policy.

4.3. PRACTICAL IMPLEMENTATION

We now propose a practical algorithm that instantiates the penalty function designed above. Essentially, the designed penalty function serves to determine whether an action a at a certain state s is likely to be executed by the behavior policy β. Therefore, we can implement the penalty function simply by cloning the behavior policy explicitly and checking the likelihood β(a|s) with the cloned model. There have been other research works that have tried to model a behavior policy using generative models, such as variational auto-encoders (VAEs) (Fujimoto et al., 2019; Rezaeifar et al., 2021) . However, the performance of these approaches is limited compared to methods that do not explicitly clone the behavior policy β. We presume that the reason for this failure is the limited expressivity of the generative model; since the behavior policy β can be complex, discontinuous, and multi-modal, only a very expressive model can successfully model the policy. To this end, we chose to use a  Q k (s, a) exp(-α 2 D(s, a)) D is pseudometric. MOPO α|Σ(s, a)| Σ is the std. of the trained T . CQL α k [ µ k β -1] µ k is the soft-policy given Q k . score-based generative model (Song & Ermon, 2019; 2020; Song et al., 2021) , which has recently shown great success in generating high-quality images. Furthermore, the score-based generative model allows an exact likelihood computation which is essential in instantiating the penalty function. We briefly examine the ability of the score-based generative model using four discontinuous multimodal distributions, and the results are shown in Figure A.1. In all four cases, the inferred probability distribution is very sharp, and its log probability resembles the penalty function we proposed. In the score-based generative model, a target distribution p(x) is indirectly expressed and trained in the form of the gradient of a log probability density function ∇ x log p(x), often referred to as the (Stein) score function (Liu et al., 2016) . When we approximate the score function of a behavior policy β(a|s) accurately via score-matching algorithm (i.e. s ψ (a|s) ≈ ∇ a β(a|s)) (Song & Ermon, 2019) , we can instantiate the penalty function with a hyperparameter ϵ which decides whether a certain action a given s is considered to be sufficiently in-support or not. While this formulation allows direct instantiation of the penalty function that can be plugged into the DQP framework, it is computationally prohibitive. This is because, first, we need to run an iterative algorithm to compute the log-likelihood from the score function, and second, it could hurt the generalization performance of the value function since Q has to output an extremely wide range of values including negative infinity. Therefore, we propose a practical approximation of the policy iteration algorithm that utilizes the proposed penalty function. The key observation is that the policy trained on top of the penalized value function will never select out-of-support actions due to penalization. This allows two modifications to the original policy iteration algorithm: First, we can perform policy evaluation only considering in-support state-action pairs; i.e., we can bootstrap from one of the samples from β(a|s), and we can skip the penalty computation since the penalty is zero for in-support data-points under the suggested penalty function. While sampling using the score function also requires expensive iterative computation, we can greatly reduce the computation by prepopulating samples for states that exist in the dataset and repeatedly using it in the policy evaluation step. Specifically, the policy evaluation is done with the following loss function: L policy-eval = E s,a,r,s ′ ∼D d Q θ (s, a), r + γ K-th a ′ ∈ supp(β(s ′ )) Q θ (s ′ , a ′ ) (5) where Q θ is a slowly updated target network, supp(β) is a set of samples approximating supp(β), and K-th is an operator that selects the K-th item among candidates. When K = 1, it becomes max operator. Both Q θ and K-th operators are adapted to stabilize the learning. Second, we can skip the policy improvement step since policy evaluation is done with pre-generated samples, not depending on any parameterized policy. Instead, we can define an implicit policy using the last Q θ and s ψ : π(a|s) = exp αQ θ (s, a) a ′ ∈ supp(β) exp αQ θ (s, a ′ ) or exp αA(s, a) a ′ ∈ supp(β) exp αA(s, a ′ ) (6) where A(s, a) = Q θ (s, a) - 1 | supp(β)| a ′ ∈ supp(β) Q θ (s, a ′ ) is an advantage function, and α is a temperature parameter that controls the policy softness with regard to the Q or A; when α is zero or infinity, the resulting policy becomes β(a|s) or greedy with regard to Q θ , respectively. To sample an action from the policy π, we can sample one action from an empirical action distribution consisting of samples generated from β on the fly. Alternatively, we can also train parameterized policy π ϕ (a|s) using advantage-weighted regression (AWR) (Neumann & Peters, 2008; Peng et al., 2019) with the advantage function A(s, a). The resulting algorithm can be regarded as one special type of Q-learning in which we restrict the domain of the maximum operator to in-support actions. We refer to this algorithm as Action-Restricted Q-learning (ARQ). The pseudocode of ARQ is shown in Algorithm 1. Also, ARQ can be seen as an extension of MBS-QI (Liu et al., 2020) in that it makes the existing algorithm applicable to MDP with a continuous action space. Note that ARQ is one way to instantiate the penalty function, mainly due to the expensive computational cost of the score-based generative model. We discuss about other possible instantiations in the discussion section. Algorithm 1: Action-Restricted Q-learning (ARQ) Input :Dataset D = {(s, a, r, s ′ )}, Hyperparameter N , ϵ, K, α Initialize s ψ (a|s), Q θ (s, a), and π ϕ (a|s) (if needed) Train s ψ with a score matching algorithm (Song et al., 2021) Sample N in-support actions for s ∈ D (i.e., β ψ (a|s) > ϵ) while until convergence do Update θ with ∇ θ L policy-eval (Eq. 5) while until convergence do Update ϕ with ∇ ϕ -E s,a∼D e αA(s,a) log π ϕ (a|s) ; AWR return s ψ , Q θ , and π ϕ

5. EXPERIMENTS

Our empirical goal is to design an algorithm that enjoys the strength of both explicit behavior cloning and value learning. Therefore, the main goal of the experiments is to check whether the proposed algorithm ARQ achieves competitive performance on different types of datasets, ranging from a dataset that consists of near-optimal data in which explicitly cloning a behavior is adequate, to a dataset containing various suboptimal trajectories in which learning a value function is necessary. The implementation of ARQ consists of four steps: score-based generative model s ψ learning, sampling, Q-learning Q θ , and optional explicit policy π ϕ training. As for the hyperparameters, we tune the hyperparameters K and α for each group of datasets using random search while we use N = 30 and ϵ = e -5 (i.e., 30 samples are generated and dropped if the likelihood is lower than ϵ = e -5 ) all across the datasets tested. For the detailed implementation details and hyperparameters, please refer Appendix A.2 or the provided codefoot_0 . The proposed method is evaluated on various simulated benchmark datasets from simple lowdimensional locomotion tasks to complex contact-rich manipulation tasks. Specifically, we use locomotion (Brockman et al., 2016) , Adroit (Rajeswaran et al., 2018) , Kitchen (Gupta et al., 2019) , and Antmaze tasks in D4RL (Fu et al., 2020) , and six manipulation tasks in Robomimic (Mandlekar et al., 2021) . We use medium-replay, medium, expert, and medium-expert datasets of the locomotion task. We use machine-generated (mg.), proficient-human (ph.), and multi-human (mh.) datasets of Robomimic, each of which consists of a replay buffer of an SAC training run, trajectories of a proficient human demonstrator, and trajectories of multiple human demonstrators with different levels of proficiency. We compare the proposed method to behavior cloning baselines, specifically ordinary BC and implicit behavior cloning (Florence et al., 2021) , and prior state-of-the-art offline RL methods. Namely, we compare the performance of our method with TD3+BC (Fujimoto & Gu, 2021) , Decision Transformer (DT) (Chen et al., 2021) , One-step RL (Brandfonbrener et al., 2021) , CQL (Kumar et al., 2020) , and IQL (Kostrikov et al., 2021b) . The aggregated results are displayed in Table 2 . The proposed algorithm ARQ shows competitive performance on all ranges of datasets, from nearoptimal ones in which simple behavior cloning is sufficient, to suboptimal datasets in which value learning is necessary. Also, ARQ exhibits state-of-the-art performance on complex and contact-rich tasks, such as adroit, kitchen, and Robomimic datasets. The results indicate the practical effectiveness of the proposed algorithm, as well as the advantage of performing behavior cloning explicitly with high-fidelity models. To examine the importance of each component in ARQ, we run two ablations; we evaluate an implicit policy defined only with s ψ without any value function, and an implicit policy incorporating Q β instead of ARQ. The results of the ablations affirm our hypotheses. First, when the dataset is nearoptimal (e.g., adroit-human or proficient-human datasets), explicitly modeling the behavior policy can address the problem, and similar performance is obtained when we use ARQ. Next, we confirm the necessity of value learning and the ability of ARQ in leveraging the explicitly cloned behavior model in learning a value function. Especially in the tasks where trajectory stitching is required (e.g., kitchen and antmaze datasets), we can see performance improvement from s ψ and Q(β) + s ψ to ARQ+s ψ , and we achieve state-of-the-art performance with the help of the explicit models. Table 2 : Aggregated performance of prior methods, ours, and two ablations (s ψ and Q(β) + s ψ ) on D4RL (Fu et al., 2020) and Robomimic (Mandlekar et al., 2021) datasets. Each number represents the mean relative performance over 100 episodes. 0 and 100 represent the performance of random and expert policy, respectively. Unless noted as (ours) or (repro.), all the numbers are borrowed from Kostrikov et al. (2021b); Fujimoto & Gu (2021) ; Mandlekar et al. (2021) . The numbers generated by us are averaged over 3 different random seeds. We run IQL on Robomimic by ourselves using the author-provided implementation. It is also noteworthy that the ablated method with Q β shows competitive results on a large number of benchmarks. Echoing prior research (Goo & Niekum, 2021; Brandfonbrener et al., 2021) , these results indicate that the vast majority of offline RL benchmarks can be resolved without iterative value learning, while most offline RL algorithms tackle problems that arise from it. Therefore, in order to fairly evaluate the offline RL algorithms and thereby foster the advance of the offline RL field, it is essential to focus on environments that require value learning (e.g., antmaze) or develop new benchmarks.

6. DISCUSSION

We investigate an offline RL algorithm that combines explicit behavior cloning and value learning. We provide a theoretical framework, DQP, which enables various offline RL algorithms to be expressed in terms of different penalty functions, and we derive a principled penalty function that can leverage a behavior cloning model. Then, we provide a practical algorithm, ARQ, which realizes the derived penalty function. We implement the algorithm with a powerful generative model to maximize the full potential of ARQ. As a result, the proposed algorithm shows competitive results on most of the D4RL and Robomimic benchmarks and yields state-of-the-art results in several tasks. This indicates that the common presumption-that it is unnecessary or infeasible to estimate a behavior policy in offline RL-is likely incorrect. These results indicate that explicitly cloning a behavior policy can be actually advantageous, which has been avoided because of the performance limitations that can arise from inaccurate modeling of the policy. The major drawback of the proposed algorithm is the computationally expensive sampling procedure. While it needs to be computed only once before the value learning step, it can take several hours to generate samples (90K samples are generated per hour on our in-house workstation with an Nvidia GTX 1080 Ti). Therefore, future research may examine how to reduce computational burden, for instance, by using different generative models for behavior cloning. Or, it may be possible to devise a method that directly utilizes the score function under the actor-critic framework; since the actor update step with the penalized Q function ( Q = Q -p) only requires a gradient of p, not an exact penalty value, the computational bottleneck may be avoided if the gradient of the penalty function can be computed directly from the score function. 



Please see the attached supplementary files. The code will be disclosed to public upon publication.



The penalty functions of different offline RL algorithms.

3: Dataset-specific hyperparameters used in the implicit policy and the explicit policy π ϕ .We display the full experiment results in TableA.4.   Table A.4: Performance of prior methods and ours on D4RL(Fu et al., 2020) and RobomimicMandlekar et al. (2021) datasets. Each number represents the performance relative to a random policy as 0 and an expert policy as 100. Unless noted as (ours) or (repro.), all the numbers are borrowed fromKostrikov et al. (2021b), Fujimoto & Gu (2021), and Mandlekar et al. (2021). The numbers generated by us are averaged over 3 different random seeds. The standard deviations of multiple runs are also displayed.

A APPENDIX

A.1 SCORE-BASED GENERATIVE MODEL (SONG & ERMON, 2019; 2020; SONG ET AL., 2021) In the score-based generative model, a target distribution p(x) is indirectly expressed and trained in the form of the gradient of a log probability density function ∇ x log p(x), often referred to as the (Stein) score function (Liu et al., 2016) . This method circumvents several problems of other generative models. The main advantages are, first, we can circumvent the problem of the inferring normalizing constant that arises in likelihood-based methods (Kingma & Welling, 2013; LeCun et al., 2006) , and second, we can train the score function without worrying about training instability that arises in adversarial training Goodfellow et al. (2014) via score-matching algorithms Hyvärinen & Dayan (2005) ; Vincent (2011) ; the score-matching algorithm minimizes the gap between groundtruth score function and the estimates:Since the objective is essentially regression with l 2 loss, the loss function does not require any assumptions on the parameterized function s ψ , unlike the energy-based model LeCun et al. (2006) in which strong regularization is often required for stable training. These advantages make it possible to model a complex behavior policy β with the high fidelity that offline RL requires. We briefly examine the ability of the score-based generative model using four discontinuous multi-modal distributions, and the results are shown in Figure A.1. In all four cases, the inferred probability distribution is very sharp, and its log probability resembles the penalty function we proposed.

A.2 IMPLEMENTATION DETAILS

The implementation of ARQ consists of four steps: score-based generative model s ψ learning, sampling, Q-learning, and optional explicit policy π ϕ training. For the score-based generative model, sampling, and likelihood computation, we generally follow the implementation of (Song et al., 2021) which trains a time-dependent neural network that approximates the reverse-time stochastic differential equations (SDE); a neural network learns to reverse the progressive diffusion process that turns a data point into random noise. Specifically, we use a value-preserving SDE (VPSDE) with a neural network having residual connections (He et al., 2015) with the embedding size of 256, and we stack 3 residual blocks. We use swish for nonlinearity. Since our target distribution is conditional (i.e., conditioned on a state) unlike the original formulation, we extend a time-dependent neural network to be a function of a state, an action, and time. We train the network with Adam with a learning rate of 1e-4 and batch size of 512, and we apply an exponential moving average with an average coefficient of 0.999. Since the number of training data samples varies across datasets, we tested the different number of training iterations and ensembles to prevent overfitting. For training iterations, we tested 150,000, 300,000, and 1 million steps, and for ensemble, we tried single and 3 ensemble models. For ensemble training, we train each model with different random samples from the same data pool. The dataset-specific hyperparameters are shown in Table A .1.The second step of the ARQ algorithm is prepopulating samples for value learning. We use a predictor-corrector algorithm (Song et al., 2021) to generate samples by solving a reverse SDE. Again, we followed the implementation of Song et al. (2021) , which uses the Euler-Maruyama method as a predictor and Langevin dynamics as a corrector. We discretize the time domain [1e-3,1] of Training Iterations 1,000,000 300,000 300,000 1,000,000 1,000,000 150,000 # Ensembles 1 1 3 1 3 3 s ψ into 500 steps, and we execute a single corrector step for every predictor step. For Langevin dynamics, we dynamically adjust the noise scale using the norm of the score; we use a score-to-noise ratio of 0.16 as used in Song et al. (2021) . Since the suggested penalty function is defined based on its probability β(a|s), we need to compute the likelihood of generated samples using s ψ . For the likelihood computation, we use an instantaneous change-of-variable formula on top of the ordinary differential equation induced from the SDE. To solve the inverse problem of ODE, we use the RK45 algorithm of scipy. For the detailed formulation and algorithmic detail regarding SDE, we refer to the original paper (Song & Ermon, 2019; 2020; Song et al., 2021) or our code. We use ϵ = e -5 across all experiments, and we drop any samples that show a lower probability than the given threshold ϵ.We generate 30 samples for both ARQ and the implicit policy using s ψ .For ARQ training, we use an MLP with 2 layers and 256 activation nodes to parameterize the value function Q θ , and we apply ReLU nonlinear activation. Also, we shape the reward function of datasets following (Kostrikov et al., 2021b) ; for the locomotion tasks, the reward is normalized by multiplying the ratio of returns between the worst and the best trajectories in the dataset, and for the antmaze tasks, the reward is set to -1 except the goal state. Similarly, we densify the reward function of Robomimic using the same technique used for the antmaze. For stability, we train two Q functions and a slowly moving target network with a polyak coefficient of 0.995. We train 1 million timesteps using Adam optimizer and batch size of 512. We perform a rough random search with the following range of values for the following hyperparameters: For the implicit policy that is based on the samples of s ψ , we first generate 30 samples using s ψ , then we resample an action from the categorical distribution that treats advantages as logits. Similarly, for the explicit policy, we train a policy using weighted behavior cloning (Peng et al., 2019) where the weight is computed using the advantage. We use state-independent stochastic policy used in (Kostrikov et al., 2021b) for the locomotion, kitchen, and adroit tasks of D4RL datasets, which predicts the mean µ(a|s) and the state-independent standard deviation σ(a) of a Gaussian distribution. We use a 2-layer MLP having 256 hidden units with ReLU activations. For the antmaze tasks, we use the deterministic policy that omits the standard deviation prediction. Similarly, a deterministic policy is used for the Robomimic datasets, but we use dense Resnet blocks for the parameterization. We stack 4 ResNet blocks, each of which has 2048 embedding dimensions. We tried Gaussian Mixture Network (GMM) as suggested in (Mandlekar et al., 2021 ), but we could not replicate the reported performance in the BC setting. For D4RL and Robomimic, we train a policy for 1 million and 300,000 steps respectively, using the Adam optimizer with a learning rate of 3e-4. The key hyperparameter for the policy is the temperature term α. We tested the following range of values: [0.1, 1.0, 10.0, 30.0], and we display the chosen values in Table A .3 along with other hyperparameters.

