DROP: CONSERVATIVE MODEL-BASED OPTIMIZA-TION FOR OFFLINE REINFORCEMENT LEARNING Anonymous

Abstract

In this work, we decouple the iterative (bi-level) offline RL from the offline training phase, forming a non-iterative bi-level paradigm that avoids the iterative error propagation over two levels. Specifically, this non-iterative paradigm allows us to conduct inner-level optimization in training (i.e., employing policy/value regularization), while performing outer-level optimization in testing (i.e., conducting policy inference). Naturally, such paradigm raises three core questions (that are not fully answered by prior non-iterative offline RL counterparts like rewardconditioned policy): (Q1) What information should we transfer from the innerlevel to the outer-level? (Q2) What should we pay attention to when exploiting the transferred information for the outer-level optimization? (Q3) What are the benefits of concurrently conducting outer-level optimization during testing? Motivated by model-based optimization (MBO), we proposed DROP, which fully answers the above three questions. Particularly, in the inner-level, DROP decomposes offline data into multiple subsets, and learns a MBO score model (A1). To keep safe exploitation to the score model in the outer-level, we explicitly learn a behavior embedding and introduce a conservative regularization (A2). During testing, we show that DROP permits deployment adaptation, enabling an adaptive inference across states (A3). Empirically, we evaluate DROP on various tasks, showing that DROP gains comparable or better performance compared to prior methods.

1. INTRODUCTION

Offline reinforcement learning (RL) (Lange et al., 2012; Levine et al., 2020) describes a task of learning a policy from previously collected static data. Due to the overestimation of values at out-ofdistribution (OOD) state-actions, recent iterative offline RL methods introduce various policy/value regularization to avoid deviating from the offline data distribution (or support) in the training phase. Then, these methods directly deploy the learned policy in an online environment to test the performance. To unfold our following analysis, we term this kind of learning procedure as iterative bi-level offline RL (Figure 1 left), wherein the inner-level optimization refers to trying to eliminate the OOD issue by constraining the policy/value function, the outer-level optimization refers to trying to learn a better policy that will be employed at testing. Here, we use the "iterative" term to emphasize that the inner-level and outer-level are iteratively optimized in the training phase. However, without enough inner-level optimization (OOD regularization), there is still a distribution shift between the behavior policy and the policy to be evaluated. Further, due to the iterative error exploitation and propagation (Brandfonbrener et al., 2021) over the two levels, performing such an iterative bi-level optimization completely in training often struggles to learn a stable policy/value function. In this work, we thus advocate for non-iterative bi-level optimization (Figure 1 right) that decouples the bi-level optimization from the training phase, namely, performing inner-level optimization (eliminating OOD) in training and performing outer-level optimization (updating policy) in testing. Intuitively, incorporating the outer-level optimization into the testing phase can eliminate the iterative error propagation over the two levels. Then, three core questionsfoot_0 are: (Q1) What information (" ") should we transfer from the inner-level to the outer-level? (Q2) What should we pay special attention to when we exploit " " for outer-level optimization? (Q3) Notice that the outer-level optimization and the online rollout test form a new loop (" "), what new benefit does this give us? Figure 1 : A framework for bi-level offline RL optimization, where the inner-level optimization refers to regularizing the policy/value function (for OOD issues) and the outer-level refers to updating the policy (for reward maximizing). Non-iterative offline RL decouples the joint optimization (of two levels) from the training phase, where " " transferred from the inner-level to the outer-level depends on the specific choice of algorithm used. In Table 1 , we will summarize different choices for " ". Intriguingly, prior works under such a non-iterative framework have proposed to transfer (as " " in Q1) filtered trajectories (Chen et al., 2021) , a reward-conditioned policy (Emmons et al., 2021; Kumar et al., 2019b) , and the Q-value estimation of the behavior policy (Brandfonbrener et al., 2021; Gulcehre et al., 2021) , all of which, however, partially address the aforementioned questions (we will elaborate these works in Table 1 ). In this work, we propose a new alternative method that transfers an embedding-conditioned (Q-value) score model and we will show that this method sufficiently answers the above questions and benefits most from the non-iterative framework. Before introducing our method, we introduce a conceptually similar task (to the non-iterative bilevel optimization) -offline model-based optimization (MBO, Trabucco et al. ( 2021))foot_1 , which aims to discover, from static input-score pairs, a new design input that will lead to the highest score. Typically, offline MBO first learns a score model that maps the input to its score via supervised regression (corresponding to inner-level optimization), and then performs inference with the learned score model (as " "), for instance, by optimizing the input against the learned score model via gradient ascent (corresponding to the outer-level). To enable this MBO implementation in offline RL, we are required to decompose an offline RL task into multiple sub-tasks, each of which thus corresponds to a behavior policy-return (parameters-return) pair. However, there are practical optimization difficulties when learning the score model (inner-level) and performing inference (outer-level) on high-dimensional policy's parameter space (as input for the score model). At inference, directly extrapolating the learned score model (" ") also tends to drive the high-dimensional candidate policy (parameters) towards out-of-distribution, invalid, and low-scoring parameters (Kumar & Levine, 2020) , as these are falsely and over-optimistically scored by the learned score model. To tackle these problems, we suggest (A1) learning low-dimensional embeddings for these sub-tasks decomposed in the MBO implementation, over which we estimate an embedding-conditioned Qvalue as the MBO score model (" " in Q1), and (A2) introduce a conservative regularization, which pushes down the predicted scores on OOD embeddings, so as to avoid over-optimistic exploitation and protect against producing unconfident embeddings when conducting outer-level optimization (policy/embedding inference). Meanwhile, (A3) learning embedding permits deployment adaptation, which means we can dynamically adjust inferred embeddings across different states in testing (aka test-time adaptation). We name our method DROP (Design fROm Policies). Compared with standard offline MBO for parameter design (Trabucco et al., 2021) , deployment adaptation in DROP leverages the MDP structure of RL tasks, rather than simply conducting inference at the beginning of test rollout. Empirically, we demonstrate that DROP can effectively extrapolate a better policy that benefits from the non-iterative framework by answering the above three questions, and can achieve comparable or better performance compared to many prior offline RL algorithms.

2.1. REINFORCEMENT LEARNING AND OFFLINE REINFORCEMENT LEARNING

We model the interaction between agent and environment as a Markov Decision Process (MDP) (Sutton & Barto, 2018) , denoted by the tuple (S, A, R, P, µ), where S is the state space, A is the action space, P : S × A × S → [0, 1] is the transition kernel, R : S × A → R is the reward function, and p 0 : S → [0, 1] is the initial state distribution. Let π ∈ Π := {π : S × A → [0, 1]} denotes a policy. In RL, we aim to find a stationary policy that maximizes the expected discounted return J(π) := E τ ∼π [ ∞ t=0 γ t R(s t , a t )] in the environment, where τ = (s 0 , a 0 , r 0 , s 1 , a 1 , . . . ), r t = R(s t , a t ), is a sample trajectory and γ ∈ (0, 1) is the discount factor. We also define the state-action value function Q π (s, a) := E τ ∼π [ ∞ t=0 γ t R(s t , a t )|s 0 = s, a 0 = a], which describes the expected discounted return starting from state s and action a and following π afterwards, and the state value function V π (s) = E a∼π(a|s) [Q π (s, a)]. To maximize J(π), actor-critic algorithm alternates between policy evaluation and improvement. Specifically, given initial Q 0 and π 0 , it iterates Q k+1 (s, a) ← arg min Q E (s,a,s )∼D + R(s, a) + γE a ∼π k (a |s ) Q k (s , a ) -Q(s, a) 2 , (1) π k+1 (a|s) ← arg max π E s∼D + ,a∼π(a|s) Q k+1 (s, a) , where the value function (critic) Q(s, a) is updated by minimizing the mean squared Bellman error with an experience replay dataset D + and, following the deterministic policy gradient theorem (Silver et al., 2014) , the policy (actor) π(a|s) is updated to maximize the estimated Q k+1 (s, π(a|s)). In offline RL (Levine et al., 2020) , the agent is provided with a static data D = {τ } which consists of trajectories collected by running some data-generating policies. Note that here we denote static offline data D, distinguishing from the experience replay D + in online setting. Unlike the online RL problem, where the experience D + in Equation 1 can be dynamically updated, the agent, in offline RL, is not allowed to interact with the environment to collect new experience data. As a result, naively performing policy improvement as in Equation 2 may evaluate the estimated Q k (s , a ) on actions that lie far outside of the static offline data D, resulting in pathological values Q k+1 (s, a) that incur large error. Further, iterating policy evaluation and improvement will cause the learned policy π k+1 (a|s) to be biased towards out-of-distribution actions with erroneously overestimated values.

2.2. OFFLINE MODEL-BASED OPTIMIZATION

Model-based optimization (MBO) (Trabucco et al., 2022) aims to find an optimal design input x * with a given score function f * : X → Y ⊂ R, i.e., x * = arg max x f * (x). Typically, we can repeatedly query the oracle score model f * for new candidate design, until it produces the best design. However, we often do not have the oracle score function f * , but are provided with a static offline dataset {(x, y)} of labeled input-score pairs. To track such offline MBO question, we can fit a parametric model f to the oracle score function f * via the empirical risk minimization (ERM), f ← arg min f E x,y (f (x) -y) 2 . Then, starting from the best point in the dataset, we can perform gradient ascent on the design input and set the learned optimal design x * = x • K := GradAscent f (x • 0 , K) (for simplicity, next we will omit subscript f in GradAscent f ), where x • k+1 ← x • k + η∇ x f (x)| x=x • k , for k = 0, 1 . . . , K -1. ( ) Since the aim is to find a better design input beyond all the designs in the dataset and while directly optimizing score model f with ERM can not ensure new candidates (out-of-distribution design inputs) receive correct scores, thus one crucial requirement is to conduct confident extrapolation.

3. DROP: DESIGN FROM POLICIES

We present our framework in Figure 2 . In Sections 3.1 and 3.2, we will answer questions Q1 and Q2, setting a learned MBO score model as " " (A1) and introducing a conservative regularization over the score model (A2). In Section 3.3, we will answer Q3, where we show that we can conduct outer-level optimization during testing, enabling an adaptive embedding inference across states (A3).

3.1. TASK DECOMPOSITION

Our core idea is to explore MBO in the non-iterative bi-level offline RL framework (Figure 1  β n (a|s) ← arg max βn E (s,a)∼Dn log β n (a|s) , n = 1, • • • , N. Such a decomposition also comes with an additional benefit that it provides an avenue to exploit the hybrid modes in offline data D, because that D is often collected using hybrid data-generating behavior policies (Fu et al., 2020) , which suggests that fitting a single behavior policy may not be optimal to model the multiple modes of the offline data distribution (see Appendix C.1 for the empirical evidence). Thus, to encourage the emergence of diverse sub-tasks which capture distinct behavior modes in D (this is not our focus in this workfoot_2 ), we simply perform task decomposition according to the returns of trajectories in D, heuristically ensuring that trajectories in the same subtask share similar returns and trajectories from different sub-tasks have distinct returns.

3.2. TASK EMBEDDING AND CONSERVATIVE MODEL-BASED OPTIMIZATION

Naive model-based optimization (MBO) over behavior policies. Benefiting from the above offline task decomposition, we can conduct MBO over a set of input-score (x, y) pairs, where we model the (parameters of) behavior policies β n as the design inputs x and the corresponding expected returns at initial state J(β n ) as scores y. Note that, ideally, evaluating behavior policy β n , i.e., calculating J(β n ) ≡ E s0 V βn (s 0 ) , with subset D n will never trigger the overestimation of values in the inner-level optimization. By introducing a score model f : Π → R (as the transferred information " " in Q1), we can then perform outer-level policy inference with π * (a|s) ← arg max π f (π) , where f = arg min f E n f (β n ) -J(β n ) 2 . (5) However, directly performing optimization-based inference (outer-level optimization), max π f (π), will quickly find an invalid input for which the learned score model f outputs erroneously large values (Q2). Furthermore, it is particularly severe if we perform the inference directly over the parameters of policies, accounting for the fact that the parameters of input behavior policies lie on a narrow manifold in a high-dimensional parametric space (Kumar & Levine, 2020) . Task embedding. To enable feasible policy inference, we propose to decouple the MBO techniques from the high-dimensional space of policy parameters. We achieve this by learning a latent embedding space Z with an information bottleneck (dim(Z) min(N, dim(Π))), conditioned on the sub-task id n, from which the high-dimensional parameters of behavior policies can be inferred. We can thus use the embedding z ∈ Z to represent sub-tasks (or the corresponding behavior policies). Formally, we learn a task embeddingfoot_3 φ : R N × Z → [0, 1] and a contextual behavior policy β : S × Z × A → [0, 1], which replaces N separate behavior policies in Equation 4: β(a|s, z), φ(z|n) ← arg max β,φ E Dn∼D [N ] E (s,a)∼Dn log β(a|s, φ(z|n)) . Conservative model-based optimization. In principle, by substituting the learned task embedding φ(z|n) and the contextual behavior policy β(a|s, z) into the original objective in Equation 5, we can then conduct MBO over the embedding space: learning f : Z → R with min f E n,φ(z|n) (f (z) -J(β n )) 2 , and setting the optimal embedding z * = arg max z f (z) and the corresponding policy π * (a|s) = β(a|s, z * ), where z * can be inferred with gradient ascent as in Equation 3. However, we must deliberate a new distribution shift in the Z-space, stemming from the original distribution shift in the parametric space when directly optimizing Equation 5. Motivated by the energy model (LeCun et al., 2006) and the conservative regularization (CQL, Kumar et al. ( 2020)), we introduce the conservative score model learning, additionally regularizing the scores of out-of-distribution embeddings µ(z): f ← arg min f E n,φ(z|n) f (z) -J(β n ) 2 , s.t. E µ(z) [f (z)] -E n,φ(z|n) [f (z)] ≤ η. (7) Intuitively, as long as the scores of out-of-distribution embeddings E µ(z) [f (z)] is lower than that of in-distribution embeddings E n,φ(z|n) [f (z)] (up to a threshold η), conducting embedding inference with z * = arg max z f (z) would produce the best and confident solution, avoiding towards embeddings that are far away from the training set {φ(z|n), n = 1, . . . , N }. Now that we have reframed the non-iterative bi-level offline RL problem as one of offline MBO: in the inner-level optimization (Q1), we set the practical choice for " " as the learned score model f (A1); in the outer-level optimization (Q2), we introduce task embedding and conservative regularization to avoid over-optimistic exploitation when exploiting f for policy/embedding inference (A2). In next section, we will show how to slightly change the form of the score model f , so as to leverage the (MDP) structural characteristic (loop " ") of RL tasks and answer the left Q3.

3.3. DEPLOYMENT ADAPTATION

Recalling that we update f (z) to regress the value at initial state E s0 V βn (s 0 ) in Equation 7, we then conduct outer-level inference with z * = arg max z f (z) and rollout the z * -conditioned policy π * (a|s) := β(a|s, z * ) until the end of rollout episode at deployment (testing). In essence, this inference produces an extrapolation over the distribution of the behavior policies (corresponding to embeddings). Going beyond the (outer-level) inference only at the initial state, we propose that a implementation can benefit by performing inference at any rollout state in testing (A3). To enable deployment adaptation, we model the score model with f : S ×A×Z → R, taking a stateaction as extra input. Then, we encourage the score model to regress the values of behavior policies over all state-action pairs in each sub-task, min f E n,φ(z|n) E (s,a)∼Dn f (s, a, z) -Q βn (s, a) 2 . For simplicity, instead of learning an additional value function Q βn for each behavior policy, we learn the score model directly with the TD-error used for learning the value function Q βn (s, a) as in Equation 1, together with the conservative regularization in Equation 7: f ← arg min f E Dn∼D [N ] E (s,a,s ,a )∼Dn R(s, a) + γ f (s , a , φ(z|n)) -f (s, a, φ(z|n)) 2 , (8) s.t. E n,µ(z) E s∼Dn,a∼β(a|s,z) [f (s, a, z)] -E n,φ(z|n) E s∼Dn,a∼β(a|s,z) [f (s, a, z)] ≤ η, where f denotes a target network and we update the target f with soft updates: f = (1 -υ) f + υf . In testing, we thus can dynamically adapt the outer-level optimization, setting policy inference with π * (a|s) = β(a|s, z * (s)), where z * (s) = arg max z f s, β(a|s, z), z . Specifically, at any state s in the deployment phase, we perform gradient ascent to find the optimal behavior embedding z * (s) = z • K (s) := GradAscent(s, z • 0 , K), where z • 0 is the starting point and z • k+1 (s) ← z • k (s) + η∇ z f (s, β(a|s, z), z))| z=z • k , for k = 0, 1 . . . , K -1. (9) Table 1 : Comparison of five non-iterative bi-level offline algorithms, where R(•) denotes the return of sampling τ or starting from (s, a), the checkmark in A2 indicates whether the exploitation (outerlevel) to " " is regularized and that in A3 indicates whether deployment adaptation is supported.

Inner-level

Outer-level " " in A1 A2 A3 F-BC filter τ with high R(τ ) behavior cloning filtered {τ } RvS-R min π -E [log π(a|s, R(τ ))] handcraft R target π(a|s, •) Onestep min Q L(Q(s, a), R(s, a)) arg max a Q β (s, β(a|s)) Q β (s, a) COMs min f L(f (β τ ), R(τ )) arg max β f (β) f (β) DROP min f L(f (s, a, z), R(s, a, z)) arg max z f (s, β(a|z), z) f (s, a, z) 3.4 CONNECTION TO PRIOR NON-ITERATIVE OFFLINE COUNTERPARTS In Table 1 , we summarize the comparison with prior representative non-iterative offline RL methods. Intuitively, our DROP (leveraging returns to decompose D) is similar in spirit to F-BC and RvS-R (Chen et al., 2021; Emmons et al., 2021) , both of which use return R(τ ) to guide the inner-level optimization. However, both F-BC and RvS-R leave Q2 unanswered. In outer-level, F-BC can not enable policy extrapolation, which heavily relies on the data quality in offline tasks, and RvS-R needs to handcraft a target return (as the contextual variable for π(a|s, •)), which also probably triggers the potential distribution shift between the hand-crafted contextual variable and that used for learning the contextual policy (see examples in Figure 6 of Emmons et al. (2021) ). Diving deeper into the bi-level optimization, we can also find DROP combines the advantages of Onestep (Brandfonbrener et al., 2021) and COMs (Trabucco et al., 2021) , where Onestep performs outer-level optimization in action space (arg max a ), COMs performs that in parameter space (arg max β ), while our DROP performs that in embedding space (arg max z ). As a result, the choice of f (s, a, z) in DROP allows us to conduct safe exploitation over " " in outer-level (Q2) and leverage the structural characteristic of RL task (the loop " " in Q3), rather than simply conducting outer-level optimization at initial states as in COMs (corresponding to the objective in Equation 5). Learn φ, β, and f with Equations 6 and 8. 6: end while Return: β(a|s, z) and f (s, a, z). Algorithm 2 DROP (Testing / Deployment) Require: Env, β(a|s, z), and f (s, a, z). 1: s 0 = Env.Reset(). 2: while not done do 3: Inference (deployment adaptation): z * (s t ) = arg max z f (s t , β(a t |s t , z), z).

4:

Sample action: a t ∼ β(a t |s t , z * (s t )).

5:

Step Env: s t+1 ∼ P (s t+1 |s t , a t ).

6: end while

We now summarize the DROP algorithm (see Algorithm 1 for the training phase and Algorithm 2 for the testing phase). During training (inner-level optimization), we alternate between updating φ(z|n), β(a|s, z), and f (s, a, z), wherein we update φ with both maximum likelihood loss and TD-error loss in Equations 6 and 8. During testing (outer-level optimization), for each state s, we use the gradient ascent in Equation 9 to choose the optimal embedding z * (s). Instead of simply sampling a single starting point z • 0 , we choose N starting points corresponding to all the embeddings {z n |n = 1, . . . , N } of sub-tasks, and then choose the optimal z * (s) from those updated embeddings for which the learned f outputs the highest score: z * (s) = arg max z f (s, β(a|s, z), z) s.t. z ∈ {GradAscent(s, z n , K)|n = 1, . . . , N }. Then, we sample action from π * (a|s) := β(a|s, z * (s)). For more optimization (training/testing) details, we refer the reader to Appendix D.

4. RELATED WORK

Offline RL. In offline RL, learning with static offline data is prone to exploiting out-of-distribution (OOD) state-action pairs and producing over-estimation of values, which makes vanilla iterative policy learning and value optimization challenging (Rashidinejad et al., 2021) . To eliminate the problem, a number of methods have been explored, in essence, by either introducing a policy/value regularization in the iterative loop or trying to eliminate the iterative procedure itself. Iterative methods: Sticking with the normal iterative updates in RL, offline policy regularization methods aim to keep the learning policy to be close to the behavior policy under a probabilistic distance (Cang et al., 2021; Fujimoto & Gu, 2021; Kostrikov et al., 2021a; Kumar et al., 2019a; Liu et al., 2022; Nair et al., 2020a; Peng et al., 2019; Siegel et al., 2020; Wu et al., 2019; Zhang et al., 2021) . Some works also conduct implicit policy regularization with variants of importance sampling Lee et al. (2021) ; Liu et al. (2019) ; Nachum et al. (2019) . Besides regularizing policy, it is also feasible to constrain the substitute value function in the iterative loop. Methods constraining the value function aim at mitigating the over-estimation, which typically introduces pessimism to the prediction of the Q-values (Chebotar et al., 2021; Jin et al., 2021; Kumar et al., 2020; Li et al., 2022; Ma et al., 2021a; b) or penalizes the value with an uncertainty quantification (An et al., 2021; Bai et al., 2022; Rezaeifar et al., 2021; Wu et al., 2021) , making the value for out-of-distribution state-actions more conservative. Similarly, another branch of model-based methods (Kidambi et al., 2020; Yu et al., 2020; 2021b; Rigter et al., 2022 ) also perform iterative bi-level updates, alternating between regularized evaluation and improvement. Different from these works, DROP only evaluates values of behavior policies in the inner-level, avoiding error propagation between two levels. Non-iterative methods: Another complementary line of work studies how to eliminate the iterative updates, which simply casts RL as a weighted or conditional imitation learning problem (Q1). Derived from the behavior-regularization RL (Geist et al., 2019; Vieillard et al., 2020) , the former conducts weighted behavior cloning: first learn a value function for the behavior policy, then weigh the state-action pairs with the learned values or advantages (Abdolmaleki et al., 2018; Chen et al., 2019; Wang et al., 2020; Peng et al., 2019) . Besides, some works also propose implicitly behavior policy regularization that also avoids estimating the value of new candidate policies, initializing the learning policy with a behavior policy Matsushima et al. (2020) or performing only a "onestep" update (policy improvement) over the behavior policy (Fujimoto et al., 2019; Gulcehre et al., 2021) . For the latter, this branch method typically builds upon the hindsight information matching (Andrychowicz et al., 2017; Eysenbach et al., 2020; Pong et al., 2018; Wan et al., 2021) , assuming that the future trajectory information can be useful to infer the middle decision that leads to the future and thus relabeling the trajectory with the reached states or returns. Due to the simplicity and stability, RvS-based methods thus advocate for learning a goal-conditioned or reward-conditioned policy (Chen et al., 2021; Ding et al., 2019; Emmons et al., 2021; Furuta et al., 2021; Janner et al., 2021; Lin et al., 2022; Srivastava et al., 2019; Yang et al., 2022) with supervised learning. However, these works do not fully exploit the non-iterative bi-level framework and fail to answer the proposed questions, which either does not regularize the inner-level optimization before exploiting " " in the outer-level (Q2), or does not support the deployment adaptation in testing (Q3). Offline model-based optimization (MBO). Similar to offline RL, the main challenge of MBO is to reason about uncertainty and OOD values (Brookes et al., 2019; Fannjiang & Listgarten, 2020) , since a direct gradient-ascent against the learned score model can easily produce invalid inputs that are falsely and highly scored. To counteract the effect of model exploitation, prior works introduce various techniques, including normalized maximum likelihood estimation (Fu & Levine, 2021) , model inversion networks (Kumar & Levine, 2020) , local smoothness prior (Yu et al., 2021a) , and conservative objective models (COMs) (Trabucco et al., 2021) . Compared to COMs, DROP shares similarity with the conservative model, but instantiates on the embedding space instead of the parameter space. Such difference is nontrivial, not only because DROP allows OOD sampling (aimed at pessimism) directly in embedding space, avoiding an adversarial training as in COMs, but also because DROP allows deployment adaptation, enabling dynamical inference across states in testing.

5. EXPERIMENTS

In this section, we present our empirical results. We first give examples to illustrate the deployment adaptation. Then we evaluate DROP against prior offline RL algorithms on D4RL benchmark. Finally, we compare DROP with prior (offline) latent-based baselines. For more ablation studies wrt to the decomposition rules and the conservative regularization, we refer readers to the appendix. Illustration of deployment adaptation. To better understand the deployment adaptation of DROP, we include four comparisons that exhibit different embedding inference rules at testing: Under review as a conference paper at ICLR 2023 z [N ] denotes embeddings of all behavior policies; z * 0 (s 0 ), z * (s 0 ), z * 0 (s t ) and z * (s t ) denote the selected embeddings in DROP-Best, DROP-Grad, DROP-Best-Ada and DROP-Grad-Ada respectively. Grad. Ascent z [N] z * 0 (s 0 ) z * (s 0 ) z * 0 (s t ) z * (s t ) (1) DROP-Best: At initial state s 0 , we choose the best embedding from those embeddings of behavior policies, z * 0 (s 0 ) = arg max z f (s 0 , β(a 0 |s 0 , z), z) s.t. s ∈ z [N ] := {z 1 , . . . , z N }, and keep this embedding fixed for the entire episode, i.e., setting π * (a t |s t ) = β(a t |s t , z * 0 (s 0 )). (2) DROP-Grad: At initial state s 0 , we conduct inference (gradient ascent on starting point z * 0 (s 0 )) with z * (s 0 ) = arg max z f (s 0 , β(a 0 |s 0 , z), z), and keep this embedding fixed throughout the rollout. (3) DROP-Best-Ada: We adapt the contextual policy by setting π * (a t |s t ) = β(a t |s t , z * 0 (s t )), where we choose the best embedding z * 0 (s t ) directly from those embeddings of behavior policies for which the score model outputs the highest score, i.e., z * 0 (s t ) = arg max z f (s t , β(a t |s t , z), z) s.t. z ∈ z [N ] . (4) DROP-Grad-Ada (gradient-based adaptation as described in Section 3.5): We set π * (a t |s t )= β(a t |s t , z * (s t )) and choose the best embedding from those updated embeddings of behavior policies, i.e., z * (s t ) = arg max z f (s t , β(a t |s t , z), z) s.t. z ∈ {GradAscent(s t , z n , K)|n = 1, . . . , N }. In Figure 3 , we visualize the four different inference rules and report the corresponding performance in the halfcheetah-medium-expert task (Fu et al., 2020) . In Figure 3 (a), we set the starting point as the best embedding z * 0 (s 0 ) in z [N ] , and perform gradient ascent to find the optimal z * 0 (s 0 ) for DROP-Grad. In Figure 3 (b), we can find that at different time steps, DROP-Best-Ada chooses different embeddings (as contextual variables for β(a t |s t , •)). At a high level, performing such dynamical inference enables us to combine different embeddings, switching behavior policies at different states. Further, in Figure 3 (c1, c2), we find that the additional inference (with gradient ascent) in DROP-Grad-Ada allows to extrapolate beyond the embeddings of behavior policies, and thus results in sequential composition of new embeddings (policies) across different states. For practical impacts of these different inference rules, we provide the performance comparison in Figure 3 (d) , where we can find that performing gradient-based optimization (*-Grad-*) outperforms the natural selection among these embeddings of behavior policies in sub-tasks (*-Best-*), and rollout with adaptive embedding inference (DROP-*-Ada) outperforms that with fixed embeddings (DROP-*). Empirical performance on benchmark tasks. We evaluate DROP on a number of tasks from the D4RL dataset and make comparison with both prior iterative and non-iterative offline algorithmsfoot_4 . Considering that DROP follows the non-iterative offline RL paradigm, we compare DROP with prior non-iterative offline baselines (BC, F-BC, DT (Chen et al., 2021) , RvS-R, Onestep, and COMs) in the main paper. Compared with our DROP, these baselines do not fully answer the raised questions (see Table 1 ), which either does not regularize the inner-level optimization before exploiting " " in outer-level (Q2), or does not support the deployment adaptation in testing (Q3). Moreover, we also provide the comparison with CQL, which inspires us to design the conservation in Equation 7. In Table 2 , we show the evaluation results for AntMaze-* and Gym-*-medium-* tasks in D4RL *-v2, where we can find DROP (-Grad-Ada) achieves better performance than these non-iterative offline RL baselines overall. Compared with CQL, DROP shows superior performance in AntMaze-large-* and Gym-*-medium-expert (m.-exp.) tasks, while leads inferior performance in AntMaze-medium-* and Gym-*-medium-replay (m.-rep.) tasks. As an extension, we also design DROP+CVAE implementation (see motivation in next paragraph and details in Appendix C.4), which further improves DROP's performance and retains superior/comparable performance in all tasks. Comparison with latent policy methods. Note that one additional merit of DROP is that it naturally accounts for hybrid modes in D by conducting task decomposition in inner-level, we thus Table 2: Comparison with non-iterative methods on D4RL (*-v2). For all results of our method, we average the normalized returns across 5 seeds; for each seed, we run 10 evaluation episodes. For comparison, we use and to denote DROP-Grad-Ada achieves better performance compared with Onestep and COMs (most related baselines in Table 1 ) respectively. (BA: Best-Ada, GA: Grad-Ada) compare DROP to latent policy methods (PLAS (Zhou et al., 2020) and LAPO (Chen et al., 2022) ) that use conditional variational autoencoder (CVAE) to model offline data and also account for multi-modes in offline data. Essentially, both our DROP and baselines (PLAS and LAPO) learn a latent policy in the inner-level optimization, except that we adopt the non-iterative bi-level learning while baselines are instantiated under the iterative paradigm. By answering Q3, DROP permits deployment adaptation, enabling us to dynamically switch/stitch "skills" (latentpolicy/behaviors as shown in Figure 3 ) and encouraging high-level abstract exploration in testing. However, the aim of introducing the latent policy in PLAS and LAPO is to regularize the inner-level optimization, which fairly answers Q2 in the iterative offline counterpart but can not provide the potential benefit (deployment adaptation) by answering Q3 in the non-iterative paradigm. We provide the comparison results in Table 3 . We can observe that DROP (-Best-Ada) and DROP (-Grad-Ada) consistently achieves better performance than PLAS on AntMaze-*-v2 tasks. On the Gym *-medium domain, DROP (-Grad-Ada) also performs better than LAPO. However, there is a big performance gap between DROP and LAPO on the *-random domain. We speculate that it is mainly caused by the decomposition rule. In our DROP implementation, we heuristically use return to conduct task decomposition (motivated by RvS-R (Emmons et al., 2021) ), while LAPO and PLAS conduct decomposition (learning latent policy) automatically. Similarly, to bridge the gap, we also adopt CVAE to model the offline data and afterwards take the learned latent embedding in CVAE as the embedding of behaviors, instead of conducting return-guided task decomposition. We provide implementation details (DROP+CVAE) in Appendix C.4 and new results in Tables 2 and 3 (DROP+), where we can see such CVAE-based DROP implementation can bring a substantial performance improvement. Further, in Table 6 (Appendix C.4), we compare DROP+CVAE to IQL (Kostrikov et al., 2021b) , consistently demonstrating the competitive empirical performance of DROP approach against state-of-art offline iterative/non-iterative baselines.

6. CONCLUSION

In this work, we introduce non-iterative bi-level offline RL and based on this paradigm, we raise three questions (Q1, Q2, and Q3). To answer that, we reframe the offline RL problem as one of MBO and learn a score model (A1), introduce embedding learning and conservative regularization (A2), and propose deployment adaptation in testing (A3). We evaluate DROP on various tasks, showing that DROP gains comparable or better performance compared to prior methods. A APPENDIX 

B DISCUSSION AND FUTURE WORK

Limitations. DROP also has several limitations. First, the offline data decomposition dominates the following bi-level optimization, and thus choosing a suitable decomposition rule is a crucial requirement for policy inference (see experimental analysis in Appendix C.2). An exciting direction for future work is to study generalized task decomposition rules (Rao et al., 2021) . In Appendix C.4, we also exhibit a potential of such generalized task decomposition by introducing CVAE into DROP's implementation, and find such a combination (DROP + CVAE) can bring practical performance improvement. Second, we find that when the number of sub-tasks is too large, the inference is unstable, where adjacent checkpoint models exhibit larger variance in performance (such instability also exists in prior offline RL methods, discovered by Fujimoto & Gu (2021) ). One natural approach to this instability is conducting online fine-tuning (see Appendix C.3 for our empirical studies). Going forward, we believe our work suggests a feasible alternative for generalizable offline robotic learning: by decomposing a single robotic dataset into multiple subsets, offline policy inference can benefit from performing model-based optimization (MBO) and the joint deployment adaptation procedure. Social impact: Beyond a general offline RL improvement, the authors do not foresee negative social impacts. In Figure 4 , we provide empirical evidence that learning a single behavior policy (using BC) is not sufficient to characterize the whole offline dataset, and multiple behavior policies (conducting task decomposition) deliver better resilience to characterize the offline data than a single behavior policy. and AntMaze suite (*-v2), where "Rand", "Quan" and "Rank" denote the Random, Quantization, and Rank decomposition rules respectively. We can find across 18 tasks (AntMaze and MuJoCo-Gym suites) and 3 embedding inference methods (DROP-Grad, DROP-Best-Ada, and DROP-Grad-Ada), Rank is more stable and yields better performance compared with the other two decomposition rules.

C.2 DECOMPOSITION RULES

In DROP algorithm, we explicitly decompose an offline task into multiple sub-tasks, over which we then reframe the offline policy learning problem as one of offline model-based optimization. In this section, we discuss three different designs for the task decomposition rule.

Random(N, M ):

We decomposition offline dataset D := {τ } into N subsets, each of which contains at most M trajectories that are randomly sampled from the offline dataset. Quantization(N, M ): Leveraging the returns of trajectories in offline data, we first quantize offline trajectories into N bins, and then randomly sample at most M trajectories (as a sub-task) from each bin. Specifically, in the i-th bin, the quantized trajectories {τ i } satisfy R min + ∆ * i < Return(τ i ) ≤ R min + ∆ * (i + 1), where ∆ = (Rmax-Rmin) N , Return(τ i ) denotes the return of trajectory τ i , and R max and R min denote the maximum and minimum trajectory returns in the offline dataset respectively.

Rank(N, M ):

We first rank the offline trajectories descendingly based on their returns, and then sequentially sample M trajectories for each subset. (We adopt this decomposition rule in main paper.) In Figure 5 , we provide the comparison of the above three decomposition rules (see the selected number of sub-tasks and the number of trajectories in each sub-task in Table 9 ). We can find that across a verity of tasks, decomposition rule has a fundamental impact on the subsequent modelbased optimization. Across different tasks and different embedding inference rules, Random and Quantization decomposition rules tend to exhibit large performance fluctuations, which reveals the importance of choosing a suitable task decomposition rule. In our paper, we adopt the Rank decomposition rule, as it demonstrates a more robust performance shown in Figure 5 . In Appendix C.4, we adopt the conditional variational auto-encoder (CVAE) to conduct automatic task decomposi-Table 4 : Comparison between our DROP (using the Rank decomposition rule) and filtered behavior cloning (F-BC) on D4RL AntMaze and MuJoCo suites (*-v2). We take the baseline results of BC and F-BC from Emmons et al. (2021) , where F-BC is trained over the top 10% trajectories, ordered by the returns. Our DROP results are computed over 5 seeds and 10 episodes for each seed. medium-replay 26 62.5 37.4 ± 13.5 60.9 ± 7.4 61.9 ± 2.3 halfcheetah medium-expert 55.2 92.9 86.6 ± 3.9 88.5 ± 1.2 88.9 ± 2 hopper medium-expert 52.5 110.9 103.5 ± 6.3 102.5 ± 6.2 105.9 ± 4.9 walker2d medium-expert 107.5 109 107.5 ± 2 106.8 ± 3.9 106.9 ± 3.6 mujoco-gym total 475.5 674 609.1 672.2 683.8 tion (treating each trajectory in offline dataset as an individual task) and we find such implementation (DROP+CVAE) can further improve DROP's performance. In future work, we also encourage better decomposition rules to decompose offline tasks so as to enable more effective model-based optimization for offline RL tasks.

Tasks

Comparison with filtered behavior cloning. We also note that the Rank decomposition rule leverages more high-quality trajectories than the other two decomposition rules (Random and Quantization). Thus, a natural question to ask is, is the performance of Rank better than that of Random and Quantization due to the presence of more high-quality trajectories in the decomposed sub-tasks? That is, whether DROP (using the Rank decomposition rule) only conducts behavioral cloning over those high-quality trajectories, thus leading to better performance. To answer the above question, we compare DROP (using the Rank decomposition rule) with the filtered behavior cloning (F-BC), where the latter (F-BC) performs behavior cloning after filtering for trajectories with highest returns. We provide the comparison results in Table 4 . We can find that in AntMaze tasks, the overall performance of DROP is higher than that of F-BC. For the MuJoCo-Gym suite, DROP-based methods outperforms F-BC on these offline tasks that contain a plenty of sub-optimal trajectories, including the random, medium, and medium-replay domains. This result indicates that DROP can leverage the sort of embedding inference (extrapolation) to find a better policy beyond all the behavior policies in sub-tasks, which is more effective than simply performing imitation learning on a subset of the dataset.

C.3 ONLINE FINE-TUNING

Online fine-tuning (checkpoint-level). In Figure 6 , we show the learning curves of DROP-Best on four DR4L tasks. We can find that DROP exhibits a high-variance (in performance) across training stepsfoot_5 , which means the performance of the agent may be dependent on the specific stopping point chosen for evaluation (such instability also exists in prior offline RL methods (Fujimoto & Gu, 2021) ). To choose a suitable stopping checkpoint over which we perform the DROP inference (DROP-Grad, DROP-Best-Ada and DROP-Grad-Ada), we propose to conduct checkpoint-level online fine-tuning (see Algorithm 3 in Section D for more details): we evaluate each of the latest T checkpoint models and choose the best one that leads to the highest episode return. In Figure 7 , we show the total normalized returns across all the tasks in each suite (including Maze2d, AntMaze, and MuJoCo-Gym). We can find that in most tasks, fine-tuning (FT) can guarantee a performance improvement. However, we also find such fine-tuning causes negative impacts in performance in AntMaze(*-v0) suite. The main reason is that, in this checkpoint-level fine-tuning, we choose the "suitable" checkpoint model using the DROP-Best embedding inference rule, while we adopt the other three embedding inference rules (DROP-Grad, DROP-Best-Ada and DROP-Grad-Ada) at the test time. Such finding also implies that the success of DROP's deployment adaptation is not entirely dependent on the best embedding across sub-tasksfoot_6 (i.e., the best embedding z * 0 (s 0 ) in DROP-Best), but requires switching between some "suboptimal" embeddings (using DROP-Best-Ada) or extrapolating new embeddings (using DROP-Grad-Ada). Online fine-tuning (embedding-level). Beyond the above checkpoint-level fine-tuning procedure, we can also conduct embedding-level online fine-tuning: we aim to choose a suitable gradient update step for the gradient-based embedding inference rules (including DROP-Grad and DROP-Grad-Ada). Similar to the checkpoint-level fine-tuning, we first conduct the deployment adaptation pro-cedure (DROP-Grad and DROP-Grad-Ada) over a set of gradient update steps, and then choose the best step that leads to the highest episode return (see Algorithm 4 in Section D for more details). In Table 5 , we compare our DROP (DROP-Grad and DROP-Grad-Ada) to three offline RL methods (AWAC (Nair et al., 2020b) , CQL (Kumar et al., 2020) and IQL (Kostrikov et al., 2021b) ), reporting the initial performance and the performance after online fine-tuning. We can find that the embedding-level fine-tuning (0.3M) enables a significant improvement in performance. The finetuned DROP-Grad-Ada (0.3M) outperforms the AWAC and CQL counterparts in most tasks, even though we take less rollout steps to conduct the online fine-tuning (baselines take 1M online rollout steps, while DROP-based fine-tuning takes 0.3M steps). However, there is still a big gap between the fine-tuned IQL and the embedding-level fine-tuned DROP (0.3M). Considering that there remains 0.7M online steps in the comparison, we further conduct "parametric-level" fine-tuning (updating the parameters of the policy network) for our DROP-Grad-Ada on medium-* and large-* tasks, we can find which achieves competitive fine-tuning performance even compared with IQL. Table 5 : Online fine-tuning results (initial performance → performance after online fine-tuning). The baseline results of AWAC, CQL, and IQL are from Kostrikov et al. (2021b) , where they run 1M online steps to fine-tune the learned policy. For our DROP method (DROP-Grad and DROP-Grad-Ada), we run 0.3M (= 6 checkpoint × 50 Kmax × 1000 steps per episode ) online steps to fine-tune (embedding-level) the policy, i.e., aiming to find the optimal gradient ascent step that is used to infer the contextual embedding z * (s 0 ) or z * (s t ) for π * (a t |s t ) := β(a t |s t , •) (see Algorithm 4 for the details). Moreover, for medium-* and large-* tasks, we conduct additional parametric-level finetuning, with 0.7M online steps to update the policy's parameters. Our DROP results are computed over 5 seeds and 10 episodes for each seed. (Chen et al., 2022) and PLAS (Zhou et al., 2020) , we adopt the conditional variational auto-encoder (CVAE) to model offline data. Specifically, we learn the contextual policy and behavior embedding: β(a|s, z), φ(z|s) ← arg max β,φ E (s,a)∼D E (z)∼φ(z|s) log β(a|s, z) -KL(φ(z|s) p(z)). Then, we learn the score model f with the TD-error and the conservative regularization: f ← arg min f E (s,a,s ,a )∼D R(s, a) + γ f (s , a , φ(z|s)) -f (s, a, φ(z|s)) 2 , .t. E s∼D,z∼µ(z),a∼β(a|s,z) [f (s, a, z)] -E s∼D,z∼φ(z|s),a∼β(a|s,z) [f (s, a, z)] ≤ η, where f denotes a target network and µ(z) denotes the uniform distribution over the Z-space. In testing, we also dynamically adapt the outer-level optimization, setting policy inference with π * (a|s) = β(a|s, z * (s)), where z * (s) = arg max z f s, β(a|s, z), z . In Table 6 , we compare DROP+CVAE (-Grad-Ada) with LAPO (Chen et al., 2022) , PLAS (Zhou et al., 2020) , CQL Kumar et al. (2020) , IQL Kostrikov et al. (2021b) and the naive implementation of DROP(-Grad-Ada) (conducting return-guided task decomposition and afterward learning behavior embedding as in Equation 6). We highlight that even there is a big performance gap between DROP and baselines (LAPO and PLAS) in Gym-MuJoCo *-random tasks, our CVAE-based implementation (DROP+CVAE) can bridge such performance gap. Further, in *-medium tasks, DROP+CVAE The y-axis represents the normalize return, and the x-axis represents the number of gradient-ascent steps used for embedding inference at deployment. We plot each random seed as a transparent line; the solid line corresponds to the average across 5 seeds. Note that our embedding inference depends on the learned score model f . Without proper regularization, such inference will lead to out-of-distribution embeddings that are erroneously high scored (Q2). Here we conduct an ablation study to examine the impact of the conservative regularization used for learning the score model. In Figure 8 , we compare DROP-Grad and DROP-Grad-Ada to their naive implementation (w.o. Reg) that ablates the regularization on halfcheetah-medium-expert. We can find that removing the conservative regularization leads to unstable performance when changing the update steps of gradient-based optimization. However, we empirically find that in some tasks such naive implementation (w/o Reg) does not necessarily bring unstable inference (Appendix E). Although improper gradient update step leads to faraway embeddings, to some extent, embedding-conditioned behavior policy can correct such deviation.

D IMPLEMENTATION DETAILS

For the practical implementation of DROP, we parameterize the task embedding function φ(z|n), the contextual behavior policy β(a|s, z) and the score model f (s, a, z) with neural networks (see Appendix D for specific architectures). For Equation 8, we construct a Lagrangian and solve the optimization through primal-dual gradient descent. For the choice of µ(z), we simply set µ(z) to be the uniform distribution over the Z-space and empirically find that such uniform sampling can effectively avoid the out-of-distribution extrapolation at inference. Lagrangian Relaxation. To optimize the constrained objective in Equation 8 in the main paper, we construct a Lagrangian and solve the optimization through primal-dual gradient descent, min f max λ>0 E Dn∼D [N ] E (s,a,s ,a )∼Dn R(s, a) + γ f (s , a , φ(z|n)) -f (s, a, φ(z|n)) 2 + λ E n,µ(z) E s∼Dn,a∼β(a|s,z) [f (s, a, z)] -E n,φ(z|n) E s∼Dn,a∼β(a|s,z) [f (s, a, z)] -η . This unconstrained objective implies that if the expected difference in scores of out-of-distribution embeddings and in-distribution embeddings is less than a threshold η, λ is going to be adjusted to 0, on the contrary, λ is likely to take a larger value, used to punish the over-estimated value function. This objective encourages that out-of-distribution embeddings score lower than in-distribution embeddings, thus performing embedding inference will not lead to these out-of-distribution embeddings that are falsely and over-optimistically scored by the learned score model. In our experiments, we tried five different values for the Lagrange threshold η (1.0, 2.0, 3.0, 4.0 and 5.0). We did not observe a significant difference in performance across these values. Therefore, we simply set η = 2.0. Hyper-parameters. In Figure 9 , we provide the network architecture of the task embedding φ(z|s), the contextual behavior policy β(a|s, z), and the score function f (s, a, z), where the corresponding hyper-parameters are provided in Table 7 . For the gradient ascent update steps (used for embedding inference), we set K = 100 for all the embedding inference rules in experiments. In Table 9 , we provide the number of sub-tasks, the number of trajectories in each sub-task, and the dimension of the embedding for each sub-task (behavior policy). The selection of hyperparameter N is based on two evaluation metrics: (1) the fitting loss of the decomposed behavioral policies to the offline data, and (2) the testing performance of DROP. Specifically, • (Step1) Over a hyperparameter (the number of sub-tasks) set, we conduct the hyperparameter search using the fitting loss of behavior policies, then we choose/filter the four best hyperparameters; • (Step2) We follow the normal practice of hyperparameter selection and tune the four hypermeters selected in Step1 by interacting with the simulator to estimate the performance of DROP under each hyperparameter setting. Gym-mujoco 10, 20, 50, 100, 200, 500, 800, 1000 Adroit 10, 20, 50, 100, 200, 500, 800, 1000 We provide the hyperparameter sets in Table 8 . In Step2, we tune the (filtered) hyperparameters using 1 seed, then evaluate the best hyperparameter by training on an additional 4 seeds and finally report the results on the 5 total seeds (see next "evaluation protocol"). In Antmaze domain, a single fixed N works well for many tasks; while in Gym-mujoco and Adroit domains, we did not find a fixed N that provides good results for all tasks in the corresponding domain in D4RL, thus we use the above hyperparameter selection rules (Step1 and Step2) to choose the number N. Environment details. For the comparison of our method to prior iterative offline RL methods, we consider the v0 versions of the datasets in D4RLfoot_7 . We take the baseline results of BEAR, BCQ, CQL, and BRAC-p from the D4RL paper (Fu et al., 2020) , and take the results of TD3+BC from their origin paper (Fujimoto & Gu, 2021) . For the comparison of our method to prior non-iterative offline RL method, we use the v2 versions of the dataset in D4RL. All the baseline results of behavior cloning (BC), Decision Transform (DT), RvS-R, and Onestep are taken from Emmons et al. (2021) . In our implementation of COMs, we take the parameters (neural network weights) of behavior policies as the design input for the score model; and during testing, we conduct parameters inference (outer-level optimization) with 200 steps gradient ascent over the learned score function, then the rollout policy is initialized with the inferred parameters. For the specific architecture, we instance the policy network with dim(S) input units, two layers with 64 hidden units, and a final output layer with dim(A). Evaluation protocol. We evaluate our results over 5 seeds. For each seed, instead of taking the final checkpoint model produced by a training loop, we take the last T (T = 6 in our experiments) checkpoint models, and evaluate them over 10 episodes for each checkpoint. That is to say, we report the average of the evaluation scores over 5 seed × 6 checkpoint × 10 episode rollouts. Online fine-tuning (checkpoint-level): Instead of re-training the learned (final) policy with online rollouts, we fine-tune our policy with enumerated trail-and-error over the last T checkpoint models (Algorithm 3). Specifically, for each seed, we run the last T checkpoint models in environment over one episode for each checkpoint. The checkpoint model which achieves the maximum episode return is returned. In essence, this fine-tuning procedure imitates the online RL evaluation protocol: if the current policy is unsatisfactory, we can use checkpoints of previous iterations of the policy. Online fine-tuning (embedding-level): The embedding-level fine-tuning aims to find a suitable gradient ascent step that is used to conduct the embedding inference in DROP-Grad or DROP-Grad-Ada. 

E ADDITIONAL RESULTS

Comparison with iterative offline RL baselines. Here, we compare the performance of DROP (Grad, Best-Ada, and Grad-Ada ) to iterative offline RL baselines (BEAR (Kumar et al., 2019a) , BCQ (Fujimoto et al., 2019) , CQL (Kumar et al., 2020) , BRAC-p (Wu et al., 2019) , and TD3+BC (Fujimoto & Gu, 2021) ) that perform iterative bi-level offline RL paradigm with (explicit or implicit) value/policy regularization in inner-level. In Table 10 , we present the results for the Maze2D, AntMaze, Gym-MuJoCo, and Adroit suites in standard D4RL benchmark (*-v0), where we can find that DROP-Grad-Ada performs comparably or surpasses prior iterative bi-level works on most tasks: outperforming (or comparing) these policy regularized methods (BRAC-p and TD3+BC) on 25 out of 33 tasks and outperforming (or comparing) these value regularized algorithms (BEAR, BCQ, and CQL) on 19 out of 33 tasks. Comparison with RvS baselines. As a complement to the results shown in Table 2 in the main paper, we provide the performance comparison for more tasks (Gym-MuJoCo and AntMaze *-v2 suites in D4RL) in Figure 11 . We can find similar results as presented in our main paper: DROP consistently outperforms baselines in AntMaze tasks (the last three rows of sub-figures in Figure 11 ) and reaches comparable results on most tasks in Gym-MuJoCo suite. Across different environments, we also find DROP exhibits more robust performance. Although baseline Onestep shows impressive performance in Gym-MuJoCo tasks, we can see that Onestep fails to make progress in AntMaze-medium-* and AntMaze-large-* tasks. However, we find that DROP-based methods exhibit a significant performance improvement in this AntMaze suite. We attribute the success of DROP outperforming Onestep (conducting only behavior policy improvement) to three advantages: (1) DROP learns multiple behavior policies; (2) DROP conducts policy improvement (corresponding to the embedding inference procedure) over multiple behavior policies; (3) DROP permits deployment adaptation, enabling the agent to "switch" behavior policies. Ablation studies. In Figure 10 , we provide more results for the ablation of the conservative regularization term in Equation 8 in the main paper. We can find that for the halfcheetah-medium and hopper-medium tasks, the performance of DROP-Grad-Ada w/o Reg depends on the choice of the gradient update steps, showing that too small or too large number of gradient update step deteriorates the performance. Such result is also consistent with COMs (Trabucco et al., 2021) , which also observes the sensitivity of naive gradient update (i.e., w/o Reg) to number of update steps used for design input inference. By comparison, conservative score model learned with DROP-Grad-Ada exhibits more stable and robust to the gradient update steps. Further, we also find that in walker2d-medium and walker2d-medium-expert tasks, the naive gradient update (w/o Reg) does not affect performance significantly across a wide range of gradient update steps. The main reason is that although the excessive gradient updates lead to faraway embeddings, conditioned on the inferred embeddings, the learned contextual behavior policy can safeguard against the embeddings distribution shift. Compared to prior model-based optimization that conducts direct gradient optimization (inference) over the design input itself, such "self-safeguard" is a special merit in the offline RL domain as long as we reframe the offline RL problem as one of model-based optimization and conduct inference over the embedding space. Thus, we encourage the research community to pursue further study to this model-based optimization view for the offline RL problem. DROP results. In Table 11 , we provide our complete results (including the variance) on all tasks in the paper. Table 10 : Comparison of our method to prior offline methods that perform iterative (regularized) RL paradigm on D4RL. We take the baseline results of BEAR, BCQ, CQL and BRAC-p from Fu et al. (2020) , and the results of TD3-BC from Fujimoto & Gu (2021) . For all results of our method (DROP), we average the normalized returns across 5 seeds; for each seed, we run 10 evaluation episodes. For proper comparison, we use and to denote DROP (*-Ada) achieves comparable or better performance compared with value and policy regularized offline RL methods respectively. 



Next we will use A1, A2, and A3 to denote our answers to the raised questions (Q1, Q2, and Q3) respectively. Please note that this MBO is different from the regular model-based RL (MBRL for short), where the model in MBO denotes a score model while that in MBRL deontes the transition dynamics (or reward) model. We compare three different decomposition rules in Appendix C.2. Further, in Appendix C.4, we adopt CVAE to conduct automatic task decomposition (treating each trajectory as an individual task). We feed the one-hot encoding of the sub-task specification (n = 1, . . . , N ) into the embedding network φ. Due to page limit, we mainly provide the comparison with prior iterative baselines in Appendix E In view of such instability, we evaluate our methods over multiple checkpoints for each seed, instead of choosing the final checkpoint models during the training loop (see the detailed evaluation protocol in Appendix D). Conversely, if the performance of DROP depends on the best embedding across sub-tasks (i.e., z * 0 (s0) in DROP-Best), then the checkpoint model we choose by fine-tuning with DROP-Best should enable a consistent performance improvement for rules that perform embedding inference with DROP-Best-Ada and DROP-Grad-Ada. However, we find a performance drop in AntMaze(*-v0) suite, which means these is no explicit dependency between the best embedding z * 0 (s0) and the inferred embedding using the adaptive inference rules (DROP-*-Ada). We noticed that Maze2D-v0 in the D4RL dataset (https://rail.eecs.berkeley.edu/datasets/) is not available, so we used v1 version instead in our experiment. For simplicity, we still use v0 in the paper exposition.



Figure 2: Overview of DROP. Given static offline dataset D, we decompose the data into N (= 4 in diagram) subsets {D n |n = 1, . . . , N }, over which we learn a task embedding φ(z|n) and conduct MBO by learning multiple behavior policies (modeled by a contextual policy) β(a|s, z) and a score model f (s, a, z). During testing, at state s, we can adapt the optimal policy (contextual variable/embedding) with π * (a|s) = β (a|s, z * (s)), where z * (s) = arg max z f (s, β(a|s, z), z).

Dataset of trajectories, D = {τ }. 1: Initialize φ(z|n), β(a|s, z), and f (s, a, z). 2: Decompose D into N sub-sets D [N ] . 3: while not converged do 4: Sample a sub-task: D n ∼ D [N ] . 5:

Figure 3: Visualization of the embedding inference (a, b, c1, c2) and performance comparison (d).z[N ]  denotes embeddings of all behavior policies; z * 0 (s 0 ), z * (s 0 ), z * 0 (s t ) and z * (s t ) denote the selected embeddings in DROP-Best, DROP-Grad, DROP-Best-Ada and DROP-Grad-Ada respectively.

Figure 4: Learning curves of behavior cloning on AntMaze suites (*-v2) in D4RL, where the x-axis denotes the training steps, and the y-axis denotes the training loss. The number N in the legend denotes the number of sub-tasks. If N = 1, we learn a single behavior policy for the whole offline dataset.

Figure 5: Comparison of three different decomposition rules on D4RL MuJoCo-Gym suite (*-v0) and AntMaze suite (*-v2), where "Rand", "Quan" and "Rank" denote the Random, Quantization, and Rank decomposition rules respectively. We can find across 18 tasks (AntMaze and MuJoCo-Gym suites) and 3 embedding inference methods (DROP-Grad, DROP-Best-Ada, and DROP-Grad-Ada), Rank is more stable and yields better performance compared with the other two decomposition rules.

Figure 6: Learning curves of DROP, where the x-axis denotes the training steps (k), y-axis denotes the evaluation return (using DROP-Best embedding inference rule). We only show two seeds for legibility.

Figure 7: Total normalized returns across all the tasks in Maze2d, AntMaze, and MuJoCo-Gym suites.

Figure8: Ablation on the conservative regularization. The y-axis represents the normalize return, and the x-axis represents the number of gradient-ascent steps used for embedding inference at deployment. We plot each random seed as a transparent line; the solid line corresponds to the average across 5 seeds.

Figure 9: Architectures of the task embedding network φ(z|s), the contextual behavior policy β(a|s, z), and the score function f (s, a, z) (from left to right).

Figure10: The performance comparison of DROP-Grad-Ada and DROP-Grad-Ada w/o Reg, where we ablate the conservative regularization for the w/o Reg implementation. The y-axis denotes the normalized return, the x-axis denotes the number of gradient-ascent steps used for embedding inference at deployment.

Figure 11: Comparison with non-iterative methods on D4RL (*-v2), where ha = halfcheetah, ho = hopper, and wa = walker2d. Each bar denotes the average of normalized returns. The baseline results of behavior cloning (BC), Decision Transform (DT), RvS-R, and Onestep are taken from Emmons et al. (2021). Our DROP results are computed over 5 seeds and 10 episodes for each seed.

Comparison on D4RL *-v2.

Comparison (on D4RL benchmark)  between DROP (including the implementation of return-guided task decomposition and the implementation CVAE-based embedding learning), latent policy baselines (LAPO and PLAS) and other two representative baselines (CQL and IQL).

Hyper-parameters.

Algorithm 3 DROP: Online fine-tuning (checkpoint-level) Require: Env, last T checkpoint models: β t (a|s, z) and f t (s, a, z) (t = 1, • • • , T ).1: R MAX = -∞.2: β best ← None. Evaluate β t and f t on Env, setting π * (a|s) = β(a|s, z * 0 (s 0 )).Update the best checkpoint models:β best ← β t , f best ← f t .Hyperparameter (the number of sub-tasks) set.

The number (N ) of sub-tasks, the number (M ) of trajectories in each sub-task, and the dimension (dim(z)) of the embedding for each sub-task.

annex

Algorithm 4 DROP: Online fine-tuning (embedding-level) Require: Env, last T checkpoint models: β t (a|s, z) and f t (s, a, z)s 0 = Env.Reset(). # Conduct embedding inference with DROP-Grad or DROP-Grad-Ada 8:Return ← Evaluate β t and f t on Env, setting π * (a|s) = β(a|s, z * (s 0 )) or β(a|s, z * (s)), where we conduct k gradient ascent steps to obtain z * (s 0 ) or z * (s).

9:

if R MAX < Return then 10:Update the best checkpoint models:11:Update the best gradient update step: k best ← k.12:Update the optimal return: R MAX ← Return.13:end if 14:end while 15: end while Return: β best , f best and k best .Thus, we enumerate a list of gradient update steps and pick the best update step (according to the episode returns).Codebase. Our code is based on d3rlpy: https://github.com/takuseno/d3rlpy. We provide our source code in the supplementary material.Computational resources. The experiments were run on a computational cluster with 22x GeForce RTX 2080 Ti, and 4x NVIDIA Tesla V100 32GB for 20 days.

