CONSERWEIGHTIVE BEHAVIORAL CLONING FOR RELIABLE OFFLINE REINFORCEMENT LEARNING

Abstract

The goal of offline reinforcement learning (RL) is to learn near-optimal policies from static logged datasets, thus sidestepping expensive online interactions. Behavioral cloning (BC) provides a straightforward solution to offline RL by mimicking offline trajectories via supervised learning. Recent advances (Chen et al., 2021; Janner et al., 2021; Emmons et al., 2021) have shown that by conditioning on desired future returns, BC can perform competitively to their value-based counterparts, while enjoying much more simplicity and training stability. However, the distribution of returns in the offline dataset can be arbitrarily skewed and suboptimal, which poses a unique challenge for conditioning BC on expert returns at test-time. We propose ConserWeightive Behavioral Cloning (CWBC), a simple and effective method for improving the performance of conditional BC for offline RL with two key components: trajectory weighting and conservative regularization. Trajectory weighting addresses the bias-variance tradeoff in conditional BC and provides a principled mechanism to learn from both low return trajectories (typically plentiful) and high return trajectories (typically few). Further, we analyze the notion of conservatism in existing BC methods, and propose a novel conservative regularizer that explicitly encourages the policy to stay close to the data distribution. The regularizer helps achieve more reliable performance, and removes the need for ad-hoc tuning of the conditioning value during evaluation. We instantiate CWBC in the context of Reinforcement Learning via Supervised Learning (RvS) (Emmons et al., 2021) and Decision Transformer (DT) (Chen et al., 2021) , and empirically show that it significantly boosts the performance and stability of prior methods on various offline RL benchmarks.

1. INTRODUCTION

In many real-world applications such as education, healthcare and autonomous driving, collecting data via online interactions can be expensive or even dangerous. However, we often have access to historical logged datasets in these domains that have been collected previously by some unknown policies. The goal of offline reinforcement learning (RL) is to directly learn effective agent policies from such datasets, without additional online interactions (Lange et al., 2012; Levine et al., 2020) . Many online RL algorithms have been adapted to work in the offline setting, including value-based methods (Fujimoto et al., 2019; Ghasemipour et al., 2021; Wu et al., 2019; Jaques et al., 2019; Kumar et al., 2020; Fujimoto & Gu, 2021; Kostrikov et al., 2021a) as well as model-based methods (Yu et al., 2020; Kidambi et al., 2020) . The key challenge in all these methods is to generalize the value or dynamics to state-action pairs outside the offline dataset. An alternative way to approach offline RL is via approaches derived from behavioral cloning (BC) (Bain & Sammut, 1995) . BC is a supervised learning technique that was initially developed for imitation learning, where the goal is to learn a policy that mimics the expert demonstrations. Recently, a number of works propose to formulate offline RL as supervised learning problems (Chen et al., 2021; Janner et al., 2021; Emmons et al., 2021) . Since offline RL datasets usually do not have expert demonstrations, these works condition BC on extra context information to specify target outcomes such as returns and goals. Compared with the value-based approaches, the empirical evidence has shown that these conditional BC approaches perform competitively, and they additionally enjoy the enhanced simplicity and training stability of supervised learning. As commonly observed for supervised learning approaches, the performance of conditional BC is often limited by the suboptimility of the offline dataset, which particularly can be probed through the distribution of returns in the dataset. There are two related challenges in this regard for offline RL. First, there is a unique bias-variance tradeoff in learning that arises due to the mismatch between the training and test distribution of returns. Typically, offline datasets in the real world mostly contain trajectories with low returns, whereas at test time, we are interested in conditioning on high returns. Simply filtering the offline dataset to contain high return trajectories is not always viable, as the number of such high-return trajectories can be very low leading to high variance during learning. Second, the maximum return in the offline trajectories is often far below the desired expert returns. This implies that at test time, we need to condition our agent on out-of-distribution (ood) expert returns. Interestingly, we find that existing BC methods have significantly different behaviors when conditioning on ood returns. While DT (Chen et al., 2021) enjoys a stable performance, RvS (Emmons et al., 2021) is highly sensitive to such ood conditioning and exhibits vast drops in peak performance for such ood inputs. Therefore, the current practice for setting the conditioning return at test-time in RvS is based on careful tuning with online rollouts, which is often tedious, impractical, and inconsistent with the promise of offline RL to minimize online interactions. We propose ConserWeightive Behavior Cloning (CWBC), a new BC-based approach for offline RL that mitigates the aforementioned challenges. CWBC consists of 2 key components: trajectory weighting and conservative regularization. With trajectory weighting, we strive to balance the biasvariance trade-off in learning by proposing a scheme for downweighting the low-return trajectories, but at the same time, we do not filter them for data efficiency. Moreover, we introduce a notion of conservatism for ood sensitve BC methods such as RvS, which encourages the policy to stay close to the data distribution when conditioning on large returns. We take trajectories with high returns from the dataset and add positive noise to their returns, which generates trajectories with large ood returns. We predict actions conditioning on the perturbed returns and project them to the original actions by penalizing the ℓ 2 distance. By imposing such a regularizer, we can condition the policy on large, unseen target returns at test-time, sidestepping tedious manual tuning and online interactions. Our proposed algorithm is simple and easy to implement. Empirically, we instantiate our framework in the context of RvS (Emmons et al., 2021) and DT (Chen et al., 2021) , two state-of-the-art BC methods for offline RL. CWBC significantly improves the performance of RvS and DT in D4RL (Fu et al., 2020) locomotion tasks by 18% and 8%, respectively, without any hand picking of the value of the conditioning returns at test-time.

2. RELATED WORK

Offline Temporal Difference Learning Most of the existing off-policy RL methods are often based on temporal difference (TD) updates. A key challenge of directly applying them in the offline setting is the extrapolation error: the value function is poorly estimated at unseen state-action pairs. To remedy this issue, various forms of conservatism have been introduced to off-policy RL methods that exploits temporal difference updates, with the purpose of encouraging the learned policy to stay close to the behavior policy that generates the data. For instance, Fujimoto et al. (2019); Ghasemipour et al. (2021) use certain policy parameterizations specifically tailored for offline RL. Wu et al. (2019) ; Jaques et al. (2019) ; Kumar et al. (2019) penalize the divergence-based distances between the learned policy and the behavior policy. Fujimoto & Gu (2021) propose an extra behavior cloning term to regularize the policy. This regularizer is simply the ℓ 2 distance between predicted actions and the truth, yet surprisingly effective for porting off-policy TD methods to the offline setting. Instead of regularizing the policy, several other works have sought to incorporate divergence regularizations into the value function estimation, e.g., (Nachum et al., 2019; Kumar et al., 2020; Kostrikov et al., 2021a) . Another recent work by Kostrikov et al. (2021b) predicts the Q function via expectile regression, where the estimation of the maximum Q-value is constrained to be in the dataset. Behavior Cloning Approaches for Offline RL Recently, there is a surge of interest in converting offline RL into supervised learning paradigms (Chen et al., 2021; Janner et al., 2021; Emmons et al., 2021) . In essence, these approaches conduct behavior cloning (Bain & Sammut, 1995) by additionally conditioning on extra information such as goals or rewards. Among these works, Chen et al. (2021) and Janner et al. (2021) have formulated offline RL as sequence modeling problems and train transformer architectures (Vaswani et al., 2017) in a similar fashion to language and vision (Radford et al., 2018; Chen et al., 2020; Brown et al., 2020; Lu et al., 2022; Yan et al., 2021) . Extensions have also been proposed in the context of sequential decision making for offline black-box optimization (Nguyen & Grover, 2022; Krishnamoorthy et al., 2022) . A recent work by Emmons et al. (2021) further shows that conditional BC can achieve competitive performance even with a simple but carefully designed MLP network. Earlier, similar ideas have also been proposed for online RL, where the policy is trained via supervised learning techniques to fit the data stored in the replay buffer (Schmidhuber, 2019; Srivastava et al., 2019; Ghosh et al., 2019) . Data Exploration for Offline RL Recent research efforts have also been made towards understanding properties and limitations of datasets used for offline RL (Yarats et al., 2022; Lambert et al., 2022; Guo et al., 2021) , particularly focusing on exploration techniques during data collection. Both Yarats et al. (2022) and Lambert et al. (2022) collect datasets using task-agnostic exploration strategies (Laskin et al., 2021) , relabel the rewards and train offline RL algorithms on them. Yarats et al. (2022) benchmark multiple offline RL algorithms on different tasks including transferring, whereas Lambert et al. (2022) focus on improving the exploration method.

3. PRELIMINARIES

We model our environment as a Markov decision process (MDP) (Bellman, 1957) , which can be described by a tuple M " xS, A, p, P, R, γy, where S is the state space, A is the action space, pps 1 q is the distribution of the initial state, P ps t`1 |s t , a t q is the transition probability distribution, Rps t , a t q is the deterministic reward function, and γ is the discount factor. At each timestep t, the agent observes a state s t P S and takes an action a t P A. This moves the agent to the next state s t`1 " P p¨|s t , a t q and provides the agent with a reward r t " Rps t , a t q. Offline RL. We are interested in learning a (near-)optimal policy from a static offline dataset of trajectories collected by unknown policies, denoted as T offline . We assume that these trajectories are i.i.d samples drawn from some unknown static distribution T . We use τ to denote a trajectory and use |τ | to denote its length. Following Chen et al. (2021) , the return-to-go (RTG) for a trajectory τ at timestep t is defined as the sum of rewards starting from t until the end of the trajectory: g t " ř |τ | t 1 "t r t 1 . This means the initial RTG g 1 is equal to the total return of the trajectory r τ " ř |τ | t"1 r t . Decision Transformer (DT). DT (Chen et al., 2021) solves offline RL via sequence modeling. Specifically, DT employs a transformer architecture that generates actions given a sequence of historical states and RTGs. To do that, DT first transforms each trajectory in the dataset into a sequence of returns-to-go, states, and actions: τ " `g1 , s 1 , a 1 , g 2 , s 2 , a 2 , . . . , g |τ | , s |τ | , a |τ | ˘. DT trains a policy that generates action a t at each timestep t conditioned on the history of RTGs g t´K:t , states s t´K:t , and actions a t´K:t´1 , wherein K is the context length of the transformer. The learning objective a simple mean square error between the predicted actions and the ground truths: min θ L DT pθq " E τ "T " 1 |τ | ř |τ | t"1 `at ´πθ pg t´K:t , s t´K:t , a t´K:t´1 q ˘2‰ . During evaluation, DT starts with an initial state s 1 and a target RTG g 1 . At each step t, the agent generates an action a t , receives a reward r t and observes the next state s t`1 . DT updates its RTG g t`1 " g t ´rt and generates next action a t`1 . This process is repeated until the end of the episode. Reinforcement Learning via Supervised Learning (RvS). Emmons et al. (2021) conduct a thorough empirical study of conditional BC methods under the umbrella of Reinforcement Learning via Supervised Learning (RvS), and show that even simple models such as multi-layer perceptrons (MLP) can perform well. With carefully chosen architecture and hyperparameters, they exhibit performance that matches or exceeds the performance of transformer-based models. There are two main differences between RvS and DT. First, RvS conditions on the average reward ω t into the future instead of the sum of future rewards: Figure 1 : The suboptimality of offline datasets (left) and the effect of trajectory weighting on the return distribution (right). We illustrate on walker2d-med-replay. For weighting, we use B " 20, λ " 0.01, κ " p r ‹ ´p r 90 , where p r 90 is the 90-th percentile of the returns in the offline dataset. ω t " 1 H´t`1 ř |τ | t 1 "t r t 1 " gt H´t`1 , where H is the maximum episode length. Intuitively, ω t is RTG normalized by the number of remaining steps. Second, RvS employs a simple MLP architecture, which generates action a t at step t based on only the current state s t and expected outcome ω t . RvS minimizes a mean square error: min θ L RvS pθq " E τ "T " 1 |τ | ř |τ | t"1 `at ´πθ ps t , ω t q ˘2‰ . At evaluation time, RvS performs a repeating process similarly to DT, except that the expected outcome is now updated as ω t`1 " pg t ´rt q{pH ´tq.

4. CONSERVATIVE BEHAVIORAL CLONING WITH TRAJECTORY WEIGHTING

A key challenge that behavioral cloning faces in an offline setting is the suboptimality of the dataset, which we can characterize via the distribution of trajectory returns. An ideal offline dataset consists of sufficiently many high-quality trajectories, which have returns matching those of a dataset of expert demonstrations. For such an idealized scenario, offline RL reduces to a vanilla imitation learning problem. In practice, however, we observe that the return distribution for a typical dataset of offline trajectories is spread over a wide range of returns and is highly non-uniform. Figure 1a illustrates the return distribution of the walker2d-med-replay dataset (Fu et al., 2020) , which is significantly different from the expert distribution. Therefore, from a return perspective, the trajectories in the offline dataset can be of varying importance for learning, which leads to a bias-variance trade-off. Further, for return-conditioned methods including conditional BC, it is unclear how the policy will behave when conditioned on o.o.d. returns at test-time. We study mitigation techniques for both these challenges in the following sections.

4.1. CONTROLLING BIAS-VARIANCE TRADEOFF VIA TRAJECTORY WEIGHTING

To formalize our discussion, recall that r τ denotes the return of a trajectory τ and let r ‹ " sup τ r τ be the maximum expert return, which is assumed to be known in prior works on conditional BC (Chen et al., 2021; Emmons et al., 2021) . We know that the optimal offline data distribution, denoted by T ‹ , is simply the distribution of demonstrations rolled out from the optimal policy. Typically, the offline trajectory distribution T will be biased w.r.t. T ‹ . During learning, this leads to a bias-variance tradeoff, wherein ideally we want to learn our BC agent to condition on the expert returns, but is forced to minimize the empirical risk on a biased data distribution. The core idea of our approach is to transform T into a new distribution r T that better estimates T ‹ . More concretely, r T should concentrate on high-return trajectories, which mitigates the bias. One naive strategy is to simply filter out a small fraction of high return trajectories from the offline dataset. However, since we expect the original dataset to contain very few high return trajectories, filtering trajectories will increase the variance for downstream BC. To balance the bias-variance trade-off, we propose to weight the trajectories based on their returns. Let f T : R Þ Ñ R `be the density function of r τ where τ " T . We consider the transformed distribution r T whose density function p r T is Algorithm 1: Weighted Trajectory Sampling Input: offline dataset T offline , number of bins B, smoothing parameters λ, κ Compute the returns: r τ Ð ř |τ | t"1 r t , @τ P T offline . Group the trajectories into B equal-sized bins according to r τ . Sample a bin b P rBs with probability P bin pbq defined in Equation ( 6). Sample a trajectory τ in bin b uniformly at random.

Output: τ

Table 1 : The normalized return on D4RL locomotion tasks of RvS and DT with trajectory weighting. We use +W as shorthand for weighting. We use #wins to denote the number of datasets where the variant outperforms the original model. The results are averaged over 10 seeds.  hkkkkkkkkkkkkkkkkikkkkkkkkkkkkkkkkj f T prτ q f T prτ q`λ ¨exp `´|rτ ´r‹ | κ ˘, where λ, κ P R `are two hyperparameters. A larger value of κ leads to a more uniform r T , whereas a smaller value upweights the high-return trajectories. In contrast, a smaller value of λ gives more weights to high-return trajectories, while a larger value makes r T closer to T . Our trajectory weighting is motivated by a similar scheme proposed for model-based optimization (Kumar & Levine, 2020) , where the authors use it to balance the bias and variance for gradient approximation for surrogates to black-box functions, and theoretically establish the optimality of the proposed distribution.

4.1.1. IMPLEMENTATION DETAILS

In practice, the dataset T offline only contains a finite number of samples and the density function p r T in equation ( 5) cannot be computed exactly. Following Kumar & Levine (2020) , we sample from a discretized approximation of r T . We first group the trajectories in T offline into B equal-sized bins according to the return r τ . To sample a trajectory, we first sample a bin index b P t1, . . . , Bu and then uniformly sample a trajectory inside bin b. We use |b| to denote the size of bin b. Let s r b τ " 1{|b| ř τ Pb r τ the average return of the trajectories in bin b, p r ‹ be the highest return in the dataset T offline , and define f Toffline pbq " |b|{|T offline |. As a discretized version of equation ( 5), the bins are weighted by their average returns with probability P bin pbq 9 f T offline pbq f T offline pbq`λ ¨exp `´|s r b τ ´p r ‹ | κ ˘. Algorithm 1 summarizes the data sampling procedure when trajectory weighting is used. Figure 1b illustrates the impact of trajectory weighting on the return distribution of the med-replay dataset for the walker2d environment. We plot the histograms before and after transformation, where the density curves are estimated by kernel density estimators.

Dataset

We evaluate the effectiveness of trajectory weighting on three locomotion tasks with dense rewards from the D4RL benchmark (Fu et al., 2020) : hopper, walker2d and halfcheetah. For each task, we consider the v2 medium, med-replay and med-expert offline datasets. The medium dataset contains 1M samples from a policy trained to approximately 1 3 the performance of an expert policy. The med-replay dataset uses the replay buffer of a policy trained up to the performance of a medium policy. The med-expert dataset contains 1M samples generated by a medium policy and 1M samples generated by an expert policy. Baselines We apply trajectory weighting to RvS (Emmons et al., 2021) and DT (Chen et al., 2021) , two state-of-the-art BC methods. We compare their performance when trained on the original distribution and on the transformed distribution induced by our trajectory weighting (denoted as +W). Hyperparameters For all datasets, we use B " 20 and λ " 0.01, and we set the temperature parameter κ to be the difference between the highest return and the 90-th percentile: p r ‹ ´p r 90 , whose value varies across the datasets. At test time, we set the evaluation RTG to be the expert return for each environment. The model architecture and the other hyperparamters are identical to what were used in the original paper. We provide a complete list of hyperparameters in Appendix B.2 and additional ablation experiments on λ and κ in Appendix C.

Results

Table 1 shows the performance of RvS and DT and their variants. DT+W outperforms the original DT in 6{9 datasets, achieving an average improvement of 8%. The improvement is significant in low-quality datasets (med-replay), which agrees with our analysis. Unlike in DT, trajectory weighting in RvS has varying effects, and the average performance of RvS+W is not better than that of RvS. To better understand this, we plot the achieved returns of RvS and DT when conditioning on different values of RTG. Figure 2 shows an interesting difference between behaviors of DT and RvS. DT is insensitive to the conditioning RTG, and continues performing stably even when conditioning on out-of-distribution RTGs. In contrast, the performance of RvS highly correlates with the evaluation RTG, but degrades quickly after a certain threshold. The performance crash problem of RvS shadows the improvement made by trajectory weighting.

4.2. RELIABLE EVALUATION VIA CONSERVATISM

The results in Section 4.1.2 introduce another challenging problem for return-conditioned BC in offline RL: generalization to out-of-distribution (ood) returns. While strong generalization beyond the offline dataset remains an ongoing challenge for the offline RL community (Wang et al., 2020; Zanette, 2021; Foster et al., 2021) , we require the policy to be reliable and at least stay close to the data distribution to avoid catastrophic failure when conditioned on ood returns. In other words, we want the policy to be conservative. Figure 2 shows that DT enjoys self-conservatism, while RvS does Algorithm 2: ConserWeightive Behavioral Cloning (CWBC) for RvS Input: dataset T offline , number of iterations I, batch size S, regularization coefficient α, initial parameters θ 0 for iteration i " 1, . . . , I do Sample a batch of trajectories B Ð tτ p1q , . . . , τ pSq u from T offline using Algorithm 1. for every sampled trajectory τ piq do Samplie noise ε as described in Section 4.2.1. Compute noisy RTGs: g ε t Ð g t `ε, 1 ď t ď |τ piq |. // loss and regularizer defined in Equation ( 4) and (7) Perform gradient update of θ by minimizing the regularized empirical risk p L B RvS pθq `α ¨p C B RvS pθq. Output: π θ not. We hypothesize that the conservative behavior of DT comes from the transformer architecture. As the policy conditions on a sequence of both state tokens and RTG tokens to predict next action, the attention layers can choose to ignore the ood RTG tokens while still obtaining a good prediction loss. To test this hypothesis, we experiment with a slightly modified version of DT, where we concatenate the state and RTG at each timestep instead of treating them as separate tokens. By doing this, the model cannot ignore the RTG information in the sequence. We call this version DT-Concat. Figure 3 shows that the performance of DT-Concat is strongly correlated with the conditioning RTG, and degrades quickly when the target return is out-of-distribution. This result confirms our hypothesis. However, conservatism does not have to come from the architecture, but can also emerge from a proper objective function, as commonly done in conservative value-based methods (Kumar et al., 2020; Fujimoto & Gu, 2021) . In this section, we propose a novel conservative regularization for BC that explicitly encourages the policy to stay close to the data distribution. The intuition is to enforce the predicted actions when conditioning on large ood returns to stay close to the in-distribution actions. To do that, for a trajectory τ with high return, we inject positive random noise ε " E τ to its RTGs, and penalize the ℓ 2 distance between the predicted action and the ground truth. Specifically, to guarantee we generate large ood returns, we choose a noise distribution E such that the perturbed initial RTG g 1 `ε is at least p r ‹ , the highest return in the dataset. The next subsections instantiate the conservative regularizer in the context of RvS, and empirically evaluate its performance.

4.2.1. IMPLEMENTATION DETAILS

We apply conservative regularization to trajectories whose returns are above p r q , the q-th percentile of returns in the dataset. This makes sure that when conditioned on ood returns, the policy behaves similarly to high-return trajectories and not to a random trajectory in the dataset. We sample a scalar noise ε " E τ and offset the RTG of τ at every timestep by ε: g ε t " g t `ε, t " 1, . . . , |τ |, resulting in the conservative regularizer: C RvS pθq " E τ "T , ε"Eτ " 1 rτ ąp rq ¨1 |τ | ř |τ | t"1 `at ´πθ ps t , ω ε t q ˘2‰ , where ω ε t " pg t `εq{pH ´t`1q (cf. Equation ( 3)) is the noisy average RTG at timestep t. We observe that using the 95-th percentile of p r 95 generally works well across different environments and datasets. Table 2 : Comparison of the normalized return on the D4RL locomotion benchmark. For BC and TD3+BC, we get the numbers from (Emmons et al., 2021) . For IQL, we get the numbers from (Kostrikov et al., 2021b) . For TTO, we get the numbers from (Janner et al., 2021) . The results are averaged over 10 seeds. We use the noise distribution E τ " Uniformrl τ , u τ s, where the lower bound l τ " p r ‹ ´rτ so that the perturbed initial RTG g ε 1 " r τ `ε is no less than p r ‹ , and the upper bound u τ " p r ‹ ´rτ `?12σ 2 so that the standard deviation of E τ is equal to σ. We emphasize our conservative regularizer is distinct from the other conservative components proposed for the value-based offline RL methods. While those usually try to regularize the value function estimation to prevent extrapolation error (Fujimoto et al., 2019) , we perturb the returns to generate ood conditioning and regularize the predicted actions. When the conservative regularizer is used, the final objective for training RvS is L RvS pθq `α ¨CRvS pθq, in which α is the regularization coefficient. When trajectory reweighting is used in conjunction with the conservative regularizer, we obtain ConserWeightive Behavioral Cloning (CWBC), which combines the best of both components. We provide a pseudo code for CWBC in Algorithm 2.

Dataset

We evaluate the effectiveness of the conservative regularizer, as well as the performance of CWBC as a whole on the D4RL datasets (Fu et al., 2020) for the gym locomotion tasks. Baselines We apply the conservative regularizer, which we denote as +C, to both RvS and RvS+W. In addition, we report the performance of three value-based methods: TD3+BC (Fujimoto & Gu, 2021) , CQL (Kumar et al., 2020) , and IQL (Kostrikov et al., 2021b) as a reference. Hyperparameters We apply our conservatism regularization to trajectories whose returns are above the q " 95-th percentile return in the dataset, and perturb their RTGs as described in Section 4.2.1. We use a regularization coefficient of α " 1. The evaluation protocol is similar to Section 4.1.2. Results Table 2 reports the performance of different methods we consider. Our proposed framework CWBC with all components enabled (RvS+W+C) significantly outperforms the original RvS on 9{9 datasets, with an average improvement of 18% over RvS. RvS+W+C is also the best performing BC method in the table, and is competitive with the value-based methods. Conservative regularization consistently improves the results for both RvS and RvS+W. Trajectory weighting on its own can have varying effects on performance, but is synergistic when combined with RvS+C leading to our best performing model in RvS+W+C. To better understand the impact of each component, we plot the achieved returns of RvS and other variants when conditioning on different values of conditioned RTG. Figure 4 shows that RvS generalizes poorly to out-of-distribution RTGs, which leads to significant performance drop when the evaluation RTG is larger than the best return in the dataset. Figure 4 illustrates the significant importance of encouraging conservatism for RvS, where RvS+C has much more stable performance, even when the evaluation RTG is 2ˆthe expert return. By explicitly asking the model to stay close to the data distribution, we achieve more reliable out-of-distribution performance, and avoid the performance crash problem. This leads to absolute performance improvement of RvS+C in Table 2 . CWBC combines the best of both weighting and conservatism, which enjoys good performance when conditioning on high RTG values, as well as better robustness to large, out-of-distribution RTGs. In addition to the main results, we include ablations for different choices of conservative percentile q and regularization coefficient α in Appendix C. Finally, we also evaluate CWBC in two more benchmarks: Atari games (Bellemare et al., 2013) and the D4RL Antmaze datasets. We present these results in Appendix D and E respectively.

5. CONCLUSION

We proposed ConserWeightive Behavioral Cloning (CWBC), a new framework that extends BC for offline RL with two novel components: trajectory weighting and conservative regularization. Trajectory weighting balances the bias-variance tradeoff that arises in learning from a suboptimal dataset, improving the performance of both DT and RvS. Next, we showed that while DT is selfconservative due to its attention architecture, we can recover this desired behavior even for RvS using our proposed conservative regularizer. Confirmed by the experiments, CWBC significantly improves the performance and stability of RvS. While we made good progress for BC, advanced value-based methods such as CQL and IQL are still ahead and we believe further understanding of the tradeoffs in both kinds of approaches is important future work. Another promising direction from a data perspective is how to combine datasets from multiple environments to obtain diverse, high-quality data. Recent works have shown promising results in this direction (Reed et al., 2022) . Last but not least, while CWBC significantly improves the performance and reliability of RvS, it is not able to extrapolate beyond the offline dataset. How to obtain extrapolation, or whether it is possible, is still an open question, and poses a persistent research opportunity for not only CWBC but the whole offline RL community.

REPRODUCIBILITY STATEMENT

We present the practical implementation of our framework in Section 4.1.1 and Section 4. We include the implementation details of our paper in Appendix B, which contains information about the datasets we use, the open sourced code we base on, and the list of hyperparameters we use to reproduce our results. Finally, we submitted the source code in the supplementary material. A LIST OF SYMBOLS 7) fT pτ q probability density of trajectory τ " T p r T pτ q probability density of trajectory τ " r T Equation ( 5 We train and evaluate our models on the D4RL (Fu et al., 2020) and Atari (Agarwal et al., 2020) benchmarks, which are available at https://github.com/rail-berkeley/d4rl and https://research.google/tools/datasets/dqn-replay, respectively. Our codebase is largely based on the RvS (Emmons et al., 2021 ) official implementation at https: //github.com/scottemmons/rvs, and DT (Chen et al., 2021 ) official implementation at https://github.com/kzl/decision-transformer.

C ABLATION ANALYSIS

In this section, we investigate the impact of each of those hyperparameters on CWBC to give insights on what values work well in practice. We use the walker2d environment and the three related datasets for illustration. In all the experiments, when we vary one hyperparameter, the other hyperparameters are kept as in Table 4 . C.1 TRAJECTORY WEIGHTING: SMOOTHING PARAMETERS λ AND κ Two hyperparameters κ and λ in Equation ( 6) affect the probability a bin index b is sampled: P bin pbq 9 f T offline pbq f T offline pbq`λ ¨exp `´|s r b τ ´p r ‹ | κ ˘. In practice, we have observed that the performance of CWBC is considerably robust to a wide range of values of κ and λ. The impact of κ The smoothing parameter κ controls how we weight the trajectories based on their relative returns. Intuitively, smaller κ gives more weights to high-return bins (and thus their trajectories), and larger κ makes the transformed distribution more uniform. We illustrate the effect of κ on the transformed distribution and the performance of CWBC in Figure 5 . As in Section 4.1.2, we set κ to be the difference between the empirical highest return p r ‹ and the z-th percentile return in the dataset: κ " p r ‹ ´p r z , and we vary the values of z. This allows the actual value of κ to adapt to different datasets. Figure 5 shows the results. The top row plots the distributions of returns before and after trajectory weighting for varying values of κ. We tested four values z P t99, 90, 50, 0u, which correspond to four increasing values of κ. We mark the actual values of κ in each dataset in the top tow 1 . For each dataset, we can see the transformed distribution using small κ (orange) highly concentrates on high returns; as κ increases, the density for low returns increases and the distribution becomes more and more uniform. The bottom row plots the corresponding performance of CWBC with different choices of κ. We select RvS+C as our baseline model, which does not have trajectory weighting but has the conservative regularization enabled. We can see that relatively small values of κ (based on p r 99 , p r 90 and p r 50 ) perform well on all the three datasets, whereas large values (based on p r 0 ) hurt the performance for the med-expert dataset, and even underperform the baseline RvS+C. Results Table 6 summarizes the performance of RvS and its variants. CWBC (RvS+W+C) is the best method, outperforming the original RvS by 72% on average. Figure 9 clearly shows the effectiveness of the conservative regularization (+C). In two low-quality datasets Qbert and Seaquest, the performance of RvS degrades quickly when conditioning on out-of-distribution RTGs. By regularizing the policy to stay close to the data distribution, we achieve a much more stable performance. The trajectory weighting component (+W) alone has varying effects on performance because of the performance crash problem, but achieves state-of-the-art when used in conjunction with conservative regularization. It is also worth noting that in both Qbert and Seaquest, CWBC achieves returns that are much higher than the best return in the offline dataset. This shows that while conservatism encourages the policy to stay close to the data distribution, it does not prohibit extrapolation. There is always a trade-off between optimizing the original supervised objective (which presumably allows extrapolation) and the conservative objective. This is very similar to other conservative regularizations used in value-based such as CQL or TD3+BC, where there is a trade-off between learning the value function and staying close to the data distribution.

E ADDITIONAL RESULTS ON D4RL ANTMAZE

Our proposed conservative regularization is especially important in dense reward environments such as gym locomotion tasks or Atari games, where choosing the target return during evaluation is a difficult problem. On the other hand, trajectory weighting is generally useful whenever the offline dataset contains both low-return and high-return trajectories. In this section, we consider Antmaze (Fu et al., 2020) , a sparse reward environment in the D4RL benchmark to evaluate the generality of CWBC. Antmaze is a navigation domain in which the task is to control a complex 8-DoF "Ant" quadruped robot to reach a goal location. We consider 3 maze layouts: umaze, medium, and large, and 3 dataset flavors: v0, diverse, and play. We use the same set of hyperparameters as mentioned in B.2. Results Table 7 summarizes the results. As expected, the conservative regularization is not important in these tasks, as the target return is either 0 (fail) or 1 (success). However, the trajectory weighting significantly boosts performance, resulting in an average of 60% improvement over the original RvS. F TRAJECTORY WEIGHTING VERSUS HARD FILTERING An alternative to trajectory weighting is hard filtering (+F), where we train the model on only top 10% trajectories with the highest returns. Filtering can be considered a hard weighting mechanism, wherein the transformed distribution only has support over trajectories with returns above a certain threshold. F.1 HARD FILTERING FOR RVS When using hard filtering for RvS, we also consider combining it with the conservative regularization. Table 8 and Figure 10 compare the performance of trajectory weighting and hard filtering when applied to RvS. While RvS+F+C also gains notable improvements , it lags behind RvS+W+C and seems to erode the benefits of conservatism alone in RvS+C. This agrees with our analysis in Section 4.1. While hard filtering achieves the same effect of reducing bias, it completely removes the low-return trajectories, resulting in highly increased variance. Our trajectory weighting upweights the good trajectories but aims to stay close to the original data distribution, balancing this bias-variance tradeoff. This is clearly shown in Figure 10 , where RvS+W+C has much smaller variance when conditioning on large RTGs. 

G BIAS-VARIANCE TRADEOFF ANALYSIS

We formalize our discussion on the bias-variance tradeoff when learning from a suboptimal distribution mentioned in Section 4.1. The objective functions for training DT (2) and RvS (4) can be rewritten as: min θ L p D pθq " E τ "T rDpτ, π θ qs " E r"p D prq,τ "Tr rDpτ, π θ qs . In which, p D prq is the data distribution over trajectory returns, T r is a uniform distribution over the set of trajectories whose return is r, and Dpτ, π θ q is the supervised loss function with respect to the sampled trajectory τ . For DT, Dpτ, π θ q " 1 |τ | ř |τ | t"1 `at ´πθ pg t´K:t , s t´K:t , a t´K:t´1 q ˘2, and for RvS, Dpτ, π θ q " 1 |τ | ř |τ | t"1 `at ´πθ ps t , ω t q ˘2. Equation ( 9) is equivalent to first sampling a return r, then sampling a trajectory τ whose return is r, and calculating the loss on τ . Ideally, we want to train the model from an optimal return distribution p ‹ prq, which is centered around the expert return r ‹ : min θ L p ‹ pθq " E r"p ‹ prq,τ "Tr rDpτ, π θ qs . (10) In practice, we only have access to the suboptimal return distribution p D prq, which leads to a biased training objective with respect to p ‹ prq. While the dataset is fixed, we can transform the data distribution p D prq to qprq that better estimates the ideal distribution p ‹ prq. The objective function with respect to q is: min θ L q pθq " E r"qprq,τ "Tr rDpτ, π θ qs (11) " E r"p D prq,τ "Tr " qprq p D prq ¨Dpτ, π θ q ȷ (12) In the extreme case, qprq " 1rr " r ‹ s, which means we only train the policy on trajectories whose return matches the expert return r ‹ . However, since offline datasets often contain very few expert trajectories, this q leads to a very high-variance training objective. An optimal distribution q should lead to a training objective that balances the bias-variance tradeoff. We quantify this by measuring the ℓ 2 of the difference between the gradient of L q pθq and the gradient of the optimal objective function L p ‹ pθq. Analogous to Kumar & Levine (2020) , we can prove that for some constants C 1 , C 2 , C 3 , with high confidence: E " ||∇ θ L q pθq ´∇θ L p ‹ pθq|| 2 2 ‰ ď C 1 ¨Er"qprq " 1 N r ȷ `C2 ¨d2 pq||p D q |D| `C3 ¨DTV pp ‹ , qq 2 . ( ) In which, N r is the number of trajectories in dataset D whose return is r, d 2 is the exponentiated Renyi divergence, and D TV is the total variation divergence. The right hand side of inequality (13) shows that an optimal distribution q should be close to the data distribution p D to reduce variance, while approximating well p ‹ to reduce bias. As shown in Kumar & Levine (2020) , qprq9 Nr Nr`K ¨expp´| r´r ‹ | κ q minimizes this bound, which inspires our trajectory weighting.



The original return distribution T and the transformed distribution r T .

Figure 2: Performance of RvS and DT when conditioning on different evaluation RTGs. We report the mean and standard deviation of 10 seeds.

Figure 3: Performance of DT when the state and RTG tokens are concatenated. We report the mean and standard deviation of 10 seeds.

Figure4: Performance of RvS and its variants when conditioning on different evaluation RTGs. We report the mean and standard deviation of 10 seeds.

) b index of a bin of trajectories in the offline dataset |b| size of bin b f T offline pbq proportion of trajectories in bin b |b|{|Toffline| Pbinpbq probability that bin b is sampled Equation

Figure5: The influence of κ on the transformed distribution (top) and on the performance of CWBC (bottom). The legend in each panel(top)  shows the absolute values of κ for easier comparison. In the bottom row, we also plot the results of RvS+C (no trajectory weighting) as a baseline.

Figure 10: Comparison of trajectory weighting and hard filtering.

Important symbols used in this paper.

Comparison of the normalized return on Atari games. The results are averaged over 3 seeds. We include the results of DT, CQL, and BC from(Chen et al., 2021) for reference.

Comparison of the success rate on the Antmaze environment. The results are averaged over 3 seeds. We include the results of DT, CQL, and BC from(Emmons et al., 2021) for reference.

Comparison of trajectory weighting (+W) and hard filtering (+F) on D4RL locomotion benchmarks. The results are averaged over 10 seeds.

B.2 DEFAULT HYPERPARAMETERS

The impact of λ To better understand the role of λ, we can rewrite Equation (6) asClearly, only T2 depends on λ. When λ " 0, T2 is canceled out and the above equation reduces towhich purely depends on the relative return. As λ increases, T2 is less sensitive to f Toffline pbq, and finally becomes the same for every b P rBs as λ Ñ 8. In that scenario, P bin pbq only depends on T1, which is the original frequency f Toffline pbq weighted by the relative return.The top row of Figure 6 plots the distributions of returns before and after trajectory weighting with different values of λ. When λ " 0, the distributions concentrate on high returns. As λ increases, the distributions are more correlated with the original one, but still weights more on the high-return region compared to the original distribution due to the exponential term in T1. The bottom row of Figure 6 plots the actual performance of CWBC as λ varies. All values of λ produce similar results, which are consistently better than or comparable to training on the original datset (RvS+C). 

C.2 CONSERVATIVE REGULARIZATION: PERCENTILE q

We only apply the conservative regularization to trajectories whose return is above the q-th percentile of the returns in the dataset. Intuitively, a larger q applies the regularization to fewer trajectories. We test four values for q: 0, 50, 95, and 99. For q " 0, our regularization applies to all the trajectories in the dataset. Figure 7 demonstrates the impact of q on the performance of CWBC. q " 95 and q " 99 perform well on all the three datasets, while q " 50 and q " 0 lead to poor results for the med-replay dataset. This is because, when the regularization applies to trajectories of low returns, the regularizer will force the policy conditioned on out-of-distribution RTGs to stay close to the actions from low return trajectories. Since the med-replay dataset contains many low return trajectories (see Figure 5 ), such regularization results in poor performance. In contrast, medium and med-expert datasets contain a much larger portion of high return trajectories, and they are less sensitive to the choice of q. 

C.3 REGULARIZATION COEFFICIENT α

The hyperparameter α controls the weight of the conservative regularization in the final objective function of CWBC L RvS `α ¨CRvS . We show the performance of CWBC with different values of α in Figure 8 . Not using any regularization (α " 0) suffers from the performance crash problem, while overly aggressive regularization (α " 10) also hurts the performance. CWBC is robust to the other non-extreme values of α . 

D ADDITIONAL RESULTS ON ATARI GAMES

In addition to D4RL, we consider 4 games from the Atari benchmark (Bellemare et al., 2013): Breakout, Qbert, Pong, and Seaquest. Similar to (Chen et al., 2021) , for each game, we train our method on 500000 transitions sampled from the DQN-replay dataset, which consists of 50 million transitions of an online DQN agent (Mnih et al., 2015) . Due to the varying performance of the DQN agent in different games, the quality of the datasets also varies. While Breakout and Pong datasets are high-quality with many expert transitions, Qbert and Seaquest datasets are highly suboptimal.Hyperparameters For trajectory weighting, we use B " 20 bins, λ " 0.1, and κ " p r ‹ ´p r 50 . We apply conservative regularization with coefficient α " 0.1 to trajectories whose returns are above p r 95 . The standard deviation of the noise distribution varies across datasets, as each different games have very different return ranges. During evaluation, we set the target return to 5 ˆp r ‹ for Qbert and Seaquest, and to 1 ˆp r ‹ for Breakout and Pong. 

F.2 HARD FILTERING FOR UNCONDITIONAL BC

Hard filtering can also be applied to ordinary BC. This is equivalent to Filtered BC in (Emmons et al., 2021) . Table 9 compares Filtered BC and CWBC. CWBC performs comparably well in medium and med-expert datasets, and outperforms Filtered BC significantly with an average improvement of 12% in med-replay datasets. We believe that in low-quality datasets, even when we filter out 90% percent of the data, the quality of the remaining trajectories is still very diverse that simple imitation learning is not good enough. CWBC is able to learn from such diverse data, and by conditioning on expert return at test time, we can recover an efficient policy. 

