HARNESSING MIXED OFFLINE REINFORCEMENT LEARNING DATASETS VIA TRAJECTORY WEIGHTING

Abstract

Most offline reinforcement learning (RL) algorithms return a target policy maximizing a trade-off between (1) the expected performance gain over the behavior policy that collected the dataset, and (2) the risk stemming from the out-of-distributionness of the induced state-action occupancy. It follows that the performance of the target policy is strongly related to the performance of the behavior policy and, thus, the trajectory return distribution of the dataset. We show that in mixed datasets consisting of mostly low-return trajectories and minor high-return trajectories, stateof-the-art offline RL algorithms are overly restrained by low-return trajectories and fail to exploit high-performing trajectories to the fullest. To overcome this issue, we show that, in deterministic MDPs with stochastic initial states, the dataset sampling can be re-weighted to induce an artificial dataset whose behavior policy has a higher return. This re-weighted sampling strategy may be combined with any offline RL algorithm. We further analyze that the opportunity for performance improvement over the behavior policy correlates with the positive-sided variance of the returns of the trajectories in the dataset. We empirically show that while CQL, IQL, and TD3+BC achieve only a part of this potential policy improvement, these same algorithms combined with our reweighted sampling strategy fully exploit the dataset. Furthermore, we empirically demonstrate that, despite its theoretical limitation, the approach may still be efficient in stochastic environments.

1. INTRODUCTION

Offline reinforcement learning (RL) currently receives great attention because it allows one to optimize RL policies from logged data without direct interaction with the environment. This makes the RL training process safer and cheaper since collecting interaction data is high-risk, expensive, and time-consuming in the real world (e.g., robotics, and health care). Unfortunately, several papers have shown that near optimality of the offline RL task is intractable sample-efficiency-wise (Xiao et al., 2022; Chen & Jiang, 2019; Foster et al., 2022) . In contrast to near optimality, policy improvement over the behavior policy is an objective that is approximately realizable since the behavior policy may efficiently be cloned with supervised learning (Urbancic, 1994; Torabi et al., 2018) . Thus, most practical offline RL algorithms incorporate a component ensuring, either formally or intuitively, that the returned policy improves over the behavior policy: pessimistic algorithms make sure that a lower bound on the target policy (i.e., a policy learned by offline RL algorithms) value improves over the value of the behavior policy (Petrik et al., 2016; Kumar et al., 2020b; Buckman et al., 2020) , conservative algorithms regularize their policy search with respect to the behavior policy (Thomas, 2015; Laroche et al., 2019; Fujimoto et al., 2019) , and one-step algorithms prevent the target policy value from propagating through bootstrapping (Brandfonbrener et al., 2021) . These algorithms use the behavior policy as a stepping stone. As a consequence, their performance guarantees highly depend on the performance of the behavior policy. Due to the dependency on behavior policy performance, these offline RL algorithms are susceptible to the return distribution of the trajectories in the dataset collected by a behavior policy. To illustrate this dependency, we will say that these algorithms are anchored to the behavior policy. Anchoring in a near-optimal dataset (i.e., expert) favors the performance of an algorithm, while anchoring in a low-performing dataset (e.g., novice) may hinder the target policy's performance. In realistic scenarios, offline RL datasets might consist mostly of low-performing trajectories with few minor highperforming trajectories collected by a mixture of behavior policies, since curating high-performing trajectories is costly. It is thus desirable to avoid anchoring on low-performing behavior policies and exploit high-performing ones in mixed datasets. However, we show that state-of-the-art offline RL algorithms fail to exploit high-performing trajectories to their fullest. We analyze that the potential for policy improvement over the behavior policy is correlated with the positive-sided variance (PSV) of the trajectory returns in the dataset and advance that when the return PSV is high, the algorithmic anchoring may be limiting the performance of the returned policy. In order to provide a better algorithmic anchoring, we propose to alter the behavior policy without collecting additional data. We start by proving that re-weighting the dataset during the training of an offline RL algorithm is equivalent to performing this training with another behavior policy. Furthermore, under the assumption that the environment is deterministic, by giving larger weights to high-return trajectories, we can control the implicit behavior policy to be high performing and therefore grant a cold start performance boost to the offline RL algorithm. While determinism is a strong assumption that we prove to be necessary with a minimal failure example, we show that the guarantees still hold when the initial state is stochastic by re-weighting with, instead of the trajectory return, a trajectory return advantage: G(τ i ) -V µ (s i,0 ), where G(τ i ) is the return obtained for trajectory i, V µ (s i,0 ) is the expected return of following the behavior policy µ from the initial state s i,0 . Furthermore, we empirically observe that our strategy allows performance gains over their uniform sampling counterparts even in stochastic environments. We also note that determinism is required by several state-of-the-art offline RL algorithms (Schmidhuber, 2019; Srivastava et al., 2019; Kumar et al., 2019b; Chen et al., 2021; Furuta et al., 2021; Brandfonbrener et al., 2022) . Under the guidance of theoretical analysis, our principal contribution is two simple weighted sampling strategies: Return-weighting (RW) and Advantage-weighting (AW). RW and AW re-weight trajectories using the Boltzmann distribution of trajectory returns and advantages, respectively. Our weighted sampling strategies are agnostic to the underlying offline RL algorithms and thus can be a drop-in replacement in any off-the-shelf offline RL algorithms, essentially at no extra computational cost. We evaluate our sampling strategies on three state-of-the-art offline RL algorithms, CQL, IQL, and TD3+BC (Kumar et al., 2020b; Kostrikov et al., 2022; Fujimoto & Gu, 2021) , as well as behavior cloning, over 62 datasets in D4RL benchmarks (Fu et al., 2020) . The experimental results reported in statistically robust metrics (Agarwal et al., 2021) demonstrate that both our sampling strategies significantly boost the performance of all considered offline RL algorithms in challenging mixed datasets with sparse rewarding trajectories, and perform at least on par with them on regular datasets with evenly distributed return distributions.

2. PRELIMINARIES

We consider reinforcement learning (RL) problem in a Markov decision process (MDP) characterized by a tuple (S, A, R, P, ρ 0 ), where S and A denote state and action spaces, respectively, R : S × A → R is a reward function, P : S × A → ∆ S is a state transition dynamics, and ρ 0 : ∆ S is an initial state distribution, where ∆ X denotes a simplex over set X . An MDP starts from an initial state s 0 ∼ ρ 0 . At each timestep t, an agent perceives the state s t , takes an action a t ∼ π(.|s t ) where π : S → ∆ A is the agent's policy, receives a reward r t = R(s t , a t ), and transitions to a next state s t+1 ∼ P (s t+1 |s t , a t ). The performance of a policy π is measured by the expected return J(π) starting from initial states s 0 ∼ ρ 0 shown as follows: J(π) = E ∞ t=0 R(s t , a t ) s 0 ∼ ρ 0 , a t ∼ π(.|s t ), s t+1 ∼ P (.|s t , a t ) . (1) Given a dataset D collected by a behavior policy µ : S → ∆ A , offline RL algorithms aim to learn a target policy π such that J(π) ≥ J(µ) from a dataset D shown as follows: D = (s i,0 , a i,0 , r i,0 , • • • s i,Ti ) s i,0 ∼ ρ 0 , a i,t ∼ µ(.|s i,t ), r i,t = R(s i,t , a i,t ), s i,t+1 ∼ P (.|s i,t , a i,t ) , where τ i = (s i,0 , a i,0 , r i,0 , • • • s i,Ti+1 ) denotes trajectory i in D, (i, t) denotes timestep t in episode i, and T i denotes the length of τ i . Note that µ can be a mixture of multiple policies. For brevity, we omit the episode index i in the subscript of state and actions, unless necessary. Generically, offline RL algorithms learn π based on actor-critic methods that train a Q-value function Q : S × A → R and π in parallel. The Q-value Q(s, a) predicts the expected return of taking action a at state s and following π later; π maximizes the expected Q-value over D. Q and π are trained through alternating between policy evaluation (Equation 3) and policy improvement (Equation 4) steps shown below: Q ← arg min Q E r t + γE a ′ ∼π(.|st+1) [Q(s t+1 , a ′ )] -Q(s t , a t ) 2 Uni(D) (3) π ← arg max π E Q(s t , a) Uni(D), a ∼ π(.|s t ) , where E[ • | Uni(D) ] denotes an expectation over uniform sampling of transitions.

3. PROBLEM FORMULATION

Most offline RL algorithms are anchored to the behavior policy. This is beneficial when the dataset behavior policy is high-performing while detrimental when the behavior policy is low-performing. We consider mixed datasets consisting of mostly low-performing trajectories and a handful of highperforming trajectories. In such datasets, it is possible to exploit the rare high-performing trajectories, yet the anchoring restrains these algorithms from making sizable policy improvements over the behavior policy of the mixed dataset. We formally define the return positive-sided variance (RPSV) of a dataset in Section 3.1 and illustrate why the performance of offline RL algorithms could be limited on high-RPSV datasets in Section 3.2.

3.1. POSITIVE-SIDED VARIANCE

Formally, we are concerned with a dataset D := {τ 0 , τ 1 , • • • τ N -1 } potentially collected by various be- havior policies {µ 0 , µ 1 , • • • µ N -1 } and constituted of empirical returns {G(τ 0 ), G(τ 1 ), • • • G(τ N -1 )}, where τ i is generated by µ i , N is the number of trajectories, T i denotes the length of τ i , and G(τ i ) = Ti-1 t=0 r i,t . To study the distribution of return, we equip ourselves with a statistical quantity: the positive-sided variance (PSV) of a random variable X: Definition 1 (Positive-sided variance). The positive-sided variance (PSV) of a random variable X is the second-order moment of the positive component of X -E[X]: V + [X] . = E (X -E[X]) 2 + with x + = max{x, 0}. The return PSV (RPSV) of D aims at capturing the positive dispersion of the distribution of the trajectory returns. An interesting question to ask is: what distribution leads to high RPSV? We simplify sampling trajectories collected by a novice and an expert as sampling from a Bernoulli distribution B, and suppose that the novice policy always yields a 0 return, while the expert always yields a 1 return. Figure 1a visualizes V + [B(p)] = p(1 -p) 2 , which is the Bernoulli distribution's PSV as a function of its parameter p, where p is the probability of choosing an expert trajectory. We see that maximal PSV is achieved for p = 1 3 . Both p = 0 (pure novice) and p = 1 (pure expert) leads to a zero PSV. This observation indicates that mixed datasets tend to have higher RPSV than a dataset collected by a single policy. We present the return distribution of datasets at varying RPSV in Figure 1 . Low-RPSV datasets have their highest returns that remain close to the mean return, which limits the opportunity for policy improvement. In contrast, the return distribution of high-RPSV datasets disperses away from the mean toward the positive end.  V + [B(p)] = p(1 -p) 2 . (b-c) The return distribution of datasets with (b) low and (c) high return positive-sided variance (RPSV) (Section 3.1) , where RPSV measures the positive contributions in the variance of trajectory returns in a dataset and Ḡ denotes the average episodic returns (dashed line) of the dataset. Intuitively, a high RPSV implies some trajectories have far higher returns than the average.

3.2. OFFLINE RL FAILS TO UTILIZE DATA IN HIGH-RPSV DATASETS

High-RPSV datasets (Figure 1c ) have a handful of high-return trajectories, yet the anchoring of offline RL algorithms on behavior policy inhibits offline RL from utilizing these high-return data to their fullest. Predominating low-return trajectories in a high-RPSV dataset restrain offline RL algorithms from learning a non-trivial policy close to the best trajectories in D due to these algorithms' pessimistic and/or conservative nature. High RPSV implies that the average episodic return is far from the best return in D (see Figure 1c ). The average episodic return reflects the performance J(µ) (formally justified in Section 4.1) of the behavior policy µ that collected D, where µ is mixture of {µ 0 , µ 1 , • • • µ N -1 } (Section 3.1). Pessimistic algorithms (Petrik et al., 2016; Kumar et al., 2020b; Buckman et al., 2020) strive to guarantee the algorithm returns a π such that J(π) ≥ J(µ), but this guarantee is loose when J(µ) is low. Conservative algorithms (Laroche et al., 2019; Fujimoto et al., 2019; Fujimoto & Gu, 2021; Kumar et al., 2019a) restrict π to behave closely to µ to prevent exploiting poorly estimated Q-values on out-of-distribution state-action pairs in actor-critic updates (i.e., (s t+1 , a ′ ) / ∈ D in Equation 3), hence restricting J(π) from deviating too much from J(µ). Similarly, one-step algorithms (Brandfonbrener et al., 2021; Kostrikov et al., 2022) that perform only a single step of policy improvement return a target policy subject to constraints that enforces π to be close to µ (Peters & Schaal, 2007; Peng et al., 2019) . As a consequence, offline RL algorithms are restrained by J(µ) and fail to utilize high-return data far from J(µ) in high-RPSV datasets. On the contrary, in low-RPSV datasets (Figure 1b ), pessimistic, conservative, and one-step algorithms do not have this severe under-utilization issue since the return distribution concentrates around or below the average episodic return, and there are very few to no better trajectories to exploit. We will show, in Section 5.2, that no sampling strategy makes offline RL algorithms perform better in extremely low-RPSV datasets, while in high-RPSV datasets, our methods (Sections 4.2 and 4.3) outperform typical uniform sampling substantially.

4. METHOD

Section 3 explains why behavior policy anchoring prevents offline RL algorithms from exploiting high-RPSV datasets to their fullest. To overcome this issue, the question that needs to be answered is: can we improve the performance of the behavior policy without collecting additional data? To do so, we propose to implicitly alter it through a re-weighting of the transitions in the dataset. Indeed, we show that weighted sampling can emulate sampling transitions with a different behavior policy. We analyze the connection between weighted sampling and performance of the implicit behavior policy in Section 4.1, and then present two weighted sampling strategies in Sections 4.2 and 4.3.

4.1. ANALYSIS

We start by showing how re-weighting the transitions in a dataset emulates sampling transitions generated by an implicit mixture behavior policy different from the one that collected the dataset. It is implicit because the policy is defined by the weights of transitions in the dataset. As suggested in Peng et al. (2019) , sampling transitions from D defined in Section 3 is equivalent to sampling state-action pairs from a weighted joint state-action occupancy: d W (s, a) = N -1 i=0 w i d µi (s)µ i (a|s), where w i is the weight of trajectory i (each τ i is collected by µ i ), W . = {w 0 , • • • w N -1 }, and d µi (s) denotes the unnormalized state occupancy measure (Laroche et al., 2022) in the rollout of µ i . Tweaking weighting W effectively alters d W and thus the transition distribution during sampling. As Peng et al. (2019) suggested, a weighting W also induces a weighted behavior policy: µ W (a|s) = d W (s,a) N -1 i=0 widµ i (s) . Uniform sampling w i = 1 N , ∀w i ∈ W is equivalent to sampling from the joint state-action occupancy of the original mixture behavior policy µ that collected D. To obtain a well-defined sampling distribution over transitions, we need to convert these trajectory weights w i to transitions sample weights w i,t , ∀t ∈ 0, T i -1 : w i,t . = w i N -1 i=0 T i w i , N -1 i=0 Ti-1 t=0 w i,t = N -1 i=0 T i w i N -1 i=0 T i w i = 1. Thus, we formulate our goal as finding W . = {w i } i∈ 0,N -1 ∈ ∆ N such that J(µ W ) ≥ J(µ) , where ∆ N denotes the simplex of size N . Naturally, we can write J(µ W ) = N -1 i=0 w i J(µ i ). The remaining question is then to estimate J(µ i ). The episodic return G(τ i ) can be treated as a sample of J(µ i ). As a result, we can concentrate J(µ W ) near the weighted sum of returns with a direct application of Hoeffding's inequality (Serfling, 1974) : P J(µ W ) - N -1 i=0 w i G(τ i ) > ϵ ≤ 2 exp 2ϵ 2 G 2 ⊤ N -1 i=0 w 2 i , where G ⊤ . = G MAX -G MIN is the return interval amplitude (see Hoeffding's inequality). For completeness, the soundness of the method is proved for any policy and MDP with discount factor (Sutton & Barto, 2018) less than 1 in Appendix A.1. Equation 7 tells us that we have a consistent estimator for J(µ W ) as long as too much mass has not been assigned to a small set of trajectories. Since our goal is to obtain a behavior policy with a higher performance, we would like to give high weights w i to high performance µ i . However, it is worth noting that setting w i as a function of G i could induce a bias in the estimator of J(µ W ) due to the stochasticity in the trajectory generation, stemming from ρ 0 , P , and/or µ i . In that case, Equation 7 concentration bound would not be valid anymore. To demonstrate and illustrate the bias, we provide a counterexample in Appendix A.3. The following section addresses this issue by making the strong assumption of determinism of the environment, and applying a trick to remove the stochasticity from the behavior policy µ i . Section 4.3 then relaxes the requirement for ρ 0 to be deterministic by using the return advantage instead of the absolute return.

4.2. RETURN-WEIGHTING

In this section, we are making the strong assumption that the MDP is deterministic (i.e., the transition dynamics P and the initial state distribution ρ 0 is a Dirac delta distribution). This assumption allows us to obtain that G(τ i ) = J(µ i ), where µ i is the deterministic policy taking the actions in trajectory τ ifoot_0 . Since the performance of the target policy is anchored on the performance of a behavior policy, we find a weighting distribution W to maximize J(µ W ): max W∈∆ N N -1 i=0 w i G(τ i ), where w i corresponds to the unnormalized weight assigned to each transition in episode i. However, the resulting solution is trivially assigning all the weights to the transitions in episode τ i with maximum return. This trivial solution would indeed be optimal in the deterministic setting we consider but would fail otherwise. To prevent this from happening, we classically incorporate entropy regularization and turn Equation 8 into: max W∈∆ N N -1 i=0 w i G(τ i ) -α N -1 i=0 w i log w i , where α ∈ R + is a temperature parameter that controls the strength of regularization. α interpolates the solution from a delta distribution (α → 0) to an uniform distribution (α → ∞). As the optimal solution to Equation 9 is a Boltzmann distribution of G(τ i ), the resulting weighting distribution W is: w i = exp G(τ i )/α τi∈D exp G(τ i )/α . ( ) The temperature parameter α (Equations 10 and 11) is a fixed hyperparameter. As the choice of α is dependent on the scale of episodic returns, which varies across environments, we normalize G(τ i ) using a max-min normalization: G(τ i ) ← G(τi)-minj G(τj ) maxj G(τj )-minj G(τj ) .

4.3. ADVANTAGE-WEIGHTING

In this section, we allow the initial state distribution ρ 0 to be stochastic. The return-weighting strategy in Section 4.2 could be biased toward some trajectories starting from lucky initial states that yield higher returns than other initial states. Thus, we change the objective of Equation 9to maximizing the weighted episodic advantage wi∈W w i A(τ i ) with entropy regularization. A(τ i ) denotes the episodic advantage of τ i and is defined as A(τ i ) = G(τ i ) -V µ (s i,0 ). V µ (s i,0 ) is the estimated expected return of following µ starting from s i,0 , using regression: 9and solving for W, we obtain the following weighting distribution: V µ ← arg min V E (G(τ i ) - V (s i,0 )) 2 | Uni(D) . Substituting G(τ i ) with A(τ i ) in Equation w i = exp A(τ i )/α τi∈D exp A(τ i )/α , A(τ i ) = G(τ i ) -V µ (s i,0 ).

5. EXPERIMENTS

Our experiments answer the following primary questions: (i) Do our methods enable offline RL algorithms to achieve better performance in datasets with sparse high-return trajectories? (ii) Does our method benefit from high RPSV? (iii) Can our method also perform well in regular datasets? (iv) Is our method robust to stochasticity in an MDP?

5.1. SETUP

Implementation. We implement our weighted-sampling strategy and baselines in the following offline RL algorithms: implicit Q-learning (IQL) (Kostrikov et al., 2022) , conservative Q-learning (CQL) (Kumar et al., 2020b) , TD3+BC (Fujimoto & Gu, 2021) , and behavior cloning (BC). IQL, CQL, and TD3+BC were chosen to cover various approaches of offline RL, including one-step, pessimistic, and conservative algorithms. Note that though BC is an imitation learning algorithm, we include it since BC clones the behavior policy, which is the object we directly alter, and BC is also a common baseline in offline RL research (Kumar et al., 2020b; Kostrikov et al., 2022) . Baselines. We compare our weighted sampling against uniform sampling (denoted as Uniform), percentage filtering (Chen et al., 2021) (denoted as Top-x%), and half-sampling (denoted as Half ). Percentage filtering only uses episodes with top-x% returns for training. We consider percentage filtering as a baseline since it similarly increases the expected return of the behavior policy by discarding some data. In the following, we compare our method against Top-10% since 10% is the best configuration found in the hyperparameter search (Appendix A.11). Half-sampling samples half of transitions from high-return and low-return trajectories, respectively. Half is a simple workaround to avoid over-sampling low-return data in datasets consisting of only sparse high-return trajectories. Note that Half requires the extra assumption of separating a dataset into high-return and low-return partitions, while our methods do not need this. Our return-weighted and advantage-weighted strategies are denoted as RW and AW, respectively, for which we use the same hyperparameter α in all the environments (see Appendix A.7). Datasets and environments. We evaluate the performance of each algorithm+sampler variant (i.e., the combination of an offline RL algorithm and a sampling strategy) in MuJoCo locomotion 0 % 5 % 1 0 % 5 0 % Ratio % 0.0 0.2 0.4 0.8 1.0 Return CQL 0 % 5 % 1 0 % 5 0 % Ratio % IQL 0 % 5 % 1 0 % 5 0 % Ratio % BC 0 % 5 % 1 0 % 5 0 % Ratio % TD3+BC Uniform Half Top-10% RW (ours) AW (ours) Figure 2 : Our RW and AW sampling strategies achieve higher returns (y-axis) than all baselines (color) on average consistently for all algorithms CQL, IQL, BC, and TD3+BC, at all datasets with varying high-return data ratios σ% (x-axis). Remarkably, our performances in four algorithms exceed or match the average returns (dashed lines) of these algorithms trained with uniform sampling in full expert datasets. The substantial performance gain over Uniform at low ratios (1% ≤ σ% ≤ 10%) shows the advantage of our methods in datasets with sparse high-return trajectories. environments of D4RL benchmarks (Fu et al., 2020) and stochastic classic control benchmarks. Each environment is regarded as an MDP and can have multiple datasets in a benchmark suite. The dataset choices are described in the respective sections of the experimental results. We evaluate our method in stochastic classic control to investigate if stochastic dynamics break our weighted sampling strategies. The implementation of stochastic dynamics is presented in Appendix A.6. Evaluation metric. An algorithm+sampler variant is trained for one million batches of updates in five random seeds for each dataset and environment. Its performance is measured by the average normalized episodic return of running the trained policy over 20 episodes in the environment. As suggested in Fu et al. (2020) , we normalize the performance using (X -X Random )/(X Expert -X Random ) where X, X Random , and X Expert denote the performance of an algorithm-sampler variant, the random policy, and the expert one, respectively.

5.2. RESULTS IN MIXED DATASETS WITH SPARSE HIGH-RETURN TRAJECTORIES

To answer whether our weighted sampling methods improve the performance of uniform sampling in datasets with sparse high-return trajectories, we create mixed datasets with varying ratios of high-return data. We test each algorithm+sampler variant in four MuJoCo locomotion environments and eight mixed datasets, and one non-mixed dataset for each environment. The mixed datasets are created by mixing σ% of either an expert or medium datasets (high-return) with (1 -σ%) of a random dataset (low-return), for four ratios, σ ∈ {1, 5, 10, 50}. The expert, medium, and random datasets are generated by an expert policy, a policy with 1/3 of the expert policy performance, and a random policy, respectively. We test all the variants in those 32 mixed datasets and random dataset. Figure 2 shows the mean normalized performance (y-axis) of each algorithm+sampler (color) variant at varying σ (x-axis). Each algorithm+sampler variant's performance is measured in the interquartile mean (IQM) (also known as 25%-trimmed mean) of average return (see Section 5.1) since IQM is less susceptible to the outlier performance as suggested in Agarwal et al. (2021) . Appendix A.8 details the evaluation protocol. It can be seen that in Figure 2 our RW and AW strategies significantly outperform the baselines Uniform, Top-10%, and Half for all algorithms at at all expert/medium data ratio σ%. Remarkably, our methods even exceed or match the performance of each algorithm trained in full expert datasets with uniform sampling (dashed lines). This implies that our methods enable offline RL algorithms to achieve expert level of performance by 5% to 10% of medium or expert trajectories. Uniform fails to exploit to the fullest of the datasets when high-performing trajectories are sparse (i.e., low σ). Top-10% slightly improves the performance, yet fails to Uniform in low ratios (σ% = 1%), which implies the best filltering percentage might be dataset-dependent. Half consistently improves Uniform slightly at all ratios, yet the amounts of performance gain are far below ours. Overall, these results suggest that up-weighting high-return trajectories in a dataset with low ratios of high-return data benefits performance while naively filtering out low-return episodes, as Top-10% does, does not consistently improve performance. Moreover, AW and RW do not show visible differences, likely because the initial state distribution is narrow in MuJoCo locomotion environments. We also include the average returns in each environment and dataset in Appendix A.13. In addition to average return, we also evaluate our methods in the probability of improvements (Agarwal et al., 2021) over uniform sampling and show statistically significant improvements in Appendix A.10. 

5.3. ANALYSIS OF PERFORMANCE GAIN IN MIXED DATASETS

We hypothesize that our methods' performance gain over uniform sampling results from increased RPSV in the datasets. The design of a robust predictor for the performance gain of a sampling strategy is not trivial since offline RL's performance is influenced by several factors, including the environment and the offline RL algorithm that it is paired with. We focus on two statistical factors that are easy to estimate from the dataset: (i) the mean return of a dataset and (ii) RPSV. Although dependent on each other, these two factors have a good variability in our experiments since increasing the ratio of expert/medium data would increase not only RPSV but also the mean return of a dataset. We show the relationship between the performance gain over uniform sampling (represented by the color of the dot in the plots below), datasets' mean return (x-axis), and RPSV (y-axis, in log scale) in Figure 3 . Each dot denotes the average performance gain in a tuple of environment, dataset, and σ. It can be seen that at similar mean returns (x-axis), our methods' performance gain grows evidently (color gets closer to red) when RPSV increases (y-axis). This observation indicates that the performance gain with low σ (expert/medium data ratio) in Figure 2 can be related to the performance gain at high RPSV since most datasets with low mean returns have high RPSV in our experiments. We also notice that a high dataset average return may temper our advantage. The reason is that offline RL with uniform sampling is already quite efficient in the settings where σ is in a high range, such as 50%, and that the room for additional improvement over it is therefore limited.

5.4. RESULTS IN REGULAR DATASETS WITH MORE HIGH-RETURN TRAJECTORIES

Datasets in Section 5.2 are adversarially created to test the performance with extremely sparse high-return trajectories. However, we show in Figure 5 that such challenging return distributions are not common in regular datasets in D4RL benchmarks. As a result, regular datasets are easier than mixed datasets with sparse high-return trajectories for the uniform sampling baseline. To show that our method does not lose performance in regular datasets with more high-return trajectories, we also evaluate our method in 30 regular datasets from D4RL benchmark (Fu et al., 2020) using the same evaluation metric in Section 5.1, and present the results in Figure 4a . It can be seen that our methods both exhibit performance on par with the baselines in regular datasets, confirming that our method does not lose performance. Note that we do not compare with Half since regular datasets collected by multiple policies cannot be split into two buffers. Notably, we find that with our RW and AW, BC achieves competitive performance with other offline RL algorithms (i.e., CQL, IQL, and TD3+BC). The substantial improvement over uniform sampling in BC aligns with our analysis (Section 4.1) since the performance of BC solely depends on the performance of behavior policy and hence the average returns of sampled trajectories. Nonetheless, paired with RW and AW, offline RL algorithms (i.e., CQL, IQL, and TD3+BC) still outperform BC. This suggests that our weighted sampling strategies do not overshadow the advantage of offline RL over BC. The complete performance table can be found in Appendix A.13. We also evaluate our methods' probability of Figure 4 : (a) Our method matches the Uniform's return (y-axis). It indicates that our methods do not lose performance in datasets consisting of sufficient high-return trajectories. These datasets are regular datasets in D4RL without adversarially mixing low-return trajectories as we do in Section 5.2. (b) Performance in classic control tasks with stochastic dynamics. Our method outperforms the baselines, showing that stochasticity do not break our methods. improvements (Agarwal et al., 2021) over uniform sampling, showing that our methods are no worse than baselines in Appendix A.8.1.

5.5. RESULTS IN STOCHASTIC MDPS

As our weighted-sampling strategy theoretically requires a deterministic MDP, we investigate if stochastic dynamics (i.e., stochastic state transitions) break our method by evaluating it in stochastic control environments. The details of their implementation can be found in Appendix A.6. We use the evaluation metric described in Section 5.1 and present the results in Figure 4b . Both of our methods still outperform uniform sampling in stochastic dynamics, suggesting that stochasticity does not break them. Note that we only report the results with CQL since IQL and TD3+BC are not compatible with the discrete action space used in stochastic classic control.

6. RELATED WORKS

Our weighted sampling strategies and non-uniform experience replay in online RL aim to improve uniform sample selection. Prior works prioritize uncertain data (Schaul et al., 2015; Horgan et al., 2018; Lahire et al., 2021) , attending on nearly on-policy samples (Sinha et al., 2022) , or select samples by topological order Hong et al. (2022) ; Kumar et al. (2020a) . However, these approaches do not take the performance of implicit behavioral policy induced by sampling into account and hence are unlikely to tackle the issue in mixed offline RL datasets. Offline imitation learning (IL) (Kim et al., 2021; Ma et al., 2022; Xu et al., 2022) consider training an expert policy from a dataset consisting of a handful of expert data and plenty of random data. They train a model to discriminate if a transition is from an expert and learn a nearly expert policy from the discriminator's predictions. Conceptually, our methods and offline IL aim to capitalize advantageous data (i.e., sparse high-return/expert data) in a dataset despite different problem settings. Offline IL require that expert and random data are given in two separated buffer, but do not need reward labels. In contrast, we do not require separable datasets but require reward labels to find advantageous data.

7. DISCUSSION

Importance of learning sparse high-return trajectories. Though most regular datasets in mainstream offline RL benchmarks such as D4RL have more high-return trajectories than mixed datasets studied in Section 5.2, it should be noted that collecting these high-return data is tedious and could be expensive in realistic domains (e.g., health care). Thus, enabling offline RL to learn from datasets with a limited amount of high-return trajectories is crucial for deploying offline RL in more realistic tasks. The significance of our work is a simple technique to enable offline RL to learn from a handful of high-return trajectories. Limitation. As our methods require trajectory returns to compute the sample weights, datasets cannot be partially fragmented trajectories, and each trajectory needs to start from states in the initial state distribution; otherwise, trajectory return cannot be estimated. One possible approach to lift this limitation is estimating the sample weight using a learned value function so that one can estimate the expected return of a state without complete trajectories. 

A APPENDIX

A.1 DETAILED ANALYSIS We consider the definitions of a policy and a Markovian policy (Laroche et al., 2022) : Definition 2 (Policy). A policy π represents any function mapping its trajectory history h t = ⟨s 0 , a 0 , r 0 . . . , s t-1 , a t-1 , r t-1 , s t ⟩ to a distribution over actions π(•|h t ) ∈ ∆ A , where ∆ A denotes the simplex over A. Let Π denote the space of policies, and Π D the space of deterministic policies. Definition 3 (Markovian policy). Policy π is said to be Markovian if its action probabilities only depend on the current state s t : π(•|h t ) = π(•|s t ) ∈ ∆ A . Otherwise, policy π is non-Markovian. We let Π M denote the space of Markovian policies, and Π DM the space of deterministic Markovian policies. We make no assumption on the behavior policy β, i.e. β ∈ Π. We notice that: J (β) = E R τ τ ∼ p 0 , β, p = E R τ β τ ∼ β, τ ∼ p 0 , β τ , p = E J (β τ ) β τ ∼ β . ( ) Equation 12 is a trick that has already been used in Peng et al. (2019) . We go a bit further by constraining β τ to be a deterministic policy sampled at the start of the episode, which may be programmatically interpreted as sampling the random seed used for the full trajectory. With a trajectory-wise reweighting W, we obtain: J (β W ) = N -1 i=0 w i J (β τi ). Furthermore, Altman (1999) tell us that there exists a Markovian policy β M W with the same occupancy measure as β W and the same performance when γ < 1 in MDPs with countable state space. Laroche et al. (2022) generalize this theorem to any MDP (including on uncountable state spaces) as long as γ < 1. Simão et al. (2020) prove in Theorem 3.2 Eq. ( 12) that, in finite MDPs, any Markovian behavior policy β M W can be cloned with policy βM W from a dataset of N trajectory up to accuracy of 2r ⊤ 1-γ 3|S||A|+4 log 1 δ 2N 2 N -1 i=0 w 2 i foot_1 with high probability 1 -δ, where 2r ⊤ is the reward function amplitude.

A.2 ADDITIONAL RELATED WORKS: IMBALANCE CLASSIFICATION/REGRESSION

Mixed datasets with high RPSV are closely related to imbalanced datasets in supervised learning. Supervised learning approaches either over-sample minority classes (Cui et al., 2019; Cao et al., 2019; Dong et al., 2018) or sample data inversely proportional to the target value densities (Yang et al., 2021; Steininger et al., 2021) . Other works (Chawla et al., 2002; García & Herrera, 2009) synthesize samples by interpolating data points nearby the minority data. In RL, on the other hand, over-sampling minority data (trajectories) can be harmful if the trajectory is low-return and does not cover high-performing policies' trajectories; in other words, naive application of over-sampling techniques from supervised learning can hurt in an RL setting as they are agnostic to the notion of return.

TRAJECTORY

We will consider a minimal example consisting of a stateless MDP (multi-arm bandit) and 2 actions A = {a 1 , a 2 }. Action a 1 yields a deterministic reward of 0.6. Action a 2 Bernouilli distribution reward with parameter p = 0.5. In other words, a 2 is a coin flip: with 50% chance, no reward is received and with 50% chance, a maximal reward of 1. We collect a dataset containing some number of samples for each action. Now consider weights w i such that: w i = 1[G(τ i ) > 0.8] j 1[G(τ j ) > 0.8] . ( ) Then, J (µ W ) = N -1 i=0 w i J (µ i ) = J (µ(a 2 ) = 1) = 0.5 (15) N -1 i=0 w i G(τ i ) = N -1 i=0 1[G(τ i ) > 0.8] j 1[G(τ j ) > 0.8] = 1, showing a counter-example for the concentration bound proposed in equation 7.

A.4 DATASET AVERAGE RETURN

We plot the average return of mixed and regular datasets in Figure 5 . It can be seen that mixed datasets used in Section 5.2 have lower average return than regular datasets on average. Also, we study the relationship between average return of dataset and offline RL performance in Figure 6 , showing that increasing average return of dataset improves offline RL performance. Interestingly, there is a sweet spot where increasing datasets' average return starts hurting the performance. We hypothesize that it is due to insufficient state-aciton coverage of datasets. We also present the return distribution for each mixed dataset in Figure 7 . 

A.5 DATASET RPSV

We list the PSV of each dataset in Table 1 .

A.6 STOCHASTIC CLASSIC CONTROL

We adapt CartPole-v1, Acrobot-v1, and MountainCar-v0 in classic control environments in Open AI gym (Brockman et al., 2016) . For each timestep, an agent's actions has 10% chance to be replaced with noisy action a ∼ A. As such, the transitions dynamics turns to be stochastic. RPSV ant-expert-v2 (σ = 1) 0.002886 ant-expert-v2 (σ = 5) 0.013152 ant-expert-v2 (σ = 10) 0.026764 ant-expert-v2 (σ = 50) 0.151138 ant-expert-v2 0.015052 ant-full-replay-v2 0.096415 ant-medium-expert-v2 0.042611 ant-medium-replay-v2 0.038834 ant-medium-v2 (σ = 1) 0.001334 ant-medium-v2 (σ = 5) 0.006703 ant-medium-v2 (σ = 10) 0.013846 ant-medium-v2 (σ = 50) 0.075325 ant-medium-v2 0.020586 ant-random-v2 0.000092 antmaze-large-diverse-v0 0.015378 antmaze-large-play-v0 0.004538 antmaze-medium-diverse-v0 0.072896 antmaze-medium-play-v0 0.006583 antmaze-umaze-diverse-v0 0.029304 antmaze-umaze-v0 0.016956 halfcheetah-expert-v2 (σ = 1) 0.008236 halfcheetah-expert-v2 (σ = 5) 0.035784 halfcheetah-expert-v2 (σ = 10) 0.063523 halfcheetah-expert-v2 (σ = 50) 0.097583 halfcheetah-expert-v2 0.000130 halfcheetah-full-replay-v2 0.006521 halfcheetah-medium-expert-v2 0.028588 halfcheetah-medium-replay-v2 0.005747 halfcheetah-medium-v2 (σ = 1) 0.001771 halfcheetah-medium-v2 (σ = 5) 0.007604 halfcheetah-medium-v2 (σ = 10) 0.013685 halfcheetah-medium-v2 (σ = 50) 0.021007 halfcheetah-medium-v2 0.000110 halfcheetah-random-v2 0.000019 hopper-expert-v2 (σ = 1) 0.000318 hopper-expert-v2 (σ = 5) 0.001436 hopper-expert-v2 (σ = 10) 0.002948 hopper-expert-v2 (σ = 50) 0.024691 hopper-expert-v2 0.000771 hopper-full-replay-v2 0.041835 hopper-medium-expert-v2 0.064473 hopper-medium-replay-v2 0.018824 hopper-medium-v2 (σ = 1) 0.000129 hopper-medium-v2 (σ = 5) 0.000506 hopper-medium-v2 (σ = 10) 0.001055 hopper-medium-v2 (σ = 50) 0.008475 hopper-medium-v2 0.006648 hopper-random-v2 0.000025 walker2d-expert-v2 (σ = 1) 0.000264 walker2d-expert-v2 (σ = 5) 0.001261 walker2d-expert-v2 (σ = 10) 0.002601 walker2d-expert-v2 (σ = 50) 0.022037 walker2d-expert-v2 0.000030 walker2d-full-replay-v2 0.070379 walker2d-medium-expert-v2 0.027597 walker2d-medium-replay-v2 0.029698 walker2d-medium-v2 (σ = 1) 0.000128 walker2d-medium-v2 (σ = 5) 0.000559 walker2d-medium-v2 (σ = 10) 0.001233 walker2d-medium-v2 (σ = 50) 0.010165 walker2d-medium-v2 0.018411 walker2d-random-v2 0.000001 Table 1 : RPSV calculated using normalized return.

A.7 DETAILS OF IMPLEMENTATION

• Temperature α. For RW and AW, we use α = 0.1 for IQL and TD3+BC, and α = 0.2 for CQL. • Trajectory advantage. We use linear regression to approximate V µ . We make a training set (s i,0 , G(τ i )) ∀τ i ∈ D, and train a regression model on the training set. • Implementation. We use the public codebase, d3rlpy (Takuma Seno, 2021). For each algorithm, we use the hyperparamters as it is. A.8 DETAILS OF EVALUATION PROTOCOL Performance logging. Given an environment E and a dataset D E , for each trial (i.e., a random seed) we train each algorithm+sampler variant for one million batches using D E and rollout the learned policy in the environment E for 20 episodes. The average return of the 20 episodes are booked as the performance of the trial. Performance metric. Given a list of empirical returns of eacy trial [g 1 , g 2 , • • • ], interquantile mean (IQM) (Agarwal et al., 2021) discards the bottom 25% and top 25% samples and calculate the mean.

A.8.1 PROBABILITY OF IMPROVEMENT

According to Agarwal et al. (2021) , the probability of improvement in an environment m is defined as: P (X m ≥ Y m ) = 1 N 2 N i=1 N j=1 S(x m,i , y m,j ), S(x m,i , y m,j ) =    1, x m,i > y m,j 1 2 , x m,i = y m,j 0, x m,i < y m,j , where m denote an environment index, x m,i and y m,j denote the samples of mean of return in trials of algorithms X and Y , respectively. We report the average probability of improvements 1 M M -1 m=0 P (X m ≥ Y m ) and its 95%-confidence interval using bootstrapping. P I cannot be directly translated into "number of winning" since P I takes the stochasticity resulting from random seeds into account, measuring the probability of improvements in a trial with a randomly selected seed, dataset, and environment. For example, in Figure 9 , we show that our methods attain 70% probability of improvements over uniform sampling, while this does not mean we beat uniform sampling in 70% of datasets and environments. From the complete score table in Appendix A.13, we see that our AW and RW strategies outperform uniform sampling in at least 80% of datasets. We want to highlight that probability of improvement (PI) measures the robustness of a method, conveying different messages than the average performance shown in Figure 2 and Figure 4a . PI measures "how likely is a method to perform better than uniform sampling in a randomly selected environment, dataset, and random seed?" PI captures the uncertainty among random seeds while aggregated metrics like average performance does not. For example, suppose we have 5 trials with different random seeds on the same environment and dataset for two methods A and B. The fact that A has a higher average return than B, does not follow that A always performs better than B in all trials. It is possible that A is worse than B in some trials. Comparing only the average return, one would mis-conclude that A is certainly better than B. Instead, PI answers "how likely is A to be better than B?" PI is important for algorithm selection since it measures the robustness of a method. One can have extremely a high performance gain in a few tasks and lose to baselines in the majority of tasks. If so, this new method would not always outperform baselines, which makes it not robust. A robust method should consistently improve baselines and not lose performance in most tasks. Robustness of a method is important for a user to decide whether or not to prefer the new method over the existing method (i.e., baselines). As offline RL algorithms' performance interplay with several factors (e.g., dataset properties, environment dynamics, reward functions, etc), it is unlikely to accurately predict what conditions make the new method perform the best. Lacking of perfect knowledge of the best condition for the new method, it is unclear whether a user should deploy the new method on a new task that does not have benchmarking results yet. As a result, robustness is crucial when selecting an algorithm for a new task. If the new method is shown to be robust and perform better than the baseline in most trials (i.e., high PI), it would be worth preferring the new method over the baseline. In contrast, it's not worth using the new method if the new method has PI below 50 A.9 SENSITIVITY TO TEMPERATURE The temperature α (Section 4) is an important hyperparameter of our method. We investigate how sensitive the choice of temperature algorithms is. Using the evaluation metric shown in Section 5.1, we compare the performance of RW and AW paired with IQL at varying temperature α in Figure 8 , where 0.1 is the temperature used in Sections 5.2, 5.4, and 5.5. Our methods outperform uniform sampling in a range of temperatures and hence are not overly sensitive to temperature. The full results in other offline algorithms are presented in Appendix A.12.

A.10 PROBABILITY OF IMPROVEMENTS

In addition to average performance, the recent study by Agarwal et al. (2021) highlights the importance of measuring the robustness of an algorithm by its probability of improvement (P I) since outliers could dominate the average performance. An algorithm with a higher average performance does not necessarily perform better than baselines in the majority of environments. Therefore, we evaluate our method in both regular (Section 5.4) and mixed (Section 5.2) datasets in D4RL using probability of performing better than uniform sampling: P I(X > Uniform), where X ∈ {Half, Top-10%, RW, AW}. The bottom row of Figure 9 shows that our method achieves above 70% chance of outperforming uniform sampling in mixed datasets with sparse high-return trajectories. Moreover, the lower bounds of the confidence interval are clearly above 50%, which indicates that the improvements are significant according to Agarwal et al. (2021) . On the other hand, in regular datasets with abundant high-return data, P I(AW > Uniform) and P I(RW > Uniform) are around 50%, suggesting that our methods match uniform sampling baseline. Note that P I(RW > Uniform) = 70% does not imply our method only beats uniform sampling in 70% of datasets and environments. The calculation of P I is detailed in Appendix A.8.1.

A.11 ADDITIONAL RESULTS OF PERCENTAGE FILTERING

Figure 10 presents the additional results at varying percentage for percentage filtering. A.12 ADDITIONAL RESULTS OF TEMPERATURE SENSITIVITY Figure 11 presents the full results at varying temperatures.

A.13 FULL RESULTS

We list the full benchmark results in Tables 2, 3, 4, and 5 , where bold text denotes a score higher than Uniform and "*" sign indicates the maximum score in a dataset and environment (row).

A.14 RESULTS IN OFFICIAL IQL CODEBASE

As the performance of our IQL implementation slightly mismatches the official implementationfoot_2 , we run the experiments in Sections 5.2 and 5.4 based on the official codebase and report the results in Figure 12 . It can be seen that our methods still exhibit similar amounts of performance gain shown in Figure 2 and Figure 4a , indicating that the performance gain of our methods are independent of implementation. Figure 9 : Probability of improvement over uniform sampling (Agarwal et al., 2021) in 32 mixed (lower row) and 30 regular datasets (upper row). In mixed datasets with sparse high-return data, our methods attains above 75% of the probability of improvements with a lower bound of confidence interval clearly above 50%, suggesting statistically significant improvements over uniform sampling. On the other hand, in regular datasets with abundant high-return data, P I(AW > Uniform) and P I(RW > Uniform) are around 50%, suggesting that our methods match uniform sampling baseline. 



The formalization of µi is provided in Appendix A.1 We implicitly replace in the denominator their unweigthed term N with its weigthed version N 2 N -1 i=0 w 2 i . https://github.com/ikostrikov/implicit_q_learning



Figure 1: (a) Bernoulli distribution PSV: V + [B(p)] = p(1 -p) 2 . (b-c) The return distribution of datasets with (b) low and (c) high return positive-sided variance (RPSV) (Section 3.1), where RPSV measures the positive contributions in the variance of trajectory returns in a dataset and Ḡ denotes the average episodic returns (dashed line) of the dataset. Intuitively, a high RPSV implies some trajectories have far higher returns than the average.

Figure3: The left end (average return between 0 to 0.1) of each plot shows that for all offline RL algorithms (CQL, IQL, and TD3+BC) and BC, our AW and RW sampling methods' performance gain grows when RPSV increases. The color denotes the performance (average return) gain over the uniform sampling baseline in the mixed datasets and environments tested in Section 5.2; the x-axis and y-axis indicate the average return of a dataset and RPSV of the dataset, respectively.

Figure5: Average return of datasets. We see that mixed datasets used in Section 5.2 have lower average return than regular datasets on average.

Figure 10: Performance of percentage-filtering sampling with varying percentages.

Haoran Xu, Xianyuan Zhan, Honglei Yin, and Huiling Qin. Discriminator-weighted offline imitation learning from suboptimal demonstrations. In International Conference on Machine Learning, pp. 24725-24742. PMLR, 2022. Yuzhe Yang, Kaiwen Zha, Ying-Cong Chen, Hao Wang, and Dina Katabi. Delving into deep imbalanced regression. In International Conference on Machine Learning (ICML), 2021.

Figure8: Performance of our method with varying temperature α (Sections 4.2 and 4.3), where color denotes α. Both AW and RW achieve higher returns than the baselines in a wide range of temperatures 0.01, 1.0 , our methods are not overly sensitive to the choice of temperature.

ACKNOWLEDGMENTS

We thank members of the Improbable AI Lab and Microsoft Research Montreal for helpful discussions and feedback. We are grateful to MIT Supercloud and the Lincoln Laboratory Supercomputing Center for providing HPC resources. This research was supported in part by the MIT-IBM Watson AI Lab, an AWS MLRA research grant, Google cloud credits provided as part of Google-MIT support, DARPA Machine Common Sense Program, ARO MURI under Grant Number W911NF-21-1-0328, ONR MURI under Grant Number N00014-22-1-2740, and by the United States Air Force Artificial Intelligence Accelerator under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

AUTHOR CONTRIBUTIONS

• Zhang-Wei Hong conceived the problem of mixed dataset in offline RL and the idea of trajectory reweighting, implemented the algorithms, ran the experiments, and wrote the paper.• Pulkit Agrawal provided feedback on the idea and experiment designs.• Rémi Tachet des Combes provided feedback on the idea and experiment designs, came up with the examples in Appendix A.3, and revised the paper.• Romain Laroche formulated the analysis in Section 4, came up with RPSV metrics, and formulated the idea and revised the paper.

REPRODUCIBILITY STATEMENT

We have included the implementation details in Appendix A.7 and the source code in the supplementary material.Published as a conference paper at ICLR 2023 The average return of each sampling strategy in mixed dataszets (Section 5.2) in official IQL codebase. We see that our methods also show higher average return than the baselines in similar trend shown in Figure 2 , suggesting that the performance gain of our methods are independent of implementation choices. (b) The average returns of each sampling methods in regular datasets (Section 5.4). We see that our methods show insignificantly performance in over uniform sampling, similar to the results in Figure 4a .

