HARNESSING MIXED OFFLINE REINFORCEMENT LEARNING DATASETS VIA TRAJECTORY WEIGHTING

Abstract

Most offline reinforcement learning (RL) algorithms return a target policy maximizing a trade-off between (1) the expected performance gain over the behavior policy that collected the dataset, and (2) the risk stemming from the out-of-distributionness of the induced state-action occupancy. It follows that the performance of the target policy is strongly related to the performance of the behavior policy and, thus, the trajectory return distribution of the dataset. We show that in mixed datasets consisting of mostly low-return trajectories and minor high-return trajectories, stateof-the-art offline RL algorithms are overly restrained by low-return trajectories and fail to exploit high-performing trajectories to the fullest. To overcome this issue, we show that, in deterministic MDPs with stochastic initial states, the dataset sampling can be re-weighted to induce an artificial dataset whose behavior policy has a higher return. This re-weighted sampling strategy may be combined with any offline RL algorithm. We further analyze that the opportunity for performance improvement over the behavior policy correlates with the positive-sided variance of the returns of the trajectories in the dataset. We empirically show that while CQL, IQL, and TD3+BC achieve only a part of this potential policy improvement, these same algorithms combined with our reweighted sampling strategy fully exploit the dataset. Furthermore, we empirically demonstrate that, despite its theoretical limitation, the approach may still be efficient in stochastic environments.

1. INTRODUCTION

Offline reinforcement learning (RL) currently receives great attention because it allows one to optimize RL policies from logged data without direct interaction with the environment. This makes the RL training process safer and cheaper since collecting interaction data is high-risk, expensive, and time-consuming in the real world (e.g., robotics, and health care). Unfortunately, several papers have shown that near optimality of the offline RL task is intractable sample-efficiency-wise (Xiao et al., 2022; Chen & Jiang, 2019; Foster et al., 2022) . In contrast to near optimality, policy improvement over the behavior policy is an objective that is approximately realizable since the behavior policy may efficiently be cloned with supervised learning (Urbancic, 1994; Torabi et al., 2018) . Thus, most practical offline RL algorithms incorporate a component ensuring, either formally or intuitively, that the returned policy improves over the behavior policy: pessimistic algorithms make sure that a lower bound on the target policy (i.e., a policy learned by offline RL algorithms) value improves over the value of the behavior policy (Petrik et al., 2016; Kumar et al., 2020b; Buckman et al., 2020) , conservative algorithms regularize their policy search with respect to the behavior policy (Thomas, 2015; Laroche et al., 2019; Fujimoto et al., 2019) , and one-step algorithms prevent the target policy value from propagating through bootstrapping (Brandfonbrener et al., 2021) . These algorithms use the behavior policy as a stepping stone. As a consequence, their performance guarantees highly depend on the performance of the behavior policy. Due to the dependency on behavior policy performance, these offline RL algorithms are susceptible to the return distribution of the trajectories in the dataset collected by a behavior policy. To illustrate this dependency, we will say that these algorithms are anchored to the behavior policy. Anchoring in a near-optimal dataset (i.e., expert) favors the performance of an algorithm, while anchoring in a low-performing dataset (e.g., novice) may hinder the target policy's performance. In realistic scenarios, offline RL datasets might consist mostly of low-performing trajectories with few minor highperforming trajectories collected by a mixture of behavior policies, since curating high-performing trajectories is costly. It is thus desirable to avoid anchoring on low-performing behavior policies and exploit high-performing ones in mixed datasets. However, we show that state-of-the-art offline RL algorithms fail to exploit high-performing trajectories to their fullest. We analyze that the potential for policy improvement over the behavior policy is correlated with the positive-sided variance (PSV) of the trajectory returns in the dataset and advance that when the return PSV is high, the algorithmic anchoring may be limiting the performance of the returned policy. In order to provide a better algorithmic anchoring, we propose to alter the behavior policy without collecting additional data. We start by proving that re-weighting the dataset during the training of an offline RL algorithm is equivalent to performing this training with another behavior policy. Furthermore, under the assumption that the environment is deterministic, by giving larger weights to high-return trajectories, we can control the implicit behavior policy to be high performing and therefore grant a cold start performance boost to the offline RL algorithm. While determinism is a strong assumption that we prove to be necessary with a minimal failure example, we show that the guarantees still hold when the initial state is stochastic by re-weighting with, instead of the trajectory return, a trajectory return advantage: G(τ i ) -V µ (s i,0 ), where G(τ i ) is the return obtained for trajectory i, V µ (s i,0 ) is the expected return of following the behavior policy µ from the initial state s i,0 . Furthermore, we empirically observe that our strategy allows performance gains over their uniform sampling counterparts even in stochastic environments. We also note that determinism is required by several state-of-the-art offline RL algorithms (Schmidhuber, 2019; Srivastava et al., 2019; Kumar et al., 2019b; Chen et al., 2021; Furuta et al., 2021; Brandfonbrener et al., 2022) . Under the guidance of theoretical analysis, our principal contribution is two simple weighted sampling strategies: Return-weighting (RW) and Advantage-weighting (AW). RW and AW re-weight trajectories using the Boltzmann distribution of trajectory returns and advantages, respectively. Our weighted sampling strategies are agnostic to the underlying offline RL algorithms and thus can be a drop-in replacement in any off-the-shelf offline RL algorithms, essentially at no extra computational cost. We evaluate our sampling strategies on three state-of-the-art offline RL algorithms, CQL, IQL, and TD3+BC (Kumar et al., 2020b; Kostrikov et al., 2022; Fujimoto & Gu, 2021) , as well as behavior cloning, over 62 datasets in D4RL benchmarks (Fu et al., 2020) . The experimental results reported in statistically robust metrics (Agarwal et al., 2021) demonstrate that both our sampling strategies significantly boost the performance of all considered offline RL algorithms in challenging mixed datasets with sparse rewarding trajectories, and perform at least on par with them on regular datasets with evenly distributed return distributions.

2. PRELIMINARIES

We consider reinforcement learning (RL) problem in a Markov decision process (MDP) characterized by a tuple (S, A, R, P, ρ 0 ), where S and A denote state and action spaces, respectively, R : S × A → R is a reward function, P : S × A → ∆ S is a state transition dynamics, and ρ 0 : ∆ S is an initial state distribution, where ∆ X denotes a simplex over set X . An MDP starts from an initial state s 0 ∼ ρ 0 . At each timestep t, an agent perceives the state s t , takes an action a t ∼ π(.|s t ) where π : S → ∆ A is the agent's policy, receives a reward r t = R(s t , a t ), and transitions to a next state s t+1 ∼ P (s t+1 |s t , a t ). The performance of a policy π is measured by the expected return J(π) starting from initial states s 0 ∼ ρ 0 shown as follows: J(π) = E ∞ t=0 R(s t , a t ) s 0 ∼ ρ 0 , a t ∼ π(.|s t ), s t+1 ∼ P (.|s t , a t ) . (1)

