CONSERWEIGHTIVE BEHAVIORAL CLONING FOR RELIABLE OFFLINE REINFORCEMENT LEARNING

Abstract

The goal of offline reinforcement learning (RL) is to learn near-optimal policies from static logged datasets, thus sidestepping expensive online interactions. Behavioral cloning (BC) provides a straightforward solution to offline RL by mimicking offline trajectories via supervised learning. Recent advances (Chen et al., 2021; Janner et al., 2021; Emmons et al., 2021) have shown that by conditioning on desired future returns, BC can perform competitively to their value-based counterparts, while enjoying much more simplicity and training stability. However, the distribution of returns in the offline dataset can be arbitrarily skewed and suboptimal, which poses a unique challenge for conditioning BC on expert returns at test-time. We propose ConserWeightive Behavioral Cloning (CWBC), a simple and effective method for improving the performance of conditional BC for offline RL with two key components: trajectory weighting and conservative regularization. Trajectory weighting addresses the bias-variance tradeoff in conditional BC and provides a principled mechanism to learn from both low return trajectories (typically plentiful) and high return trajectories (typically few). Further, we analyze the notion of conservatism in existing BC methods, and propose a novel conservative regularizer that explicitly encourages the policy to stay close to the data distribution. The regularizer helps achieve more reliable performance, and removes the need for ad-hoc tuning of the conditioning value during evaluation. We instantiate CWBC in the context of Reinforcement Learning via Supervised Learning (RvS) (Emmons et al., 2021) and Decision Transformer (DT) (Chen et al., 2021), and empirically show that it significantly boosts the performance and stability of prior methods on various offline RL benchmarks.

1. INTRODUCTION

In many real-world applications such as education, healthcare and autonomous driving, collecting data via online interactions can be expensive or even dangerous. However, we often have access to historical logged datasets in these domains that have been collected previously by some unknown policies. The goal of offline reinforcement learning (RL) is to directly learn effective agent policies from such datasets, without additional online interactions (Lange et al., 2012; Levine et al., 2020) . Many online RL algorithms have been adapted to work in the offline setting, including value-based methods (Fujimoto et al., 2019; Ghasemipour et al., 2021; Wu et al., 2019; Jaques et al., 2019; Kumar et al., 2020; Fujimoto & Gu, 2021; Kostrikov et al., 2021a) as well as model-based methods (Yu et al., 2020; Kidambi et al., 2020) . The key challenge in all these methods is to generalize the value or dynamics to state-action pairs outside the offline dataset. An alternative way to approach offline RL is via approaches derived from behavioral cloning (BC) (Bain & Sammut, 1995) . BC is a supervised learning technique that was initially developed for imitation learning, where the goal is to learn a policy that mimics the expert demonstrations. Recently, a number of works propose to formulate offline RL as supervised learning problems (Chen et al., 2021; Janner et al., 2021; Emmons et al., 2021) . Since offline RL datasets usually do not have expert demonstrations, these works condition BC on extra context information to specify target outcomes such as returns and goals. Compared with the value-based approaches, the empirical evidence has shown that these conditional BC approaches perform competitively, and they additionally enjoy the enhanced simplicity and training stability of supervised learning. As commonly observed for supervised learning approaches, the performance of conditional BC is often limited by the suboptimility of the offline dataset, which particularly can be probed through the distribution of returns in the dataset. There are two related challenges in this regard for offline RL. First, there is a unique bias-variance tradeoff in learning that arises due to the mismatch between the training and test distribution of returns. Typically, offline datasets in the real world mostly contain trajectories with low returns, whereas at test time, we are interested in conditioning on high returns. Simply filtering the offline dataset to contain high return trajectories is not always viable, as the number of such high-return trajectories can be very low leading to high variance during learning. Second, the maximum return in the offline trajectories is often far below the desired expert returns. This implies that at test time, we need to condition our agent on out-of-distribution (ood) expert returns. Interestingly, we find that existing BC methods have significantly different behaviors when conditioning on ood returns. While DT (Chen et al., 2021) enjoys a stable performance, RvS (Emmons et al., 2021) is highly sensitive to such ood conditioning and exhibits vast drops in peak performance for such ood inputs. Therefore, the current practice for setting the conditioning return at test-time in RvS is based on careful tuning with online rollouts, which is often tedious, impractical, and inconsistent with the promise of offline RL to minimize online interactions. We propose ConserWeightive Behavior Cloning (CWBC), a new BC-based approach for offline RL that mitigates the aforementioned challenges. CWBC consists of 2 key components: trajectory weighting and conservative regularization. With trajectory weighting, we strive to balance the biasvariance trade-off in learning by proposing a scheme for downweighting the low-return trajectories, but at the same time, we do not filter them for data efficiency. Moreover, we introduce a notion of conservatism for ood sensitve BC methods such as RvS, which encourages the policy to stay close to the data distribution when conditioning on large returns. We take trajectories with high returns from the dataset and add positive noise to their returns, which generates trajectories with large ood returns. We predict actions conditioning on the perturbed returns and project them to the original actions by penalizing the ℓ 2 distance. By imposing such a regularizer, we can condition the policy on large, unseen target returns at test-time, sidestepping tedious manual tuning and online interactions. Our proposed algorithm is simple and easy to implement. Empirically, we instantiate our framework in the context of RvS (Emmons et al., 2021) and DT (Chen et al., 2021) , two state-of-the-art BC methods for offline RL. CWBC significantly improves the performance of RvS and DT in D4RL (Fu et al., 2020) locomotion tasks by 18% and 8%, respectively, without any hand picking of the value of the conditioning returns at test-time.

2. RELATED WORK

Offline Temporal Difference Learning Most of the existing off-policy RL methods are often based on temporal difference (TD) updates. A key challenge of directly applying them in the offline setting is the extrapolation error: the value function is poorly estimated at unseen state-action pairs. To remedy this issue, various forms of conservatism have been introduced to off-policy RL methods that exploits temporal difference updates, with the purpose of encouraging the learned policy to stay close to the behavior policy that generates the data. For instance, Fujimoto et al. ( 2019 2021) propose an extra behavior cloning term to regularize the policy. This regularizer is simply the ℓ 2 distance between predicted actions and the truth, yet surprisingly effective for porting off-policy TD methods to the offline setting. Instead of regularizing the policy, several other works have sought to incorporate divergence regularizations into the value function estimation, e.g., (Nachum et al., 2019; Kumar et al., 2020; Kostrikov et al., 2021a) . Another recent work by Kostrikov et al. (2021b) predicts the Q function via expectile regression, where the estimation of the maximum Q-value is constrained to be in the dataset. Behavior Cloning Approaches for Offline RL Recently, there is a surge of interest in converting offline RL into supervised learning paradigms (Chen et al., 2021; Janner et al., 2021; Emmons et al., 2021) . In essence, these approaches conduct behavior cloning (Bain & Sammut, 1995) by additionally conditioning on extra information such as goals or rewards. Among these works, Chen et al. ( 2021) and Janner et al. ( 2021) have formulated offline RL as sequence modeling problems



); Ghasemipour et al. (2021) use certain policy parameterizations specifically tailored for offline RL. Wu et al. (2019); Jaques et al. (2019); Kumar et al. (2019) penalize the divergence-based distances between the learned policy and the behavior policy. Fujimoto & Gu (

