CONSERWEIGHTIVE BEHAVIORAL CLONING FOR RELIABLE OFFLINE REINFORCEMENT LEARNING

Abstract

The goal of offline reinforcement learning (RL) is to learn near-optimal policies from static logged datasets, thus sidestepping expensive online interactions. Behavioral cloning (BC) provides a straightforward solution to offline RL by mimicking offline trajectories via supervised learning. Recent advances (Chen et al., 2021; Janner et al., 2021; Emmons et al., 2021) have shown that by conditioning on desired future returns, BC can perform competitively to their value-based counterparts, while enjoying much more simplicity and training stability. However, the distribution of returns in the offline dataset can be arbitrarily skewed and suboptimal, which poses a unique challenge for conditioning BC on expert returns at test-time. We propose ConserWeightive Behavioral Cloning (CWBC), a simple and effective method for improving the performance of conditional BC for offline RL with two key components: trajectory weighting and conservative regularization. Trajectory weighting addresses the bias-variance tradeoff in conditional BC and provides a principled mechanism to learn from both low return trajectories (typically plentiful) and high return trajectories (typically few). Further, we analyze the notion of conservatism in existing BC methods, and propose a novel conservative regularizer that explicitly encourages the policy to stay close to the data distribution. The regularizer helps achieve more reliable performance, and removes the need for ad-hoc tuning of the conditioning value during evaluation. We instantiate CWBC in the context of Reinforcement Learning via Supervised Learning (RvS) (Emmons et al., 2021) and Decision Transformer (DT) (Chen et al., 2021), and empirically show that it significantly boosts the performance and stability of prior methods on various offline RL benchmarks.

1. INTRODUCTION

In many real-world applications such as education, healthcare and autonomous driving, collecting data via online interactions can be expensive or even dangerous. However, we often have access to historical logged datasets in these domains that have been collected previously by some unknown policies. The goal of offline reinforcement learning (RL) is to directly learn effective agent policies from such datasets, without additional online interactions (Lange et al., 2012; Levine et al., 2020) . Many online RL algorithms have been adapted to work in the offline setting, including value-based methods (Fujimoto et al., 2019; Ghasemipour et al., 2021; Wu et al., 2019; Jaques et al., 2019; Kumar et al., 2020; Fujimoto & Gu, 2021; Kostrikov et al., 2021a) as well as model-based methods (Yu et al., 2020; Kidambi et al., 2020) . The key challenge in all these methods is to generalize the value or dynamics to state-action pairs outside the offline dataset. An alternative way to approach offline RL is via approaches derived from behavioral cloning (BC) (Bain & Sammut, 1995) . BC is a supervised learning technique that was initially developed for imitation learning, where the goal is to learn a policy that mimics the expert demonstrations. Recently, a number of works propose to formulate offline RL as supervised learning problems (Chen et al., 2021; Janner et al., 2021; Emmons et al., 2021) . Since offline RL datasets usually do not have expert demonstrations, these works condition BC on extra context information to specify target outcomes such as returns and goals. Compared with the value-based approaches, the empirical evidence has shown that these conditional BC approaches perform competitively, and they additionally enjoy the enhanced simplicity and training stability of supervised learning.

