OPTIMAL TRANSPORT FOR OFFLINE IMITATION LEARNING

Abstract

With the advent of large datasets, offline reinforcement learning (RL) is a promising framework for learning good decision-making policies without the need to interact with the real environment. However, offline RL requires the dataset to be reward-annotated, which presents practical challenges when reward engineering is difficult or when obtaining reward annotations is labor-intensive. In this paper, we introduce Optimal Transport Reward labeling (OTR), an algorithm that assigns rewards to offline trajectories, with a few high-quality demonstrations. OTR's key idea is to use optimal transport to compute an optimal alignment between an unlabeled trajectory in the dataset and an expert demonstration to obtain a similarity measure that can be interpreted as a reward, which can then be used by an offline RL algorithm to learn the policy. OTR is easy to implement and computationally efficient. On D4RL benchmarks, we show that OTR with a single demonstration can consistently match the performance of offline RL with ground-truth rewards 1 .

1. INTRODUCTION

Offline Reinforcement Learning (RL) has made significant progress recently, enabling learning policies from logged experience without any interaction with the environment. Offline RL is relevant when online data collection can be expensive or slow, e.g., robotics, financial trading and autonomous driving. A key feature of offline RL is that it can learn an improved policy that goes beyond the behavior policy that generated the data. However, offline RL requires the existence of a reward function for labeling the logged experience, making direct applications of offline RL methods impractical for applications where rewards are hard to specify with hand-crafted rules. Even if it is possible to label the trajectories with human preferences, such a procedure to generate reward signals can be expensive. Therefore, enabling offline RL to leverage unlabeled data is an open question of significant practical value. Besides labeling every single trajectory, an alternative way to inform the agent about human preference is to provide expert demonstrations. For many applications, providing expert demonstrations is more natural for practitioners compared to specifying a reward function. In robotics, providing expert demonstrations is fairly common, and in the absence of natural reward functions, 'learning from demonstration' has been used for decades to find good policies for robotic systems; see, e.g., (Atkeson & Schaal, 1997; Abbeel & Ng, 2004; Calinon et al., 2007; Englert et al., 2013) . One such framework for learning policies from demonstrations is imitation learning (IL). IL aims at learning policies that imitate the behavior of expert demonstrations without an explicit reward function. There are two popular approaches to IL: Behavior Cloning (BC) (Pomerleau, 1988) and Inverse Reinforcement Learning (IRL) (Ng & Russell, 2000) . BC aims to recover the demonstrator's behavior directly by setting up an offline supervised learning problem. If demonstrations are of high quality and actions of the demonstrations are recorded, BC can work very well as demonstrated by Pomerleau (1988) have demonstrated very strong empirical performance, requiring only a few expert demonstrations to obtain good performance (Ho & Ermon, 2016; Dadashi et al., 2022) . While these IRL methods do not require many demonstrations, they typically focus on the online RL setting and require a large number of environment interactions to learn an imitating policy, i.e., these methods are not suitable for offline learning. s 1 a 2 s 2 a 2 • • • s T r 1 r 2 • • • r T Offline RL . . . . . . In this paper, we introduce Optimal Transport Reward labeling (OTR), an algorithm that uses optimal transport theory to automatically assign reward labels to unlabeled trajectories in an offline dataset, given one or more expert demonstrations. This reward-annotated dataset can then be used by offline RL algorithms to find good policies that imitate demonstrated behavior. Specifically, OTR uses optimal transport to find optimal alignments between unlabeled trajectories in the dataset and expert demonstrations. The similarity measure between a state in unlabeled trajectory and that of an expert trajectory is then treated as a reward label. These rewards can be used by any offline RL algorithm for learning policies from a small number of expert demonstrations and a large offline dataset. Figure 1 illustrates how OTR uses expert demonstrations to add reward labels to an offline dataset, which can then be used by an offline RL algorithm to find a good policy that imitates demonstrated behavior. Empirical evaluations on the D4RL (Fu et al., 2021) datasets demonstrate that OTR recovers the performance of offline RL methods with ground-truth rewards with only a single demonstration. Compared to previous reward learning and imitation learning approaches, our approach also achieves consistently better performance across a wide range of offline datasets and tasks.

2. OFFLINE REINFORCEMENT LEARNING AND IMITATION LEARNING

Offline Reinforcement Learning In offline/batch reinforcement learning, we are interested in learning policies directly from fixed offline datasets (Lange et al., 2012; Levine et al., 2020) , i.e., the agent is not permitted any additional interaction with the environment. Offline RL research typically assumes access to an offline dataset of observed transitions D = {(s i t , a i t , r i t , s i t+1 )} N i=1 . This setting is particularly attractive for applications where there is previous logged experience available but online data collection is expensive (e.g., robotics, healthcare). Recently, the field of offline RL has made significant progress, and many offline RL algorithms have been proposed to learn improved policies from diverse and sub-optimal offline demonstrations (Levine et al., 2020; Fujimoto & Gu, 2021; Kumar et al., 2020; Kostrikov et al., 2022c; Wang et al., 2020) . Offline RL research typically assumes that the offline dataset is reward-annotated. That is, each transition (s i t , a i t , r i t , s i t+1 ) in the dataset is labeled with reward r i t . One common approach to offline RL is to leverage the reward signals in the offline dataset and learn a policy with an actor-critic algorithm purely from offline transitions (Levine et al., 2020) . However, in practice, having pertransition reward annotations for the offline dataset may be difficult due to the challenges in designing a good reward function. Zolna et al. (2020) propose ORIL which learns a reward function based on positive-unlabeled (PU) learning (Elkan & Noto, 2008) that can be used to add reward labels to offline datasets, allowing for unlabeled datasets to be used by offline RL algorithms.



Code is available at https://github.com/ethanluoyc/optimal_transport_reward



Figure1: Illustration of Optimal Transport Reward Labeling (OTR). Given expert demonstrations (left) and an offline dataset without reward labels (center), OTR adds reward labels r i to the offline dataset by means of optimal transport (orange, center). The labeled dataset can then be used by an offline RL algorithm (right) to learn policies.

, but generalization to new situations typically does not work well. IRL learns an intermediate reward function that aims to capture the demonstrator's intent. Current algorithms

