OPTIMAL TRANSPORT FOR OFFLINE IMITATION LEARNING

Abstract

With the advent of large datasets, offline reinforcement learning (RL) is a promising framework for learning good decision-making policies without the need to interact with the real environment. However, offline RL requires the dataset to be reward-annotated, which presents practical challenges when reward engineering is difficult or when obtaining reward annotations is labor-intensive. In this paper, we introduce Optimal Transport Reward labeling (OTR), an algorithm that assigns rewards to offline trajectories, with a few high-quality demonstrations. OTR's key idea is to use optimal transport to compute an optimal alignment between an unlabeled trajectory in the dataset and an expert demonstration to obtain a similarity measure that can be interpreted as a reward, which can then be used by an offline RL algorithm to learn the policy. OTR is easy to implement and computationally efficient. On D4RL benchmarks, we show that OTR with a single demonstration can consistently match the performance of offline RL with ground-truth rewards 1 .

1. INTRODUCTION

Offline Reinforcement Learning (RL) has made significant progress recently, enabling learning policies from logged experience without any interaction with the environment. Offline RL is relevant when online data collection can be expensive or slow, e.g., robotics, financial trading and autonomous driving. A key feature of offline RL is that it can learn an improved policy that goes beyond the behavior policy that generated the data. However, offline RL requires the existence of a reward function for labeling the logged experience, making direct applications of offline RL methods impractical for applications where rewards are hard to specify with hand-crafted rules. Even if it is possible to label the trajectories with human preferences, such a procedure to generate reward signals can be expensive. Therefore, enabling offline RL to leverage unlabeled data is an open question of significant practical value. Besides labeling every single trajectory, an alternative way to inform the agent about human preference is to provide expert demonstrations. For many applications, providing expert demonstrations is more natural for practitioners compared to specifying a reward function. In robotics, providing expert demonstrations is fairly common, and in the absence of natural reward functions, 'learning from demonstration' has been used for decades to find good policies for robotic systems; see, e.g., (Atkeson & Schaal, 1997; Abbeel & Ng, 2004; Calinon et al., 2007; Englert et al., 2013) . One such framework for learning policies from demonstrations is imitation learning (IL). IL aims at learning policies that imitate the behavior of expert demonstrations without an explicit reward function. There are two popular approaches to IL: Behavior Cloning (BC) (Pomerleau, 1988) and Inverse Reinforcement Learning (IRL) (Ng & Russell, 2000) . BC aims to recover the demonstrator's behavior directly by setting up an offline supervised learning problem. If demonstrations are of high quality and actions of the demonstrations are recorded, BC can work very well as demonstrated by Pomerleau (1988) , but generalization to new situations typically does not work well. IRL learns an intermediate reward function that aims to capture the demonstrator's intent. Current algorithms



Code is available at https://github.com/ethanluoyc/optimal_transport_reward 1

