ADVERSARIAL IMITATION LEARNING WITH PREFER-ENCES

Abstract

Designing an accurate and explainable reward function for many Reinforcement Learning tasks is a cumbersome and tedious process. Instead, learning policies directly from the feedback of human teachers naturally integrates human domain knowledge into the policy optimization process. Different feedback modalities, such as demonstrations and preferences, provide distinct benefits and disadvantages. For example, demonstrations convey a lot of information about the task but are often hard or costly to obtain from real experts while preferences typically contain less information but are in most cases cheap to generate. However, existing methods centered around human feedback mostly focus on a single teaching modality, causing them to miss out on important training data while making them less intuitive to use. In this paper we propose a novel method for policy learning that incorporates two different feedback types, namely demonstrations and preferences. To this end, we make use of the connection between discriminator training and density ratio estimation to incorporate preferences into the popular Adversarial Imitation Learning paradigm. This insight allows us to express loss functions over both demonstrations and preferences in a unified framework. Besides expert demonstrations, we are also able to learn from imperfect ones and combine them with preferences to achieve improved task performance. We experimentally validate the effectiveness of combining both preferences and demonstrations on common benchmarks and also show that our method can efficiently learn challenging robot manipulation tasks.

1. INTRODUCTION

This paper aims to progress research towards enabling humans without expert knowledge of machine learning or robotics to teach robots to perform tasks using diverse feedback modalities. Enabling human teachers to use various feedback types allows for a more natural human-robot training interaction. In particular, this paper focuses on human demonstrations of the desired behaviour, and preferences, which are pairwise comparisons of two possible robot behaviors. Both types of feedback have distinct benefits. Demonstrations exploit the domain knowledge of human teachers for a given task. Yet, they are often cumbersome to generate, and present considerable cognitive load on the human teacher, up to the point that the teacher might not be capable to demonstrate optimal behavior on their own. In contrast, preferences are easier to evaluate, albeit less informative as they only indicate relative quality, i.e., which option is better among two provided ones. Both feedback types have been extensively researched in isolation. For instance, a plethora of approaches focus on learning from demonstrations, also referred to as imitation learning (Osa et al., 2018) . In recent years, preference learning has been an active research topic (Wirth et al., 2017) , but integrative work on learning from multiple feedback modalities has been much sparser and is often limited to e.g., a theoretical analysis (Jeon et al., 2020) . In this paper, we introduce Adversarial Imitation Learning with Preferences (AILP), a novel method for learning from a combination of demonstrations and preferences that builds upon the well-known Adversarial Imitation Learning (AIL) framework. AIL uses a discriminator that indicates the change of the policy, which can be seen as a differential reward. In preference learning, typically we directly encode the reward in a preference and require a static reward. We show that combining a differential and static reward directly in the form of adversarial comparisons and preferences is incompatible and leads to poor performance. To alleviate this issue, we present a preference loss that is compatible with AIL and therefore enables AIL approaches to benefit from additional preference feedback that is available alongside demonstrations. We provide an overview of our method in Figure 1 . Given a policy (blue) that may be pre-trained on a self-supervised maximum entropy objective, we can optionally query a teacher for new demonstrations (yellow) and preferences between trajectories from the buffer B (green). Next, a discriminator ϕ(s, a) (purple) is trained to discriminate between samples from B and teacher demonstrations (L dem ), while at the same time preferring better samples over worse ones (L pref ). For an off-policy Reinforcement Learning (RL) algorithm, this training makes use of the environment (grey) and a sample buffer that needs to be relabeled according to the reward that depends on ϕ(s, a). This process is iterated over until convergence, at which point the policy produces samples that are indistinguishable from the expert demonstrations. The contribution of the paper are as follows: (i) we present a novel preference-learning approach that is compatible with Adversarial Imitation Learning and achieves results comparable to the state of the art, (ii) we extend Adversarial Imitation Learning to include learning from both preferences and demonstrations, (iii) we present extensive evaluations of the proposed method, outperforming various baseline methods on well-known benchmarks.

2. RELATED WORK

Teaching robots by human teachers has been an active research topic in the past years (Chernova & Thomaz, 2014) . There has been research on unifying multiple feedback types into a single framework, e.g., by (Jeon et al., 2020) , but the focus has been mostly on unimodal feedback. Two commonly researched human feedback modes are demonstrations and preferences. Besides these, there are numerous others, such as assigning numerical values to different options by human teachers, as in the work by Wilde et al. (2021) . Learning from Demonstrations. Recently, learning from demonstrations and Imitation Learning (IL) (Osa et al., 2018; Argall et al., 2009) have been active research fields in robot learning in which the robot is presented with instances of desired expert behavior. IL algorithms can generally be classified as either Behavioral Cloning (BC) (Torabi et al., 2018; Florence et al., 2021) , where a policy is directly regressed from demonstrations in a supervised fashion, or as Inverse Reinforcement Learning (IRL) (Abbeel & Ng, 2017; Ziebart et al., 2008; Zakka et al., 2021) , which recovers and subsequently optimizes a reward function from demonstrations. In recent years, there has been a rise in adversarial methods inspired by Generative Adversarial Networks (Goodfellow et al., 2014) . Starting with Generative Adversarial Imitation Learning (GAIL) (Ho & Ermon, 2016) , these methods utilize a discriminator-based distribution matching objective for both BC (Ho & Ermon, 2016; Torabi et al.) and IRL (Fu et al., 2018; Xiao et al., 2019) . Building on this, another body of work



Figure1: Schematic overview of Adversarial Imitation Learning with Preferences (AILP). Given a policy (blue) that may be pre-trained on a self-supervised maximum entropy objective, we can optionally query a teacher for new demonstrations (yellow) and preferences between trajectories from the buffer B (green). Next, a discriminator ϕ(s, a) (purple) is trained to discriminate between samples from B and teacher demonstrations (L dem ), while at the same time preferring better samples over worse ones (L pref ). For an off-policy Reinforcement Learning (RL) algorithm, this training makes use of the environment (grey) and a sample buffer that needs to be relabeled according to the reward that depends on ϕ(s, a). This process is iterated over until convergence, at which point the policy produces samples that are indistinguishable from the expert demonstrations.

