RELATIVE BEHAVIORAL ATTRIBUTES: FILLING THE GAP BETWEEN SYMBOLIC GOAL SPECIFICATION AND REWARD LEARNING FROM HUMAN PREFERENCES

Abstract

Generating complex behaviors that satisfy the preferences of non-expert users is a crucial requirement for AI agents. Interactive reward learning from trajectory comparisons (a.k.a. RLHF) is one way to allow non-expert users to convey complex objectives by expressing preferences over short clips of agent behaviors. Even though this parametric method can encode complex tacit knowledge present in the underlying tasks, it implicitly assumes that the human is unable to provide richer feedback than binary preference labels, leading to intolerably high feedback complexity and poor user experience. While providing a detailed symbolic closed-form specification of the objectives might be tempting, it is not always feasible even for an expert user. However, in most cases, humans are aware of how the agent should change its behavior along meaningful axes to fulfill their underlying purpose, even if they are not able to fully specify task objectives symbolically. Using this as motivation, we introduce the notion of Relative Behavioral Attributes, which allows the users to tweak the agent behavior through symbolic concepts (e.g., increasing the softness or speed of agents' movement). We propose two practical methods that can learn to model any kind of behavioral attributes from ordered behavior clips. We demonstrate the effectiveness of our methods on four tasks with nine different behavioral attributes, showing that once the attributes are learned, end users can produce desirable agent behaviors relatively effortlessly, by providing feedback just around ten times. This is over an order of magnitude less than that required by the popular learning-from-humanpreferences baselines.

1. INTRODUCTION

A central problem in building versatile autonomous agents is how to specify and customize agent behaviors. Two representative ways to specify tasks include manual specification of reward functions, and reward learning from trajectory comparisons. In the former, the user needs to provide an exact description of the objective as a suitable reward function to be used by the agent. This is often only feasible when the specification can be done at a high level, e.g. in symbolic terms (Russell & Norvig, 2003) or by providing a symbolic reward machine or domain model (Yang et al., 2018; Illanes et al., 2020; Guan et al., 2022; Icarte et al., 2022) , or by giving a natural language instruction (Mahmoudieh et al., 2022) . Instead of requiring users to give precise task descriptions, reward learning from trajectory comparisons (a.k.a. reinforcement learning from human feedback or RLHF) learns a reward function from human preference labels over pairs of behavior clips (Wilson et al., 2012; Christiano et al., 2017; Ouyang et al., 2022; Bai et al., 2022a) , or from numerical ratings on trajectory segments (Knox & Stone, 2009; MacGlashan et al., 2017; Warnell et al., 2018; Guan et al., 2021; Abramson et al., 2022) or from rankings over a set of behaviors (Brown et al., 2019) . To illustrate the distinct characteristics of the two objective specification methods, let us consider a Humanoid control task in a benchmark DeepMind Control Suite (Tunyasuvunakool et al., 2020) .

availability

https://guansuns.github.io/pages/rba.

