RELATIVE BEHAVIORAL ATTRIBUTES: FILLING THE GAP BETWEEN SYMBOLIC GOAL SPECIFICATION AND REWARD LEARNING FROM HUMAN PREFERENCES

Abstract

Generating complex behaviors that satisfy the preferences of non-expert users is a crucial requirement for AI agents. Interactive reward learning from trajectory comparisons (a.k.a. RLHF) is one way to allow non-expert users to convey complex objectives by expressing preferences over short clips of agent behaviors. Even though this parametric method can encode complex tacit knowledge present in the underlying tasks, it implicitly assumes that the human is unable to provide richer feedback than binary preference labels, leading to intolerably high feedback complexity and poor user experience. While providing a detailed symbolic closed-form specification of the objectives might be tempting, it is not always feasible even for an expert user. However, in most cases, humans are aware of how the agent should change its behavior along meaningful axes to fulfill their underlying purpose, even if they are not able to fully specify task objectives symbolically. Using this as motivation, we introduce the notion of Relative Behavioral Attributes, which allows the users to tweak the agent behavior through symbolic concepts (e.g., increasing the softness or speed of agents' movement). We propose two practical methods that can learn to model any kind of behavioral attributes from ordered behavior clips. We demonstrate the effectiveness of our methods on four tasks with nine different behavioral attributes, showing that once the attributes are learned, end users can produce desirable agent behaviors relatively effortlessly, by providing feedback just around ten times. This is over an order of magnitude less than that required by the popular learning-from-humanpreferences baselines.

1. INTRODUCTION

A central problem in building versatile autonomous agents is how to specify and customize agent behaviors. Two representative ways to specify tasks include manual specification of reward functions, and reward learning from trajectory comparisons. In the former, the user needs to provide an exact description of the objective as a suitable reward function to be used by the agent. This is often only feasible when the specification can be done at a high level, e.g. in symbolic terms (Russell & Norvig, 2003) or by providing a symbolic reward machine or domain model (Yang et al., 2018; Illanes et al., 2020; Guan et al., 2022; Icarte et al., 2022) , or by giving a natural language instruction (Mahmoudieh et al., 2022) . Instead of requiring users to give precise task descriptions, reward learning from trajectory comparisons (a.k.a. reinforcement learning from human feedback or RLHF) learns a reward function from human preference labels over pairs of behavior clips (Wilson et al., 2012; Christiano et al., 2017; Ouyang et al., 2022; Bai et al., 2022a) , or from numerical ratings on trajectory segments (Knox & Stone, 2009; MacGlashan et al., 2017; Warnell et al., 2018; Guan et al., 2021; Abramson et al., 2022) or from rankings over a set of behaviors (Brown et al., 2019) . To illustrate the distinct characteristics of the two objective specification methods, let us consider a Humanoid control task in a benchmark DeepMind Control Suite (Tunyasuvunakool et al., 2020) . The default symbolic reward function in the benchmark is a carefully-designed linear combination of multiple explicit factors such as moving speed and control force. By optimizing the reward, the agent will learn to run at the specified speed but in an unnatural way. However, to further specify the motion styles of the robot (e.g., to define a human-like walking style), non-expert users may find it hard to express such objectives symbolically since this involves tacit motion knowledge. Similarly, in natural language processing tasks like dialogue (Ouyang et al., 2022) and summarization (Stiennon et al., 2020) , it can be extremely challenging to construct reward functions purely with explicit concepts. Hence, for tacit-knowledge tasks, people usually resort to reward specification via pairwise trajectory comparisons and learn a parametric reward function (e.g., parameterized by deep neural networks). While pairwise comparisons are general enough to work for any setting, due to the limited information a binary label can carry, they are also an impoverished way for humans to communicate their preferences. Thus, treating every task at hand as a tacit-knowledge one and limiting reward specification to binary comparisons can be unnecessarily inefficient. Moreover, since the internal representations learned within neural networks are typically inscrutable, it's unclear how a learned reward model can be reused to serve multiple users and fulfill a variety of goals. In short, symbolic goal specification is more straightforward and intuitive to use, but it offers limited expressiveness, making it more suitable for explicit-knowledge tasks. Reward learning from trajectory comparisons, in contrast, offers better expressiveness, but it is more costly and less intuitive to use. However, in many real-world scenarios, user objectives are neither completely explicit nor purely tacit. In other words, although human users may not be able to construct a reward function symbolically, they still have some idea of how the AI agent can change its behavior along certain meaningful axes of variation to better serve the users. In the above Humanoid example, even though the users can not fully define the motion style in a closed-form manner, they may still be able to express their preferences in terms of some nameable concepts that describe certain properties in the agent behavior. For instance, the users of a household humanoid robot may want it to not only walk in a natural way but also walk more softly and take smaller steps at night when people are sleeping. Motivated by the observations above, we introduce the notion of Relative Behavioral Attributes (RBA) to capture the relative strength of some properties' presence in the agent's behavior. We aim to learn an attribute-parameterized reward function that can encode complex task knowledge while allowing end users to freely increase or decrease the strength of certain behavioral attributes such as "step size" and "movement softness" (Fig. 1 ). This new paradigm of reward specification is intended to bring the best of symbolic and pairwise trajectory comparison approaches by allowing for reward feedback in terms of conceptual attributes. The benefits of explicitly modeling RBAs are obvious: (a) it offers a semantically richer and more natural mode of communication, such that the hypothesis space of user intents or preferences can be greatly reduced each time the system receives attributelevel feedback from the user, thereby improving the overall feedback complexity significantly; (b) since humans communicate at the level of symbols and concepts, the learned RBAs can be used as shared vocabulary between humans and inscrutable models (Kambhampati et al., 2022) , and more importantly, such shared vocabulary can be reused to serve any future users and support a diverse set of objectives. The concept of relative attributes has been introduced earlier in other areas like computer vision (Parikh & Grauman, 2011) and recommender systems (Goenka et al., 2022) . Though they share some similarities in motivations and definition to our relative behavioral attribute, the key difference is that relative attributes there only capture properties present in a single static image, while RBAs aim to capture properties over trajectories of behavior, wherein an attribute may be associated with either static features at a certain timestep (e.g., step size of a walking agent) or temporally-extended features spanning multiple steps (e.g., the softness of movement). Additionally, we need to consider



Figure 1: Visualizing behavioral attributes of Walker and Lane-Change. The behavioral attributes of other domains are shown in Fig. 3 in Appendix A.

availability

https://guansuns.github.io/pages/rba.

