RELATIVE BEHAVIORAL ATTRIBUTES: FILLING THE GAP BETWEEN SYMBOLIC GOAL SPECIFICATION AND REWARD LEARNING FROM HUMAN PREFERENCES

Abstract

Generating complex behaviors that satisfy the preferences of non-expert users is a crucial requirement for AI agents. Interactive reward learning from trajectory comparisons (a.k.a. RLHF) is one way to allow non-expert users to convey complex objectives by expressing preferences over short clips of agent behaviors. Even though this parametric method can encode complex tacit knowledge present in the underlying tasks, it implicitly assumes that the human is unable to provide richer feedback than binary preference labels, leading to intolerably high feedback complexity and poor user experience. While providing a detailed symbolic closed-form specification of the objectives might be tempting, it is not always feasible even for an expert user. However, in most cases, humans are aware of how the agent should change its behavior along meaningful axes to fulfill their underlying purpose, even if they are not able to fully specify task objectives symbolically. Using this as motivation, we introduce the notion of Relative Behavioral Attributes, which allows the users to tweak the agent behavior through symbolic concepts (e.g., increasing the softness or speed of agents' movement). We propose two practical methods that can learn to model any kind of behavioral attributes from ordered behavior clips. We demonstrate the effectiveness of our methods on four tasks with nine different behavioral attributes, showing that once the attributes are learned, end users can produce desirable agent behaviors relatively effortlessly, by providing feedback just around ten times. This is over an order of magnitude less than that required by the popular learning-from-humanpreferences baselines.

1. INTRODUCTION

A central problem in building versatile autonomous agents is how to specify and customize agent behaviors. Two representative ways to specify tasks include manual specification of reward functions, and reward learning from trajectory comparisons. In the former, the user needs to provide an exact description of the objective as a suitable reward function to be used by the agent. This is often only feasible when the specification can be done at a high level, e.g. in symbolic terms (Russell & Norvig, 2003) or by providing a symbolic reward machine or domain model (Yang et al., 2018; Illanes et al., 2020; Guan et al., 2022; Icarte et al., 2022) , or by giving a natural language instruction (Mahmoudieh et al., 2022) . Instead of requiring users to give precise task descriptions, reward learning from trajectory comparisons (a.k.a. reinforcement learning from human feedback or RLHF) learns a reward function from human preference labels over pairs of behavior clips (Wilson et al., 2012; Christiano et al., 2017; Ouyang et al., 2022; Bai et al., 2022a) , or from numerical ratings on trajectory segments (Knox & Stone, 2009; MacGlashan et al., 2017; Warnell et al., 2018; Guan et al., 2021; Abramson et al., 2022) or from rankings over a set of behaviors (Brown et al., 2019) . To illustrate the distinct characteristics of the two objective specification methods, let us consider a Humanoid control task in a benchmark DeepMind Control Suite (Tunyasuvunakool et al., 2020) . The default symbolic reward function in the benchmark is a carefully-designed linear combination of multiple explicit factors such as moving speed and control force. By optimizing the reward, the agent will learn to run at the specified speed but in an unnatural way. However, to further specify the motion styles of the robot (e.g., to define a human-like walking style), non-expert users may find it hard to express such objectives symbolically since this involves tacit motion knowledge. Similarly, in natural language processing tasks like dialogue (Ouyang et al., 2022) and summarization (Stiennon et al., 2020) , it can be extremely challenging to construct reward functions purely with explicit concepts. Hence, for tacit-knowledge tasks, people usually resort to reward specification via pairwise trajectory comparisons and learn a parametric reward function (e.g., parameterized by deep neural networks). While pairwise comparisons are general enough to work for any setting, due to the limited information a binary label can carry, they are also an impoverished way for humans to communicate their preferences. Thus, treating every task at hand as a tacit-knowledge one and limiting reward specification to binary comparisons can be unnecessarily inefficient. Moreover, since the internal representations learned within neural networks are typically inscrutable, it's unclear how a learned reward model can be reused to serve multiple users and fulfill a variety of goals. In short, symbolic goal specification is more straightforward and intuitive to use, but it offers limited expressiveness, making it more suitable for explicit-knowledge tasks. Reward learning from trajectory comparisons, in contrast, offers better expressiveness, but it is more costly and less intuitive to use. However, in many real-world scenarios, user objectives are neither completely explicit nor purely tacit. In other words, although human users may not be able to construct a reward function symbolically, they still have some idea of how the AI agent can change its behavior along certain meaningful axes of variation to better serve the users. In the above Humanoid example, even though the users can not fully define the motion style in a closed-form manner, they may still be able to express their preferences in terms of some nameable concepts that describe certain properties in the agent behavior. For instance, the users of a household humanoid robot may want it to not only walk in a natural way but also walk more softly and take smaller steps at night when people are sleeping. Motivated by the observations above, we introduce the notion of Relative Behavioral Attributes (RBA) to capture the relative strength of some properties' presence in the agent's behavior. We aim to learn an attribute-parameterized reward function that can encode complex task knowledge while allowing end users to freely increase or decrease the strength of certain behavioral attributes such as "step size" and "movement softness" (Fig. 1 ). This new paradigm of reward specification is intended to bring the best of symbolic and pairwise trajectory comparison approaches by allowing for reward feedback in terms of conceptual attributes. The benefits of explicitly modeling RBAs are obvious: (a) it offers a semantically richer and more natural mode of communication, such that the hypothesis space of user intents or preferences can be greatly reduced each time the system receives attributelevel feedback from the user, thereby improving the overall feedback complexity significantly; (b) since humans communicate at the level of symbols and concepts, the learned RBAs can be used as shared vocabulary between humans and inscrutable models (Kambhampati et al., 2022) , and more importantly, such shared vocabulary can be reused to serve any future users and support a diverse set of objectives. The concept of relative attributes has been introduced earlier in other areas like computer vision (Parikh & Grauman, 2011) and recommender systems (Goenka et al., 2022) . Though they share some similarities in motivations and definition to our relative behavioral attribute, the key difference is that relative attributes there only capture properties present in a single static image, while RBAs aim to capture properties over trajectories of behavior, wherein an attribute may be associated with either static features at a certain timestep (e.g., step size of a walking agent) or temporally-extended features spanning multiple steps (e.g., the softness of movement). Additionally, we need to consider how to connect the captured attributes to a valid reward function. In this work, we propose two generic data-driven methods to encode RBAs within a reward function. On four domains with nine attributes in total, we show that our methods can learn to accurately capture the attributes from roughly 200 labelled behavior clips. Once the attribute-parameterized reward is learned, any end user can produce diverse agent behaviors with relative ease, by providing feedback just around 10 times. This offers a significant advantage over the learning-from-human-preferences baseline (Christiano et al., 2017) that requires hundreds of binary labels from the user per task.

2. RELATED WORK

In general, direct reward specification is the most commonly employed approach. It is used in almost all simulators with engineered reward functions and in industrial applications such as self-driving vehicle control systems. Yet, there are still some challenges associated with it, such as extensive requirement for sensory instrumentation, incomplete (sub)goal specification (Guan et al., 2022) , reward exploitation (Lee et al., 2021) and the expressiveness issue mentioned in the previous section. These impediments motivate the idea of learning reward from states or trajectory comparisons (Christiano et al., 2017; Warnell et al., 2018; Zhang et al., 2019; Lee et al., 2021) , which leverages the exceptional representational capacity of neural networks but tends to suffer from high data complexity. Another alternative to symbolic reward specification is imitation learning or inverse reinforcement learning (Schaal, 1996; Ng et al., 2000; Abbeel & Ng, 2004) . However, this is usually infeasible for non-expert end-users, as it requires the user to have sufficient knowledge and proper hardware setup in order to teleoperate the agent. Even if a pre-collected large-scale behavior dataset is provided to the user, the process of finding the target behavior trace(s) from the large dataset can still be frustrating. The introduction of RBAs aligns with the broader intent of building symbolic interfaces as a middle layer for humans to communicate effectively with the agent (Bobu et al., 2021; Guan et al., 2021; 2022; Silver et al., 2022; Zhang et al., 2022; Bucker et al., 2022; Cui et al., 2023) or for the agent to explain to humans (Kim et al., 2018; Sreedharan et al., 2022; Kambhampati et al., 2022) . Lee et al. (2020) utilize relative-attribute information in robot skill learning, but their GAN-based formulation is restricted to static visual attributes and is not applicable to temporally-extended concepts. This paper adopts a similar setup to works that learn diverse skills or motion styles from largescale offline behavior datasets or demonstrations (Lee & Popović, 2010; Wang et al., 2017; Zhou & Dragan, 2018; Peng et al., 2018b; Luo et al., 2020; Chebotar et al., 2021; Peng et al., 2021) . These works emphasize on modeling a variety of reusable motor skills by learning a low-level controller conditioned on skill latent codes. Since the latent codes are inscrutable to humans, for each new task, the user must specify the desirable agent behavior by constructing an engineered symbolic reward and use it to train a separate high-level policy that controls the low-level controller. Our methods are complemented by existing diverse-skill learning methods because skill priors (i.e., pre-trained low-level controllers) allow us to optimize the behavioral reward more efficiently. More recently, there have been works in diffusion-based text-to-motion animation generation (Tevet et al., 2022; Guo et al., 2022) . They are similar to this work in the sense that we both allow humans to control the agent behavior through explicit concepts. However, they do not support fine-grained control over the strength of individual behavioral attributes, and their works are not applicable to physics-based character control.

3. PROBLEM SETUP

Personalizing agent behaviors at skill level. We assume the users are interested in customizing agent behaviors at the level of skills (e.g., making a lane change, taking a step, picking up an object). A skill is a solution (i.e., policy) to an episodic task in an indefinite-horizon discounted MDP M = ⟨S, A, R, P, γ⟩, where S is the set of states, A is the set of actions, R : S × A → R is the provided reward function, P : S × A × S → [0, 1] is the transition function and γ is the discount factor. An episode terminates whenever the agent reaches an absorbing terminal state or exceeds a fixed number of time steps T . An agent M is supposed to find a policy (π : S → A) that maximizes the expected return E[ T t=0 γ t r t ], r ∈ R. When executing a policy, the agent is initialized to some initial state s 0 ∈ S at the beginning of each episode; and then at each succeeding time step t, the agent interacts with the environment E by taking an action a t ∈ A and receiving the next state s t+1 ∈ S. In order to personalize agent behavior, rather than assuming that the reward function R is supplied by the environment, we require the reward function to be specified by the end user. Given a behavior clip or a trajectory τ = {(s 0 , a 0 ), ..., (s l , a l )}, where l is the length of the trajectory, a relative behavioral attribute α ∈ A captures the strength of the presence of a certain property exhibited in τ . It assumes that a mapping ζ can be established to map any (α, τ ) pair to a real value that reflects the relative strength of α in τ . In this work, our goal is to construct a reward function that allows end users to iteratively adjust attribute strengths presented in the agent's behavior. Learning to model RBAs from offline behavior datasets. RBAs are essentially semantically capturing different ways to carry out a task. Ideally, the training data for the reward function should contain clips of diverse skill behaviors. Although we used synthetic data in this study, there are many publicly accessible behavior corpora, such as the Waymo Open Dataset for autonomous driving tasks (Ettinger et al., 2021) and large-scale motion clips data for character control (Peng et al., 2018a; 2022) . Note that we do not expect the offline dataset to exhaustively cover all possible skill behaviors, as the reward function should generalize to unseen configurations. Also, considering that action information is not always available, to be more general, we assume the training dataset D only consists of state-only trajectories. Last but not least, as a common setting, we assume the embodiment of the agent is not significantly different from that in D. Reward learning from trajectory comparisons. Methods for reward learning from trajectory comparisons construct a reward function by learning to infer the user's latent preference from a set of ranked trajectories {(τ 1 , τ 2 )}. Typically, the user preference is modelled according to the Bradley-Terry model (Bradley & Terry, 1952) : P τ 1 ≻ τ 2 = exp t r s 1 t , [a 1 t ] i∈{1,2} exp t r (s i t , [a i t ]) , where τ 1 ≻ τ 2 denotes the event that the user prefers τ 1 over τ 2 , r is a parametric learnable reward function, and [a i t ] means the action input is optional. A cross-entropy loss is typically used for optimization: Lossp(r) = - (τ 1 ,τ 2 ,y)∈Dp y(1) log P τ 1 ≻ τ 2 + y(2) log P τ 2 ≻ τ 1 , where y ∈ {(1, 0), (0, 1), (0.5, 0.5)} are possible preference labels indicating τ 1 ≻ τ 2 , τ 2 ≻ τ 1 , or τ 1 and τ 2 are equally preferred, respectively. The set of preference labels D p can be collected beforehand (Brown et al., 2019) or through active queries to the user in an online manner (Wilson et al., 2012; Christiano et al., 2017) . The latter is often referred to as preference-based reinforcement learning, or PbRL for short. Most PbRL methods build upon the Bradley-Terry model but differ in the query strategies and how the network is initialized (Lee et al., 2021; Park et al., 2022; III & Sadigh, 2022; Ren et al., 2022) .

4. METHODOLOGY 4.1 PERSONALIZING AGENT BEHAVIOR VIA RELATIVE BEHAVIORAL ATTRIBUTES

Our framework involves two phases, namely, learning an attribute parameterized reward function and interacting with the end user. Learning attribute parameterized reward function (no interaction with the end user). This phase is supposed to learn a reward function that internally learns a family of rewards that correspond to behaviors with diverse attribute strengths. By varying the input attribute configurations, end users will be able to find the preferred "reward member" and obtain desired behaviors. In Section 4.2 and Section 4.3, we will present two methods that can learn such a reward function given a subset of labelled trajectories from an offline behavior dataset D. We assume that the agent builders are the ones who provide the training labels (e.g., the engineers of autonomous vehicles, and the developer of virtual characters). In some extreme cases, if a novel concept has to be learned, the training labels may also come from the end user. Also note that, this step can be skipped if the RBAs have already been captured by a learned reward function. Supporting end users in the loop. Once an attribute parameterized reward function is learned, any incoming users can leverage it to personalize the agent behavior through multiple rounds of query. In each round of interaction, the agent presents the user a trajectory, sampled according to the policy that optimizes the current reward function. Then the user provides a feedback on whether the current behavior is desirable; and if not satisfied, the user can express the intent to increase or decrease the strength of certain attributes. The concept-level feedback is a set of attribute-feedback pairs {(α, h)}, where α is the attribute of interest, and h is a binary value indicating whether the user wants to increase or decrease α's strength. In this work, we consider two types of attribute representation, namely the index of α in a list of known attributes or a natural-language description of α. Upon collecting feedback from the user, the agent will adjust the reward function and update the corresponding policy. This human-agent interaction process repeats until the user is satisfied with the latest agent behavior. In the next two sections, we will elaborate on two candidate architectures for the attribute parameterized reward function. We note that both architectures support the same type of user interaction as outlined above.

4.2. METHOD 1: MODELING BEHAVIORAL ATTRIBUTES BY ESTABLISHING GLOBAL RANKINGS

The learning process of Method 1 consists of two steps. In the first step, we learn an attribute strength estimator ζ σ (parameterized by σ) that can map any given attribute α and trajectory τ to a real-valued score that measures the relative strength of attribute α in τ . Here, ζ σ is essentially establishing a global ranking among all possible behaviors according to any given attribute α. Hence, we denote this method as RBA-Global. Then in the second step, given a finite set of attributes A, we learn a dense reward function r θ (s|v t = ⟨v α1 t , ..., v α k t ⟩), where v t is called the target attribute-score vector, v αi t is the target strength of attribute α i , and k = |A| is the total number of attributes. r θ is expected to satisfy that, when r θ (•|v t ) is optimized, the agent is able to sample a trajectory τ ′ , such that for any α i ∈ A, we have ζ σ (τ ′ , α i ) ≃ v αi t . Hence, with a learned reward function r θ , the agent can produce diverse behaviors by varying the input target attribute-score vector v t . As an example, let us consider the humanoid household robot domain with two attributes {α 0 : "softness of movement", α 1 : "step size"}. By setting the target attribute vector v t to be ⟨v α0 t = -1.5, v α1 t = 2.0⟩, we expect the agent that optimizes r θ to be able to produce a trajectory τ ′ with a "softness" score ζ σ (τ ′ , α 0 ) approximately equal to -1.5. A graphical illustration of the overall architecture can be found in Fig. 4 in Appendix A. In the remainder of this section, we will explain how ζ σ and r θ can be learned from the offline behavior dataset D. The problem of learning an attribute strength estimator ζ σ is essentially a learning-to-rank problem. Specifically, we assume we are given (normally by the agent designers rather than the end users) a set of state-only trajectories {τ } and their orderings according to different attributes {(τ 0 ≻ τ 1 ≻ ... ≻ τ N |α)}, or a set of ranked trajectory pairs D l = {(τ i ≻ τ j |α)}, where α ∈ A is one of the attributes in the domain, (τ i ≻ τ j |α) represents the event that the attribute α has a stronger presence in trajectory τ i than that in trajectory τ j , and N ≤ |D| is the length of the ranked trajectory sequence. We propose to employ a modified state-only version of Bradley-Terry model (Eq. 1), in which rather than assuming that the ranking is governed by the latent user preferences, we assume the ranking is determined by the given attribute α: Pσ τ 1 ≻ τ 2 α] = exp t fσ [s 1 t , eα] i∈{1,2} exp t fσ ([s i t , eα]) , where f σ is an attribute conditioned ranking function with parameters σ, [•, •] is the vector concatenation operation, and e α is the embedding of attribute α. The strength of any attribute α in a trajectory τ is given as ζ σ (τ, α) = s∈τ f σ ([s, e α ] ). Recall that we consider two types of attribute representation, namely attribute index and natural language description. Accordingly, e α can either be a one-hot vector or a sentence embedding generated by any pretrained natural language sentence encoder like Sentence-BERT (Reimers & Gurevych, 2019) . Since different behaviors may result in trajectories of varying lengths, in Eq. 3 we do not require the two trajectories to have the same size. Given the training dataset D l , f σ can be trained via a cross-entropy loss similar to the one in Eq. 2. For numerical stability, in practice, we also clip the values of t f σ [s i t , e α ] (we used [-20, 20 ] for all the attributes in our experiment). With a learned attribute strength estimator ζ σ and a finite set of k attributes, the agent behavior in any trajectory τ can be characterized by an attribute vector v(τ ) = ⟨ζ σ (τ, α 1 ), ..., ζ σ (τ, α k )⟩. The problem of learning r θ ([s, v t ]) can also be cast to a learning-to-rank or preference modeling problem, wherein the reward function is supposed to give higher cumulative rewards to trajectories that have attribute strengths closer to the targets in v t . Specifically, given v t , for any two trajectories τ i and τ j from the offline behavior dataset D, the preference label l p (τ i , τ j , v t ) is given as: lp(τ i , τ j , vt) =      τ i ≻ τ j if ∥v(τ i ) -vt∥2 < ∥v(τ j ) -vt∥2 -ξr τ i ≺ τ j if ∥v(τ i ) -vt∥2 > ∥v(τ j ) -vt∥2 + ξr no ordering otherwise, (4) where ξ r is a small slack variable. To train r θ , we can randomly sample a set of triplets from D, namely {(τ 0 , τ 1 , τ 2 )}. By treating τ 0 as the target behavior, we can generate a set of training labels {l p (τ 1 , τ 2 , v t = v(τ 0 ))} for r θ . In short, r θ can be trained as a standard state-only preference-based reward function according to Eq. 1 and Eq. 2 but with preference labels given by the extracted attribute strengths (i.e., by ζ σ ). Note that, unlike the training of ζ σ , the training of r θ does not require any human-provided labels. Once a reward function r θ is learned, the end user can use it to specify agent behavior by simply tuning the attribute strength values in the input target attribute vector v t . In our realization, we implement the process of finding the target attribute score as a process of performing binary search in real space (details can be found in Appendix A.3). The process of personalizing agent behavior through r θ is highly intuitive because r θ handles the complex tacit parts of the problem internally (e.g., how to walk naturally and realistically with the constraints of softness and step size) and only relies on the end user to set the explicit parts. Additional discussion on alternative ways to leverage ζ σ without learning a reward function can be found in Appendix A.2.

4.3. METHOD 2: MODELING BEHAVIORAL ATTRIBUTES BY CAPTURING MINIMALLY VIABLE LOCAL CHANGES

One limitation of RBA-Global is that, it requires the total number of encoded attributes to be finite because the size of the target attribute vector v t grows with the number of attributes. This may limit the scalability of RBA-Global. In this section, we will introduce a more extensible method that can potentially encode an arbitrary number of attributes. The key motivation is that in RBA-Global, the user never directly manipulates the attribute scores. Instead, we can skip the explicit modeling of attribute strength (i.e., learning ζ σ ) and directly learn a behavior-editing reward function. Specifically, given a trajectory τ c and the corresponding human feedback (α, h), our goal is to construct a reward function r θ (•|α, h, τ c ) that gives higher cumulative rewards to trajectories that have some minimal but noticeable change in α in the direction specified by h while keeping other unmentioned attributes unchanged (or minimally changed). We refer to such minimal but noticeable changes as minimally viable local changes, and the queried trajectory τ c as the anchor trajectory. Accordingly, we denote this method as RBA-Local. To learn r θ , we assume we are provided (again, normally by the agent builder and not the end user) a set of trajectory pairs D l = {(τ c , τ t , α, h)}, where τ t is a trajectory that reflects some minimally viable local changes to the anchor trajectory τ c in terms of the attribute α and the direction h. r θ is trained to prefer τ t over other negative samples (such as trajectories that make excessive changes to τ c , or trajectories that are not significantly different from τ c , or trajectories that make changes to unspecified attributes). In practice, we select the negative samples by randomly sampling from the behavior dataset D. Again, r θ can be formulated as a modified Bradley-Terry model: P θ [τt ≻ τn| α, h, τc] = exp s∈τ t r θ ([s, eα, h, ϕ(τc)]) exp s∈τn r θ ([s, eα, h, ϕ(τc)]) + exp s∈τ t r θ ([s, eα, h, ϕ(τc)]) , where τ n is a negative sample (i.e., trajectory), [•] is the vector concatenation operation, e α as in RBA-Global can be either an one-hot representation of attribute α or a sentence embedding output by Sentence-BERT, and ϕ(τ c ) is a sequence encoder (e.g., an LSTM (Hochreiter & Schmidhuber, 1997) ) that encodes the anchor trajectory τ c to a compact latent representation. Note that ϕ(•) is a sub-module of r θ and it's jointly optimized with r θ . Since r θ is essentially a preference-based reward function, it can be optimized by employing a cross-entropy loss as in the one in Eq. 2. We note that our computational framework is similar to the Prompt-DT (Xu et al., 2022) in the sense that we both take a reference trajectory as a "prompt" to the model to obtain conditioned outputs. But instead of trying to replicate the behavior in the "prompt" trajectory as in Prompt-DT, our reward function learns to modify the prompt in a controlled way. In a more recent work (Liu et al., 2023) , similar ideas have been shown to be effective in refining language model outputs through sequences of local changes informed by human feedback. Compared to RBA-Global (Sec. 4.2) which requires a full specification of all the attributes' strengths, the reward function in RBA-Local only takes one attribute as input at a time. This design is appealing as it offers better scalability and it affords the development of a big universal behavioral concept "encoder". Nevertheless, RBA-Local still has the following shortcomings: (a) it is less efficient than RBA-Global in searching for the target behavior because it can only make minimally viable changes to the presented behavior; (b) The training data for RBA-Local is harder to collect. Unlike RBA-Global, where any random subset of trajectories exhibiting distinct behaviors can be used as the training samples, in RBA-Local, the agent builder must carefully pick pairs of trajectories that reflect local changes. Also, the judgement of how much variation constitutes a minimally viable change can be fairly subjective.

5. EMPIRICAL EVALUATION

As a proof of concept, we demonstrate the effectiveness of our methods in a diverse set of four domains with nine behavioral attributes that are depicted in Fig. 1 and Fig. 3 : Walker. This environment corresponds to a scenario that involves a 2-legged home-service robot walking around the house to perform household tasks. The users may want the robot to walk more softly at night. This environment is also related to physical character control scenarios wherein we want the character to move in a sneaky way. Two attributes are considered here: (a) step size; (b) softness of movement. Manipulator. We consider a virtual character control scenario wherein we want a simulated arm to mimic the ways of a human lifting objects. When humans, especially elders and children, are lifting heavy objects, their movement can be unstable. Hence, we consider two attributes: (a) moving speed of the arm; (b) instability of the movement. Lane Change. We consider a driving scenario wherein the rider would want to change the lanechanging behavior of the cab to get a more pleasant experience. Two attributes are used for evaluation: (a) the sharpness of steering: this attribute corresponds to how sharp a turn the agent makes while changing lanes; (b) distance to the following vehicle: this attribute is about the distance between our agent and the following car at the moment when our agent starts making the lane change. Snake Concertina. In this task, the agent is supposed to control a virtual snake to imitate diverse concertina styles of a real snake's locomotion. There are three relevant attributes in concertina locomotion: (a) width of the bend (i.e., the maximal width that the snake occupies); (b) compression (i.e., how much the snake's body is compressed when it is moving); (c) speed of movement. More details of the evaluation domains can be found in Appendix A.1. In our experiments, all the behavior clips are generated either by hard-coded motions or by using reinforcement learning with sophisticated reward designs and hard-coded constraints. 

5.1. BASELINE AND RESULTS

For comparisons, we use the PbRL algorithm proposed in (Christiano et al., 2017; Lee et al., 2021) , which learns a reward function from human preference labels collected by making active queries. Considering that our methods assume the additional access to the offline behavior dataset D, we made a couple of modifications to make PbRL into a stronger baseline. The most important one is that, to optimize the most recent reward function and update the agent's behavior after each query, rather than applying reinforcement learning, we use the policies extracted from D. Specifically, we stochastically sample a large set of rollouts by executing the policies we used to synthesize dataset D, and the rollout with the highest cumulative rewards is set as the agent's latest behavior. In practice, these policies can also be unsupervisedly learned from D. This is identical to the use of skill priors as in (Peng et al., 2018b; Pertsch et al., 2020; Luo et al., 2020; Peng et al., 2022) , which first learns a diverse set of natural and plausible motions from offline behavior datasets to reduce online computation. Besides, for PbRL, we also experimented with different query strategies and considered reusing previously trained reward models. For the sake of simplicity, we only report the best PbRL performance achieved at different setups. As the evaluation metric, we count the number of human feedbacks (i.e., the binary preference labels in PbRL and the attribute-level feedback in our methods) needed to produce the target behavior. For each domain, we randomly sample 20 behavior configurations as targets, and the selected targets were unseen in the behavior dataset D. A trial is considered as a success if the generated agent behavior has a ground truth attribute strength or proxy score that falls within a certain range of the target value. We use a threshold that roughly divides the strength of each attribute into five to ten buckets. A trial is deemed unsuccessful if the agent fails to produce the target behavior within a user-affordable number of feedbacks (we used 500 in our evaluation). The results are shown in Table 1 , where we add "L" as suffix to the names of variants that use language embedding as attribute representation. Results show that with RBAs, users can obtain desired agent behavior much more efficiently than with PbRL. The upper row in Fig. 2 showcases how the user can obtain desired agent behaviors through sequences of attribute feedback with both methods. The interaction processes suggest that the attribute-parameterized reward is able to modify agent behaviors meaningfully in directions that are informed by RBAs. For full information of the interaction processes in all domains, we encourage the reader to check out the supplementary video.foot_0  Analysis of failure modes. Results suggest that both methods have a lower success rate when trying to generate behaviors close to the two extremes (e.g., moving very softly or very recklessly). The bottom row in Fig. 2 shows two failure cases. In RBA-Global, the reward function fails to produce a behavior with a larger step size when we increase the target step size score in v t from 18.51 to 18.62. This disrupts the binary search process (Appendix A.3) and causes the system to get stuck. Similar failure patterns can also be observed in other domains. For example, in the Lane-Change domain, the agent fails to increase the distance to the following vehicle when the sharpness (proxy) score is 0.09. Since RBA-Global is composed of two learned models, namely the attribute strength estimator ζ σ and the reward function r θ , we also examine them separately to see which module contributes more to the failures. By visualizing the outputs of ζ σ and the corresponding proxy ground truth (Fig. 5 in Appendix A), we observe a positively correlated or sometimes linear relationship between them, suggesting that ζ σ can accurately capture the attribute strength even at the two extremes. This observation verifies that the ineffectiveness of RBA-Global is mainly caused by r θ 's failure to recover behaviors specified in the target attribute-score vector v t . Note that this is not surprising since ζ σ only needs to learn one global ordering per attribute while r θ has to learn almost an infinite number of orderings given various input targets v t . Future work can explore better training paradigms and more expressive model architectures for r θ . In terms of the failure modes of RBA-Local, it sometimes fails to edit the behavior in the anchor trajectory τ c and simply produces the same behavior as in τ c . As shown in the right bottom plot of Fig. 2 , the agent gets stuck when the user wants to increase the softness when the agent is taking a large step (proxy score: 0.954). Also, RBA-Local tends to have a lower success rate than RBA-Global, indicating that its more complex formulation and architecture might affect its performance.

5.2. ADDITIONAL DISCUSSION

To get more insights into the number of labels needed to learn an accurate attribute ranking function or reward function in our methods, we conduct an extra experiment in which we train the model with different numbers of samples uniformly sampled from the training set, and evaluate each model on a held-out testing set. Results (Appendix A.5) show that to simultaneously learn two attributes, RBA-Global needs around 200 labelled trajectories (and the orderings among them) and RBA-Local needs around 200 (τ c , τ t ) pairs, which is a reasonable number. Also, the cost of learning a reward function is amortized when we continue to use it to support incoming users over its lifetime. Recall that in the main experiment, we consider a trial as a successful one if the difference between the agent's behavior and the target behavior is lower than a threshold. One interesting thing to see is whether our methods can achieve even higher-precision control over the agent's behavior. Hence, we additionally experiment with a threshold value that roughly divides the attribute strengths into five to ten times more buckets. As expected, a more restrictive threshold reduces the performance of all algorithms, but our methods still have a significant advantage in terms of feedback efficiency. Results are shown in Table 2 in Appendix A. Note that this performance degradation was expected because the control precision we wanted to achieve in this experiment is higher than what we set in the training data (e.g., the precision corresponds to changes that are smaller than the minimally viable local changes defined in the training data for RBA-Local).

6. CONCLUSION

In this paper, we introduced the notion of relative behavioral attributes which allows users to provide symbolic feedback (i.e., their intent to increase/decrease attribute strength) to efficiently tweak and get desired agent behavior. We proposed two approaches and demonstrated their effectiveness through experiments in a varied set of domains. For future works, apart from the limitations we discussed earlier, it would be interesting if we could develop methods that combine the strengths of the two approaches proposed in this work. Also, currently, we use the sentence embedding of each attribute only as an alternative to the one-hot vector. However, it would be beneficial to make better use of the semantic structure inside the sentence embeddings. Furthermore, it would be useful to explore the use of RBAs outside of continuous control tasks. For instance, for AI chatbots, we may construct rewards to capture not only binary attributes like helpfulness and harmfulness (Bai et al., 2022b) but also abstraction levels of the text (e.g., scientific concepts need to be explained differently to kids and researchers) or the tonality of the response (ranging from casual to a more formal or professional way). Inspired by recent attempts to build reward-driven vision models (Pinto et al., 2023) , it would be also interesting to investigate whether RBAs can facilitate reward construction for fine-tuning models in computer vision tasks. To synthesize dataset and simulate human feedbacks, we use the moving speed and landing speed of the feet as proxies to measure the abstract concept softness or "sneaky". The environment is implemented based on the Walker-2d domain in the DeepMind control suite (Tunyasuvunakool et al., 2020) . Manipulator. We consider two attributes: (a) moving speed of the arm; (b) instability of the movement. The environment is implemented based on the Manipulator domain in the DeepMind control suite. We synthesize the behavior clips by hard coding the arm motion and by adding random noises in a controlled way. The purpose of conducting experiments in this domain is to demonstrate that our method can capture not only regular behavior patterns but also the irregularities. A Markov state is constructed by stacking five consecutive raw states of the environment to ensure it contains information about the irregularities. Lane Change. Two attributes are used for evaluation: (a) the sharpness of steering: this attribute corresponds to how sharp a turn the agent makes while changing lanes; (b) distance to the following vehicle: this attribute is about the distance between our agent and the following car at the moment when our agent starts making the lane change. This environment is built on the highway environment in (Leurent, 2018) . Note that the environment is an image based domain, so the objective here is to verify that our methods can be scaled to image inputs. Snake Concertina. There are three relevant attributes in concertina locomotion: (a) width of the bend (i.e., the maximal width that the snake occupies); (b) compression (i.e., how much the snake's body is compressed when it is moving); (c) speed of movement.

A.2 ALTERNATIVE WAYS TO USE THE ATTRIBUTE STRENGTH ESTIMATOR FUNCTION IN METHOD 1 (RBA-GLOBAL)

Recall that in RBA-Global, given a trained attribute strength estimator ζ σ and a finite set of attributes A, the agent behavior in any trajectory τ can be represented by an attribute vector v(τ ) = ⟨ζ σ (τ, α 1 ), ..., ζ σ (τ, α k )⟩, where k = |A| is the number of attributes. If we assume the optimal policy is approximated via some parametric model (e.g., neural networks), we can actually skip the learning of the attribute parameterized reward function by viewing the attribute vectors as skill latent codes and learning a versatile policy conditioned on them. This is similar to the operations in (Wang et al., 2017; Peng et al., 2022) but with a more structured latent code. Also, one might create a sparse reward given at the end of episode by computing the distance between the extracted attribute vector and the target attribute vector. But such a reward is usually hard to optimize. The main reason we choose to construct a reward function is that we find it more general, since rewards can be optimized not only by RL, but also by other optimization-based methods. Another consideration here is whether it is easier to learn a policy directly (e.g., via BC or IL) or to learn the reward first and then the policy (e.g., via IRL or PbRL). Prior empirical results suggest that the latter tends to be a more robust solution. 



Supplementary video at https://guansuns.github.io/pages/rba



Figure 1: Visualizing behavioral attributes of Walker and Lane-Change. The behavioral attributes of other domains are shown in Fig. 3 in Appendix A.

Figure 2: A step-by-step visualization of the interaction process showing how the agent's behaviors change according to sequences of user's feedback in the Walker domain. The two sequences on the left showcase RBA-Global and the sequences on the right showcase RBA-Local. The upper and bottom rows represent the success and failure cases respectively. The attribute strength scores above each image frame are given by hard-coded proxy measures, which are in the range of [0, 1]. For RBA-Global, we also show the corresponding reward function and input v t = ⟨target step size, target softness⟩. A more detailed visualization is presented in the supplementary video.

Figure 3: Visualizations of the evaluation domains and behavioral attributes.

Figure 4: Overview of Method 1 (RBA-Global) and Method 2 (RBA-Local).

Figure 5: The relationship between ζ σ and the hand-designed proxy ground truth. The attribute strength scores are computed with one-hot vectors used as the attribute representation. We can observe a similar pattern when language embeddings are used.

Figure 6: A failure case of unsupervised concept discovery. Each row corresponds to a behavior trace.

Figure 7: Performance of the ranking function in Method 1 (RBA-Global) versus # of training samples.

SR

Results on controlling the agent's behavior with higher precision. SR -Success Rate; AF -Average

ACKNOWLEDGMENTS

This research is supported in part by ONR grants N00014-16-1-2892, N00014-18-1-2442, N00014-18-1-2840, N00014-9-1-2119, AFOSR grant FA9550-18-1-0067, DARPA SAIL-ON grant W911NF19-2-0006 and a JP Morgan AI Faculty Research grant.

availability

https://guansuns.github.io/pages/rba.

A.3 FINDING THE TARGET ATTRIBUTE SCORES WITH BINARY SEARCH

Binary search is highly efficient because it narrows down the search space by cutting it in half at each step. In order to apply binary search in the attribute space, we need to maintain beliefs about the upper and lower bounds of each attribute. In our case, the upper and lower bounds can be initialized to the maximum and minimum attribute scores observed in the offline behavioral dataset D, where the scores are given by ζ σ . At each query step, the agent presents to the human the behavior corresponding to the median attribute value, i.e., αupper+α lower 2, and updates the beliefs accordingly after getting the feedback. In the case of extrapolation, we can also go beyond the maximum and minimum attribute strengths, but this is no longer a binary search.

A.4 ARCHITECTURES AND HYPERPARAMETERS

When one-hot representation is used to represent attributes, both f σ and r θ in RBA-Global employ a 3-layer fully-connected network with 512 hidden neurons as the architecture. In RBA-Local, for the trajectory encoder, we use a 2-layer bi-directional LSTM with 128 as the hidden dim. The trajectory embedding, along with the input state and attribute, are fed into a 3-layer fully-connected network with 512 hidden neurons to compute the reward.When sentence embeddings are used as attribute representation, for both RBA-Global and RBA-Local, we increase the number of hidden neurons in fully-connected layers from 512 to 1024 due to the increase in the size of attribute embedding (size 768).

A.5 NUMBER OF TRAINING SAMPLES NEEDED IN OUR METHODS

Fig. 7 and Fig. 8 show the performance (on a held-out testing set) of Method 1 (RBA-Global) and Method 2 (RBA-Local) when different numbers of training samples are used. For RBA-Global, given an ordered trajectory pair (τ 1 ≻ α τ 2 ), the ranking function is converted into a binary classifier that predicts the ordering of the given pair. The performance is measured in terms of the accuracy of this binary classifier. Similarly for RBA-Local, the performance is measured by converting Equation 5 to a binary classification problem where the function predicts whether a trajectory is the target trajectory or not. Note that the sample complexity of the two methods is not directly comparable because their training samples are in different formats.A.6 DISCUSSION ON UNSUPERVISED DISCOVERY OF BEHAVIORAL ATTRIBUTES As a preliminary attempt in this study, we explored the possibility of unsupervised discovery of concepts or properties. We applied a state-of-the-art disentangled sequential variational autoencoder method, C-DSVAE (Bai et al., 2021) , to learn to encode behaviors in the offline behavior dataset D. Though the behavior/motion embeddings given by C-DSVAE are able to cluster visually similar traces together, it still has difficulty capturing subtle but meaningful differences in behaviors. More importantly, it fails to establish any meaningful ordering among behaviors. As an example (Fig. 6 ), we visualize the variations encoded in a specific dimension in the latent space of the Lane-Change domain. This observation is consistent with the current trend in vision-text research, which suggests unless additional supervision signals are provided, representations developed by neural networks are not guaranteed to capture semantics that make sense to humans. This preliminary study confirms the necessity to employ supervised learning for behavioral attribute modeling.Published as a conference paper at ICLR 2023 

