PREFERENCE TRANSFORMER: MODELING HUMAN PREFERENCES USING TRANSFORMERS FOR RL

Abstract

Preference-based reinforcement learning (RL) provides a framework to train agents using human preferences between two behaviors. However, preferencebased RL has been challenging to scale since it requires a large amount of human feedback to learn a reward function aligned with human intent. In this paper, we present Preference Transformer, a neural architecture that models human preferences using transformers. Unlike prior approaches assuming human judgment is based on the Markovian rewards which contribute to the decision equally, we introduce a new preference model based on the weighted sum of non-Markovian rewards. We then design the proposed preference model using a transformer architecture that stacks causal and bidirectional self-attention layers. We demonstrate that Preference Transformer can solve a variety of control tasks using real human preferences, while prior approaches fail to work. We also show that Preference Transformer can induce a well-specified reward and attend to critical events in the trajectory by automatically capturing the temporal dependencies in human decision-making. Code is available on the project website: https://sites.google.com/view/preference-transformer.

1. INTRODUCTION

Reinforcement learning (RL) has been successful in solving sequential decision-making problems in various domains where a suitable reward function is available (Mnih et al., 2015; Silver et al., 2017; Berner et al., 2019; Vinyals et al., 2019) . However, reward engineering poses a number of challenges. It often requires extensive instrumentation (e.g., thermal cameras (Schenck & Fox, 2017) , accelerometers (Yahya et al., 2017) , or motion trackers (Peng et al., 2020) ) to design a dense and precise reward. Also, it is hard to evaluate the quality of outcomes in a single scalar since many problems have multiple objectives. For example, we need to care about many objectives like velocity, energy spent, and torso verticality to achieve stable locomotion (Tassa et al., 2012; Faust et al., 2019) . It requires substantial human effort and extensive task knowledge to aggregate multiple objectives into a single scalar. To avoid reward engineering, there are various ways to learn the reward function from human data, such as real-valued feedback (Knox & Stone, 2009; Daniel et al., 2014) , expert demonstrations (Ng et al., 2000; Abbeel & Ng, 2004) , preferences (Akrour et al., 2011; Wilson et al., 2012; Sadigh et al., 2017) and language instructions (Fu et al., 2019; Nair et al., 2022) . Especially, research interest in preference-based RL (Akrour et al., 2012; Christiano et al., 2017; Lee et al., 2021b) has increased recently since making relative judgments (e.g., pairwise comparison) is easy to provide yet information-rich. By learning the reward function from human preferences between trajectories, recent work has shown that the agent can learn novel behaviors (Christiano et al., 2017; Stiennon et al., 2020) or avoid reward exploitation (Lee et al., 2021b) . However, existing approaches still require a large amount of human feedback, making it hard to scale up preference-based RL to various applications. We hypothesize this difficulty originated from common underlying assumptions in preference modeling used in most prior work. Specifically, prior work commonly assumes that (a) the reward function is Markovian (i.e., depending only on the current state and action), and (b) human evaluates the quality of a trajectory (agent's behavior) based on the sum of rewards with equal weights. These assumptions can be flawed due to the Given a preference between two trajectory segments (σ 0 , σ 1 ), Preference Transformer generates non-Markovian rewards rt and their importance weights w t over each segment. We then model the preference predictor based on the weighted sum of non-Markovian rewards (i.e., t w t rt ), and align it with human preference. following reasons. First, there are various tasks where rewards depend on the visited states (i.e., non-Markovian) since it is hard to encode all task-relevant information into the state (Bacchus et al., 1996; 1997) . This can be especially true in preference-based learning since the trajectory segment is provided to the human sequentially (e.g., a video clip (Christiano et al., 2017; Lee et al., 2021b) ), enabling earlier events to influence the ratings of later ones. In addition, since humans are highly sensitive to remarkable moments (Kahneman, 2000) , credit assignment within the trajectory is required. For example, in the study of human attention on video games using eye trackers (Zhang et al., 2020) , the human player requires a longer reaction time and multiple eye movements on important states that can lead to a large reward or penalty, in order to make a decision. In this paper, we aim to propose an alternative preference model that can overcome the limitations of common assumptions in prior work. To this end, we introduce the new preference model based on the weighted sum of non-Markovian rewards, which can capture the temporal dependencies in human decisions and infer critical events in the trajectory. Inspired by the recent success of transformers (Vaswani et al., 2017) in modeling sequential data (Brown et al., 2020; Chen et al., 2021) , we present Preference Transformer, a transformer-based architecture for designing the proposed preference model (see Figure 1 ). Preference Transformer takes a trajectory segment as input, which allows extracting task-relevant historical information. By stacking bidirectional and causal selfattention layers, Preference Transformer generates non-Markovian rewards and importance weights as outputs. We then utilize them to define the preference model. We highlight the main contributions of this paper below: • We propose a more generalized framework for modeling human preferences based on a weighted sum of non-Markovian rewards. • We present Preference Transformer, a transformer-based architecture that consists of our novel preference attention layer designed for the proposed framework. • Preference Transformer enable RL agents to solve complex navigation, locomotion tasks from D4RL (Fu et al., 2020) benchmarks and robotic manipulation tasks from Robomimic (Mandlekar et al., 2021) benchmarks by learning a reward function from real human preferences.

• We analyze the learned reward function and importance weights, showing that Preference

Transformer can induce a well-specified reward and capture critical events within a trajectory.

2. RELATED WORK 2022

). Transformer-based models (i.e., pre-trained language models) are used as a reward function in these approaches due to partial observability of language inputs, but we utilize transformers with a new preference modeling for control tasks. Transformer for reinforcement learning and imitation learning. Transformers (Vaswani et al., 2017) have been studied for various purposes in RL (Vinyals et al., 2019; Zambaldi et al., 2019; Parisotto et al., 2020; Chen et al., 2021; Janner et al., 2021) . It has been observed that sampleefficiency and generalization ability can be improved by modeling RL agents using transformers in complex environments, such as StarCraft (Vinyals et al., 2019; Zambaldi et al., 2019) and DMLab-30 (Parisotto et al., 2020) benchmarks. For offline RL, Chen et al. (2021) and Janner et al. (2021) also leveraged the transformers by formulating RL problems in the context of the sequential modeling problem. Additionally, transformers have been applied successfully in imitation learning (Dasari & Gupta, 2020; Mandi et al., 2022; Reed et al., 2022) . Dasari & Gupta (2020) and Mandi et al. (2022) demonstrated the generalization ability of transformers in one-shot imitation learning (Duan et al., 2017), and Reed et al. (2022) utilized the transformers for multi-domain and multi-task imitation learning. In this work, we demonstrate that transformers also can be useful in improving the efficiency of preference-based learning. Non-Markovian reward learning. Non-Markovian reward (Bacchus et al., 1996; 1997) has been studied for dealing with the realistic reward setting in which reward depends on the visited states, or reward is delayed or even given at the end of each episode. Several recent work focuses on return decomposition where an additional model is trained under the objective of predicting trajectory return with a given state-action sequence and used for reward redistribution. Arjona-Medina et al. (2019) and Early et al. (2022) used LSTM (Hochreiter & Schmidhuber, 1997) for capturing sequential information in reward learning. Gangwani et al. (2020) and Ren et al. (2022) proposed simplified approaches without any prior task knowledge, under the assumption that episodic return is distributed uniformly in the timestep. In this work, we adopt a transformer-based reward model for learning non-Markovian rewards, which can capture the temporal dependencies in human decisions.

3. PRELIMINARIES

We consider the reinforcement learning (RL) framework where an agent interacts with an environment in discrete time (Sutton & Barto, 2018) . Formally, at each timestep t, the agent receives the current state s t from the environment and chooses an action a t based on its policy π. The environment gives a reward r(s t , a t ), and the agent transitions to the next state s t+1 . The goal of RL is to learn a policy that maximizes the expected return, R t = ∞ k=0 γ k r(s t+k , a t+k ), which is defined as a discounted cumulative sum of the reward with discount factor γ. In many applications, it is difficult to design a suitable reward function capturing human intent. Preference-based RL (Akrour et al., 2011; Pilarski et al., 2011; Wilson et al., 2012; Christiano et al., 2017) addresses this issue by learning a reward function from human preferences. Similar to Wilson et al. (2012) and Christiano et al. (2017) , we consider preferences over two trajectory segments of length H, σ = {(s 1 , a 1 ), ..., (s H , a H )}. Given a pair of segments (σ 0 , σ 1 ), a (human) teacher indicates which segment is preferred, i.e., y ∈ {0, 1, 0.5}. The preference label y = 1 indicates σ 1 ≻ σ 0 , y = 0 indicates σ 0 ≻ σ 1 , and y = 0.5 implies an equally preferable case, where σ i ≻ σ j denotes the event that the segment i is preferable to the segment j. Each feedback is stored in a dataset of preferences D as a triple (σ 0 , σ 1 , y). To obtain a reward function r parameterized by ψ, most prior work (Christiano et al., 2017; Ibarz et al., 2018; Lee et al., 2021b; c; III & Sadigh, 2022; Park et al., 2022) defines a preference predictor following the Bradley-Terry model (Bradley & Terry, 1952) : P [σ 1 ≻ σ 0 ; ψ] = exp t r(s 1 t , a 1 t ; ψ) exp ( t r(s 1 t , a 1 t ; ψ)) + exp ( t r(s 0 t , a 0 t ; ψ)) . (1) Then, given a dataset of preferences D, the reward function r is updated by minimizing the crossentropy loss between this preference predictor and the actual human labels: L CE (ψ) = - E (σ 0 ,σ 1 ,y)∼D (1 -y) log P [σ 0 ≻ σ 1 ; ψ] + y log P [σ 1 ≻ σ 0 ; ψ] . One can update a policy π using any RL algorithm such that it maximizes the expected returns with respect to the learned reward. s s a a z z x x P [•] t -1 t -1 t -1 t -1 t t t t

Causal Transformer

Preference Attention Layer (bidirectional)  z z MatMul SoftMax Scale MatMul Linear {k t } H t=1 {x t } H t=1 {z t } H t=1 {q t } H t=1 {r t } H t=1 < l a t e x i t s h a 1 _ b a s e 6 4 = " J R 2 + l u Z + y I k 6 o 9 D Q y N k 8 5 m 3 D Z 7 I = " > A A A B / 3 i c b V B N S 8 N A E N 3 U r 1 q / o o I X L 4 t F 8 F S S I u h F K H j p s Y J t h S a G z X b b L t 1 s w u 5 E K D E H / 4 o X D 4 p 4 9 W 9 4 8 9 + 4 b X P Q 1 g c D j / d m m J k X J o J r c J x v q 7 S y u r a + U d 6 s b G 3 v 7 O 7 Z + w c d H a e K s j a N R a z u Q q K Z 4 J K 1 g Y N g d 4 l i J A o F 6 4 b j 6 6 n f f W B K 8 1 j e w i R h f k S G k g 8 4 J W C k w D 7 y s s w b E c A q D 8 D L g w y u 3 P y + G d h V p + b M g J e J W 5 A q K t A K 7 C + v H 9 M 0 Y h K o I F r 3 X C c B P y M K O B U s r 3 i p Z g m h Y z J k P U M l i Z j 2 s 9 n 9 O T 4 1 S h 8 P Y m V K A p 6 p v y c y E m k 9 i U L T G R E Y 6 U V v K v 7 n 9 V I Y X P o Z l 0 k K T N L 5 o k E q M M R 4 G g b u c 8 U o i I k h h C p u b s V 0 R B S h Y C K r m B D c x Z e X S a d e c 5 2 a e 3 N e b d S L O M r o G J 2 g M + S i C 9 R A T d R C b U T R I 3 p G r + j N e r J e r H f r Y 9 5 a s o q Z Q / Q H 1 u c P 8 z 2 V / A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " J R 2 + l u Z + y I k 6 o 9 D Q y N k 8 5 m 3 D Z 7 I = " > A A A B / 3 i c b V B N S 8 N A E N 3 U r 1 q / o o I X L 4 t F 8 F S S I u h F K H j p s Y J t h S a G z X b b L t 1 s w u 5 E K D E H / 4 o X D 4 p 4 9 W 9 4 8 9 + 4 b X P Q 1 g c D j / d m m J k X J o J r c J x v q 7 S y u r a + U d 6 s b G 3 v 7 O 7 Z + w c d H a e K s j a N R a z u Q q K Z 4 J K 1 g Y N g d 4 l i J A o F 6 4 b j 6 6 n f f W B K 8 1 j e w i R h f k S G k g 8 4 J W C k w D 7 y s s w b E c A q D 8 D L g w y u 3 P y + G d h V p + b M g J e J W 5 A q K t A K 7 C + v H 9 M 0 Y h K o I F r 3 X C c B P y M K O B U s r 3 i p Z g m h Y z J k P U M l i Z j 2 s 9 n 9 O T 4 1 S h 8 P Y m V K A p 6 p v y c y E m k 9 i U L T G R E Y 6 U V v K v 7 n 9 V I Y X P o Z l 0 k K T N L 5 o k E q M M R 4 G g b u c 8 U o i I k h h C p u b s V 0 R B S h Y C K r m B D c x Z e X S = " > A A A B / 3 i c b V B N S 8 N A E N 3 U r 1 q / o o I X L 4 t F 8 F S S I u h F K H j p s Y J t h S a G z X b b L t 1 s w u 5 E K D E H / 4 o X D 4 p 4 9 W 9 4 8 9 + 4 b X P Q 1 g c D j / d m m J k X J o J r c J x v q 7 S y u r a + U d 6 s b G 3 v 7 O 7 Z + w c d H a e K s j a N R a z u Q q K Z 4 J K 1 g Y N g d 4 l i J A o F 6 4 b j 6 6 n f f W B K 8 1 j e w i R h f k S G k g 8 4 J W C k w D 7 y s s w b E c A q D 8 D L g w y u 3 P y + G d h V p + b M g J e J W 5 A q K t A K 7 C + v H 9 M 0 Y h K o I F r 3 X C c B P y M K O B U s r 3 i p Z g m h Y z J k P U M l i Z j 2 s 9 n 9 O T 4 1 S h 8 P Y m V K A p 6 p v y c y E m k 9 i U L T G R E Y 6 U V v K v 7 n 9 V I Y X P o Z l 0 k K T N L 5 o k E q M M R 4 G g b u c 8 U o i I k h h C p u b s V 0 R B S h Y C K r m B D c x Z e X S = " > A A A B / 3 i c b V B N S 8 N A E N 3 U r 1 q / o o I X L 4 t F 8 F S S I u h F K H j p s Y J t h S a G z X b b L t 1 s w u 5 E K D E H / 4 o X D 4 p 4 9 W 9 4 8 9 + 4 b X P Q 1 g c D j / d m m J k X J o J r c J x v q 7 S y u r a + U d 6 s b G 3 v 7 O 7 Z + w c d H a e K s j a N R a z u Q q K Z 4 J K 1 g Y N g d 4 l i J A o F 6 4 b j 6 6 n f f W B K 8 1 j e w i R h f k S G k g 8 4 J W C k w D 7 y s s w b E c A q D 8 D L g w y u 3 P y + G d h V p + b M g J e J W 5 A q K t A K 7 C + v H 9 M 0 Y h K o I F r 3 X C c B P y M K O B U s r 3 i p Z g m h Y z J k P U M l i Z j 2 s 9 n 9 O T 4 1 S h 8 P Y m V K A p 6 p v y c y E m k 9 i U L T G R E Y 6 U V v K v 7 n 9 V I Y X P o Z l 0 k K T N L 5 o k E q M M R 4 G g b u c 8 U o i I k h h C p u b s V 0 R B S h Y C K r m B D c x Z e X S a d e c 5 2 a e 3 N e b d S L O M r o G J 2 g M + S i C 9 R A T d R C b U T R I 3 p G r + j N e r J e r H f r Y 9 5 a s o q Z Q / Q H 1 u c P 8 z 2 V / A = = < / l a t e x i t > X w t rt < l a t e x i t s h a 1 _ b a s e 6 4 = " / w Q X 5 R k 4 6 m h 9 w e J w i M f Q Z U / X q x M = " > A A A B / H i c b V B N S 8 N A E N 3 U r 1 q / o j 1 6 W S y C p 5 K I q M e C F 4 8 V 7 A c 0 I W y 2 m 3 b p b h J 2 J 0 o I 9 a 9 4 8 a C I V 3 + I N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Y S q 4 B s f 5 t i p r 6 x u b W 9 X t 2 s 7 u 3 v 6 B f X j U 1 U m m K O v Q R C S q H x L N B I 9 Z B z g I 1 k 8 V I z I U r B d O b m Z + 7 4 E p z Z P 4 H v K U + Z K M Y h 5 x S s B I g V 3 3 d C b x Y w C 4 8 M Y E s J o G E N g N p + n M g V e J W 5 I G K t E O 7 C 9 v m N B M s h i o I F o P X C c F v y A K O B V s W v M y z V J C J 2 T E B o b G R D L t F / P j p / j U K E M c J c p U D H i u / p 4 o i N Q 6 l 6 H p l A T G e t m b i f 9 5 g w y i a 7 / g c Z o B i + l i U Z Q J D A m e J Y G H X D E K I j e E U M X N r Z i O i S I U T F 4 1 E 4 K 7 / P I q 6 Z 4 3 X a f p 3 l 0 0 W p d l H F V 0 j E 7 Q G X L R F W q h W 9 R G H U R R j p 7 R K 3 q z n q w X 6 9 3 6 W L R W r H K m j v 7 A + v w B Z U u U j g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " / w Q X 5 R k 4 6 m h 9 w e J w i M f Q Z U / X q x M = " > A A A B / H i c b V B N S 8 N A E N 3 U r 1 q / o j 1 6 W S y C p 5 K I q M e C F 4 8 V 7 A c 0 I W y 2 m 3 b p b h J 2 J 0 o I 9 a 9 4 8 a C I V 3 + I N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Y S q 4 B s f 5 t i p r 6 x u b W 9 X t 2 s 7 u 3 v 6 B f X j U 1 U m m K O v Q R C S q H x L N B I 9 Z B z g I 1 k 8 V I z I U r B d O b m Z + 7 4 E p z Z P 4 H v K U + Z K M Y h 5 x S s B I g V 3 3 d C b x Y w C 4 8 M Y E s J o G E N g N p + n M g V e J W 5 I G K t E O 7 C 9 v m N B M s h i o I F o P X C c F v y A K O B V s W v M y z V J C J 2 T E B o b G R D L t F / P j p / j U K E M c J c p U D H i u / p 4 o i N Q 6 l 6 H p l A T G e t m b i f 9 5 g w y i a 7 / g c Z o B i + l i U Z Q J D A m e J Y G H X D E K I j e E U M X N r Z i O i S I U T F 4 1 E 4 K 7 / P I q 6 Z 4 3 X a f p 3 l 0 0 W p d l H F V 0 j E 7 Q G X L R F W q h W 9 R G H U R R j p 7 R K 3 q z n q w X 6 9 3 6 W L R W r H K m j v 7 A + v w B Z U u U j g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " / w Q X 5 R k 4 6 m h 9 w e J w i M f Q Z U / X q x M = " > A A A B / H i c b V B N S 8 N A E N 3 U r 1 q / o j 1 6 W S y C p 5 K I q M e C F 4 8 V 7 A c 0 I W y 2 m 3 b p b h J 2 J 0 o I 9 a 9 4 8 a C I V 3 + I N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Y S q 4 B s f 5 t i p r 6 x u b W 9 X t 2 s 7 u 3 v 6 B f X j U 1 U m m K O v Q R C S q H x L N B I 9 Z B z g I 1 k 8 V I z I U r B d O b m Z + 7 4 E p z Z P 4 H v K U + Z K M Y h 5 x S s B I g V 3 3 d C b x Y w C 4 8 M Y E s J o G E N g N p + n M g V e J W 5 I G K t E O 7 C 9 v m N B M s h i o I F o P X C c F v y A K O B V s W v M y z V J C J 2 T E B o b G R D L t F / P j p / j U K E M c J c p U D H i u / p 4 o i N Q 6 l 6 H p l A T G e t m b i f 9 5 g w y i a 7 / g c Z o B i + l i U Z Q J D A m e J Y G H X D E K I j e E U M X N r Z i O i S I U T F 4 1 E 4 K 7 / P I q 6 Z 4 3 X a f p 3 l 0 0 W p d l H F V 0 j E 7 Q G X L R F W q h W 9 R G H U R R j p 7 R K 3 q z n q w X 6 9 3 6 W L R W r H K m j v 7 A + v w B Z U u U j g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " / w Q X 5 R k 4 6 m h 9 w e J w i M f Q Z U / X q x M = " > A A A B / H i c b V B N S 8 N A E N 3 U r 1 q / o j 1 6 W S y C p 5 K I q M e C F 4 8 V 7 A c 0 I W y 2 m 3 b p b h J 2 J 0 o I 9 a 9 4 8 a C I V 3 + I N / + N 2 z Y H b X 0 w 8 H h v h p l 5 Y S q 4 B s f 5 t i p r 6 x u b W 9 X t 2 s 7 u 3 v 6 B f X j U 1 U m m K O v Q R C S q H x L N B I 9 Z B z g I 1 k 8 V I z I U r B d O b m Z + 7 4 E p z Z P 4 H v K U + Z K M Y h 5 x S s B I g V 3 3 d C b x Y w C 4 8 M Y E s J o G E N g N p + n M g V e J W 5 I G K t E O 7 C 9 v m N B M s h i o I F o P X C c F v y A K O B V s W v M y z V J C J 2 T E B o b G R D L t F / P j p / j U K E M c J c p U D H i u / p 4 o i N Q 6 l 6 H p l A T G e t m b i f 9 5 g w y i a 7 / g c Z o B i + l i U Z Q J D A m e J Y G H X D E K I j e E U M X N r Z i O i S I U T F 4 1 E 4 K 7 / P I q 6 Z 4 3 X a f p 3 l 0 0 W p d l H F V 0 j E 7 Q G X L R F W q h W 9 R G H U R R j p 7 R K 3 q z n q w X 6 9 3 6 W L R W r H K m j v 7 A + v w B Z U u U j g = = < / l a t e x i t > Figure 2 : Overview of Preference Transformer. We first construct hidden embeddings {x t } through the causal transformer, where each represents the context information from the initial timestep to timestep t. The preference attention layer with a bidirectional self-attention computes the non-Markovian rewards {r t } and their convex combinations {z t } from those hidden embeddings, then we aggregate {z t } for modeling the weighted sum of non-Markovian rewards t w t rt .

4. PREFERENCE TRANSFORMER

In this section, we present Preference Transformer (PT), the transformer architecture for modeling human preferences (see Figure 2 for the overview). First, we introduce a new preference predictor P [σ 1 ≻ σ 0 ] based on a weighted sum of non-Markovian rewards in Section 4.1, which can reflect the long-term context of the agent's behaviors and capture the critical events in the trajectory segment. We then describe a novel transformer-based architecture to model the proposed preference predictor in Section 4.2.

4.1. PREFERENCE MODELING

As mentioned in Section 3, most prior work assumes that the reward is Markovian (i.e., depending only on current state and action), and human evaluates the quality of trajectory segment based on the sum of rewards with equal weight. Based on these assumptions, a preference predictor P [σ 1 ≻ σ 0 ] is defined as equation 1. However, this formulation has several limitations in modeling real human preferences. First, in many cases, it is hard to specify the tasks using the Markovian reward (Bacchus et al., 1996; 1997; Early et al., 2022) . Furthermore, credit assignment within the trajectory can be required since human is sensitive to remarkable moments (Kahneman, 2000) . To address the above issues, we introduce a new preference predictor that assumes the probability of preferring a segment depends exponentially on the weighted sum of non-Markovian rewards: P [σ 1 ≻ σ 0 ; ψ] = exp t w {(s 1 i , a 1 i )} H i=1 ; ψ t • r {(s 1 i , a 1 i )} t i=1 ; ψ j∈{0,1} exp t w {(s j i , a j i )} H i=1 ; ψ t • r {(s j i , a j i )} t i=1 ; ψ . ( ) To capture the temporal dependencies, we consider a non-Markovian reward function r, which receives the full preceding sub-trajectory {(s i , a i )} t i=1 as inputs. The importance weight w is also introduced as a function of the entire trajectory segment {(s i , a i )} H i=1 , which makes our preference predictor able to perform the credit assignment within the segment. We remark that our formulation is a generalized version of the conventional design; if the reward function only depends on the current state-action pair and the importance weight is always 1, our preference predictor would be equivalent to the standard model in equation 1.

4.2. ARCHITECTURE

In order to model the preference predictor as in equation 3, we propose a new transformer architecture, called Preference Transformer, that consists of the following learned components: Causal transformer. We use the transformer network (Vaswani et al., 2017) as the backbone inspired by its superiority in modeling sequential data (Brown et al., 2020; Ramesh et al., 2021; Janner et al., 2021; Chen et al., 2021) . Specifically, we use the GPT architecture (Radford et al., 2018) , i.e., the transformer architecture with causally masked self-attention. Given trajectory segment σ = {(s 1 , a 1 ), • • • , (s H , a H )} of length H, we generate 2H input embeddings (for state and action), which are learned by a linear layer followed by layer normalization (Ba et al., 2016) . A shared positional embedding (i.e., state and action at the same timestep share the same positional embedding) is learned and added to each input embedding. The input embeddings are then fed into the causal transformer network, and it produces output embeddings {x t } H t=1 such that t-th output depends on input embeddings up to t. Preference attention layer. To model the preference predictor using the weighted sum of the non-Markovian rewards as defined in equation 3, we introduce a preference attention layer. As shown in Figure 2 , the preference attention layer receives the hidden embeddings from the causal transformer {x t } H t=1 and generates rewards r and importance weights w t . Formally, t-th input x t is mapped via linear transformations to a key k t ∈ R d , query q t ∈ R d , and value rt ∈ R, where d is the embedding dimension. We remark that t-th value in self-attention, i.e., rt , is considered as modeling a non-Markovian reward r ({(s i , a i )} t i=1 ) since hidden embedding x t only depends on previous inputs in a trajectory segment. Specifically, at each timestep t, we use t state-action pairs {(s i , a i )} t i=1 for approximating the reward rt during the training. Following the self-attention (Vaswani et al., 2017) , the i-th output z i is defined as a convex combination of the values with attention weights from the i-th query and keys: z i = H t=1 softmax({⟨q i , k t ′ ⟩} H t ′ =1 ) t • rt . Then, the weighted sum of rewards can be computed by the average of outputs {z i } H i=1 as follows: 1 H H i=1 z i = 1 H H i=1 H t=1 softmax({⟨q i , k t ′ ⟩} H t ′ =1 ) t • rt = H t=1 w t rt where w t = 1 H H i=1 softmax({⟨q i , k t ′ ⟩} H t ′ =1 ) t . Here, w t corresponds to modeling importance weight w({(s i , a i )} H i=1 ) t in equation 3 because this preference attention is not causally masked and thus depends on the full sequence (i.e., bidirectional self-attention). In summary, we model the weighted sum of non-Markovian rewards by taking the average of outputs from transformer networks and assume that the probability of preferring the segment is proportional to it. The complete architecture of Preference Transformer is shown in Figure 2 .

4.3. TRAINING AND INFERENCE

Training. We train Preference Transformer by minimizing the cross-entropy loss in equation 2 given a dataset of preferences D. By aligning a preference predictor modeled by Preference Transformer with human labels, we find that Preference Transformer can induce a suitable reward function and capture important events in the trajectory segments (see Figure 3 for supporting results). Inference. For RL training, all state-action pairs are labeled with the learned reward function. Because we train the non-Markovian reward function, we provide H past transitions (s t-H+1 , a t-H+1 , • • • , s t , a t ) to Preference Transformer and use t-th value rt from the preference attention layer as a reward at timestep t.

5. EXPERIMENTS

We design our experiments to investigate the following: • Can Preference Transformer solve complex control tasks using real human preferences? • Can Preference Transformer induce a well-aligned reward and attend to critical events? • How well does Preference Transformer perform with synthetic preferences (i.e., scripted teacher settings)? Shin & Brown (2021) , we evaluate Preference Transformer (PT) on several complex control tasks in the offline setting using D4RL (Fu et al., 2020) benchmarks and Robomimic (Mandlekar et al., 2021) benchmarks.foot_0 Specifically, we consider three different domains (AntMaze, Gym-Mujoco locomotion (Todorov et al., 2012; Brockman et al., 2016) from D4RL benchmarks and Robosuite robotic manipulation (Zhu et al., 2020) from Robomimic benchmarks) with different data collection schemes. For reward learning, we select queries (pairs of trajectory segments) uniformly at random from offline datasets and collect preferences from real human trainers (the authors). 2 Then, using the collected datasets of human preferences, we learn a reward function and train RL agents using Implicit Q-Learning (IQL; Kostrikov et al. 2022 ), a recent offline RL algorithm which achieves strong performances on D4RL benchmarks. We consider Markovian policy and value functions by following the original implementation (e.g., architecture, hyperparameters) of IQL. We additionally provide the result of offline RL experiments with the non-MDP models in Appendix G. For evaluation, we measure expert-normalized scores respecting underlying task reward from the original benchmark (D4RL) 3 and success rate (Robomimic). As baselines, we consider the standard preference modeling based on Markovian reward (MR) or non-Markovian reward (NMR). For the MR-based model, we define the preference predictor as and define the preference predictor as P [σ 1 ≻ σ 0 ; ψ MR ] = exp( t r(s 1 t ,a 1 t ;ψMR)) j exp( t r(s j t ,a j t ;ψMR)) P [σ 1 ≻ σ 0 ; ψ NMR ] = exp( t r({(s 1 i ,a 1 i )} t i=1 ;ψNMR)) j exp( t r({(s j i ,a j i )} t i=1 ;ψNMR)) . 4 Note that both baselines use the same preference modeling based on the sum of rewards with equal weight, while our method is based on the weighted sum of non-Markovian rewards modeled by transformers. We also report the performance of IQL trained with task reward from benchmarks as a reference. For all experiments, we report the mean and standard deviation across 8 runs. More experimental details (e.g., task descriptions, feedback collection, and reward learning) are in Appendix B and C. 

Equally Good Prefer baseline

Figure 4 : Averaged human evaluation results on 4 AntMaze tasks. Numbers denote the statistics of the evaluators' responses over 40 trials. PT received higher ratings compared to both MR and NMR. Table 1 shows the performances of IQL with different reward functions. Preference Transformer consistently outperforms all baselines in almost all tasks. Especially, only our method almost matches the performance of IQL with the task reward, while baselines fail to work in hard tasks. This implies that PT can induce a suitable reward function from real human preferences and teach meaningful behaviors. In particular, there is a big gap between PT and baselines in complex tasks (i.e., AntMaze) since capturing the long-term context of the agent's behaviors (e.g., direction of the agent) and critical events (e.g., goal location) is important in this task. These results show that our transformer-based preference model is very effective in reward learning. Human evaluation. To check whether learned rewards are indeed aligned with human preferences, we also conduct a human evaluation. We generate a query (i.e., two trajectories) from agents trained with two different rewards and a human evaluator (authors) decides which trajectory is better. queries. 6 We observe that human evaluators prefer agents from PT compared to agents from MR or NMR, showing PT is more aligned with human preferences.

5.3. REWARD AND WEIGHT ANALYSIS

We evaluate whether Preference Transformer can induce a well-specified reward and capture the critical events from human preferences. Figure 3 shows the learned reward function (red curve) and importance weight (blue curve) on successful and failure trajectory segments from antmaze-large-play-v2. First, we find that the reward function is well-aligned with human intent. In the successful trajectory (Figure 3a ), reward value increases as the agent gets close to the goal, while the failure trajectory shows low rewards since the agent struggles to escape the corner, then it flips as shown in Figure 3b . This shows that Preference Transformer can capture the context of the agent's behaviors in the reward function. We also find that importance weight shows a different trend compared to reward function. Interestingly, for both successful and failure trajectories, spikes in importance weights correspond to critical events such as turning right to reach the goal or flipping. More video examples are available in the supplementary material.

5.4. BENCHMARK TASKS WITH SCRIPTED TEACHERS

Similar to prior work (Christiano et al., 2017; Lee et al., 2021b; c) , we also evaluate Preference Transformer using synthetic preferences from scripted teachers. We consider a deterministic teacher, which generates preferences based on task reward r from the benchmark as follows: y = i, where i = arg max i H t=1 r(s i t , a i t ). We remark that this scripted teacher is a special case of the preference model introduced in equation 1. 7Figure 5 shows the performances of IQL with different reward functions from both human and scripted teachers (see Appendix D for full results). Preference Transformer achieves strong performances on scripted teachers even though synthetic preferences are based on Markovian rewards which contribute to the decision equally. We expect this is because Preference Transformer can induce a better-shaped reward by utilizing historical information. Also, non-Markovian formulation (in equation 3) can be interpreted as a generalized version of Markovian formulation (in equation 1) since it can learn Markovian rewards by ignoring the history in the inputs. We also remark that baselines achieve better performances with scripted teachers compared to real human teachers on some easy tasks (e.g., locomotion tasks). This implies that scripted teacher does not model real human behavior exactly, and evaluation on scripted teachers can generate misleading information. To investigate this in detail, we measure the agreement rates between human and scripted teachers. Figure 6b shows that disagreement rates are quite high since the scripted teacher can not catch between human teachers and scripted teachers. We find that disagreement rates are quite high across all tasks, implying that evaluation on scripted teacher can generate misleading information. Q z T V N M x n h I e 5 Z K L K g O 8 v k B U 3 R u l Q G K E 2 V L G j R X f 0 / k W G g 9 E Z H t F N i M 9 L I 3 E / / z e p m J b 4 K c y T Q z V J L F R 3 H G k U n Q L A 0 0 Y I o S w y e W Y K K Y 3 R W R E V a Y G J t Z x Q z T V N M x n h I e 5 Z K L K g O 8 v k B U 3 R u l Q G K E 2 V L G j R X f 0 / k W G g 9 E Z H t F N i M 9 L I 3 E / / z e p m J b 4 K c y T Q z V J L F R 3 H G k U n Q L A 0 0 Y I o S w y e W Y K K Y 3 R W R E V a Y G J t Z x Q z T V N M x n h I e 5 Z K L K g O 8 v k B U 3 R u l Q G K E 2 V L G j R X f 0 / k W G g 9 E Z H t F N i M 9 L I 3 E / / z e p m J b 4 K c y T Q z V J L F R 3 H G k U n Q L A 0 0 Y I o S w y e W Y K K Y 3 R W R E V a Y G J t Z x Q z T V N M x n h I e 5 Z K L K g O 8 v k B U 3 R u l Q G K E 2 V L G j R X f 0 / k W G g 9 E Z H t F N i M 9 L I 3 E / / z e p m J b 4 K c y T Q z V J L F R 3 H G k U n Q L A 0 0 Y I o S w y e W Y K K Y 3 R W R E V a Y G J t Z x Q z T V N M x n h I e 5 Z K L K g O 8 v k B U 3 R u l Q G K E 2 V L G j R X f 0 / k W G g 9 E Z H t F N i M 9 L I 3 E / / z e p m J b 4 K c y T Q z V J L F R 3 H G k U n Q L A 0 0 Y I o S w y e W Y K K Y 3 R W R E V a Y G J t Z x Q z T V N M x n h I e 5 Z K L K g O 8 v k B U 3 R u l Q G K E 2 V L G j R X f 0 / k W G g 9 E Z H t F N i M 9 L I 3 E / / z e p m J b 4 K c y T Q z V J L F R 3 H G k U n Q L A 0 0 Y I o S w y e W Y K K Y 3 R W R E V a Y G J t Z x Q z T V N M x n h I e 5 Z K L K g O 8 v k B U 3 R u l Q G K E 2 V L G j R X f 0 / k W G g 9 E Z H t F N i M 9 L I 3 E / / z e p m J b 4 K c y T Q z V J L F R 3 H G k U n Q L A 0 0 Y I o S w y e W Y K K Y 3 R W R E V a Y G J t Z x Q z T V N M x n h I e 5 Z K L K g O 8 v k B U 3 R u l Q G K E 2 V L G j R X f 0 / k W G g 9 E Z H t F N i M 9 L I 3 E / / z e p m J b 4 K c y T Q z V J L F R 3 H G k U n Q L A 0 0 Y I o S w y e W Y K K Y 3 R W R E V a Y G J t Z x 4 r Y F Z j B O 4 U 8 T O i x O Y B j o T z a j k = " > A A A C C H i c b V C 7 S g N B F J 3 1 G e N r 1 d L C w S B Y h V k R F K u A j W U C 5 g H Z s M x O Z p M x s w 9 m 7 o r L E r G x 8 V d s L B S x 9 R P s / B s n j 0 I T D 1 w 4 n H M v M + f 4 i R Q a C P m 2 F h a X l l d W C 2 v F 9 Y 3 N r W 1 7 Z 7 e h 4 1 Q x X m e x j F X L p 5 p L E f E 6 C J C 8 l S h O Q 1 / y p j + 4 H P n N W 6 6 0 i K N r y B L e C W k v E o F g F I z k 2 Q e u F r 2 Q e u Q C u 8 D v I A d F b z i D W G W Y D O + x Z 5 d I m Y y B 5 4 k z J S U 0 R d W z v 9 x u z N K Q R 8 A k 1 b r t k A Q 6 O V U = " > A A A C C H i c b V C 7 S g N B F J 3 1 G e N r 1 d L C w S B Y h R 0 R F K u A j W U C 5 g H Z s M x O Z p M x s w 9 m 7 o r L E r G x 8 V d s L B S x 9 R P s / B s n j 0 I T D 1 w 4 n H M v M + f 4 i R Q a H O f b W l h c W l 5 Z L a w V 1 z c 2 t 7 b t n d 2 G j l P F e J 3 F M l Y t n 2 o u R c T r I E D y V q I 4 D X 3 J m / 7 g c u Q 3 b 7 n S I o 6 u I U t 4 J 6 S 9 S A S C U T C S Z x + 4 W v R C 6 p E L 7 A K / g x w U v e E M Y p V h M r z H n l 1 y y s 4 Y e J 6 Q K S m h K a q e / e V 2 Y 5 a G P A I m q d Z t 4 i T Q y a k C w S Q f F t 1 U 8 4 S y A e 3 x t q E R D b n u 5 O M g Q 3 x k l C 4 O Y m U m A j x W f 1 / k N N Q 6 C 3 2 z G V L = " > A A A C C H i c b V C 7 S g N B F J 3 1 G e N r 1 d L C w S B Y h R 0 R F K u A j W U C 5 g H Z s M x O Z p M x s w 9 m 7 o r L E r G x 8 V d s L B S x 9 R P s / B s n j 0 I T D 1 w 4 n H M v M + f 4 i R Q a H O f b W l h c W l 5 Z L a w V 1 z c 2 t 7 b t n d 2 G j l P F e J 3 F M l Y t n 2 o u R c T r I E D y V q I 4 D X 3 J m / 7 g c u Q 3 b 7 n S I o 6 u I U t 4 J 6 S 9 S A S C U T C S Z x + 4 W v R C 6 p E L 7 A K / g x w U v e E M Y p V h M r z H n l 1 y y s 4 Y e J 6 Q K S m h K a q e / e V 2 Y 5 a G P A I m q d Z t 4 i T Q y a k C w S Q f F t 1 U 8 4 S y A e 3 x t q E R D b n u 5 O M g Q 3 x k l C 4 O Y m U m A j x W f 1 / k N N Q 6 C 3 2 z G V L = " > A A A C C H i c b V C 7 S g N B F J 3 1 G e N r 1 d L C w S B Y h R 0 R F K u A j W U C 5 g H Z s M x O Z p M x s w 9 m 7 o r L E r G x 8 V d s L B S x 9 R P s / B s n j 0 I T D 1 w 4 n H M v M + f 4 i R Q a H O f b W l h c W l 5 Z L a w V 1 z c 2 t 7 b t n d 2 G j l P F e J 3 F M l Y t n 2 o u R c T r I E D y V q I 4 D X 3 J m / 7 g c u Q 3 b 7 n S I o 6 u I U t 4 J 6 S 9 S A S C U T C S Z x + 4 W v R C 6 p E L 7 A K / g x w U v e E M Y p V h M r z H n l 1 y y s 4 Y e J 6 Q K S m h K a q e / e V 2 Y 5 a G P A I m q d Z t 4 i T Q y a k C w S Q f F t 1 U 8 4 S y A e 3 x t q E R D b n u 5 O M g Q 3 x k l C 4 O Y m U m A j x W f 1 / k N N Q 6 C 3 2 z G V L = " > A A A C C H i c b V C 7 S g N B F J 3 1 G e N r 1 d L C w S B Y h R 0 R F K u A j W U C 5 g H Z s M x O Z p M x s w 9 m 7 o r L E r G x 8 V d s L B S x 9 R P s / B s n j 0 I T D 1 w 4 n H M v M + f 4 i R Q a H O f b W l h c W l 5 Z L a w V 1 z c 2 t 7 b t n d 2 G j l P F e J 3 F M l Y t n 2 o u R c T r I E D y V q I 4 D X 3 J m / 7 g c u Q 3 b 7 n S I o 6 u I U t 4 J 6 S 9 S A S C U T C S Z x + 4 W v R C 6 p E L 7 A K / g x w U v e E M Y p V h M r z H n l 1 y y s 4 Y e J 6 Q K S m h K a q e / e V 2 Y 5 a G P A I m q d Z t 4 i T Q y a k C w S Q f F t 1 U 8 4 S y A e 3 x t q E R D b n u 5 O M g Q 3 x k l C 4 O Y m U m A j x W f 1 / k N N Q 6 C 3 2 z G V L the context of the agent's behavior correctly (see Figure 6a ). Note that the disagreement between human and scripted teachers also has been observed in simple grid world domains (Knox et al., 2022) . These results imply that existing preference-based RL benchmarks (Lee et al., 2021c) based on synthetic preferences may not be enough, calling for a new benchmark specially designed for preference-based RL.

5.5. LEARNING COMPLEX NOVEL BEHAVIORS

Figure 7 : Six frames from a double backflip of Hopper. The agent is trained to perform a sequence of backflips using 300 queries of human feedback. We present the effectiveness of Preference Transformer for enabling agents to learn complex and novel behaviors where a suitable reward function is difficult to design. Specifically, we demonstrate Hopper in Gym-Mujoco locomotion performing multiple backflips at each jump. This task is more challenging compared to a single backflip considered in previous preference-based RL approaches (Christiano et al., 2017; Lee et al., 2021b) , as the reward function must capture non-Markovian contexts including the number of rotations. We observe that the Hopper agent learns to perform double backflip as shown in Figure 7 , while the agent with Markovian reward function fails to learn it. The implementation details of Hopper backflip are provided in Appendix B, and videos of all behaviors (including triple backflip) are available on the project website.

6. DISCUSSION

In this paper, we present a new framework for modeling human preferences based on the weighted sum of non-Markovian rewards, and design the proposed framework using a transformer-based architecture. We propose a novel preference attention layer for learning the weighted sum of the non-Markovian reward function, which enables inferring the importance weight of each reward via the self-attention mechanism. Our experiments demonstrate that Preference Transformer significantly outperforms the current preference modeling methods on complex navigation, locomotion, and robotic manipulation tasks from offline RL benchmarks. In addition, we observe that the learned preference attention layer can indeed capture the events critical to the human decision. We believe that Preference Transformer is essential to scale preference-based RL (and other human-in-the-loop learning) to various applications.

FUTURE DIRECTIONS

There are several future directions in Preference Transformer. One is to utilize the importance weights in reinforcement learning or preference-based reward learning. For example, importance weights can be utilized for sampling more informative queries, which can improve the feedbackefficiency of preference-based reward learning (Sadigh et al., 2017; Lee et al., 2021c) . It is also an interesting direction to use the importance weights for stabilizing Q-learning via weighted updates (Kumar et al., 2020; Lee et al., 2021a) . Combination with other preference models is another important direction for future research. For example, Knox et al. (2022) proposed a new preference model based on each segment's regret in simple grid world environments. Even though their proposed method is based on several assumptions (e.g., generating successor features (Dayan, 1993; Barreto et al., 2017) ), a combination with the regret-based model would be interesting.

ETHICS STATEMENT

Unlike other domains (e.g., language), control tasks require more high-quality human feedback from domain experts. Even though the quality of the dataset can be improved by training labelers, such training requires substantial effort. In addition, feedback from crowd-sourcing platforms, such as Amazon Mechanical Turk, can be noisy. We addressed these concerns by collecting human feedback from domain experts (authors) who are very familiar with robotics, RL, and target tasks. We expect that our datasets are labeled with high-quality and clean labels, and our collection strategy is closer to practice. However, at the same time, we think that evaluations using public crowd-sourcing platforms would be interesting and leave it to future work.

REPRODUCIBILITY STATEMENT

We describe the implementation details of Preference Transformer in Appendix B, and our source code is available on the project website in the abstract. We will also publicly release the collected offline dataset with real human preferences for benchmarks.

A TASKS AND DATASETS

In this section, we describe the details of control tasks from D4RL benchmarks (Fu et al., 2020) . AntMaze. AntMaze is a navigation task requiring a Mujoco Ant robot to reach a goal location. The datasets are generated by a pre-trained policy designed to reach a goal on different maze layouts. We consider two maze layouts: medium and large. Datasets are generated by two strategies: diverse and play. The diverse datasets are generated from the pre-trained policy with randomized start and goal locations. The play datasets are generated from the pre-trained policy with specific hand-picked goal locations. Task reward is constructed in a sparse setting that gives rewards only when the distance to the goal location is less than a fixed threshold, or all of them give zero. Gym-Mujoco locomotion. The goal of Gym-Mujoco locomotion tasks is to control the simulated robots (Walker2d, Hopper) such that they can move forward while minimizing the energy cost (action norm) for safe behaviors. We consider two data generation strategies: medium-expert and medium-replay. The medium-expert datasets are generated by mixing equal amounts of expert demonstrations, suboptimal (partially-trained) demonstrations, and medium-replay datasets correspond to the replay buffer collected by a partially-trained policy. The task reward is defined as the forward velocity of the torso, control penalty, and survival bonus. Robosuite robotic manipulation. Robosuite robotic manipulation (Zhu et al., 2020) Baselines. Implementation details of our baselines are as follows: • Markovian reward model (MR). We use two-layer MLPs with 256 hidden dimensions each. We use ReLUs for the activation function between layers, and we do not use the activation function for the output. Each model is trained by optimizing the cross-entropy loss defined in equation 2 with the learning rate of 0.0003. • Non-Markovian reward model (NMR). For the non-Markovian reward model, we reimplement CSC Instance Space LSTM (Early et al., 2022) following the architecture specified in the original paper using JAX. We double the hidden dimensions to match the NMR and PT. Implementation details of Hopper backflip. For enabling Hopper backflip, we train the agent in online setup following Lee et al. (2021b) . We pre-train the policy π using an intrinsic reward for the first 10,000 timesteps, to explore and collect diverse experiences. Then, we train PT using collected behaviors and relabel past experiences in the replay buffer using the learned reward model. This reward learning and relabeling stage is repeated every 10,000 timesteps, and we give 100 queries of human feedback at each stage. We observe that 3 reward learning stages (i.e., 300 queries) are enough to perform double backflips. We use 3 layers and 8 attention heads for PT, and the model is trained for 100 epochs at each stage. The other details are the same as in Table 2 .

C HUMAN PREFERENCES

Preference collection. We collect feedback from actual human subjects (the authors) familiar with robotic tasks. In detail, the human teacher who is given instruction on each task watches a video rendering each segment and chooses which of the two is more helpful in achieving the objective of the agent. Each trajectory segment is 3 seconds long (100 timesteps). If the human teacher cannot decide about the preference over segments, it is allowed to select a neutral option that gives the same preference to each of the two segments. Instruction given to human teacher. • AntMaze: The first priority is for the ant robot to reach the goal location as soon as possible without wandering or falling. If the ant robot is either left falling, hovering, or moving in the opposite direction to the goal location, lower your priority to the segment even if the distance to the goal direction is closer than that of the other segment. If the two robots are almost tied on this metric, choose the segment by the distance that the robot has moved. • Hopper: The hopper robot aims to move to the right as far as possible while minimizing energy costs. If the hopper robot lands unsteadily, lower your priority even if the distance traveled moved during a segment is longer than the other. If the two robots are almost tied on this metric, choose the segment by the distance that the robot has moved. • Walker2d: The goal of the walker robot is to move to the right as far as possible while minimizing energy costs. If the walker is about to fall or walks abnormally (e.g., walking using only one leg, slipping, etc.), lower your priority to the segment even if the distance traveled moved during a segment is longer than the other. If the two robots are almost tied on this metric, choose the segment by the distance the robot has moved. • Robosuite: Panda robot arm first grasps the object and carries out actions for a specific purpose (lift a cube or move a coke can to the target bin). Prioritize catching the object between the two. If the two robots are almost tied on this metric, choose by the extent to which the target object has moved. 

G EXPERIMENTAL RESULTS USING NON-MARKOVIAN MODELS

We investigate whether our non-Markovian reward function shows better performance with non-Markovian policy and value functions. For each timestep t, we augment state s t by concatenating historical informationfoot_8 , and use it as inputs to train policy and value functions. Table 4 shows the results comparing Preference Transformer and NMR. We observe that PT outperforms NMR, which again shows the superiority of our method. However, compared to the results with Markovian policy and value functions in Table 1 , the overall performances of all methods are degraded. We expect that this is because the original offline RL algorithm (IQL) assumed Markovian setup and tuned all hyper-parameters under this assumption. 



To focus on evaluating the performance of reward learning, we consider the offline setting. Note that similar setting assuming expert demonstrations are available is also considered in prior work(Ibarz et al., 2018).2 We collected 100 (medium) / 1000 (large) queries in AntMaze, 500 (medium-replay) / 100 (mediumexpert) queries in Gym-Mujoco locomotion, and 100 (PH) / 500 (MH) queries in Robosuite robotic manipulations. A maximum of 10 minutes of human time was required for feedback collection in all cases except Gym-Mujoco locomotion-medium-expert and antmaze-large, which required between 1 and 2 hours of human time.3 Since preferences are collected by experts who are familiar with robotics and RL domains, we measure the performance with respect to task reward similar toChristiano et al. (2017). Original work(Early et al., 2022) only considered the return prediction but we utilize the proposed LSTMarchitecture in preference-based learning. Trajectories are anonymized for fair comparisons. Also, the evaluators skip the query if both agents equally fail to solve the task for meaningful comparisons. Each set consists of 5 rollouts from 8 models trained with different random seeds. If we introduce a rationality constant (or temperature) β to the exponential term in equation 1 and set it as β → ∞, the preference model becomes deterministic. https://github.com/matthias-wright/flaxmodels https://github.com/ikostrikov/implicit_q_learning Specifically, we provide a predicted reward of the previous timestep rt-1 as an additional input.



Figure1: Illustration of our framework. Given a preference between two trajectory segments (σ 0 , σ 1 ), Preference Transformer generates non-Markovian rewards rt and their importance weights w t over each segment. We then model the preference predictor based on the weighted sum of non-Markovian rewards (i.e., t w t rt ), and align it with human preference.

a d e c 5 2 a e 3 N e b d S L O M r o G J 2 g M + S i C 9 R A T d R C b U T R I 3 p G r + j N e r J e r H f r Y 9 5 a s o q Z Q / Q H 1 u c P 8 z 2 V / A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " J R 2 + l u Z + y I k 6 o 9 D Q y N k 8 5 m 3 D Z 7 I

a d e c 5 2 a e 3 N e b d S L O M r o G J 2 g M + S i C 9 R A T d R C b U T R I 3 p G r + j N e r J e r H f r Y 9 5 a s o q Z Q / Q H 1 u c P 8 z 2 V / A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " J R 2 + l u Z + y I k 6 o 9 D Q y N k 8 5 m 3 D Z 7 I

Figure 3: Time series of learned reward function (red curve) and importance weight (blue curve) on (a) successful trajectory segment and (b) failure trajectory segment from antmaze-large-play-v2.For both cases, spikes in the importance weight correspond to critical events: turning right to reach the goal (point 2), or flipping (point 4). The learned reward is also well-aligned with human intent: reward increases as the agent gets close to the goal, while it decreases when agent is flipped.

Figure5: Averaged normalized scores of IQL with various reward functions trained from human and synthetic preferences on AntMaze, Gym-Mujoco locomotion (Hopper and Walker2d), and Robosuite robotic manipulation tasks. The result shows the mean and standard deviation averaged over 8 runs. Our method (PT) achieves strong performances on both scripted and human teachers, while the performances of baselines (MR and NMR) are significantly reduced on human teachers.

t e x i t s h a 1 _ b a s e 6 4 = " s 4 8 c E C 9 z s p q L N D c X m 3 d 3 q h o U Z s Y = " > A A A C A H i c b V D L S g M x F L 1 T X 7 W + R l 2 4 c B M s g q s y I 4 I u C 2 5 c t m A f 0 A 5 D J s 2 0 o U l m S D J C G Q r i r 7 h x o Y h b P 8 O d f 2 P a z k J b D 1 w 4 O e d e c u + J U s 6 0 8 b x v p 7 S 2 v r G 5 V d 6 u 7 O z u 7 R + 4 h 0 d t n W S K 0 B Z J e K K 6 E d a U M 0 l b h h l O u 6 m i W E S c d q L x 7 c z v P F C l W S L v z S S l g c B D y W J G s L F S 6 J 7 0 N R s K H P q o r z N C U P H 0 Q r f q 1 b w 5 0 C r x C 1 K F A o 3 Q / e o P E p I J K g 3 h W O u e 7 6 U m y L E y j H A 6 r f

Y b g L 5 + 8 S t q X N d + r + c 2 r a r 3 5 u I i j D K d w B h f g w z X U 4 Q 4 a 0 A I C U 3 i G V 3 h z n p w X 5 9 3 5 W L S W n C L C Y / g D 5 / M H 4 / y W Y w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " s 4 8 c E C 9 z s p q L N D c X m 3 d 3 q h o U Z s Y = " > A A A C A H i c b V D L S g M x F L 1 T X 7 W + R l 2 4 c B M s g q s y I 4 I u C 2 5 c t m A f 0 A 5 D J s 2 0 o U l m S D J C G Q r i r 7 h x o Y h b P 8 O d f 2 P a z k J b D 1 w 4 O e d e c u + J U s 6 0 8 b x v p 7 S 2 v r G 5 V d 6 u 7 O z u 7 R + 4 h 0 d t n W S K 0 B Z J e K K 6 E d a U M 0 l b h h l O u 6 m i W E S c d q L x 7 c z v P F C l W S L v z S S l g c B D y W J G s L F S 6 J 7 0 N R s K H P q o r z N C U P H 0 Q r f q 1 b w 5 0 C r x C 1 K F A o 3 Q / e o P E p I J K g 3 h W O u e 7 6 U m y L E y j H A 6 r f

Y b g L 5 + 8 S t q X N d + r + c 2 r a r 3 5 u I i j D K d w B h f g w z X U 4 Q 4 a 0 A I C U 3 i G V 3 h z n p w X 5 9 3 5 W L S W n C L C Y / g D 5 / M H 4 / y W Y w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " s 4 8 c E C 9 z s p q L N D c X m 3 d 3 q h o U Z s Y = " > A A A C A H i c b V D L S g M x F L 1 T X 7 W + R l 2 4 c B M s g q s y I 4 I u C 2 5 c t m A f 0 A 5 D J s 2 0 o U l m S D J C G Q r i r 7 h x o Y h b P 8 O d f 2 P a z k J b D 1 w 4 O e d e c u + J U s 6 0 8 b x v p 7 S 2 v r G 5 V d 6 u 7 O z u 7 R + 4 h 0 d t n W S K 0 B Z J e K K 6 E d a U M 0 l b h h l O u 6 m i W E S c d q L x 7 c z v P F C l W S L v z S S l g c B D y W J G s L F S 6 J 7 0 N R s K H P q o r z N C U P H 0 Q r f q 1 b w 5 0 C r x C 1 K F A o 3 Q / e o P E p I J K g 3 h W O u e 7 6 U m y L E y j H A 6 r f

Y b g L 5 + 8 S t q X N d + r + c 2 r a r 3 5 u I i j D K d w B h f g w z X U 4 Q 4 a 0 A I C U 3 i G V 3 h z n p w X 5 9 3 5 W L S W n C L C Y / g D 5 / M H 4 / y W Y w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " s 4 8 c E C 9 z s p q L N D c X m 3 d 3 q h o U Z s Y = " > A A A C A H i c b V D L S g M x F L 1 T X 7 W + R l 2 4 c B M s g q s y I 4 I u C 2 5 c t m A f 0 A 5 D J s 2 0 o U l m S D J C G Q r i r 7 h x o Y h b P 8 O d f 2 P a z k J b D 1 w 4 O e d e c u + J U s 6 0 8 b x v p 7 S 2 v r G 5 V d 6 u 7 O z u 7 R + 4 h 0 d t n W S K 0 B Z J e K K 6 E d a U M 0 l b h h l O u 6 m i W E S c d q L x 7 c z v P F C l W S L v z S S l g c B D y W J G s L F S 6 J 7 0 N R s K H P q o r z N C U P H 0 Q r f q 1 b w 5 0 C r x C 1 K F A o 3 Q / e o P E p I J K g 3 h W O u e 7 6 U m y L E y j H A 6 r f

Y b g L 5 + 8 S t q X N d + r + c 2 r a r 3 5 u I i j D K d w B h f g w z X U 4 Q 4 a 0 A I C U 3 i G V 3 h z n p w X 5 9 3 5 W L S W n C L C Y / g D 5 / M H 4 / y W Y w = = < / l a t e x i t > 🆚 0 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " F r J T j S / r X t 7 K W m v K e W d y A Q P K s / c = " > A A A C A H i c b V D L S g M x F L 1 T X 7 W + R l 2 4 c B M s g q s y I 4 I u C 2 5 c t m A f 0 A 5 D J s 2 0 o U l m S D J C G Q r i r 7 h x o Y h b P 8 O d f 2 P a z k J b D 1 w 4 O e d e c u + J U s 6 0 8 b x v p 7 S 2 v r G 5 V d 6 u 7 O z u 7 R + 4 h 0 d t n W S K 0 B Z J e K K 6 E d a U M 0 l b h h l O u 6 m i W E S c d q L x 7 c z v P F C l W S L v z S S l g c B D y W J G s L F S 6 J 7 0 N R s K H H q o r z N C U P H 0 Q 7 f q 1 b w 5 0 C r x C 1 K F A o 3 Q / e o P E p I J K g 3 h W O u e 7 6 U m y L E y j H A 6 r f

Y b g L 5 + 8 S t q X N d + r + c 2 r a r 3 5 u I i j D K d w B h f g w z X U 4 Q 4 a 0 A I C U 3 i G V 3 h z n p w X 5 9 3 5 W L S W n C L C Y / g D 5 / M H 4 + 2 W Y w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " F r J T j S / r X t 7 K W m v K e W d y AQ P K s / c = " > A A A C A H i c b V D L S g M x F L 1 T X 7 W + Rl 2 4 c B M s g q s y I 4 I u C 2 5 c t m A f 0 A 5 D J s 2 0 o U l m S D J C G Q r i r 7 h x o Y h b P 8 O d f 2 P a z k J b D 1 w 4 O e d e c u + J U s 6 0 8 b x v p 7 S 2 v r G 5 V d 6 u 7 O z u 7 R + 4 h 0 d t n W S K 0 B Z J e K K 6 E d a U M 0 l b h h l O u 6 m i W E S c d q L x 7 c z v P F C l W S L v z S S l g c B D y W J G s L F S 6 J 7 0 N R s K H H q o r z N C U P H 0 Q 7 f q 1 b w 5 0 C r x C 1 K F A o 3 Q / e o P E p I J K g 3 h W O u e 7 6 U m y L E y j H A 6 r f

Y b g L 5 + 8 S t q X N d + r + c 2 r a r 3 5 u I i j D K d w B h f g w z X U 4 Q 4 a 0 A I C U 3 i G V 3 h z n p w X 5 9 3 5 W L S W n C L C Y / g D 5 / M H 4 + 2 W Y w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " F r J T j S / r X t 7 K W m v K e W d y AQ P K s / c = " > A A A C A H i c b V D L S g M x F L 1 T X 7 W + Rl 2 4 c B M s g q s y I 4 I u C 2 5 c t m A f 0 A 5 D J s 2 0 o U l m S D J C G Q r i r 7 h x o Y h b P 8 O d f 2 P a z k J b D 1 w 4 O e d e c u + J U s 6 0 8 b x v p 7 S 2 v r G 5 V d 6 u 7 O z u 7 R + 4 h 0 d t n W S K 0 B Z J e K K 6 E d a U M 0 l b h h l O u 6 m i W E S c d q L x 7 c z v P F C l W S L v z S S l g c B D y W J G s L F S 6 J 7 0 N R s K H H q o r z N C U P H 0 Q 7 f q 1 b w 5 0 C r x C 1 K F A o 3 Q / e o P E p I J K g 3 h W O u e 7 6 U m y L E y j H A 6 r f

Y b g L 5 + 8 S t q X N d + r + c 2 r a r 3 5 u I i j D K d w B h f g w z X U 4 Q 4 a 0 A I C U 3 i G V 3 h z n p w X 5 9 3 5 W L S W n C L C Y / g D 5 / M H 4 + 2 W Y w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " F r J T j S / r X t 7 K W m v K e W d y AQ P K s / c = " > A A A C A H i c b V D L S g M x F L 1 T X 7 W + Rl 2 4 c B M s g q s y I 4 I u C 2 5 c t m A f 0 A 5 D J s 2 0 o U l m S D J C G Q r i r 7 h x o Y h b P 8 O d f 2 P a z k J b D 1 w 4 O e d e c u + J U s 6 0 8 b x v p 7 S 2 v r G 5 V d 6 u 7 O z u 7 R + 4 h 0 d t n W S K 0 B Z J e K K 6 E d a U M 0 l b h h l O u 6 m i W E S c d q L x 7 c z v P F C l W S L v z S S l g c B D y W J G s L F S 6 J 7 0 N R s K H H q o r z N C U P H 0 Q 7 f q 1 b w 5 0 C r x C 1 K F A o 3 Q / e o P E p I J K g 3 h W O u e 7 6 U m y L E y j H A 6 r f

Y b g L 5 + 8 S t q X N d + r + c 2 r a r 3 5 u I i j D K d w B h f g w z X U 4 Q 4 a 0 A I C U 3 i G V 3 h z n p w X 5 9 3 5 W L S W n C L C Y / g D 5 / M H 4 + 2 W Y w = = < / l a t e x i t > 0 : trajectory 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " q f 4 r Y F Z j B O 4 U 8T O i x O Y B j o T z a j k = " > A A A C C H i c b V C 7 S g N B F J 3 1 G e N r 1 d L C w S B Y h V k R F K u A j W U C 5 g H Z s M x O Z p M x s w 9 m 7 o r L E r G x 8 V d s L B S x 9 R P s / B s n j 0 I T D 1 w 4 n H M v M + f 4 i R Q a C P m 2 F h a X l l d W C 2 v F 9 Y 3 N r W 1 7 Z 7 e h 4 1 Q x X m e x j F X L p 5 p L E f E 6 C J C 8 l S h O Q 1 / y p j + 4 H P n N W 6 6 0 i K N r y B L e C W k v E o F g F I z k 2 Q e u F r 2 Q e u Q C u 8 D v I A d F b z i D W G W Y D O + x Z 5 d I m Y y B 5 4 k z J S U 0 R d W z v 9 x u z N K Q R 8 A k 1 b r t k A Q 6 O V U g m O T D o p t q n l A 2 o D 3 e N j S i I d e d f B x k i I + M 0 s V B r M x E g M f q 7 4 u c h l p n o W 8 2 Q w p 9 P e u N x P + 8 d g r B e S c X U Z I C j 9 j k o S C V G G I 8 a g V 3 h T K h Z W Y I Z U q Y v 2 L W p 4 o y M N 0 V T Q n O b O R5 0 j g p O 6 T s 1 E 5 L l d r D p I 4 C 2 k e H 6 B g 5 6A x V 0 B W q o j p i 6 B E 9 o 1 f 0 Z j 1 Z L 9 a 7 9 T F Z X b C m F e 6 h P 7 A + f w C A R Z o U < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " q f 4 r Y F Z j B O 4 U 8 T O i x O Y B j o T z a j k = " > A A A C C H i c b V C 7 S g N B F J 3 1 G e N r 1 d L C w S B Y h V k R F K u A j W U C 5 g H Z s M x O Z p M x s w 9 m 7 o r L E r G x 8 V d s L B S x 9 R P s / B s n j 0 I T D 1 w 4 n H M v M + f 4 i R Q a C P m 2 F h a X l l d W C 2 v F 9 Y 3 N r W 1 7 Z 7 e h 4 1 Q x X m e x j F X L p 5 p L E f E 6 C J C 8 l S h O Q 1 / y p j + 4 H P n N W 6 6 0 i K N r y B L e C W k v E o F g F I z k 2 Q e u F r 2 Q e u Q C u 8 D v I A d F b z i D W G W Y D O + x Z 5 d I m Y y B 5 4 k z J S U 0 R d W z v 9 x u z N K Q R 8 A k 1 b r t k A Q 6 O V Ug m O T D o p t q n l A 2 o D 3 e N j S i I d e d f B x k i I + M 0 s V B r M x E g M f q 7 4 u c h l p n o W 8 2 Q w p 9 P e u N x P + 8 d g r B e S c X U Z I C j 9 j k o S C V G G I 8 a g V 3 h T K h Z W Y I Z U q Y v 2 L W p 4 o y M N 0 V T Q n O b O R 5 0 j g p O 6 T s 1 E 5 L l d r D p I 4 C 2 k e H 6 B g 5 6 A x V 0 B W q o j p i 6 B E 9 o 1 f 0 Z j 1 Z L 9 a 7 9 T F Z X b C m F e 6 h P 7 A + f w C A R Z o U < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " q f 4 r Y F Z j B O 4 U 8 T O i x O Y B j o T z a j k = " > A A A C C H i c b V C 7 S g N B F J 3 1 G e N r 1 d L C w S B Y h V k R F K u A j W U C 5 g H Z s M x O Z p M x s w 9 m 7 o r L E r G x 8 V d s L B S x 9 R P s / B s n j 0 I T D 1 w 4 n H M v M + f 4 i R Q a C P m 2 F h a X l l d W C 2 v F 9 Y 3 N r W 1 7 Z 7 e h 4 1 Q x X m e x j F X L p 5 p L E f E 6 C J C 8 l S h O Q 1 / y p j + 4 H P n N W 6 6 0 i K N r y B L e C W k v E o F g F I z k 2 Q e u F r 2 Q e u Q C u 8 D v I A d F b z i D W G W Y D O + x Z 5 d I m Y y B 5 4 k z J S U 0 R d W z v 9 x u z N K Q R 8 A k 1 b r t k A Q 6 O V U g m O T D o p t q n l A 2 o D 3 e N j S i I d e d f B x k i I + M 0 s V B r M x E g M f q 7 4 u c h l p n o W 8 2 Q w p 9 P e u N x P + 8 d g r B e S c X U Z I C j 9 j k o S C V G G I 8 a g V 3 h T K h Z W Y I Z U q Y v 2 L W p 4 o y M N 0 V T Q n O b O R 5 0 j g p O 6 T s 1 E 5 L l d r D p I 4 C 2 k e H 6 B g 5 6 A x V 0 B W q o j p i 6 B E 9 o 1 f 0 Z j 1 Z L 9 a 7 9 T F Z X b C m F e 6 h P 7 A + f w C A R Z o U < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " q f

g m O T D o p t q n l A 2 o D 3 e N j S i I d e d f B x k i I + M 0 s V B r M x E g M f q 7 4 u c h l p n o W 8 2 Q w p 9 P e u N x P + 8 d g r B e S c X U Z I C j 9 j k o S C V G G I 8 a g V 3 h T K h Z W Y I Z U q Y v 2 L W p 4 o y M N 0 V T Q n O b O R 5 0 j g p O 6 T s 1 E 5 L l d r D p I 4 C 2 k e H 6 B g 5 6 A x V 0 B W q o j p i 6 B E 9 o 1 f 0 Z j 1 Z L 9 a 7 9 T F Z X b C m F e 6 h P 7 A + f w C A R Z o U < / l a t e x i t > 1 : trajectory 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " X 0 2 d + z 5 p R D D v D E M W L W x f x x 6 / 4 4 0

o 6 1 l v J P 7 n t V M I z j u 5 i J I U e M Q m D w W p x B D j U S u 4 K 5 Q J L T N D K F P C / B W z P l W U g e m u a E o g s 5 H n S e O k T J w y q Z 2 W K r W H S R 0 F t I 8 O 0 T E i 6 A x V 0 B W q o j p i 6 B E 9 o 1 f 0 Z j 1 Z L 9 a 7 9 T F Z X b C m F e 6 h P 7 A + f w C D Z 5 o W < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " X 0 2 d + z 5 p R D D v D E M W L W x f x x 6 / 4 4 0

o 6 1 l v J P 7 n t V M I z j u 5 i J I U e M Q m D w W p x B D j U S u 4 K 5 Q J L T N D K F P C / B W z P l W U g e m u a E o g s 5 H n S e O k T J w y q Z 2 W K r W H S R 0 F t I 8 O 0 T E i 6 A x V 0 B W q o j p i 6 B E 9 o 1 f 0 Z j 1 Z L 9 a 7 9 T F Z X b C m F e 6 h P 7 A + f w C D Z 5 o W < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " X 0 2 d + z 5 p R D D v D E M W L W x f x x 6 / 4 4 0

o 6 1 l v J P 7 n t V M I z j u 5 i J I U e M Q m D w W p x B D j U S u 4 K 5 Q J L T N D K F P C / B W z P l W U g e m u a E o g s 5 H n S e O k T J w y q Z 2 W K r W H S R 0 F t I 8 O 0 T E i 6 A x V 0 B W q o j p i 6 B E 9 o 1 f 0 Z j 1 Z L 9 a 7 9 T F Z X b C m F e 6 h P 7 A + f w C D Z 5 o W < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " X 0 2 d + z 5 p R D D v D E M W L W x f x x 6 / 4 4 0

Figure 6: Difference between the human and scripted teacher. (a) Examples of trajectories shown to the human and scripted teacher on AntMaze task. The human teacher provides the correct label by catching the context of behavior (i.e. direction) while the scripted teacher does not. (b) Agreementbetween human teachers and scripted teachers. We find that disagreement rates are quite high across all tasks, implying that evaluation on scripted teacher can generate misleading information.

Averaged normalized scores of IQL on AntMaze, Gym-Mujoco locomotion tasks, and success rate on Robosuite manipulation tasks with different reward functions. Using the same dataset of preferences from real human teachers, we train Preference Transformer (PT), MLP-basedMarkovian reward (MR;Christiano et al. 2017; Lee et al. 2021b), and LSTM-based non-Markovian reward (NMR;Early et al. 2022). The result shows the average and standard deviation averaged over 8 runs.



includes different types of tasks with various 7-DoF simulated Hand robots. We use simulated environments with Panda by Franka Emika for the robot in our experiments. We choose the task of lifting a cube object (lift) and placing a coke can from table to the target bin (can). Datasets are collected by either one proficient teleoperators (ph) or 6 teleoperators with varying proficiency (mh). Task reward is defined as a sparse reward, and we defer this detail to the original paper.where max timestep is the maximum episode length, and max timestep and min timestep denote returns of best and worst trajectories in the dataset, respectively. Training time varies depending on the environment, but it takes less than 10 minutes for reward learning in AntMaze environment with 1000 human feedback and takes about an hour to train IQL with learned PT. For training and evaluating our model, we use a single NVIDIA GeForce RTX 2080 Ti GPU and 8 CPU cores (Intel Xeon CPU E5-2630 v4 @ 2.20GHz). We train both reward function and IQL over 8 random seeds.Implementation details of PT. In all experiments, we use causal transformers with one layer and four self-attention heads followed by a bidirectional self-attention layer with a single self-attention head. PT is trained using AdamW optimizer(Loshchilov & Hutter, 2019) with a learning rate of 1 × 10 -4 including linear warmup steps of 5% of total gradient steps, cosine learning rate decay, weight decay of 1×10 -4 , and batch size of 256. For RL training, we use publicly released implementations of IQL 9 and follow the original hyper-parameter settings. For all experiments, we use the same hyperparameters used by the original IQL. -random score expert score-random score for AntMaze and Gym-Mujoco locomotion tasks and success rate for Robosuite robotic manipulation tasks.Hyperparameters. Hyperparameters for PT are shown in Table2. We remark that PT with more attention layers and more gradient updates can boost the model's performance, but we use the current version with small layers for faster training and evaluation. Hyperparameters of Preference Transformer.

Averaged normalized scores of IQL on AntMaze, Gym-Mujoco locomotion tasks, and success rate on Robosuite manipulation tasks with different reward functions. We train Preference Transformer (PT) and standard preference modeling with Markovian reward (MR) modeled by MLP or non-Markovian reward (NMR) modeled by LSTM using the same dataset of preferences from scripted teachers and real human teachers. The result shows the average and standard deviation averaged over 8 runs. ± 30.27 57.88 ± 40.63 84.54 ± 4.07 hopper-medium-expert-v2 73.55 ± 41.47 80.00 ± 33.06 85.97 ± 37.91 39.14 ± 29.33 57.75 ± 23.70 38.63 ± 35.58 68.96 ± 33.86 walker2d-medium-replay-v2 73.11 ± 8.07 65.69 ± 8.17 73.63 ± 7.35 77.08 ± 9.84 72.07 ± 1.96 77.00 ± 3.03 71.27 ± 10.30 walker2d-medium-expert-v2 107.75 ± 2.02 109.95 ± 0.54 110.41 ± 0.82 109.99 ± 0.63 108.32 ± 3.87 110.39 ± 0.93 110.13 ± 0.21

Averaged normalized scores of non-MDP IQL with different reward functions on AntMaze and Gym-Mujoco locomotion tasks. Using the same dataset of preferences from real human teachers, we train Preference Transformer (PT) and LSTM-based non-Markovian reward (NMR; Early et al. 2022). The result shows the average and standard deviation averaged over 8 runs. ± 26.11 55.77 ± 31.09 hopper-medium-expert-v2 55.80 ± 34.34 79.84 ± 29.32 walker2d-medium-replay-v2 41.91 ± 32.58 67.06 ± 13.39 walker2d-medium-expert-v2 94.20 ± 29.41 104.97 ± 10.63

ACKNOWLEDGMENTS AND DISCLOSURE OF FUNDING

We would like to thank Sihyun Yu, Subin Kim, Younggyo Seo, and anonymous reviewers for providing helpful feedback and suggestions for improving our paper. This research is supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2022-0-00953, Self-directed AI Agents with Problem-solving Capability; No.2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)). This material is based upon work supported by the Google Cloud Research Credits program with the award (A696-P323-JC3B-RNWG).

E ADDITIONAL EXAMPLES OF QUERIES

In Figure 8 , we provide additional examples of queries to highlight the difference between human and scripted teachers. In the case of the example in Figure 8a , the scripted teacher prefers trajectory 1 since ant is closer to the goal. However, the human teacher prefers trajectory 0 since ant is flipped in trajectory 1. In the case of the example in Figure 8b , the scripted teacher shows a very myopic decision because a hand-designed reward does not capture the context of behavior (e.g., falling down). These examples show the lack of ability to consider multiple objectives, while humans can provide more reasonable feedback by balancing the multiple objectives.

🤖

Scripted < l a t e x i t s h a 1 _ b a s e 6 4 = " d r w P V m Y O S K B 6 4 T q T f n l b b 5 2 k A k Q = " > A A A B + H i c b V B N S 8 N A E N 3 U r 1 o / G v X o J V g E T y U R Q Y 8 F L x 5 b t B / Q h r L Z T N q l m 0 3 Y n Y g 1 F P w f X j w o 4 t W f 4 s 1 / 4 / b j o K 0 P B h 7 v z T A z L 0 g F 1 + i 6 3 1 Z h b X 1 j c 6 u 4 X d r Z 3 d s v 2 w e H L Z 1 k i k G T J S J R n Y B q E F x C E z k K 6 K Q K a B w I a A e j 6 6 n f v g e l e S L v c J y C H 9 O B 5 B F n F I 3 U t 8 s 9 h A f M b 5 n i K U I 4 6 d s V t + r O 4 K w S b 0 E q Z I F 6 3 / 7 q h Q n L Y p D I B N W 6 6 7 k p + j l V y J m A S a m X a U g p G 9 E B d A 2 V N A b t 5 7 P D J 8 6 p U U I n S p Q p i c 5 M / T 2 R 0 1 j r c R y Y z p j i U C 9 7 U / E / r 5 t h d O X n X K Y Z g m T z R V E m H E y c a Q p O y B U w F G N D q H n d 3 O q w I V W U o c m q Z E L w l l 9 e J a 3 z q u d W v c Z F p d Z 4 m s d R J M f k h J w R j 1 y S G r k h d d I k j G T k m b y S N + v R e r H e r Y 9 5 a 8 F a R H h E / s D 6 / A G g S p Q n < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " d r w P V m Y O S K B 6 4 T q T f n l b b 5 2 k A k Q = " > A A A B + H i c b V B N S 8 N A E N 3 U r 1 o / G v X o J V g E T y U R Q Y 8 F L x 5 b t B / Q h r L Z T N q l m 0 3 Y n Y g 1 F P w f X j w o 4 t W f 4 s 1 / 4 / b j o K 0 P B h 7 v z T A z L 0 g F 1 + i 6 3 1 Z h b X 1 j c 6 u 4 X d r Z 3 d s v 2 w e H L Z 1 k i k G T J S J R n Y B q E F x C E z k K 6 K Q K a B w I a A e j 6 6 n f v g e l e S L v c J y C H 9 O B 5 B F n F I 3 U t 8 s 9 h A f M b 5 n i K U I 4 6 d s V t + r O 4 K w S b 0 E q Z I F 6 3 / 7 q h Q n L Y p D I B N W 6 6 7 k p + j l V y J m A S a m X a U g p G 9 E B d A 2 V N A b t 5 7 P D J 8 6 p U U I n S p Q p i c 5 M / T 2 R 0 1 j r c R y Y z p j i U C 9 7 U / E / r 5 t h d O X n X K Y Z g m T z R V E m H E y c a Q p O y B U w F G N D q H n d 3 O q w I V W U o c m q Z E L w l l 9 e J a 3 z q u d W v c Z F p d Z 4 m s d R J M f k h J w R j 1 y S G r k h d d I k j G T k m b y S N + v R e r H e r Y 9 5 a 8 F a R H h E / s D 6 / A G g S p Q n < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " d r w P V m Y O S K B 6 4 T q T f n l b b 5 2 k A k Q = " > A A A B + H i c b V B N S 8 N A E N 3 U r 1 o / G v X o J V g E T y U R Q Y 8 F L x 5 b t B / Q h r L Z T N q l m 0 3 Y n Y g 1 F P w f X j w o 4 t W f 4 s 1 / 4 / b j o K 0 P B h 7 v z T A z L 0 g F 1 + i 6 3 1 Z h b X 1 j c 6 u 4 X d r Z 3 d s v 2 w e H L Z 1 k i k G T J S J R n Y B q E F x C E z k K 6 K Q K a B w I a A e j 6 6 n f v g e l e S L v c J y C H 9 O B 5 B F n F I 3 U t 8 s 9 h A f M b 5 n i K U I 4 6 d s V t + r O 4 K w S b 0 E q Z I F 6 3 / 7 q h Q n L Y p D I B N W 6 6 7 k p + j l V y J m A S a m X a U g p G 9 E B d A 2 V N A b t 5 7 P D J 8 6 p U U I n S p Q p i c 5 M / T 2 R 0 1 j r c R y Y z p j i U C 9 7 U / E / r 5 t h d O X n X K Y Z g m T z R V E m H E y c a Q p O y B U w F G N D q H n d 3 O q w I V W U o c m q Z E L w l l 9 e J a 3 z q 

F DESCRIPTION OF THE VIDEO EXAMPLES

We provide video examples with visualization of the learned importance weight in the supplementary material. For a better visibility, we represent the weight as a color map on the rim of the video, i.e., we highlight frames with higher weights using more bright colors. We observe that the learned importance weights are well-aligned with human intent, as in Figure 3 .

