PREFERENCE TRANSFORMER: MODELING HUMAN PREFERENCES USING TRANSFORMERS FOR RL

Abstract

Preference-based reinforcement learning (RL) provides a framework to train agents using human preferences between two behaviors. However, preferencebased RL has been challenging to scale since it requires a large amount of human feedback to learn a reward function aligned with human intent. In this paper, we present Preference Transformer, a neural architecture that models human preferences using transformers. Unlike prior approaches assuming human judgment is based on the Markovian rewards which contribute to the decision equally, we introduce a new preference model based on the weighted sum of non-Markovian rewards. We then design the proposed preference model using a transformer architecture that stacks causal and bidirectional self-attention layers. We demonstrate that Preference Transformer can solve a variety of control tasks using real human preferences, while prior approaches fail to work. We also show that Preference Transformer can induce a well-specified reward and attend to critical events in the trajectory by automatically capturing the temporal dependencies in human decision-making.

1. INTRODUCTION

Reinforcement learning (RL) has been successful in solving sequential decision-making problems in various domains where a suitable reward function is available (Mnih et al., 2015; Silver et al., 2017; Berner et al., 2019; Vinyals et al., 2019) . However, reward engineering poses a number of challenges. It often requires extensive instrumentation (e.g., thermal cameras (Schenck & Fox, 2017 ), accelerometers (Yahya et al., 2017) , or motion trackers (Peng et al., 2020) ) to design a dense and precise reward. Also, it is hard to evaluate the quality of outcomes in a single scalar since many problems have multiple objectives. For example, we need to care about many objectives like velocity, energy spent, and torso verticality to achieve stable locomotion (Tassa et al., 2012; Faust et al., 2019) . It requires substantial human effort and extensive task knowledge to aggregate multiple objectives into a single scalar. To avoid reward engineering, there are various ways to learn the reward function from human data, such as real-valued feedback (Knox & Stone, 2009; Daniel et al., 2014) , expert demonstrations (Ng et al., 2000; Abbeel & Ng, 2004) , preferences (Akrour et al., 2011; Wilson et al., 2012; Sadigh et al., 2017) and language instructions (Fu et al., 2019; Nair et al., 2022) . Especially, research interest in preference-based RL (Akrour et al., 2012; Christiano et al., 2017; Lee et al., 2021b) has increased recently since making relative judgments (e.g., pairwise comparison) is easy to provide yet information-rich. By learning the reward function from human preferences between trajectories, recent work has shown that the agent can learn novel behaviors (Christiano et al., 2017; Stiennon et al., 2020) or avoid reward exploitation (Lee et al., 2021b) . However, existing approaches still require a large amount of human feedback, making it hard to scale up preference-based RL to various applications. We hypothesize this difficulty originated from common underlying assumptions in preference modeling used in most prior work. Specifically, prior work commonly assumes that (a) the reward function is Markovian (i.e., depending only on the current state and action), and (b) human evaluates the quality of a trajectory (agent's behavior) based on the sum of rewards with equal weights. These assumptions can be flawed due to the

Preference Transformer

Preference TransformerPreference model:PreferenceFigure 1 : Illustration of our framework. Given a preference between two trajectory segments (σ 0 , σ 1 ), Preference Transformer generates non-Markovian rewards rt and their importance weights w t over each segment. We then model the preference predictor based on the weighted sum of non-Markovian rewards (i.e., t w t rt ), and align it with human preference.following reasons. First, there are various tasks where rewards depend on the visited states (i.e., non-Markovian) since it is hard to encode all task-relevant information into the state (Bacchus et al., 1996; 1997) . This can be especially true in preference-based learning since the trajectory segment is provided to the human sequentially (e.g., a video clip (Christiano et al., 2017; Lee et al., 2021b) ), enabling earlier events to influence the ratings of later ones. In addition, since humans are highly sensitive to remarkable moments (Kahneman, 2000) , credit assignment within the trajectory is required. For example, in the study of human attention on video games using eye trackers (Zhang et al., 2020) , the human player requires a longer reaction time and multiple eye movements on important states that can lead to a large reward or penalty, in order to make a decision.In this paper, we aim to propose an alternative preference model that can overcome the limitations of common assumptions in prior work. To this end, we introduce the new preference model based on the weighted sum of non-Markovian rewards, which can capture the temporal dependencies in human decisions and infer critical events in the trajectory. Inspired by the recent success of transformers (Vaswani et al., 2017) in modeling sequential data (Brown et al., 2020; Chen et al., 2021) , we present Preference Transformer, a transformer-based architecture for designing the proposed preference model (see Figure 1 ). Preference Transformer takes a trajectory segment as input, which allows extracting task-relevant historical information. By stacking bidirectional and causal selfattention layers, Preference Transformer generates non-Markovian rewards and importance weights as outputs. We then utilize them to define the preference model.We highlight the main contributions of this paper below:• We propose a more generalized framework for modeling human preferences based on a weighted sum of non-Markovian rewards. Transformer can induce a well-specified reward and capture critical events within a trajectory.

2. RELATED WORK

Preference-based reinforcement learning. Recently, various methods have utilized human preferences to train RL agents without reward engineering (Akrour et al., 2012; Christiano et al., 2017; Ibarz et al., 2018; Stiennon et al., 2020; Lee et al., 2021b; c; Nakano et al., 2021; Wu et al., 2021; III & Sadigh, 2022; Knox et al., 2022; Ouyang et al., 2022; Park et al., 2022; Verma & Metcalf, 2022) . Christiano et al. (2017) showed that preference-based RL can effectively solve complex control tasks using deep neural networks. To improve the feedback-efficiency, several methods, such as pre-training (Ibarz et al., 2018; Lee et al., 2021b ), data augmentation (Park et al., 2022 ), exploration (Liang et al., 2022) , and meta-learning (III & Sadigh, 2022), have been proposed. Preferencebased RL also has been successful in fine-tuning large-scale language models (such as GPT-3; Brown et al. 2020) for hard tasks (Stiennon et al., 2020; Wu et al., 2021; Nakano et al., 2021; Ouyang et al., 

