PREFERENCE TRANSFORMER: MODELING HUMAN PREFERENCES USING TRANSFORMERS FOR RL

Abstract

Preference-based reinforcement learning (RL) provides a framework to train agents using human preferences between two behaviors. However, preferencebased RL has been challenging to scale since it requires a large amount of human feedback to learn a reward function aligned with human intent. In this paper, we present Preference Transformer, a neural architecture that models human preferences using transformers. Unlike prior approaches assuming human judgment is based on the Markovian rewards which contribute to the decision equally, we introduce a new preference model based on the weighted sum of non-Markovian rewards. We then design the proposed preference model using a transformer architecture that stacks causal and bidirectional self-attention layers. We demonstrate that Preference Transformer can solve a variety of control tasks using real human preferences, while prior approaches fail to work. We also show that Preference Transformer can induce a well-specified reward and attend to critical events in the trajectory by automatically capturing the temporal dependencies in human decision-making.

1. INTRODUCTION

Reinforcement learning (RL) has been successful in solving sequential decision-making problems in various domains where a suitable reward function is available (Mnih et al., 2015; Silver et al., 2017; Berner et al., 2019; Vinyals et al., 2019) . However, reward engineering poses a number of challenges. It often requires extensive instrumentation (e.g., thermal cameras (Schenck & Fox, 2017 ), accelerometers (Yahya et al., 2017) , or motion trackers (Peng et al., 2020) ) to design a dense and precise reward. Also, it is hard to evaluate the quality of outcomes in a single scalar since many problems have multiple objectives. For example, we need to care about many objectives like velocity, energy spent, and torso verticality to achieve stable locomotion (Tassa et al., 2012; Faust et al., 2019) . It requires substantial human effort and extensive task knowledge to aggregate multiple objectives into a single scalar. To avoid reward engineering, there are various ways to learn the reward function from human data, such as real-valued feedback (Knox & Stone, 2009; Daniel et al., 2014) , expert demonstrations (Ng et al., 2000; Abbeel & Ng, 2004) , preferences (Akrour et al., 2011; Wilson et al., 2012; Sadigh et al., 2017) and language instructions (Fu et al., 2019; Nair et al., 2022) . Especially, research interest in preference-based RL (Akrour et al., 2012; Christiano et al., 2017; Lee et al., 2021b) has increased recently since making relative judgments (e.g., pairwise comparison) is easy to provide yet information-rich. By learning the reward function from human preferences between trajectories, recent work has shown that the agent can learn novel behaviors (Christiano et al., 2017; Stiennon et al., 2020) or avoid reward exploitation (Lee et al., 2021b) . However, existing approaches still require a large amount of human feedback, making it hard to scale up preference-based RL to various applications. We hypothesize this difficulty originated from common underlying assumptions in preference modeling used in most prior work. Specifically, prior work commonly assumes that (a) the reward function is Markovian (i.e., depending only on the current state and action), and (b) human evaluates the quality of a trajectory (agent's behavior) based on the sum of rewards with equal weights. These assumptions can be flawed due to the

