BEYOND REWARD: OFFLINE PREFERENCE-GUIDED POLICY OPTIMIZATION

Abstract

In this work, we study offline preference-based reinforcement learning (PbRL), which relaxes two fundamental supervisory signals in standard reinforcement learning (online accessible transition dynamics and rewards). In other words, the agent is provided with fixed offline trajectory transitions and human preferences between pairs of trajectories. Due to the orthogonality property of rewards and dynamics, one common practice is combining prior PbRL-based reward learning objectives with off-the-shelf offline RL algorithms to bridge preference modeling and offline learning. However, such two isolated optimizations require learning a separate reward function and thus place an information bottleneck on reward learning (the bridge). As an alternative, we propose offline preference-guided policy optimization (OPPO), an end-to-end offline PbRL formulation, which jointly learns to model the preference (for finding the optimal task policy) and the offline data (for eliminating OOD). In particular, OPPO introduces an offline hindsight information matching objective and a preference modeling objective. Then, by iterating the two objectives over, we can directly extract a well-performing decision policy, avoiding a separate reward learning. We empirically show that OPPO can effectively model the offline preference and outperform prior competing baselines (including the offline RL algorithms performed over the true reward function).

1. INTRODUCTION

Deep reinforcement learning (RL) presents a flexible framework for learning task-oriented behaviors (Kohl & Stone, 2004; Kober & Peters, 2008; Kober et al., 2013; Silver et al., 2017; Kalashnikov et al., 2018; Vinyals et al., 2019) , where the "task" is often expressed as maximizing the cumulative reward sum of trajectories generated by rollouting the learning policy in the corresponding environment. However, the above RL formulation implies two indispensable prerequisites for the training of decision policy: 1) an interactable environment and 2) a pre-specified reward function. Unfortunately, 1) online interactions with the environment can be costly and unsafe (Mihatsch & Neuneier, 2002; Hans et al., 2008; Garcıa & Fernández, 2015) , and 2) designing a suitable reward function often requires expensive human effort, while the heuristic rewards often used are sometimes incapable of conveying the true intention (Hadfield-Menell et al., 2017) . To relax these requirements, previous works have either 1) focused on the offline RL formulation (the online rollout is inaccessible) (Fujimoto et al., 2019) , where the learner has access to fixed offline trajectories along with a reward signal for each transition (or along with limited expert demonstrations), or 2) considered the (online) preference-based RL (PbRL) formulation (Christiano et al., 2017; Bıyık & Sadigh, 2018; Sadigh et al., 2017; Biyik et al., 2020; Lee et al., 2021a) , where messages about the task objective are passed to the learner through preferences of a (human) annotator between two trajectories rather than rewards for each transition. To further progress in this setting, we propose relaxing both requirements and advocating for offline PbRL. In the offline PbRL setting with access to an offline dataset and labeled preferences between the offline trajectories, it is straightforward to combine previous (online) PbRL methods and off-theshelf offline RL algorithms (Shin & Brown, 2021) . As shown in Fig. 1 (left), we can first use the Bradley-Terry model (Bradley & Terry, 1952) to model the preference label and supervisedly learn a reward function (normally adopted in prior PbRL methods), and then train the policy with any offline RL algorithm on the transitions relabeled via the learned reward function. Intuitively, such a two-step procedure allows preference modeling via a separate reward function. Fundamentally, however, learning a separate reward function that explains expert preferences does not directly instruct the policy how to act optimally. As PbRL tasks are defined by preference labels, the goal is to learn the most preferred trajectory by the annotator rather than to maximize the cumulative discounted proxy rewards of the policy rollouts. More specifically, when encountering complex tasks such as non-Markovian tasks, conveying information from preferences to the policy via the scalar rewards creats a bottleneck in policy improvement. On the other hand, if the learned reward function is miscalibrated, an isolated policy optimization may learn to exploit loopholes in the relabeled rewards, resulting in unwanted behaviors. Then why must we learn a reward function considering it possibly being not able to directly yield optimal actions? To this end, we propose offline preference-guided policy optimization (OPPO), an end-to-end formulation that jointly models offline preferences and learns the optimal decision policy without learning a separate reward function (as shown in Fig. 1 right ). Specifically, we explicitly introduce a hindsight information encoder with which we further design an offline hindsight information matching objective and a preference modeling objective. Via optimizing the two objectives iteratively, we derive a contextual policy π(a|s, z) to model the offline data and concurrently optimize an optimal contextual variable z * to model the preference. In this way, the focus of OPPO is on learning a highdimensional z-space capturing more task-related information and evaluating policies in it. Then, we arrive at the optimal policy by having the contextual policy π(a|s, z * ) conditioned on the learned optimal z * . In summary, our contributions include 1) OPPO: a simple, stable and end-to-end offline PbRL method that avoids learning a separate reward function, 2) an instance of preference-based hindsight information matching objective and a novel preference modeling objective over the contextual, and 3) extensive experiments to show and analyze the outstanding performance of OPPO against previous competitive baselines.

2. RELATED WORK

Preference-based RL. Preference-based RL is also known as learning from human feedback. Several works have successfully utilized feedback from real humans to train RL agents (Arumugam et al., 2019; Christiano et al., 2017; Ibarz et al., 2018; Knox & Stone, 2009; Lee et al., 2021b; Warnell et al., 2017) Offline RL. To mitigate the impact of distribution shifts in offline RL, prior algorithms (a) constrain the action space (Fujimoto et al., 2019; Kumar et al., 2019a; Siegel et al., 2020) , (b) incorporate value pessimism (Fujimoto et al., 2019; Kumar et al., 2020) , and (c) introduce pessimism into learned dynamics models (Kidambi et al., 2020; Yu et al., 2020) . Another line of work explored learning a wide behavior distribution from the offline dataset by learning a task-agnostic set



Figure 1: A flow diagram of previous offline PbRL algorithms (left) and our OPPO algorithm (right). Previous works require learning a separate reward function for modeling human preferences using the Bradley-Terry model. In contrast, our OPPO directly learns the policy network.

. Christiano et al. (2017) scaled preference-based reinforcement learning to utilize modern deep learning techniques, and Ibarz et al. (2018) improved the efficiency of this method by introducing additional forms of feedback such as demonstrations. Recently, Lee et al. (2021b) proposed a feedback-efficient RL algorithm by utilizing off-policy learning and pre-training. Park et al. (2022) used pseudo-labeling to utilize unlabeled segments and proposed a novel data augmentation method called temporal cropping.

