SCALING PARETO-EFFICIENT DECISION MAKING VIA OFFLINE MULTI-OBJECTIVE RL

Abstract

The goal of multi-objective reinforcement learning (MORL) is to learn policies that simultaneously optimize multiple competing objectives. In practice, an agent's preferences over the objectives may not be known apriori, and hence, we require policies that can generalize to arbitrary preferences at test time. In this work, we propose a new data-driven setup for offline MORL, where we wish to learn a preference-agnostic policy agent using only a finite dataset of offline demonstrations of other agents and their preferences. The key contributions of this work are two-fold. First, we introduce D4MORL, (D)datasets for MORL that are specifically designed for offline settings. It contains 1.8 million annotated demonstrations obtained by rolling out reference policies that optimize for randomly sampled preferences on 6 MuJoCo environments with 2-3 objectives each. Second, we propose Pareto-Efficient Decision Agents (PEDA), a family of offline MORL algorithms that builds and extends return-conditioned offline methods including Decision Transformers (Chen et al., 2021) and RvS (Emmons et al., 2021) via a novel preference-and-return conditioned policy. Empirically, we show that PEDA closely approximates the behavioral policy on the D4MORL benchmark and provides an excellent approximation of the Pareto-front with appropriate conditioning, as measured by the hypervolume and sparsity metrics.

1. INTRODUCTION

We are interested in learning agents for multi-objective reinforcement learning (MORL) that optimize for one or more competing objectives. This setting is commonly observed in many real-world scenarios. For instance, an autonomous driving car might trade off high speed and energy savings depending on the user's preferences. If the user has a relatively high preference for speed, the agent will move fast regardless of power usage; on the other hand, if the user tries to save energy, the agent will keep a more steady speed. One key challenge with MORL is that different users might have different preferences on the objectives and systematically exploring policies for each preference might be expensive, or even impossible. In the online setting, prior work considers several approximations based on scalarizing the vector-valued rewards of different objectives based on a single preference (Lin, 2005) , learning an ensemble of policies based on enumerating preferences (Mossalam et al., 2016 , Xu et al., 2020) , or extensions of single-objective algorithms such as Q-learning to vectorized value functions (Yang et al., 2019) . We introduce the setting of offline multi-objective reinforcement learning for high-dimensional state and action spaces, where our goal is to train an MORL policy agent using an offline dataset of demonstrations from multiple agents with known preferences. Similar to the single-task setting, offline MORL can utilize auxiliary logged datasets to minimize interactions, thus improving data efficiency and minimizing interactions when deploying agents in high-risk settings. In addition to its practical utility, offline RL (Levine et al., 2020) has enjoyed major successes in the last few years (Kumar et al., 2020 , Kostrikov et al., 2021 , Chen et al., 2021) on challenging high-dimensional environments for continuous control and game-playing. Our contributions in this work are two-fold in introducing benchmarking datasets and a new family of MORL, as described below. We introduce Datasets for Multi-Objective Reinforcement Learning (D4MORL), a collection of 1.8 million trajectories on 6 multi-objective MuJoCo environments (Xu et al., 2020) . Here, 5 environ-ments consist of 2 objectives and 1 environment consists of 3 objectives. For each environment in D4MORL, we collect demonstrations from 2 pretrained behavioral agents: expert and amateur, where the relative expertise is defined in terms of the Pareto-efficiency of the agents and measured empirically via their hypervolumes. Furthermore, we also include 3 kinds of preference distributions with varying entropies to expose additional data-centric aspects for downstream benchmarking. Lack of MORL datasets and large-scale benchmarking has been a major challenge for basic research (Hayes et al., 2022) , and we hope that D4MORL can aid future research in the field. Next, we propose Pareto-Efficient Decision Agents (PEDA), a family of offline MORL algorithms that extends return-conditioned methods including Decision Transformer (DT) (Chen et al., 2021) and RvS (Emmons et al., 2021) to the multi-objective setting. These methods learn a returnconditioned policy via a supervised loss on the predicted actions. In recent work, these methods have successfully scaled to agents that demonstrate broad capabilities in multi-task settings (Lee et al., 2022 Reed et al., 2022) . For MORL, we introduce a novel preference and return conditioned policy network and train it via a supervised learning loss. At test time, naively conditioning on the default preferences and maximum possible returns leads to out-of-distribution behavior for the model, as neither has it seen maximum returns for all objectives in the training data nor is it possible to simultaneously maximize all objectives under competition. We address this issue by learning to map preferences to appropriate returns and hence, enabling predictable generalization at test-time. Empirically, we find PEDA performs exceedingly well on D4MORL and closely approximates the reference Pareto-frontier of the behavioral policy used for data generation. In the multi-objective HalfCheetah environment, compared with an average upper bound on the hypervolume of 5.79ˆ10 6 achieved by the behavioral policy, PEDA achieves an average hypervolume of 5.77 ˆ10 6 on the Expert and 5.76 ˆ10 6 on the Amateur datasets.

2. RELATED WORK

Multi-Objective Reinforcement Learning Predominant works in MORL focus on the online setting where the goal is to train agents that can generalize to arbitrary preferences. This can be achieved by training a single preference-conditioned policy (Yang et al., 2019; Parisi et al., 2016) , or an ensemble of single-objective policies for a finite set of preferences (Mossalam et al., 2016; Xu et al., 2020; Zhang & Li, 2007) . Many of these algorithms consider vectorized variants of standard algorithms such as Q-learning (Mossalam et al., 2016; Yang et al., 2019) , often augmented with strategies to guide the policy ensemble towards the Pareto front using evolutionary or incrementally updated algorithms (Xu et al., 2020; Zhang & Li, 2007; Mossalam et al., 2016; Roijers et al., 2014; Huang et al., 2022) . Other approaches have also been studied, such as framing MORL as a meta-learning problem (Chen et al., 2019) , learning the action distribution for each objective (Abdolmaleki et al., 2020) , and learning the relationship between objectives (Zhan & Cao, 2019) among others. In contrast to these online MORL works, our focus is on learning a single policy that works for all preferences using only offline datasets. There are also a few works that study decision-making with multiple objectives in the offline setting and sidestep any interaction with the environments. Wu et al., 2021 propose a provably efficient offline MORL algorithm for tabular MDPs based on dual gradient ascent. Thomas et al., 2021 study learning of safe policies by extending the approach of Laroche et al., 2019 to the offline MORL setting. Their proposed algorithm assumes knowledge of the behavioral policy used to collect the offline data and is demonstrated primarily on tabular MDPs with finite state and action spaces. In contrast, we are interested in developing dataset benchmarks and algorithms for scalable offline policy optimization in high-dimensional MDPs with continuous states and actions. Multi-Task Reinforcement Learning MORL is also closely related to multi-task reinforcement learning, where every task can be interpreted as a distinct objective. There is an extensive body of work in learning multi-task policies both in the online and offline setups (Wilson et al., 2007; Lazaric & Ghavamzadeh, 2010; Teh et al., 2017) inter alia. However, the key difference is that typical MTRL benchmarks and algorithms do not consider solving multiple tasks that involve inherent trade-offs. Consequently, there is no notion of Pareto efficiency and an agent can simultaneously excel in all the tasks without accounting for user preferences.

