APPROXIMATING PARETO FRONTIER THROUGH BAYESIAN-OPTIMIZATION-DIRECTED ROBUST MULTI-OBJECTIVE REINFORCEMENT LEARNING

Abstract

Many real-word decision or control problems involve multiple conflicting objectives and uncertainties, which requires learned policies are not only Pareto optimal but also robust. In this paper, we proposed a novel algorithm to approximate a representation for robust Pareto frontier through Bayesian-optimization-directed robust multi-objective reinforcement learning (BRMORL). Firstly, environmental uncertainty is modeled as an adversarial agent over the entire space of preferences by incorporating zero-sum game into multi-objective reinforcement learning (MORL). Secondly, a comprehensive metric based on hypervolume and information entropy is presented to evaluate convergence, diversity and evenness of the distribution for Pareto solutions. Thirdly, the agent's learning process is regarded as a black-box, and the comprehensive metric we proposed is computed after each episode of training, then a Bayesian optimization (BO) algorithm is adopted to guide the agent to evolve towards improving the quality of the approximated Pareto frontier. Finally, we demonstrate the effectiveness of proposed approach on challenging multi-objective tasks across four environments, and show our scheme can produce robust policies under environmental uncertainty.

1. INTRODUCTION

Reinforcement learning (RL) algorithm has demonstrated its worth in a series of challenging sequential decision making and control tasks, which train policies to optimize a single scalar reward function (Mnih et al., 2015; Silver et al., 2016; Haarnoja et al., 2018; Hwangbo et al., 2019) . However, many real-world tasks are characterized by multiple competing objectives whose relative importance (preferences) is ambiguous in most cases. Moreover, uncertainty or perturbation caused by environment dynamic change, is inevitable in real-world scenarios, which may result in lowered agent performance (Pinto et al., 2017; Ji et al., 2018) . For instance, autonomous electric vehicle requires trading off transport efficiency and electricity consumption while considering environmental uncertainty (e.g., vehicle mass, tire pressure and road conditions might vary over time). Consider a decision-making problem for traffic mode, as shown in Figure 1 . A practitioner or a rule is responsible for picking the appropriate preference among time and cost, and the agent need to determine different policies depending on the chosen trade-off between these two metrics. Whereas, the environment contain uncertainty factors related to actions of other agents or to dynamic changes of Nature, which may lead to more randomness in these two metrics, and makes multi-objective decision-making or control more challenging. If weather factors are taken into account, e.g., heavy rain may cause traffic congestion, which can increase the time and cost of the plan-A, but it not have a significant impact on the two metrics of the plan-B. From this perspective, selecting plan-B is more robust, i.e., a policy is said to be robust if its capability to obtain utility is relatively stable under environmental changes. Therefore, preference and uncertainty jointly affect the decision-making behavior of the agent. In traditional multi-objective reinforcement learning (MORL), one popular way is scalarization, which is to convert the multi-objective reward vector into a single scalar reward through various techniques (e.g., by taking a convex combination), and then adopt standard RL algorithms to optimize this scalar reward (Vamplew et al., 2011) . Unfortunately, it is very tricky to determine an appropriate scalarization, because often common approach only learn an 'average' policy over the space of preferences (Yang et al., 2019) , or though the obtained policies can be relatively quickly Figure 1 : Diagram of decision-making problem for traffic mode. If time is crucial, the agent tend to choose plan-A that takes less time, but it costs more. On the other hand, if cost is more important matters, the agent will be inclined to select plan-B that requires less cost, but it takes more time. adapted to different preferences between performance objectives but are not necessarily optimal. Furthermore, these methods almost did not take into account the robustness of the policies under different preferences, which means the agent cannot learn robust Pareto optimal policies. In this work, we propose a novel approach to approximate well-distributed robust Pareto frontier through BRMORL. This allows our trained single network model to produce the robust Pareto optimal policy for any specified preference, i.e., the learned policy is not only robust to uncertainty (e.g., random disturbance and environmental change) but also Pareto optimal under different preference conditions. Our algorithm is based on three key ideas, which are also the main contributions of this paper: (1) present a generalized robust MORL framework through modelling uncertainty as an adversarial agent; (2) inspired by Shannon-Wiener diversity index, a novel metric is presented to evaluate diversity and evenness of distribution for Pareto solutions. In addition, combined with hypervolume indicator, a comprehensive metric is designed, which can evaluate the convergence, diversity and evenness for the solutions on the approximated Pareto frontier; (3) regard agent's learning process in each episode as a black-box, and BO algorithm is used to guide agent to evolve towards improving the quality of the Pareto set. Finally, we demonstrate our proposed algorithm outperform competitive baselines on multi-objective tasks across several MuJoCo (Todorov et al., 2012) environments and SUMO (Simulation of Urban Mobility) (Lopez et al., 2018) , and show our approach can produce robust policies under environmental uncertainty.

2.1. MULTI-OBJECTIVE REINFORCEMENT LEARNING

MORL algorithms can be roughly classified into two main categories: single-policy approaches and multiple-policy approaches (Roijers et al., 2013; Liu et al., 2014) . Single-policy methods seek to find the optimal policy for a given preference among multiple competing objectives. These approaches convert the multi-objective problem into a single-objective problem through different forms of scalarization, including linear and non-linear ones (Mannor & Shimkin, 2002; Tesauro et al., 2008) . The main advantage of scalarization is its simplicity, which can be integrated into single-policy scheme with very little modification. However, the main drawback of these approaches is that the preference among the objectives must be set in advance. Multi-policy methods aim to learn a set of policies that approximate Pareto frontier under different preference conditions. The most common approaches repeatedly call a single-policy scheme with different preferences (Natarajan & Tadepalli, 2005; Van Moffaert et al., 2013; Zuluaga et al., 2016) . Other methods learn a set of policies simultaneously via using a multi-objective extended version of value-based RL (Barrett & Narayanan, 2008; Castelletti et al., 2012; Van Moffaert & Nowé, 2014; Mossalam et al., 2016; Nottingham et al., 2019) or via modifying policy-based RL as a MORL variant (Pirotta et al., 2015; Parisi et al., 2017; Abdolmaleki et al., 2020; Xu et al., 2020) . Nevertheless, most of these methods are offen constrained to convex regions of the Pareto front and explicitly maintain sets of policies, which may prevent these schemes from finding the sets of well-distributed Pareto solutions which can represent different preferences. There are also meta-policy methods, which can be relatively quickly adapted to different preferences (Chen et al., 2018; Abels et al., 2019; Yang et al., 2019) . Although the above works were successful to some extent, these approaches share the same shortcomings that no attention is paid to the robustness of Pareto-optimal policy over the

