APPROXIMATING PARETO FRONTIER THROUGH BAYESIAN-OPTIMIZATION-DIRECTED ROBUST MULTI-OBJECTIVE REINFORCEMENT LEARNING

Abstract

Many real-word decision or control problems involve multiple conflicting objectives and uncertainties, which requires learned policies are not only Pareto optimal but also robust. In this paper, we proposed a novel algorithm to approximate a representation for robust Pareto frontier through Bayesian-optimization-directed robust multi-objective reinforcement learning (BRMORL). Firstly, environmental uncertainty is modeled as an adversarial agent over the entire space of preferences by incorporating zero-sum game into multi-objective reinforcement learning (MORL). Secondly, a comprehensive metric based on hypervolume and information entropy is presented to evaluate convergence, diversity and evenness of the distribution for Pareto solutions. Thirdly, the agent's learning process is regarded as a black-box, and the comprehensive metric we proposed is computed after each episode of training, then a Bayesian optimization (BO) algorithm is adopted to guide the agent to evolve towards improving the quality of the approximated Pareto frontier. Finally, we demonstrate the effectiveness of proposed approach on challenging multi-objective tasks across four environments, and show our scheme can produce robust policies under environmental uncertainty.

1. INTRODUCTION

Reinforcement learning (RL) algorithm has demonstrated its worth in a series of challenging sequential decision making and control tasks, which train policies to optimize a single scalar reward function (Mnih et al., 2015; Silver et al., 2016; Haarnoja et al., 2018; Hwangbo et al., 2019) . However, many real-world tasks are characterized by multiple competing objectives whose relative importance (preferences) is ambiguous in most cases. Moreover, uncertainty or perturbation caused by environment dynamic change, is inevitable in real-world scenarios, which may result in lowered agent performance (Pinto et al., 2017; Ji et al., 2018) . For instance, autonomous electric vehicle requires trading off transport efficiency and electricity consumption while considering environmental uncertainty (e.g., vehicle mass, tire pressure and road conditions might vary over time). Consider a decision-making problem for traffic mode, as shown in Figure 1 . A practitioner or a rule is responsible for picking the appropriate preference among time and cost, and the agent need to determine different policies depending on the chosen trade-off between these two metrics. Whereas, the environment contain uncertainty factors related to actions of other agents or to dynamic changes of Nature, which may lead to more randomness in these two metrics, and makes multi-objective decision-making or control more challenging. If weather factors are taken into account, e.g., heavy rain may cause traffic congestion, which can increase the time and cost of the plan-A, but it not have a significant impact on the two metrics of the plan-B. From this perspective, selecting plan-B is more robust, i.e., a policy is said to be robust if its capability to obtain utility is relatively stable under environmental changes. Therefore, preference and uncertainty jointly affect the decision-making behavior of the agent. In traditional multi-objective reinforcement learning (MORL), one popular way is scalarization, which is to convert the multi-objective reward vector into a single scalar reward through various techniques (e.g., by taking a convex combination), and then adopt standard RL algorithms to optimize this scalar reward (Vamplew et al., 2011) . Unfortunately, it is very tricky to determine an appropriate scalarization, because often common approach only learn an 'average' policy over the space of preferences (Yang et al., 2019) , or though the obtained policies can be relatively quickly Figure 1 : Diagram of decision-making problem for traffic mode. If time is crucial, the agent tend to choose plan-A that takes less time, but it costs more. On the other hand, if cost is more important matters, the agent will be inclined to select plan-B that requires less cost, but it takes more time. adapted to different preferences between performance objectives but are not necessarily optimal. Furthermore, these methods almost did not take into account the robustness of the policies under different preferences, which means the agent cannot learn robust Pareto optimal policies. In this work, we propose a novel approach to approximate well-distributed robust Pareto frontier through BRMORL. This allows our trained single network model to produce the robust Pareto optimal policy for any specified preference, i.e., the learned policy is not only robust to uncertainty (e.g., random disturbance and environmental change) but also Pareto optimal under different preference conditions. Our algorithm is based on three key ideas, which are also the main contributions of this paper: (1) present a generalized robust MORL framework through modelling uncertainty as an adversarial agent; (2) inspired by Shannon-Wiener diversity index, a novel metric is presented to evaluate diversity and evenness of distribution for Pareto solutions. In addition, combined with hypervolume indicator, a comprehensive metric is designed, which can evaluate the convergence, diversity and evenness for the solutions on the approximated Pareto frontier; (3) regard agent's learning process in each episode as a black-box, and BO algorithm is used to guide agent to evolve towards improving the quality of the Pareto set. Finally, we demonstrate our proposed algorithm outperform competitive baselines on multi-objective tasks across several MuJoCo (Todorov et al., 2012) environments and SUMO (Simulation of Urban Mobility) (Lopez et al., 2018) , and show our approach can produce robust policies under environmental uncertainty.

2.1. MULTI-OBJECTIVE REINFORCEMENT LEARNING

MORL algorithms can be roughly classified into two main categories: single-policy approaches and multiple-policy approaches (Roijers et al., 2013; Liu et al., 2014) . Single-policy methods seek to find the optimal policy for a given preference among multiple competing objectives. These approaches convert the multi-objective problem into a single-objective problem through different forms of scalarization, including linear and non-linear ones (Mannor & Shimkin, 2002; Tesauro et al., 2008) . The main advantage of scalarization is its simplicity, which can be integrated into single-policy scheme with very little modification. However, the main drawback of these approaches is that the preference among the objectives must be set in advance. Multi-policy methods aim to learn a set of policies that approximate Pareto frontier under different preference conditions. The most common approaches repeatedly call a single-policy scheme with different preferences (Natarajan & Tadepalli, 2005; Van Moffaert et al., 2013; Zuluaga et al., 2016) . Other methods learn a set of policies simultaneously via using a multi-objective extended version of value-based RL (Barrett & Narayanan, 2008; Castelletti et al., 2012; Van Moffaert & Nowé, 2014; Mossalam et al., 2016; Nottingham et al., 2019) or via modifying policy-based RL as a MORL variant (Pirotta et al., 2015; Parisi et al., 2017; Abdolmaleki et al., 2020; Xu et al., 2020) . Nevertheless, most of these methods are offen constrained to convex regions of the Pareto front and explicitly maintain sets of policies, which may prevent these schemes from finding the sets of well-distributed Pareto solutions which can represent different preferences. There are also meta-policy methods, which can be relatively quickly adapted to different preferences (Chen et al., 2018; Abels et al., 2019; Yang et al., 2019) . Although the above works were successful to some extent, these approaches share the same shortcomings that no attention is paid to the robustness of Pareto-optimal policy over the entire space of preferences. In addition, most approaches still focus on the domains with discrete action space. In contrast, our scheme can guarantee the learned policies is approximately robust Pareto-optimal on continuous control tasks.

2.2. ROBUST REINFORCEMENT LEARNING

Robust reinforcement learning (RRL) algorithms can be broadly grouped into three distinct methods (Derman et al., 2020) . The first approach focuses on solving robust Markov decision process (MDP) with rectangular uncertainty sets. Some researches proposed RRL algorithms for learning optimal policies using coupled uncertainty sets (Mannor et al., 2012) . Other works modeled an ambiguous linear function of a factor matrix as a selection setting from an uncertainty set (Goyal & Grand-Clement, 2018) . The second RRL approach considered a distribution over the uncertainty set to mitigate the conservativeness. Yu & Xu (2015) presented the distributional RRL method by supposing the uncertain parameters are random variables following an unknown distribution. Tirinzoni et al. (2018) proposed a RRL scheme using conditioned probability distribution that defines uncertainty sets. A third RRL method mostly concerns adversarial setting in RL. Pinto et al. (2017) developed a robust adversarial reinforcement learning (RARL) scheme through modeling uncertainties via adversarial agent which applies disturbances to the system. Tessler et al. (2019) proposed an adversarial RRL framework through structuring probabilistic action robust MDP and noisy action robust MDP. Nonetheless, these researches do not take into account the connection between Pareto-optimal policy and robust policy, which leaves room for improving the performance of them in practical applications. In contrast, our scheme can learn robust Pareto-optimal policies through modeling uncertainty as an adversary over the entire space of preferences.

3.1. MULTI-OBJECTIVE MARKOV DECISION PROCESS

In this work, we consider a MORL problem defined by a multi-objective Markov decision process (MOMDP), which is represented by the tuple S, A, P, R, γ, Ω, U Ω with state space S, action space A, state transition probability P(s |s, a), vector reward function R(s, a) = [r 1 , ..., r k ] T , the space of preferences Ω, and preference functions, e.g., U ω (R) which produces an utility function using preference ω ∈ Ω, and a discount factor γ ∈ [1, 0). In MOMDP, a policy π is associated with a vector of expected returns Q π (s, a) = [Q π 1 , ..., Q π k ] T , where the action-value function of π for objective k can be represented as Q π k (s, a) = E π [ t γ t r k (s t , a t )|s 0 = s, a 0 = a]. For MOMDP, a set of non-dominated policies is called as the Pareto frontier. Definition 1. A policy π 1 Pareto dominates another policy π 2 , i.e., π 1 π 2 when ∃i : Q π1 i (s, a) > Q π2 i (s, a) ∧ ∀j = i : Q π1 j (s, a) Q π2 j (s, a). Definition 2. A policy π is Pareto optimal if and only if it is non-dominated by any other policies.

3.2. TWO-PERSON ZERO-SUM GAMES

In standard two-person zero-sum games, players have opposite goals-the payoff of a player equals the loss of the opponent (Mazalov, 2014) , i.e., V + V = 0, where V and V are payoff of a player and the opponent, respectively. For two player discounted zero-sum Markov game, assuming protagonist is playing policy π and adversary is playing the policy π, transition kernel P(s |s, a, ā) depend on both players. In the game, the value function based on π and π can be represented as v π,π (s ) ≡ E π,π [ ∞ t=0 γ t r(s t , a t , āt ) | s 0 = s], ∀s ∈ S. Each player chooses his policy regardless of the opponent. Protagonist attempts to maximize the value function (i.e., total expected discounted reward), and adversary seeks to minimize this function. Nash equilibrium is a key role in game theory, which is one kind of game solution concept. A Nash equilibrium (π * , π * ) in zero-sum Markov game exists when the following relation holds (Shapley, 1953; Bas ¸ar & Olsder, 1998) : v * (s) = max π min π E π,π [ ∞ t=0 γ t r(s t , a t , āt ) | s 0 = s] (1) = min π max π E π,π [ ∞ t=0 γ t r(s t , a t , āt ) | s 0 = s], where π * and π * are the optimal policies of protagonist and adversary respectively, v * is optimal equilibrium value of the game. In such a situation, neither player can improve their respective returns, and there is an important relation., i.e., ∀π, π, v π,π * ≤ v * ≤ v π * ,π . 4 BAYESIAN-OPTIMIZATION-DIRECTED ROBUST MORL 4.1 OVERVIEW We propose a generalized robust MORL framework to learn a single parametric representation for robust Pareto optimal policy over the space of preferences (see Algorithm 1 for implementation scheme based on DDPG). The optimization process of our proposed approach is illustrated in Figure 2 . Bayesian model based on Gaussian process is adopted to predict the Pareto quality and estimate the model uncertainty. Then, using the Bayesian model, acquisition function (Frazier, 2018) can determine optimal guess point, which is the suggested preference in our task. In order to prevent the policy from falling into local optimum, some preferences is randomly sampled from replay buffer, which guide the training of the agent together with the preferences from BO. In addition, the policy of the adversary evolves in the opposite direction to the policy of the protagonist in each preference. In Sections 4.2 and 4.3, through incorporating zero-sum game into MORL, environmental uncertainty is modeled as an adversarial agent. This means that the protagonist needs to learn Pareto optimal policy under attack from the adversary. In Section 4.4, inspired by Shannon-Wiener diversity index, a novel metric for Pareto quality is presented to evaluate the distribution of Pareto solutions from diversity and evenness. Moreover, combined with hypervolume index, a comprehensive metric is designed, which can evaluate the convergence, diversity and evenness for solutions in Pareto set. In Section 4.5, regard agent's learning process as a black-box, and the comprehensive metric for the approximated Pareto frontier is computed after each episode of training, then BO algorithm is adopted to guide the protagonist to evolve towards improving the Pareto quality (i.e., maximizing the comprehensive metric). Figure 2 : Illustration for process to approximate well-distributed robust Pareto frontier through the proposed algorithm.

4.2. ROBUST MULTI-OBJECTIVE MDP

In this section, we propose a robust multi-objective MDP (RMO-MDP), which considers both the Pareto optimality and robustness for the learned policies. Probabilistic action robust MDP (PR-MDP) (Tessler et al., 2019 ) is adopted to improve the robustness of the policies, which can be regarded as a special zero-sum game between a protagonist and an adversary. We refer to the optimal policies of the protagonist as robust Pareto-optimal policies in RMO-MDP, which the difference from the MOMDP is that the action space here includes not only the actions of the protagonist, but also the actions of the adversary with a certain probability. Definition 3. A RMO-MDP can be defined by the tuple S, A mix , P, R, γ, Ω, U Ω . A mix is the mixed action space. The mixed policy π mix α (π, π) is defined as π mix α (a mix | s, ω) ≡ (1 -α)π(a | s, ω) + απ(ā | s, ω), ∀s ∈ S and α ∈ [0, 1] . π and π are policies the players can take, and a mix ∼ π mix α (π(s), π(s)). In this work, in order to improve the quality of the approximated Pareto frontier, the scalar utility function U Ω is designed as non-linear combinations of objectives: U Ω (s, a mix , ω) = ω Q π mix α (π,π) (s, a mix , ω) + kM (s, a mix , ω), M (s, a mix , ω) = Q(s, a mix , ω) Q(s, a mix , ω) 2 - ω ω 2 2 2 , Q π mix α (π,π) (s, a mix , ω) = (1 -α)Q(s, a, ω) + αQ(s, ā, ω), where M (s, a mix , ω) is a metric, which can evaluate the mismatch between the Pareto optimal solution and the corresponding preference. Figure 3 illustrates metric function M (s, a mix , ω) in more detail. The distribution of solutions on the Pareto front can be more well-distributed through optimizing function M (s, a mix , ω). k is a coefficient can adjust the role of M (s, a mix , ω) in the utility function 3. For a protagonist, k is a negative, and k is positive for an adversary. This means that the policy with higher preference is more likely to be violently attacked by an adversary, which can makes the policy with higher preference stronger robust. Under the condition of adversary attack, the utility value of protagonist's policy can be defined as v π α ≡ min π E π mix α (π,π) [U Ω (s, a mix , ω)]. Therefore, the robust Pareto optimal policy is optimal policy in RMO-MDP, which can be represent as: π * α ∈ arg max π min π E π mix α (π,π) [U Ω (s, a mix , ω)]. The complexity of greedy solution to finding the Nash equilibria policies is exponential in the cardinality of the action space, which makes it unworkable in most cases (Schulman et al., 2015) . In addition, most two player discounted zero-sum Markov game methods require solving for the equilibrium policy of a minimax action-value function at each iteration. This is a typically intractable optimization problem (Pinto et al., 2017) . Instead, we focus on approximating equilibrium solution to avoid this tricky optimization.

4.3. POLICY ITERATION FOR RMO-MDP

In this section, we present a policy iteration (PI) approach for solving RMO-MDP called robust multi-objective PI (RMO-PI). RMO-PI algorithm can decompose the RMO-MDP problem into two sub-problems (policy evaluation and policy improvement) and iterate until convergence.

4.3.1. ROBUST MULTI-OBJECTIVE POLICY EVALUATION

In this stage, the vectorized Q-function is learned to evaluate the policy π of the protagonist. With Equation 5, we define the target vectorized Q-function as: y = E π mix α [R + γQ π mix α (π,π) (s , a mix , ω; φ -)] = E s ,a mix ,ω R + γ[(1 -α)Q(s , a, ω; φ -) + αQ(s , ā, ω; φ -)] , Then, we minimize the following loss function at each step: L 1 (φ) = E π mix α y ω rb -Q(s, a, ω rb ; φ) 2 2 + y ω bo -Q(s, a, ω bo ; φ) 2 2 , where φ and φ -are the parameters of the Q-function network and the target Q-function network, ω rb and ω bo are obtained from replay buffer and Bayesian-optimization, y ω rb and y ω bo represent y(s , a mix , ω rb ) and y(s , a mix , ω bo ), respectively. In order to improve the smoothness of the landscape of loss function, the auxiliary loss setting is used (Yang et al., 2019) : L 2 (φ) = E π mix α ω rb y ω rb -ω rb Q(s, a, ω rb ; φ) 2 2 + ω bo y ω bo -ω bo Q(s, a, ω bo ; φ) 2 2 . (9) The final loss function can be written as: L(φ) = (1 -β)L 1 (φ) + βL 2 (φ) , where β is a weighting coefficient to trade off between losses L 1 (φ) and L 2 (φ).

4.3.2. ROBUST MULTI-OBJECTIVE POLICY IMPROVEMENT

In RMO-PI, policy improvement refers to optimizing and updating the policies of a protagonist and an adversary for the given utility function. RMO-PI optimizes both of the agents through the following alternating process. In the first stage, the policy of protagonist is learned while holding the adversary's policy fixed. In the second stage, the policy of protagonist is held constant and the adversary's policy is learned. This learning sequence is repeated until convergence. The protagonist seeks to maximize the utility function U Ω , and then the policy gradient can be represented as: ∇ θ L π = ∇ θ L π ω rb + ∇ θ L π ω bo , where ∇ θ L π ω rb ≈ E π mix α [(1-α)∇ a ω rb Q(s, a, ω rb ; φ)∇ θ π(s, ω rb ; θ)+k∇ a M (s, a, ω rb )∇ θ π(s, ω rb ; θ)], (10) ∇ θ L π ω bo ≈ E π mix α [(1-α)∇ a ω bo Q(s, a, ω bo ; φ)∇ θ π(s, ω bo ; θ)+k∇ a M (s, a, ω bo )∇ θ π(s, ω bo ; θ)], (11) θ is the model parameters of the protagonist. Next, the adversary tries to minimize the utility function U Ω , and the policy gradient can be written as: ∇θL π = ∇θL π ω rb + ∇θL π ω bo , where ∇θL π ω rb ≈ E π mix α [α∇ āω rb Q(s, ā, ω rb ; φ)∇θ π(s, ω rb ; θ) + k∇ āM (s, ā, ω rb )∇θ π(s, ω rb ; θ)], (12) ∇θL π ω bo ≈ E π mix α [α∇ āω bo Q(s, ā, ω bo ; φ)∇θ π(s, ω bo ; θ) + k∇ āM (s, ā, ω bo )∇θ π(s, ω bo ; θ)], (13) θ is the model parameters of the adversary. The derivation details of the policy gradients are available in Appendix A.1.2.

4.4. METRICS FOR PARETO REPRESENTATION

Since the true Pareto set is intractable to obtain in complex problems, the goal of MORL is to find the set of policies that best approximates the optimal Pareto front. Many researchers have reported the works for quality metrics of Pareto front (Cheng et al., 2012; Parisi et al., 2017; Audet et al., 2018) . Hypervolume indicator is widely adopted to evaluate the quality of an approximated Pareto frontier, which can measure the convergence and uniformity for the distribution of Pareto solutions (Zitzler & Thiele, 1999; Xu et al., 2020) . From our perspective, this indicator may be difficult to accurately measure the uniformity of the Pareto solution distribution. As shown in Figure 4 , suppose the Pareto frontiers 1, 2 and 3 are obtained by different algorithms, and compared with the Pareto frontiers 2 and 3, although the hypervolume metric formed by the solutions on Pareto frontier 1 and the reference point O is optimal, the distribution of solutions on the frontier 1 is not well-distributed, which makes the valid preferences of the practitioner or the agent to choose is very limited. Moreover, imagine the solutions on Pareto frontier 1 are very close to each other or even overlap into one solution. At this time, if we adopt the metric (integrated hypervolume metric and sparsity metric) proposed in the paper (Xu et al., 2020) to measure the quality of Pareto frontier 1, the result to have high hypervolume and low sparsity is very ideal. However, such Pareto frontier 1 might not satisfy the needs of the practitioner or the agent. In a word, the high quality of the approximated Pareto frontier is expected to have high hypervolume, and the distribution of solutions is well-distributed. Therefore, in this section, we proposed a novel metric for quality of the approximated Pareto frontier through combining hypervolume metric and evenness metric. Inspired by Shannon-Wiener diversity index, the diversity metric for the solutions of the Pareto frontier can be expressed as D(P ) = -[p i ln (p i )], where P represents the solutions of the Pareto frontier, and p i is the proportion of the number of non-dominated solutions in the corresponding solution interval to the total number of the solutions on Pareto frontier. The expected diversity of Pareto set D max can be defined as ln(S n ), and S n is the number of solution intervals. Then, our evenness metric E(P ) can be represented as D(P )/D max . For example, in Figure 4 , S n = 6, and the evenness metrics for the distribution of the solutions on the Pareto frontiers 1, 2 and 3 are approximately equal to 0.37, 1 and 0.56, respectively. Hence, we can get the following two inferences. Proposition 1. As E(P ) and S n increases, the distribution of solutions in Pareto set becomes denser and more uniform, and the Pareto frontier becomes more continuous. Proposition 2. The Pareto frontier is continuous as E(P ) = 1 and S n → ∞. Combined with the hypervolume indicator H(P ), we propose a comprehensive metric I(P ) that can measure the convergence, diversity and evenness of the solutions: I(P ) = H(P )(1 + λE(P )), where λ is a weight coefficient.

4.5. BAYESIAN-OPTIMIZATION-DIRECTED PARETO REPRESENTATION IMPROVEMENT

In this Section, in order to further improve the representation of the approximated Pareto frontier, the agent's learning process is regarded as a black-box, and the comprehensive metric I(P ) is computed after each episode of training, then a BO algorithm is adopted to guide the protagonist to evolve towards maximizing the proposed metric I(P ). As shown in Figure 5 , the Pareto representation improvement scheme based on BO-directed is illustrated. The value of the objective function f (Ω) equals the value of the comprehensive metric I(P ), which is obtaind after each episode of training. In addition, suggested preferences from BO algorithm and sampled preferences from replay buffer are simultaneously used to guide the learning process, which is to avoid the algorithm into a local optimum. The scheme to guide the learning process with BO has high universality for Pareto quality improvement, and does not require much expert experience in the selection of prediction models.

5. EXPERIMENTS

In order to benchmark our proposed scheme, we develop two MORL environments with continuous action space based on SUMO and Swimmer-v2. Moreover, we also adopt HalfCheetah-v2 and Walker2d-v2, which are two MORL domains provided by Xu et al. (2020) . The goal of all tasks is to try to optimize the speed of the agent while minimizing energy consumption. The observation and action space settings are shown in Table 1 . The more details can be found in Appendix A.2. Table 1 : Observation space and action space of the experiment environments. SUMO Swimmer-v2 HalfCheetah-v2 Walker2d-v2 Observation Space S ∈ R 16 S ∈ R 8 S ∈ R 17 S ∈ R 17 Action Space A ∈ R 1 A ∈ R 2 A ∈ R 6 A ∈ R 6 Our algorithm is implemented based on Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) framework. In principle, our scheme can be combined with any RL method, regardless of whether it is off-policy or on-policy. Moreover, we implement three baseline methods for comparison and ablation analysis: SMORL represents a MO-DDPG method based on linear scalarization function, which is a linear combination of rewards in the form of a preference; SRMORL is a RMO-DDPG approach using linear scalarization function; RMORL represents a RMO-DDPG approach with the utility function U Ω . BRMORL is a RMO-DDPG scheme combined with the utility function U Ω and BO algorithm. More details about the algorithms are described in Appendix A.1.1. Figure 6 and 7 show the learning curves and Pareto frontiers comparison results on SUMO and Swimmer-v2 respectively. Moreover, the results in Table 2 and 3 demonstrate that our proposed BRMORL scheme outperforms all the baseline methods on SUMO and Swimmer-v2 environments in hypervolume and evenness. It can also be found from Figure 7 (c) the BRMORL mothod is not only able to find solutions on the convex portions of the Pareto frontier, but also the concave portions. Figure 8 illustrates that the robustness of different policy models under the preference=[0.5,0.5], on Swimmer-v2 domain. We test with jointly varying both mass and disturbance probability. Obviously, the capability to obtain return based on BRMORL approach is less affected by environmental changes than other schemes. Moreover, the standard deviation based on the utility of the policy is adopted to quantify the robustness. This means that the stronger the robustness of a policy is, then the smaller its standard deviation is. Table 4 shows the quantitative analysis results of robustness under different preferences and environmental changes, on Swimmer-v2. For more results and implementation details, please refer to Appendix A.3 and A.4.1. In Table 5 and 7, we compare our BRMORL scheme with state-of-the-art baseline (PG-MORL) provided by Xu et al. (2020) . Although our method is not superior in hypervolume, it outperforms the baseline in evenness, robustness and utility. In this section, the utility is defined as the expectation of return based on a policy under environmental changes. More details and results can be found in Appendix A.3 and A.4.2. 

6. CONCLUSION AND DISCUSSION

In this paper, we proposed a generalized robust MORL framework to approximate a representation for robust Pareto frontier, which allows our trained single model to produce the robust Pareto optimal policy for any specified preference. Our experiments across four different domains demonstrate that our scheme is effective and advanced. Most importantly, we note that training with appropriate adversarial setting can not only result in robust policies, but also improve the performance even. Moreover, both solutions on convex and concave portions of the Pareto frontier can be found through our approach. Although our scheme cannot guarantee the learned policy is optimal, it is approximately robust Pareto optimal. end for Update the target networks: θ -← τ θ + (1 -τ )θ - θ-← τ θ + (1 -τ ) θ- φ -← τ φ + (1 -τ )φ - end for end for

A.1.2 THEORETICAL DERIVATION

In this part, we provide the derivation details of some formulas. The policy gradient of the protagonist based on ω rb can be derived:  ∇ θ L π ω rb ≈ E π mix α [∇ θ U Ω (s, a mix , ω rb )] = E π mix α [∇ θ ω rb Q π mix α (π, The policy gradient of the protagonist based on ω bo can be written as:  ∇ θ L π ω bo ≈ E π mix α [∇ θ U Ω (s, a mix , ω bo )] = E π mix α [∇ θ ω bo Q π mix α (π,



Figure3: Illustration of the mismatch between the Pareto optimal solution and the corresponding preference. Suppose the point A represents a Pareto optimal solution, which and the origin form the vector OA. The corresponding preference vector can be represented by the vector OB. In most cases, OA is not parallel to OB.

Figure 4: Quality analysis of Pareto frontiers. The Pareto frontiers 1, 2 and 3 are approximated by different approaches. The green, blue and purple points represent the solutions on Pareto frontiers 1, 2 and 3 respectively. The hypervolume formed by the solutions on Pareto front 2 and the reference point O is the blue shaded region.

Figure 5: Pareto representation improvement scheme based on BO algorithm. The surrogate model for the objective function f (Ω) is typically a Gaussian Process. Posteriors represent the confidence a model has about the function values at a point or set of points. Acquisition function is employed to evaluate the usefulness of optimal guess point corresponding to posterior distribution over f (Ω). The expected improvement method chosen to design the acquisition function in our scheme.

Figure 6: The learning curves and the Pareto frontiers obtained by different algorithms on SUMO.

Figure 7: The learning curves and the Pareto frontiers obtained by different methods on Swimmer-v2.

Figure 8: Robustness to environmental uncertainty. Disturbance probability represents the probability of a random disturbance being played instead of the selected action. Relative mass denotes the ratio of the current agent's mass to its original mass.

Figure 12: Robustness to environmental uncertainty on Swimmer-v2 domain.

Figure 13: Robustness to environmental uncertainty on Swimmer-v2 domain.

Figure 14: Robustness to environmental uncertainty on Walker2d-v2 domain. Disturbance probability represents the probability of a random disturbance being played instead of the selected action. Relative mass denotes the ratio of the current agent's mass to its original mass.

Figure 15: Robustness to environmental uncertainty on HalfCheetah-v2 domain.

Training results on SUMO.

Training results on Swimmer-v2.

Quantitative analysis results for robustness.

Test results on Walker2d-v2.

Algorithm 4 multi-objective DDPG with linear scalarization function Input: weighting coefficients β and τ Randomly initialize actor π(s, ω; θ), adversary π(s, ω; θ) and critic network Q(s, a, ω; φ) Initialize target networks with weights θ -, θand φ - Initialize replay buffer B and comprehensive metric I for episode = 0...M do Receive initial state s 0 for t = 0...T do Sample preference ω ud based on uniform-distribution: ω t ← ω ud Sample action a t = π(s t , ω t ; θ) ãt = a t + exploration noise Execute action ãt and observe reward r t and new state s t+1 Store transition (s t , ω t , ãt , r t , s t+1 ) in B for i = 0...N do Sample batch from replay buffer B Update actor network: θ ← E π [∇ a ω rb Q(s, a, ω rb ; φ)∇ θ π(s, ω rb ; θ)] Update critic network: φ ← (1 -β) ∇ φ E π y ω rb -Q(s, a, ω rb ; φ) β∇ φ E π ω rb y ω rb -ω rb Q(s, a, ω rb ; φ)

π) (s, a mix , ω bo ; φ) + k∇ θ M (s, a mix , ω bo )] = E π mix α [(1 -α)∇ a ω bo Q(s, a, ω bo ; φ)∇ θ π(s, ω bo ; θ) + k∇ a M (s, a, ω bo )∇ θ π(s, ω bo ; θ)],(16)The policy gradient of the adversary based on ω rb can be derived:∇θL π ω rb ≈ E π mix

Test results on HalfCheetah-v2.

A APPENDIX

A.1 ALGORITHM A.1.1 ALGORITHM OVERVIEW The details of the BRMORL, RMORL, SRMORL and SMORL schemes are provided in Algorithm 1, 2, 3 and 4 respectively. Update critic according to equations 8 and 9: Update the target networks: end for Sample batch from replay buffer B Update adversary according to equations 12:Update the target networks: end for Sample batch from replay buffer B Update adversary network:The policy gradient of the adversary based on ω bo can be described as:A.2 DOMAINIn this section, we give more details about training environment. We demonstrate our approach in four challenging continuous control domains 9. Observation and action space dimension: S ∈ R 16 , A ∈ R 1 . The first objective is vehicle velocity:The second objective is energy conservation:where v x is vehicle longitudinal velocity, v max is maximum speed, p = 1 is scaling factor.

A.2.2 SWIMMER-V2

Observation and action space dimension:The first objective is running velocity:The second objective is energy conservation:where v x is longitudinal velocity, a i is the action of each actuator.

A.2.3 WALKER2D-V2

Observation and action space dimension: S ∈ R 17 , A ∈ R 6 .The first objective is running velocity:The second objective is energy conservation:where v x is longitudinal velocity, a i is the action of each actuator.

A.2.4 HALFCHEETAH-V2

Observation and action space dimension:The first objective is running velocity:The second objective is energy conservation:where v x is longitudinal velocity, a i is the action of each actuator.

A.3 IMPLEMENTATION

We implement the actor, adversary and critic neural networks by 2 fully connected hidden layers, which layer sizes are {256, 256} and {512, 256} in SUMO and Mujoco respectively.Our scheme contains three important hyperparameters α, λ and k. α mainly affects the robustness of the policy model. If α is set to a smaller value, the robustness of the model will be reduced, otherwise, the performance of the model will be reduced. λ and k are responsible for balancing hypervolume, diversity and evenness metrics. if λ and k are set to larger values, then the diversity and evenness of the learned policies will be better, otherwise, the hypervolume will converge to a larger value. In order to select appropriate values, these three parameters need to be adjusted in the experiment.The number of solution intervals S n is set to 9, and the main parameters of our BRMO-DDPG algorithm are reported in the Table 6 . In the robustness and comparing the PGMORL tests, the utility is designed as follows:We test the utility of the policy ten times under each mass and disturbance changes, and calculate the cumulative value of the utility. The mean and standard deviation of each policy's utility obtained from the whole test process are used to evaluate the performance of the policy under the environment uncertainty. The smaller standard deviation is, the stronger the robustness of the policy is. Moreover, for a policy , the higher the mean is, the stronger the capability to obtain utility under the environment uncertainty is. Hence, the robustness and utility in Table 5 and 7 refer to the standard deviation and the mean here.In the experiment comparing the PGMORL scheme, we evenly selected ten trained policy models.Because we use one model for testing, we evenly selected ten preferences to input to the tested model.For the PGMORL method, we tested each model one hundred times and calculated the cumulative value of the utility. For BRMORL scheme, we tested each preference one hundred times, and then calculated the cumulative value of the utility. Therefore, the hypervolume and evenness in Table 5 and 7 are calculated by using the returns.

A.4 RESULTS

In this section, we give more experimental results.A.4.1 SWIMMER-V2 The comparison results of the baseline (PGMORL) are provided here.

