MULTI-OBJECTIVE REINFORCEMENT LEARNING: CONVEXITY, STATIONARITY AND PARETO OPTIMALITY

Abstract

In recent years, single-objective reinforcement learning (SORL) algorithms have received a significant amount of attention and seen some strong results. However, it is generally recognized that many practical problems have intrinsic multi-objective properties that cannot be easily handled by SORL algorithms. Although there have been many multi-objective reinforcement learning (MORL) algorithms proposed, there has been little recent exploration of the fundamental properties of the spaces we are learning in. In this paper, we perform a rigorous analysis of policy induced value functions and use the insights to distinguish three views of Pareto optimality. The results imply the convexity of the induced value function's range for stationary policies and suggest that any point of its Pareto front can be achieved by training a policy using linear scalarization (LS). We show the problem that leads to the suboptimal performance of LS can be solved by adding strongly concave terms to the immediate rewards, which motivates us to propose a new vector reward-based Q-learning algorithm, CAPQL. Combined with an actor-critic formulation, our algorithm achieves state-of-the-art performance on multiple MuJoCo tasks in the preference agnostic setting. Furthermore, we empirically show that, in contrast to other LS-based algorithms, our approach is significantly more stable, achieving similar results across various random seeds. Published as a conference paper at ICLR 2023 of policy alterations on the induced value function. We find that single-state policy alterations are insufficient to optimize the induced value function in a MORL setting, but show how it can be done by a multi-state update. We also prove that improving in all states is generally not possible. From here, we show that the range of value functions is convex, which suggests that linear scalarization (LS) is not the bottleneck in finding PE policies. We discuss the deficiencies of existing LS-based algorithms as suggested by our theory and fix them by augmenting the reward function using a strongly convex term. These insights motivate us to propose a new MORL algorithm (CAPQL) which achieves state-of-the-art performance on multiple MoJoCo environments in the preference agnostic setting. An ablation study is performed to understand how augmentation affects the algorithm's performance. To begin, we will do a quick review of MORL problems and the notation we will be using, as well as introduce our definitions of Pareto optimality. Like SORL problems, we consider an agent interacting with an environment. At each step, the agent performs an action based on the current state and the environment returns a reward and the next state. Our setting assumes a vector reward in R d and is reduced to a SORL problem if d = 1. We model the interaction as a Markov Decision Process (MDP) (S, A, R, P, γ). As usual, S and A are the sets of states and actions, and γ ∈ (0, 1) is the discount factor. Our discussion considers finite A and S.When the agent takes action a ∈ A in state s ∈ S, the environment gives reward R(a, s) ∈ R d and moves to the next state following the transition probability P (a, s) ∈ ∆ |S| . In this paper, we consider an infinite-horizon MORL problem and assume bounded rewards. Let R(s) = [R(a, s)|a ∈ A] ∈ R d×|A| and P(s) = [P (a, s)|a ∈ A] ∈ R |S|×|A| , Π the set of all stationary policies, where π ∈ Π maps a state to a distribution over actions. Following the work of Roijers et al. ( 2013), given π ∈ Π, the induced value function V π (s) ∈ R d returns the expected sum of discounted reward over the interaction trajectory with the initial state s, 2 V π (s) := E ∞ t=0 γ t R(a t , s t ) with s t ∼P (a t-1 , s t-1 ), a t-1 ∼π(s t-1 ), s 0 = s. (1) Let µ : S → [0, 1] be the probability distribution over initial states. The expected value function is: γ t R(a t , s t ) with s t ∼ P (a t-1 , s t-1 ), a t-1 ∼ π(s t-1 ) and s 0 ∼ µ . (2)

1. INTRODUCTION

The past decade has seen the rapid development of reinforcement learning (RL) algorithms. Recent breakthroughs in RL have made it possible to develop policies that exceed human-level performance: Atari (Mnih et al., 2015) , Dota 2 (OpenAI et al., 2019) , etc. Despite their great success, the vast majority of RL algorithms are single-objective based. Although many practical problems can be reduced to a SORL task, there is an increasing recognition that many real-world tasks require us to consider their multi-objective nature (Coello, 2000; Pickett & Barto, 2002; Moffaert & Nowé, 2014; Roijers et al., 2013; Abels et al., 2019; Abdolmaleki et al., 2020; Abdelaziz et al., 2021) . There are many works that discuss how to find optimal policies in a multi-objective RL (MORL) problem (Gábor et al., 1998; Pickett & Barto, 2002; Moffaert & Nowé, 2014; Roijers et al., 2013; Yang et al., 2019; Parisi et al., 2016; Mahapatra & Rajan, 2020) or a more general dynamic programming setting (Sobel, 1975; Corley, 1985) , but the relationship among various definitions of Pareto optimal policies is hardly discussed. Moreover, there is no rigorous analysis of the range of induced value functions, which has been thought hard to characterize and of irregular shapes (Vamplew et al., 2008; Roijers et al., 2013; Reymond & Nowe, 2019) . (Note, similar work has been done for mixed policies, but fundamentally differs from the more common stationary policy that is sought in modern RL.) We hope to give researchers well aligned intuitions about MORL problems that can save effort and accelerate the rate of research in the field; it is to this end that we introduce this paper. 1Within this paper, we perform a theoretical analysis of MORL problems with an infinite horizon (rigorous proofs are given in Appx B). After a quick review of the MORL setting and three widelyadopted definitions of Pareto efficiency (PE), we begin our analysis by characterizing the effects That is, V π µ = E s0∼µ V π (s 0 ). Let V π = [V π (s)|s ∈ S] ∈ R d×|S| , V(s) = {V π (s)|π ∈ Π}, V µ = {V π µ |π ∈ Π} and V = {V π |π ∈ Π}. The Bellman equation (Bellman, 2003) can be written as: V π (s) = R(s) + γV π P(s) π(s) for s ∈ S. (3) In RL, we are interested in finding a π that maximizes V π (s). When d = 1, the regular order defined on R is adopted, and the optimal policy gives the greatest V π (s). For d > 1, we consider the Pareto order (PO): for real-valued tensors u, v of the same shape, u v if every entry of u is not less than its counterpart in v. 3 For a set of tensors C of the same shape, v ∈ C is Pareto efficient (PE) if for all u ∈ C, either v u or v u. (A set may have multiple PE elements.) In this paper, we are interested in three types of PE policies that are not carefully distinguished in the existing literature (Roijers et al., 2013; Song et al., 2020; Abdolmaleki et al., 2020) .  distribution µ, π ∈ Π is single-state PE (SPE) if V π (s) is PE in V(s), and it is distributed initial state PE (DPE) if V π µ is PE in V µ . Likewise, π ∈ Π is aggregate PE (APE) if V π is PE in V. Let Π * s , Π * µ and Π * denote the sets of policies that are SPE, DPE and APE, respectively. In Sec 4.2, assuming the Markov chain is ergodic, we show that SPE and DPE are equivalent and have the relationship with APE demonstrated in Fig. 1 . Since we prove that for all s ∈ S, Π * s coincide, we may not specify to which state we are referring to when discussing SPE. Optimizing MORL using linear scalarization (LS). It is common to convert MORL problems to SORL problems by LS (White, 1993) . To do so, take a nonzero vector w ∈ R d and take the dot product of the reward vector via w R(a, s). In fact, for any π, we can left multiply Eq (3) by w to see that w V π is the induced value function of the associated SORL problem. We refer the associated SORL problem with weight w as SORL(w). In this paper, we rely on the intimate relationship between MORL and the associated SORL problems to characterize the properties of PE policies (of both types). Our characterization shows that LS does not necessarily inhibit agents from finding desired PE policies, and sheds light on the challenges that modern MORL algorithms face when searching for PE policies. (s1) V 0 (s1) V (s1)

3. FINE-GRAINED CONTROL OF INDUCED VALUE FUNCTIONS

Figure 2 : The induced value functions of selected policies in Ex 1. Here, π 0 : λ 0 = λ 1 = 0.5; π: λ 0 = 0.7, λ 1 = 0.5. The pink line segments are the value function range for π ∈ Φ(s 0 , π 0 ), and the light blue patch is the value function range V. 

State s1

Adj. 0(s0) Adj. 0(s1) Adj. 0(s0), 0(s1) V 0(s 1) V a(s 1) V b(s 1) Figure 3 : The induced value functions of selected policies in a two-state MORL problem with the configuration given in Ex 1. Here, π 0 : λ 0 = λ 1 = 0.5; π a : λ 0 = 0.7, λ 1 = 0.5; π b : λ 0 = 0.5, λ 1 = 0.7. Starting from π 0 , the pink area corresponds to the value functions of π obtained by adjusting π 0 (s 0 ) (i.e., λ 0 ), the green area is the one by adjusting π 0 (s 1 ) (i.e., λ 1 ) and the grey area is the one by adjusting both. The red vector indicates the shifting direction when increasing λ 0 and λ 1 at the same rate. In this section, we study the dynamics of the induced value functions resultant from policy alterations in MORL problems. Our discussion starts by characterizing the effects of single-state policy adjustments. We will show that, unlike SORL, it is not always possible to optimize the value function by optimizing single-state actions. This suggests that the popular Bellman operator, optimizing policies on a single-state basis, does not sufficiently improve a policy's performance in the MORL setting. The difficulties faced by single-state based optimization techniques reveal the intrinsic difference between the SORL and MORL optimizations, which motivates us to investigate how to adjust multiple states' actions to jointly improve the induced function values. To make our theoretical results intuitive, we will use the following example throughout this paper. Example 1 Consider a two-objective problem with S = {s 0 , s 1 } and A = {a 0 , a 1 }. For each state s, the chance of staying in s is 50% for both actions. Let R(a 0 , s 5, 2] , and γ = 0.5. The induced value function of π satisfies: 0 ) = [1, 5] , R(a 1 , s 0 ) = [5, 1] , R(a 0 , s 1 ) = [10, 1] , R(a 1 , s 1 ) = [ V π (s 0 ) = 1 5 5 1 + γV π 0.5 0.5 0.5 0.5 π(s 0 ), V π (s 1 ) = 10 5 1 2 + γV π 0.5 0.5 0.5 0.5 π(s 1 ) Since |A| = 2, λ i := Pr(taking a 0 in s i ), i ∈ {0, 1} is sufficient to specify a policy because Pr(taking a 1 in s i ) = 1 -λ i .

3.1. ADJUSTING THE POLICY IN A SINGLE STATE

In this section, we discuss the properties of the induced value functions when the policy is modified in a single state. In particular, given s 0 ∈ S and π 0 ∈ Π, we temporarily restrict our attention to the policies equal to π 0 in all states except s 0 ; this is the set: Φ(s 0 , π 0 ) := {π ∈ Π : π(s) = π 0 (s) for s = s 0 }. Fig 2 plots the value function's range with π 0 in Ex 1: λ 0 = λ 1 = 0.5. The plot shows, when π(s 0 ) is changed (i.e. π ∈ Φ(s 0 , π 0 )), its induced value function moves along a line segment in both states with the same moving direction but at different rates. For instance, in Fig 2 , we mark the value function of π by green dots, where π : λ 0 = 0.7, λ 1 = 0.5. In comparison to π 0 (blue dots), V π moves faster in s 0 than in s 1 , which is intuitive because the adjustment is made in s 0 and should change its value function by the greatest magnitude. Our observations here indeed hold in general: Proposition 1 For π 0 , π 1 ∈ Φ(s 0 , π 0 ), let π α = (1 -α)π 0 + απ 1 . Then ∂V πα (s) ∂α = ∂V πα (s 0 ) ∂α E γ X(s;s0) , where the random variable X(s; s 0 ) is the number of steps to reach s 0 starting from s; its distribution is identical for π 0 , π 1 , and α. Moreover, V πα =(1-β(α; s 0 ))V π0 +β(α; s 0 )V π1 , where β(α; s 0 ) = αφ1 (1-α)φ0+αφ1 and φ i ∈ [1 -γ, 1 ] is a scalar depending on π i and the RL problem settings but independent from α. 4In Prop 1, Eq (5) says when π α (s 0 ) moves from π 0 (s 0 ) to π 1 (s 0 ), the value functions in all states move along the same direction ∂V πα (s0) ∂α , while the moving rate is scaled by E γ X(s;s0) for state s ∈ S. We note that X(s 0 ; s 0 ) = 0 and X(s; s 0 ) ≥ 1 for all s = s 0 . Therefore, the induced value function always changes the most drastically in s 0 , which is what we observed in Fig 2 . Eq (6) shows that, if we modify the policy in s 0 by letting π α (s 0 ) be a convex combination of π 0 (s 0 ) and π 1 (s 0 ) ∈ Φ(s 0 , π 0 ), then V πα (s 0 ) is a convex combination of V π0 (s 0 ) and V π1 (s 0 ). Consider the example presented in Fig 2 : every policy π ∈ Φ(s 0 , π 0 ) can be seen as a convex combination between π 0 and some policy π 1 ∈ Φ(s 0 , π 0 ) with V π1 (s 0 ) on the boundary (where π 1 depends π ). Moreover, every point between V π0 (s 0 ) and V π1 (s 0 ) corresponds to a policy in Φ(s 0 , π 0 ). Single-state optimization is not sufficient to improve policies in MORL. It is known that, in SORL problems, a single-state action optimization is sufficient to increase the value function in all states (Sutton & Barto, 2018, p78) . Unfortunately, this does not generally hold in MORL problems. To give a counterexample, consider the setting in Ex 1. Given π 0 : λ 0 = λ 1 = 0.5, we plot the induced values of π obtained by modifying exactly one state's policy in Fig 3 . Here, the pink line segment is the set of value functions obtained by modifying π 0 (s 0 ). Similarly, the green line segment is obtained by changing π 0 (s 1 ). As we can observe, when updating a single state's policy, one objective must be traded off to optimize the other one. This can be clarified by considering the policy optimization of s 0 : taking actions a 0 and a 1 gives the rewards [1, 5] and [5, 1] respectively, so the immediate expected reward for this state is [-4λ 0 + 5, 4λ 0 + 1]. Thus, no single-state policy adjustment will improve both objectives. Inc. V 0(s 0) Inc. V 0(s 1) Dec. V 0(s 0) Dec. V 0(s 1) 0.4 0.2 0.0 0.2 0.4 0 = 0.7 Inc. V 0(s 0) Inc. V 0(s 1) Dec. V 0(s 0) Dec. V 0(s 1) 0.4 0.2 0.0 0.2 0.4 0 = 0.9 Inc. V 0(s 0) Inc. V 0(s 1) Dec. V 0(s 0) Dec. V 0(s 1)

3.2. ADJUSTING THE POLICY IN MULTIPLE STATES

Figure 4 : Left: γ = 0.5. The feasible changes on π 0 in Ex 1 that increase (decrease) the value function in two states (under the PO). Middle: γ = 0.7. Right: γ = 0.9. The insufficiency of optimizing single-state actions motivates us to combine policy adjustments in multiple states to jointly improve the value functions. In Fig 3 , the vectors in red give the induced value function's moving directions when we increase λ 0 and λ 1 at the same rate. We observe that both objectives are improved in s 1 , but not in s 0 . This suggests that we can adjust multiple states' policies simultaneously to jointly improve its value function in one state. This leads to the natural question: Are there rates of change for λ 0 and λ 1 that improves the induced values at both states? The answer is negative. The first graph of Fig 4 plots the feasible changes of λ 0 and λ 1 that improve V π0 (s 0 ) and V π0 (s 1 ). When γ = 0.5, the feasible areas (blue and yellow) to improve both objectives in both states are disjoint and hence cannot be improved simultaneously. The mutual exclusiveness of the two areas is caused by the attenuation effect of the discount factor. Consider the induced value function obtained by increasing λ 0 from 0.5 to 0.7 for π 0 (the steel-blue dots in Fig. 3 ). This adjustment makes the induced function increase in Obj. 1 at the cost of Obj. 0 for both states. Since we are making the adjustment from the perspective of s 0 , the value function change is greatest in s 0 and is dampened by the discount factor γ when viewed from s 1 . Similar observations can be made by increasing λ 1 from 0.5 to 0.7. This attenuation of the reward propagation makes it impossible for the induced values to move in the same direction in all states, which prevents them from being optimized simultaneously. The last two plots of Fig. 4 show that when γ increases, the feasible areas to improve induced values in both states overlap, which corroborate our claims. We generalize our observations by proving: Proposition 2 Assume π 0 ∈ Π is not optimal in the associated scalar problem for all w = 0. Let s 0 ∈ S. Then any neighbourhood of π 0 contains π 1 , π 2 ∈ Π such that for any u ∈ R d we have V π1 (s 0 ) -V π0 (s 0 ) = ξ 1 u and V π2 µ -V π0 µ = ξ 2 u for some ξ 1 , ξ 2 > 0. (7) Moreover, if the MDP is ergodic, as γ → 1, V π1 (s) -V π0 (s) → ξ 1 u for all s ∈ S. V π0 (s0) u V(s0) Figure 5: V π0 (s 0 ) is not optimal for any w but cannot move along u. Prop 2 says that, under some weak conditions, we can move the induced value of π 0 in a state (or the expected induced value) along any direction. Setting u = 1, we improve the induced values. When the MDP is ergodic, as γ → 1, the values in all states move along u and will be optimized at the same time. Remark 1 We note that Prop 2 is non-trivial because a policy's suboptimality over all associated scalar problems does not automatically imply the induced value function can move in any direction. For example, in Fig 5, V π0 (s 0 ) is not optimal in any associated SORL problem but cannot move along u: Prop 2 says V(s 0 ) cannot have a shape like this. Remark 2 The first plot of Fig 4 shows π 0 cannot be further improved in both states at the same time. This implies π 0 is in fact APE. However, Fig 2 shows V π0 (s 0 ) is in the middle of the induced value function range. Thus, there exists a policy π that has an induced value in s 0 greater than V π0 (s 0 ), and hence π 0 is not SPE in s 0 . This observation tells us SPE and APE are different in general.

4. CONVEXITY AND PARETO EFFICIENT POLICIES

The analysis of the induced value function's behaviour in Sec 3 provides us with powerful tools to characterize their properties. In this section, we use these tools to show the convexity of the induced function's range and the relationships among the three types of PE.

4.1. CONVEXITY OF THE INDUCED FUNCTION RANGES

For more than a decade, it has been considered true that the induced value functions' ranges are irregularly shaped (for stationary policies) (Vamplew et al., 2008; Roijers et al., 2013; Hayes et al., 2022) . In fact, this is the key reason behind the belief that LS is not powerful enough to find all PE policies. Prop 3 shows that this belief is not true, and the ranges of the functions are actually convex. Proposition 3 For s ∈ S, V(s) is convex. Also, V µ is convex. The convexity of V(s) can be shown by repetitively applying Prop 2 to construct a path between V π0 (s 0 ) and V π1 (s) with π 0 , π 1 ∈ Π. Roughly speaking, for any point v over the path, starting from V π0 (s), we can repetitively use Prop 2 to construct a sequence of policies with the induced value approaches v. Thus, the path is included in V(s), and by definition, V(s) is convex. We can apply the similar idea to show V µ is convex. Remark 3 Prop 3 can also be proved as a corollary of the convexity of the occupancy measure, initially derived in constrained Markov decision theory (Kallenberg, 1983; Puterman, 1994; Altman, 1999) . We discuss their relationship and provide a second proof of Prop 3 in Appx C. LS is not a bottleneck in finding PE policies. The convexity of the induced value functions' ranges suggest that we can potentially find all SPE (DPE) policies through LS. In particular, we have: Proposition 4 For s ∈ S, π ∈ Π * s if π is optimal in a SORL(w) with some w 0. Also, if π ∈ Π * s , V π (s) is optimal in a SORL(w) with a nonzero w 0. As a result, a SPE (DPE) policy achieves optimality in some associated SORL problem. This also suggests that we can potentially find all SPE (DPE) policies by choosing different weights in LS. Problems of the existing LS-based algorithms. We suggest that there are two significant reasons why many algorithms cannot find a rich set of PE policies: determinism and numerical instability. The majority of RL algorithms favour determinism. While it is wellknown that an optimal deterministic policy always exists for SORL problems (Puterman, 1994, Ch 6 ), almost all PE value functions for a MORL problem require stochastic policies. In Fig. 6 , we plot the value functions of state s 0 for different policies in Ex 1. We see that most PE policies are stochastic (squares) while the deterministic ones can only cover the vertexes (stars). This observation shows that unless we ensure current algorithms can favour stochastic policies, we have implicitly excluded almost all PE policies. Additionally, it is numerically impossible to obtain some SPE policies in practice although they can be found in theory. In fact, almost all nonzero weights w 0 correspond to the SPE policies on the vertexes. For instance, in Fig. 6 , we show the choice of weights for finding a specific SPE policy in the up-right corner. While there is a wide range of weight selections for finding the SPE policies on the vertexes, only those on the boundary of the green-cyan/cyan-purple patches can be used to find the stochastic policies upper/right boundary of the polytope. Specifically, only weight vectors normal to the Pareto front can be used to find stochastic policies. In practice, we do not know how to find such weights; even if we do, they cannot be picked as a tiny perturbation over them gives a weight corresponding to a vertex. Besides, these weights are shared by a set of stochastic SPE policies (on the same facet of the polytope). Thus, the algorithm would produce a random policy in the set with an undesired value function even if the right weight is picked. Adding a strongly concave term fixes the problems. Let f : ∆ |A| → R be a strongly concave function for action-taking distributions. We then replace the regular reward R(a t , s t ) with R(a t , s t ) = R(a t , s t ) + αf π (s t ) 1, where 1 ∈ R d is a vector of ones and α > 0 controls the strength of the augmentation effect. Then the induced value function of a policy π for initial state s ∈ S under this augmented setting becomes: V π αf (s):=E ∞ t=0 γ t R(a t , s t )+αf π(s t ) 1 with s t ∼P (a t-1 , s t-1 ), a t-1 ∼π(s t-1 ), s 0 =s. (8) Figure 7 : The effects on the induced value functions at s 0 for selected policies in Ex 1 by adding strongly concave terms to the immediate rewards with different alpha. Here, f returns the entropy of action taking distribution. V αf (s 0 ) is marked in blue and the dots of the same colour among the four plots correspond to the same policy. We observe that the augmentation makes the shape of the PE element set V αf (s) strictly convex. Thus, for every w ∈ W + , there is a unique V π αf (s) ∈ V αf (s) that has the maximum projection on w. Let g s denote this unique correspondence relationship from W + to V αf (s): Let V αf (s) = {V π αf (s)|π ∈ Π}, W + = {w 0|r 1 ≤ w 1 ≤ r 2 } for some r 1 , r 2 > 0 and V αf (s) the set of PE elements in V αf (s). g s (w) = argmax v∈V αf (s) w v. From Fig 7, we can also observe that, the strict convexity of V αf (s) makes g s (uniformly) continuous. As we rotate w clockwise, g s (w) slides from the left end of the blue curve the right for last three plots; in contrast, for the first one, it will jump from top left vertex to the middle one followed by the bottom right one. The continuity of g s makes it numerically possible to pick a good w with g s (w) close to the desired PE element in V αf (s). We summarize our observations in Prop 5. Proposition 5 Function g s given in (9) is well-defined, surjective and uniformly continuous. Finally, the strongly concave term naturally injects the preference on the stochastic policies. Therefore, all the problems of the existing LS-based methods mentioned in Sec 4.1 has been fixed. Remark 4 Function g s corresponds to the extended target value function considered by Abels et al. (2019) and Yang et al. (2019) . In their work, they implemented an extended Q-network, Q(s, a, w) and minimize g s (w) -E a [Q(s, a, w)] 2 for all s ∈ S, a ∈ A and w ∈ Φ ⊆ W + . In Sec 5, we will see that, as they adopted regular rewards without augmentation, the target value function g s is sensitive to the input w, making the training process unstable and lowering the algorithms' performance (Appx D describes the causes of this training instability). We will instead propose a concave-augmented Pareto Q-learning algorithm (CAPQL). Our empirical study shows that the algorithm's performance improves with significantly more stable training trajectories. Remark 5 Fig 7 suggests that a smaller α preserves more information about the original problem but makes g s (•) less robust to perturbations and lowers the stability of the algorithm approximating g s . Likewise, a larger α improves the stability but would harm the algorithm's expected performance.

4.2. RELATIONSHIPS AMONG THREE TYPES OF PE

Before we introduce our new LS-based MORL algorithm, we will complete our theoretical discussion about PE. In particular, we prove that the three types of PE have the relationship summarized in Fig. 1 . Throughout this section, we assume that the MDP is ergodic. SPE is state-independent and equivalent to DPE. According to Prop 4, if π ∈ Π * s for s ∈ S, then it must be the optimal policy for some associated scalar problem with some nonzero w 0. When w 0, we apply Prop 4 to conclude that π ∈ Π * s for all s ∈ S and π ∈ Π * µ . If w contains zero entries, the problem can be approximated by assigning an arbitrarily small positive weight to the zero entries. Then, the problem is reduced to case with w 0.foot_4 Since π is optimal in the associated problem of all states, none of its value functions can be improved independently or in aggregation. Hence, π ∈ Π * as well. We summarize the derived results in: Proposition 6 If π ∈ Π * s for some state s ∈ S, then for all s ∈ S, π ∈ Π * s . Thus, the sets of single-state PE policies coincide for all initial states and is a subset of Π * . Using a similar derivation by linking the PE policies to the optimal solutions of the corresponding associated scalar problem also proves: Proposition 7 A policy is SPE if and only if it is DPE. That is, for s ∈ S, Π * s = Π * µ . SPE implies APE but not vice versa in general. For a selected state s ∈ S, a policy π is SPE as long as its induced value function V π (s) is PE in V(s). The function values for other states are not relevant. In contrast, APE involves the induced function value in all states. A policy π is APE if we have to trade off the value functions of some states for improving those of the others. For ergodic SORL problems, the induced value function of a policy π reaches the maximum in one state if and only if it reaches the maximum in all states (Prop 2.1.2 in (Bertsekas, 2022) ). This implies that SPE and APE are equivalent in a SORL problem. However, the equivalence does not generally hold for MORL problems. In Rmk 2, we presented a case where π 0 ∈ Π * but π 0 / ∈ Π * s0 . Besides, Prop 6 shows that if π ∈ Π * s for some s ∈ S, then π ∈ Π * . As a result, Π * s ⊂ Π * but do not necessarily coincide. As we have noted in Sec 3.2, this proper subset relationship is due to the attenuation of the reward propagation caused by the discount factor. The attenuation makes the moving direction of the value function differ among various states and causes the feasible sets of improvement adjustments to be disjoint (see Fig 4) . When γ → 1, the moving directions of all states converge (Prop 2). Therefore, the feasible sets of improvement adjustments for different states will also converge and eventually overlaps for sufficiently large γ (see the last two plots of Fig 4 ). Therefore, improvements can be made in all states until they reach the boundaries of the value functions' ranges. In other words: Proposition 8 For all s ∈ S, as γ → 1, the SPE policy set Π * s approaches the APE one Π * .

5. CONCAVE-AUGMENTED PARETO Q-LEARNING

Motivated by the discussions in Sec 4.1, in this section, we develop a new Q-learning algorithm with the reward augmented by a strongly concave term. We call our new algorithm concave-augmented Pareto Q-learning (CAPQL).

5.1. MORL PROBLEM WITH AGNOSTIC WEIGHT PREFERENCE

Our CAPQL algorithm is designed for solving MORL problems where the preference weights for the objectives are (potentially) different between episodes and are not known in advance. The setting was initially considered by Abels et al. (2019) to propose a multi-objective Q-network (MOQ). In particular, the problem considers a set of weights Φ ⊆ W + . For each episode, a preference weight w ∈ Φ is given, and the algorithm is expected to maximize the sum of the rewards projected onto w. Thus, the algorithm has to handle all possible weights in Φ.

5.2. IMPLEMENTATION OF CAPQL

Following Abels el al.'s work, we consider an extended Q-network, Q(s, a, w), and train it to approximate the Q-values of a optimal policy of SORL(w) for all w ∈ Φ. Unlike MOQ that uses the reward R(a t , s t ) from the environment directly, CAPQL replaces it with R(a t , s t ) = R(a t , s t ) + αf π (s t ; w) 1, where α > 0. Notice that scaling w does change the selection of the induced value function V αf (s) that has the greatest projection on it. Without loss of generality, we assume w 1 = 1. Furthermore, we set f to be the entropy operator H(q) = -q(a) log q(a) da (discussions on the selection of f are given in Appx E). We train our algorithm by optimizing it over SORL(w), for all w ∈ Φ. For a fixed w, the learning task of SORL(w) can be written as π( • ; w) = argmax π ( • ;w) E ∞ t=0 γ t w R(a t , s t ) + αH π (s t ; w) , which is obtained by projecting the value function defined in (8) onto w followed by taking the maximum over the policies conditioned on it. Interestingly, this is the MORL extension of the learning task considered in SAC (Haarnoja et al., 2018) . Hence, we implement the algorithm with the Q-network and policy network conditioned on w. In each training step, we first sample a weight w and follow the SAC method to train the policy and the Q-network conditioned on it. (The implementation details, pseudocode are given in Appx F and convergence property is discussed in Appx G.) As discussed in Prop 5 and Rmk 4, the optimal policy π of SORL(w) has the value function g s (w). Thus, the training is to make E a∼π(s;w) [Q(s, a, w)] match g s (w) for all s ∈ S and w ∈ Φ. As Prop 5 shows, adding the entropy term makes the target g s (w) uniformly continuous with respect to w, which is easier to fit and is less numerically unstable (in Appx H, we empirically show this phenomenon by training the algorithm over Ex 1 and visualizing its g s (w)). Hence, we should expect CAPQL to both converge faster and be more numerically stable during training.

5.3. EXPERIEMENTS

We test our algorithm over a multi-objective version of the MuJoCo environment. The reward vector was created by simply exposing the individual components that went into the regular scalar reward: 4 for details.) We restrict Φ to only contain weights within 22.5 degrees of the unit vector to ensure that w 0. Finally, we will also perform an ablation study to understand how the algorithm's performance changes with different strength of reward augmentation. We compare our method to two popular LS-based algorithms: MOQ (Abels et al., 2019) and EnvQ (Yang et al., 2019) . MOQ can be seen as a special case of CAPQL as α → 0 and EnvQ is its enhanced version. It has been shown that EnvQ enjoys a higher sampling efficiency and has a consistently better performance than MOQ on multiple MORL benchmarks. Since MOQ and EnvQ were proposed under the finite action setting, to adapt them to MuJoCo's continuous action space, we follow Tang & Agrawal ( 2020 We train each method five times with various random seeds and report the mean and standard deviation. In every step, we test them over ten randomly sampled weights. We observe that CAPQL has a consistently better performance over all benchmarks and enjoys a faster convergence speed. Additionally, compared to QEnv-ctn, CAPQL has a far more stable training trajectory over different random seeds. The relationship between the augmentation strength and CAPQL's performance. In Rmk 5, we discussed how the augmentation would affect the CAPQL's performance as its strength varies. In Fig 9 , we plot the training curves of CAPQL for Hopper with different α, which corroborates our claim. In particular, as α increases, the target value function g s (w) defined in Prop 5 becomes less sensitive to w. Thus, it can be learned more easily; meanwhile, we observe a faster convergence and stabler training trajectories over various random initial seeds. However, if α becomes too large (the case with α = 0.8), g s (w) will significantly deviate from the original one (i.e., when α = 0). Then, the algorithm's performance after convergence starts to drop.

6. DISCUSSION

This paper performed a rigorous analysis of the dynamics of the induced value functions resultant from policy alterations in MORL problems. We analyzed the behaviours of the functions when a single state's policy is altered and showed that this is insufficient to optimize the induced value functions in a MORL setting. We then discussed how to update a policy in multiple states to improve the value function of a specific state. We also showed that when γ → 1, the induced values of all states will be improved as well. These insights into the induced value function's properties helped us show the convexity of their range and prove that LS is sufficient to find all SPE (DPE) policies. The equivalence of SPE and DPE was shown, which are also equivalent to APE when γ → 1. Next, we showed why existing LS-based algorithms fail and proposed the CAPQL algorithm to address these issues; our empirical evaluation indicates CAPQL's superior performance and corroborates our theoretical analysis.

REPRODUCIBILITY

Theoretical Work All theoretical results have formal proofs provided in the Appx B.

Empirical Work

We provide a detailed pseudocode description of our CAPQL implementation (Alg 1) in Appx F. The parameters used to train and implement all algorithms we analyzed are listed in Tables 1-3 in Appx I. Additionally, we have provided details on the environment configuration and our derived reward functions in Table 4 . The source code of our CAPQL implementation is available online: https://github.com/haoyelu/CAPQL.git.

A RELATED WORK

In recent years, there have been relatively few developments in the abstract analysis of the spaces involved in MORL problems. Much of the previous work done on analyzing the set of induced value functions has been done on a general state-space while allowing various kinds of deterministic or mixed policies (Feinberg & Shwartz, 1995; Vamplew et al., 2008; 2009; Roijers et al., 2013; Barrett & Narayanan, 2008; Moffaert & Nowé, 2014) . In particular, Feinberg & Shwartz (1995) found that the expected discounted sum of rewards V π (s) with a fixed initial state s is a convex set when π is allowed to be non-stationary. Moreover, they prove that Pareto optimality (at any given state) is equivalent to a linear scalarization (LS) problem under a specific weight vector. Similar convexity and Pareto results were found by Vamplew et al. (2008; 2009) , but instead of allowing a general non-stationary policy, they restricted their policies to the set of mixed deterministic policies. The key difference between our work and theirs is that we restrict our analysis to stochastic stationary policies. Hernández-Lerma & Romera (2004) also performed an analysis in a similar setting to us but focused on the feasibility of finding Pareto optimal solutions for specific weights. Mannor & Shimkin ( 2004) have also done similar work and required that there exists a common accessible state from every other state; ergodicity is a sufficient condition for their setting. However, their expected sum of rewards is un-discounted, which is not reflective of common modern formulations. They also focused significantly on directional policy optimization and proposed several algorithms that require mixed deterministic policies to function. Apart from the theoretical works, Abels et al. ( 2019) generalized the Q-function by conditioning it on the importance of the objectives. In their problem setting, the objective preference changes between different episodes and is not known in advance. Therefore, the learning task requires the algorithm to perform well over all potential weight selections. QEnv) . They showed that their algorithm consistently performs better over multiple MORL benchmarks and is more sample efficiency. Finally, Abdolmaleki et al. (2020) have developed an algorithm that allows users to choose objective preferences in a scale-invariant way by restricting the relative influence of each objective when improving the policy. It should be noted that significant efforts have been made to find alternatives to LS due to perceived and real drawbacks. For example, Van Moffaert et al. (2013) demonstrate that Chebyshev metric can dominate LS in the discrete policy setting since it can be used to find Pareto efficient policies that are in the interior of the convex hull. While one of its primary benefits over LS no longer holds in the stochastic policy setting, it may be interesting to see if an extension of this methodology to our setting could still be beneficial. Alternatively, recent papers in the domain of concave utility reinforcement learning (CURL) have achieved some intriguing results (Geist et al., 2021; Zhang et al., 2020; Agarwal et al., 2022) . In the recent work of Agarwal et al. (2022) , they seek to find a single policy that maximizes a concave utility function applied to V π (s). While their setting, methods, focuses and insights are largely different from ours, they analyzed their framework's sample efficiency and developed an actor-critic method that reduces the variation of the policy gradients by directly shifting them with a state-dependent term.

B PROOFS

Given s 0 ∈ S and π 0 ∈ Π, we temporarily restrict our attention to the policies equal to π 0 in all states except s 0 ; this is the set: Φ(s 0 , π 0 ) := {π ∈ Π : π(s) = π 0 (s) for s = s 0 }. ( ) Proposition 9 Let Ṽ(s; s 0 ) and X(s; s 0 ) denote the random variables that, starting from s ∈ S, the sum of the discounted rewards and the number of steps before the first time reaching s 0 .foot_5 (The distributions of Ṽ(s; s 0 ) and X(s; s 0 ) do not depend on π(s) and thus are identical for all π ∈ Φ(s 0 , π 0 ).) Write V(s 0 ) = E Ṽ(s ; s 0 ) for s ∈ S ∈ R d×|S| , Γ(s 0 ) = E γ X(s ;s0) for s ∈ S ∈ R 1×|S| . ( ) For π ∈ Φ(s 0 , π 0 ), we have V π = V(s 0 ) + V π (s 0 )Γ(s 0 ) = V(s 0 ) + Q(s 0 )π(s 0 )Γ(s 0 ) 1 -γη(s 0 )π(s 0 ) . ( ) where η(s 0 ) = Γ(s 0 )P(s 0 ), Q(s 0 ) = R(s 0 ) + γV(s 0 )P(s 0 ) . Moreover, η(s 0 )π(s 0 ) ∈ [0, 1]. Proof: Let s be the next state starting from s 0 . For π ∈ Φ(s 0 , π 0 ), V π (s 0 ) = E R(a, s 0 ) + γ Ṽ(s ; s 0 ) + γ • γ X(s ;s0) • V π (s 0 ) (14) = R(s 0 )π(s 0 ) + γ s E Ṽ(s ; s 0 ) + E γ X(s ;s0) V π (s 0 ) a P (s 0 , a, s )π(a, s 0 ), where π(a, s 0 ) is the probability of taking action a in state s 0 when adopting policy π and P (s 0 , a, s ) is the transition probability of moving to state s when taking a in s 0 . That is, V π (s 0 ) = R(s 0 )π(s 0 ) + γV(s 0 )P(s 0 )π(s 0 ) + γV π (s 0 )η(s 0 )π(s 0 ), where η(s 0 ) = Γ(s 0 )P(s 0 ), and it is easy to see η(s 0 )π(s 0 ) ∈ [0, 1]. Rearranging it yields V π (s 0 ) = R(s 0 ) + γV(s 0 )P(s 0 ) π(s 0 ) 1 -γη(s 0 )π(s 0 ) = Q(s 0 ) π(s 0 ) 1 -γη(s 0 )π(s 0 ) . ( ) Note that, for all s ∈ S, we have V π (s) = E Ṽ(s; s 0 ) + γ X(s ;s0) • V π (s 0 ) = E Ṽ(s; s 0 ) + E γ X(s ;s0) V π (s 0 ). ( ) Replacing V π (s 0 ) with the expression in ( 17) yields (13). Corollary 1 (Full version of Proposition 1 in the main text) For π 0 , π 1 ∈ Φ(s 0 , π 0 ), let π α = (1 -α)π 0 + απ 1 , with α ∈ [0, 1]. Then V πα = (1 -β(α; s 0 ))V π0 + β(α; s 0 )V π1 , where β(α; s 0 ) = αφ 1 (1 -α)φ 0 + αφ 1 ( ) with φ i = 1 -γη(s 0 )π i (s 0 ) ∈ [1 -γ, 1] and β (α; s 0 ) > 0. Besides, V π1 -V π0 = V π1 (s 0 ) -V π0 (s 0 ) Γ(s 0 ), and ∂V πα ∂α = β (α; s 0 ) • (V π1 -V π0 ) = β (α; s 0 ) • D(s 0 )Γ(s 0 ) (22) = ∂V πα (s 0 ) ∂α Γ(s 0 ), where D(s 0 ) = Q(s 0 ) π 1 (s 0 ) 1 -γη(s 0 )π 1 (s 0 ) - π 0 (s 0 ) 1 -γη(s 0 )π 0 (s 0 ) . ( ) (Note: Eq (23) is identical to (5) in Prop 1.) Proof: Notice that (1 -β(α; s 0 ))V π0 + β(α; s 0 )V π1 = (1 -α)φ 0 (1 -α)φ 0 + αφ 1 V π0 + αφ 1 (1 -α)φ 0 + αφ 1 V π1 = V(s 0 ) + Q(s 0 ) (1 -α)φ 0 (1 -α)φ 0 + αφ 1 π 0 (s 0 ) φ 0 + αφ 0 (1 -α)φ 0 + αφ 1 π 1 (s 0 ) φ 1 Γ(s 0 ) [by (17)] = V(s 0 ) + Q(s 0 ) (1 -α)π 0 + απ 1 (1 -α)φ 0 + αφ 0 Γ(s 0 ) = V(s 0 ) + Q(s 0 ) π α 1 -γη(s 0 )π α Γ(s 0 ) = V πα [by (17)] which is (19). By Prop 9, η(s 0 )π i (s 0 ) ∈ [0, 1) for i = 0, 1. Hence, φ i = 1 -γη(s 0 )π i (s 0 ) ∈ [1 -γ, 1] and β (α; s 0 ) = φ0φ1 (φ0α-φ0-φ1α) 2 ∈ [1-γ, 1]. Besides, by ( 19), we have ∂V πα ∂α = β (α; s 0 )•(V π1 -V π0 ). Plugging π 1 and π 0 into (13) followed by taking the difference yields V π1 -V π0 = D(s 0 )Γ(s 0 ), with D(s 0 ) defined in (24) . In this way, ( 22) is proved. Let Γ(s; s 0 ) be the entry for s in Γ(s 0 ). Namely, Γ(s; s 0 ) = E γ X(s;s0) . Then V π1 (s 0 ) -V π0 (s 0 ) = D(s 0 )Γ(s 0 ; s 0 ) = D(s 0 )E γ 0 = D(s 0 ). Hence, V π1 -V π0 = D(s 0 )Γ(s 0 ) = V π1 (s 0 ) -V π0 (s 0 ) Γ(s 0 ), which is (21). Likewise, ( 22) can be written as ∂V πα (s 0 ) ∂α = β (α; s 0 ) • D(s 0 )Γ(s 0 ; s 0 ) = β (α; s 0 ) • D(s 0 )E γ 0 = β (α; s 0 ) • D(s 0 ). (27) Therefore, for s ∈ S, ∂V πα (s) ∂α = β (α; s 0 ) • D(s 0 ) Γ(s; s 0 ) = ∂V πα (s 0 ) ∂α Γ(s; s 0 ), which is (23). Proposition 10 (Proposition 2 in the main text) Assume π 0 ∈ Π is not optimal in the associated scalar problem for all w = 0. Let s 0 ∈ S. We have the conic hull Q 1 = cone V π (s 0 ) -V π0 (s 0 ) π ∈ s∈S Φ(s, π 0 ) = R d , Q 2 = cone V π µ -V π0 µ π ∈ s∈S Φ(s, π 0 ) = R d . Therefore, for all u ∈ R d , we can construct a function π(α 1 , . . . , α M ) = 1 - M i=1 α i π 0 + M i=1 α i π i (31) with α i ≥ 0, j α j ≤ 1 and π i ∈ Π such that u = M j=1 ∂V π(0) (s 0 ) ∂α j • r j for some r j ≥ 0. ( ) (Note that π(0) = π(0, 0, . . . , 0) = π 0 . ) Besides, assuming the MDP is ergodic, if γ → 1, M j=1 ∂V π(0) (s) ∂α j • r j → u for all s ∈ S. Similarly, we can construct π(α 1 , . . . , α M ) such that u = M j=1 ∂V π(0) µ ∂α j • r j for some r j ≥ 0. ( ) Proof: Assume the cone V π (s 0 ) -V π0 (s 0 ) π ∈ s∈S Φ(s, π 0 ) = R d . Then by Farkas' lemma (Dax, 1997) , it must be contained in some closed half-space H 1 = {x ∈ R d |n 1 x ≤ 0} . Now we show that n 1 V π0 (s 0 ) cannot be further improved, which means π is optimal in the associated scalar RL problem with weight n 1 and thus is a contradiction. To improve n 1 V π (s 0 ), there must be some s ∈ S that is reachable from s 0 and π ∈ Φ(s , π 0 ) such that n 1 (V π (s ) -V π0 (s )) > 0. By (21), this implies, n 1 V π (s 0 ) -V π0 (s 0 ) = n 1 V π (s ) -V π0 (s ) • Γ(s 0 ; s ) > 0, where Γ(s 0 ; s ) > 0 is the entry corresponding to s 0 in Γ(s ).foot_6 Hence, V π (s 0 ) -V π0 (s 0 ) ∈ H 1 , which is a contradiction. Therefore, we have shown that Q 1 = R d . Similarly, if Q 2 = R d , Farkas' lemma shows it must be contained in some closed half-space H 2 = {x ∈ R d |n 2 x ≤ 0}. Since π 0 is assumed to be not optimal in all associated SORL problems with w = 0, it is not optimal for the SORL one with weight n 2 . Therefore, there exists some s ∈ S that is reachable when following the initial distribution and π ∈ Φ(s , π 0 ) such that n 2 (V π (s ) -V π0 (s )) > 0. Then (21) implies n 2 (V π (s) -V π0 (s)) = n 2 (V π (s ) -V π0 (s )) • Γ(s; s ) > 0 for all s ∈ S. As a result, n 2 V π µ -V π0 µ = n 2 s∈S µ(s) V π (s) -V π0 (s) > 0. Hence, V π µ -V π0 µ ∈ H 2 , which is a contradiction. Therefore, Q 2 = R d . Since Q 1 = R d , there exist s i ∈ S, π i ∈ s∈S (s i , π 0 ) and d i ≥ 0, for i = 1, . . . , M , such that u = M i=1 d i V πi (s 0 ) -V π0 (s 0 ) . Define function π taking the expression (31). By Corollary 1, we have ∂V π(0) ∂α j = ∂V (1-αj )π0+αj πj | αj =0 ∂α j = β (α j ; s j ) • (V πj -V π0 ) = ∂V π(0) (s j ) ∂α j Γ(s j ). This implies, ∂V π(0) (s 0 ) ∂α j =β (α j ; s j ) • (V πj (s 0 ) -V π0 (s 0 )) (40) = ∂V π(0) (s j ) ∂α j Γ(s 0 ; s j ). Combining ( 38) and ( 40) yields u = M i=1 d i β (α i ; s i ) ∂V π(0) (s 0 ) ∂α i . Proof: Consider the optimization problem maximize w v subject to v ∈ V(s). ( ) If π is optimal for some associated scalar problem with weight w 0, then V π (s) is the optimal solution of problem (47) (Bertsekas, 2022, Prop 2.1.2) . Since V(s) is convex (Prop 11), according to (Miettinen, 1998, Thm 3 (Miettinen, 1998, Thm 3.1.4) , there exists w 0 and w = 0 such that V π (s) is a solution of problem (47). Then π is optimal in the associated SORL problem with the nonzero w 0. .1.2), V π (s) is Pareto optimal in V(s). That is, π ∈ Π * s . Conversely, if π ∈ Π * s , then V π (s) is Pareto optimal in V(s). Since V(s) is convex, by Repeating the derivations by replacing Π * s with Π * µ proves the similar statement for Π * µ . Proposition 13 (Proposition 5 in the main text) Function g s given in (9) is well-defined, surjective and uniformly continuous. We prove Prop 13 by first proving the functional relationship from w ∈ W + to V π αf (s) ∈ V αf (s) and thus g s is well defined. In particular, we show Lemma 1 For any w ∈ W + , there exists a unique V π αf (s) ∈ V αf (s) such that π is optimal in SORL(w). (i.e., V π αf (s) is the only element in V αf (s) having the greatest projection on w.) Proof: We prove the lemma by contradiction. Assume that there exist two distinct V π1 αf (s) and V π1 αf (s) in V αf (s) such that π 1 and π 2 are optimal in SORL(w). Then we have {V π1 αf (s), V π2 αf (s)} ⊆ argmax v∈V αf (s) w v, and w V π1 αf (s) = w V π2 αf (s). In a SORL problem, it is well-known that a policy π is optimal for initial state s if and only if it is optimal for all states that are reachable from s (Sutton & Barto, 2018) . Since V π1 αf (s) = V π2 αf (s), we have π 1 (s ) = π 2 (s ) for some s that is reachable from s under policies π 1 and π 2 . Moreover, we have w V π1 αf (s ) = w V π2 αf (s ). We then consider a new policy π that equals π 1 for all s ∈ S except s . For s , π (s ) = 1 2 (π 1 (s ) + π 2 (s )). Due to the strongly concavity of the immediate augmented reward, the projected immediate reward of π on w is strictly greater than the ones of π 1 and π 2 . As a single-state action optimization is sufficient to increase the value function in all states (Sutton & Barto, 2018, p78) , we also have w V π αf (s) > w V π1 αf (s) = w V π2 αf (s), which contradicts (48). Since for each w ∈ W + , there exists a unique corresponding V π αf (s) ∈ V αf (s). The relationship g s (•) by definition is a function. We then show that Lemma 2 g s : W + → V αf (s) is surjective. Proof: Consider a variant of the MORL problem that has the same setting as the original one except that we now treat the action-taking distributions of a state as actions. We keep referring the actions under the original definition as actions and the ones under the new definition as the d-actions. Then for each state-action pair (ā, s), the immediate reward is R(s)ā + αf (ā). This setting is the same as the one we considered in the proof of the convexity of the induced value function's range. Therefore, let Vαf (s) denote the range of the induced value function for the original MORL problem's variant; by Prop 3, we have Vαf (s) is convex. (In this variant of the MORL problem, the action space Ā is actually infinite. Thus, to apply Prop 3, we need to first approximate the action space through sufficiently fine discretization (Chow & Tsitsiklis, 1991) .) Let V αf (s) denotes the set of PE elements in V αf . Then applying the same method for the proof of Prop 12, we can show that for all v ∈ V αf (s), there exists w ∈ W + such that v has the maximum projection on w. Besides, applying the same method used in the proof of Lem 1, we can show that for any w ∈ W + , there exists a unique v ∈ V αf (s) such that v has the maximum projection on w. Therefore, there is a functional relationship ḡs : W + → V αf (s) that maps w to v. Moreover, ḡs is surjective. Finally, we show that ḡs = g s by showing that v must be in V αf (s). Given v ∈ V αf (s) that has the maximum projection over some w ∈ W + . Then consider its associated SORL with weight w. Since the corresponding policy of taking d-actions are optimal, it must achieve the maximal cumulative discounted reward in all s that are reachable from s. Then suppose in some s the policy of taking actions is not deterministic (i.e., the agent follows some distribution p(ā) to take d-actions). Then by replacing the distribution with the one only taking d-action ā = Ā āp(ā), the projected cumulative discounted reward starting at s increases (due to the strongly convavity and the Jensen's inequality). Therefore, to find an optimal policy of the associated SORL, we only need to consider the deterministic d-action policy. Namely, it is sufficient to choose one distribution to take actions instead of adopting a bilevel design such that first follow a distribution to pick an action-taking distribution q followed by using q to take actions. Equivalently, we have shown that v ∈ V αf (s), which completes the proof. Finally, we show that g s is uniformly continuous. Lemma 3 g s is uniformly continuous. Proof: We first show that w is continuous over W + . By Lem 1, we know g s (w) can be written as g s (w) = argmax v∈V αf (s) w v. For a sequence w k → w , let v k = g s (w k ) and v = g s (w ). Then for any subsequence I ⊂ N, v k has an accumulation point v (because V αf (S) is closed and bounded and due to the Bolzano-Weierstrass theorem (Davidson & Donsig, 2009) ). Since w k v k ≥ w k v for all v ∈ V αf (s), k ∈ I. We also have w v ≥ w v for all v ∈ V αf (s). Therefore, v = v and g s is continuous by definition (Davidson & Donsig, 2009) . Finally, since W + is compact, g s is uniformly continuous (Davidson & Donsig, 2009) . Combining Lemma 1-3 completes the proof of Prop 13. Corollary 2 (Proposition 6 in the main text) If π ∈ Π * s for some state s ∈ S, then for all s ∈ S, π ∈ Π * s . Thus, the sets of SPE policies coincide for all initial states and is a subset of Π * . Proof: According to Prop 12, if π ∈ Π * s for some s ∈ S, then V π (s ) is optimal in an associated SORL with some none-zero w 0. Thus, V π (s) is the optimal solution of problem (47) for all s ∈ S (by (Bertsekas, 2022, Prop 2.1.2 ) and the ergodic assumption). If w 0, then by Prop 12, V π (s) ∈ V(s) for all s ∈ S. That is, π ∈ Π * s for all s. If w contains zero entries, as π(s ) ∈ Π * s , V π (s ) is Pareto optimal in V(s ). Intuitively, V π (s ) can be seen as a maximizer of problem (47) with weight w = lim → + 0 w + 1. Then since w + 1 0 as it approaches to w, the problem is reduced to the first case, and thus, π ∈ Π * s for all s. A more rigorous proof can be done by introducing a lexicographic order over R d . Specifically, we need to construct a lexicographic order such that V π (s ) is the global maximum V(s ) ⊂ R d . Let w 0 = w, S 0 = V π (s ) and m 0 = max v∈S0 w 0 v. Define S 1 = {v ∈ S 0 |w 0 v = m 0 }. Since S 0 is convex, so is S 1 as it is an intersection of S 0 and a hyperplane. It is easy to check V π (s ) ∈ S 1 and is Pareto optimal in S 1 . Applying (Miettinen, 1998, Thm 3.1.4) , there exists nonzero w 1 0 such that w 1 • w 0 = 0 and V(s ) is a solution of maximize w 1 v subject to v ∈ S 1 . (51)

E THE SELECTION OF THE STRONGLY CONCAVE TERMS IN CAPQL

In this section, we discuss our selection of the strongly concave term used in CAPQL. According to the theoretical results presented in Sec 4.1, f can be any strongly concave term for the definition of V π given in (8), and Prop 5 shows that the corresponding g s defined in ( 9) is surjective and uniformly continuous. In this way, all the problems of the existing LS-based methods discussed in Sec 4.1 are solved. While the theoretical results presented in Sec 4.1 holds as long as f is strongly concave, some special choice of f can significantly simplify the implementation of CAPQL and improve the computational efficiency. Specifically, CAPQL consists of two major parts: 1) optimizing the generalized Qnework using the Bellman operation and 2) optimizing the policy networks π φ conditioned on w by minimizing the KL divergence from the predicted action-taking distribution π φ (s, w) and the target one induced by the Q-values: π * (s) = argmax π(s) w a∈A Q θ (s, a, w) • π(a, s) + αf π(s) , where π(a, s) denotes the probability of taking action a in state s.foot_8  Regarding the first part, the optimization of Q(s t , a t , w) needs the estimated value function conditioned on w at the next state s t+1 . That is, V (s t+1 , w) = E a∼π φ (st+1,w) Q θ (s t+1 , a, w) + αf π φ (s t+1 , w) , which requires f (π φ (s t+1 , w)) can be computed efficiently. If the number of actions is huge or infinite, we need to estimate V (s t+1 , w) through sampling. Note that, when f = H, we have V (s t+1 , w) = E a∼π φ (st+1,w) Q θ (s t+1 , a, w) -α E a∼π φ (st+1,w) log π φ (s t+1 , w) , = E a∼π φ (st+1,w) Q θ (s t+1 , a, w) -α log π φ (s t+1 , w) ≈ Q θ (s t+1 , a, w) -α log π φ (s t+1 , w) (63) with a ∼ π φ (s t+1 , w). (We adopt this approximation in our CAPQL implementation as the MoJuCo environment has a continuous action space.) However, a similar estimation cannot be made given f (p) = - |A| i=1 p 2 i because the expression is not an expectation over the action-taking distribution. For the second part, a practical implementation requires that the π * (s) defined in (59) can be evaluated efficiently. When f = H, the π * has the expression: π * (a, s) = exp w Q θ (s, a, w)/α Z , where Z = a∈A exp w Q θ (s, a, w)/α da. In Appx F, we will show that Z is not required to be evaluated explicitly. As a result, π * can be computed effortlessly. Since setting f = H enables the efficient computation/estimation in training the Q network and the policy network, we use the entropy function to augment the immediate rewards in our CAPQL implementation.

F IMPLEMENTATION DETAILS OF CAPQL

In this section, we provide extra implementation details of the CAPQL algorithm and gives its psuedocode. Our algorithm largely follows the spirit of the implementation of the soft actor critic (SAC) (Haarnoja et al., 2018) . The major difference is that in CAPQL, the Q-network and the policy network are conditioned on the preference weight w. As a result, the Q-network takes input (s, a, w) instead of (s, a), and the policy network has input (s, w) instead of s. Let Q θ denote the Q-network with parameter θ for training, Qθ target network with parameter θ and π ψ the policy network with parameter ψ. Additionally, let D φ be a weight sampling distribution with support φ. As mentioned in the main text, without the loss of generality, we assume that for all w ∈ φ, we have w 1 = 1; in practice, the assumption can be simply satisfied by normalize w after sampling it. Then we give the CAPQL's implementation in Alg 1.  θ i ← θ i -λ Q ∇ θi 1 2 E S Q(s j , a j , w) -Q θi (s j , a j , w) 2 2 for i ∈ {1, 2} where Q(s j , a j , w) = R(a j , s j )+ γ min i∈{1,2} Q θi (s j+1 , a j+1 , w) -α log π φ (a j+1 , s j+1 , w) 1 and a j+1 ∼ π ψ (s j+1 , w) ψ ← ψ -λ π ∇ ψ E S D KL π ψ (•, s j , w) exp(w min i∈{1,2} Q θ i (sj ,•,w)/α) Z(sj ,w) with Z(s j , w) = A exp w min i∈{1,2} Q θi (s j , a, w)/α da θi ← τ θ i + (1 -τ ) θi for i ∈ {1, 2} end end Note that when computing the gradient of ψ, the partition function Z(s j , w) does not depend on ψ and will be dropped in the actual implementation. Besides, we use an exponentially moving average with a smoothing constant τ to update the target network, which stabilizes the learning trajectory. This technique has been commonly adopted in the prior work (Mnih et al., 2015; Lillicrap et al., 2016; Haarnoja et al., 2018) . Implementation of the Q-network. Inspired by the design of SAC, our Q-network consists of two fully connected networks (FCNs). The two networks have the same architecture but different parameters. Each of them has two hidden layers and takes input of the dimension equal to the sum of the ones of observation states and reward. The output dimension equals the reward's. When performing inference, the Q-network returns element-wise minimum over the two FCNs' outputs. Implementation of the policy network. The policy network is implemented using the reparameterization trick. In particular, it can be written as a t = f ψ ( t ; s t , w), ( ) where t is a sample of a spherical Gaussian distribution. In our implementation, we first use the trick to generate a Gaussian sample with mean µ φ (s t , w) and standard deviation σ φ (s t , w) (µ φ and σ φ are respectively implemented by a two-layer FCN), followed by using tanh function to ensure a t is in [-1, 1] d as required by the Mujoco environment.

G THE CONVERGENCE PROPERTIES OF CAPQL

In this section, we discuss the convergence property of the CAPQL algorithm. As mentioned in Sec 5.2, the learning task of SORL(w) specified in ( 10) is the one used in SAC (Haarnoja et al., 2018) ; therefore, after picking w, we use the SAC method to train the policy and the Q-network. Let W = {w |w ∈ Φ and w 1 = 1}. It has been shown that by repeatedly applying the SAC optimization step, the policy converges to the optimal (Haarnoja et al., 2018, Thm 1) . As a result, for CAPQL, given a fixed weight w ∈ W, we have the conditioned policy π(•, •, w) converge to the optimal policy π * w such that w Q π * w (s, a) ≥ w Q π (s, a) for all π ∈ Π and (s, a) ∈ S × A. As the CAPQL algorithm converges conditioned on every w ∈ W, it also converges as a whole. In this section, we provide extra empirical evidence to corroborate our theoretical results discussed in Prop 5 and Rmk 4 and show that CAPQL is capable of finding all the PE policies We perform the experiments on the environment specified in Ex 1 which has a simple configuration so that we can plot the range of its induced value function. In Fig 11 , we visualize the estimated g s0 (w) of the CAPQL with α = 0.1, 0.3, 0.5. We also provide one for MOQ (Abels et al., 2019) which can be seen as a limiting case of CAPQL when α → 0. The visualization was made by picking w ∈ [w 0 , w 1 ] ∈ R 2 |w 0 + w 1 = 1, w 0 ∈ [0, 1] . Fig 11 (left plot) shows that the MOQ algorithm cannot precisely estimate g s0 as it is not continuous when α = 0. In particular, the target g s0 has the range only consisting of the upper two vertices and the one on the right (also see Fig 6 ). However, the estimated g s0 randomly fluctuates at the vertices and mistakenly has a continuous transition between the upper two. The fluctuations suggest that the estimated g s0 is numerically unstable and could harm the performance of the induced policy. Additionally, the continuous transition indicates that the target g s0 is inaccurately estimated. The right three plots show that the aforementioned problems can be alleviated by the CALQL algorithm. We observe that even a relatively small alpha (α = 0.1) can largely alleviate the numerical 



Extra literature reviews are given in Appx A. The entries of V π (s) are induced value functions in SORL and is known to exist. Thus, V π (s) also exists. If every entry in u is strictly greater than its counterpart in v, we write u v. In general, not all tensors are comparable. For example, given u = 1 0 and v = 0 1 , u v and u v. We write u v ⇐⇒ (u v) ∧ (u v). For single-entry tensors, PO is reduced to the regular order defined on R. We give the closed-form expression of φi in Cor 1 in the appendix. We provide a more rigorous proof in the appendix by constructing a lexicographic order on R d . When s = s0, Ṽ(s0; s0) = X(s0; s0) = 0. We note that the policy optimization is performed in s instead of s0 (as we did in Cor 1). A more rigorous treatment is to construct a line (path) integral with the directional derivative u. See Appx F for the implementation details of π φ and Q θ .



Figure 1: The relationship among three types of PE in Defn 1. Definition 1 (Pareto efficient policies) For s ∈ S and initial statedistribution µ, π ∈ Π is single-state PE (SPE) if V π (s) is PE in V(s), and it is distributed initial state PE (DPE) if V π µ is PE in V µ . Likewise, π ∈ Π is aggregate PE (APE) if V π is PE in V. Let Π *s , Π * µ and Π * denote the sets of policies that are SPE, DPE and APE, respectively.

Figure 6: Value functions of s 0 for various (λ 0 , λ 1 ) in Ex 1. The stars are deterministic SPE policies and the squares are stochastic ones; sub-optimal ones are marked by golden dots. The choice of weights for finding different SPE policies are plotted in the top-right corner, where the patch colours correspond to the found SPE policies'.

Fig 7 shows how the reward augmentation changes the value function's range. We observe that the augmentation makes the shape of the PE element set V αf (s) strictly convex. Thus, for every w ∈ W + , there is a unique V π αf (s) ∈ V αf (s) that has the maximum projection on w. Let g s denote this unique correspondence relationship from W + to V αf (s):

Figure 8: Training curves of the MORL algorithms in MuJoCo environments with vector rewards.

Figure 9: Training curves of CAPQL with different α for Hopper.

177

Figure10: The effects on the induced value functions at s 0 for selected policies in Ex 1 by adding strongly concave terms to the immediate rewards with different alpha. Here,f : ∆ |A| → R is defined as f (p) = -|A| i=1 p 2 i . V αf (s 0 )is marked in blue and the dots of the same colour among the four plots correspond to the same policy.

Figure 11: Visualization of the estimated g s0 (w) in Ex 1 by MOQ (Abels et al., 2019) and CAPQL with α = 0.1, 0.3, 0.5. The MOQ can be seen as a special case of CAPQL when α → 0.

Hyperparameters of QEnv and MOQ

Specifications of EnvironmentsEnvironment Action dim. Obs. state dim. Rwd. dim. Rwd entries

ACKNOWLEDGEMENT

We thank the reviewers and area chair for constructive comments. We gratefully acknowledge funding support from NSERC and the Canada CIFAR AI Chairs program. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.

annex

Setting r i = d i β (αj ;sj ) yields (32). Moreover, if the MDP is ergodic, as γ → 1, we have all entries of Γ(s j ) approach to one for all j = 1, 2, . . . , M . Then, combining (41) and ( 42) yieldfor all s ∈ S.Likewise, since Q 2 = R d , there exist r i ≥ 0 for i = 1, . . . , M such thatNotice that[by ( 22)]= β (α j ; s j ) (V πj µ -V π0 µ ). Plugging it into (44) yields,Setting r i = r i β (αi;si) yields (34).Proposition 11 (Proposition 3 in the main text) For s ∈ S, V(s) is convex. Besides, V µ is convex.Proof: We first consider the set of policies Π that are not optimal in the associated scalar problem for all w = 0. Then for π 1 , π 2 ∈ Π and α ∈ [0, 1], we showNote that, since V π1 (s) and V π2 (s) are not optimal in the associated SORL problem for all w = 0, for any vector v on the line segment from V π1 (s) to V π2 (s), v cannot be optimal in any associated problem either. (Otherwise, at least one of V π1 (s) and V π2 (s) is optimal.) Therefore, starting from V π1 (s), we can keep constructing function π defined in Prop 10 with u = V π2 (s) -V π1 (s) to move along the line segment. 8 In this way, we can find a policy π that has V π (s) corresponding to each point on the line segment, which impliesSimilarly, for π 1 , π 2 ∈ Π, V π1 µ and V π2 µ are not optimal in the associated SORL problem for all w = 0. Therefore, for any vector v that is a convex combination of V π1 µ and V π2 µ , v is not optimal in any associated SORL problem either. According to Prop 10, for every convex combination v, there is a policy π that has V π µ = v. Hence, {V π µ |π ∈ Π} is convex, which implies its closure V µ is also convex.Proposition 12 (Proposition 4 in the main text) For s ∈ S, π ∈ Π * s if there exists w 0 such that π is optimal in the associated scalar problem. Also, if π ∈ Π * s , π is optimal in an associated scalar problem with some nonzero w 0. A similar statement holds for Π * s replaced with Π * µ .We can continue this process to get (Miettinen, 1998, Thm 3.1.4 ) to get w k+1 , where w k+1 • w j = 0 for all j < k + 1.Besides, w k+1 has at least one nonzero entry, which is zero for the counterpart in w j for all j < k + 1.The process will continue until we get a list of non-zero weights w 0 , w 1 , . . . , w n 0 such that for any dimension i = 1, 2, . . . , d, there is exactly one w j having a positive entry in dimension i. Then we can construct a lexicographic order L over R d by ordering tuple (w 0 v, w 1 v, . . . ,Since V π (s ) is Pareto optimal in all S k , it is the global maximum of V(s ) under L . Since for any dimension in R d , there is exactly one w j , j = 0, 1, . . . , n, having a positive entry. ForGabor et al. (Gábor et al., 1998) extends the SORL setting to MORL by ordering R d using a lexicographic order. They generalized the Bellman optimality operator by choosing the action that gives the greatest (vectored) value function under the lexicographic order. They showed that under this setting, the value function converges to the unique optimal point (under the lexicographic order) and the generalized Bellman optimality operator is monotonic. Thus, the value function reaches the optimality for all states simultaneously under the ergodic assumption.As a result, since V π (s ) is optimal in V(s ), V π (s) is also optimal in V(s) for all s ∈ S (under order L ). Thus, V π (s) is Pareto optimal in V(s). (Otherwise, if there is some u ∈ V(s) such that u V π (s), we can show that u L V π (s) as well.)Proposition 14 (Proposition 7 in the main text) For all s ∈ S, Π * s = Π * µ .Proof: If π ∈ Π * s for some s ∈ S, then by Prop 12, π is optimal in an associated SORL problem with some nonzero w 0. Therefore, w V π (s) reaches the maximum for all s ∈ S; thus, so is w V π µ = s∈S µ s w V π (s). If w 0, by Prop 12, V π µ is Pareto optimal in V µ . Otherwise, w contains zero entries. Intuitively, we can approximate w with w = lim → + 0 w + 1 like what we do in the proof of Cor 2. More rigorously, we can use the same trick to construct a lexicographic order L over R d , where V π (s ) is the only global maximum of V(s ). Then, by Gabor et al. (Gábor et al., 1998) 's work, V π (s) is the only global maximum of V(s) for all s, which implies V π µ is the only global maximum in V µ under order L . Therefore, V π µ is Pareto optimal in V µ . Reversely, given π ∈ Π * µ , we have V π µ is Pareto optimal in V µ . Then Prop 12 shows π is optimal in an associated SORL problem with a nonzero w 0. If w 0, then by Prop 12, π ∈ Π * s for all s ∈ S. Otherwise, we construct a lexicographic order L through the method used in Cor 2's proof. Then we can show that V π µ is the the global maximum in V µ , and V π (s) is the global maximum in V(s) for all s (under order L ). Thus, V π (s) is Pareto optimal in V(s) and π ∈ Π * (s) for all s.Proposition 15 (Proposition 8 in the main text) For any s ∈ S, as γ → 1, the SPE set Π * s approaches to the APE one Π * .Proof: It is sufficient to show as γ → 1, every policy π ∈ Π * must be in Π * s . In particular, assume there is π ∈ Π * but π / ∈ Π * s . Then there exists v ∈ V(s) such that v = V π (s) and v V π (s). By Prop 11, V(s) is convex. Thus, the line segment with the ends v and V π (s) is contained in V(s). By Prop 10, we can optimize the policy to move from V π (s) to v over the line segment; at the same time, as γ → 1, V π (s ) will move in the same direction for all s ∈ S (see ( 33)). Hence, the value function are improved in all states, which implies π cannot be APE.

C THE RELATIONSHIP BETWEEN THE CONVEXITY OF THE OCCUPATION MEASURE AND THE INDUCED FUNCTION'S RANGE

The convexity of the induced function's range can also be seen as a corollary of the occupation measure's convexity, which was initially proved by Kallenberg (1983) . In this section, we give a second proof of Prop 3 based on the occupation measure's convexity.For any initial distribution µ, the occupation measure of a policy π is defined as (Altman, 1999, p27) x(π, µ; a, s)where P (a t , s t ; π, µ) denotes the probability of taking action a t in state s t at step t when adopting policy π with initial distribution µ. It is easy to see that we can write Kallenberg (1983) proved that (Altman, 1999, Thm 3.2) Lemma 4 (Convexity of occupation measure) X(µ) is convex.In other words, for X 1 , X 2 ∈ X(µ), we also haveThen, we use this result to prove the convexity of the induced function's range.Proof:, we haveThen for any δ ∈ [0, 1], we have(1 -δ) x(π 1 , µ; a, s) + δ x(π 2 , µ; a, s) • R(a, s). (56) Due to the convexity of X(µ), we know that there exists some π such that x(π , µ; a, s) = (1 -δ) x(π 1 , µ; a, s) + δ x(π 2 , µ; a, s)for all s ∈ S and a ∈ A. That is,as well. Therefore, V µ is convex. Let µ be the one-hot distribution that puts all mass on state s. We have V(s) = V µ , which is also convex.

D WHY A DISCONTINUOUS g s MAKES THE TRAINING PROCESS UNSTABLE

Training using LS is unstable because of the sensitivity of g s to preference vectors w that are near normal to a surface or edge of V(s): slight changes in w means the optimal policy will be on a different vertex of the polytope. This behaviour will be exacerbated by estimating V(s) since slight changes in the value functions will also change the surface geometry. More specifically, g s (w) is discontinuous when we do not add our concave regularization term (α = 0), so it is highly sensitive to errors in our estimate of V αf (s).stability problem as the most of fluctuations near the vertices disappeared and the estimation of g s0 improves. We also observe that as α increases to 0.5, a nearly perfect estimation of g s0 is obtained.The observations tell us by adding the entropy term to the immediate reward can indeed make it easier to learn g s0 and improves the numerical stability, which supports our claims in Prop 5 and Rmk 4.Moreover, we can observe that the learned g s0 is a surjective function from W + to the entire Pareto front, which implies that the corresponding policies learned by CAPQL are the induced value functions which cover the entire Pareto front of V αf (s 0 ) as well.

I HYPERPARAMETERS

Tables 1-3 list the hypermeters of the models considered in Sec 5.3. Package Versioning Python 3.10.4 was used as the primary programming language. We accessed MuJoCo210 through gym-0.21.0's wrapper classes. Training was done using pytorch-1.12.1 and NVIDIA's CUDA 11.6.

