MULTI-OBJECTIVE REINFORCEMENT LEARNING: CONVEXITY, STATIONARITY AND PARETO OPTIMALITY

Abstract

In recent years, single-objective reinforcement learning (SORL) algorithms have received a significant amount of attention and seen some strong results. However, it is generally recognized that many practical problems have intrinsic multi-objective properties that cannot be easily handled by SORL algorithms. Although there have been many multi-objective reinforcement learning (MORL) algorithms proposed, there has been little recent exploration of the fundamental properties of the spaces we are learning in. In this paper, we perform a rigorous analysis of policy induced value functions and use the insights to distinguish three views of Pareto optimality. The results imply the convexity of the induced value function's range for stationary policies and suggest that any point of its Pareto front can be achieved by training a policy using linear scalarization (LS). We show the problem that leads to the suboptimal performance of LS can be solved by adding strongly concave terms to the immediate rewards, which motivates us to propose a new vector reward-based Q-learning algorithm, CAPQL. Combined with an actor-critic formulation, our algorithm achieves state-of-the-art performance on multiple MuJoCo tasks in the preference agnostic setting. Furthermore, we empirically show that, in contrast to other LS-based algorithms, our approach is significantly more stable, achieving similar results across various random seeds.

1. INTRODUCTION

The past decade has seen the rapid development of reinforcement learning (RL) algorithms. Recent breakthroughs in RL have made it possible to develop policies that exceed human-level performance: Atari (Mnih et al., 2015) , Dota 2 (OpenAI et al., 2019) , etc. Despite their great success, the vast majority of RL algorithms are single-objective based. Although many practical problems can be reduced to a SORL task, there is an increasing recognition that many real-world tasks require us to consider their multi-objective nature (Coello, 2000; Pickett & Barto, 2002; Moffaert & Nowé, 2014; Roijers et al., 2013; Abels et al., 2019; Abdolmaleki et al., 2020; Abdelaziz et al., 2021) . There are many works that discuss how to find optimal policies in a multi-objective RL (MORL) problem (Gábor et al., 1998; Pickett & Barto, 2002; Moffaert & Nowé, 2014; Roijers et al., 2013; Yang et al., 2019; Parisi et al., 2016; Mahapatra & Rajan, 2020) or a more general dynamic programming setting (Sobel, 1975; Corley, 1985) , but the relationship among various definitions of Pareto optimal policies is hardly discussed. Moreover, there is no rigorous analysis of the range of induced value functions, which has been thought hard to characterize and of irregular shapes (Vamplew et al., 2008; Roijers et al., 2013; Reymond & Nowe, 2019) . (Note, similar work has been done for mixed policies, but fundamentally differs from the more common stationary policy that is sought in modern RL.) We hope to give researchers well aligned intuitions about MORL problems that can save effort and accelerate the rate of research in the field; it is to this end that we introduce this paper. 1Within this paper, we perform a theoretical analysis of MORL problems with an infinite horizon (rigorous proofs are given in Appx B). After a quick review of the MORL setting and three widelyadopted definitions of Pareto efficiency (PE), we begin our analysis by characterizing the effects of policy alterations on the induced value function. We find that single-state policy alterations are insufficient to optimize the induced value function in a MORL setting, but show how it can be done by a multi-state update. We also prove that improving in all states is generally not possible. From here, we show that the range of value functions is convex, which suggests that linear scalarization (LS) is not the bottleneck in finding PE policies. We discuss the deficiencies of existing LS-based algorithms as suggested by our theory and fix them by augmenting the reward function using a strongly convex term. These insights motivate us to propose a new MORL algorithm (CAPQL) which achieves state-of-the-art performance on multiple MoJoCo environments in the preference agnostic setting. An ablation study is performed to understand how augmentation affects the algorithm's performance.

2. MULTI-OBJECTIVE RL PROBLEM

To begin, we will do a quick review of MORL problems and the notation we will be using, as well as introduce our definitions of Pareto optimality. Like SORL problems, we consider an agent interacting with an environment. At each step, the agent performs an action based on the current state and the environment returns a reward and the next state. Our setting assumes a vector reward in R d and is reduced to a SORL problem if d = 1. We model the interaction as a Markov Decision Process (MDP) (S, A, R, P, γ). As usual, S and A are the sets of states and actions, and γ ∈ (0, 1) is the discount factor. Our discussion considers finite A and S.When the agent takes action a ∈ A in state s ∈ S, the environment gives reward R(a, s) ∈ R d and moves to the next state following the transition probability P (a, s) ∈ ∆ |S| . In this paper, we consider an infinite-horizon MORL problem and assume bounded rewards. Let R(s) = [R(a, s)|a ∈ A] ∈ R d×|A| and P(s) = [P (a, s)|a ∈ A] ∈ R |S|×|A| , Π the set of all stationary policies, where π ∈ Π maps a state to a distribution over actions. Following the work of Roijers et al. (2013) , given π ∈ Π, the induced value function V π (s) ∈ R d returns the expected sum of discounted reward over the interaction trajectory with the initial state s,foot_1  V π (s) := E ∞ t=0 γ t R(a t , s t ) with s t ∼P (a t-1 , s t-1 ), a t-1 ∼π(s t-1 ), s 0 = s. Let µ : S → [0, 1] be the probability distribution over initial states. The expected value function is: V π µ := E ∞ t=0 γ t R(a t , s t ) with s t ∼ P (a t-1 , s t-1 ), a t-1 ∼ π(s t-1 ) and s 0 ∼ µ . (2) That is, V π µ = E s0∼µ V π (s 0 ). Let V π = [V π (s)|s ∈ S] ∈ R d×|S| , V(s) = {V π (s)|π ∈ Π}, V µ = {V π µ |π ∈ Π} and V = {V π |π ∈ Π}. The Bellman equation (Bellman, 2003) can be written as: V π (s) = R(s) + γV π P(s) π(s) for s ∈ S. In RL, we are interested in finding a π that maximizes V π (s). When d = 1, the regular order defined on R is adopted, and the optimal policy gives the greatest V π (s). For d > 1, we consider the Pareto order (PO): for real-valued tensors u, v of the same shape, u v if every entry of u is not less than its counterpart in v.foot_2 For a set of tensors C of the same shape, v ∈ C is Pareto efficient (PE) if for all u ∈ C, either v u or v u. (A set may have multiple PE elements.) In this paper, we are interested in three types of PE policies that are not carefully distinguished in the existing literature (Roijers et al., 2013; Song et al., 2020; Abdolmaleki et al., 2020) .  distribution µ, π ∈ Π is single-state PE (SPE) if V π (s) is PE in V(s), and it is distributed initial state PE (DPE) if V π µ is PE in V µ . Likewise, π ∈ Π is aggregate PE (APE) if V π is PE in V. Let Π * s , Π * µ and Π * denote the sets of policies that are SPE, DPE and APE, respectively. In Sec 4.2, assuming the Markov chain is ergodic, we show that SPE and DPE are equivalent and have the relationship with APE demonstrated in Fig. 1 . Since we prove that for all s ∈ S, Π * s coincide, we may not specify to which state we are referring to when discussing SPE.



Extra literature reviews are given in Appx A. The entries of V π (s) are induced value functions in SORL and is known to exist. Thus, V π (s) also exists. If every entry in u is strictly greater than its counterpart in v, we write u v. In general, not all tensors are comparable. For example, given u = 1 0 and v = 0 1 , u v and u v. We write u v ⇐⇒ (u v) ∧ (u v). For single-entry tensors, PO is reduced to the regular order defined on R.



Figure 1: The relationship among three types of PE in Defn 1. Definition 1 (Pareto efficient policies) For s ∈ S and initial statedistribution µ, π ∈ Π is single-state PE (SPE) if V π (s) is PE in V(s), and it is distributed initial state PE (DPE) if V π µ is PE in V µ . Likewise, π ∈ Π is aggregate PE (APE) if V π is PE in V. Let Π *s , Π * µ and Π * denote the sets of policies that are SPE, DPE and APE, respectively.

