Q-PENSIEVE: BOOSTING SAMPLE EFFICIENCY OF MULTI-OBJECTIVE RL THROUGH MEMORY SHARING OF Q-SNAPSHOTS

Abstract

Many real-world continuous control problems are in the dilemma of weighing the pros and cons, multi-objective reinforcement learning (MORL) serves as a generic framework of learning control policies for different preferences over objectives. However, the existing MORL methods either rely on multiple passes of explicit search for finding the Pareto front and therefore are not sample-efficient, or utilizes a shared policy network for coarse knowledge sharing among policies. To boost the sample efficiency of MORL, we propose Q-Pensieve, a policy improvement scheme that stores a collection of Q-snapshots to jointly determine the policy update direction and thereby enables data sharing at the policy level. We show that Q-Pensieve can be naturally integrated with soft policy iteration with convergence guarantee. To substantiate this concept, we propose the technique of Q replay buffer, which stores the learned Q-networks from the past iterations, and arrive at a practical actor-critic implementation. Through extensive experiments and an ablation study, we demonstrate that with much fewer samples, the proposed algorithm can outperform the benchmark MORL methods on a variety of MORL benchmark tasks.

1. INTRODUCTION

Many real-world sequential decision-making problems involve the joint optimization of multiple objectives, while some of them may be in conflict. For example, in robot control, it is expected that the robot can run fast while consuming as little energy as possible; nevertheless, we inevitably need to use more energy to make the robot run fast, regardless of how energy-efficient the robot motion is. Moreover, various other real-world continuous control problems are also multi-objective tasks by nature, such as congestion control in communication networks (Ma et al., 2022) and diversified portfolios (Abdolmaleki et al., 2020) . Moreover, the relative importance of these objectives could vary over time (Roijers and Whiteson, 2017) . For example, the preference over energy and speed in robot locomotion could change with the energy budget; network service providers need to continuously switch service among various networking applications (e.g., on-demand video streaming versus real-time conferencing), each of which could have preferences over latency and throughput. To address the above practical challenges, multi-objective reinforcement learning (MORL) serves as one classic and popular formulation for learning optimal control strategies from vector-valued reward signal and achieve favorable trade-off among the objectives. In the MORL framework, the goal is to learn a collection of policies, under which the attained return vectors recover as much of the Pareto front as possible. One popular approach to addressing MORL is to explicitly search for the Pareto front with an aim to maximize the hypervolume associated with the reward vectors, such as evolutionary search (Xu et al., 2020) and search by first-order stationarity (Kyriakis et al., 2022) . While being effective, explicit search algorithms are known to be rather sample-inefficient as the data sharing among different passes of explicit search is rather limited. As a result, it is typically difficult to maintain a sufficiently diverse set of optimal policies for different preferences within a reasonable number of training samples. Another way to address MORL is to implicitly search for non-dominated policies through linear scalarization, i.e., convert the vector-valued reward signal to a single scalar with the help of a linear preference and thereafter apply a conventional single-objective RL algorithm for iteratively improving the policies (e.g., (Abels et al., 2019; Yang et al., 2019) ). To enable implicit search for diverse preferences simultaneously, a single network is typically used to express a whole collection of policies. As a result, some level of data sharing among policies of different preferences is done implicitly through the shared network parameters. However, such sharing is clearly not guaranteed to achieve policy improvement for all preferences. Therefore, there remains one critical open research question to be answered: How to boost the sample efficiency of MORL through better policy-level knowledge sharing? To answer this question, we revisit MORL from the perspective of memory sharing among the policies learned across different training iterations and propose Q-Pensieve, where a "Pensieve", as illustrated in the novel Harry Potter, is a magical device used to store pieces of personal memories, which can later be shared with someone else. By drawing an analogy between the memory sharing among humans and the knowledge sharing among policies, we propose to construct a Q-Pensieve, which stores snapshots of the Q-functions of the policies learned in the past iterations. Upon improving the policy for a specific preference, we expect that these Q-snapshots could help jointly determine the policy update direction. In this way, we explicitly enforce knowledge sharing on the policy level and thereby enhance the sample use in learning optimal policies for various preferences. To substantiate this idea, we start by considering Q-Pensieve memory sharing in the tabular planning setting and integrate Q-Pensieve with the soft policy iteration for entropy-regularized MDPs. Inspired by (Yang et al., 2019) , we leverage the envelope operation and propose the Q-Pensieve policy iteration for MORL, which we show would preserve the similar convergence guarantee as the standard singleobjective soft policy iteration. Based on this result, we propose a practical implementation that consists of two major components: (i) We introduce the technique of Q replay buffer. Similar to the standard replay buffer of state transitions, a Q replay buffer is meant to achieve sample reuse and improve sample efficiency, but notably at the policy level. Through the use of Q replay buffer, we can directly obtain a large collection of Q functions, each of which corresponds to a policy in a prior training iteration, without any additional efforts or computation in forming the Q-Pensieve. (ii) We convert the Q-Pensieve policy iteration into an actor-critic off-policy MORL algorithm by adapting the soft actor critic to the multi-objective setting and using it as the base of our implementation. The main contributions of this paper can be summarized as: • We identify the critical sample inefficiency issue in MORL and address this issue by proposing Q-Pensieve, which is a policy improvement scheme for enhancing knowledge sharing on the policy level. We then present Q-Pensieve policy iteration and establish its convergence property. • We substantiate the concept of Q-Pensieve policy iteration by proposing the technique of Q replay buffer and arrive at a practical actor-critic type practical implementation. • We evaluate the proposed algorithm in various benchmark MORL environments, including Deep Sea Treasure and MuJoCo. Through extensive experiments and an ablation study, we demonstrate the the proposed Q-Pensieve can indeed achieve significantly better empirical sample efficiency than the popular benchmark MORL algorithms, in terms of multiple common MORL performance metrics, including hypervolume and utility.

2. PRELIMINARIES

Multi-Objective Markov Decision Processes (MOMDPs). We consider the formulation of MOMDP defined by the tuple (S, A, P, r, γ, D, S λ , Λ), where S denotes the state space, A is the action space, P : S × A × S → [0, 1] is the transition kernel of the environment, r : S × A → [-r max , r max ] d is the vector-valued reward function with d as the number of objectives, γ ∈ (0, 1) is the discount factor, D is the initial state distribution, S λ : R d → R is the scalarization function (under some preference vector λ ∈ R d ), and Λ denotes the set of all preference vectors. In this paper, we focus on the linear reward scalarization setting, i.e., S λ (r) = λ ⊤ r(s, a), as commonly adopted in the MORL literature (Abels et al., 2019; Yang et al., 2019; Kyriakis et al., 2022) . Without loss of generality, we let Λ be the unit simplex. If d = 1, an MOMDP would degenerate to a standard MDP, and we simply use r(s, a) to denote the scalar reward. At each time step t ∈ N ∪{0}, the learner receives the observation s t , takes an action a t , and receives a reward vector r t . We use π : S → ∆(A) to denote a stationary randomized policy, where ∆(A) denotes the set of all probability distributions over the action space. Let Π be the set of all such policies. Single-Objective Entropy-Regularized RL. In the standard framework of single-objective entropyregularized RL (Haarnoja et al., 2017; 2018; Geist et al., 2019) , the goal is to learn an optimal policy for an entropy-regularized MDP, where an entropy regularization term is augmented to the original reward function. For a policy π ∈ Π, the regularized value functions V π : S → R and Q π : S × A → R can be characterized through the regularized Bellman equations as Q π (s, a) = r(s, a) + γE s ′ ∼P(•|s,a) [V π (s ′ )], (1) V π (s) = E a∼π(•|s) [Q π (s, a) -α log π(a|s)], where α is a temperature parameter that specifies the relative importance of the entropy regularization term. In this setting, the goal is to learn an optimal policy π * such that Q π * (s, a) ≥ Q π (s, a) , for all (s, a), for all π ∈ Π. An optimal policy can be obtained through soft policy iteration, which alternates between soft policy evaluation and soft policy improvement: (i) Soft policy evaluation: For a policy π, the soft Q-function of π can be obtained by iteratively applying the corresponding soft Bellman backup operator T π defined as T π Q (s, a) = r (s, a) + γE s ′ ∼P(•|s,a) [V (s ′ )] , where V (s ′ ) = E a ′ ∼π(•|s ′ ) [Q (s ′ , a ′ ) -α log (π (a ′ | s ′ ))]. (ii) Soft policy improvement: In each iteration k, the policy is updated towards an energy-based policy induced by the soft Q-function, i.e., π k+1 = arg min π ′ ∈ Π D KL π ′ (• | s) exp 1 α Q π k (s, •) Z π k (s) , where Π is the set of parameterized policies of interest and Z π k is the normalization term. Multi-Objective Entropy-Regularized RL. We extend the standard single-objective RL with entropy regularization to the multi-objective setting. For each policy π ∈ Π, we define the multi-objective regularized value functions via the following multi-objective version of entropy-regularized Bellman equations as follows: Q π (s, a) = r(s, a) + γE s ′ ∼P(•|s,a) [V π (s ′ )], (5) V π (s) = E a∼π(•|s) [Q π (s, a) -α log π(a|s)1 d ], where 1 d denotes a d-dimensional vector of all ones. In this paper, our goal is to learn a preference-dependent policy π(•|•; λ) such that for any preference a) , for all (s, a), for all π ′ ∈ Π. For ease of notation, we let λ ∈ Λ, λ ⊤ Q π(•|•;λ) (s, a; λ) ≥ λ ⊤ Q π ′ (s, V π(•|•;λ) (s; λ) ≡ V π (s; λ) and Q π(•|•;λ) (s, a; λ) ≡ Q π (s, a; λ) in the sequel.

3. ALGORITHMS

In this section, we propose our Q-Pensieve learning algorithm for boosting the sample efficiency of multi-objective RL. We first describe the idea of Q-Pensieve in the tabular planning setting by introducing Q-Pensieve soft policy iteration. We then extend the idea to develop a practical deep reinforcement learning algorithm.

3.1. NAIVE MULTI-OBJECTIVE SOFT POLICY ITERATION

To solve MORL in the entropy-regularized setting, one straightforward approach is to leverage the single-objective soft policy improvement with the help of linear scalarization. That is, in each iteration k, the policy can be updated by π k+1 (•, •; λ) = arg min π ′ ∈ Π D KL π ′ (• | s) exp 1 α λ ⊤ Q π k (s, •; λ) Z π k λ (s) . While ( 7) serves as a reasonable approach, designing a learning algorithm based on the update scheme in (7) could suffer from sample inefficiency due to the lack of policy-level knowledge sharing: In (7), the policy for each preference λ is updated completely separately based solely on the Q-function under λ. Moreover, as the update (7) relies on an accurate estimate of the Q-function, the critic learning for the policy of each individual preference would typically require at least a moderate number of samples. These issues could be particularly critical for a large preference set in practice. While the use of a conditioned policy network (e.g., (Abels et al., 2019) ), a commonly-used network architecture in the MORL literature, could somewhat mitigate this issue, it remains unclear whether the knowledge sharing induced by the conditioned network can indeed achieve policy improvement across various preferences. As a result, a systematic approach is needed for boosting the sample efficiency in MORL.

3.2. Q-PENSIEVE SOFT POLICY ITERATION

To boost the sample efficiency of MORL, we propose to enhance the policy-level knowledge sharing by constructing a Q-Pensieve for memory sharing across iterations. Specifically, a Q-Pensieve is a collection of Q-snapshots obtained from the past iterations, and it is formed to boost the policy improvement update with respect to the Q-function in the current iteration as these Q-snapshots could offer potentially better policy improvement directions under linear scalarization. Moreover, one major computational benefit of Q-Pensieve is that these Q-snapshots are obtained without the need for any updates or additional samples from the environment (and hence are for free) as they already exist during training. We substantiate this idea by first introducing the Q-Pensieve soft policy iteration in the tabular setting (i.e., |S| and |A| are finite) as follows: Q-Pensieve Policy Improvement. In the policy improvement step of the k-th iteration, for each specific λ, we update the policy as π k+1 (•|•; λ) = arg min π ′ ∈ Π D KL π ′ (• | s; λ) exp sup λ ′ ∈W k (λ),Q ′ ∈Q k 1 α λ ⊤ Q ′ (s, •; λ ′ ) Z Q k (s) , where Z Q k is again the normalization term, W k (λ) ⊂ Λ is a set of preference vectors, and Q k is a set of Q-snapshots. The two sets W k (λ) and Q k are to be selected as follows: • For W k (λ), the only requirement is that λ ∈ W k (λ), for all k. The preference sets can be different in different iterations. • Similarly, for Q k , the only requirement is that Q π k ∈ Q k , for all k. The set of Q-snapshots can also be different in different iterations. Hence, the choice of Q k is rather flexible. When choosing W k (λ) = {λ} and Q k = {Q π k }, one would recover the update in (7). Policy Evaluation. In the policy evaluation step, we evaluate the policy that corresponds to each preference λ by iteratively applying the multi-objective softmax Bellman backup operator T π MO as (T π MO Q)(s, a; λ) = r(s, a) + γE s ′ ∼P(•|s,a),a ′ ∼π(•|s ′ ;λ) [Q(s ′ , a ′ ; λ) -α log π(a ′ |s ′ ; λ)1 d ]. (9) Remark 1 The Q-Pensieve update in ( 8) is inspired by the envelope Q-learning (EQL) technique (Yang et al., 2019) , where in each iteration k, the Q-learning update takes into account the envelope formed by the Q-functions of the current policy π k for different preferences. The fundamental difference between Q-Pensieve and EQL is that Q-Pensieve further achieves memory sharing across training iterations through the use of Q-snapshots from the past iterations, and EQL focuses mainly on the use of the Q-function of the current iteration. Convergence of Q-Pensieve Soft Policy Iteration. Another nice feature of the Q-Pensieve policy improvement step is that it preserves the similar convergence result as the standard single-objective soft policy iteration, as stated below. The proof of Theorem 3.1 is provided in Appendix A. Theorem 3.1 Under the Q-Pensieve soft policy iteration given by ( 8) and ( 9), the sequence of preference-dependent policies {π k } converges to a policy π * such that λ ⊤ Q π * (s, a; λ) ≥ λ ⊤ Q π (s, a) for all π ∈ Π, for all (s, a) ∈ S × A and for all λ ∈ Λ.

3.3. PRACTICAL IMPLEMENTATION OF Q-PENSIEVE

In this section, we present the implementation of proposed Q-Pensieve algorithm for learning policies with function approximation for the general state and action spaces. Q Replay Buffer. Based on (8), we know that the policy update of Q-Pensieve would involve both the current Q-function and the Q-snapshots from the past iterations. To implement this, we introduce Q replay buffer, which could store multiple Q-networks in a predetermined manner (e.g., first-in first-out). Notably, unlike the conventional experience replay buffer (Mnih et al., 2013) of state transitions, Q replay buffer stores the learned Q-networks in past iterations as candidates for forming the Q-Pensieve. On the other hand, while each Q-network would require a moderate amount of memory usage, we found that in practice a rather small Q replay buffer is already effective enough for boosting the sample efficiency. We further illustrate this observation through the experimental results in Section 4. Next, we convert the Q-Pensieve soft policy iteration into an actor-critic off-policy MORL algorithm. Specifically, we adapt the idea of soft actor critic to Q-Pensieve by minimizing the residual of the multi-objective soft Q-function: Let θ and ϕ be the parameters of the policy network and the critic network, respectively. Then, the critic network is updated by minimizing the following loss L Q (ϕ; λ) = E (s,a)∼µ λ ⊤ Q ϕ (s, a; λ) -r (s, a) + γE s ′ ∼P(•|s,a) V φ (s ′ ) 2 , ( ) where φ is the parameter of the target network and µ is the sampling distribution of the state-action pairs (e.g., a distribution induced by a replay buffer of state transitions). On the other hand, based on (8), the policy network is updated by minimizing the following objective L π (θ; λ) = E s∼µ E a∼π θ sup λ ′ ∈W (λ),Q ′ ∈Q α log (π θ (a | s; λ)) -λ ⊺ Q ′ (s, a; λ ′ ) . ( ) The overall architecture of Q-Pensieve is provided in Figure 1 . The pseudo code of the Q-Pensieve algorithm is described in Algorithm 1 in Appendix. The code of our experiments is availablefoot_0 . Notably, in Section 4 we show that empirically a relatively small Q buffer size (e.g., 4 in our experiments) can already offer a significant performance improvement. 

4. EXPERIMENTS

In this section, we demonstrate the effectiveness of Q-Pensieve on various benchmark RL tasks and discuss how Q-Pensieve boosts the sample efficiency through an extensive ablation study.

4.1. EXPERIMENTAL CONFIGURATION

Popular Benchmark Methods. We compare the proposed algorithm against various popular benchmark methods, including the Conditioned Network with Diverse Experience Replay (CN-DER) in (Abels et al., 2019) , the Prediction-Guided Multi-Objective RL (PGMORL) in (Xu et al., 2020) , the Pareto Following Algorithm (PFA) in (Parisi et al., 2014) , and SAC (Haarnoja et al., 2018) . For CN-DER, as the original CN-DER is built on deep Q-networks (DQN) for discrete actions, we modify the source code of Abels et al. (2019) for continuous control by implementing CN-DER on top of DDPG. Moreover, we follow the same DER technique, which uses a diverse replay buffer and gives priority according to how much the samples increase the overall diversity of the buffer. For PGMORL and PFA, we use the open-source implementation of (Xu et al., 2020) for the experiments. As these explicit search methods typically require more samples before reaching a comparable performance level, we evaluate the performance PGMORL and PFA under both 1 times and β times (β > 1) of the number of samples used by Q-Pensieve to demonstrate the sample efficiency of Q-Pensieve. For SAC, as the MORL problem reduces a single-objective one under a fixed preference, we train multiple models using single-objective SAC (one model for each fixed preference) as a performance reference for other MORL methods. Performance Metrics. In the evaluation, we consider the following three commonly-used performance metrics for MORL: • HyperVolume (HV): Let R be a set of return vectors attained and r 0 ∈ R d be a reference point. Then, we define the HyperVolume as  HV := H(R) I{z ∈ H(R)}dz, where H(R) := z ∈ R d : ∃r ∈ R, r 0 ≺ z ≺ := E λ [I{ T1 t=0 λ ⊤ r 1 t > T2 t=0 λ ⊤ r 2 t }], where r 1 t , r 2 t are the return vectors, and T 1 ,T 2 are the episode lengths of algorithm 1 and 2, respectively. ED serves as a useful metric for pairwise comparison in those problems where the return vectors under different preferences can differ by a lot in magnitude (in this case, HV and UT could be dominated by the return vectors of a few preferences). Evaluation Domains. We evaluate the algorithms in the following domains: (i) Continuous Deep Sea Treasure (DST): a two-objective continuous control task modified from the original DST environment. (ii) Multi-Objective Continuous LunarLander: a four-objective task modified from the classic control task in the OpenAI gym. (iii) Multi-Objective MuJoCo: modified benchmark locomotion tasks with either two or three objectives. Configuration of Q-Pensieve. For Q-Pensieve, at each policy update, we set the size of the preference set W k (λ) to be 5 (including λ and another four preferences drawn randomly) and set the size of the Q replay buffer to be 4, unless stated otherwise.

4.2. EXPERIMENTAL RESULTS

Does Q-Pensieve achieve better sample efficiency than the MORL benchmark methods? Table 1 shows the performance of Q-Pensieve and the benchmark methods in terms of the three metrics. For each algorithm, we report the mean and the standard deviation over five random seeds. We can observe that Q-Pensieve consistently enjoys higher HV, UT, and ED in almost all the domains. More importantly, Q-Pensieve indeed exhibits superior sample efficiency as it still outperforms the explicit search methods (i.e., PFA and PGMORL) even if these methods are given 10 times of the number of samples used by Q-Pensieve. Moreover, we can observe that the explicit search methods (i.e., PFA and PGMORL) often have larger HV than the implicit search method (such as CN-DER), while implicit search methods tend to have larger UT. This manifests the design principles and the characteristics of the two families of approaches, where explicit search is designed mainly for achieving large HV and implicit search typically aims for larger scalarized return. How much improvement in sample efficiency can Q-Pensieve achieve compared to training multiple single-objective SAC models separately? To answer this question, we conduct experiments on 2-objective MuJoCo tasks and consider a whole range of 19 preference vectors Table 1 : Comparison of Q-Pensieve and other benchmark algorithms in terms of the three metrics across ten domains. We report the mean and standard deviation over five random seeds. The ED is calculated through comparing each algorithm to a multi-objective version of SAC (equivalent to Q-Pensieve with the size of the preference set equal to 1 and without Q replay buffer). We set β = 10 for HalfCheetah2d, Ant2d, Ant3d, and Hopper3d, set β = 5 for LunarLander4d, LunarLander5d, and Hopper5d, and set β = 3 for DST2d, Hopper2d, and Walker2d. ). We train 19 models by using single-objective SAC, one model for each individual preference. Each model is trained for 1.5M steps (and hence the total number of steps under SAC is 28.5M steps). By contrast, Q-Pensieve only uses 1.5M steps in total in learning policies for all the preferences. Figure 2 shows the return vectors attained by Q-Pensieve and the collection of 19 SAC models. Q-Pensieve can achieve comparable or better returns than the collection of SAC models with only 1/19 of the samples. This further demonstrate the sample efficiency of Q-Pensieve. -1 0 0 0 -7 0 0 -4 0 0 -1 0 0 2 0 0 5 0 0 Why can Q-Pensieve outperform single-objective SAC in some cases? From Figures 2(a ) and (c), we see that Q-Pensieve can attain some return vectors that are strictly better than those of the single-objective SAC models. The reasons behind this phenomenon are minaly two-fold: (i) Under single-objective SAC, despite that we train one model for each individual preference, it could still occur that single-objective SAC gets stuck at a sub-optimal policy under some preferences. (ii) By contrast, Q-Pensieve has a better chance of escaping from these sub-optimal policies with the help of the Q-snapshots in the Q replay buffer. To verify the above argument, we design a hybrid SAC algorithm as follows: (a) For the first 10 5 time steps, this algorithm simply follows the single-objective SAC. (b) At time step 10 5 , it switches to the update rule of Q-Pensieve based on the Q-snapshots stored in the Q replay buffer of another model trained under Q-Pensieve algorithm in parallel. Figure 3 shows the performance of this hybrid algorithm in DST and HalfCheetah. Clearly, the Q-Pensieve update could help the SAC model escape from the sub-optimal policies, under various preferences. An ablation study on Q replay buffer. To verify the effectiveness of the technique of Q replay buffer, we compare the performance of Q-Pensieve with buffer size equal to 4 and that without using Q replay buffer (termed "Vanilla" in Figures 4 and 5 ). Figure 4 and 5 show the attained return vectors and HV of both methods. We can see that Q replay buffer indeed leads to a better policy improvement behavior, in terms of both HV and the scalarized returns. However, these figures may sometimes oscillate a lot in the end period. It is because our algorithm finds solutions from another Q-vector, and their inner product of Q and preference may be quite close. We can check the points are in the same contour. Figure 5 : A comparison in HV between Q-Pensieve with buffer size equal to 4 and that without using Q replay buffer at different training stages.

5. RELATED WORK

The multi-objective RL problems have been extensively studied from two major perspectives: Explicit Search. A plethora of prior works on MORL updates a policy or a set of policies by explicitly searching for the Pareto front of the reward space. To learn policies under time-varying preferences, (Natarajan and Tadepalli, 2005) presented to store a set of policies, which are to be used in searching for a proper policy for a new preference without learning from scratch. (Lizotte et al., 2012) leveraged linear value function approximation to search for optimal policies. (Van Moffaert and Nowé, 2014) proposed Pareto Q-learning, which stores the immediate rewards and the non-dominated future return vectors separately and leverage the Pareto dominance for selecting the actions in Q-learning. (Parisi et al., 2014) presented a policy gradient approach to search for non-dominated policies. (Mossalam et al., 2016) solves MORL via scalarized Q-learning along with the concept of prioritizing the corner weights for selecting the preference of the scalarized problem. (Xu et al., 2020) proposed an evolutionary approach to search for the Pareto set of policies, with the help of a prediction model for determining the search direction. (Kyriakis et al., 2022) presented a policy gradient method by approximating the Pareto front via a first-order necessary condition. However, the above explicit search algorithms are known to be rather sample-inefficient as the knowledge sharing among different passes of search is limited. Implicit Search. Another class of algorithms are designed to improve policies for multiple preferences through implicit search. For example, (Abels et al., 2019) presents Conditioned Network, which extends the standard single-objective DQN to learning preference-dependent multi-objective Q-functions. To achieve scale-invariant MORL, (Abdolmaleki et al., 2020) proposed to first learn the Q-functions for different objectives and encode the preference through constraints. Recently, (Yang et al., 2019) proposes envelope Q-functions to encourage knowledge sharing among the Q functions of different the current multi-objective Q-values that any policy can benefit from other preferences' experiences, that make training more efficiently, and (Zhou et al., 2020) proposed model-based envelope value iteration base on envelope Q-functions, it provides an efficient way to get optimal multi-objective Q-functions. Despite that our method is inspired by (Yang et al., 2019) , the main difference between our work and theirs is that we boost the sample efficiency of MORL via explicit memory sharing among policies learned during training.

6. CONCLUSION

This paper proposes Q-Pensieve, which significantly enhances the policy-level data sharing through in order to boost the sample efficiency of MORL problems. We substantiate the idea by presenting Q-Pensieve soft policy iteration in the tabular setting and show that it preserves the global convergence property. Then, to implement the Q-Pensieve policy improvement step, we introduce the Q replay buffer technique, which offers a simple yet effective way to maintain Q-snapshot. Our experiments demonstrate that Q-Pensieve is a promising approach in that it can outperform the state-of-the-art MORL methods with much fewer samples in a variety of MORL benchmark tasks.

APPENDIX

A PROOF OF THEOREM 3.1 Before proving Theorem 3.1, we first present two supporting lemmas as follows. To begin with, we establish the policy improvement property of the Q-Pensieve update. Recall that the Q-Pensieve policy update is that for each preference λ ∈ Λ, π k+1 (•|s; λ) = arg min π ′ ∈Π D KL π ′ (• | s; λ) exp sup λ ′ ∈W k (λ),Q ′ ∈Q k 1 α λ ⊤ Q ′ (s, •; λ ′ ) Z Q (s) =:L(π ′ ;λ) . ( ) Lemma 1 (Q-Pensieve Policy Improvement) Under the Q-Pensieve policy improvement update, we have λ ⊤ Q π k (s, a; λ) ≤ λ ⊤ Q π k+1 (s, a; λ), for all state-action pairs (s, a) ∈ S × A, for all preference vectors λ ∈ Λ, and for all iteration k ∈ N ∪ {0}. Proof (Lemma 1) By the update rule in ( 12), we know that π k+1 is a minimizer of L(π ′ ; λ) and hence L(π k+1 ; λ) ≤ L(π k ; λ). This implies that for each state s ∈ S, we have E a∼π k+1 (•|s) λ ⊤ 1 d • log π k+1 (a|s; λ) - 1 α sup λ ′ ∈W k (λ),Q ′ ∈Q k λ ⊤ Q ′ (s, a; λ ′ ) + log Z Q k (s) ≤E a∼π k (•|s) λ ⊤ 1 d • log π k (a|s; λ) - 1 α sup λ ′ ∈W k (λ),Q ′ ∈Q k λ ⊤ Q ′ (s, a; λ ′ ) + log Z Q k (s) . ( ) Since Z Q k only depends on the state, the inequality (13) reduces to E a∼π k+1 (•|s) λ ⊤ 1 d • log π k+1 (a|s; λ) - 1 α sup λ ′ ∈W k (λ),Q ′ ∈Q k λ ⊤ Q ′ (s, a; λ ′ ) ≤E a∼π k (•|s) λ ⊤ 1 d • log π k (a|s; λ) - 1 α sup λ ′ ∈W k (λ),Q ′ ∈Q k λ ⊤ Q ′ (s, a; λ ′ ) . ( ) Next, we proceed to consider the multi-objective soft Bellman equation as follows: λ ⊤ Q π k (s 0 , a 0 ; λ) -λ ⊤ r(s 0 , a 0 ) (15) = γλ ⊤ E s1∼P (•|s0,a0),a1∼π k (•|s1;λ) Q π k (s 1 , a 1 ; λ) -α • λ ⊤ 1 d log(π k (a 1 |s 1 ; λ)) (16) ≤ γE s1∼P (•|s0,a0),a1∼π k (•|s1;λ) sup λ ′ ∈W k (λ),Q ′ ∈Q k λ ⊤ Q ′ (s 1 , a 1 ; λ ′ ) -α • λ ⊤ 1 d log(π k (a 1 |s 1 ; λ)) (17) ≤ γE s1∼P (•|s0,a0),a1∼π k+1 (•|s1;λ) sup λ ′ ∈W k (λ),Q ′ ∈Q k λ ⊤ Q ′ (s 1 , a 1 ; λ ′ ) -α • λ ⊤ 1 d log(π k+1 (a 1 |s 1 ; λ)) ≤ -γE s1∼P (•|s0,a0),a1∼π k+1 (•|s1;λ) α • λ ⊤ 1 d log(π k+1 (a 1 |s 1 ; λ)) + γE s1∼P (•|s0,a0),a1∼π k+1 (•|s1;λ) λ ⊤ r(s 1 , a 1 ) -γE s2∼P (•|s1,a1),a2∼π k (•|s2) [αλ ⊤ 1 d log(π k (a 2 |s 2 ; λ))] + γ sup λ ′ ∈W k (λ),Q ′ ∈Q k E s2∼P (•|s1,a1),a2∼π k (•|s2) λ ⊤ Q ′ (s 2 , a 2 ; λ) (19) ≤ -γE s1∼P (•|s0,a0),a1∼π k+1 (•|s1;λ) α • λ ⊤ 1 d log(π k+1 (a 1 |s 1 ; λ)) + γE s1∼P (•|s0,a0),a1∼π k+1 (•|s1;λ) λ ⊤ r(s 1 , a 1 ) -γE s2∼P (•|s1,a1),a2∼π k+1 (•|s2) [αλ ⊤ 1 d log(π k+1 (a 2 |s 2 ; λ))] + γ sup λ ′ ∈W k (λ),Q ′ ∈Q k E s2∼P (•|s1,a1),a2∼π k+1 (•|s2) λ ⊤ Q ′ (s 2 , a 2 ; λ) (20) (21) ≤ -γE s1∼P (•|s0,a0),a1∼π k+1 (•|s1;λ) α • λ ⊤ 1 d log(π k+1 (a 1 |s 1 ; λ)) + E P,π k+1 t≥1 γ t E λ ⊤ r(s t , a t ) (22) -γE st+1∼P (•|st,at),at+1∼π k+1 (•|st+1) αλ ⊤ 1 d log(π k+1 (a t+1 |s t+1 ; λ))|s t , a t (23) = λ ⊤ Q π k+1 (s 0 , a 0 ; λ) -λ ⊤ r(s 0 , a 0 ), where ( 16) follows from the multi-objective soft Bellman equation, ( 17) holds by the sup operation and the fact that 18) follows from ( 14), ( 19) holds by applying the multi-objective soft Bellman equation to Q ′ (s 1 , a 1 ; λ), (20) again follows from the inequality in ( 14), ( 23) is obtained by unrolling the whole trajectory, and ( 24) holds by the definition of Q π . □ Lemma 2 (Multi-Objective Soft Policy Evaluation) Under the multi-objective soft Bellman backup operator T π MO with respect to a policy π and some Q π k ∈ Q k , ( Q (0) : S × A → R d , the sequence of intermediate Q-functions {Q (i) } during policy evaluation is given by Q (i+1) = T π MO Q (i) , for all i ∈ N ∪ {0}. Then, Q (i) converges to the soft Q-function of π, as i → ∞. Proof (Lemma 2) This can be directly obtained from the standard convergence property of iterative policy evaluation (Sutton and Barto, 2018) in two steps: (i) Define the entropy-augmented reward as r(s, a; π) := r(s, a) + γE s ′ ∼P(•|s,a),a ′ ∼π(•|s) [α log π(a|s)1 d ], which is a bounded function. (ii) Then, rewrite the policy evaluation update as Q π (s, a) ← r(s, a; π) + γE s ′ ∼P(•|s,a),a ′ ∼π(•|s) [Q π (s ′ , a ′ ; λ)]. This completes the proof. □ Now we are ready to prove Theorem 3.1. Proof (Theorem 3.1) Note that by Lemma 1, the sequence λ ⊤ Q π k is monotonically increasing. As each element in Q π is bounded above for all π ∈ Π given the boundedness of both the reward and the entropy term, the sequence of policies shall converge to some policy π * . The remaining thing is to show that π * is optimal: (i) Define L π ′ (π) := D KL π(• | s) exp sup λ∈W (λ),Q ′ ∈Q Q ′ π (s,•;λ) Z Q . (ii) Upon convergence, we shall have L π * (π * (•|s)) ≤ L π * (π(•|s)) for all π ∈ Π. Using the same iterative argument as in the proof of Lemma 1, we get λ ⊤ Q π * (s, a; λ) ≥ λ ⊤ Q π (s, a; λ) for all (s, a) ∈ S × A and all λ ∈ Λ. □

B DETAILED CONFIGURATION OF EXPERIMENTS B.1 DETAILS ON THE EVALUATION DOMAINS

• Continuous Deep Sea Treasure (DST): DST is a classical multi-objective reinforcement learning environment. We control the agent to find the treasure, while the further the treasure is, the higher its value. In other words, the agent needs to spend more resources (-1 penalty for each action) to get the more precious treasure. To extend DST to continuous space, we modify the simple four direction movement to the movement in a circle, we set the β of DST to 3. • Multi-Objective Continuous LunarLander: We modify LunarLander to the multi-objective version by dismantling the reward to main engine cost, side engine cost, shaping reward, and result reward. Since the past MORL methods were conducted in environments with 2 or 3 objectives, we created an environment with 4 and 5 objectives to show our method can be used in high dimension objectives environments, we set the β of LunarLander to 5. • MuJoCo: We divide the scalar reward in MuJoCo environments into vector rewards. What's more, we amplify the weight of the control cost to make the magnitude of each reward element similar. -HalfCheetah2d: 2 objectives as forward speed, control cost (S ⊆ R 17 , A ⊆ R 6 ), 1000 times for control cost, and β = 10. -Hopper2d: 2 objectives: forward speed, control cost (S ⊆ R 11 , A ⊆ R 3 ), 1500 times for control cost, and β = 3. -Hopper3d: 3 objectives: forward speed, jump reward, control cost (S ⊆ R 11 , A ⊆ R 3 ), 1500 times for control cost. The jump reward is 15 times of the difference between current height and initial height, and β = 10. -Hopper5d: 5 objectives: forward speed, control cost of each of the 3 joints, and healthy reward (S ⊆ R 11 , A ⊆ R 3 ), 1500 times for control cost, and β = 5 -Ant2d: 2 objectives: forward speed, control cost (S ⊆ R 111 , A ⊆ R 8 ), 1 times for control cost, and β = 10. -Ant3d: 3 objectives: forward speed, control cost, healthy reward (S ⊆ R 111 , A ⊆ R 8 ), 1 times for control cost, 1 times for healthy reward, and β = 10. -Walker2d: 2 objectives: forward speed, control cost (S ⊆ R 17 , A ⊆ R 6 ), 1000 times for control cost, and β = 3. B.2 HYPERPARAMETERS B.2.1 HYPERPARAMETERS OF Q-PENSIEVE We conduct all experiment on baselines with following hyperparameters. For PGMORL and PFA, we use the hyperparameters as provided in Table 3 : • n: the number of reinforcement learning tasks. • total steps: the total number of environment training steps. • m w : the number of iterations in warm-up stages. • m t : the number of iterations in evolutionary stages. • P num : the number of performance buffers. • P size : the size of each performance buffer. • n weight : the number of sampled weights for each policy. • sparsity: the weight of sparsity metric.

C PSEUDO CODE OF Q-PENSIEVE

We provide the pseudo code in Algorithm 1 as follows.  ϕ i ← ϕ i -η Q ∇ϕi L Q (ϕ i ; λ) , B ← B {ϕ i ; λ} for i ∈ {1, 2}; compute ∇θ with eq. 11 with W;  θ ← θ -η π ∇θ L π (θ; λ); φi ← τ ϕ i + (1 -τ ) φi for i ∈ {1, 2};

D COMPARISON OF LEARNING CURVES

We demonstrate the learning curves of Q-Pensieve and the benchmark methods. In Figures 9 10 11 12 13 14 15 , we can find that Q-Pensieve enjoys the fastest learning progress in almost all tasks and preferences. Notably, as PGMORL is an evolutionary method and does explicit search for policies for only a small set of preferences vectors in each generation, the typical learning curve (in terms of expected total reward) under a given preference is not very informative about the overall learning progress. Therefore, regarding the learning curves, we compare Q-Pensieve with CN-DER and PFA. Moreover, as PFA cannot handle tasks with more than two objectives (this fact has also been mentioned in (Xu et al., 2020 )), PFA is evaluated only in tasks with two objectives. 

E ADDITIONAL EXPERIMENTAL RESULTS

In this section, we compare Q-Pensieve with the baseline methods, discuss how the performance of Q-Pensieve can be further improved through hyperparameter tuning, and then demonstrate the model generalization of Q-Pensieve.

E.1 COMPARISON WITH THE ENVELOPE Q-LEARNING ALGORITHM

The Envelope Q-Learning algorithm and its neural version Envelope DQN (Yang et al., 2019) presume that the action space is discrete. To adapt Envelope DQN to the continuous control tasks considered in our paper (including MuJoCo and continuous Deep Sea Treasure), we take the opensource implementation in (Yang et al., 2019) and apply action discretization, which has been shown to be also quite effective in MuJoCo control tasks (Tavakoli et al., 2018; Tang and Agrawal, 2020) . We compare Envelope DQN with Q-Pensieve in both Hopper3d and DST2d environments. For Envelope DQN, we set the number of bins for each action dimension to be 11 and 5 for DST2d (actions are 2-dimensional) and Hopper3d (actions are 3-dimensional) respectively, based on the suggestions provided by (Tavakoli et al., 2018) . Table 4 shows the performance of Q-Pensieve and Envelope DQN in terms of the two metrics. We can observe that Q-Pensieve outperforms Envelope DQN by a large margin in the above two popular multi-objective tasks. Q Replay Buffer Size: One could expect that a larger Q buffer size could help provide a more diverse collection of Q snapshots and thereby better boost the policy improvement update in each iteration. On the other hand, in practice, the required memory usage also scales with the Q buffer size. We evaluate Q-Pensieve under buffer sizes = 2, 4, 6 and compare it to that without using a Q replay buffer in Ant2d. Table 5 show that empirically a relatively small Q buffer size already offers a significant performance improvement. Q Replay Buffer Update Interval: To ensure that the Q snapshots in the Q replay buffer are rather diverse, we would suggest that the update interval shall not be too small (otherwise the Q snapshots in the buffer would be fairly similar). Moreover, as in general this update interval can be viewed as a hyperparameter to be tuned (similar to the update interval of the target networks in many RL algorithms). We further do an empirical study on the performance of Q-Pensieve under different update intervals. We run Q-Pensieve with different update intervals in Ant2d for 1500k steps. Table 6 show that the hypervolume is not sensitive to the update interval, and the performance in UT can potentially be further improved through hyperparameter tuning.

E.3 MODEL GENERALIZATION OF Q-PENSIEVE

To demonstrate that the critic model with the preference vector can generalize well, we define a metric for the critic as L critic = E s∼S,a∼A ||Q (s, a; λ) -Q true (s, a; λ) || 2 , ( ) where Q is the action-value function learned by our critic network and Q true is the true Q function calculated by Monte-Carlo method. Table 7 and Table 8 show the L critic under various preferences λ at different training stages in Hopper2d and HalfCheetah2d respectively. Note that the true Q values are typically in the range of 1000 to a few thousands. Therefore, we can see that L critic under various preferences is indeed pretty low, which indicates that the critic model can generalize well across preferences. 



https://github.com/NYCU-RL-Bandits-Lab/Q-Pensieve



Figure 1: The architecture of Q-Pensieve.

r and I is the indicator function. • Utility (UT): To further evaluate the performance under linear scalarization, we define the utility metric as UT := E λ T t=0 λ ⊤ r t , where the preference λ is sampled uniformly from Λ. • Episodic Dominance (ED): To compare the performance of a pair of algorithms, we define Episodic Dominance as ED 1,2

Figure 2: Return vectors attained by Q-Pensieve and the collection of single-objective SAC models under 19 preferences.

Figure 3: Comparison of standard single-objective SAC and the hybrid SAC assisted by another Q-Pensieve model trained in parallel.

Figure 4: Return vectors attained under preference λ = [0.5, 0.5] at different training stages (We also plot return vectors under others preference in Figure 7 and Figure 8 in Appendix). A number x on the red or blue marker indicates that the model is obtained at 100 • x thousand steps.

Figure 6: Return vectors attained under different Q replay buffer sizes of Q-Pensieve

(a) λ = [0.1, 0.9] (b) λ = [0.5, 0.5] (c) λ = [0.9, 0.1]

Figure 9: Average return in Hopper2d over 5 random seeds (average return is the inner product of the reward vectors and the corresponding preference).

Figure 10: Average return in Ant2d over 5 random seeds (average return is the inner product of the reward vectors and the corresponding preference).

Figure 11: Average return in HalfCheetah2d over 5 random seeds (average return is the inner product of the reward vectors and the corresponding preference).

Figure 12: Average return in Walker2d over 5 random seed (average return is the inner product of the reward vectors and the corresponding preference).

(a) λ = [0.33, 0.33, 0.33] (b) λ = [0.1, 0.1, 0.8] (c) λ = [0.1, 0.8, 0.1] (d) λ = [0.1, 0.1, 0.8]

Figure 13: Average return in Ant3d over 5 random seeds (average return is the inner product of the reward vectors and the corresponding preference).

Figure 14: Average return in Hopper3d over 5 random seeds (average return is the inner product of the reward vectors and the corresponding preference).

(a) λ = [0.25, 0.25, 0.25, 0.25] (b) λ = [0.05, 0.05, 0.05, 0.85] (c) λ = [0.05, 0.05, 0.85, 0.05] (d) λ = [0.05, 0.85, 0.05, 0.05] (e) λ = [0.85, 0.05, 0.05, 0.05]

Figure 15: Average return in LunarLander4d over 5 random seeds (average return is the inner product of the reward vectors and the corresponding preference).

Hyperparameters of Q-Pensieve.

Hyperparameters of PGMORL and PFA.Environmentsn total steps m w m t P num P size n weight sparsity , ϕ 2 , θ, preference sampling distribution P λ , number of preference vectors N λ , the soft update coefficient τ , actor learning rates η π , critic learning rates η Q Output :ϕ 1 , ϕ 2 , θ from Λ according P λ ;for each environment step doa t ∼ π θ (•|s t ; λ); s t+1 ∼ P (•|s t , a t );M ← M {(s t , a t , r(s t , a t ), s t+1 )};

Comparison of Q-Pensieve and Envelope DQN in terms of the two metrics across two domains. We report the mean and standard deviation over five random seeds.

Comparison of Q-Pensieve with different Q replay buffer size in terms of the two metrics in Ant2d over five random seeds.

Comparison of Q-Pensieve with different replay buffer update interval in terms of the two metrics in Ant2d over five random seeds.

L critic in Hopper2d over five random seeds.

L critic in HalfCheetah2d over five random seeds.

ACKNOWLEDGMENTS

This material is based upon work partially supported by the National Science and Technology Council (NSTC), Taiwan under Contract No. 110-2628-E-A49-014 and Contract No. 111-2628-E-A49-019, and based upon work partially supported by the Higher Education Sprout Project of the National Yang Ming Chiao Tung University and Ministry of Education (MOE), Taiwan.

