Q-PENSIEVE: BOOSTING SAMPLE EFFICIENCY OF MULTI-OBJECTIVE RL THROUGH MEMORY SHARING OF Q-SNAPSHOTS

Abstract

Many real-world continuous control problems are in the dilemma of weighing the pros and cons, multi-objective reinforcement learning (MORL) serves as a generic framework of learning control policies for different preferences over objectives. However, the existing MORL methods either rely on multiple passes of explicit search for finding the Pareto front and therefore are not sample-efficient, or utilizes a shared policy network for coarse knowledge sharing among policies. To boost the sample efficiency of MORL, we propose Q-Pensieve, a policy improvement scheme that stores a collection of Q-snapshots to jointly determine the policy update direction and thereby enables data sharing at the policy level. We show that Q-Pensieve can be naturally integrated with soft policy iteration with convergence guarantee. To substantiate this concept, we propose the technique of Q replay buffer, which stores the learned Q-networks from the past iterations, and arrive at a practical actor-critic implementation. Through extensive experiments and an ablation study, we demonstrate that with much fewer samples, the proposed algorithm can outperform the benchmark MORL methods on a variety of MORL benchmark tasks.

1. INTRODUCTION

Many real-world sequential decision-making problems involve the joint optimization of multiple objectives, while some of them may be in conflict. For example, in robot control, it is expected that the robot can run fast while consuming as little energy as possible; nevertheless, we inevitably need to use more energy to make the robot run fast, regardless of how energy-efficient the robot motion is. Moreover, various other real-world continuous control problems are also multi-objective tasks by nature, such as congestion control in communication networks (Ma et al., 2022) and diversified portfolios (Abdolmaleki et al., 2020) . Moreover, the relative importance of these objectives could vary over time (Roijers and Whiteson, 2017) . For example, the preference over energy and speed in robot locomotion could change with the energy budget; network service providers need to continuously switch service among various networking applications (e.g., on-demand video streaming versus real-time conferencing), each of which could have preferences over latency and throughput. To address the above practical challenges, multi-objective reinforcement learning (MORL) serves as one classic and popular formulation for learning optimal control strategies from vector-valued reward signal and achieve favorable trade-off among the objectives. In the MORL framework, the goal is to learn a collection of policies, under which the attained return vectors recover as much of the Pareto front as possible. One popular approach to addressing MORL is to explicitly search for the Pareto front with an aim to maximize the hypervolume associated with the reward vectors, such as evolutionary search (Xu et al., 2020) and search by first-order stationarity (Kyriakis et al., 2022) . While being effective, explicit search algorithms are known to be rather sample-inefficient as the data sharing among different passes of explicit search is rather limited. As a result, it is typically difficult to maintain a sufficiently diverse set of optimal policies for different preferences within a reasonable number of training samples. Another way to address MORL is to implicitly search for non-dominated policies through linear scalarization, i.e., convert the vector-valued reward signal to a single scalar with the help of a linear preference and thereafter apply a conventional single-objective RL algorithm for iteratively improving the policies (e.g., (Abels et al., 2019; Yang et al., 2019) ). To enable implicit search for diverse preferences simultaneously, a single network is typically used to express a whole collection of policies. As a result, some level of data sharing among policies of different preferences is done implicitly through the shared network parameters. However, such sharing is clearly not guaranteed to achieve policy improvement for all preferences. Therefore, there remains one critical open research question to be answered: How to boost the sample efficiency of MORL through better policy-level knowledge sharing? To answer this question, we revisit MORL from the perspective of memory sharing among the policies learned across different training iterations and propose Q-Pensieve, where a "Pensieve", as illustrated in the novel Harry Potter, is a magical device used to store pieces of personal memories, which can later be shared with someone else. By drawing an analogy between the memory sharing among humans and the knowledge sharing among policies, we propose to construct a Q-Pensieve, which stores snapshots of the Q-functions of the policies learned in the past iterations. Upon improving the policy for a specific preference, we expect that these Q-snapshots could help jointly determine the policy update direction. In this way, we explicitly enforce knowledge sharing on the policy level and thereby enhance the sample use in learning optimal policies for various preferences. To substantiate this idea, we start by considering Q-Pensieve memory sharing in the tabular planning setting and integrate Q-Pensieve with the soft policy iteration for entropy-regularized MDPs. Inspired by (Yang et al., 2019) , we leverage the envelope operation and propose the Q-Pensieve policy iteration for MORL, which we show would preserve the similar convergence guarantee as the standard singleobjective soft policy iteration. Based on this result, we propose a practical implementation that consists of two major components: (i) We introduce the technique of Q replay buffer. Similar to the standard replay buffer of state transitions, a Q replay buffer is meant to achieve sample reuse and improve sample efficiency, but notably at the policy level. Through the use of Q replay buffer, we can directly obtain a large collection of Q functions, each of which corresponds to a policy in a prior training iteration, without any additional efforts or computation in forming the Q-Pensieve. (ii) We convert the Q-Pensieve policy iteration into an actor-critic off-policy MORL algorithm by adapting the soft actor critic to the multi-objective setting and using it as the base of our implementation. The main contributions of this paper can be summarized as: • We identify the critical sample inefficiency issue in MORL and address this issue by proposing Q-Pensieve, which is a policy improvement scheme for enhancing knowledge sharing on the policy level. We then present Q-Pensieve policy iteration and establish its convergence property. • We substantiate the concept of Q-Pensieve policy iteration by proposing the technique of Q replay buffer and arrive at a practical actor-critic type practical implementation. • We evaluate the proposed algorithm in various benchmark MORL environments, including Deep Sea Treasure and MuJoCo. Through extensive experiments and an ablation study, we demonstrate the the proposed Q-Pensieve can indeed achieve significantly better empirical sample efficiency than the popular benchmark MORL algorithms, in terms of multiple common MORL performance metrics, including hypervolume and utility.

2. PRELIMINARIES

Multi-Objective Markov Decision Processes (MOMDPs). We consider the formulation of MOMDP defined by the tuple (S, A, P, r, γ, D, S λ , Λ), where S denotes the state space, A is the action space, P : S × A × S → [0, 1] is the transition kernel of the environment, r : S × A → [-r max , r max ] d is the vector-valued reward function with d as the number of objectives, γ ∈ (0, 1) is the discount factor, D is the initial state distribution, S λ : R d → R is the scalarization function (under some preference vector λ ∈ R d ), and Λ denotes the set of all preference vectors. In this paper, we focus on the linear reward scalarization setting, i.e., S λ (r) = λ ⊤ r(s, a), as commonly adopted in the MORL literature (Abels et al., 2019; Yang et al., 2019; Kyriakis et al., 2022) . Without loss of generality, we let Λ be the unit simplex. If d = 1, an MOMDP would degenerate to a standard MDP, and we simply use r(s, a) to denote the scalar reward. At each time step t ∈ N ∪{0}, the learner receives the

