NOVEL POLICY SEEKING WITH CONSTRAINED OPTI-MIZATION

Abstract

We address the problem of seeking novel policies in reinforcement learning tasks. Instead of following the multi-objective framework commonly used in existing methods, we propose to rethink the problem under a novel perspective of constrained optimization. We first introduce a new metric to evaluate the difference between policies, and then design two practical novel policy seeking methods following the new perspective, namely the Constrained Task Novel Bisector (CTNB), and the Interior Policy Differentiation (IPD), corresponding to the feasible direction method and the interior point method commonly known in the constrained optimization literature. Experimental comparisons on the MuJuCo control suite show our methods can achieve substantial improvement over previous novelty-seeking methods in terms of both the novelty of policies and their performances in the primal task. 1

1. INTRODUCTION

In Reinforcement Learning, an agent interacts with the environment to learn a policy that can maximize a certain form of cumulative rewards (Sutton & Barto, 1998) , while the policy gradient method can be applied to optimize parametric policy functions (Sutton et al., 2000) . However, direct optimization with respect to the reward function is prone to get stuck in sub-optimal solutions and therefore hinders the policy optimization (Liepins & Vose, 1991; Lehman & Stanley, 2011; Plappert et al., 2018) . Consequently, an appropriate exploration strategy is crucial for the success of policy learning (Auer, 2002; Bellemare et al., 2016; Houthooft et al., 2016; Tang et al., 2017; Ostrovski et al., 2017; Tessler et al., 2019; Ciosek et al., 2019) . Recently many works have shown that incorporating curiosity in the policy learning leads to better exploration strategies (Pathak et al., 2017; Burda et al., 2018a; b; Liu et al., 2019) . In these works, visiting a previously unseen or infrequent state is assigned with an extra curiosity bonus reward. Different from those curiosity-driven methods which focus on discovering new states within the learning procedure of a repeated single policy, the alternative approach of Novel Policy Seeking (Lehman & Stanley, 2011; Zhang et al., 2019; Pugh et al., 2016) focuses on learning different policies with diverse or the so-called novel behaviors to solve the primal task. In the process of novel policy seeking, policies in new iterations are usually encouraged to be different from previous policies. Therefore novel policy seeking can be viewed as an extrinsic curiosity-driven method at the level of policies, as well as an exploration strategy for a population of agents. Besides encouraging exploration (Eysenbach et al., 2018; Gangwani et al., 2018; Liu et al., 2017) , novel policy seeking is also related to policy ensemble (Osband et al., 2018; 2016; Florensa et al., 2017) and evolution strategies (ES) (Salimans et al., 2017; Conti et al., 2018) . In this work, we aim at generating a set of policies that behave differently from all previous given policies while trying to keep their primal task performance. In order to generate novel policies, previous work often defines a heuristic metric for novelty estimation, e.g., differences of state distributions estimated by neural networks are used in (Zhang et al., 2019) , and tries to solve the problem under the formulation of multi-objective optimization. However, most of these metrics suffer from the difficulty when dealing with episodic novelty reward, i.e., the difficulty of episodic credit assignment (Sutton et al., 1998) , thus their effectiveness in learning novel policies is limited. Figure 1 : The comparison of the standard policy gradient method without novelty seeking (left), multi-objective optimization method (middle), and our constrained optimization approach (right) for novel policy seeking. The standard policy gradient method does not try actively to find novel solutions. The multi-objective optimization method may impede the learning procedure when the novelty gradient is being applied all the time (Zhang et al., 2019), e.g., a random initialized policy will be penalized from getting closer to the previous policy due to the conflict of gradients, which limits the learning efficiency and the final performance. On the contrary, the novelty gradient of our constrained optimization approach will only be considered within a certain region to keep the policy being optimized away from highly similar solutions. Such an approach is more flexible and includes the multi-objective optimization method as its special case. Moreover, the difficulty of balancing different objectives impedes the agent to find a well-performing policy for the primal task, as shown by Fig. 1 which compares the policy gradients of three cases, namely the one without novel policy seeking, novelty seeking with multi-objective optimization and novelty seeking with constrained optimization methods, respectively. In this work we take into consideration both the novelty of the learning policies and the performance of the primal task when addressing the problem of novel policy seeking. To this end we propose to seek novel policies under a formulation of constrained optimization. Two algorithms under such a formulation are designed to seek novel policies while keeping their performances of the primal task, avoiding excessive novelty seeking. As a result, the performances of our learned novel policies can be guaranteed and even further improved. Our contributions can be summarized in three-folds. Firstly, we introduce a new metric to compute the difference between policies with instant feedback at every timestep; Secondly, we propose a constrained optimization formulation for novel policy seeking and design two practical algorithms resembling two approaches in constrained optimization literature; Thirdly, we evaluate our proposed algorithms on the MuJoCo locomotion environments, showing the advantages of these constrained optimization novelty-seeking methods which can generate a series of diverse and well-performing policies over previous multi-objective novelty seeking methods.

2. RELATED WORK

Intrinsic motivation methods In previous work, different approaches are proposed to provide intrinsic motivation or intrinsic reward as a supplementary to the primal task reward for better exploration (Houthooft et al., 2016; Pathak et al., 2017; Burda et al., 2018a; b; Liu et al., 2019) . All those approaches leverage the weighted sum of two rewards, the primal rewards provided by environments, and intrinsic rewards that provided by different heuristics. On the other hand, the work of DIAYN and DADS (Eysenbach et al., 2018; Sharma et al., 2019) learn diverse skills without extrinsic reward. Those approaches focus on decomposing diverse skills of a single policy, while our work focuses on learning diverse behaviors among a batch of policies for the same task. Diverse policy seeking methods The work of Such et al. shows that different RL algorithms may converge to different policies for the same task (Such et al., 2018) . On the contrary, we are interested in how to learn different policies through a single learning algorithm with the capability of avoiding local optimum. The work of Pugh et al. establishes a standard framework for understanding and comparing different approaches to search for quality diversity (QD) (Pugh et al., 2016) . Conti et al. proposes a solution which avoids local optima as well as achieves higher performance by adding novelty search and QD to evolution strategies (Conti et al., 2018) . The Task-Novelty Bisector (TNB) (Zhang et al., 2019) aims to solve novelty seeking problem by jointly optimize the extrinsic rewards and novelty rewards defined by an auto-encoder. In this work, one of the two proposed methods is closely related to TNB, but is adapted to the constrained optimization formulation.



Code will be made publicly available.

