DISCOVERING POLICIES WITH DOMINO: DIVERSITY OPTIMIZATION MAINTAINING NEAR OPTIMALITY

Abstract

In this work we propose a Reinforcement Learning (RL) agent that can discover complex behaviours in a rich environment with a simple reward function. We define diversity in terms of state-action occupancy measures, since policies with different occupancy measures visit different states on average. More importantly, defining diversity in this way allows us to derive an intrinsic reward function for maximizing the diversity directly. Our agent, DOMiNO, stands for Diversity Optimization Maintaining Near Optimally. It is based on maximizing a reward function with two components: the extrinsic reward and the diversity intrinsic reward, which are combined with Lagrange multipliers to balance the quality-diversity trade-off. Any RL algorithm can be used to maximize this reward and no other changes are needed. We demonstrate that given a simple reward functions in various control domains, like height (stand) and forward velocity (walk), DOMiNO discovers diverse and meaningful behaviours. We also perform extensive analysis of our approach, compare it with other multi-objective baselines, demonstrate that we can control both the quality and the diversity of the set via interpretable hyperparameters, and show that the set is robust to perturbations of the environment.

1. INTRODUCTION

As we make progress in Artificial Intelligence, our agents get to interact with richer and richer environments. This means that we cannot expect such agents to come to fully understand and control all of their environment. Nevertheless, given an environment that is rich enough, we would like to build agents that are able to discover complex behaviours even if they are only provided with a simple reward function. Once a reward is specified, most existing RL algorithms will focus on finding the single best policy for maximizing it. However, when the environment is rich enough, there may be many qualitatively (optimal or near-optimal) different policies for maximising the reward, even if it is simple. Finding such diverse set of policies may help an RL agent to become more robust to changes, to construct a basis of behaviours, and to generalize better to future tasks. Our focus in this work is on agents that find creative and new ways to maximize the reward, which is closely related to Creative Problem Solving (Osborn, 1953) : the mental process of searching for an original and previously unknown solution to a problem. Previous work on this topic has been done in the field of Quality-Diversity (QD), which comprises of two main families of algorithms: MAP-Elites (Mouret & Clune, 2015; Cully et al., 2015) and novelty search with local competition (Lehman & Stanley, 2011) . Other work has been done in the RL community and typically involves combining the extrinsic reward with a diversity intrinsic reward (Gregor et al., 2017; Eysenbach et al., 2019) . Our work shares a similar objective to these excellent previous work, and proposes a new class of algorithms as we will soon explain. Due to space considerations we further discuss the connections to related work in Appendix A. Our main contribution is a new framework for maximizing the diversity of RL policies in the space of state-action occupancy measures. Intuitively speaking, a state-action occupancy d π measures how often a policy π visits each state-action pair. Thus, policies with diverse state-action occupancies induce different trajectories. Moreover, for such a diverse set there exist rewards (down stream tasks) for which some policies are better than others (since the value function is a linear product between the state occupancy and the reward). But most importantly, defining diversity in terms of state occupancies allows us to use duality tools from convex optimization and propose a discovery algorithm that is based, completely, on intrinsic reward maximization. Concretely, one can come up with different diversity objectives, and use the gradient of these objectives as an intrinsic reward. Building on these results, we show how to use existing RL code bases for diverse policy discovery. The only change needed is to to provide the agent with a diversity objective and its gradient. To demonstrate the effectiveness of our framework, we propose propose DOMiNO, a method for Diversity Optimization that Maintains Nearly Optimal policies. DOMiNO trains a set of policies using a policy-specific, weighted combination of the extrinsic reward and an intrinsic diversity reward. The weights are adapted using Lagrange multipliers to guarantee that each policy is near-optimal. We propose two novel diversity objectives: a repulsive pair-wise force that motivates policies to have distinct expected features and a Van Der Waals force, which combines the repulsive force with an attractive one and allows us to specify the degree of diversity in the set. We emphasize that our framework is more general, and we encourage others to propose new diversity objectives and use their gradients as intrinsic rewards. To demonstrate the effectiveness of DOMiNO, we perform experiments in the DeepMind Control Suite (Tassa et al., 2018) and show that given simple reward functions like height and forward velocity, DOMiNO discovers qualitatively diverse and complex locomotion behaviors (Fig. 1b ). We analyze our approach and compare it to other multi-objective strategies for handling the QD trade-off. Lastly, we demonstrate that the discovered set is robust to perturbations of the environment and the morphology of the avatar. We emphasize that the focus of our experiments is on validating and getting confidence in this new and exciting approach; we do not explicitly compare it with other work nor argue that one works better than the other. 

2. PRELIMINARIES AND NOTATION

In this work, we will express objectives in terms of the state occupancy measure d π . Intuitively speaking, d π measures how often a policy π visits each state-action pair. As we will soon see, the classic RL objective of reward maximization can be expressed as a linear product between the reward vector and the state occupancy. In addition, in this work we will formulate diversity maximization via an objective that is a nonlinear function of the state occupancy. While it might seem unclear which reward should be maximized to solve such an objective, we take inspiration from Convex MDPs (Zahavy et al., 2021b) where one such reward is the gradient of the objective with respect to d π . We begin with some formal definitions. In RL an agent interacts with an environment and seeks to maximize its cumulative reward. We consider two cases, the average reward case and the discounted case. The Markov decision process (Puterman, 1984, MDP) is defined by the tuple (S, A, P, R) for the average reward case and by the tuple (S, A, P, R, γ, D 0 ) for the discounted case. We assume an infinite horizon, finite state-action problem. Initially, the state of the agent is sampled according to s 0 ∼ D 0 . At time t, given state s t , the agent selects action a t according to its policy π(s t , •),



Figure 1: (a) DOMiNO's architecture: The agent learns a set of diverse high quality policies via a single latent-conditioned actor-critic network with intrinsic and extrinsic value heads. Dashed arrows signify training objectives. (b) DOMiNO's π: Near optimal diverse policies in walker.stand corresponding to standing on both legs, standing on either leg, lifting the other leg forward and backward, spreading the legs and stamping. Not only are the policies different from each other, they also achieve high extrinsic reward in standing (see values on top of each policy).

