DISCOVERING A SET OF POLICIES FOR THE WORST CASE REWARD

Abstract

We study the problem of how to construct a set of policies that can be composed together to solve a collection of reinforcement learning tasks. Each task is a different reward function defined as a linear combination of known features. We consider a specific class of policy compositions which we call set improving policies (SIPs): given a set of policies and a set of tasks, a SIP is any composition of the former whose performance is at least as good as that of its constituents across all the tasks. We focus on the most conservative instantiation of SIPs, setmax policies (SMPs), so our analysis extends to any SIP. This includes known policy-composition operators like generalized policy improvement. Our main contribution is a policy iteration algorithm that builds a set of policies in order to maximize the worst-case performance of the resulting SMP on the set of tasks. The algorithm works by successively adding new policies to the set. We show that the worst-case performance of the resulting SMP strictly improves at each iteration, and the algorithm only stops when there does not exist a policy that leads to improved performance. We empirically evaluate our algorithm on a grid world and also on a set of domains from the DeepMind control suite. We confirm our theoretical results regarding the monotonically improving performance of our algorithm. Interestingly, we also show empirically that the sets of policies computed by the algorithm are diverse, leading to different trajectories in the grid world and very distinct locomotion skills in the control suite.

1. INTRODUCTION

Reinforcement learning (RL) is concerned with building agents that can learn to act so as to maximize reward through trial-and-error interaction with the environment. There are several reasons why it can be useful for an agent to learn about multiple ways of behaving, i.e., learn about multiple policies. The agent may want to achieve multiple tasks (or subgoals) in a lifelong learning setting and may learn a separate policy for each task, reusing them as needed when tasks reoccur. The agent may have a hierarchical architecture in which many policies are learned at a lower level while an upper level policy learns to combine them in useful ways, such as to accelerate learning on a single task or to transfer efficiently to a new task. Learning about multiple policies in the form of options (Sutton et al., 1999a) can be a good way to achieve temporal abstraction; again this can be used to quickly plan good policies for new tasks. In this paper we abstract away from these specific scenarios and ask the following question: what set of policies should the agent pre-learn in order to guarantee good performance under the worst-case reward? A satisfactory answer to this question could be useful in all the scenarios discussed above and potentially many others. There are two components to the question above: (i) what policies should be in the set, and (ii) how to compose a policy to be used on a new task from the policies in the set. To answer (ii), we propose the concept of a set improving policy (SIP). Given any set of n policies, a SIP is any composition of these policies whose performance is at least as good as, and generally better than, that of all of the constituent policies in the set. We present two policy composition (or improvement) operators that lead to a SIP. The first is called set-max policy (SMP). Given a distribution over states, a SMP chooses from n policies the one that leads to the highest expected value. The second SIP operator is generalized policy improvement (Barreto et al., 2017, GPI) . Given a set of n policies and their associated action-value functions, GPI is a natural extension of regular policy improvement in which the agent acts greedily in each state with respect to the maximum over the set of action-values functions. Although SMP provides weaker guarantees than GPI (we will show this below), it is more amenable to analysis and thus we will use it exclusively for our theoretical results. However, since SMP's performance serve as a lower bound to GPI's, the results we derive for the former also apply to the latter. In our illustrative experiments we will show this result empirically. Now that we have fixed the answer to (ii), i.e., how to compose pre-learned policies for a new reward function, we can leverage it to address (i): what criterion to use to pre-learn the policies. Here, one can appeal to heuristics such as the ones advocating that the set of pre-learned policies should be as diverse as possible (Eysenbach et al., 2018; Gregor et al., 2016; Grimm et al., 2019; Hansen et al., 2019) . In this paper we will use the formal criterion of robustness, i.e., we will seek a set of policies that do as well as possible in the worst-case scenario. Thus, the problem of interest to this paper is as follows: how to define and discover a set of n policies that maximize the worst possible performance of the resulting SMP across all possible tasks? Interestingly, as we will discuss, the solution to this robustness problem naturally leads to a diverse set of policies. To solve the problem posed above we make two assumptions: (A1) that tasks differ only in their reward functions, and (A2) that reward functions are linear combinations of known features. These two assumptions allow us to leverage the concept of successor features (SFs) and work in apprenticeship learning. As our main contribution in this paper, we present an algorithm that iteratively builds a set of policies such that SMP's performance with respect to the worst case reward provably improves in each iteration, stopping when no such greedy improvement is possible. We also provide a closed-form expression to compute the worst-case performance of our algorithm at each iteration. This means that, given tasks satisfying Assumptions A1 and A2, we are able to provably construct a SIP that can quickly adapt to any task with guaranteed worst-case performance. Related Work. The proposed approach has interesting connections with hierarchical RL (HRL) (Sutton et al., 1999b; Dietterich, 2000) . We can think of SMP (and GPI) as a higher-level policy-selection mechanism that is fixed a priori. Under this interpretation, the problem we are solving can be seen as the definition and discovery of lower-level policies that will lead to a robust hierarchical agent. There are interesting parallels between robustness and diversity. For example, diverse stock portfolios have less risk. In robust least squares (El Ghaoui & Lebret, 1997; Xu et al., 2009) , the goal is to find a solution that will perform well with respect to (w.r.t) data perturbations. This leads to a min-max formulation, and there are known equivalences between solving a robust (min-max) problem and the diversity of the solution (via regularization) (Xu & Mannor, 2012) . Our work is also related to robust Markov decision processes (MDPs) (Nilim & El Ghaoui, 2005) , but our focus is on a different aspect of the problem. While in robust MDPs the uncertainty is w.r.t the dynamics of the environment, here we focus on uncertainty w.r.t the reward and assume that the dynamics are fixed. More importantly, we are interested in the hierarchical aspect of the problem -how to discover and compose a set of policies. In contrast, solutions to robust MDPs are typically composed of a single policy. In Apprenticeship Learning (AL; Abbeel & Ng, 2004) the goal is also to solve a min-max problem in which the agent is expected to perform as well as an expert w.r.t any reward. If we ignore the expert, AL algorithms can be used to find a single policy that performs well w.r.t any reward. The solution to this problem (when there is no expert) is the policy whose SFs have the smallest possible norm. When the SFs are in the simplex (as in tabular MDPs) the vector with the smallest 2 norm puts equal probabilities on its coordinates, and is therefore "diverse" (making an equivalence between the robust min-max formulation and the diversity perspective). In that sense, our problem can be seen as a modified AL setup where: (a) no expert demonstrations are available (b) the agent is allowed to observe the reward at test time, and (c) the goal is to learn a set of constituent policies.

2. PRELIMINARIES

We will model our problem of interest using a family of Markov Decision Processes (MDPs). An MDP is a tuple M (S, A, P, r, γ, D), where S is the set of states, A is the set of actions, P = {P a | a ∈ A} is the set of transition kernels, γ ∈ [0, 1] is the discount factor and D is the initial state distribution. The function r : S × A × S → R defines the rewards, and thus the agent's objective; here we are interested in multiple reward functions, as we explain next. Let φ(s, a, s ) ∈ [0, 1] d be an observable vector of features (our analysis only requires the features to be bounded; we use [0, 1] for ease of exposition). We are interested in the set of tasks induced by all possible linear combinations of the features φ. Specifically, for any w ∈ R d , we can define a reward function r w (s, a, s ) = w • φ(s, a, s ). Given w, the reward r w is well defined and we will use the terms w and r w interchangeably to refer to the RL task induced by it. Formally, we are interested in

