DISCOVERING A SET OF POLICIES FOR THE WORST CASE REWARD

Abstract

We study the problem of how to construct a set of policies that can be composed together to solve a collection of reinforcement learning tasks. Each task is a different reward function defined as a linear combination of known features. We consider a specific class of policy compositions which we call set improving policies (SIPs): given a set of policies and a set of tasks, a SIP is any composition of the former whose performance is at least as good as that of its constituents across all the tasks. We focus on the most conservative instantiation of SIPs, setmax policies (SMPs), so our analysis extends to any SIP. This includes known policy-composition operators like generalized policy improvement. Our main contribution is a policy iteration algorithm that builds a set of policies in order to maximize the worst-case performance of the resulting SMP on the set of tasks. The algorithm works by successively adding new policies to the set. We show that the worst-case performance of the resulting SMP strictly improves at each iteration, and the algorithm only stops when there does not exist a policy that leads to improved performance. We empirically evaluate our algorithm on a grid world and also on a set of domains from the DeepMind control suite. We confirm our theoretical results regarding the monotonically improving performance of our algorithm. Interestingly, we also show empirically that the sets of policies computed by the algorithm are diverse, leading to different trajectories in the grid world and very distinct locomotion skills in the control suite.

1. INTRODUCTION

Reinforcement learning (RL) is concerned with building agents that can learn to act so as to maximize reward through trial-and-error interaction with the environment. There are several reasons why it can be useful for an agent to learn about multiple ways of behaving, i.e., learn about multiple policies. The agent may want to achieve multiple tasks (or subgoals) in a lifelong learning setting and may learn a separate policy for each task, reusing them as needed when tasks reoccur. The agent may have a hierarchical architecture in which many policies are learned at a lower level while an upper level policy learns to combine them in useful ways, such as to accelerate learning on a single task or to transfer efficiently to a new task. Learning about multiple policies in the form of options (Sutton et al., 1999a ) can be a good way to achieve temporal abstraction; again this can be used to quickly plan good policies for new tasks. In this paper we abstract away from these specific scenarios and ask the following question: what set of policies should the agent pre-learn in order to guarantee good performance under the worst-case reward? A satisfactory answer to this question could be useful in all the scenarios discussed above and potentially many others. There are two components to the question above: (i) what policies should be in the set, and (ii) how to compose a policy to be used on a new task from the policies in the set. To answer (ii), we propose the concept of a set improving policy (SIP). Given any set of n policies, a SIP is any composition of these policies whose performance is at least as good as, and generally better than, that of all of the constituent policies in the set. We present two policy composition (or improvement) operators that lead to a SIP. The first is called set-max policy (SMP). Given a distribution over states, a SMP chooses from n policies the one that leads to the highest expected value. The second SIP operator is generalized policy improvement (Barreto et al., 2017, GPI) . Given a set of n policies and their associated action-value functions, GPI is a natural extension of regular policy improvement in which the agent acts greedily in each state with respect to the maximum over the set of action-values

