LEARNING PORTABLE SKILLS BY IDENTIFYING GEN-ERALIZING FEATURES WITH AN ATTENTION-BASED ENSEMBLE Anonymous

Abstract

The ability to rapidly generalize is crucial for reinforcement learning to be practical in real-world tasks. However, generalization is complicated by the fact that, in many settings, some state features reliably support generalization while others do not. We consider the problem of learning generalizable policies and skills (in the form of options) by identifying feature sets that generalize across instances. We propose an attention-ensemble approach, where a collection of minimally overlapping feature masks is learned, each of which individually maximizes performance on the source instance. Subsequent tasks are instantiated using the ensemble, and transfer performance is used to update the estimated probability that each feature set will generalize in the future. We show that our approach leads to fast policy generalization for eight tasks in the Procgen benchmark. We then show its use in learning portable options in Montezuma's Revenge, where it is able to generalize skills learned in the first screen to the remainder of the game.

1. INTRODUCTION

In recent years reinforcement learning has outperformed humans in many Atari games (Mnih et al., 2015) , learned to play world champion level Go (Silver et al., 2017) and mastered many robot manipulation tasks (Levine et al., 2016; 2018) . While these achievements are undeniably impressive, they are in simulated and controlled environments stripped of many of the complexities humans face in everyday life. For reinforcement learning to be viable in real-world applications, the ability to scale to large, high-dimensional environments is crucial. Hierarchical reinforcement learning (Barto & Mahadevan, 2003) is a promising approach to achieve this scalability through the use of high-level skills that abstract away the detail of low-level action. The most popular hierarchical RL framework is the options framework (Sutton et al., 1999) ), which models abstract actions as consisting of three components: a set of states from which execution can begin, a policy which specifies how the option executes, and a set of states where execution ceases. To fully realize the promise of the options framework, learned options should ideally be easily reused, or ported, to new tasks and environments (Konidaris & Barto, 2007) . The core difficulty here is that, in practice, an option will be first learned in a small number of specific instantiationspossibly just one-without foreknowledge of the circumstances under which it will be applied again in the future. In such cases there may be many state features over which the first instance(s) of the option could be successfully defined, but which will not support reuse. For example, a single option to open a door might be equally well-defined using features describing the door's location in a global map, or features describing the location of its handle relative to the agent, but only the latter will generalize to new doors. This problem is exacerbated by the fact that all three components of the option must simultaneously function in new instantiations, or the option will fail. We therefore propose to learn portable options by identifying sets of state features that support generalization. We adopt the transfer learning setting (Taylor & Stone, 2009) , where the goal is to learn on a number of source tasks and perform well on target tasks with minimum re-training. We introduce a method where an agent uses an attention-based ensemble (Kim et al., 2018) to learn a collection of diverse feature sets that each individually maximize performance on the source task. Subsequent option instantiations are evaluated for success, and the results are used to update the agent's estimate of the probability that each feature set will generalize to future tasks. These probabilities in turn govern which feature sets are used in new option instantations. We begin by showing how to learn a portable policy using an ensemble, and demonstrate that it leads to fast learning on eight games from the Procgen generalization benchmark (Cobbe et al., 2020) . The ensemble is then extended to learn portable classifiers that represent initiation and termination sets. We combine the resulting portable policy, initiation classifier, and termination classifier methods to learn portable options in Montezuma's Revenge, where our method enables an agent to generalize skills learned in the first room to all the others.

2. BACKGROUND AND RELATED WORK

We consider the episodic reinforcement learning setting. An agent operates in a Markov Decision Process (MDP) with state space S and action space A. The transition probability p(s t+1 |s t , a t ) is the probability of transitioning from state s t to state s t+1 with action a t . At each time step, the agent receives a scalar reward defined by the reward function r(s t , a t ). The goal of the reinforcement learning agent is to maximize the cumulative discounted reward over an episode by finding a policy π(a|s) that selects actions at each step. Options in their most basic formulation offer no guarantees of being portable. However, Konidaris & Barto (2007) argue that if the inputs to an option retain the same semantics (introducing the notion of an agent-space) across option instances, these options will be reusable. There are many works focused on building derived input spaces with transferable semantics. Konidaris & Barto (2007) showed that an agent-centric representation, analogous to the egocentric space (the space surrounding the agent (Klatzky, 1998)), would be sufficient for many tasks, especially in the robotics domain. The successor features (SF) and generalized policy improvement (GPI) framework proposed by Barreto et al. (2018) creates a space derived from successor features leading to reusable skills. However, these skills only adapt to changes in the reward function, not the transition dynamics. Gupta et al. (2017) learn skills in an invariant feature space which enables generalization across morphollogically different robots. However, the invariant feature space cannot generalize across tasks. Hausman et al. (2018) propose learning a separate embedding space which can be used to parameterize discovered skills. They show this successfully generalizes these parameterized skills across tasks but make the assumption that the agent has access to a collection of different tasks which are used to learn the embedding space. This embedding space results in portable policies, but does not address the main issue of over-fitting to the few instances an option is defined over. Topin et al. (2015) transfers options between different object-oriented MDPs, but are confined to discrete-domains. Our work instead focuses on building portable options in high-dimensional and continuous domains. We also differ from these approaches by using diversity to build portable options. We propose to learn a diverse set of state features-each of which maximizes rewards-to both define the option policy and identify the portable initiation and termination sets. One way of maintaining a set of features is by building an ensemble. Ensemble methods train multiple learners on the same task, resulting in a combined model which performs better than each individual model. The quality of the ensemble depends on the quality of each individual and the diversity among them. The attention-based ensemble proposed by Kim et al. (2018) is a deep learning framework, originally intended for deep metric learning, that encourages diversity in the feature embeddings learned by each member in the ensemble. The objective for the ensemble is to minimize a combination of the training objective loss and a divergence loss: L({x i }) = m L train,(m) ({x i }) + λ div L div ({x i }). (1)



The options framework bySutton et al. (1999)  extends the MDP framework by creating temporally extended abstract actions known as options. An option o is defined by a three-tuple (I o , π o , β o ). The initiation set, I o : S → {0, 1}, is the set of states in which option o can initiate. The termination set, β o : S → {0, 1} is the set of states in which option o successfully terminates. The option policy π o : S → A is a controller that transitions the agent from states in I o to states in β o .

