Composing Task Knowledge with Modular Successor Feature Approximators

Abstract

Recently, the Successor Features and Generalized Policy Improvement (SF&GPI) framework has been proposed as a method for learning, composing, and transferring predictive knowledge and behavior. SF&GPI works by having an agent learn predictive representations (SFs) that can be combined for transfer to new tasks with GPI. However, to be effective this approach requires state features that are useful to predict, and these state-features are typically hand-designed. In this work, we present a novel neural network architecture, "Modular Successor Feature Approximators" (MSFA), where modules both discover what is useful to predict, and learn their own predictive representations. We show that MSFA is able to better generalize compared to baseline architectures for learning SFs and modular architectures for learning state representations.

1. Introduction

Consider a household robot that needs to learn tasks including picking up dirty dishes and cleaning up spills. Now consider that the robot is deployed and encounters a table with both a spill and a set of dirty dishes. Ideally this robot can combine its training behaviors to both clean up the spill and pickup the dirty dishes. We study this aspect of generalization: combining knowledge from multiple tasks. Combining knowledge from multiple tasks is challenging because it is not clear how to synthesize either the behavioral policies or the value functions learned during training. This challenge is exacerbated when an agent also needs to generalize to novel appearances and environment configurations. Returning to our example, our robot might need to additionally generalize to both novel dirty dishes and to novel arrangements of chairs. Successor features (SFs) and Generalized Policy Improvement (GPI) provide a mechanism to combine knowledge from multiple training tasks (Barreto et al., 2017; 2020) . SFs are predictive representations that estimate how much state-features (known as "cumulants") will be experienced given a behavior. By assuming that reward has a linear relationship between cumulants and a task vector, an agent can efficiently compute how much reward it can expect to obtain from a given behavior. If the agent knows multiple behaviors, it can leverage GPI to compute which behavior would provide the most reward (see Figure 2 for an example). However, SF&GPI commonly assume hand-designed cumulants and don't have a mechanism for generalizing to novel environment configurations. Modular architectures are a promising method for generalizing to distributions outside of the training distribution (Goyal et al., 2019; Madan et al., 2021) . Recently, Carvalho et al. (2021a) presented "FARM" and showed that learning multiple state modules enabled generalization to environments with unseen environment parameters (e.g. to larger maps with more objects). In this work, we hypothesize that modules can further be leveraged to discover state-features that are useful to predict. * Contact author: wcarvalh@umich.edu. 1

