Composing Task Knowledge with Modular Successor Feature Approximators

Abstract

Recently, the Successor Features and Generalized Policy Improvement (SF&GPI) framework has been proposed as a method for learning, composing, and transferring predictive knowledge and behavior. SF&GPI works by having an agent learn predictive representations (SFs) that can be combined for transfer to new tasks with GPI. However, to be effective this approach requires state features that are useful to predict, and these state-features are typically hand-designed. In this work, we present a novel neural network architecture, "Modular Successor Feature Approximators" (MSFA), where modules both discover what is useful to predict, and learn their own predictive representations. We show that MSFA is able to better generalize compared to baseline architectures for learning SFs and modular architectures for learning state representations.

1. Introduction

Consider a household robot that needs to learn tasks including picking up dirty dishes and cleaning up spills. Now consider that the robot is deployed and encounters a table with both a spill and a set of dirty dishes. Ideally this robot can combine its training behaviors to both clean up the spill and pickup the dirty dishes. We study this aspect of generalization: combining knowledge from multiple tasks. Combining knowledge from multiple tasks is challenging because it is not clear how to synthesize either the behavioral policies or the value functions learned during training. This challenge is exacerbated when an agent also needs to generalize to novel appearances and environment configurations. Returning to our example, our robot might need to additionally generalize to both novel dirty dishes and to novel arrangements of chairs. Successor features (SFs) and Generalized Policy Improvement (GPI) provide a mechanism to combine knowledge from multiple training tasks (Barreto et al., 2017; 2020) . SFs are predictive representations that estimate how much state-features (known as "cumulants") will be experienced given a behavior. By assuming that reward has a linear relationship between cumulants and a task vector, an agent can efficiently compute how much reward it can expect to obtain from a given behavior. If the agent knows multiple behaviors, it can leverage GPI to compute which behavior would provide the most reward (see Figure 2 for an example). However, SF&GPI commonly assume hand-designed cumulants and don't have a mechanism for generalizing to novel environment configurations. Modular architectures are a promising method for generalizing to distributions outside of the training distribution (Goyal et al., 2019; Madan et al., 2021) . Recently, Carvalho et al. (2021a) presented "FARM" and showed that learning multiple state modules enabled generalization to environments with unseen environment parameters (e.g. to larger maps with more objects). In this work, we hypothesize that modules can further be leveraged to discover state-features that are useful to predict. Figure 1: (1) FARM learns multiple state modules. This promotes generalization to novel environments. However, it has no mechanism for combining task solutions. (2) USFA learns a single monolithic architecture for predicing SFs and can combine task solutions. However, it relies on hand-designed state features and has no mechanism for generalization to novel environments. (3) We combine the benefits of both. We leverage modules for reward-driven discovery of state features that are useful to predict. These form the basis of their own predictive representations (SFs) and enables combining task solutions in novel environments. We present "Modular Successor Feature Approximators" (MSFA), a novel neural network for discovering, composing, and transferring predictive knowledge and behavior via SF&GPI. MSFA is composed of a set of modules, which each learn their own state-features and corresponding predictive representations (SFs). Our core contribution is showing that an inductive bias for modularity can enable reward-driven discovery of state-features that are useful for zero-shot transfer with SF&GPI. We exemplify this with a simple statefeature discovery method presented in Barreto et al. (2018) where the dot-product between state-features and a task vector is regressed to environment reward. This method enabled transfer with SF&GPI in a continual learning setting but had limited success in the zeroshot transfer settings we study. While there are other methods for state-feature discovery, they add training complexity with mutual information objectives (Hansen et al., 2019) or meta-gradients (Veeriah et al., 2019) . With MSFA, by adding only an architectural bias for modularity, we discover state-features that (1) support zero-shot transfer competitive with hand-designed features, and (2) enable zero-shot transfer in visually diverse, procedurally generated environments. We are hopeful that our architectural bias can be leveraged with other discovery methods in future work.

2. Related Work on Generalization in RL

Hierarchical RL (HRL) is one dominant approach for combining task knowledge. The basic idea is that one can sequentially combine policies in time by having a "meta-policy" that sequentially activates "low-level" policies for protracted periods of time. By leveraging hand-designed or pre-trained low-level policies, one can generalize to longer instructions (Oh et al., 2017; Corona et al., 2020) , to new instruction orders (Brooks et al., 2021) , and to novel subtask graphs (Sohn et al., 2020; 2022) . We differ in that we focus on combining policies concurrently in time as opposed to sequentially in time. To do so, we develop a modular neural network for the SF&GPI framework. SFs are predictive representations that represent the current state as a summary of the successive features to follow (see §3 for a formal definition). By combining them with Generalized Policy Improvement, researchers have shown that they can transfer behaviors across object navigation tasks (Borsa et al., 2019; Zhang et al., 2017; Zhu et al., 2017) , across continuous control tasks (Hunt et al., 2019) , and within an HRL framework (Barreto et al., 2019) . However, these works tend to require hand-designed cumulants which are cumbersome to design for every new environment. In our work, we integrate SFs with Modular RL to facilitate reward-driven discovery of cumulants and improve successor feature learning. Modular RL (MRL) (Russell & Zimdars, 2003) is a framework for generalization by combining value functions. Early work dates back to (Singh, 1992) , who had a mixtureof-experts system select between separately trained value functions. Since then, MRL has been applied to generalize across robotic morphologies (Huang et al., 2020) , to novel task-robot combinations (Devin et al., 2017; Haarnoja et al., 2018) , and to novel language

