BUILDING A SUBSPACE OF POLICIES FOR SCALABLE CONTINUAL LEARNING

Abstract

The ability to continuously acquire new knowledge and skills is crucial for autonomous agents. Existing methods are typically based on either fixed-size models that struggle to learn a large number of diverse behaviors, or growing-size models that scale poorly with the number of tasks. In this work, we aim to strike a better balance between an agent's size and performance by designing a method that grows adaptively depending on the task sequence. We introduce Continual Subspace of Policies (CSP), a new approach that incrementally builds a subspace of policies for training a reinforcement learning agent on a sequence of tasks. The subspace's high expressivity allows CSP to perform well for many different tasks while growing sublinearly with the number of tasks. Our method does not suffer from forgetting and displays positive transfer to new tasks. CSP outperforms a number of popular baselines on a wide range of scenarios from two challenging domains, Brax (locomotion) and Continual World (manipulation). Interactive visualizations of the subspace can be found at csp. Code is available here.

1. INTRODUCTION

Developing autonomous agents that can continuously acquire new knowledge and skills is a major challenge in machine learning, with broad application in fields like robotics or dialogue systems. In the past few years, there has been growing interest in the problem of training agents on sequences of tasks, also referred to as continual reinforcement learning (CRL, Khetarpal et al. (2020) ). However, current methods either use fixed-size models that struggle to learn a large number of diverse behaviors (Hinton et al., 2006; Rusu et al., 2016a; Li & Hoiem, 2018; Bengio & LeCun, 2007; Kaplanis et al., 2019; Traoré et al., 2019; Kirkpatrick et al., 2017; Schwarz et al., 2018; Mallya & Lazebnik, 2018) , or growing-size models that scale poorly with the number of tasks (?Cheung et al., 2019; Wortsman et al., 2020) . In this work, we introduce an adaptive-size model which strikes a better balance between performance and size, two crucial properties of continual learning systems (Veniat et al., 2020) , thus scaling better to long task sequences. Taking inspiration from the mode connectivity literature (Garipov et al., 2018; Gaya et al., 2021) , we propose Continual Subspace of Policies (CSP), a new CRL approach that incrementally builds a subspace of policies (see Figure 1 for an illustration of our method). Instead of learning a single policy, CSP maintains an entire subspace of policies defined as a convex hull in parameter space. The vertices of this convex hull are called anchors, with each anchor representing the parameters of a policy. This subspace captures a large number of diverse behaviors, enabling good performance on a wide range of settings. At every stage of the CRL process, the best found policy for a previously seen task is represented as a single point in the current subspace (i.e. unique convex combination of the anchors), which facilitates cheap storage and easy retrieval of prior solutions.

