BUILDING A SUBSPACE OF POLICIES FOR SCALABLE CONTINUAL LEARNING

Abstract

The ability to continuously acquire new knowledge and skills is crucial for autonomous agents. Existing methods are typically based on either fixed-size models that struggle to learn a large number of diverse behaviors, or growing-size models that scale poorly with the number of tasks. In this work, we aim to strike a better balance between an agent's size and performance by designing a method that grows adaptively depending on the task sequence. We introduce Continual Subspace of Policies (CSP), a new approach that incrementally builds a subspace of policies for training a reinforcement learning agent on a sequence of tasks. The subspace's high expressivity allows CSP to perform well for many different tasks while growing sublinearly with the number of tasks. Our method does not suffer from forgetting and displays positive transfer to new tasks. CSP outperforms a number of popular baselines on a wide range of scenarios from two challenging domains, Brax (locomotion) and Continual World (manipulation). Interactive visualizations of the subspace can be found at csp. Code is available here.

1. INTRODUCTION

Developing autonomous agents that can continuously acquire new knowledge and skills is a major challenge in machine learning, with broad application in fields like robotics or dialogue systems. In the past few years, there has been growing interest in the problem of training agents on sequences of tasks, also referred to as continual reinforcement learning (CRL, Khetarpal et al. (2020) ). However, current methods either use fixed-size models that struggle to learn a large number of diverse behaviors (Hinton et al., 2006; Rusu et al., 2016a; Li & Hoiem, 2018; Bengio & LeCun, 2007; Kaplanis et al., 2019; Traoré et al., 2019; Kirkpatrick et al., 2017; Schwarz et al., 2018; Mallya & Lazebnik, 2018) , or growing-size models that scale poorly with the number of tasks (?Cheung et al., 2019; Wortsman et al., 2020) . In this work, we introduce an adaptive-size model which strikes a better balance between performance and size, two crucial properties of continual learning systems (Veniat et al., 2020) , thus scaling better to long task sequences. Taking inspiration from the mode connectivity literature (Garipov et al., 2018; Gaya et al., 2021) , we propose Continual Subspace of Policies (CSP), a new CRL approach that incrementally builds a subspace of policies (see Figure 1 for an illustration of our method). Instead of learning a single policy, CSP maintains an entire subspace of policies defined as a convex hull in parameter space. The vertices of this convex hull are called anchors, with each anchor representing the parameters of a policy. This subspace captures a large number of diverse behaviors, enabling good performance on a wide range of settings. At every stage of the CRL process, the best found policy for a previously seen task is represented as a single point in the current subspace (i.e. unique convex combination of the anchors), which facilitates cheap storage and easy retrieval of prior solutions. can usually be learned in the new subspace. In this case, CSP extends the subspace by keeping the new anchor. If the new task bears some similarities to previously seen ones, a good policy α old i can typically be found in the old subspace. In this case, CSP prunes the subspace by removing the new anchor. The subspace is extended only if it improves performance relative to the old subspace by at least some threshold ϵ. (b) Trade-off between model performance and size for a number of methods, on a sequence of 8 tasks from HalfCheetah (see Table 7 for trade-offs on other scenarios). If a new task shares some similarities with previously seen ones, a good policy can often be found in the current subspace without increasing the number of parameters. On the other hand, if a new task is very different from previously seen ones, CSP extends the current subspace by adding another anchor, and learns a new policy in the extended subspace. In this case, the pool of candidate solutions in the subspace increases, allowing CSP to deal with more diverse tasks in the future. The size of the subspace is increased only if this leads to performance gains larger than a given threshold, allowing users to specify the desired trade-off between performance and size (i.e. number of parameters or memory cost). We evaluate our approach on 18 CRL scenarios from two different domains, locomotion in Brax and robotic manipulation in Continual World, a challenging CRL benchmark (Wołczyk et al., 2021) . We also compare CSP with a number of popular CRL baselines, including both fixed-size and growingsize methods. As Figure 1b shows, CSP is competitive with the strongest existing methods while maintaining a smaller size. CSP does not incur any forgetting of prior tasks and displays positive transfer to new ones. We demonstrate that by increasing the threshold parameter, the model size can be significantly reduced without substantially hurting performance. In addition, our qualitative analysis shows that the subspace captures diverse behaviors and even combinations of previously learned skills, allowing transfer to many new tasks without requiring additional parameters.

2. CONTINUAL REINFORCEMENT LEARNING

A continual RL problem is defined by a sequence of N tasks denoted t 1 , ..., t N . Task i is defined by a Markov Decision Process (MDP) M i = ⟨S i , A i , T i , r i , γ⟩ with a set of states S i , a set of actions A i , a transition function T i ∶ S i × A i → P(S i ), and a reward function r i ∶ S i × A i → R. Here, we consider that all the tasks have the same state and action space. We also define a global policy Π ∶ [1..N ]×Z → (S → P(A)) which takes as input a task id i and a sequence of tasks Z, and outputs the policy which should be used when interacting with task i. P(A) is a probability distribution over the action space. We consider that each task i is associated with a training budget of interactions b i . When the system switches to task t i+1 it no longer has access to transitions from the M i . We do not assume access to a replay buffer with transitions from prior tasks since this would lead to a significant increase in memory cost; instead, at each stage, we maintain a buffer with transitions from the current task, just like SAC (Haarnoja et al., 2018b) .



Figure 1: (a) Continual Subspace of Policies (CSP) iteratively learns a subspace of policies in the continual RL setting. At every stage during training, the subspace is a simplex defined by a set of anchors (i.e. vertices). Any policy (i.e. point) in this simplex can be represented as a convex combination α of the anchor parameters. α i defines the best policy in the subspace for task i. When the agent encounters a new task, CSP tentatively grows the subspace by adding a new anchor. If the new task i is very different from previously seen ones, a better policy α new i

