EFFICIENT CONTINUAL LEARNING WITH MODULAR NETWORKS AND TASK-DRIVEN PRIORS

Abstract

Existing literature in Continual Learning (CL) has focused on overcoming catastrophic forgetting, the inability of the learner to recall how to perform tasks observed in the past. There are however other desirable properties of a CL system, such as the ability to transfer knowledge from previous tasks and to scale memory and compute sub-linearly with the number of tasks. Since most current benchmarks focus only on forgetting using short streams of tasks, we first propose a new suite of benchmarks to probe CL algorithms across these new axes. Finally, we introduce a new modular architecture, whose modules represent atomic skills that can be composed to perform a certain task. Learning a task reduces to figuring out which past modules to re-use, and which new modules to instantiate to solve the current task. Our learning algorithm leverages a task-driven prior over the exponential search space of all possible ways to combine modules, enabling efficient learning on long streams of tasks. Our experiments show that this modular architecture and learning algorithm perform competitively on widely used CL benchmarks while yielding superior performance on the more challenging benchmarks we introduce in this work. The Benchmark is publicly available at https://github.com/facebookresearch/CTrLBenchmark.

1. INTRODUCTION

Continual Learning (CL) is a learning framework whereby an agent learns through a sequence of tasks (Ring, 1994; Thrun, 1994; 1998) , observing each task once and only once. Much of the focus of the CL literature has been on avoiding catastrophic forgetting (McClelland et al., 1995; McCloskey & Cohen, 1989; Goodfellow et al., 2013) , the inability of the learner to recall how to perform a task learned in the past. In our view, remembering how to perform a previous task is particularly important because it promotes knowledge accrual and transfer. CL has then the potential to address one of the major limitations of modern machine learning: its reliance on large amounts of labeled data. An agent may learn well a new task even when provided with little labeled data if it can leverage the knowledge accrued while learning previous tasks. Our first contribution is then to pinpoint general properties that a good CL learner should have, besides avoiding forgetting. In §3 we explain how the learner should be able to transfer knowledge from related tasks seen in the past. At the same time, the learner should be able to scale sub-linearly with the number of tasks, both in terms of memory and compute, when these are related. Our second contribution is to introduce a new benchmark suite, dubbed CTrL, to test the above properties, since current benchmarks only focus on forgetting. For the sake of simplicity and as a first step towards a more holistic evaluation of CL models, in this work we restrict our attention to supervised learning tasks and basic transfer learning properties. Our experiments show that while commonly used benchmarks do not discriminate well between different approaches, our newly introduced benchmark let us dissect performance across several new dimensions of transfer and scalability (see Fig. 1 for instance), helping machine learning developers better understand the strengths and weaknesses of various approaches. Our last contribution is a new model that is designed according to the above mentioned properties of CL methods. It is based on a modular neural network architecture (Eigen et al., 2014; Denoyer & Gallinari, 2015; Fernando et al., 2017; Li et al., 2019) task is solved by the composition of a handful of neural modules which can be either borrowed from past tasks or freshly trained on the new task. In principle, modularization takes care of all the fundamental properties we care about, as i) by design there is no forgetting as modules from past tasks are not updated when new tasks arrive, ii) transfer is enabled via sharing modules across tasks, and iii) the overall model scales sublinearly with the number of tasks as long as similar tasks share modules. The key issue is how to efficiently select modules, as the search space grows exponentially in their number. In this work, we overcome this problem by leveraging a data driven prior over the space of possible architectures, which allows only local perturbations around the architecture of the previous task whose features best solve the current task ( §4.2). Our experiments in §5, which employ a stricter and more realistic evaluation protocol whereby streams are observed only once but data from each task can be played multiple times, show that this model performs at least as well as state-of-the-art methods on standard benchmarks, and much better on our new and more challenging benchmark, exhibiting better transfer and ability to scale to streams with a hundred tasks.

2. RELATED WORK

CL methods can be categorized into three main families of approaches. Regularization based methods use a single shared predictor across all tasks with the only exception that there can be a taskspecific classification head depending on the setting. They rely on various regularization methods to prevent forgetting. Kirkpatrick et al. (2016); Schwarz et al. (2018) use an approximation of the Fisher Information matrix while Zenke et al. (2017) using the distance of each weight to its initialization as a measure of importance. These approaches work well in streams containing a limited number of tasks but will inevitably either forget or stop learning as streams grow in size and diversity (van de Ven & Tolias, 2019), due to their structural rigidity and fixed capacity. Similarly, rehearsal based methods also share a single predictor across all tasks but attack forgetting by using rehearsal on samples from past tasks. Finally, approaches based on evolving architectures directly tackle the issue of the limited capacity by enabling the architecture to grow over time. Rusu et al. (2016) introduce a new network on each task, with connection to all previous layers, resulting in a network that grows super-linearly with the number of tasks. This issue was later addressed by Schwarz et al. (2018) who propose to distill the new network back to the original one after each task, henceforth yielding a fixed capacity predictor which is going to have severe limitations on long streams. Yoon et al. (2018); Hung et al. (2019) propose a heuristic algorithm to automatically add and prune weights. Mehta et al. (2020) propose a Bayesian approach adding an adaptive number of weights to each layer. Li et al. (2019) propose to softly select between reusing, adapting, and introducing a new module at every layer. Similarly, Xu & Zhu (2018) propose to add filters once a new task arrives using REINFORCE (Williams, 1992) , leading to larger and larger networks even at inference time as time goes by. These two works are the most similar to ours, with the major difference that we restrict the search space over architectures,



Figure 1: Comparison of various CL methods on the CTrL benchmark using Resnet (left) and Alexnet (right) backbones. MNTDP-D is our method. See Tab. 1 of §5.3 for details.

For instance, Lopez-Paz & Ranzato (2017); Chaudhry et al. (2019b); Rolnick et al. (2019) store past samples in a replay buffer, while Shin et al. (2017) learn to generate new samples from the data distribution of previous tasks and Zhang et al. (2019) computes per-class prototypes. These methods share the same drawback of regularization methods: Their capacity is fixed and pre-determined which makes them ineffective at handling long streams.

with a novel task-driven prior ( §4). Every

