EFFICIENT CONTINUAL LEARNING WITH MODULAR NETWORKS AND TASK-DRIVEN PRIORS

Abstract

Existing literature in Continual Learning (CL) has focused on overcoming catastrophic forgetting, the inability of the learner to recall how to perform tasks observed in the past. There are however other desirable properties of a CL system, such as the ability to transfer knowledge from previous tasks and to scale memory and compute sub-linearly with the number of tasks. Since most current benchmarks focus only on forgetting using short streams of tasks, we first propose a new suite of benchmarks to probe CL algorithms across these new axes. Finally, we introduce a new modular architecture, whose modules represent atomic skills that can be composed to perform a certain task. Learning a task reduces to figuring out which past modules to re-use, and which new modules to instantiate to solve the current task. Our learning algorithm leverages a task-driven prior over the exponential search space of all possible ways to combine modules, enabling efficient learning on long streams of tasks. Our experiments show that this modular architecture and learning algorithm perform competitively on widely used CL benchmarks while yielding superior performance on the more challenging benchmarks we introduce in this work. The Benchmark is publicly available at https://github.com/facebookresearch/CTrLBenchmark.

1. INTRODUCTION

Continual Learning (CL) is a learning framework whereby an agent learns through a sequence of tasks (Ring, 1994; Thrun, 1994; 1998) , observing each task once and only once. Much of the focus of the CL literature has been on avoiding catastrophic forgetting (McClelland et al., 1995; McCloskey & Cohen, 1989; Goodfellow et al., 2013) , the inability of the learner to recall how to perform a task learned in the past. In our view, remembering how to perform a previous task is particularly important because it promotes knowledge accrual and transfer. CL has then the potential to address one of the major limitations of modern machine learning: its reliance on large amounts of labeled data. An agent may learn well a new task even when provided with little labeled data if it can leverage the knowledge accrued while learning previous tasks. Our first contribution is then to pinpoint general properties that a good CL learner should have, besides avoiding forgetting. In §3 we explain how the learner should be able to transfer knowledge from related tasks seen in the past. At the same time, the learner should be able to scale sub-linearly with the number of tasks, both in terms of memory and compute, when these are related. Our second contribution is to introduce a new benchmark suite, dubbed CTrL, to test the above properties, since current benchmarks only focus on forgetting. For the sake of simplicity and as a first step towards a more holistic evaluation of CL models, in this work we restrict our attention to supervised learning tasks and basic transfer learning properties. Our experiments show that while commonly used benchmarks do not discriminate well between different approaches, our newly introduced benchmark let us dissect performance across several new dimensions of transfer and scalability (see Fig. 1 for instance), helping machine learning developers better understand the strengths and weaknesses of various approaches. Our last contribution is a new model that is designed according to the above mentioned properties of CL methods. It is based on a modular neural network architecture (Eigen et al., 2014; Denoyer & Gallinari, 2015; Fernando et al., 2017; Li et al., 2019) with a novel task-driven prior ( §4). Every

