A PROBABILISTIC FRAMEWORK FOR MODULAR CONTINUAL LEARNING

Abstract

Continual learning (CL) algorithms seek to accumulate and transfer knowledge across a sequence of tasks and achieve better performance on each successive task. Modular approaches, which use a different composition of modules for each task and avoid forgetting by design, have been shown to be a promising direction to CL. However, searching through the large space of possible module compositions remains a challenge. In this work, we develop a scalable probabilistic search framework as a solution to this challenge. Our framework has two distinct components. The first is designed to transfer knowledge across similar input domains. To this end, it models each module's training input distribution and uses a Bayesian model to find the most promising module compositions for a new task. The second component targets transfer across tasks with disparate input distributions or different input spaces and uses Bayesian optimisation to explore the space of module compositions. We show that these two methods can be easily combined and evaluate the resulting approach on two benchmark suites designed to capture different desiderata of CL techniques. The experiments show that our framework offers superior performance compared to state-of-the-art CL baselines.

1. INTRODUCTION

The continual learning (CL) (Thrun & Mitchell, 1995) setting calls for algorithms that can solve a sequence of learning problems while performing better on every successive problem. A CL algorithm should avoid catastrophic forgetting -i.e., not allow later tasks to overwrite what has been learned from earlier tasks -and achieve transfer across a large sequence of problems. Ideally, the algorithm should be able to transfer knowledge across similar input distributions (perceptual transfer), dissimilar input distributions and different input spaces (non-perceptual transfer), and to problems with a few training examples (few-shot transfer). It is also important that the algorithm's computational and memory demands scale sub-linearly with the number of encountered tasks. Recent work (Valkov et al., 2018; Veniat et al., 2020; Ostapenko et al., 2021) has shown modular algorithms to be a promising approach to CL. These methods represent a neural network as a composition of modules, in which each module is a reusable parameterised function trained to perform an atomic transformation of its input. During learning, the algorithms accumulate a library of diverse modules by solving the encountered problems in a sequence. Given a new problem, they seek to find the best composition of pre-trained and new modules, out of the set of all possible compositions, as measured by the performance on a held-out dataset. Unlike CL approaches which share the same parameters across all problems, modular algorithms can introduce new modules and, thus, do not have an upper bound on the number of solved problems. However, scalability remains a key challenge in modular approaches to CL, as the set of module compositions is discrete and explodes combinatorially. Prior work has often sidestepped this challenge by introducing various restrictions on the compositions, for example, by only handling perceptual transfer (Veniat et al., 2020) or by ignoring non-perceptual transfer and being limited by the number of modules that can be stored in memory (Ostapenko et al., 2021) . The design of CL algorithms that relax these restrictions and can also scale remains an open problem. In this paper, we present a probabilistic framework as a solution to the scalability challenges in modular CL. We observe that searching over module compositions efficiently is difficult because

