A PROBABILISTIC FRAMEWORK FOR MODULAR CONTINUAL LEARNING

Abstract

Continual learning (CL) algorithms seek to accumulate and transfer knowledge across a sequence of tasks and achieve better performance on each successive task. Modular approaches, which use a different composition of modules for each task and avoid forgetting by design, have been shown to be a promising direction to CL. However, searching through the large space of possible module compositions remains a challenge. In this work, we develop a scalable probabilistic search framework as a solution to this challenge. Our framework has two distinct components. The first is designed to transfer knowledge across similar input domains. To this end, it models each module's training input distribution and uses a Bayesian model to find the most promising module compositions for a new task. The second component targets transfer across tasks with disparate input distributions or different input spaces and uses Bayesian optimisation to explore the space of module compositions. We show that these two methods can be easily combined and evaluate the resulting approach on two benchmark suites designed to capture different desiderata of CL techniques. The experiments show that our framework offers superior performance compared to state-of-the-art CL baselines.

1. INTRODUCTION

The continual learning (CL) (Thrun & Mitchell, 1995) setting calls for algorithms that can solve a sequence of learning problems while performing better on every successive problem. A CL algorithm should avoid catastrophic forgetting -i.e., not allow later tasks to overwrite what has been learned from earlier tasks -and achieve transfer across a large sequence of problems. Ideally, the algorithm should be able to transfer knowledge across similar input distributions (perceptual transfer), dissimilar input distributions and different input spaces (non-perceptual transfer), and to problems with a few training examples (few-shot transfer). It is also important that the algorithm's computational and memory demands scale sub-linearly with the number of encountered tasks. Recent work (Valkov et al., 2018; Veniat et al., 2020; Ostapenko et al., 2021) has shown modular algorithms to be a promising approach to CL. These methods represent a neural network as a composition of modules, in which each module is a reusable parameterised function trained to perform an atomic transformation of its input. During learning, the algorithms accumulate a library of diverse modules by solving the encountered problems in a sequence. Given a new problem, they seek to find the best composition of pre-trained and new modules, out of the set of all possible compositions, as measured by the performance on a held-out dataset. Unlike CL approaches which share the same parameters across all problems, modular algorithms can introduce new modules and, thus, do not have an upper bound on the number of solved problems. However, scalability remains a key challenge in modular approaches to CL, as the set of module compositions is discrete and explodes combinatorially. Prior work has often sidestepped this challenge by introducing various restrictions on the compositions, for example, by only handling perceptual transfer (Veniat et al., 2020) or by ignoring non-perceptual transfer and being limited by the number of modules that can be stored in memory (Ostapenko et al., 2021) . The design of CL algorithms that relax these restrictions and can also scale remains an open problem. In this paper, we present a probabilistic framework as a solution to the scalability challenges in modular CL. We observe that searching over module compositions efficiently is difficult because

annex

(Eq. 6) achieve non-perceptual transfer by reusing a module in the second layer, allowing applications to new input domains. evaluating most of them involves training their new modules. This difficulty would be overcome if we could approximate a module composition's final performance without training its new modules.Accordingly, our method divides the search space into subsets of module compositions which achieve different types of forward transfer and can be searched through separately. It then explores each subset using a subset-specific probabilistic model over the choice of pre-trained modules, designed to take advantage of the subset's properties. Querying each probabilistic model is efficient, as it does not involve training new parameters, which in turn enables a scalable search method.Operationally, we first develop a probabilistic model over a set of module compositions which can achieve perceptual and few-shot transfer. The model exploits the fact that the input distribution on which a module is trained can indicate how successfully said module would process a set of inputs. Second, we identify a subset of module combinations capable of non-perceptual transfer and, using a new kernel, define a probabilistic model over this subset. We show that each of the two probabilistic models can be used to conduct separate searches through module combinations, which can then be combined into a scalable modular CL algorithm capable of perceptual, few-shot and non-perceptual transfer. Using two benchmark suites that evaluate different aspects of CL, we show that our approach achieves different types of knowledge transfer in large search spaces, is applicable to different input domains and modular neural architectures, and outperforms competitive baselines.

2. BACKGROUND

A continual learning algorithm is tasked with solving a sequence of problems S = (Ψ 1 , ..., Ψ T ), usually provided one at a time. We consider the supervised setting, in which each problem is characterised by a tuple Ψ = (D, T ), where D is the input domain, comprised of an input space and an input distribution, and T is a task, defined by a label space and a labelling function (Pan & Yang, 2009) . A CL algorithm aims to transfer knowledge between the problems in a sequence in order to improve each problem's generalisation performance. The knowledge transfer to a single problem can be defined as the difference in performance when the rest of the problems are not available.CL algorithms have several desiderata. First, an algorithm should be plastic, i.e., learn to solve new problems. Second, it should be stable and avoid catastrophic forgetting. Third, it should be capable of forward transfer: the ability to transfer knowledge to a newly encountered problem. In particular, we distinguish between three types of knowledge transfer: between problems with similar input distributions (perceptual), between problems with different input distributions or different inputspaces (non-perceptual) and to problems with a few training examples (few-shot). Fourth, a CL algorithm should also be capable of backward transfer, meaning its performance on previously encountered problems should increase after solving new ones. Finally, the resource demands of a CL algorithm should scale sub-linearly with the number of solved problems.

