CONTINUAL LEARNING WITH SOFT-MASKING OF PARAMETER-LEVEL GRADIENT FLOW Anonymous

Abstract

Existing research on task incremental learning in continual learning has primarily focused on preventing catastrophic forgetting (CF). Several techniques have achieved learning with no CF. However, they attain it by letting each task monopolize a sub-network in a shared network, which seriously limits knowledge transfer (KT) and causes over-consumption of the network capacity, i.e., as more tasks are learned, the performance deteriorates. The goal of this paper is threefold: (1) overcoming CF, (2) encouraging KT, and (3) tackling the capacity problem. A novel and simple technique (called SPG) is proposed that soft-masks (partially blocks) parameter updating in training based on the importance of each parameter to old tasks. Each task still uses the full network, i.e., no monopoly of any part of the network by any task, which enables maximum KT and reduction of capacity usage. Extensive experiments demonstrate the effectiveness of SPG in achieving all three objectives. More notably, it attains significant transfer of knowledge not only among similar tasks (with shared knowledge) but also among dissimilar tasks (with little shared knowledge) while preventing CF. 1

1. INTRODUCTION

Catastrophic forgetting (CF) and knowledge transfer (KT) are two key challenges of continual learning (CL), which learns a sequence of tasks incrementally. CF refers to the phenomenon where a model loses some of its performance on previous tasks once it learns a new task. KT means that tasks may help each other to learn by sharing knowledge. This work further investigates these problems in the popular CIL paradigm of CL, called task-incremental learning (TIL). In TIL, each task consists of several classes of objects to be learned. Once a task is learned, its data is discarded and will not be available for later use. During testing, the task id is provided for each test sample so that the corresponding classification head of the task can be used for prediction. Several effective approaches have been proposed for TIL that can achieve learning with little or no CF. Parameter isolation is perhaps the most successful one in which the system learns to mask a sub-network for each task in a shared network. HAT (Serra et al., 2018) and SupSup (Wortsman et al., 2020) are two representative systems. HAT learns neurons (not parameters) that are important for each task and in learning a new task, these important neurons to previous tasks are hard-masked or blocked to prevent updating in the backward pass. Only those free (unmasked) neurons and their parameters are trainable. Thus, as more tasks are learned, the number of free neurons left becomes fewer, making later tasks harder to learn. Further, if a neuron is masked, all the parameters feeding to it are also masked, which consumes a great deal of network capacity. While in HAT, as more and more tasks are learned, it has less and less capacity to learn new tasks, which results in gradual performance deterioration (see Section 4.3), the proposed method only soft-masks parameters and thus consumes much less network capacity. SupSup uses a different approach to learn and to fix a sub-network for each task. Since it does not learn parameters but only learn separate masks per task, it can largely avoid the reduction in capacity. However, that limits knowledge transfer. As the sub-networks for old tasks cannot be updated, these approaches can have two major shortcomings: (1) limited knowledge transfer, and/or (2) over-consumption of network capacity. CAT (Ke et al., 2020) tries to improve knowledge transfer of HAT by detecting task similarities. If the new task is found similar to some previous tasks, these tasks' masks are removed so that the new task training 1 The code is contained in the Supplementary Materials. 1

