CONTINUAL LEARNING BASED ON SUB-NETWORKS AND TASK SIMILARITY Anonymous

Abstract

Continual learning (CL) has two main objectives: preventing catastrophic forgetting (CF) and encouraging knowledge transfer (KT) across tasks. The existing literature mainly tries to overcome CF. Although some papers have focused on both CF and KT, they may still suffer from CF because of their ineffective handling of previous tasks and/or poor task similarity detection mechanisms to achieve KT. This work presents a new CL method that addresses the above issues. First, it overcomes CF by isolating the knowledge of each task via a learned mask that indicates a sub-network. Second, it proposes a novel technique to compute how important each mask is to the new task, which indicates how the new task is similar to an underlying old task. Similar tasks can share the same mask/subnetwork for KT, while dissimilar tasks use different masks/sub-networks for CF prevention. Comprehensive experiments have been conducted using a range of NLP problems, including classification, generation, and extraction to show that the proposed method consistently outperforms prior state-of-the-art baselines. 1

1. INTRODUCTION

This paper studies continual learning (CL) of a sequence of natural language processing (NLP) tasks in the task continual learning (Task-CL) setting. It deals with both catastrophic forgetting (CF) (McCloskey & Cohen, 1989) and knowledge transfer (KT) across tasks. In Task-CL, the task ID is provided for each test case in testing. In learning, after a task is learned, its data is no longer accessible. Another CL setting is class continual learning (Class-CL), which provides no task ID in testing and it solves a different type of problems. Existing research in CL has almost exclusively focused on overcoming CF (Kirkpatrick et al., 2016; Serrà et al., 2018; Wortsman et al., 2020) . Limited work has being done on KT except (Ke et al., 2020; 2021; Wang et al., 2022) . But KT is particularly important for NLP because many tasks in NLP share similar knowledge that can be leveraged to achieve better accuracy. We humans are also particularly good at leveraging prior knowledge to help learn new skills. To achieve KT in learning a new task, CAT (Ke et al., 2020) first detects previous tasks that are similar to the current task so that the current task learning can leverage the knowledge learned from the similar past tasks. CAT uses the hard-attention mechanism in HAT (Serrà et al., 2018) to deal with CF, which masks out those important neurons for each task so that the training of new tasks cannot change them in back-propagation. However, different tasks can share neurons. This approach has a major problem for KT that is very hard, if not impossible, to solve. After the similar previous tasks are detected, CAT opens the masks of these tasks so that the new task learning can modify their parameters to achieve both forward and backward KT. This clearly helps KT. But this can cause CF for dissimilar tasks that share parameters with those similar tasks. CAT's task similarity comparison method based on the transfer learning performance can be quite inaccurate too. The KT methods in (Ke et al., 2021; Wang et al., 2022) are based on instance-level feature similarity comparison using dot product or cosine, which can be inaccurate as well (see experiment results in Sec. 4.2). To deal with these issues, we would like to have (1) a learning method that can isolate the knowledge of each task without parameter overlapping among tasks to deal with harmful interference in KT and (2) a task similarity detection method that is directly related to the loss of previous tasks (even though we do not have their data) for more accurate similar task detection. For (1), we draw inspiration from the sub-network masking idea in (Wortsman et al., 2020) , where the underlying backbone network is fixed but a binary mask is learned to find a sub-network for each task, which encodes the model for the task. The mask is basically a set of binary gates that indicates which parameters in the backbone network should be used for a task model. Thus, different task models have no interference to cause CF although sub-networks of multiple tasks can share neurons and parameters because the underlying backbone network is fixed and shared by all tasks. Although this helps learn a sub-network to achieve no interference (CF) in transfer, the original (Wortsman et al., 2020) , by design, cannot do KT. It is still very challenging to detect task similarity and to know what level of similarity is similar enough to ensure positive transfer. If a wrong similarity threshold is used, CF will be serious. To this end, we propose a novel method to detect similarity in (2). For (2), we propose to determine whether a previous task k is similar to the current task t by assessing the importance of the mask (which represents the model or sub-network of a task) for the previous task k to the current task t. To compute the importance score, we make use of an effective idea from the network pruning community (Michel et al., 2019) . In Michel et al. ( 2019), the gradient on each parameter is serving as the importance of the parameter. The less important parameters (determined by a threshold) are regarded as unimportant and removed to reduce the network size. However, it is not obvious how we can compute the importance of each mask with its sub-network to the current task and how to determine whether the current task is similar enough to a previous task based on the importance so that they can perform KT. This paper proposes a novel method to perform the above two functions. A set of virtual/dummy gate variables are introduced to represent the mask/sub-network so that we can compute the gradient of each mask/sub-network. The gradient, based on the current task data and directly related to current task loss, serves as the importance of the mask/sub-network to the current task. The more important a mask/sub-network is, it is more likely that the previous task that has used the mask is similar to the current task. To mitigate the possible forgetting, a novel importance comparison mechanism is also proposed to take the previous task gradient into account. Based on the proposed idea, a new method, called TST (Task-CL based on Sub-networks and Task similarity), is proposed. TST is evaluated using datasets for classification, generation, and extraction with similar tasks and dissimilar tasks. The results demonstrate the high effectiveness of TST. In summary, this paper makes two key contributions. 1. It proposes a new Task-CL method TST based on sub-networks and task similarity. TST not only overcomes CF but also enables effective KT. For KT, it learns the current task in the sub-network of a previous task without interference with any other tasks and thus will not cause any CF for the other tasks. This cannot be achieved by other existing methods. 2. It proposes a novel task similarity detection method based on gradients computed on masks. This method is simple and yet highly effective. It is instrumental for effective KT. 



The code has been uploaded as supplementary materials.



Continual learning. Existing CL work mainly focused on overcoming CF: (1) Regularization-based approaches(Kirkpatrick et al., 2016; Lee et al.; Seff et al., 2017; Zenke et al., 2017; Rusu et al.,  2016)  add a regularization in the loss to penalize changes to parameters that are important to previous tasks. (2) Gradient projection (Zeng et al., 2019) ensures the gradient updates occur in the orthogonal direction to the input of old tasks. (3) Parameter isolation (Serrà et al., 2018; Ke et al.These are clearly very different from our method as we don't use any replay data.Continual learning in NLP. Above existing approaches usually do not use a pre-trained model. However, in NLP, almost all recent CL/non-CL techniques use pre-trained language models (LMs). We categorize them into 3 types based on which part in the pre-trained LM is trainable. The first one below belongs to replay or regularization families and last two types belong to parameter-isolation. (1). Transformer-updating basedthis family updates the Transformer directly. IDBR(Huang et al., 2021)   disentangles task-shared and task-specific knowledge in BERT via regularizations. LAMOL (Sun

