CONTINUAL LEARNING BASED ON SUB-NETWORKS AND TASK SIMILARITY Anonymous

Abstract

Continual learning (CL) has two main objectives: preventing catastrophic forgetting (CF) and encouraging knowledge transfer (KT) across tasks. The existing literature mainly tries to overcome CF. Although some papers have focused on both CF and KT, they may still suffer from CF because of their ineffective handling of previous tasks and/or poor task similarity detection mechanisms to achieve KT. This work presents a new CL method that addresses the above issues. First, it overcomes CF by isolating the knowledge of each task via a learned mask that indicates a sub-network. Second, it proposes a novel technique to compute how important each mask is to the new task, which indicates how the new task is similar to an underlying old task. Similar tasks can share the same mask/subnetwork for KT, while dissimilar tasks use different masks/sub-networks for CF prevention. Comprehensive experiments have been conducted using a range of NLP problems, including classification, generation, and extraction to show that the proposed method consistently outperforms prior state-of-the-art baselines. 1

1. INTRODUCTION

This paper studies continual learning (CL) of a sequence of natural language processing (NLP) tasks in the task continual learning (Task-CL) setting. It deals with both catastrophic forgetting (CF) (McCloskey & Cohen, 1989) and knowledge transfer (KT) across tasks. In Task-CL, the task ID is provided for each test case in testing. In learning, after a task is learned, its data is no longer accessible. Another CL setting is class continual learning (Class-CL), which provides no task ID in testing and it solves a different type of problems. Existing research in CL has almost exclusively focused on overcoming CF (Kirkpatrick et al., 2016; Serrà et al., 2018; Wortsman et al., 2020) . Limited work has being done on KT except (Ke et al., 2020; 2021; Wang et al., 2022) . But KT is particularly important for NLP because many tasks in NLP share similar knowledge that can be leveraged to achieve better accuracy. We humans are also particularly good at leveraging prior knowledge to help learn new skills. To achieve KT in learning a new task, CAT (Ke et al., 2020) first detects previous tasks that are similar to the current task so that the current task learning can leverage the knowledge learned from the similar past tasks. CAT uses the hard-attention mechanism in HAT (Serrà et al., 2018) to deal with CF, which masks out those important neurons for each task so that the training of new tasks cannot change them in back-propagation. However, different tasks can share neurons. This approach has a major problem for KT that is very hard, if not impossible, to solve. After the similar previous tasks are detected, CAT opens the masks of these tasks so that the new task learning can modify their parameters to achieve both forward and backward KT. This clearly helps KT. But this can cause CF for dissimilar tasks that share parameters with those similar tasks. CAT's task similarity comparison method based on the transfer learning performance can be quite inaccurate too. The KT methods in (Ke et al., 2021; Wang et al., 2022) are based on instance-level feature similarity comparison using dot product or cosine, which can be inaccurate as well (see experiment results in Sec. 4.2). To deal with these issues, we would like to have (1) a learning method that can isolate the knowledge of each task without parameter overlapping among tasks to deal with harmful interference in KT and (2) a task similarity detection method that is directly related to the loss of previous tasks (even though we do not have their data) for more accurate similar task detection.



The code has been uploaded as supplementary materials.1

