EXCLUSIVE SUPERMASK SUBNETWORK TRAINING FOR CONTINUAL LEARNING

Abstract

Continual Learning (CL) methods mainly focus on avoiding catastrophic forgetting and learning representations that are transferable to new tasks. Recently, Wortsman et al. (2020) proposed a CL method, SupSup, which uses a randomly initialized, fixed base network (model) and finds a supermask for each new task that selectively keeps or removes each weight to produce a subnetwork. They prevent forgetting as the network weights are not being updated. Although there is no forgetting, the performance of supermask is sub-optimal because fixed weights restrict its representational power. Furthermore, there is no accumulation or transfer of knowledge inside the model when new tasks are learned. Hence, we propose EXSSNET (Exclusive Supermask SubNEtwork Training), that performs exclusive and nonoverlapping subnetwork weight training. This avoids conflicting updates to the shared weights by subsequent tasks to improve performance while still preventing forgetting. Furthermore, we propose a novel KNN-based Knowledge Transfer (KKT) module that dynamically initializes a new task's mask based on previous tasks for improving knowledge transfer. We demonstrate that EXSSNET outperforms SupSup and other strong previous methods on both text classification and vision tasks while preventing forgetting. Moreover, EXSSNET is particularly advantageous for sparse masks that activate 2-10% of the model parameters, resulting in an average improvement of 8.3% over SupSup. Additionally, EXSSNET scales to a large number of tasks (100) and our KKT module helps to learn new tasks faster while improving the overall performance. 1

1. INTRODUCTION

In artificial intelligence, the overarching goal is to develop autonomous agents that can learn to accomplish a set of tasks. Continual Learning (CL) (Ring, 1998; Thrun, 1998; Kirkpatrick et al., 2017) is a key ingredient for developing agents that can accumulate expertise on new tasks. However, when a model is sequentially trained on tasks t 1 and t 2 with different data distributions, the model's ability to extract meaningful features for the previous task t 1 degrades. This loss in performance on the previously learned tasks, is referred to as catastrophic forgetting (CF) (McCloskey & Cohen, 1989; Zhao & Schmidhuber, 1996; Thrun, 1998; Goodfellow et al., 2013) . Forgetting is a consequence of two phenomena happening in conjunction: (1) not having access to the data samples from the previous tasks, and (2) multiple tasks sequentially updating shared model parameters resulting in conflicting updates, which is called as parameter interference (McCloskey & Cohen, 1989) . Recently, some CL methods avoid parameter interference by taking inspiration from the Lottery Ticket Hypothesis (Frankle & Carbin, 2018) and Supermasks (Zhou et al., 2019) to exploit the expressive power of sparse subnetworks. Zhou et al. (2019) observed that the number of sparse subnetwork combinations is large enough (combinatorial) that even within randomly weighted neural networks, there exist supermasks that achieve good performance. A supermask is a sparse binary mask that selectively keeps or removes each connection in a fixed and randomly initialized network to produce a subnetwork with good performance on a given task. We call this the subnetwork as supermask subnetwork that is shown in Figure 1 , highlighted in red weights. Building upon this idea, Wortsman et al. (2020) proposed a CL method, SupSup, which initializes a network with fixed and random weights 1 Our code is uploaded as supplementary material. 1

