LONG LIVE THE LOTTERY: THE EXISTENCE OF WIN-NING TICKETS IN LIFELONG LEARNING

Abstract

The lottery ticket hypothesis states that a highly sparsified sub-network can be trained in isolation, given the appropriate weight initialization. This paper extends that hypothesis from one-shot task learning, and demonstrates for the first time that such extremely compact and independently trainable sub-networks can be also identified in the lifelong learning scenario, which we call lifelong tickets. We show that the resulting lifelong ticket can further be leveraged to improve the performance of learning over continual tasks. However, it is highly non-trivial to conduct network pruning in the lifelong setting. Two critical roadblocks arise: i) As many tasks now arrive sequentially, finding tickets in a greedy weight pruning fashion will inevitably suffer from the intrinsic bias, that the earlier emerging tasks impact more; ii) As lifelong learning is consistently challenged by catastrophic forgetting, the compact network capacity of tickets might amplify the risk of forgetting. In view of those, we introduce two pruning options, e.g., top-down and bottom-up, for finding lifelong tickets. Compared to the top-down pruning that extends vanilla (iterative) pruning over sequential tasks, we show that the bottomup one, which can dynamically shrink and (re-)expand model capacity, effectively avoids the undesirable excessive pruning in the early stage. We additionally introduce lottery teaching that further overcomes forgetting via knowledge distillation aided by external unlabeled data. Unifying those ingredients, we demonstrate the existence of very competitive lifelong tickets, e.g., achieving 3 -8% of the dense model size with even higher accuracy, compared to strong class-incremental learning baselines on CIFAR-10/CIFAR-100/Tiny-ImageNet datasets.

1. INTRODUCTION

The lottery ticket hypothesis (LTH) (Frankle & Carbin, 2019) suggests the existence of an extremely sparse sub-network, within an overparameterized dense neural network, that can reach similar performance as the dense network when trained in isolation with proper initialization. Such a subnetwork together with the used initialization is called a winning ticket (Frankle & Carbin, 2019) . The original LTH studies the sparse pattern of neural networks with a single task (classification), leaving the question of generalization across multiple tasks open. Following that, a few works (Morcos et al., 2019; Mehta, 2019) have explored LTH in transfer learning. They study the transferability of a winning ticket found in a source task to another target task. This provides insights on one-shot transferability of LTH. In parallel, lifelong learning not only suffers from notorious catastrophic forgetting over sequentially arriving tasks but also often comes at the price of increasing model capacity. With those in mind, we ask a much more ambitious question: Does LTH hold in the setting of lifelong learning when different tasks arrive sequentially? Intuitively, a desirable "ticket" sub-network in lifelong learning (McCloskey & Cohen, 1989; Parisi et al., 2019) needs to be: 1) independently trainable, same as the original LTH; 2) trained to perform competitively to the dense lifelong model, including both maintaining the performance of previous tasks, and quickly achieving good generalization at newly added tasks; 3) found online, as the tasks sequentially arrive without any pre-assumed order. We define such a sub-network with its initialization as a lifelong lottery ticket. This paper seeks to locate the lifelong ticket in class-incremental learning (CIL) (Wang et al., 2017; Rosenfeld & Tsotsos, 2018; Kemker & Kanan, 2017; Li & Hoiem, 2017; Belouadah & Popescu, 2019; 2020) , a popular, realistic and challenging setting of lifelong learning. A natural idea to extend the original LTH is to introduce sequential pruning: we continually prune the dense network until the desired sparsity level, as new tasks are incrementally added. However, we show that the direct application of the iterative magnitude pruning (IMP) used in LTH fails in the scenario of CIL since the pruning schedule becomes critical when tasks arrive sequentially. To circumvent this challenge, we generalize IMP to incorporate a curriculum pruning schedule. We term this technique top-down lifelong pruning. When the total number of tasks is pre-known and small, then with some "lottery" initialization (achieved by rewinding (Frankle et al., 2019) or similar), we find that the pruned sparse ticket can be re-trained to similar performance as the dense network. However, if the number of tasks keeps increasing, the above ticket will soon witness performance collapse as its limited capacity cannot afford the over-pruning. The limitation of top-down lifelong pruning reminds us of two unique dilemmas that might challenge the validity of lifelong tickets. i) Greedy weight pruning v.s. all tasks' performance: While the sequential pruning has to be performed online, its greedy nature inevitably biases against later arriving tasks, as earlier tasks apparently will contribute to shaping the ticket more (and might even use up the sparsity budget). ii) Catastrophic forgetting v.s. small ticket size: To overcome the notorious catastrophic forgetting (McCloskey & Cohen, 1989; Tishby & Zaslavsky, 2015) , many lifelong learning models have to frequently consolidate weights to carefully re-assign the model capacity (Zhang et al., 2020) or even grow model size as tasks come in (Wang et al., 2017) . Those seem to contradict our goal of pruning by seeing more tasks. To address the above two limitations, we propose a novel bottom-up lifelong pruning approach, which allows for re-growing the model capacity to compensate for any excessive pruning. It therefore flexibly calibrates between increasing and decreasing tickets throughout the entire learning process, alleviating the intrinsic greedy bias caused by the top-down pruning. We additionally introduce lottery teaching to overcome forgetting, which regularizes previous task models' soft logit outputs by using free unlabeled data. That is inspired by lifelong knowledge preservation techniques (Castro et al., 2018; He et al., 2018; Javed & Shafait, 2018; Rebuffi et al., 2017) . For validating our proposal, we conduct extensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets for class-incremental learning (Rebuffi et al., 2017) . The results demonstrate the existence and the high competitiveness of lifelong tickets. Our best lifelong tickets (found by bottom-up pruning and lottery teaching) achieve comparable or better performance across all sequential tasks, with as few as 3.64% parameters, compared to state-of-the-art dense models. Our contributions can be summarized as: • The problem of lottery tickets is formulated and studied in lifelong learning (class incremental learning) for the first time. • Top-down pruning: a generalization of iterative weight magnitude pruning used in the original LTH over continual learning tasks. • Bottom-up pruning: a novel pruning method, which is unique to allow for re-growing model capacity, throughout the lifelong process. • Extensive experiments and analyses demonstrating the promise of lifelong tickets, in achieving superior yet extremely light-weight lifelong learners.

2. RELATED WORK

Lifelong Learning A lifelong learning system aims to continually learn sequential tasks and accommodate new information while maintaining previously learned knowledge (Thrun & Mitchell, 1995) . One of its major challenges is called catastrophic forgetting (McCloskey & Cohen, 1989; Kirkpatrick et al., 2017; Hayes & Kanan, 2020) , i.e., the network cannot maintain expertise on tasks that they have not experienced for a long time.

availability

://github.com/VITA-Group/

