LONG LIVE THE LOTTERY: THE EXISTENCE OF WIN-NING TICKETS IN LIFELONG LEARNING

Abstract

The lottery ticket hypothesis states that a highly sparsified sub-network can be trained in isolation, given the appropriate weight initialization. This paper extends that hypothesis from one-shot task learning, and demonstrates for the first time that such extremely compact and independently trainable sub-networks can be also identified in the lifelong learning scenario, which we call lifelong tickets. We show that the resulting lifelong ticket can further be leveraged to improve the performance of learning over continual tasks. However, it is highly non-trivial to conduct network pruning in the lifelong setting. Two critical roadblocks arise: i) As many tasks now arrive sequentially, finding tickets in a greedy weight pruning fashion will inevitably suffer from the intrinsic bias, that the earlier emerging tasks impact more; ii) As lifelong learning is consistently challenged by catastrophic forgetting, the compact network capacity of tickets might amplify the risk of forgetting. In view of those, we introduce two pruning options, e.g., top-down and bottom-up, for finding lifelong tickets. Compared to the top-down pruning that extends vanilla (iterative) pruning over sequential tasks, we show that the bottomup one, which can dynamically shrink and (re-)expand model capacity, effectively avoids the undesirable excessive pruning in the early stage. We additionally introduce lottery teaching that further overcomes forgetting via knowledge distillation aided by external unlabeled data. Unifying those ingredients, we demonstrate the existence of very competitive lifelong tickets, e.g., achieving 3 -8% of the dense model size with even higher accuracy, compared to strong class-incremental learning baselines on CIFAR-10/CIFAR-100/Tiny-ImageNet datasets.

1. INTRODUCTION

The lottery ticket hypothesis (LTH) (Frankle & Carbin, 2019) suggests the existence of an extremely sparse sub-network, within an overparameterized dense neural network, that can reach similar performance as the dense network when trained in isolation with proper initialization. Such a subnetwork together with the used initialization is called a winning ticket (Frankle & Carbin, 2019) . The original LTH studies the sparse pattern of neural networks with a single task (classification), leaving the question of generalization across multiple tasks open. Following that, a few works (Morcos et al., 2019; Mehta, 2019) have explored LTH in transfer learning. They study the transferability of a winning ticket found in a source task to another target task. This provides insights on one-shot transferability of LTH. In parallel, lifelong learning not only suffers from notorious catastrophic forgetting over sequentially arriving tasks but also often comes at the price of increasing model capacity. With those in mind, we ask a much more ambitious question: Does LTH hold in the setting of lifelong learning when different tasks arrive sequentially? Intuitively, a desirable "ticket" sub-network in lifelong learning (McCloskey & Cohen, 1989; Parisi et al., 2019) needs to be: 1) independently trainable, same as the original LTH; 2) trained to perform competitively to the dense lifelong model, including both maintaining the performance of previous tasks, and quickly achieving good generalization at newly added tasks; 3) found online, as the tasks sequentially arrive without any pre-assumed order. We define such a sub-network with its initialization as a lifelong lottery ticket. This paper seeks to locate the lifelong ticket in class-incremental learning (CIL) (Wang et al., 2017; Rosenfeld & Tsotsos, 2018; Kemker & Kanan, 2017; Li & Hoiem, 2017; Belouadah & Popescu, 2019; 2020) , a popular, realistic and challenging setting of lifelong learning. A natural idea to extend the original LTH is to introduce sequential pruning: we continually prune the dense network until the desired sparsity level, as new tasks are incrementally added. However, we show that the direct application of the iterative magnitude pruning (IMP) used in LTH fails in the scenario of CIL since the pruning schedule becomes critical when tasks arrive sequentially. To circumvent this challenge, we generalize IMP to incorporate a curriculum pruning schedule. We term this technique top-down lifelong pruning. When the total number of tasks is pre-known and small, then with some "lottery" initialization (achieved by rewinding (Frankle et al., 2019) or similar), we find that the pruned sparse ticket can be re-trained to similar performance as the dense network. However, if the number of tasks keeps increasing, the above ticket will soon witness performance collapse as its limited capacity cannot afford the over-pruning. The limitation of top-down lifelong pruning reminds us of two unique dilemmas that might challenge the validity of lifelong tickets. i) Greedy weight pruning v.s. all tasks' performance: While the sequential pruning has to be performed online, its greedy nature inevitably biases against later arriving tasks, as earlier tasks apparently will contribute to shaping the ticket more (and might even use up the sparsity budget). ii) Catastrophic forgetting v.s. small ticket size: To overcome the notorious catastrophic forgetting (McCloskey & Cohen, 1989; Tishby & Zaslavsky, 2015) , many lifelong learning models have to frequently consolidate weights to carefully re-assign the model capacity (Zhang et al., 2020) or even grow model size as tasks come in (Wang et al., 2017) . Those seem to contradict our goal of pruning by seeing more tasks. To address the above two limitations, we propose a novel bottom-up lifelong pruning approach, which allows for re-growing the model capacity to compensate for any excessive pruning. It therefore flexibly calibrates between increasing and decreasing tickets throughout the entire learning process, alleviating the intrinsic greedy bias caused by the top-down pruning. We additionally introduce lottery teaching to overcome forgetting, which regularizes previous task models' soft logit outputs by using free unlabeled data. That is inspired by lifelong knowledge preservation techniques (Castro et al., 2018; He et al., 2018; Javed & Shafait, 2018; Rebuffi et al., 2017) . For validating our proposal, we conduct extensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets for class-incremental learning (Rebuffi et al., 2017) . The results demonstrate the existence and the high competitiveness of lifelong tickets. Our best lifelong tickets (found by bottom-up pruning and lottery teaching) achieve comparable or better performance across all sequential tasks, with as few as 3.64% parameters, compared to state-of-the-art dense models. Our contributions can be summarized as: • The problem of lottery tickets is formulated and studied in lifelong learning (class incremental learning) for the first time. • Top-down pruning: a generalization of iterative weight magnitude pruning used in the original LTH over continual learning tasks. • Bottom-up pruning: a novel pruning method, which is unique to allow for re-growing model capacity, throughout the lifelong process. • Extensive experiments and analyses demonstrating the promise of lifelong tickets, in achieving superior yet extremely light-weight lifelong learners.

2. RELATED WORK

Lifelong Learning A lifelong learning system aims to continually learn sequential tasks and accommodate new information while maintaining previously learned knowledge (Thrun & Mitchell, 1995) . One of its major challenges is called catastrophic forgetting (McCloskey & Cohen, 1989; Kirkpatrick et al., 2017; Hayes & Kanan, 2020) , i.e., the network cannot maintain expertise on tasks that they have not experienced for a long time. This paper's study subject is class-incremental learning (CIL) (Rebuffi et al., 2017; Elhoseiny et al., 2018) : a popular, realistic, albeit challenging setting of lifelong learning. CIL requires the model to recognize new classes emerging over time while maintaining recognizing ability over old classes without access to the previous data. Typical solutions are based on regularization (Li & Hoiem, 2017; Kirkpatrick et al., 2017; Zenke et al., 2017; Aljundi et al., 2018a; Ebrahimi et al., 2019) , for example, knowledge distillation (Hinton et al., 2015) is a common regularizer to inherit previous knowledge through preserving soft logits of those samples (Li & Hoiem, 2017) while learning new tasks. Besides, several approaches are learning with memorized data (Castro et al., 2018; Javed & Shafait, 2018; Rebuffi et al., 2017; Belouadah & Popescu, 2019; 2020; Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2018) . And some generative lifelong learning methods (Liu et al., 2020; Shin et al., 2017) mitigate catastrophic forgetting by generating simulated data of previous tasks. There also exist a few architecture-manipulation-based lifelong learning methods (Rajasegaran et al., 2019; Aljundi et al., 2018b; Hung et al., 2019; Abati et al., 2020; Rusu et al., 2016; Kemker & Kanan, 2017) , while their target is dividing a dense model into task-specific parts for lifelong learning, rather than localizing sparse networks and the lottery tickets. Pruning and Lottery Ticket Hypothesis It is well-known that deep networks could be pruned of excess capacity (LeCun et al., 1990b) . Pruning algorithms can be categorized into unstructured (Han et al., 2015b; LeCun et al., 1990a; Han et al., 2015a) and structured pruning (Liu et al., 2017; He et al., 2017; Zhou et al., 2016) . The former sparsifies weight elements based on magnitudes, while the latter removes network sub-structures such as channels for more hardware friendliness. LTH (Frankle & Carbin, 2019) advocates the existence of an independently trainable sparse subnetwork from a dense network. In addition to image classification (Frankle & Carbin, 2019; Liu et al., 2019; Wang et al., 2020; Evci et al., 2019; Frankle et al., 2020; Savarese et al., 2020; You et al., 2020; Ma et al., 2021; Chen et al., 2020a) , LTH has been explored widely in numerous contexts, such as natural language processing (Gale et al., 2019; Chen et al., 2020b) , reinforcement learning (Yu et al., 2019) , generative adversarial networks (Chen et al., 2021b) , graph neural networks (Chen et al., 2021a) , and adversarial robustness (Cosentino et al., 2019) . Most of them adopt unstructured weight magnitude pruning (Han et al., 2015a; Frankle & Carbin, 2019 ) to obtain the ticket, which we also follow in this work. (Frankle et al., 2019) analyzes large models and datasets, and presents a rewinding technique that re-initializes ticket training from the early training stage rather than from scratch. (Renda et al., 2020) further compares different retraining techniques and endorses the effectiveness of rewinding. (Mehta, 2019; Morcos et al., 2019; Desai et al., 2019) pioneer to study the transferability of the ticket identified on one source task to another target task, which delivers insights on one-shot transferability of LTH. One latest work (Golkar et al., 2019) aimed at lifelong learning in fixed-capacity models based on pruning neurons of low activity. The authors observed that a controlled way of "graceful forgetting" after training each task can regain network capacity for new tasks, meanwhile not suffering from forgetting. Sokar et al. (2020) further compresses the sparse connections of each task during training, which reduces the interference between tasks and alleviates forgetting. In CIL, a model continuously learns from a sequential data stream in which new tasks (namely, classification tasks with new classes) are added over time, as shown in Figure 1 . At the inference stage, the model can operate without having access to the information of task IDs. Following (Castro et al., 2018; He et al., 2018; Rebuffi et al., 2017) , a handful of samples from previous classes are stored in a fixed memory buffer. More formally, let T 1 , T 2 , • • • represent a sequence of tasks, and the ith task T i contains data that fall into (k i -k i-1 ) classes C i = {c ki-1+1 , c ki-1+2 , • • • , c ki }, where k 0 = 0 by convention. Let Θ (i) = {θ (i) , θ c } denote the model of the learner used at task i, where θ (i) corresponds to the base model cross all tasks from T 1 to T i , and θ (i) c denotes the task-specific classification head at T i . Thus, the size of θ (i) is fixed, but the dimension of θ (i) c aligns with the number of classes, which have been seen at T i . In general, the learner has access to two types of information at task i: the current training data D (i) , and certain previous information P (i) . The latter includes a small amount of data from previous tasks {T j } i-1 1 stored in the memory buffer, and the previous model Θ (i-1) at task T i-1 . This is commonly used to overcome the catastrophic forgetting issue of the current task i against the previous tasks. Based on the aforementioned setting, we state the CIL problem as below. Problem of CIL. At the current task i, we aim to learn a full model Θ (i) = {θ (i) , θ (i) c } based on the information (D (i) , P (i) ) such that Θ (i) not only (I) yields the generalization ability to the newly added data at task T i but also (II) does not lose its power to the previous tasks {T j } i-1 1 . We note that the aforementioned problem statement applies to CIL with any fixed-length learning period. That is, for n time stamps (one task per time), the validity of the entire trajectory {Θ (i) } n 1 is justified by each Θ (i) from the CIL criteria (I) and (II) stated in 'Problem of CIL'.

3.2. LIFELONG LOTTERY TICKETS

It was shown by LTH (Frankle & Carbin, 2019 ) that a standard (unstructured) pruning technique can uncover the so-called winning ticket, namely, a sparse sub-network together with proper initialization that can be trained in isolation and reach similar performance as the dense network. In this paper, we aim to prune the base model θ (i) over time. And we ask: Do there exist winning tickets in lifelong learning? If yes, how to obtain them? To answer these questions, a prerequisite is to define the notion of lottery tickets in lifelong learning, which we call lifelong lottery tickets. Following LTH (Frankle & Carbin, 2019) , a lottery ticket consists of two parts: 1) a binary mask m ∈ {0, 1} θ (i) 0 obtained from a one-shot or iterative pruning algorithm, and 2) initial weights or rewinding weights θ 0 . The ticket (m, θ 0 ) is a winning ticket if training the subnetwork m θ 0 ( denotes element-wise product), identified by the sparse pattern m with initialization θ 0 , wins the initialization lottery to match the performance of the original (fully trained) network. In CIL, at the presence of sequential tasks {T (i) } i=1,2,... , we define lifelong lottery tickets (m (i) , θ (i) 0 ) recursively from the perspective of dynamical system: i) , P (i) , m (i-1) ), and θ m (i) = m (i-1) + A(D ( (i) 0 ∈ {θ (0) , θ (i) rw }, where A denotes a pruning algorithm used at the current task T (i) based on the information D (i) , P (i) and m (i-1) , θ (0) denotes the initialization prior to training the model at T (1) , and θ (i) rw denotes a rewinding point at T (i) . In Eq. ( 1), we interpret the (non-trivial) pruning operation A by weight perturbations, with values drawn from {-1, 0, 1}, to the previous binary mask. Here -1 denotes the removal of a weight, 0 signifies to keep a weight intact, and 1 represents the addition of a weight. Moreover, the introduction of weight rewinding is spurred by the so-called rewinding ticket (Renda et al., 2020; Frankle et al., 2020) . For example, if θ (i) rw = θ (i-1) , then we pick the model weights learnt at the previous task T (i-1) to initialize the training at T (i) . We also note that θ (0) can be regarded as the point rewound to the earliest stage of the lifelong learning. Based on Eq. ( 1), we then state the definition of winning tickets in CIL.

Lifelong winning tickets. Given a sequence of tasks {T

i } n 1 , the lifelong lottery tickets {(m (i) , θ (i) 0 )} n 1 given by ( 1) are winning tickets if they can be trained in isolation to match the CIL performance (i.e., criteria I and II) of the corresponding full model {θ (i) } n 1 , where n ∈ N + . In the next section, we will design the lifelong pruning algorithm A, together with ticket initialization schemes formulated in Eq. ( 1)

4.1. REVISITING IMP OVER SEQUENTIAL TASKS: TOP-DOWN (TD) PRUNING

In order to find the potential tickets at the current task T (i) , it is natural to specify A in Eq. ( 1) as the iterative magnitude pruning (IMP) algorithm (Han et al., 2015a) to prune the model from m (i-1) θ (i-1) . Following (Frankle & Carbin, 2019; Renda et al., 2020) , IMP iteratively prunes p 1 n (i) (%) non-zero weights of m (i-1) θ (i-1) over n (i) rounds at T (i) . Thus, the number of nonzero weights in the obtained mask m (i) is given by ( (1 -p 1 n (i) ) n (i) • m (i-1) 0 ). However, in the application of IMP to the sequential tasks {T (i) }, we find that the schedule of IMP over sequential tasks, in terms of {n (i) }, is critical to make pruning successful in lifelong learning. We refer readers to Appendix A2.1 for detailed justifications. Curriculum schedule of TD pruning is a key to success The conventional method is to set {n (i) } as a uniform schedule, namely, IMP prunes a fixed portion of non-zeros at each task. However, this direct application fails quickly as the number of incremental tasks increases, implying that "not all tasks are created equal" in the learning/pruning schedule. Inspired by the recent observation that training with more classes helps consolidate a more robust sparse model (Morcos et al., 2019) , we propose a curriculum pruning schedule, in which IMP is conducted more aggressively for new tasks arriving later, with n (i) ≥ n (i-1) , until reaching the desired sparsity. For example, if there are 12 times of pruning on five sequentially arrived tasks, we arrange them in a linearly increasing way, i.e., (T 1 :1,T 2 :1,T 3 :2,T 4 :3,T 5 :5). Note that TD pruning relies on the heuristic curriculum schedule, and thus inevitably greedy and suboptimal over continual learning tasks. In what follows, we propose a more advanced pruning scheme, bottom-up (BU) pruning, that obeys a different principle of design. Why we need more than top-down pruning? TD pruning is inevitably greedy and suboptimal.

Bottom-up Lifelong Pruning

Earlier added tasks contribute more to shaping the final mask, due to the nested dependency between intermediate masks. In the later training stage, we often observe the network is already too heavily pruned to learn more tasks. Inspired by the recently proposed model consolidation (Zhang et al., 2020) , we propose the BU alternative of lifelong pruning, to dynamically compensate for the excessive pruning by re-growing previously reduced networks. Full reference model & rewinding point For BU lifelong pruning, we maintain a full (unpruned) model θ (i) . Once the validation accuracy of the current sparse model is no worse than the reference performance, the sparse model is considered to still have sufficient capacity and can be further pruned. Otherwise, capacity expansion is needed. On the other hand, the reference model offers a rewinding point for network parameters, which preserves knowledge of all previous tasks prior to T (i) . It naturally extends the rewinding concept (Frankle et al., 2019) to lifelong learning. BU pruning method BU lifelong pruning expands the previous mask m (i-1) to m (i) . Different from TD pruning, the model size grows along the task sequence, namely, m (i) 0 ≥ m (i-1) 0 . Thus, BU pruning enforces A in Eq. ( 1) to draw non-negative perturbations. As illustrated in Figure 2 , for each newly added T i , we first re-train the previous sparse model m (i-1) θ (i-1) under the current information (D (i) , P (i) ) and calculate the validation accuracy R (i) . If R (i) is above than the reference performance R (i) ref , we proceed to keep the sparse mask m (i) = m (i-1) intact and use re-trained θ (i-1) as θ (i) at T i . Otherwise, an expansion from m (i-1) is required to ensure sufficient learning capacity. To do so, we restart from the full reference model θ (i) ref and iteratively prune its weights using IMP until the performance gets just below R (i) ref . Here the previous non-zero weights localized by m (i-1) are excluded from the pruning scope of IMP but the values of those non-zero weights could be re-trained. As a result, IMP will yield the updated mask m (i) with a larger size than m (i-1) . We repeat the aforementioned BU pruning method when a new task arrives. Although never observed in our CIL experiments, a potential corner case of expansion is that the ticket size may hit the size of the full model. We consider this as an artifact of limited model capacity and suggest future work of combining lifelong tickets with (full) model growing (Wang et al., 2017) . Ticket initialization Given the pruning mask found by the BU (or TD) pruning method, we next determine the initialization scheme of a lifelong ticket. We consider three specifications of θ (i) 0 in Eq. ( 1) to initialize the sparse model m (i) for re-training the found tickets. They include: (I) θ (i) 0 = θ (0) , i.e., the original "from the same random" initialization (Frankle & Carbin, 2019) , (II) a random re-initialization θ reinit which is independent of θ (0) , and (III) previous-task rewinding, i.e., θ 1) . The initialization schemes I-III together with m (i) yield the following tickets m (i) at T (i) : (1) BU (or TD) tickets, namely, m (i) found by BU (or TD) pruning with initialization I; (2) random BU (or TD) tickets, namely, m (i) with initialization II; (3) task-rewinding BU (or TD) tickets, namely, m (i) with initialization III. In experiments, we will show that both BU (or TD) tickets and their task-rewinding (TR) counterparts are winning tickets, which outperform unpruned full CIL models. Compared BU with TD pruning, TR-BU tickets surpass the best TD tickets. (i) 0 = θ (i-

4.3. LOTTERY TEACHING: A PLUG-IN REGULARIZATION

Catastrophic forgetting poses a severe challenge to class-incremental learning, especially for compact models. (Castro et al., 2018; He et al., 2018; Javed & Shafait, 2018; Rebuffi et al., 2017) are early attempts for undertaking the forgetting dilemma by introducing knowledge distillation regularization (Hinton et al., 2015) , which employs a handful of stored previous data in addition to new task data. (Zhang et al., 2020) takes advantage of unlabeled data to handle the forgetting quandary. We adapt their philosophy (Li & Hoiem, 2017; Hinton et al., 2015; Zhang et al., 2020) to presenting lottery teaching, enforcing previous information into the new tickets via a knowledge distillation term on external unlabeled data. Lottery teaching consists of two steps: i) we query more similar unlabeled data "for free" from a public source, by utilizing a small number of prototype samples from previous tasks' training data. In this way, the storage required for previous tasks could be minimal, while the queried surrogate data functions similarly for our purpose; ii) we then enforce the output soft logits of the current subnetwork {m (i) θ (i) , θ (i) c } on each queried unlabeled sample x to be close to the logits from previously trained subnetwork {m (i-1) θ (i-1) , θ (i-1) c }, via knowledge distillation (KD) regularization based on the K-L divergence. For all experiments of our methods hereinafter, we by default append the lottery teaching as it is widely beneficial. An ablation study will also follow in Section 5.3.

5. EXPERIMENTS

Experimental Setup We briefly discuss the key facts used in our experiments and refer readers to Appendix A2.3 for more implementation details. We evaluate our proposed lifelong tickets on three datasets: CIFAR-10, CIFAR-100, and Tiny-ImageNet. We adopt ResNet18 (He et al., 2016) as our backbone. We evaluate the model performance by standard testing accuracy (SA) averaged over three independent runs. CIL baseline: We consider a strong baseline framework derived from (Zhang et al., 2019) , a recent state-of-the-art (SOTA) method introduced for imbalanced data training (see more illustrations in Appendix A1.1). We implement (Zhang et al., 2019) for CIL, and compare with two latest CIL SOTAs: iCaRL (Rebuffi et al., 2017) and IL2M (Belouadah & Popescu, 2019) foot_0 . Our results demonstrate the adapted CIL method from (Zhang et al., 2019) outperforms the others significantly, (1.65% SA better than IL2M and 4.88% SA better than iCaRL on CIFAR-10)foot_1 , establishing a new SOTA bar. The proposed lottery teaching further improves the performance of the baseline adapted from Zhang et al. (2019) , given by 4.4% SA improvements on CIFAR-10. Thus, we use (Zhang et al., 2019) , combined with/without lottery teaching, to train the original (unpruned) CIL model. CIL pruning: To the best of our knowledge, we are not aware of any effective CIL pruning baseline comparable to ours. Thus, we focus on the comparison among different variants of our methods. We also compare the proposal with the ordinary IMP, showing its incapability in CIL pruning. Furthermore, we demonstrate that given our proposed pruning frameworks, standard pruning methods such as IMP and 1 pruning (by imposing 1 sparsity regularization) then turn to be successful.

Results on TD tickets

We begin by showing that TD pruning is non-trivial in CIL. We find that the ordinary IMP (Han et al., 2015a) fails: It leads to 10.21% SA degradation (from 72.79% to 62.58% for SA) with 4.40% parameters left. By contrast, our proposed lifelong tickets yield substantially better performance which even surpasses the full dense model, with fewer parameters left than the ordinary IMP (Han et al., 2015a) . In what follows, We evaluate TD lifelong pruning using different weight rewindings, namely, i) TD tickets; ii) random TD tickets; iii) task-rewinding TD tickets; iv) late-rewinding TD tickets; and v) Fine-tuning. The late-rewinding tickets is a strong baseline claimed in Mehta (2019). A4 demonstrate the high competitiveness of our proposed TD ticket (blue lines). It matches and most of the time outperforms the full modelfoot_2 (black dash lines). Even with only 6.87% model parameters left, the TD ticket still surpasses the dense model by 0.49% SA. The task-rewinding tickets, in second place, exceeds the dense model until reaching the extreme sparsity of 4.40%. Moreover, we see late-rewinding TD tickets dominate over other rewinding/fine-tuning options, echoing the finding in single-task learning (Frankle et al., 2019) . However, TD pruning cannot afford a lot more incremental tasks due to its greedy weight (over-)pruning. Our results show that TD tickets pruned from only tasks T 1 and T 2 clearly overfit the first two tasks, even after incrementally learning the remaining three tasks. In this inappropriate pruning schedule (in contrast to T 1 ∼ T 5 scheme), the resultant ticket drops to 59.28% SA which is 13.51% lower than the dense model, as shown in Table A3 . More results can be found in the appendix. Therefore, bottom-up lifelong pruning is proposed, as a remedy for relieving laborious tuning of pruning schedules.

Results on BU lifelong tickets

The bottom-up lifelong pruning allows the sparse network to regret if they could not deal with the newly added tasks, which compensates for the excessive pruning and reaches a substantially better trade-off between sparsity and generalization ability. Compared to TD pruning, it does not require any heuristic pruning schedules. In Table 1 , we first present the performance of BU tickets, random BU tickets, and task-rewinding BU (TR-BU) tickets, as mentioned in Section 4.2. As we can see, TR-BU tickets obtain the supreme performance. A possible explanation is that task-rewinding (i.e., θ (i) 0 = θ (i-1) ) maintains full information of learned tasks which mitigates the catastrophic forgetting, while other weight rewinding points lack sufficient task information to prevent compact models from forgetting. Next, we observe that TR-BU tickets significantly outperform the full dense model by 0.52% SA with only 3.64% Published as a conference paper at ICLR 2021 parameters left and 1 BU tickets obtain matched performance to the full dense model with 5.16% remaining parameters. It suggests that IMP, 1 and even other adequate pruning algorithms can be plugged into our proposed BU pruning framework to identify the lifelong tickets. In Figure 4 , we present the performance comparison between TR-BU tickets (the best subnetworks in Table 1 ) and TD tickets. TR-BU tickets are identified through bottom-up lifelong pruning, whose sparse masks continue to subtly grow along with the incremental tasks, from sparsity 2.81% at the first task to sparsity 3.64% at the last task. As we can see, at any incremental learning stage, TR-BU tickets attain a superior performance with significantly fewer parameters. Particularly, after learning all tasks, TR-BU tickets surpass TD tickets by 1.01% SA with 0.76% fewer weights on CIFAR-10; 3.07% with 2.46% fewer weights on CIFAR-100. Results demonstrate TR-BU tickets have a better generalization ability and parameter-efficiency compared with TD tickets. In addition, on Tiny-ImageNet in Table A7 , TR-BU tickets outperform full model with only 12.08% remaining weights. It is worth to mention that both TR-BU tickets and TD tickets have a superior performance than full dense model. We refer readers to Table A5 and A6 in the appendix for more detailed results. From the above results, we further observe that TR-BU tickets achieve comparable accuracy to full models which have more than 30× times in network capacity, implying that bottom-up lifelong pruning successfully discovers extremely sparse sub-networks, and yet they are powerful enough to inherit previous knowledge and generalize well on newly added tasks. Furthermore, our proposed lifelong pruning schemes can be directly plugged into other CIL models to identify the lifelong ticket, as shown in Appendix A1.

Ablation studies

In what follows, we summarize our results on ablation studies and refer readers to the appendix A1.2.1 for more details. In Figure 5 , we show the essential role of the curriculum schedule in TD pruning compared to the uniform pruning schedule. We notice that the curriculum pruning scheme generates stronger TD tickets than the uniform pruning in terms of accuracy, which confirms our motivation that pruning heavier in the late stage of lifelong learning with more classes is beneficial. In Table A8 , we demonstrate the effectiveness of our proposals against different numbers of incremental tasks. In the Figure 5 , we show that lottery teaching injects previous knowledge through applying knowledge distillation on external unlabeled data, and greatly alleviates the catastrophic forgetting issue in lifelong pruning (i.e., after learning all tasks, utilizing lottery teaching obtains a 4.34% SA improvement on CIFAR-10). It is worth mentioning that we set a buffer of fixed storage capacity to store 128 unlabeled images queried from public sources at each training iteration. We find that leveraging newly queried unlabeled data offers a better generalization-ability than storing historical data in past tasks. The latter only reaches 70.60% SA on CIFAR-10, which is 2.19% worse than the use of unlabeled data.

6. CONCLUSION

We extend the Lottery Ticket Hypothesis to lifelong learning, in which networks incrementally learn from sequential tasks. We pose top-down and bottom-up lifelong pruning algorithms to identify lifelong tickets. Systematical experiments are conducted to validate that located tickets obtain strong(er) generalization ability across all incremental learned tasks, compared with unpruned models. Our future work aims to explore lifelong tickets with the (full) model growing approach.

A1.1 MORE BASELINE RESULTS

Comparison with the Latest CIL SOTAs We find (Zhang et al., 2019) can be naturally introduced to class-incremental learning, which tackles the intrinsic training bias between a handful of previously stored data and a large amount of newly added data. It adopts random and class-balanced sampling strategies, combined with an auxiliary classifier, to alleviate the negative impact from imbalanced classes. Extensive results, shown in Table A2 , demonstrates that adopting (Zhang et al., 2019) as the simple baseline surpasses previous SOTAs iCaRL (Rebuffi et al., 2017) and IL2M (Belouadah & Popescu, 2019 ) by a significant performance margin (1.65%/0.57% SA better than IL2M and 4.88%/7.60% SA better than iCaRL on CIFAR-10/CIFAR-100, respectively)foot_3 , establishing a new SOTA bar. With the assistance of lottery teaching, (Zhang et al., 2019) obtains an extra performance boost, 4.4% SA on CIFAR-10 and 7.34% SA on CIFAR-100. It is worth mentioning that a lifelong ticket also exists in other CIL models. Take IL2M on CIFAR-10 as an example, bottom-up (BU) ticket achieves accuracy 68.92% with 11.97% parameters vs. the dense unpruned model with an accuracy of 66.74%. Table A2 : Comparison between our dense model and two previous SOTA CIL methods on CIFAR-10 and CIFAR-100. Reported performance it the final accuracy for each task T . Simple baseline donates the dense CIL model (Zhang et al., 2019) . Full model represents our proposed framework which combines lottery teaching technique with the simple baseline. Pruning Schedule is Important As shown in Table A3 , an inappropriate pruning schedule across T 1 ∼ T 2 , the resultant ticket drops to 59.28% accuracy which is 13.51% lower than the dense model. On the contrary, the adequate scheme across T 1 ∼ T 5 in Table A3 , generates a TD winning ticket with a higher test accuracy (+0.49% SA) and extreme fewer parameters (6.87%), compared with the dense CIL model. Methods CIFAR-10 T 1 (%) T 2 (%) T 3 (%) T 4 (%) T 5 (%) Table A3 : Evaluation performance of TD tickets (6.87%) pruned from different task ranges. Pruning Schedule TD Tickets (6.87%) on CIFAR-10  T 1 (%) T 2 (%) T 3 (%) T 4 (%) T 5 (%) Average (%)

Top-down Lifelong Tickets

We also report several performance reference baselines: (a) Full model, denoting the achievable performance of the dense CIL model (Zhang et al., 2019) combined with lottery teaching. (b) CIL lower denoting a vanilla CIL model without using lottery teaching nor storing/utilizing previous data in any form; (c) MT upper training a dense model using full data from all tasks simultaneously in a multi-task learning scheme. While it is not CIL (and much easier to learn), we consider MT upper as an accuracy "upper bound" for (dense) CIL ; (d) MT LT by directly pruning MT upper to obtain its lottery ticket (Frankle & Carbin, 2019) . The detailed evaluation performance of TD tickets at different sparsity levels on CIFAR-10 are collected in Table A4 . 2 . 2 5 4 . 4 6 . 8 7 1 0 . 7 4 3 2 . 7 7 4 0 . 9 6 5 1 . Table A4 : Evaluation performance of TD tickets at different sparsity levels on CIFAR-10. Reported performance is the final accuracy for each task T . Differences (+/-) are calculated w.r.t. the full/dense model performance. Remaining Weights TD Tickets on CIFAR-10 Bottom-up Lifelong Tickets As shown in Table A5 and Table A6 , even compared with the best TD tickets in terms of the trade-off between sparsity and accuracy, TR-BU tickets consistently remain prominent on both CIFAR-10 (a slightly higher accuracy and 3.23% fewer weights) and CIFAR-100 (2.37% higher accuracy and 4.88% fewer weights). From the results, we further observe that TR-BU tickets achieve comparable accuracy to full models which have more than 30× times in network capacity, implying that bottom-up lifelong pruning successfully discovers extremely sparse sub-networks, and yet they are powerful enough to inherit previous knowledge and generalize well on newly added tasks. T 1 (%) T 2 (%) T 3 (%) T 4 (%) T 5 (%)

A1.2.1 MORE ABLATION RESULTS

Uniform v.s. Curriculum Lifelong Pruning We discuss different pruning schedules of top-down lifelong pruning, which play an essential role in the performance of TD tickets. From the right figure in Figure A7 , we notice that the curriculum pruning scheme generates stronger TD tickets than the uniform pruning in terms of accuracy, which confirms our motivation that pruning heavier in the late stage of lifelong learning with more classes is beneficial. The Number of Incremental Tasks Here we study the influence of increment times in our lifelong learning settings. Table A8 shows the results of TR-BU 20 tickets incrementally learn from 20 tasks (5 A6 presents the results of TR-BU 10 tickets incrementally learn from 10 tasks (10 classes per task). Comparing between two tickets, TR-BU 10 tickets reach 6.55% higher accuracy at the expense of 1.77% more parameters. Possible reasons behind it are that: i) the increasing of incremental learning times aggravates the forgetting issue, which causes TR-BU 20 tickets fall in a More Technical Details of Top-down Pruning In our implementation, we set p 1 n (i) = 20% as (Frankle & Carbin, 2019; Renda et al., 2020) and adjust {n (i) } to control the pruning schedule of IMP over sequential tasks. The aforementioned lifelong pruning method is illustrated in Figure A8 , and we call it top-down lifelong pruning since the model size is sequentially reduced, namely, m (i) 0 ≤ m (i-1) 0 . Pruning Algorithms We summarize the workflow of the top-down pruning and bottom-up pruning in Algorithm 1 and 2, respectively. For pruning hyperparameters, we follow the original LTH's setting (Frankle & Carbin, 2019) , i.e. ∆p = 20%. If we change ∆p to 40%, it will drop 2.04% accuracy at the same sparsity level on CIFAR-10.

A2.2 MORE CLASS-INCREMENTAL LEARNING DETAILS

Lottery teaching regularization In order to mitigate the catastrophic forgetting effect, we apply knowledge distillation (Hinton et al., 2015) R KD to enforce the similarity between previous ŷ and current y soft logits on unlabeled data. We state R KD as follows: R KD (y, ŷ) = -H(t(y), t( ŷ)) =j t(y) j log t( ŷ) j where t(y) i = (yi) 1/T j (yj ) 1/T , T = 2 in our case, following the standard setting in (Hinton et al., 2015; Li & Hoiem, 2017) . Our Dense Full CIL Model We consider a strong baseline framework derived from (Zhang et al., 2019) with our proposed lottery teaching as our dense full CIL model. It adopts random and classbalanced sampling strategies, an auxiliary classifier, and the knowledge distillation regularizer R KD . For incrementally learning task T i , the training objective is depicted as: L CIL (θ, θ (i) c , θ (i) a ) = γ 2 × L(θ, θ (i) c ) + L(θ, θ (i) a ) L(θ, θ (i) c ) = E (x,y)∈D b L XE (f (θ, θ (i) c , x), y) + γ 1 × E x∈Du R KD (f (θ, θ (i) c , x), ŷc ) , L(θ, θ (i) a ) = E (x,y)∈Dr L XE (f (θ, θ (i) a , x), y) + γ 1 × E x∈Du R KD (f (θ, θ (i) a , x), ŷa ) , where D b is the class-balanced sampled dataset, D r represents the randomly sampled dataset, and D u stands for the queried unlabeled dataset. θ (i) c and θ (i) a are the main and auxiliary classifiers. ŷc and ŷa are soft logits on previous tasks of the main and auxiliary classifiers. We adopt γ 1 = 1, γ 2 = 0.5 in our experiment according to grid search.



Both are implemented using official codes. The comparison has been strictly controlled to be fair, including dataset splits, same previously stored data, due diligence in hyperparameter tuning for each, etc. More comparisons with the latest CIL SOTAs are referred to the Appendix A1.1 Full model, denoting the performance of the dense CIL model(Zhang et al., 2019) with lottery teaching. To ensure fair compassion, iCaRL and IL2M both are implemented with their official codes. The comparison has been strictly controlled to be fair, including dataset splits, same previously stored data, due diligence in hyperparameter tuning for each, etc.



Figure 1: Basic CIL Setting.

Figure 2: Framework of our proposed bottom-up (BU) lifelong pruning which is based on sparse model consolidation. Tickets founded by BU pruning keep expanding for each newly added task.

Figure 3: Evaluation performance (standard accuracy) of top-down lifelong tickets on CIFAR-10.

Figure3and TableA4demonstrate the high competitiveness of our proposed TD ticket (blue lines). It matches and most of the time outperforms the full model 3 (black dash lines). Even with only 6.87% model parameters left, the TD ticket still surpasses the dense model by 0.49% SA. The task-rewinding tickets, in second place, exceeds the dense model until reaching the extreme sparsity of 4.40%. Moreover, we see late-rewinding TD tickets dominate over other rewinding/fine-tuning options, echoing the finding in single-task learning(Frankle et al., 2019).

Figure 4: Performance and sparsity comparison between TR-BU tickets and TD tickets when training models incrementally. Left: CIFAR-10. Right: CIFAR-100. Upper: Comparison of SA. Bottom: Comparison of remaining weights in tickets. Above all, tickets located by TD pruning continue to shrink with the growth of incremental tasks. On the contrary, tickets founded by BU pruning keep expanding for each newly added task.

Figure 5: Left: the results of TD Tickets with/without lottery teaching. Right: the comparison of TD tickets (10.74%) obtained from uniform and curriculum pruning schedule. Experiments are conducted on CIFAR-10.

Figure A6: Evaluation performance (standard accuracy) of top-down lifelong tickets. The right figure zooms in the red dash-line box in the left figure.

Figure A7: Left: the results of TD Tickets with/without lottery teaching. Right: the comparison of TD tickets (10.74%) obtained from uniform and curriculum pruning schedule. Experiments are conducted on CIFAR-10.

Figure A8: Framework of our proposed top-down lifelong pruning algorithms. The top-down (TD) lifelongpruning performs like iterative magnitude pruning (IMP) by unrolling the sequential tasks. Tickets located by TD pruning continue to shrink with the growth of incremental tasks.

Top-Down PruningInput: Full dense model f (θ 0 , θ (0) c ; x), a desired sparsity P m , samples x from a storage S and sequential tasks T 1∼n , soft logits from previous model on queried unlabeled data, pruning ratio ∆p Output: An updated sparse model f (θ m, θ(n) c ; x) 1 Set i = 1 and mask m = 1 ∈ R ||θ||0 2 Train f (θ 0 m, θ(0)c ; x) with data from S and T 1 . 3 while 1 -||m||0 ||θ||0 ≤ P m and i ≤ n do 4 Iterative weight magnitude (IMP) pruning ∆p and obtaining new mask m, where || m|| 0 < ||m|| 0Rewind ), m , x, soft logits and ∆p defined in Algorithm 1,f (θ i , θ (i) c ; x) has learned T 1∼i and has performance R * i , i ∈ {1, • • • , n} Output: An updated sparse model f (θ m, θ(n)c ; x) 1 Set i = 1 and mask m = 0 ∈ R ||θ||0 2 Train (θ 0 m, θ(0)c ; x) with data from S and T 1 . Calculate accuracy R 1 .3 while i ≤ n and || m|| 0 < ||θ|| 0 do 4 if R i ≥ R * i or || m|| 0 = ||θ|| 0 of θ i (m -m),obtain new mask m * , where ||m * || 0 ≥ || m|| 0 and m ∈ m * 10 Retrain f (θ i-1 m * , θ

Comparison results across full dense model, BU Tickets with different ticket initialization, and 1 BU Tickets when training incrementally on CIFAR-10. T1∼i denotes the learned sequential tasks T1 ∼ Ti.

Evaluation performance of TR-BU/TD tickets on CIFAR-10. T 1∼i , i ∈ {1, 2, 3, 4, 5} donates that models have learned from T 1 , • • • , T i incrementally. ||m||0 ||θ||0 represents the current network sparsity.

Evaluation performance of TR-BU/TD tickets when training incrementally on CIFAR-100.

Evaluation performance of TR-BU/TD tickets when training incrementally on Tiny-ImageNet.

Evaluation performance of TR-BU Tickets when models incrementally learn 20 tasks on CIFAR-100. TR-BU 10 tickets learn more knowledge (10 v.s. 5 classes per task), which requires a large network capacity. With v.s. Without Lottery Teaching Comparison results between TD tickets with lottery teaching and the ones without lottery teaching are collected in this section. As shown in FigureA7(left figure), the performance of TD tickets without lottery teaching (black dash curves), quickly falls into a worse decay along with the times of incremental learning increase. After learning all tasks, utilizing lottery teaching obtains a 4.34% accuracy improvement on CIFAR-10. It suggests that our proposed lottery teaching injects previous knowledge through applying knowledge distillation on external unlabeled data, and greatly alleviates the catastrophic forgetting issue.



availability

://github.com/VITA-Group/

A2.3 MORE OTHER IMPLEMENTATION DETAILS

Datasets and Task Splittings We evaluate our proposed lifelong tickets on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets, all being standard and state-of-the-art benchmarks for CIL (Krizhevsky & Hinton, 2009) . For all three datasets, we randomly split the original training dataset into training and validation with a ratio of 9 : 1. On CIFAR-10, we divide the 10 classes into splits of 2 classes with a random order (10/2 = 5 tasks); On CIFAR-100, we divide the 100 classes into splits of 10 classes with a random order (100/10 = 10 tasks); On Tiny-Imagenet, we divide the 200 classes into splits of 20 classes with a random order (200/20 = 10 tasks). In this way, when models learn a new incoming task, the dimension of classifiers will increase by 2 for CIFAR-10, 10 for CIFAR-100, and 20 for Tiny-ImageNet. Additionally, 100 images, 10 images, and 5 images per class of learned tasks will be stored for CIFAR-10, CIFAR-100, and Tiny-ImageNet respectively.Unlabeled Dataset All queried unlabeled data for CIFAR-10/CIFAR-100 are from 80 Million Tiny Image dataset (Torralba et al., 2008) , and for Tiny-ImageNet are from ImageNet dataset (Krizhevsky et al., 2012) . At each incremental learning stage, 4, 500, 450 and 450 images per class of learned tasks will be queried, based on the feature similarity with stored prototypes {m (i-1) θ (i-1) , θ} in top-down Pruning and {θ (i-1) , θ} in bottom-up Pruning at the i th CIL stage for CIFAR-10, CIFAR-100, and Tiny-ImageNet respectively. The feature similarity is defined by 2 norm distance.Training and Evaluation Models are trained using Stochastic Gradient Descent (SGD) with 0.9 momentum and 5 × 10 -4 weight decay. For 100 epochs training, a multi-step learning rate schedule is conducted, starting from 0.01, then decayed by 10 times at epochs 60 and 80. During the iterative pruning, we retrain the model for 30 epochs using a fixed learning rate of 10 -4 . The batch size for both labeled and unlabeled data is 128. We pick the trained model of the highest validation accuracy and report their performance on the hold-out testing set.Other Training Details (i) CIFAR-10 and CIFAR-100 can be download at https://www.cs. toronto.edu/ ˜kriz/cifar.html. (ii) 80 Million Tiny Image dataset is referred to http: //horatio.cs.nyu.edu/mit/tiny/data/index.html. (iii) All of our experiments are conducted on NVIDIA GTX 1080-Ti GPUs.

A3 DISCUSSION

Challenges of Theoretical Analysis and Future Work The theoretical justification of the lottery ticket hypothesis is very limited, except for very shallow networks (Anonymous, 2021) . In the meantime, class-incremental learning makes the theoretical analysis more difficult. It is a challenging lifelong learning problem, and the current progress lies in the empirical side rather than the theoretical side. The theoretical analysis is out of scope for this paper and we would like to explore it in the future.

