ON THE SOFT-SUBNETWORK FOR FEW-SHOT CLASS INCREMENTAL LEARNING

Abstract

Inspired by Regularized Lottery Ticket Hypothesis, which states that competitive smooth (non-binary) subnetworks exist within a dense network, we propose a fewshot class-incremental learning method referred to as Soft-SubNetworks (SoftNet). Our objective is to learn a sequence of sessions incrementally, where each session only includes a few training instances per class while preserving the knowledge of the previously learned ones. SoftNet jointly learns the model weights and adaptive non-binary soft masks at a base training session in which each mask consists of the major and minor subnetwork; the former aims to minimize catastrophic forgetting during training, and the latter aims to avoid overfitting to a few samples in each new training session. We provide comprehensive empirical validations demonstrating that our SoftNet effectively tackles the few-shot incremental learning problem by surpassing the performance of state-of-the-art baselines over benchmark datasets.

1. INTRODUCTION

Lifelong Learning, or Continual Learning, is a learning paradigm to expand knowledge and skills through sequential training of multiple tasks (Thrun, 1995) . According to the accessibility of task identity during training and inference, the community often categorizes the field into specific problems, such as task-incremental (Pfülb and Gepperth, 2019; Delange et al., 2021; Yoon et al., 2020; Kang et al., 2022) , class-incremental (Chaudhry et al., 2018; Kuzborskij et al., 2013; Li and Hoiem, 2017; Rebuffi et al., 2017; Kemker and Kanan, 2017; Castro et al., 2018; Hou et al., 2019; Wu et al., 2019) , and task-free continual learning (Aljundi et al., 2019; Jin et al., 2021; Pham et al., 2022; Harrison et al., 2020) . While the standard scenarios for continual learning assume a sufficiently large number of instances per task, a lifelong learner for real-world applications often suffers from insufficient training instances for each problem to solve. This paper aims to tackle the issue of limited training instances for practical Class-Incremental Learning (CIL), referred to as Few-Shot CIL (FSCIL) (Ren et al., 2019; Chen and Lee, 2020; Tao et al., 2020; Zhang et al., 2021; Cheraghian et al., 2021; Shi et al., 2021) . However, there are two critical challenges in solving FSCIL problems: catastrophic forgetting and overfitting. Catastrophic forgetting (Goodfellow et al., 2013; Kirkpatrick et al., 2017) or Catastrophic Interference McCloskey and Cohen (1989) is a phenomenon in which a continual learner loses the previously learned task knowledge by updating the weights to adapt to new tasks, resulting in significant performance degeneration on previous tasks. Such undesired knowledge drift is irreversible since the scenario does not allow the model to revisit past task data. Recent works propose to mitigate catastrophic forgetting for class-incremental learning, often categorized in multiple directions, such as constraint-based (Rebuffi et al., 2017; Castro et al., 2018; Hou et al., 2018; 2019; Wu et al., 2019 ), memory-based (Rebuffi et al., 2017; Chen and Lee, 2020; Mazumder et al., 2021; Shi et al., 2021) , and architecture-based methods (Mazumder et al., 2021; Serra et al., 2018; Mallya and Lazebnik, 2018; Kang et al., 2022) . However, we note that catastrophic forgetting becomes further challenging in FSCIL. Due to the small amount of training data for new tasks, the model tends to severely overfit to new classes and quickly forget old classes, deteriorating the model performance. Meanwhile, several works address overfitting issues for continual learning from various perspectives. NCM (Hou et al., 2019) and BiC (Wu et al., 2019) highlight the prediction bias problem during sequential training that the models are prone to predict the data to classes in recently trained tasks. OCS (Yoon et al., 2022) tackles the class imbalance problems for rehearsal-based continual learning, where the number of instances at each class varies per task so that the model would perform biased training on dominant classes. Nevertheless, these works do not consider the overfitting issues caused by training a sequence of few-shot tasks. FSLL (Mazumder et al., 2021) tackles overfitting for few-shot CIL by partially-splitting model parameters for different sessions through multiple substeps of iterative reidentification and weight selection. However, it led to computationally inefficient. To deploy a practical few-shot CIL model, we propose a simple yet efficient method named SoftNet, effectively alleviating catastrophic forgetting and overfitting. Motivated by Lottery Ticket Hypothesis (Frankle and Carbin, 2019) , which hypothesizes the existence of competitive subnetworks (winning tickets) within the randomly initialized dense neural network, we suggest a new paradigm for Few-shot CIL, named Regularized Lottery Ticket Hypothesis: Regularized Lottery Ticket Hypothesis (RLTH). A randomly-initialized dense neural network contains a regularized subnetwork that can retain the prior class knowledge while providing room to learn the new class knowledge through isolated training of the subnetwork. Based on RLTH, we propose a method, referred to as Soft-SubNetworks (SoftNet), illustrated in Figure 1 . First, SoftNet jointly learns the randomly initialized dense model (Figure 1 (a) ) and soft mask m ∈ [0, 1] |θ| pertaining to Soft-subnetwork (Figure 1 (b)) on the base session training; the soft mask consists of the major part of the model parameters m = 1 and the minor ones m < 1 where m = 1 is obtained by the top-c% of model parameters and m < 1 is obtained by the remaining ones (100-top-c%) sampled from the uniform distribution. Then, we freeze the major part of pre-trained subnetwork weights for maintaining prior class knowledge and update the only minor part of weights for the novel class knowledge (Figure 1 (c) ). We summarize our key contributions as follows: • This paper presents a new masking-based method, Soft-SubNetwork (SoftNet), that tackles two critical challenges in the few-shot class incremental learning (FSCIL), known as catastrophic forgetting and overfitting. • Our SoftNet trains two different types of non-binary masks (subnetworks) for solving FSCIL, preventing the continual learner from forgetting previous sessions and overfitting simultaneously. • We conduct a comprehensive empirical study on SoftNet with multiple class incremental learning methods. Our method significantly outperforms strong baselines on benchmark tasks for FSCIL problems.

2. RELATED WORK

Catastrophic Forgetting. Many recent works have made remarkable progress in tackling the challenges of catastrophic forgetting in lifelong learning. To be specific, Architecture-based approaches (Mallya et al., 2018; Serrà et al., 2018; Li et al., 2019) utilize an additional capacity to expand (Xu and Zhu, 2018; Yoon et al., 2018) or isolate (Rusu et al., 2016) model parameters, thereby avoiding knowledge interference during continual learning; SupSup (Wortsman et al., 2020) allocates model parameters dedicated to different tasks. Very recently, Chen et al. (2021) ; Kang et al. (2022) shows the existence of a sparse subnetwork, called winning tickets, that performs well on all tasks during continual learning. However, many subnetwork-based approaches are incompatible with the FSCIL setting since performing task inference under data imbalances is challenging. FSLL (Mazumder et al., 2021) aims to search session-specific subnetworks while preserving weights for previous sessions for incremental few-shot learning. However, the expansion process comprises another series of retraining and pruning steps, requiring excessive training time and computational costs. On the contrary, our proposed method, SoftNet, jointly learns the model and task-adaptive smooth (i.e., non-binary) masks of the subnetwork associated with the base session while selecting an essential subset of the model weights for the upcoming session. Furthermore, smooth masks behave like regularizers that prevent overfitting when learning new classes. Soft-subnetwork. Recent works with context-dependent gating of sub-spaces (He and Jaeger, 2018) , parameters (Mallya and Lazebnik, 2018; He et al., 2019; Mazumder et al., 2021) , or layers (Serra et al., 2018) of a single deep neural network demonstrated its effectiveness in addressing catastrophic forgetting during continual learning. Masse et al. ( 2018) combines context-dependent gating with the constraints preventing significant changes in model weights, such as SI (Zenke et al., 2017) and EWC (Kirkpatrick et al., 2017) , achieving further performance increases than using them alone. A flat minima could also be considered as acquiring sub-spaces. Previous works have shown that a flat minimizer is more robust to random perturbations (Hinton and Van Camp, 1993; Hochreiter and Schmidhuber, 1994; Jiang et al., 2019) . Recently, Shi et al. (2021) showed that obtaining flat loss minima in the base session, which stands for the first task session with sufficient training instances, is necessary to alleviate catastrophic forgetting in FSCIL. To minimize forgetting, they updated the model weights on the obtained flat loss contour. In our work, by selecting sub-networks (Frankle and Carbin, 2019; Zhou et al., 2019; Wortsman et al., 2019; Ramanujan et al., 2020; Kang et al., 2022; Chijiwa et al., 2022) and optimizing the sub-network parameters in a sub-space, we propose a new method to preserve learned knowledge from a base session on a major subnetwork and learn new sessions through regularized minor subnetworks.

3.1. PROBLEM STATEMENTS

Various works have tried to mitigate catastrophic forgetting problems in class incremental learning using knowledge distillation, revisiting a subset of prior samples, or isolating essential model parameters to retain prior class knowledge even after the model loses accessibility to them. However, as a few-shot class incremental learning scenario regards following tasks/sessions containing a small amount of training data, the model tends to severely overfit to new classes, making it difficult to finetune the previously trained model on a few samples. In addition, the fine-tuning process often leads to the catastrophic forgetting of base class knowledge. As a result, regularization is indispensable in the models to avoid forgetting and prevent the model from overfitting to new class samples by updating only the selected parameters for learning in the new session.  i O j̸ =i = ∅ for ∀i, j ≤ T . Consider a supervised learning setup where the T sessions arrive in a lifelong learner f (•; θ) parameterized by the model weights θ in sequential order. A few-shot class incremental learning scenario aims to learn the classes in a sequence of sessions without catastrophic forgetting. In the training session t, the model solves the following optimization procedure: θ * = minimize θ 1 n t nt i=1 L t (f (x t i ; θ), y t i ), where L t is a classification loss like cross-entropy, and n t is the number of instances for session t.

3.2. SUBNETWORK-BASED TRAINING FOR FEW-SHOT CLASS INCREMENTAL LEARNING

As lifelong learners often adopt over-parameterized dense neural networks to allow resource freedom for future classes or tasks, updating entire weights in neural networks for few-shot tasks is often not preferable and often yields the overfitting problem. To overcome the limitations in FSCIL, we focus on updating partial weights in neural networks when a new task arrives. The desired set of partial weights, named subnetwork, can achieve on-par or even better performance with the following motivations: (1) Lottery Ticket Hypothesis (Frankle and Carbin, 2019) shows the existence of a subnetwork that performs well as the dense network, and (2) The subnetwork significantly downsized from the dense network reduces the size of the expansion of the solver while providing extra capacity to learn new sessions or tasks. We first suggest the objective referred to as HardNet as follows: given dense neural network parameters θ, the binary attention mask m * t describes the optimal subnetwork for session t such that |m * t | is less than the dense model capacity |θ|. However, such binarized subnetworks m t ∈ {0, 1} |θ| cannot adjust the remaining parameters in a dense network for future sessions while solving past task problems cost-and memory efficiently. In FSCIL, the test accuracy of the base session drops significantly when it proceeds to learn sequential sessions since the subnetwork of m = 1 plays a crucial role in maintaining the base class knowledge. To this end, we propose a soft-subnetwork m t ∈ [0, 1] |θ| instead of the binarized subnetwork. It gives more flexibility to fine-tune a small part of the softsubnetwork while fixing the rest to retain base class knowledge for FSCIL. As such, we find the soft-subnetwork through the following objective: m * t = minimize mt∈[0,1] |θ| 1 n t nt i=1 L t f (x t i ; θ ⊙ m t ), y t i -J subject to |m t | ≤ c. where session loss J = L f (x t i ; θ), y t i , the subnetwork sparsity c ≪ |θ| (used as the selected proportion % of model parameters in the following section), and ⊙ represents an element-wise product. In the following section, we describe how to obtain the soft-subnetwork m * t using the magnitude-based criterion (RLTH) while minimizing session loss jointly.

3.3. OBTAINING SOFT-SUBNETWORKS VIA COMPLEMENTARY WINNING TICKETS

Let each weight be associated with a learnable parameter we call weight score s, which numerically determines the importance of the associated weight. In other words, we declare a weight with a higher score as more important. At first, we find a subnetwork θ * = θ ⊙ m * t of the dense neural network and then assign it as a solver of the current session t. The subnetworks associated with each session jointly learn the model weight θ and binary mask m t . Given an objective L t , we optimize θ as follows: θ * , m * t = minimize θ,s L t (θ ⊙ m t ; D t ). where m t is obtained by applying an indicator function 1 c on weight scores s. Note 1 c (s) = 1 if s belongs to top-c% scores and 0 otherwise. In the optimization process for FSCIL, however, we consider two main problems: (1) Catastrophic forgetting: updating all θ ⊙ m t-1 when training for new sessions will cause interference with the weights allocated for previous tasks; thus, we need to freeze all previously learned parameters θ ⊙ m t-1 ; (2) Overfitting: the subnetwork also encounters overfitting issues when training an incremental task on a few samples, as such, we need to update a few parameters irrelevant to previous task knowledge., i.e., θ ⊙ (1 -m t-1 ). To acquire the optimal subnetworks that alleviate the two issues, we define a soft-subnetwork by dividing the dense neural network into two parts-one is the major subnetwork m major , and another is the minor subnetwork m minor . The defined soft-subnetwork follows as: m soft = m major ⊕ m minor , where m major is a binary mask and m minor ∼ U (0, 1) and ⊕ represents an element-wise summation. As such, a soft-mask is given as m * t ∈ [0, 1] |θ| in Eq.3. In the all-experimental FSCIL setting, m major maintains the base task knowledge t = 1 while m minor acquires the novel task knowledge t ≥ 2. Then, with base session learning rate α, the θ is updated as follows: θ ← θ -α ∂L ∂θ ⊙ m soft effectively regularize the weights of the subnetworks for incremental learning. The subnetworks are obtained by the indicator function that always has a gradient value of 0; therefore, updating the weight scores s with its loss gradient is impossible. To update the weight scores, we use Straight-through Estimator (Hinton, 2012; Bengio et al., 2013; Ramanujan et al., 2020) in the backward pass. Specifically, we ignore the derivatives of the indicator function and update the weight score s ← s -α ∂L ∂s ⊙ m soft , where m soft = 1 for exploring the optimal subnetwork for base session training. Our Soft-subnetwork optimizing procedure is summarized in Algorithm 1. Once a single soft-subnetwork m soft is obtained in the base session, then we use the soft-subnetwork for the entire new sessions without updating. Compute Lm (θ ⊙ msoft; bt) by Eq. 5 16: Algorithm 1 Soft-Subnetworks (SoftNet) input {D t } T t=1 , θ ← θ -β ∂L ∂θ ⊙ mminor 17: end for 18: end for output model parameters θ, s, and msoft.

4. INCREMENTAL LEARNING FOR SOFT-SUBNETWORK

We now describe the overall procedure of our soft-pruning-based incremental learning/inference method, including the training phase with a normalized informative measurement in Section 4.1, as followed by the prior work (Shi et al., 2021) , and the inference phase in Section 4.2.

4.1. INCREMENTAL SOFT-SUBNETWORK TRAINING

Base Training (t = 1). In the base learning session, we optimize the soft-subnetwork parameter θ (including a fully-connected layer as a classifier) and weight score s with cross-entropy loss jointly using the training examples of D 1 . Incremental Training (t ≥ 2). In the incremental few-shot learning sessions (t ≥ 2), leveraged by θ ⊙ m soft , we fine-tune few minor parameters θ ⊙ m minor of the soft-subnetwork to learn new classes. Table 1 : Classification accuracy of ResNet18 on CIFAR-100 for 5-way 5-shot incremental learning. Underbar denotes the comparable results with FSLL (Mazumder et al., 2021) . * denotes the results reported from Shi et al. (2021) .

Method sessions

The (Hou et al., 2019) 64.10 53.05 43.96 36.97 31.61 26.73 21.23 16.78 13.54 -31 .74 TOPIC (Cheraghian et al., 2021 ) 64.10 55.88 47.07 45.16 40.11 36.38 33.96 31.55 29.37 -15.91 F2M (Shi et al., 2021) 64 Since m minor < 1, the soft-subnetwork alleviates the overfitting of a few samples. Furthermore, instead of Euclidean distance (Shi et al., 2021) , we employ a metric-based classification algorithm with cosine distance to finetune the few selected parameters. In some cases, Euclidean distance fails to give the real distances between representations, especially when two points with the same distance from prototypes do not fall in the same class. In contrast, representations with a low cosine distance are located in the same direction from the origin, providing a normalized informative measurement. We define the loss function as:  L m (z; θ ⊙ m sof t ) = - z∈D o∈O 1(y = o) log e -d(po,f (x; θ⊙m sof t )) o k ∈O e -d(po k ,f (x; θ⊙m sof t )) o = 1 No i 1(y i = o)f (x i ; θ ⊙ m sof t ) and those of base classes are saved in the base session, and N o denotes the number of the training images of class o. We also save the prototypes of all classes in O t for later evaluation.

4.2. INFERENCE FOR INCREMENTAL SOFT-SUBNETWORK

In each session, the inference is also conducted by a simple nearest class mean (NCM) classification algorithm (Mensink et al., 2013; Shi et al., 2021) for fair comparisons. Specifically, all the training and test samples are mapped to the embedding space of the feature extractor f , and Euclidean distance d u (•, •) is used to measure the similarity between them. The classifier gives the kth prototype index o * k = arg min o∈O d u (f (x; θ ⊙ m sof t ), p o ) as output.

5. EXPERIMENTS

We introduce experimental setups in Section 5.1. Then, we empirically evaluate our soft-subnetworks for incremental few-shot learning and demonstrate its effectiveness through comparison with state-ofthe-art methods and vanilla subnetworks in the following subsections.

5.1. EXPERIMENTAL SETUP

Datasets. To validate the effectiveness of the soft-subnetwork, we follow the standard FSCIL experimental setting. We randomly select 60 classes as the base class and the remaining 40 as new classes for CIFAR-100 and miniImageNet. In each incremental learning session, we construct 5-way 5-shot tasks by randomly picking five classes and sampling five training examples for each class. Baselines. We mainly compare our SoftNet with architecture-based methods for FSCIL: FSLL (Mazumder et al., 2021) that selects important parameters for each session, and HardNet, representing a binary subnetwork. Furthermore, we compare other FSCIL methods such as iCaRL (Rebuffi et al., 2017) , Rebalance (Hou et al., 2019) , TOPIC (Tao et al., 2020) , IDLVQ-C (Chen and Lee, 2020), and F2M (Shi et al., 2021) . We also include a joint training method (Shi et al., 2021) that uses all previously seen data, including the base and the following few-shot tasks for training as a reference. Furthermore, we fix the classifier re-training method (cRT) (Kang et al., 2019) for long-tailed classification trained with all encountered data as the approximated upper bound. Experimental details. The experiments are conducted with NVIDIA GPU RTX8000 on CUDA 11.0. We also randomly split each dataset into multiple sessions. We run each algorithm ten times for each dataset and report their mean accuracy. We adopt ResNet18 (He et al., 2016) as the backbone network. For data augmentation, we use standard random crop and horizontal flips. In the base session training stage, we select top-c% weights at each layer and acquire the optimal soft-subnetworks with the best validation accuracy. In each incremental few-shot learning session, the total number of training epochs is 6, and the learning rate is 0.02. We train new class session samples using a few minor weights of the soft-subnetwork (Conv4x layer of ResNet18 and Conv3x layer of ResNet20) obtained by the base session learning. We specify further experiment details in Appendix A.

5.2. RESULTS AND COMPARISONS

We compared SoftNet with the architecture-based methods -FSLL and HardNet. We pick FSLL as an architecture-based baseline since it selects important parameters for acquiring old/new class knowledge. The architecture-based results on CIFAR-100 and miniImageNet are presented in Table 1 and Table 2 respectively. The performances of HardNet show the effectiveness of the subnetworks that go with less model capacity compared to dense networks. To emphasize our point, we found that ResNet18, with approximately 50% parameters, achieves comparable performances with FSLL on CIFAR-100 and miniImageNet. In addition, the performances of ResNet20 with 30% parameters (HardNet) are comparable with those of FSLL on CIFAR-100, as denoted in Appendix of Table 9 and Table 11 , including performances (Figure 4 and Figure 5 ) and smoothness in t-SNE plots (Figure 6 ). Experimental results are prepared to analyze the overall performances of SoftNet according to the sparsity and dataset as shown in Figure 2 . As we increase the number of parameters employed by SoftNet, we achieve performance gain on both benchmark datasets. The performance variance of SoftNet's sparsity seems to be depending on datasets from the fact that the performance variance on CIFAR-100 is less than that on miniImageNet. In addition, SoftNet retains prior session knowledge successfully in both experiments as described in the dashed line, and the performances of SoftNet (c = 60.0%) on the new class session (8, 9) of CIFAR-100 than those of SoftNet (c = 80.0%) as depicted in the dashed-dot line. From these results, we could expect that the best performances depend on the number of parameters and properties of datasets. We further result on comparisons of HardNet and SoftNet in Appendix B. Our SoftNet outperforms the state-of-the-art methods and cRT, which is used as the approximate upper bound of FSCIL (Shi et al., 2021) as shown in Table 1 and Table 2 . Moreover, Figure 3 

5.3. LAYER-WISE ACCURACY

In incremental few-shot learning sessions, we train new class session samples using a few minor weights m minor of the specific layer. At the same, we entirely fix the remaining weights to investigate the best performances as shown in Table 3 . The best performances involve fine-tuning at the Conv5x layer with c = 97%. It means features computed by the lower layer are general and reusable in different classes. On the other hand, features from the higher layer are specific and highly dependent on the dataset.

5.4. ARCHITECTURE-WISE ACCURACY

Depending on architectures, the performances of subnetworks vary, and the sparsity is also one another: ResNet18 tends to use dense parameters, whereas ResNet20 tends to use sparse parameters on CIFAR-100 for 5-way 5-shot as shown in Table 4 . We observed that the SoftNet with ResNet20 has a more sparse solution as c = 90% than ResNet18 on this CIFAR-100 FSCIL setting. From these 

5.5. DISCUSSIONS

Based on our thorough empirical study, we uncover the following facts: (1) Depending on architectures, the performances of subnetworks vary, and the sparsity is also one another: ResNet18 tends to use dense parameters, while ResNet20 tends to use sparse parameters on CIFAR-100 FSCIL settings. This result provides the general pruning-based model with a hidden clue. (2) In general, fine-tuning strategies are essential in retaining prior knowledge and learning new knowledge. We found that performance varies depending on fine-tuning a Conv layer through the layer-wise inspection. Lastly, (3) from overall experimental results, the base session learning is significant for lifelong learners to acquire generalized performances in FSCIL.

6. CONCLUSION

Inspired by Regularized Lottery Ticket Hypothesis (RLTH), which hypothesizes that smooth subnetworks exist within a dense network, we propose Soft-SubNetworks (SoftNet); an incremental learning strategy that preserves the learned class knowledge and learns the newer ones. More specifically, SoftNet jointly learned the model weights and adaptive soft masks to minimize catastrophic forgetting and to avoid overfitting novel few samples in FSCIL. Finally, we compared a comprehensive empirical study on SoftNet with multiple class incremental learning methods. Extensive experiments on benchmark tasks demonstrate how our method achieves superior performance over the state-of-the-art class incremental learning methodologies. We also discovered how subnetworks perform differently under specified architectures and datasets through ablation studies. In addition, we emphasized the importance of fine-tuning and base session learning in achieving optimum performance for FSCIL. We believe that our findings could bring a monumental on deep neural network architecture search, both on task-specific architectures and utilization of sparse models.

B RESULTS AND CONCLUSIONS

To expand upon the results of our paper, we conduct more experiments on various datasets mentioned in the previous section. We first display the full performance table with more capacity values c employed towards our method in Table 9 and Table 11 . Next, we identify how choosing a different architecture would impact the performance of our algorithm in Table 10 . Furthermore, we analyze the performance of our method on the CUB-200-2011 dataset in Table 7 . Through extensive experiments, we deduce the following three conclusions for incorporating our method in the few-shot class incremental learning: Structure. We identified a SubNetwork of ResNet18 and ResNet20 with varying capacities on CIFAR-100 for the 5-way 5-shot FSCIL setting as shown in Table 9 and Table 10 . First, according to both tables, our method performs better as we use more parameters within our network. In addition, as denoted in our paper, we see how effective subnetwork is by observing how HardNet, with only 50% of its dense capacity, achieves comparable performance to methods utilizing dense networks, while SoftNet can do the same with only 30% of its dense capacity. Furthermore, we argue that our method is architecture-dependent. Our observation from Table 10 shows that at ResNet18, our architecture performs the best at the maximum capacity of c = 99%, while at ResNet20, we achieve the optimum performance at c = 90%. Comparisions of Hard and SoftNet. Furthermore, increasing the number of network parameters leads to better overall performance in both subnetworks types, as shown in Figure 4 and Figure 5 . Subnetworks, in the form of HardNet and SoftNet, tend to retain prior (base) session knowledge denoted in dashed ( ) line, and HardNet seems to be able to classify new session class samples without continuous updates stated in dashed-dot ( ) line. From this, we could expect how much previous knowledge HardNet learned at the base session to help learn new incoming tasks (Forward Transfer). The overall performances of SoftNet are better than HardNet since SoftNet improves both base/new session knowledge by updating minor subnetworks. Subnetworks have a broader spectrum of performances on miniImageNet (Figure 5 ) than on CIFAR-100 (Figure 4 ). This could be an observation caused by the dataset complexity -i.e., if the miniImagenet dataset is more complex or harder to learn for a subnetwork or a deep model as such subnetworks need more parameters to learn miniImageNet than the CIFAR-100 dataset. 

Smoothness of SoftNet.

As emphasized in Table 11 , SoftNet has a broader spectrum of performances than HardNet on miniImageNet. 20% of minor subnet might provide a smoother representation than HardNet because the performance of SoftNet was the best approximately at c = 80%. From these results, we could expect that model parameter smoothness guarantees quite competitive results. To support the claim, we prepared the loss landscapes of a dense neural network, HardNet, and SoftNet on two Hessian eigenvectors (Yao et al., 2020) as shown in Fig. 7 . We observed the following points through simple experiments: From these results, we can expect how much knowledge the specified subnetworks can retain and acquire on each dataset. • The loss landscapes of Subnetworks (HardNet and SoftNet) were flatter than those of dense neural networks. • The minor subnet of SoftNet helped find a flat global minimum despite random scaling weights in the training process. Moreover, we compared the embeddings using t-SNE plots as shown in Figure 6 Preciseness. Regarding fine-grained and small-sized CUB200-2011 FSCIL settings, HardNet also shows comparable results with the baselines, and SoftNet outperforms others as denoted in Table 7 . In this FSCIL setting, we acquired the best performances of SoftNet through the specific parameter selections. As of now, our SoftNet achieves state-of-the-art results on the three datasets.

C CONVERGENCE OF SUBNETWORKS

Convergences of HardNet and SoftNet. To interpret the convergence of SoftNet, we follow the Lipschitz-continuous objective gradients (Bottou et al., 2018) : the objective function of dense networks R : R d → R is continuously differentiable and the gradient function of R, namely, ∇R : R d → R d , Lipschitz continuous with Lipschitz constant L > 0, i.e., ||∇R(θ) -∇R(θ ′ )|| 2 ≤ L||θ -θ ′ || for all {θ, θ ′ } ⊂ R d . ( ) Following the same formula, we define the Lipschitz-continuous objective gradients of subnetworks as follows: ||∇R (θ ⊙ m) -∇R(θ ′ ⊙ m)|| 2 ≤ L||(θ -θ ′ ) ⊙ m|| for all {θ, θ ′ } ⊂ R d . ( ) where m is a binary mask. In comparision of Eq. 6 and 7, we use the theoretical analysis (Ye et al., 2020) where subnetwork achieve a faster rate of R(θ ⊙ m) = O(1/||m|| 2 1 ) at most. The comparison is as follows: ||∇R(θ ⊙ m) -∇R(θ ′ ⊙ m)|| 2 ||(θ -θ ′ ) ⊙ m|| < ||∇R(θ) -∇R(θ ′ )|| 2 ||θ -θ ′ || ≤ L (8) The smaller the value is, the flatter the solution (loss landscape) has. The equation is established from the relationship R(θ ⊙ m) ≪ R * (θ), where R * (θ denotes the best possible loss achievable by convex combinations of all parameters despite ||(θ -θ ′ ) ⊙ m|| < ||θ -θ ′ ||. Furthermore, we have the following inequality if ||R(θ ⊙ m hard ) -R(θ ⊙ m sof t )|| ≃ 0 and ||m hard || < ||m sof t ||: ||∇R(θ ⊙ m hard ) -∇R(θ ′ ⊙ m hard )|| 2 ||(θ -θ ′ ) ⊙ m hard || ≥ ||∇R(θ ⊙ m sof t ) -∇R(θ ′ ⊙ m sof t )|| 2 ||(θ -θ ′ ) ⊙ m sof t || (9) where the equality holds iff ||m hard || = ||m sof t ||. We prepare the loss landscapes of Dense Network, Hard-WSN, and Soft-WSN as shown in Figure 7 as an example to support the inequality.

D ADDITIONAL COMPARISONS WITH CURRENT WORKS

Comparisons with SOTA. We compare SoftNet with the following state-of-art-methods on TOPIC class split (Tao et al., 2020) of three benchmark datasets -CIFAR100 (Table 5 ), miniImageNet (Table 6 ), and CUB-200-2011 (Table 7 ). We summarize the current FSCIL methods as follows: • CEC Zhang et al. ( 2021): The authors proposed a Continually Evolved Classifier (CEC) that employs a graph model to propagate context information between classifiers for adaptation. • LIMIT Zhou et al. (2022) : The authors proposed a new paradigm for FSCIL based on meta-learning by LearnIng Multi-phase Incremental Tasks (LIMIT), which synthesizes fake FSCIL tasks from the base dataset. Besides, LIMIT also constructs a calibration module based on a transformer, which calibrates the old class classifiers and new class prototypes into the same scale and fills in the semantic gap. • MetaFSCIL Chi et al. (2022) : The authors proposed a bilevel optimization based on meta-learning to directly optimize the network to learn how to learn incrementally in the setting of FSCIL. Concretely, They proposed to sample sequences of incremental tasks from base classes for training to simulate the evaluation protocol. For each task, the model is learned using a meta-objective to perform fast adaptation without forgetting. Furthermore, they proposed a bi-directional guided modulation to modulate activations and reduce catastrophic forgetting. • C-FSCIL Hersche et al. ( 2022): The authors proposed C-FSCIL, which is architecturally composed of a frozen meta-learned feature extractor, a trainable fixed-size fully connected layer, and a rewritable dynamically growing memory that stores as many vectors as the number of encountered classes. • Subspace Reg. Akyürek et al. ( 2021): The authors presented a straightforward approach that enables using logistic regression classifiers for few-shot incremental learning. The key to this approach is a new family of subspace regularization schemes that encourage weight vectors for new classes to lie close to the subspace spanned by the weights of existing classes. • Entropy-Reg Liu et al. ( 2022): The authors alternatively proposed using data-free replay to synthesize data by a generator without accessing real data. • ALICE Peng et al. ( 2022): The authors proposed a method -Augmented Angular Loss Incremental Classification or ALICE -inspired by the similarity of the goals for FSCIL and modern face recognition systems. Instead of the commonly used cross-entropy loss, they proposed using the angular penalty loss to obtain well-clustered features in ALICE. Table 5 : Classification accuracy of ResNet18 on CIFAR-100 for 5-way 5-shot incremental learning with the same class split as in TOPIC (Cheraghian et al., 2021 This point makes it difficult to train AANet on a few samples even though the performance at session 1 is comparable with SoftNet as shown in Table 8 . 

E LIMITATIONS AND FUTURE WORKS

Our method employs two sets of subnetworks. One is the major subnetworks, whereas the other is minor subnets. Since the former serve their duty to retain the base session knowledge, once the major subnetwork is tuned, there could be a potential loss of previously acquired knowledge. Furthermore, we explicitly divide SoftNet by the magnitude criterion. As a result, when SoftNet parameters are exposed, the essential parameters will be vulnerable to intentional attacks. It could result in the leak of knowledge maintained by SoftNet. First, to avoid tunning the major subnetwork issue, the new session learner should know the model sparsity for maintaining base session knowledge. Second, to address the leaking information issue, the binary mask should be encoded by a compression method to reduce model capacity and protect the privacy of task knowledge. Moreover, in FSCIL tasks, SoftNet alleviates overfitting issues while effectively maintaining base-session performance. In future work, we consider expanding the model parameters to acquire a long sequence of incoming new class knowledge depending on the data or task size, i.e., CIL tasks. 



Figure 1: Incremental Soft-Subnetwork (SoftNet): (a) Dense Neural Network is randomly initialized for the base session (S-1) training (b) SoftNet is trained by major subnetwork mmajor = 1 (thick solid line) and minor mminor ∼ U (0, 1), and (c) SoftNet updates only a few minor weights (thin solid line) for new sessions (S).

model weights θ, and score weights s, layer-wise capacity c 1: // Training over base classes t = 1 2: Randomly initialize θ and s. 3: for epoch e = 1, 2, • • • do 4: Obtain softmask msoft of mmajor and mminor ∼ U (0, 1) at each layer 5: for batch bt ∼ D t do 6: Compute L base (θ ⊙ msoft; bt) by Eq. 3 7: θ ← θ -α ∂L ∂θ ⊙ msoft 8: s ← s -α ∂L ∂s ⊙ msoft 9: end for 10: end for 11: // Incremental learning t ≥ 2 12: Combine the training data D t and the exemplars saved in previous few-shot sessions 13: for epoch e = 1, 2, • • • do 14: for batch bt ∼ D t do 15:

where d (•, •) denotes cosine distance, p o is the prototype of class o, O = t i=1 O i refers to all encountered classes, and D = D t P denotes the union of the current training data D t and the exemplar set P = {p 2 • • • , p t-1 }, where P te (2 ≤ t e < t) is the set of saved exemplars in session t e . Note that the prototypes of new classes are computed by p

Figure 2: Classification accuracy of SoftNet on CIFAR-100 and miniImageNet for 5-way 5-shot FSCIL: the overall performance depends on capacity c and the softness of subnetwork. Note that solid( ), dashed( ), and dashed-dot( ) lines denote overall, base, and novel class performances respectively.

Figure 3: Comparision of subnetworks (HardNet and SoftNet) with state-of-the-art methods.

Figure 4: Performances of HardNet v.s. SoftNet on CIFAR-100 for 5-way 5-shot FSCIL: the overall performance depends on capacity c and the softness of subnetwork. Note that solid( ), dashed( ), and dashed-dot( ) lines denote overall, base, and novel class performances respectively.

Figure 5: Performances of HardNet v.s. SoftNet on miniImageNet for 5-way 5-shot FSCIL: the overall performance depends on capacity c and the softness of subnetwork. Note that solid( ), dashed( ), and dashed-dot( ) lines denote overall, base, and novel class performances respectively.

Figure 6: t-SNE Plots of HardNet v.s. SoftNet on miniImageNet for 5-way 5-shot FSCIL: t-SNE plots represent the embeddings of the even-numbered test class samples and compare one another. Note Session1 Class Set:{0, • • • , 59} and Session2 Novel Class Set:{60, • • • , 64}.

Figure 7: Loss landscapes of DenseNet, HardNet, and SoftNet: Subnetworks provide a more flat global minimum than dense neural networks. To demonstrate the loss landscapes, we trained a simple three-layered, fully connected model (fc-4-25-30-3) on the Iris Flower dataset (which is three classification problem) for 100 epochs.

In FSCIL, the base session D 1 usually contains a large number of classes with sufficient training data for each class. In contrast, the subsequent sessions (t ≥ 2) will only contain a small number of classes with a few training samples per class, e.g., the t th session D t is often presented as a N-way K-shot task. In each training session t, the model can access only the training data D t and a few examples stored in the previous session. When the training of session t is completed, we evaluate the model on test samples from all classes O =

Classification accuracy of ResNet18 on miniImageNet for 5-way 5-shot incremental learning. Underbar denotes the comparable results withFSLL Mazumder et al. (2021). * denotes the results reported fromShi et al. (2021).

Classification accuracy of ResNet18 on miniImageNet for 5-way 5-shot incremental learning. The layer-wise inspection with fixed c = 97%. all denotes that all minor weights m minor of the entire layers were trained while only conv * x trained.

Classification accuracy of ResNet18, 20, 32, and 50 on CIFAR-100 for 5-way 5-shot FSCIL.

. In t-SNE's 2D embedding spaces, the overall discriminative of SoftNet is better than that of HardNet in terms of base class set and novel class set. This 70% of minor subnet affects SoftNet positively in base session training and offers good initialized weights in novel session training.

Table7, the performances of SoftNet were comparable with those of ALICE and LIMIT, considering that ALICE used class/data augmentations and LIMIT added an extra multi-head attention layer.Comparisions of SoftNet and AANet. Our SoftNet and AANetLiu et al. (2021) have proposed alleviating catastrophic forgetting in FSCIL and CIL, respectively. AANet consists of multi-ResNets: one residual block learns new knowledge while another fine-tunes to maintain the previously learned knowledge. Through the learnable scaling parameter for the linear combination of the multi-ResNet features, AANet showed outstanding performances in the CSIL setting. However, AANet tends to overfit since the ResNet block's parameters are fully used to update a few new class data in FSCIL.

Classification accuracy of ResNet18 on miniImageNet for 5-way 5-shot incremental learning with the same class split as in TOPIC(Cheraghian et al., 2021). * denotes results reported fromShi et al. (2021). † represents our reproduced results.

Classification accuracy of ResNet18 on CUB-200-2011 for 10-way 5-shot incremental learning (TOPIC class split Tao et al. (2020)). * denotes results reported from Shi et al. (2021). † represents our reproduced results.

Classification accuracy of ResNet18 on CIFAR-100 for 5-way 5-shot incremental learning with the same class split as in TOPIC(Cheraghian et al., 2021). * denotes the results reported fromShi et al. (2021). † represents our reproduced results.

Classification accuracy of ResNet18 on CIFAR-100 for 5-way 5-shot incremental learning. Underbar denotes the comparable results with baseline. * denotes the results reported fromShi et al. (2021).

Classification accuracy of ResNet18 v.s. ResNet20 on CIFAR-100 for 5-way 5-shot FSCIL with varying capacity c. Underbar denotes the comparable results with baseline. * denotes results reported fromShi et al. (2021). ResNet18, cRT Shi et al. (2021) 65.18 63.89 60.20 57.23 53.71 50.39 48.77 47.29 45.28 -ResNet18, Joint-training Shi et al. (2021) 65.18 61.45 57.36 53.68 50.84 47.33 44.79 42.62 40.08 -5.20 ResNet18, Baseline Shi et al. (2021) 65.18 61.67 58.61 55.11 51.86 49.43 47.60 45.64 43.83 -1.45 ResNet18, HardNet, c = 10% 28.97 27.71 26.08 24.68 23.34 22.02 21.12 20.48 19.50 -25.78 ResNet18, HardNet, c = 20% 37.42 35.29 33.22 31.32 29.51 27.80 26.54 25.28 24.16 -21.12 ResNet18, HardNet, c = 30% 55.47 52.37 49.38 46.53 43.88 41.50 39.58 37.82 36.06 -9.22 ResNet18, HardNet, c = 40% 57.52 53.85 50.62 47.74 44.90 42.64 40.76 38.95 37.07 -8.21 ResNet18, HardNet, c = 50% 64.80 60.77 56.95 53.53 50.40 47.82 45.93 43.95 41.91 -3.37 ResNet18, HardNet, c = 60% 66.72 62.21 58.14 54.60 51.47 48.86 46.67 44.67 42.66 -2.62 ResNet18, HardNet, c = 70% 68.27 63.52 59.45 55.89 52.91 50.30 48.27 46.25 44.22 -1.06 ResNet18, HardNet, c = 80% 69.65 64.60 60.59 56.93 53.60 50.80 48.69 46.69 44.63 -0.65 ResNet18, HardNet, c = 90% 70.85 65.84 61.59 57.92 54.65 51.90 49.79 47.66 45.47 +0.19 ResNet18, HardNet, c = 93% 71.22 66.20 62.00 58.34 55.04 52.34 50.22 48.07 46.04 +0.76 ResNet18, HardNet, c = 95% 71.73 66.31 62.17 58.44 54.98 52.20 50.17 47.97 45.87 +0.59 ResNet18, HardNet, c = 97% 71.85 66.48 62.29 58.62 55.36 52.55 50.60 48.43 46.22 +0.94 ResNet18, HardNet, c = 99% 71.95 66.83 62.75 59.09 55.92 53.03 50.78 48.52 46.31 +1.03 ResNet18, SoftNet, c = 10% 60.77 57.02 53.62 50.51 47.67 45.14 43.32 41.6 39.58 -5.7 ResNet18, SoftNet, c = 20% 64.67 60.69 57.15 53.77 50.76 48.28 46.24 44.23 42.31 -2.97 ResNet18, SoftNet, c = 30% 67.00 62.18 58.22 54.69 51.82 49.12 47.13 44.98 42.44 -2.84 ResNet18, SoftNet, c = 40% 67.50 63.11 59.29 55.61 52.53 49.85 47.85 45.84 43.85 -1.43 ResNet18, SoftNet, c = 50% 69.20 64.18 60.01 56.43 53.11 50.62 48.60 46.51 44.61 -0.67 ResNet18, SoftNet, c = 60% 69.15 63.68 59.54 56.05 52.72 50.10 48.20 46.18 44.15

acknowledgement

Acknowledgement.This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. 2021-0-01381, Development of Causal AI through Video Understanding and Reinforcement Learning, and Its Applications to Real Environments) and partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2022-0-00184, Development and Study of AI Technologies to Inexpensively Conform to Evolving Policy on Ethics).

availability

//github.com/ihaeyong/

A EXPERIMENTAL DETAILS

We validate the effectiveness of the soft-subnetwork in our method on several benchmark datasets against various architecture-based methods for Few-Shot Class Incremental Learning (FSCIL). To proceed with the details of our experiments, we first explain the datasets and how we involve them in our experiments. Later, we detail experiment setups, including architecture details, preprocessing, and training budget.

A.1 DATASETS

The following datasets are utilized for comparisons:CIFAR-100 In CIFAR-100, each class contains 500 images for training and 100 images for testing. Each image has a size of 32 × 32. Here, we follow an identical FSCIL procedure as in (Shi et al., 2021) , where we divide the dataset into a base session with 60 base classes and eight novel sessions with a 5-way 5-shot problem on each session. miniImageNet miniImageNet consists of RGB images from 100 different classes, where each class contains 500 training images and 100 test images of size 84 × 84. Originally proposed for fewshot learning problems, miniImageNet is part of a much larger ImageNet dataset. Compared with CIFAR-100, the miniImageNet dataset is more complex and suitable for prototyping. The setup of miniImageNet is similar to that of CIFAR-100. To proceed with our evaluation, we follow the procedure described in (Shi et al., 2021) , where we incorporate 60 base classes and eight novel sessions through 5-way 5-shot problems. -200-2011 -200- CUB-200-2011 contains 200 fine-grained bird species with 11, 788 images with varying images for each class. To proceed with experiments, we split the dataset into 6, 000 training images and 6, 000 test images as in (Tao et al., 2020) . During training, We randomly crop each image to be of size 224 × 224. We fix the first 100 classes as base classes, where we utilize all samples in these respective classes to train the model. On the other hand, we treat the remaining 100 classes as novel categories split into ten novel sessions with a 10-way 5-shot problem in each session.

A.2 EXPERIMENT SETUPS

We begin this section by describing the setups used for experiments in CIFAR-100 and miniImageNet. After that, we proceed with a follow-up discussion on the configuration we employ for experiments involving the CUB-200-2011 dataset.CIFAR-100 and miniImageNet. For experiments in these two datasets, we are using NVIDIA GPU RTX8000 on CUDA 11.0. We randomly split these two datasets into multiple sessions, as described in the previous sub-section. We run each algorithm ten times for experiments on both datasets with a fixed split and report their mean accuracy. We adopt ResNet18 (He et al., 2016) as the backbone network. For data augmentation, we use standard random crop and horizontal flips. During the training stage in the base session, we select top-c% weights at each layer and acquire the optimal soft-subnetworks with the best validation accuracy. For each incremental few-shot learning session, we train our model for six epochs with a learning rate is 0.02. We train new class session samples using a few minor weights of the soft-subnetwork (conv4x layer of ResNet18 and conv3x layer of ResNet20) obtained by learning at the base session. -200-2011. Besides experiments in the previous two datasets, we conducted an additional experiment on this dataset. We prepare this dataset following the split procedure described in the previous sub-section. We run each algorithm ten times and report their mean accuracy. We also adopt ResNet18 (He et al., 2016) as the backbone network and follow the same data augmentation as in the previous two datasets. We follow the same base-session training procedure as in the other two datasets. In each incremental few-shot learning session t > 1, the total number of training epochs is 10, and the learning rate is 0.1. We train new class session samples using a few minor weights of the soft-subnetwork (conv4x layer of ResNet18) obtained at the base session.

CUB

Published as a conference paper at ICLR 2023 

