NEURAL COLLAPSE INSPIRED FEATURE-CLASSIFIER ALIGNMENT FOR FEW-SHOT CLASS INCREMENTAL LEARNING

Abstract

Few-shot class-incremental learning (FSCIL) has been a challenging problem as only a few training samples are accessible for each novel class in the new sessions. Finetuning the backbone or adjusting the classifier prototypes trained in the prior sessions would inevitably cause a misalignment between the feature and classifier of old classes, which explains the well-known catastrophic forgetting problem. In this paper, we deal with this misalignment dilemma in FSCIL inspired by the recently discovered phenomenon named neural collapse, which reveals that the last-layer features of the same class will collapse into a vertex, and the vertices of all classes are aligned with the classifier prototypes, which are formed as a simplex equiangular tight frame (ETF). It corresponds to an optimal geometric structure for classification due to the maximized Fisher Discriminant Ratio. We propose a neural collapse inspired framework for FSCIL. A group of classifier prototypes are pre-assigned as a simplex ETF for the whole label space, including the base session and all the incremental sessions. During training, the classifier prototypes are not learnable, and we adopt a novel loss function that drives the features into their corresponding prototypes. Theoretical analysis shows that our method holds the neural collapse optimality and does not break the feature-classifier alignment in an incremental fashion. Experiments on the miniImageNet, CUB-200, and CIFAR-100 datasets demonstrate that our proposed framework outperforms the state-of-the-art performances.

1. INTRODUCTION

Learning incrementally and learning with few-shot data are common in the real-world implementations, and in many applications, such as robotics, the two demands emerge simultaneously. Despite the great success in a closed label space, it is still challenging for a deep learning model to learn new classes continually with only limited samples (LeCun et al., 2015) . To this end, few-shot classincremental learning (FSCIL) was proposed to tackle this problem (Tao et al., 2020b) . Compared with few-shot learning (Ravi & Larochelle, 2017; Vinyals et al., 2016) , FSCIL transfers a trained model into new label spaces incrementally. It also differs from incremental learning (Cauwenberghs & Poggio, 2000; Li & Hoiem, 2017; Rebuffi et al., 2017) in that there are only a few (usually 5) samples accessible for each new class in the incremental sessions. For each session's evaluation, the model is required to infer test images coming from all the classes that have been encountered. The base session of FSCIL contains a large label space and sufficient training samples, while each incremental session only has a few novel classes and labeled images. It poses the notorious catastrophic forgetting problem (Goodfellow et al., 2013) because the novel sessions have no access to the data of the previous sessions. Due to the importance and difficulty, FSCIL has attracted much research attention. The initial solutions to FSCIL finetune the network on new session data with distillation schemes to reduce the forgetting of old classes (Tao et al., 2020b; Dong et al., 2021) . However, the few-shot data in novel sessions can easily induce over-fitting. Following studies favor training a backbone network on the current prototypes prototypes in the last session misalignment Figure 1 : A popular choice in prior studies is to evolve the old-class prototypes via delicate design of loss or regularizer to keep them separated from novel-class prototypes, but will cause misalignment. As a comparison, we pre-assign and fix an optimal feature-classifier alignment, and then train a model towards the same neural collapse optimality in each session to avoid target conflict. base session as a feature extractor (Zhang et al., 2021; Hersche et al., 2022; Akyürek et al., 2022) . For novel sessions, the backbone network is fixed and a group of novel-class prototypes (classifier vectors) are learned incrementally. But as shown in Figure 1 (a), the newly added prototypes may lie close to the old-class prototypes, which impedes the ability to discriminate between the old-class and the novel-class samples in evaluation. As a result, adjusting the classifier prototypes is always necessary for two goals: (i) keep a sufficient distance between the old-class and the novel-class prototypes; (ii) prevent the adjusted old-class prototypes from shifting far away from their original positions. However, the two goals rely on sophisticated loss functions or regularizers (Chen & Lee, 2021; Hersche et al., 2022; Akyürek et al., 2022) , and are hard to attain simultaneously without qualification. Besides, as shown in Figure 1 (a), there will be a misalignment between the adjusted classifier and the fixed features of old classes. A recent study proposes to reserve feature space for novel classes to circumvent their conflict with old classes (Zhou et al., 2022a) , but an optimal feature-classifier alignment is hard to be guaranteed with learnable classifier (Pernici et al., 2021) . We point out that it is the misalignment dilemma between feature and classifier that causes the catastrophic forgetting problem of old classes. If a backbone network is finetuned in novel sessions, the features of old classes will be easily deviated from their classifier prototypes. Alternatively, when a backbone network is fixed and a group of new prototypes for novel classes are learned incrementally, the adjustment of old-class prototypes will also induce misalignments with their fixed features. In this paper, we pose and study the following question, "Can we look for and pre-assign an optimal feature-classifier alignment such that the model is optimized towards the fixed optimality, so avoids conflict among sessions?"

1.1. MOTIVATIONS AND CONTRIBUTIONS

Neural collapse is a recently discovered phenomenon that at the terminal phase of training (after 0 training error rate), the last-layer features of the same class will collapse into a single vertex, and the vertices of all classes will be aligned with their classifier prototypes and be formed as a simplex equiangular tight frame (ETF) (Papyan et al., 2020) . A simplex ETF is a geometric structure of K vectors in R d , d ≥ K-1. All vectors have the same 2 norm of 1 and any pair of two different vectors has an inner product of -1 K-1 , which corresponds to the largest possible angle of K equiangular vectors. Particularly when d = K -1, a simplex ETF reduces to a regular simplex such as triangle and tetrahedron. It describes an optimal geometric structure for classification due to the minimized within-class variance and the maximized between-class variance (Martinez & Kak, 2001) , which indicates that the Fisher Discriminant Ratio (Fisher, 1936; Rao, 1948) is maximized. Following studies aim to theoretically explain this phenomenon (Fang et al., 2021; Han et al., 2022) . It is expected that imperfect training condition, such as imbalance, cannot induce neural collapse and will cause deteriorated performance (Fang et al., 2021; Yang et al., 2022b) . Training in an incremental fashion will also break the neural collapse optimality. Since neural collapse offers us an optimal structure where features and their classifier prototypes are aligned, we can pre-assign such a structure and learn the model towards the optimality. Inspired by this insight, in this paper, we initialize a group of classifier prototypes ŴETF ∈ R d×(K0+K ) as a simplex ETF for the whole label space, where K 0 is the number of classes in the base session and K is the number of classes in all the incremental sessions. As shown in Figure 1 (b), it serves as the optimization target and keeps fixed throughout all sessions training. We append a projection layer after the backbone network and store the mean latent feature of each class output by the backbone in a memory. In the training of incremental sessions, we only finetune the projection layer using a novel loss function that drives the final features towards their corresponding target prototypes. Without bells and whistles, our method achieves superior performances and relieves the catastrophic forgetting problem. The contributions of this paper can be summarized as follows: • To relieve the misalignment dilemma in FSCIL, we propose to pre-assign an optimal alignment inspired by neural collapse as a fixed target throughout the incremental learning. Our model is trained towards the same optimality to avoid optimization conflict among sessions. • We fix the prototypes and apply a novel loss function that only finetunes a projection layer to drive the output features into their corresponding prototypes. Theoretical and empirical analyses show that our method better holds the neural collapse optimality. • Experiments on miniImageNet, CIFAR-100, and CUB-200 demonstrate that our method is able to surpass the state-of-the-art performances. In particular, our method achieves an average accuracy improvement of more than 3.5% over a recent strong baseline on both miniImageNet and CIFAR-100.

2. RELATED WORK

Few-shot class-incremental learning (FSCIL). As a variant of class-incremental learning (CIL) (Cauwenberghs & Poggio, 2000; Li & Hoiem, 2017; Rebuffi et al., 2017) , FSCIL only has a few novel-classes and training data in each incremental session (Tao et al., 2020b; Dong et al., 2021) , which increases the tendency of overfitting on novel classes (Snell et al., 2017; Sung et al., 2018) . Both CIL and FSCIL require a delicate balance between well adapting a model to novel classes and less forgetting of old classes (Zhao et al., 2021) . A popular choice is to use meta learning (Yoon et al., 2020; Chi et al., 2022; Zhou et al., 2022b) . Some studies try to make base and incremental sessions compatible via pseudo-feature (Cheraghian et al., 2021b; Zhou et al., 2022a) , augmentation (Peng et al., 2022) , or looking for a flat minima (Shi et al., 2021) . For training in incremental sessions, the new prototypes for novel classes should be separable from the old-class prototypes. Meanwhile, the adjustment of old-class prototypes should not induce large shifts. Current studies widely rely on evolving the prototypes (Zhang et al., 2021; Zhu et al., 2021a) or sophisticated designs of loss and regularizer (Ren et al., 2019; Hou et al., 2019; Tao et al., 2020a; Joseph et al., 2022; Lu et al., 2022; Hersche et al., 2022; Chen & Lee, 2021; Akyürek et al., 2022; Yang et al., 2022a) . However, the two goals have inherent conflict, and a tough effort to balance the loss terms is necessary. In contrast, our method pre-assigns and fixes a feature-classifier alignment as an optimality. A model is trained towards the same target in all sessions. We only use a single loss without any regularizer. Neural collapse. Neural collapse describes an elegant geometric structure of the last-layer feature and classifier in a well-trained model (Papyan et al., 2020) . It inspires later studies to theoretically explain this phenomenon. Based on a simplified model that only considers the last-layer optimization, neural collapse is proved to be the global optimality of balanced training with the CE (Weinan & Wojtowytsch, 2020; Graf et al., 2021; Lu & Steinerberger, 2020; Fang et al., 2021; Zhu et al., 2021b; Ji et al., 2022) and the MSE (Mixon et al., 2020; Poggio & Liao, 2020; Zhou et al., 2022c; Han et al., 2022; Tirer & Bruna, 2022) loss functions. Recent studies try to induce neural collapse in imbalanced training by fixing a classifier (Yang et al., 2022b; Zhong et al., 2023) or novel loss (Xie et al., 2023) . Our method is inspired by Yang et al. (2022b) , but we apply the classifier in an incremental fashion. Galanti et al. (2022) show that neural collapse is still valid when transferring a model into new samples or classes. To the best of our knowledge, we are the first to study FSCIL from the neural collapse perspective, which offers our method sound interpretability.

3.1. FEW-SHOT CLASS-INCREMENTAL LEARNING (FSCIL)

In real-world applications, one often needs to adapt a model to data coming from a new label space with only a few labeled samples. FSCIL trains a model incrementally on a sequence of training datasets {D (0) , D (foot_0) , . . . , D (T ) }, where 0) is the base session, and T the number of incremental sessions. The base session D (0) usually contains a large label space C (0) and sufficient training images for each class c ∈ C (0) . In each incremental session D (t) , t > 0, there are only a few labeled images and we have |D (t) | = pq, where p is the number of classes and q is the number samples per novel class, known as p-way q-shot. The label space C (t) has no overlap with any other session, i.e., C (t) ∩ C (t ) = ∅, ∀t = t. For any incremental session t > 0, we only have access to the data in D (t) , and the training sets of the previous sessions are not available. For evaluation in session t, the test dataset comes from all the encountered classes in the previous and current sessions 1 , i.e. the label space of ∪ t i=0 C (i) . Therefore, FSCIL suffers from severe data scarcity and imbalance. It requires a model to be adaptable to novel classes, and meanwhile keep the ability on old classes. D (t) = {(x i , y i )} |D (t) | i=1 , D

3.2. NEURAL COLLAPSE

Neural collapse refers to a phenomenon at the terminal phase of training (after 0 training error rate) on balanced data (Papyan et al., 2020) . It reveals a geometric structure formed by the last-layer feature and classifier that can be defined as: Definition 1 (Simplex Equiangular Tight Frame) A simplex Equiangular Tight Frame (ETF) refers to a matrix that is composed of K vectors in R d and satisfies: E = K K -1 U I K - 1 K 1 K 1 T K , where E = [e 1 , • • • , e K ] ∈ R d×K , U ∈ R d×K allows a rotation and satisfies U T U = I K , I K is the identity matrix, and 1 K is an all-ones vector. All column vectors in E have the same 2 norm and any pair has an inner produce of -1 K-1 , i.e., e T k1 e k2 = K K -1 δ k1,k2 - 1 K -1 , ∀k 1 , k 2 ∈ [1, K], where δ k1,k2 = 1 when k 1 = k 2 , and 0 otherwise. The neural collapse phenomenon includes the following four properties: (NC1): The last-layer features of the same class will collapse into their within-class mean, i.e., the covariance Σ (k) W → 0, where Σ (k) W = Avg i {(µ k,i -µ k )(µ k,i -µ k ) T }, µ k,i is the feature of sample i in class k, and µ k is the within-class mean of class k features; (NC2): The within-class means of all classes centered by the global mean will converge to the vertices of a simplex ETF defined in Definition 1, i.e., μk , 1 ≤ k ≤ K satisfy Eq. ( 2), where μk = (µ k -µ G )/ µ k -µ G and µ G is the global mean; (NC3): The within-class means centered by the global mean will be aligned with (parallel to) their corresponding classifier weights, which means the classifier weights will converge to the same simplex ETF, i.e., μk = w k / w k , 1 ≤ k ≤ K, where w k is the classifier weight of class k; (NC4): When (NC1)-(NC3) hold, the model prediction using logits can be simplified to the nearest class centersfoot_1 , i.e., arg max k µ, w k = arg min k ||µ-µ k ||, where • is the inner product operator, µ is the last-layer feature of a sample for prediction. Neural collapse corresponds to an optimal feature-classifier alignment for classification due to the maximized Fisher Discriminant Ratio (between-class variance to within-class variance).

4. METHOD

Neural collapse tells us an optimal geometric structure for classification problems where the lastlayer features and classifier prototype of the same class are aligned, and those of different classes are maximally separated. However, this structure will be broken in imperfect training conditions, such as imbalanced training data (Fang et al., 2021; Yang et al., 2022b) . As illustrated in Figure 1 (a), training in an incremental fashion will also break the neural collapse optimality. Inspired by this perspective, what we should do for FSCIL is to keep the neural collapse inspired feature-classifier alignment as sound as possible. Concretely, we adopt a fixed classifier and a novel loss function as described in Section 4.1 and Section 4.2, respectively. We introduce our framework for FSCIL in Section 4.3. Finally, in Section 4.4, we conduct theoretical analysis to show how our method better holds the neural collapse optimality in an incremental fashion.

4.1. ETF CLASSIFIER

Assume that the base session contains a label space of K 0 classes, each incremental session has p classes, and we have T incremental sessions in total. The whole label space of this FSCIL problem has K 0 + K classes, where K = T p, i.e., we need to learn a model that can recognize samples from K 0 + K classes. We denote a backbone network as f , and then we have µ = f (x, θ f ), where µ ∈ R d is the output feature of input x, and θ f is the backbone network parameters. A popular choice in current studies learns f and W (0) using the base session data, where W (0) ∈ R d×K0 is the classifier prototypes for base classes. In incremental sessions t > 0, f is fixed as a feature extractor and only W (t) ∈ R d×p for novel classes is learnable. As shown in Figure 1 (a), one need to adjust {W (0) , • • • , W (t) } via sophisticated loss or regularizer to ensure separation among these prototypes (Akyürek et al., 2022; Hersche et al., 2022) . But it will inevitably introduce misalignment between the adjusted prototypes and the fixed features of old classes. It is an underlying reason for the catastrophic forgetting problem (Joseph et al., 2022) . Since neural collapse describes an optimal geometric structure of the last-layer feature and classifier, we pre-assign such an optimality by fixing a learnable classifier as the structure instructed by neural collapse. Following Yang et al. (2022b) , we adopt an ETF classifier that initializes a classifier as a simplex ETF and fixes it during training. The difference lies in that we use it in an incremental fashion. Concretely, we randomly initialize classifier prototypes ŴETF ∈ R d×(K0+K ) by Eq. ( 1) for the whole label space, i.e., the union of classes in all session, ∪ T i=0 C (i) . We have K 0 = |C (0) | and K = T i=1 |C (i) | = T p. Then any pair (k 1 , k 2 ) of classifier prototypes in ŴETF satisfies: ŵT k1 ŵk2 = K 0 + K K 0 + K -1 δ k1,k2 - 1 K 0 + K -1 , ∀k 1 , k 2 ∈ [1, K 0 + K ], where ŵk1 and ŵk2 are two column vectors in ŴETF . Our ETF classifier ensures that the prototypes of the whole label space have the maximal pair-wise separation. It serves as a fixed target along the incremental training to avoid conflict among sessions. We only need to learn a model whose output features are aligned with this pre-assigned structure.

4.2. DOT-REGRESSION LOSS

The gradient of cross entropy (CE) loss with respect to the last-layer feature is composed of a pull term that drives the feature into its classifier prototype of the same class, and a push term that pushes it away from the prototypes of different classes. As pointed out by Yang et al. (2022b) , when the classifier prototypes are fixed as an optimality, the pull term is always accurate towards the solution, and we can drop the push gradient that may be inaccurate. Accordingly, we adopt a novel loss named dot-regression (DR) loss that can be formulated as (Yang et al., 2022b) : L μi , ŴETF = 1 2 ŵT yi μi -1 2 , ( ) where μi is the normalized feature, i.e., μi = µ i / µ i , µ i = f (x i , θ f ), y i is the label of input x i , ŵyi is the fixed prototype in ŴETF for class y i , and we have ŵyi = 1 by Eq. ( 3). The total loss is an average over a batch of input x i . The gradient of Eq. ( 4) with respect to μi takes the form of: ∂L/∂ μi = -(1 -cos ∠( μi , ŵyi )) ŵyi . It is shown that the gradient pulls feature μi towards the direction of ŵyi , which is a pre-assigned target prototype. Finally, the converged features will be aligned with ŴETF , and thus the geometric structure instructed by neural collapse is attained. The theoretical advantage of the DR loss has been proved in Yang et al. (2022b) . In experiments, we will compare the DR loss with the CE loss to show its effectiveness in FSCIL. 𝒟 (") 𝒉 $ 𝑓 𝑔 backbone network projection layer ⋮ 𝒞 (") 𝒞 ($%&) (1 ≤ 𝑡 ≤ 𝑇) # 𝐖 %&'

4.3. NC-FSCIL

Based on the ETF classifier and the DR loss, we now introduce our neural collapse inspired framework for few-shot class-incremental learning (NC-FSCIL). As shown in Figure 2 , our model is composed of two components, a backbone network f and a projection layer g.  i / µ i , µ i = g(h i , θ g ), h i = f (x i , θ f ), where θ f and θ g denote the parameters of the backbone network and the projection layer, respectively. We use the normalized output feature μi to compute error signal by Eq. ( 4). In the base session t = 0, we jointly train both f and g using the base session data. The empirical risk to minimize in the base session can be formulated as: min θ f ,θg 1 |D (0) | (xi,yi)∈D (0) L μi , ŴETF , where ŴETF is the pre-assigned ETF classifier as introduced in Section 4.1, L is the DR loss as introduced in Section 4.2, and μi is a function of f and g as shown in Eq. ( 5). In each incremental session 1 ≤ t ≤ T , we fix the backbone network f as a feature extractor, and only finetune the projection layer g. As a widely adopted practice in FSCIL studies, a small memory of samples or features of old classes can be retained to relieve the overfitting on novel classes (Cheraghian et al., 2021a; Chen & Lee, 2021; Akyürek et al., 2022; Hersche et al., 2022) . Following Hersche et al. (2022) , we only keep a memory M (t) of the mean intermediate feature h c for each old class c. Concretely, we have, M (t) = {h c |c ∈ ∪ t-1 j=0 C (j) }, h c = Avg i {f (x i , θ f )|y i = c}, 1 ≤ t ≤ T, where f has been fixed after the base session. Then we use D (t) as the input of f , and M (t) as the input of g to finetune the projection layer g. The empirical risk to minimize in incremental sessions can be formulated as: min θg 1 |D (t) | + |M (t) |   (xi,yi)∈D (t) L μi , ŴETF + (hc,yc)∈M (t) L μc , ŴETF   , where μi , and μc are the output features of x i and h c , respectively, |D (t) | is the number of training samples in session t, and we have |M (t) | = t-1 j=0 |C (j) |. Thanks to our pre-assigned alignment, we do not rely on any regularizer in our training. In the evaluation of session t, we predict an input x based on the inner product between its output feature μ and the ETF classifier prototypes: arg max k μ, ŵk , ∀1 ≤ k ≤ K 0 + K .

4.4. THEORETICAL SUPPORTS

We perform our theoretical work based on a simplified model that drops the backbone network and only keeps the last-layer features and classifier prototypes as independent optimization variables. This simplification has been widely adopted in prior studies to facilitate analysis (Graf et al., 2021; Fang et al., 2021; Zhu et al., 2021b) . We investigate the neural collapse optimality of an incremental problem of T sessions with our ETF classifier. Concretely, we consider the following problem, min M (t) 1 N (t) K (t) k=1 n k i=1 L m (t) k,i , ŴETF , 0 ≤ t ≤ T, s.t. m (t) k,i 2 ≤ 1, ∀1 ≤ k ≤ K (t) , 1 ≤ i ≤ n k , k,i ∈ R d denotes a feature variable that belongs to the i-th sample of class k in session t, n k is number of samples in class k, K (t) is number of classes in session t, N (t) is the number of samples in session t, i.e., N (t) = K (t) k=1 n k , and M (t) ∈ R d×N (t) denotes a collection of m (t) k,i . ŴETF ∈ R d×K refers to the ETF classifier for the whole label space as introduced in Section 4.1, and we have K = T t=0 K (t) . L can be both the cross entropy and the dot regression loss functions. Theorem 1 Let M(t) denote the global minimizer of Eq. ( 9) by optimizing the model incrementally from t = 0, and we have t) . When L in Eq. ( 9) is CE or DR loss, for any column vector mk,i in M whose class label is k, we have: M = [ M(0) , • • • , M(T ) ] ∈ R d× T t=0 N ( mk,i = 1, mT k,i ŵk = K K -1 δ k,k - 1 K -1 , ∀k, k ∈ [1, K], 1 ≤ i ≤ n k , where K = T t=0 K (t) denotes the total number of classes of the whole label space, δ k,k = 1 when k = k and 0 otherwise, and ŵk is the prototype of class k in ŴETF . The proof of Theorem 1 can be found in Appendix A. Eq. ( 10) indicates that the global minimizer M of Eq. ( 9) satisfies the neural collapse condition, i.e., features of the same class collapse into a single vertex, and the vertices of all classes are aligned with ŴETF as a simplex ETF. It is shown that the feature space is equally separated by prototypes of all classes. More importantly, in problem Eq. ( 9), the number of classes K (t) among T + 1 sessions and the number of samples n k among K classes can be imbalanced, which corresponds to the challenging demand of FSCIL.

5. EXPERIMENTS

In this section, we test our method on FSCIL benchmark datasets including miniImageNet (Russakovsky et al., 2015) , CIFAR-100 (Krizhevsky et al., 2009) , and CUB-200 (Wah et al., 2011) , and compare it with state-of-the-art methods. We also perform ablation studies to validate the effects of ETF classifier and DR loss. Finally, we show the feature-classifier structure achieved by our method.

5.1. IMPLEMENTATION DETAILS

Please refer to Appendix B for our implementation details. (Liu et al., 2022) 74.40 70.20 66.54 62.51 59.71 56.58 54.52 52.39 50.14 60.78 +5.97 *ALICE (Peng et al., 2022) 79.00 70. 50 67.10 63.40 61.20 59.20 58.10 56.30 C ), respectively. We see that our method achieves the best performance in all sessions on both miniImageNet and CIFAR-100 compared with previous studies. ALICE (Peng et al., 2022) is a recent study that achieves strong performances on FSCIL. Compared with this challenging baseline, we have an improvement of 2.61% in the last session on miniImageNet, and 2.01% on CIFAR-100. We achieve an averaged accuracy improvement of more than 3.5% on both miniImageNet and CIFAR-100. Although we do not surpass ALICE in the last session on CUB-200, we still have the best average accuracy among all methods. As shown in the last rows of Table 1 and Table 2 , the improvement of our method lasts and even becomes larger in the first several sessions. It indicates that our method is able to hold the superiority and relieve the forgetting of old sessions.

5.3. ABLATION STUDIES

We consider three models to validate the effects of ETF classifier and DR loss. All three models are based on the same framework introduced in Section 4.3 including the backbone network, the projection layer, and the memory module. The first model uses a learnable classifier and the CE loss, which is the most adopted practice. The second model only replaces the classifier with our ETF classifier and also uses the CE loss. The third model corresponds to our method using both ETF classifier and DR loss. As shown in Table 3 , when a fixed ETF classifier is used, the final session accuracies are significantly better, and the performance drops get much mitigated. Adopting the DR loss is able to further moderately improve the performances. It indicates that the success of our method is largely attributed to ETF classifier and DR loss, as they pre-assign a neural collapse inspired alignment and drive a model towards the fixed optimality, respectively.

5.4. FEATURE-CLASSIFIER STRUCTURE

We check the feature-classifier alignment instructed by neural collapse using our method and "Learnable+CE" as a comparison. As shown in Figure 3 , the average cosine similarities between features and classifier prototypes of different classes, i.e., Avg k =k {cos ∠(m k -m G , w k )}, of our method are consistently lower than those of the baseline. Most values of our method are negative and close to 0, which is in line with the guidance from neural collapse as derived in Eq. ( 10). Particularly in Figure 3b and Figure 3d , the average cosine similarities between m k -m G and w k (k = k ) among all encountered classes increase fast with session for the baseline method, while ours keep relatively flat. It indicates that the baseline method reduces the feature-classifier margin of different classes as training incrementally, and our method enjoys a stable alignment. As shown in Figure 4 and Figure 5 , we also calculate the average cosine similarities between feature and classifier of the same class, i.e., Avg k {cos ∠(m k -m G , w k )}, and the trace ratio of within-class covariance to between-class covariance, tr(Σ W )/tr(Σ B ). These results together support that our method better holds the feature-classifier alignment and relieves the forgetting problem.

6. CONCLUSION

In this paper, we propose to fix a learnable classifier as a geometric structure instructed by neural collapse for FSCIL. It pre-assigns an optimal feature-classifier alignment as a fixed target throughout incremental training, which avoids optimization conflict among sessions. Accordingly, a novel loss function that drives features towards this pre-assigned optimality is adopted without any regularizer. Both theoretical and empirical results support that our method is able to hold the alignment in an incremental fashion, and thus relieve the forgetting problem. In experiments of FSCIL, we achieve and even surpass the state-of-the-art performances on three datasets.

A APPENDIX: PROOF OF THEOREM 1

Our proof is following Yang et al. (2022b) . We consider the problem in Eq. ( 9), min M (t) 1 N (t) K (t) k=1 n k i=1 L m (t) k,i , ŴETF , 0 ≤ t ≤ T, s.t. m (t) k,i 2 ≤ 1, ∀1 ≤ k ≤ K (t) , 1 ≤ i ≤ n k , where m (t) k,i ∈ R d denotes a feature variable that belongs to the i-th sample of class k in session t, n k is number of samples in class k, K (t) is number of classes in session t, N (t) is the number of samples in session t, i.e., N (t) = K (t) k=1 n k , and M (t) ∈ R d×N (t) denotes a collection of m (t) k,i . ŴETF ∈ R d×K refers to the ETF classifier for the whole label space as introduced in Section 4.1. We have K = T t=0 K (t) and, ŵT k ŵk = K K -1 δ k,k - 1 K -1 , ∀k, k ∈ [1, K], where ŵk and ŵk are two column vectors in ŴETF . From the definition of a simplex ETF in Eq. (1), we have ŴETF • 1 K = 0 d , where 1 K is an all-ones vector in R K , and 0 d is an all-zeros vector in R d . Then we have, K k=1 ŵk = 0 d . ( ) When L is the dot-regression (DR) loss in Eq. ( 4), it is easy to identify that L ≥ 0 and the equality holds if and only if ŵT k m (t)  k,i = 1, ∀0 ≤ t ≤ T, 1 ≤ k ≤ K, 1 ≤ i ≤ n k . Since ŵk = 1 and m (t) k,i 2 ≤ 1, we have ŵT k m (t) k,i ≤ 1. k,i ) = 1. Denote M = [ M(0) , • • • , M(T ) ] ∈ R d× T t=0 N (t) as the global optimality of Eq. ( 9) for all sessions 0 ≤ t ≤ T . For any column vector mk,i in M, we have, mk,i = 1, mT k,i ŵk = K K -1 δ k,k - 1 K -1 , ∀k, k ∈ [1, K], 1 ≤ i ≤ n k , which concludes the proof for DR loss. When L is the cross-entropy (CE) loss, i.e., L m (t) k,i , ŴETF = -log exp( ŵT k m (t) k,i ) K j=1 exp( ŵT j m (t) k,i ) , where t) , and 1 ≤ i ≤ n k . Since the problem is separable among T + 1 sessions, we only analyze the t-th session and omit the superscript (t) for simplicity. The objective in Eq. ( 13) is the sum of an affine function and log-sum-exp functions. When ŴETF is fixed, the loss is convex w.r.t m k,i with convex constraints. So, we can use the KKT condition for its global optimality. Based on Eq. ( 9) and Eq. ( 13), we have the Lagrange function, 0 ≤ t ≤ T, 1 ≤ k ≤ K ( L = 1 N (t) K (t) k=1 n k i=1 -log exp( ŵT k m k,i ) K j=1 exp( ŵT j m k,i ) + K (t) k=1 n k i=1 λ k,i ( m k,i 2 -1), where λ k,i is the Lagrange multiplier. The gradient with respect to m k,i takes the form of: ∂ L ∂m k,i = - (1 -p k ) N (t) ŵk + 1 N (t) K j =k p j ŵj + 2λ k,i m k,i , where t) , and p j is the softmax probability of m k,i for the j-th class, i.e., 1 ≤ i ≤ n k , 1 ≤ k ≤ K ( p j = exp( ŵT j m k,i ) K j =1 exp( ŵT j m k,i ) . ( ) Since ŵk = 1 and m k,i ≤ 1, we have 0 < p k < 1, ∀1 ≤ k ≤ K. We now solve the equation ∂ L ∂m k,i = 0. Assume that λ k,i = 0, and then we have, K j =k p j ŵj = (1 -p k ) ŵk . ( ) Since 1 -p k = K j =k p j and Eq. ( 11), multiplying ŵk by both sides of Eq. ( 17), we have, K K -1 (1 -p k ) = 0, which contradicts with 0 < p k < 1. Then we have the other case λ k,i > 0. Based on the KKT condition, the global optimality mk,i satisfies that mk,i 2 = 1. ( ) The equation ∂ L ∂ mk,i = 0 leads to: K j =k p j ( ŵj -ŵk ) + 2N (t) λ k,i mk,i = 0. Based on Eq. ( 11), for any j = k, multiplying ŵj by both sides of Eq. ( 20), we have, p j K K -1 + 2N (t) λ k,i mT k,i ŵj = 0. Since ∀k ∈ [1, K], p k > 0, we have mT k,i ŵj < 0. Then for any j 1 , j 2 = k, p j1 p j2 = exp( ŵT j1 mk,i ) exp( ŵT j2 mk,i ) = ŵT j1 mk,i ŵT j2 mk,i . The function f (x) = exp(x)/x is monotonically increasing when x < 1. So, Eq. ( 22) indicates that p j1 = p j2 , ŵT j1 mk,i = ŵT j2 mk,i , ∀j 1 , j 2 = k, and p j = 1 -p k K -1 = - 2N (t) λ k,i mT k,i ŵj (K -1) K , ∀j = k. Multiplying ŵk by both sides of Eq. ( 20), we have, - K K -1 (1 -p k ) + 2N (t) λ k,i mT k,i ŵk = 0. Combing Eq. ( 24) and Eq. ( 25), we have, mT k,i ŵj (K -1) + mT k,i ŵk = 0, ∀j = k. Based on p j = 1-p k K-1 , ∀j = k, and Eq. ( 12), we can rewrite Eq. ( 20) as: - (1 -p k )K K -1 ŵk + 2N (t) λ k,i mk,i = 0, which means that mk,i is aligned with ŵk , i.e., cos ∠( mk,i , ŵk ) = 1. Given that ŵk = 1 and mk,i = 1 (Eq. ( 19)), we have, mT k,i ŵk = 1, and Eq. ( 26) leads to: mT k,i ŵj = - 1 K -1 , ∀j = k. Therefore, for any column vector mk,i in M, we have, mk,i = 1, mT k,i ŵk = K K -1 δ k,k - 1 K -1 , ∀k, k ∈ [1, K], 1 ≤ i ≤ n k , which concludes the proof for CE loss. Training Details. We adopt the standard data pre-processing and augmentation schemes including random resizing, random flipping, and color jittering (Tao et al., 2020b; Zhang et al., 2021; Peng et al., 2022) . We train all models with a batchsize of 512 in the base session, and a batchsize of 64 (containing new session data and intermediate features in the memory) in each incremental session. On miniImageNet, we train for 500 epochs in the base session, and 100-170 iterations in each incremental session. The initial learning rate is 0.25 for base session, and 0.025 for incremental sessions. On CIFAR-100, we train for 200 epochs in the base session, and 50-200 iterations in each incremental session. The initial learning rate is 0.25 for both base and incremental sessions. On CUB-200, we train for 80 epochs in the base session, and 105-150 iterations in each incremental session. The initial learning rates are 0.025 and 0.05 for base session and incremental sessions, respectively. In all experiments, we adopt a cosine annealing strategy for learning rate, and use SGD with momentum as optimizer. Our code will be publicly available in the final version.

C APPENDIX: MORE RESULTS

Our experimental result on CUB-200 is shown in Table 4 . We achieve a better accuracy in the last session than most of the baseline methods. Although we do not surpass ALICE in the last session on CUB-200, we still have the best average accuracy among all methods. We also visualize the average cosine similarities between feature and classifier of the same class, i.e., Avg k {cos ∠(m k -m G , w k )} and the trace ratio of within-class covariance to between-class covariance, tr(Σ W )/tr(Σ B ). A higher average cos ∠(m k -m G , w k ) indicates that feature centers are more closely aligned with their corresponding classifier prototypes of the same class. As shown in Figure 4 , the values of our method are consistently higher than those of the baseline method. Figure 4a and Figure 4d reveal that our method has a better feature-classifier alignment in each session of the incremental training on both train and test sets. When we measure on all the encountered classes by each session in Figure 4b and Figure 4e , the metric for our method does not change obviously after the 4-th session on train set, while the metric for the baseline method keeps decreasing as training incrementally. Especially for the base session classes, our method is able to keep the metric stable on both train and test sets after the decline of the first 3-4 sessions, as shown in Figure 4c and Figure 4f . As a comparison, the baseline method cannot mitigate the deterioration. Given that the base session has the most classes, the performance on base session classes largely decides the final accuracy in the last session for FSCIL. Therefore, the superiority of our method can be attributed to our ability of keeping the feature-classifier alignment well for base session classes. The within-class covariance Σ W and the between-class covariance Σ B are defined as: Σ W = Avg k {Σ (k) W }, Σ W = Avg i {(m k,i -m k )(m k,i -m k ) T }, and Σ B = Avg k {(m k -m G )(m k -m G ) T }, where m k,i is the feature of sample i in class k, m k is the within-class mean of class k features, and m G denotes the global mean of all features. A lower within-class variation with a higher betweenclass variation corresponds to a better Fisher Discriminant Ratio. As shown in Figure 5 , we compare the trace ratio of within-class covariance to between-class covariance between our method and the baseline method. We observe similar patterns to Figure 4 . Concretely, the trace ratio metric of our method is consistently lower that of baseline. For the base session classes, the metric of our method increases more mildly, which corroborates our ability of maintaining the performance on the old classes, and is in line with the indications from Figure 3 and Figure 4 . 



Different from task-incremental learning, we do not know which session a test sample comes from. We omit the bias term in a linear classifier layer for simplicity.



prior studies: classifier prototypes are learnable (b) our solution: classifier prototypes are pre-assigned and fixed

Figure 2: An illustration of our NC-FSCIL. h i is the intermediate feature from the backbone network f . μi is the normalized output feature after the projection layer g. ŴETF is the ETF classifier that contains prototypes of the whole label space and serves as a fixed target throughout the incremental training. L denotes the dot-regression loss function. f is frozen in the incremental sessions (1 ≤ t ≤ T ). A small memory of old-class features is widely adopted in prior studies such as Cheraghian et al. (2021a), Chen & Lee (2021), Akyürek et al. (2022), and Hersche et al. (2022).

Figure 3: Average cosine similarity between features and classifier prototypes of different classes, i.e., Avg k =k {cos ∠(m k -m G , w k )}, where m k is the within-class mean of class k features, m G denotes the global mean, and w k is the classifier prototype of class k . Statistics are performed among classes in each session (a and c), and all encountered classes by the current session (b and d), on train set (a and b) and test set (c and d), for models trained after each session on miniImageNet.

Figure 4: Average cosine similarity between features and classifier prototypes of the same class, i.e., Avg k {cos ∠(m k -m G , w k )}, where m k is the within-class mean of class k features, m G denotes the global mean, and w k is the classifier prototype of class k. Statistics are performed among classes in each session (a and d), all encountered classes by the current session (b and e), and only the base session classes (c and f), on train set (a, b, c) and test set (d, e, f), for models trained after each session on miniImageNet.

Performance of FSCIL in each session on miniImageNet and comparison with other studies. The top rows list class-incremental learning and few-shot learning results implemented byTao et al.  (2020b);Zhang et al. (2021) in the FSCIL setting. "Average Acc." is the average accuracy of all sessions. "Final Improv." calculates the improvement of our method in the last session. * indicates that the method saves the within-class feature mean of each class for training or inference.

Performance of FSCIL in each session on CIFAR-100 and comparison with other studies. The top rows list class-incremental learning and few-shot learning results implemented byTao et al.

Ablation studies on three datasets to investigate the effects of ETF classifier and DR loss. "Learnable+CE" uses a learnable classifier and the CE loss; "ETF+CE" adopts our ETF classifier with the CE loss; "ETF+DR" uses both ETF classifier and DR loss. "FINAL" refers to the accuracy of the last session; "AVERAGE" is the average accuracy of all sessions; "PD" denotes the performance drop, i.e., the accuracy difference between the first and the last sessions.

The equality holds if and only if m

Performance of FSCIL in each session on CUB-200 and comparison with other studies. The top rows list class-incremental learning and few-shot learning results implemented by Tao et al. (2020b); Zhang et al. (2021); Liu et al. (2022); Zhou et al. (2022a) in the FSCIL setting. "Average Acc." is the average accuracy of all sessions. "Final Improv." calculates the improvement of our method in the last session.

ACKNOWLEDGMENTS

Z. Lin was supported by National Key R&D Program of China (2022ZD0160302), the major key project of PCL, China (No. PCL2021A12), the NSF China (No. 62276004), Qualcomm, and Project 2020BD006 supported by PKU-Baidu Fund.

availability

https://github.com/NeuralCollapseApplications/FSCIL 

annex

Ethics Statement. Our study does NOT involve any of the potential issues such as human subject, public health, privacy, fairness, security, etc. All authors of this paper confirm that they adhere to the ICLR Code of Ethics.Reproducibility Statement. For our theoretical result Theorem 1, we offer the proof in Appendix A. All datasets used in this paper are public and have been cited. Please refer to Appendix B for the dataset descriptions and the implementation details of our experiments. Our source code is released at https://github.com/NeuralCollapseApplications/FSCIL.

