DO WE NEED NEURAL COLLAPSE? LEARNING DI-VERSE FEATURES FOR FINE-GRAINED AND LONG-TAIL CLASSIFICATION

Abstract

Feature extractors learned from supervised training of deep neural networks have demonstrated superior performance over handcrafted ones. Recently, it is shown that such learned features have a neural collapse property, where within-class features collapse to the class mean and different class means are maximally separated. This paper examines the neural collapse property in the context of fine-grained classification tasks, where a feature extractor pretrained from a classification task with coarse labels is used for generating features for a downstream classification task with fine-grained labels. We argue that the within-class feature collapse is an undesirable property for fine-grained classification. Hence, we introduce a geometric arrangement of features called the maximal-separating-cone, where within-class features lie in a cone of nontrivial radius instead of collapsing to the class mean, and cones of different classes are maximally separated. We present a technique based on classifier weight and training loss design to produce such an arrangement. Experimentally we demonstrate an improved fine-grained classification performance with a feature extractor pretrained by our method. Moreover, our technique also provides benefits for classification on data with long-tail distribution over classes. Our work may motivate future efforts on the design of better geometric arrangement of deep features.

1. INTRODUCTION

The extraction of features or representations from image, language, and speech data in their raw forms is a problem of fundamental interest in machine learning. A standard approach in classical machine learning is to handcraft a feature extractor that maps from the input to the feature space (Lowe, 2004) , which often requires meticulous and onerous work of human and domain experts. With the success of deep learning, methods based on learned feature extractors have become very popular. Such an approach not only requires less human expertise, but also exhibits better empirical performance compared to handcrafted feature extractors (Krizhevsky et al., 2017) . Moreover, deep features extractors obtained from a pre-training task can be used for downstream tasks by simply training a task-specific classifier, which offers good empirical performance (Yosinski et al., 2014) . The great success of learned features naturally leads to the following question: Are learned features optimal for various application scenarios? In this paper, we show that learned features in a standard deep classification model are not optimal and can be improved by handcrafting a better geometric arrangement of the features. We are motivated by a recent line of work that shows that the geometric arrangement of features in a standard deep classification model has a simple and elegant form: • Within each class, the features are maximally concentrated and collapse to the class mean. • Across different classes, the features are maximally separated. Namely, the distance between each pair of class means is the same and that distance is at the maximum. This phenomenon is called Neural Collapse (N C) (Papyan et al., 2020) . The occurrence and prevalence of N C is verified empirically through experiments with a variety of datasets and network architectures (Han et al., 2021) . Motivated by such an observation, recent work provided theoretical analysis of deep features under the so-called unconstrained feature models (Mixon et al., 2020 ; Fang et al., 2021) , where the features are considered to be free optimization variables for the reason that deep neural networks have sufficient expressive power to produce any features for any training dataset. In particular, Lu & Steinerberger (2020) ; Weinan & Wojtowytsch (2020) ; Mixon et al. (2020) ; Graf et al. (2021) ; Fang et al. (2021) ; Ji et al. (2021) ; Tirer & Bruna (2022) show that N C solutions are global optimal solutions for the nonconvex objective functions. Moreover, Zhu et al. (2021) ; Zhou et al. (2022) show that such nonconvex problems have a benign global landscape hence N C can be easily obtained by standard training algorithms. The study of N C arguably brings us a better understanding of the properties of the deep features. Nonetheless, existing study does not provide an answer to the following question: Is Neural Collapse the desired property for deep features? Contribution. We provide a study of whether N C is the desired property for deep features by considering a specific learning task. In particular, we consider the fine-grained classification task, where a deep learning model is first pretrained on a classification dataset with coarse class labels, and use the features extracted from such a model for a downstream classification task with fine-grained classes. For example, the pretraining task contains a coarse class, say flower, that contains all variants of flowers, while the fine-grained task may give each variant of flower a different class label. If N C occurs hence all features for the coarse class collapse to the class mean, then such features do not distinguish between different fine-grained classes hence cannot be used to successfully perform fine-grained classification. This simple thought experiment demonstrates that N C is not a desired property for fine-grained classification. Motivated by the observation above, we argue that deep features extracted during pretraining should not have the N C property for the task of fine-grained classification. Instead, the features for each pretraining class should be diverse to a certain extent, so that variation within the class is preserved and can benefit fine-grained classification. To realize this idea, we introduce a geometric arrangement of deep features called the maximal-separating-cone (MSC), where features of each class lie inside a cone with a nontrivial radius (i.e. instead of collapsing to class mean as in N C), and the axes of the cones are maximally separated. See Figure 2 for an illustration. To obtain such an arrangement, we present a technique based on the simple idea that, if a feature lies inside its cone, then its loss is set to be a constant by a hinge function. Figure 1 provides a comparison of N C features and our MSC features for fine-grained classification with CIFAR-20 and CIFAR-100. The contribution of this work is summarized as follows. • We design a MSC arrangement of deep features where within-class features are diverse within a cone instead of collapsing to the class mean (as in N C). • We present a generic technique based on the design of classifier weight and training loss function to obtain MSC features. • We conduct experiments on fine-grained classification to demonstrate the effectiveness of our method. We also provide ablation study to justify the design of each component in our method. (a) (b) • Aside from fine-grained classification, we demonstrate that our method also improves the performance for classification with training data that has imbalanced number of samples for different classes (i.e., long-tail data). ! ! ! ! ! ! ! 0.4 0.3 0.2 0.1 0 25k 50k 75k 100k -1 -2 -3 -4 ×10 "#

2. BACKGROUND

In  min θ,W ,b 1 N N i=1 L [v i1 , . . . , v iK ], y i , s.t. u i = f (x i , θ), v ik = u i , w k + b k , where {x i , y i } N i=1 is a training dataset of input-label pairs. In above, f (•, θ) : IR n → IR m is the feature extractor implemented as a deep neural network with trainable parameters θ. The output u i of the feature extractor is referred to as the feature of x i . The other two optimization variables in (1), namely W = [w 1 , . . . , w K ] ∈ IR m×K and b = [b 1 , . . . , b K ] ∈ IR K , are parameters of a linear classifier. The vector [v i1 , . . . , v iK ] is often referred to the logit vector for data x i . The loss function L(•, •) is defined on the logits and the label. One of the most popular choice of L(•, •) is the cross-entropy (CE) loss, but other choices such as the mean squared error loss may also induce N C.  k -µ * G | k ∈ [K] } are maximally distant and located on a sphere centered at origin. More precisely, they form a simplex equiangular tight frame (Simplex ETF) up to a rotation and scaling: [µ * 1 -µ * G , . . . , µ * K -µ * G ] = sU W ETF , where s > 0 is an arbitrary scaling and U ∈ IR m×K is an arbitrary rotation (i.e., orthonormal matrix)foot_0 of the Simplex ETF. In above, W ETF is a matrix that defines a Simplex ETF: W ETF . = K K -1 I K - 1 K 1 K 1 T K (3) where I K ∈ IR K×K is an identity matrix and 1 K ∈ R K is a vector of all ones. It can be verified that the angle between any pair of class means after subtracting the global mean is the same: arccos µ * k -µ * G , µ * k -µ * G µ * k -µ * G 2 • µ * k -µ * G 2 = θ K . = arccos - 1 K -1 , ∀k, k ∈ [K], k = k . Aside from the learned features, N C also states the following two properties hold for the classifier. (iii) N C3: Self-duality between Feature and Classifier. The classifier weight W * = [w * 1 , . . . , w * K ] satisfies w * k = C • (µ * k -µ * G ) , where C is a scalar independent of k. (iv) N C4: Nearest Class Mean Classifier. The class label for a test data x is determined by a nearest class mean classifier: arg min k∈[K] f (x, θ * ) -µ * k 2 . Theoretical studies of the N C phenomenon are often conducted with the unconstrained feature models (UFMs). The idea of UFMs is that, since performing an analysis of a multi-layer nonlinear neural network f (.) is difficult, we treat f (.) as a model that can produce any set of features given any set of inputs. Hence, the features {u i } K i=1 can be treated as free optimization variables, which allows us to peel off the network f (.) and consider the following optimization problemfoot_1  min {ui} N i=1 ,W ,b 1 N N i=1 L [v i1 , . . . , v iK ], y i , s.t. v ik = u i , w k + b k . Under certain assumptions, it can be shown that all global solutions to (5) satisfy N C and that such an optimization problem has a benign global landscape (Zhu et al., 2021) .

3. METHOD

In this section, we start by presenting an arrangement of features described by their geometric properties in Section 3.1, where within-class features are diverse within a cone. Then, Section 3.2 introduces our method for obtaining such an arrangement by the design of classifier and loss function.

3.1. A MAXIMAL-SEPARATING-CONE (MSC) ARRANGEMENT

As explained in Section 2.2, a standard deep learning based classification produces features with an arrangement described by N C, where within-class variability converges to zero (i.e., N C1) and between-class separation is maximized (i.e., N C2) (see Figure 2(a) ). Here we introduce an arrangement of features where within-class features are allowed to be diverse. As briefly explained in the introduction, within-class diversity is a desirable property for fine-grained classification. Definition 1 (Maximal-Separating-Cone Arrangement). The set of features {u i } N i=1 ⊆ IR m is said to have a maximal-separating-cone (MSC) arrangement with respect to (w.r.t.) the set of labels {y i } N i=1 ⊆ [K] with angle τ > 0 if there exist a collection of vectors {β k } K k=1 ⊆ IR m with {β k } K k=1 forming a Simplex ETF, such that for all i, k that satisfies y i = k, it has u i , β k u i 2 • β k 2 ≥ cos τ. The MSC arrangement has a simple geometric description (see Figure 2 (b)). In words, it requires that features associated with each class lies in a cone. Moreover, the cones have the properties that 1) their axes form a Simplex ETF, hence different classes are sufficiently separated, and that 2) their angular radius is τ > 0, hence within-class features do not collapse to a single point as in N C.

3.2. INDUCING MSC FEATURES VIA CLASSIFIER AND LOSS DESIGN

We obtain MSC features by training a deep neural network with the following optimization problem: min θ 1 N N i=1 L [v i1 , . . . , v iK ], y i , s.t. u i = f (x i , θ), v ik = h(u i , y i , w k , b k ), W . = [w 1 , . . . , w K ] = W ETF , and b . = [b 1 , . . . , b K ] = 0, where h(•) is defined in ( 8) and W ETF is a Simplex ETF matrix defined in (3). The problem in (7) differs from (1) in two aspects, which we explain below. Fixed Classifier. Instead of learning the classifier weights W and b from training data, we fix them as constant model parameters in (7). In particular, W is fixed as a Simplex ETF matrix W ETF . 3 The idea is that we would like the columns of W to be the axes of the cones in MSC arrangement, which are required to form a Simplex ETF according to Definition 1. For the parameter b, we simply fix it to be a zero vector, which makes the design of method for obtaining MSC features easier. Hinge Loss. The second difference in ( 7) is that the logits are set as v ik = h(u i , y i , w k , b k ) with h(.) being a hinge function defined below: h(u i , y i , w k , b k ) = h + (u i , w k , b k ) . = min w k , u i , stop-grad( u i 2 ) w k 2 cos τ + + b k , if k = y i , h -(u i , w k , b k ) . = max w k , u i , stop-grad( u i 2 ) w k 2 cos τ -+ b k , if k = y i , where τ + and τ -are two hyper-parameters. If τ + = 0 and τ -= π, we obtain h(u i , y i , w k , b k ) = w k , u i + b i which is the logit function used in (7). The idea of designing ( 8) is explained below. • For the logit v ik corresponding to the positive class, i.e., when k = y i , we define v ik = w k , u i , which is the typically used logit in ( 7), when the angle between w k and u i is greater than τ +4 . Otherwise, we define the logit to be v ik = stop-grad( u i 2 ) cos τ + , with stop-grad() denotes an operation to stop the backpropagation of gradient; thereby making the logit independent of network parameters θ. • For logits v ik corresponding to the negative classes, i.e., when k = y i , we define v ik = w k , u i , which is the typically used logit in (7), if the angle between w k and u i is smaller than τ -. Otherwise, we define v ik = stop-grad( u i 2 ) cos τ -, so that the logits becomes independent of network parameters θ. By setting τ + = τ , the logit v ik for the positive class is independent of network parameters θ when u i lies in its corresponding cone in an MSC. Similarly, using the fact that the pairwise angular distance of vectors in a Simplex ETF with K vectors is the same and is given by θ K (see ( 4)), by setting τ -= θ K -τ , the logits v ik for the negative classes are independent of θ when u i lies in its corresponding cone in an MSC. Combining the cases above, we readily obtain the following result. Theorem 1. Consider (7) with τ + = θ K -τ -= τ . If the set of features {u i } N i=1 satisfy the MSC arrangement w.r.t. {y i } N i=1 with angle τ , then the gradient of the objective w.r.t. θ is zero. From Theorem 1, if the features already have an MSC arrangement, then the network f () will no longer be updated and hence the features will not collapse to their corresponding class centers.

4. EXPERIMENTS

In this section, we demonstrate the empirical performance of our approach. In particular, we focus on two applications: (i) Fine-grained classification, where the goal is to illustrate the benefit of diverse features learned during pre-training for fine-grained downstream classification, and (ii) Long-tail classification, where we apply our approach to improve the performance on tail classes. For our approach, we train feature extractors f () by solving the optimization problem in (7) with L being the CE loss and f () being different variants of ResNet (He et al., 2016) depending on the dataset and application. We refer to our method as HingeCE. Since the last two layers of (all variants of) ResNets is a ReLU activation followed by a pooling, the extracted features are constrained to have non-negative entries, hence cannot satisfy an MSC arrangement. Therefore, we add a fully connected layer with the same input and output dimensionsfoot_4 on top of ResNets as the feature extractor f (), such that the output feature is theoretically defined over IR m . The choice of hyper-parameters τ + and τ -, as well as additional analysis, are discussed in Section 5.

4.1. FINE-GRAINED CLASSIFICATION

Experimental Setup. For this setup, we follow Kornblith et al. (2021) and first pre-train a ResNet-50 model (He et al., 2016) on ImageNet ILSVRC 2012 dataset (Russakovsky et al., 2015) . The outputs of ResNet-50 are then used as features for the downstream tasks. More specifically, after pre-training, we fix the pre-trained feature extractor and only learn a linear classifier for each of the nine different downstream natural image classification tasks through L 2 -regularized multinomial logistic regression. The downstream datasets are Food (Bossard et al., 2014) , CIFAR10 & CIFAR100 (Krizhevsky et al., 2009) , Birdsnap (Berg et al., 2014) , Sun (Xiao et al., 2010) , Car (Krause et al., 2013) , Aircraft (Maji et al., 2013) , Pet (Parkhi et al., 2012) and Flower (Nilsback & Zisserman, 2008) . Most of these datasets concern recognition on fine-grained classes (see Appendix A for more details). The primary goal of this experiment is to demonstrate that diverse features in pre-training enables better downstream fine-grained classification. Results. The results for fine-grained classification are summarized in Table 1 . We consider several baselines to evaluate our performance gain. We use CE to denote the baseline which jointly learns the classifier and the feature extractor from scratch by minimizing CE loss L CE during pre-training. Another natural baseline is CE+ETF, where the learnable classifier in CE is replaced with a fixed Simplex ETF. To have a fair comparison with our HingeCE which also uses fixed Simplex ETF as the classifier but uses an additional linear layer, we also add such a layer for CE+ETF. Hence, HingeCE and CE+ETF are the same in terms of architecture of feature extractor and using fixed Simplex ETF classifier, with the only difference being the choice of loss function for network training. Additionally, following Kornblith et al. (2021) , we use label-smoothed variants of the baselines above as additional baselines (e.g. Smooth+ETF). In particular, we modify the target distribution t as follows t = t(1-α)+α/K where α determines the weighting of the original and uniform targets and K is the number of classes in Eq. 7. We set α as 0.1 and mean linear evaluation accuracy drops from 94.3% to 92.6% (Row (2,3) of Table 1 ) when a classifier is jointly learned and from 94.0% to 92.7% (Row (4,5) of Table 1 ) when a Simplex ETF is used. We first note that CE+ETF achieves similar or competitive accuracy in linear evaluation as well as the accuracy on ImageNet (pre-training) test set when compared to CE (Row (1,2) of Table 3 ), indicating that using Simplex ETF as classifier does not impact the transferability of deep learning features. Finally, with our approach HingeCE, we mitigate the degree of collapse by maintaining diversity within each class, and observe substantial performance boost over the baselines on both transfer learning (Row (5,6) of Table 1 ) and test accuracy on ImageNet (Row (2,3) of Table 3 ). Also, in contrast to self-supervision approaches (e.g. Table 2 : Performance comparison on long-tail datasets. We report Top-1 accuracy (%) and the best accuracy is highlighted. Our implementation is based upon Menon et al. (2020) . Symbol * : the accuracy on is 38.4% when the feature dimension m = 64 and is 39.5% when m = 128. To ensure Simplex ETF constraint holds, for CIFAR100-LT, we increase the feature dimension to 128. Approach CIFAR10-LT CIFAR100-LT ImageNet-LT Focal (Lin et al., 2017) 70.4 38.4 -LADM (Cao et al., 2019) 73.4 39.6 -M2M (Kim et al., 2020) 79.1 43.5 -OLTR (Liu et SimCLR) which aim to learn diverse features, our method can provide both discriminative power and feature diversity, thereby, yielding better transfer learning performance.

4.2. LONG-TAIL CLASSIFICATION

Experimental Setup. For long-tail classification, we conduct experiments on three popular longtail benchmarks: CIFAR10-LT, CIFAR100-LT (Krizhevsky et al., 2009) , and ImageNet-LT (Russakovsky et al., 2015) . CIFAR10-LT and CIFAR100-LT are derived from CIFAR10 and CIFAR100 with the number of samples per class ranges from 5000 to 50 and 500 to 5, respectively. ImageNet-LT is derived from ImageNet and contains 1000 classes. The number of samples per class ranges from 1280 to 5. Based on the number of training samples per class, we group the classes into two categories: major class which have more than 50 samples, and minor classes with less than 50 samples. In our method, we tune a τ + for the minor classes and simply use τ + = 0 (which is equivalent to using w k , u i as logits) for the major classes. Results. In addition to standard baselines, we also compare our approach with state-of-the-art longtail classification methods. The results are summarized in Table 2 . Using Simplex ETF as classifier, CE+ETF already yields improved performance when compared to CE (Row (10,11) of Table 2 ) as class centers are predefined and maximally separated while the centers of minor classes may not be well-learned (Fang et al., 2021) . For CIFAR100-LT, even though the test accuracy improved from 38.4% to 39.5% by increasing feature dimension from 64 to 128, a larger gain in test accuracy (from 39.5% to 42.6%) was obtained through the use of Simplex ETF. Note that the performance gain using CE+ETF on ImageNet-LT is relatively minor compared to CIFAR-related datasets. On the other hand, by preserving the diversity of each class, our HingeCE considerably improves the performance on all three datasets yielding state-of-the-art performance on a few datasets and is competitive on the rest. Note that TSC (Li et al., 2022) and KCL (Kang et al., 2020) use twostage training involving supervised contrastive learning (SCL) to improve the performance on longtail classification, which is complementary to our approach. Nonetheless, even without SCL, we obtain better or competitive results. We believe our results can be further improved using SCL and additional hyperparameter tuning.

5. DISCUSSION

Choice of Hyper-parameters. τ + and τ -are two hyper-parameters of our method introduced in Equation 8. Here we plot the cosine similarities between each feature on ImageNet train set with its positive class center and rest of the negative class centers during the training process, and present the results in Figure 3 (a) and (b), respectively. It can be observed that the mean cosine similarity with positive class centers is around 0.46, which is perhaps larger than expected (since if all features collapse completely to class mean, then this value should be one). Moreover, the similarity with negative class centers remains close to zero during the entire training process. Hence, the negative logits have small impact on the value of the CE loss and for simplicity we set τ -= π (i.e., use w k , u i as logits). In the following we examine the effect of τ + . Figure 3 (c) demonstrates the effect of τ + for fine-grained classification on Flower dataset. It can be seen that with a small value of τ + , HingeCE does not have a significant effect. This is aligned with the observation in Figure 3(a) . When τ + is increased to the range of 11 36 π ≤ τ + ≤ 5 12 π, the accuracy increases considerably and is not sensitive to the choice of τ + within that range. For all results in Table 1 & 3, the value of τ + is fixed to 13 36 π. Since this choice works fairly well across various settings, hyper-parameter tuning for our method is very easy. Finally, while this fixed choice works well, we find empirically without reporting the details here that additional hyper-parameter tuning can further improve the final performance. Measuring Diversity. To measure the diversity of features, we use N C1, defined through betweenclass and within-class covariance matrices, and numerical rank, defined as the ratio of squared nuclear norm and squared Frobenius norm for features of each class averaged over all classes. The definition of these two measures can be found in e.g., Zhou et al. (2022) . In Figure 4 , we plot N C 1 and numerical rank as a function of τ + . It can be seen that both measures increases with τ + , showing that our method indeed improves feature diversity. Pre-training Performance. While designed for improving the performance of downstream finegrained classification task, we show in Table 3 that our method improves the performance on the test set of ImageNet as well. Per-class Analysis for Long-tail Classification. We evaluate the performance gain of our method on major vs minor classes in long-tail classification, and report the results in Figure 5 . Specifically, Figure 5 (a) & (b) shows that the benefit of our method is mainly on the minor classes. Figure 5(c ) demonstrates the effect of τ + on major and minor classes separately. It can be seen that setting τ + to be in the range of 7 36 π ≤ τ + ≤ 1 4 π provides a notable gain for minor classes. In addition, setting τ + to be in the range of 1 12 π ≤ τ + ≤ 11 90 π (which is a range with lower values than that of the minor classes) provides a gain for major classes. This may imply that the optimal threshold τ + should be dependant upon the number of training samples in each class. Hence, while the results in Table 2 are obtained from using the same τ + in all minor classes and another τ + for all major classes, which is a choice made to simplify the study, the performance may be improved with a more careful choice of τ + for individual classes. Then, we apply our approach on the major and minor classes separately and adjust the threshold τ + . In (c), we observe that the best testing accuracy is reached when our approach is applied on minor classes and the threshold is around 1 6 π ≤ τ + ≤ 5 18 π.

6. RELATED WORK

Transfer Learning is a pipeline where feature extractors trained using a pretraining task are used to perform downstream tasks (Kornblith et al., 2019) . Existing literature on transfer learning abounds and most related to ours include Kornblith et al. (2021) which show a trade-off between pretraining and downstream performance. Specifically, many techniques such as label smoothing produces more collapsed within-class features (Müller et al., 2019) hence improves the performance on Ima-geNet pre-training task, but such features lead to worse downstream performance. Our result shows that by introducing diversity, the performance on both pretraining and the downsteam fine-grained classification tasks can be improved. While other methods for obtaining diverse features exist, they either require an additional k-nearest neighbor search (Dwibedi et al., 2021; Touvron et al., 2021; Feng et al., 2022) , or require that different classes share the same feature distribution (Xu et al., 2021) . On the other hand, within-class compactness is found to benefit face recognition (Wen et al., 2016; Liu et al., 2017; Deng et al., 2019; Wang et al., 2018; Meng et al., 2021) . Neural Collapse is a phenomenon reported in Papyan et al. (2020) for deep models fully trained on a balanced dataset in classification tasks. It simplifies the deep model by considering the last-layer features and the classifier as independent variables, and subsequent works theoretically demonstrated phenomenon of N C under Cross-Entropy (Wojtowytsch & Weinan, 2020; Lu & Steinerberger, 2020; Graf et al., 2021; Fang et al., 2021) or Mean Squared Error (Mixon et al., 2022; Rangamani & Banburski-Fahey, 2022; Poggio & Liao, 2020) losses. Since the classifier weight forms a Simplex ETF in N C, Zhu et al. (2021) shows that fixing the classifier as a Simplex ETF during training can achieve similar performance. While the minimized diversity may hurt the adaptation performance to new classes (Kornblith et al., 2019) , the Simplex ETF has been shown to be useful for continual learning (Wu et al., 2021) and long-tail learning (Li et al., 2022; Yang et al., 2022) . Long-tail Learning with imbalance of data distribution is a common real-world scenario. Deep learning models tend to perform well on major classes (Zhang et al., 2021) with a lower accuracy on minor classes. To address the issue, an intuitive strategy is to explicitly re-sample the training data (Ando & Huang, 2017) which usually hurts performance on major classes (Kang et al., 2019) . Alternative methods include re-weighting the loss for different classes Byrd & Lipton (2019) ; Cui et al. (2019) or re-adjusting the logits (Menon et al., 2020; Cao et al., 2019) during training. Meanwhile, several regularization techniques such as weight-normalization (Kang et al., 2019) and loss modification (Iranmehr et al., 2019) can be applied, but they can be sensitive to the selection of optimizer and may not reach a balanced error (Menon et al., 2020) .

7. CONCLUSION

In this paper, we ask the question of whether Neural Collapse is a desirable property of deep features and argue that the lack of within-class diversity may hurt the performance on fine-grained classification task. To address that, we introduce a geometric arrangement of features with within-class diversity, and present a method called HingeCE based on the classifier and loss function design. Empirically we show that HingeCE is substantially better than or competitive to the state-of-the-art methods for fine-grained classification and long-tail classification tasks.

B IMPLEMENTATION DETAILS

We use the ResNet for all experiments. For each dataset, we follow the same configuration of ResNet used in the baseline approach for fair comparison. Meanwhile, as ResNet has a ReLU activation before average pooling He et al. (2016) , the feature vector (after pooling) cannot satisfy the constraint of the defined Simplex ETF in an MSC arrangement, i.e., the weights in simplex ETF are maximally separated over the full feature space (always have negative entries) while the averaged feature vectors always have non-negative entries and its distance with with the weights then cannot be zero. As such, we add a fully connected layer, serving as a projector, on top of the resnet structure such that the value of each entry in the output features is theoretically defined over IR m As the thresholds (τ + , τ -) are the only hyper-parameters introduced in our approach where τ + = τ and τ -is usually set as 0, we summarize them in the Table 4 and explain the details of other hyperparameters below. (2016) . For the projector, we set the output dimension the same as the input dimension 2048. We set the batch size as 512, use SGD optimizer with nesterov, and train for 90 epoches. The coefficient of weight decay is 0.001. We set the initial learning rate as 0.4 with a linear warm-up of 100 steps and scheduled according to a cosine annealing function.

B.2 TRANSFER LEARNING FOR FINE-GRAINED CLASSIFICATION

During transfer learning, we fix the feature extractor learned by our HingeCE to extract the 2048dim feature vector for each image. Then, we learn a linear classifier on top of the extracted features from images of train set. As the dataset considered in transfer learning are abount fine-grained classification, we use the performance on fine-grained classification to indicate the quality & diversity of the learned features. Note that the classifiers in transfer learning are trained by minimizing vanilla cross-entropy loss and we do not apply our approach in transfer learning. During training, we set the batch size as 128, use SGD optimizer, and train the classifier for 100 epoches. As only a linear classifier is learned, we set a large learning rate 1.0 in the whole training process. The information of dataset is summarized in Section A.

B.3 LONG-TAIL LEARNING

We use ResNet-32 as backbone for experiments on CIFAR10-LT and CIFAR100-LT, and ResNet-50 for experiments on ImageNet-LT. For each dataset, all of the approaches being compared use the same architecture as ours and the comparison is fair regarding network architecture. The implementation details for each dataset can be found below. CIFAR10-LT We use the standard ResNet32 architecture in our experiments and exactly follow the configuration in Menon et al. (2020) ; Yang & Xu (2020) . As the spatial resolution of image is 32 × 32, the first convolution layer is with stride 1 and a kernel of size 3x3. The number of channels in the feature output is 64 and the projector linear layer has the same input and output dimension (i.e., 64). We set the batch size as 128, use SGD optimizer, and train for 256 epoches. We set the coefficient of weight decay as 0.0001. We set the initial learning rate as 0.4 without any warm-up and scheduled according to a cosine annealing function. CIFAR100-LT We use the same network (i.e., ResNet32) as feature backbone and follow the experiment configuration described in CIFAR10-LT except for the configuration of projector. As the number of classes in CIFAR100-LT is K = 100 and larger than the feature dimension m = 64, we thus set the output dimension m = 128 to meet the constraint have m ≥ K. To note, the structure of ResNet32 is not changed. ImageNet-LT We use the same network (i.e., ResNet50) as feature backbone and experiment setup described in Section B.1 but we use 0.0001 as the coefficient of weight decay.



We assume that the feature space dimension m is greater or equal to the number of classes K. A regularization term is often added to the optimization variables. We omit it to simplify the discussion. This design choice was previously made inZhu et al. (2021), with the idea that since N C occurs at convergence, simply fixing W to be a a Simplex ETF should not hurt performance while may accelerate training. More broadly,Hoffer et al. (2018) also noted that the classifier need not be trained. Here we use the fact that w k 2 = 1 by the constraint in (7). For CIFAR100-LT classification using ResNet-32, the number of class K = 100 is greater than the feature dimension m = which violates the constraint of m ≥ K for constructing a Simplex ETF in (3). In this particular case we set the output dimension to be 128.



Figure 1: Fine-grained classification with N C vs. diverse features. We train a ResNet32 on CIFAR-20, which contains 20 coarse classes, and use the features for classification on CIFAR-100, which contains 100 fine-grained class labels of the 20 coarse classes. The plots provide a visualization of randomly sampled features from three selected coarse classes (enclosed in dashed circles), each with two selected fine-grained classes (shown with different colors). Our visualization uses the technique of Müller et al. (2019). (a) With cross-entropy (CE) loss, within-class features collapse and finegrained classes are less distinguishable. (b) With our method (i.e., Hinge CE), within-class features are diverse and fine-grained classes are more distinguishable.

Figure 2: Conceptual illustration of features obtained from deep learning based classification models. Each ball represents a feature vector for an input in the training set. (Left) Neural collapse features obtained by standard training ( Section 2.2), where within each class, all features collapse to the class mean. (Right) Maximal-separating-cone features that our method obtains (Section 3.1), where within each class, the features lie in a cone and the axes of the cones are maximally separated.

this section, we start by reviewing the basics of deep learning based classification in Section 2.1, where a feature extractor parameterized by a deep neural network and a linear classifier on the features are jointly learned from a training dataset. Then, Section 2.2 explains N C which characterizes the features and classifier obtained from training a deep learning based classification model. 2.1 DEEP LEARNING BASED CLASSIFICATION Consider the classification task where the goal is to predict the label y ∈ [K] . = {1, . . . , K} for an input x ∈ IR n , where K is the number of classes. With deep learning based classification, this is typically achieved by training a model with the following optimization objective:

NEURAL COLLAPSE (N C) N C is a phenomenon for the features obtained at terminal phase of training a deep learning based classification model (Papyan et al., 2020). Let (θ * , W * , b * ) be the solution to (1). Then, N C states that the learned features {u * i . = f (x i , θ * )} N i=1 have the following properties (see also Figure 2(a)). (i) N C1: Within-class Variability Collapse. For each class k ∈ [K], the features {u * i | i : y i = k} collapse to the class mean µ * k . = i:y i =k u * i i:y i =k 1 . (ii) N C2: Between-class Maximal Separation. Let µ * G = 1 K K k=1 µ * k be the global mean. Then, the class means after subtracting the global mean, i.e., {µ *

Figure 3: (a, b) Cosine similarity of each feature with its corresponding positive (in (a)) and negative (in (b)) class centers averaged over all features obtained with CE+ETF. (c) Transfer learning accuracy on Flower dataset as a function of τ + ∈ [0, 17 36 π] with τ -= π during ImageNet pretraining with HingeCE.

Figure 4: The (a) numerical rank and (b) NC1 metric among features on the ImageNet training set with different thresholds τ + for h + .

Figure 5: On CIFAR100-LT, for each class, we calculate (a) test accuracy and (b) relative performance gain (the class with a larger index on x-axis has less training data) and the performance on minor class are mostly improved.Then, we apply our approach on the major and minor classes separately and adjust the threshold τ + . In (c), we observe that the best testing accuracy is reached when our approach is applied on minor classes and the threshold is around 1 6 π ≤ τ + ≤ 5 18 π.

≤ 50 training data are treated as minor classes. B.1 PRETRAINING ON IMAGENET We use the standard Resnet-50 as feature extractor and follow the configuration defined in He et al.

Performance comparison of transfer learning using linear evaluation on nine downstream image classification datasets. We report the Top-1 accuracy (%) where the first and the second high results are highlighted. Our approach, HingeCE, improves the transfer learning accuracy consistently and reaches the highest mean accuracy on several datasets. Symbols † & * here indicate the results reported inChen et al. (2020) andKornblith et al. (2021) respectively.

Summary of threshold selectionDatasetτ -τ + Range The classes on which HingeCE are applied.

A FINGRAINED TRANSFER DATASET

The downstream datasets are Food (Food-101) (Bossard et al., 2014) with 101 finegrained classes of food, CIFAR10 & CIFAR100 (Krizhevsky et al., 2009) , Birdsnap (Berg et al., 2014) with 500 finegrained classes regarding bird species, Sun (SUN-397) (Xiao et al., 2010) with 397 scene categories, Car (Stanford Cars) (Krause et al., 2013) with 196 finegrained class of car models, Aircraft (Maji et al., 2013) Visualization on fingrained classes. As shown in Figure 6 , we provide a visualization on three coarse classes for ImageNet and three Flower separately. All of the classes are randomly selected. Comparing Figure 6(a, b ), applying our approach HingeCE can preserve more diversity within each coarse class. Then, from Figure 6(c, d ), the features of finegrained classes can be more separable.

