DO WE NEED NEURAL COLLAPSE? LEARNING DI-VERSE FEATURES FOR FINE-GRAINED AND LONG-TAIL CLASSIFICATION

Abstract

Feature extractors learned from supervised training of deep neural networks have demonstrated superior performance over handcrafted ones. Recently, it is shown that such learned features have a neural collapse property, where within-class features collapse to the class mean and different class means are maximally separated. This paper examines the neural collapse property in the context of fine-grained classification tasks, where a feature extractor pretrained from a classification task with coarse labels is used for generating features for a downstream classification task with fine-grained labels. We argue that the within-class feature collapse is an undesirable property for fine-grained classification. Hence, we introduce a geometric arrangement of features called the maximal-separating-cone, where within-class features lie in a cone of nontrivial radius instead of collapsing to the class mean, and cones of different classes are maximally separated. We present a technique based on classifier weight and training loss design to produce such an arrangement. Experimentally we demonstrate an improved fine-grained classification performance with a feature extractor pretrained by our method. Moreover, our technique also provides benefits for classification on data with long-tail distribution over classes. Our work may motivate future efforts on the design of better geometric arrangement of deep features.

1. INTRODUCTION

The extraction of features or representations from image, language, and speech data in their raw forms is a problem of fundamental interest in machine learning. A standard approach in classical machine learning is to handcraft a feature extractor that maps from the input to the feature space (Lowe, 2004) , which often requires meticulous and onerous work of human and domain experts. With the success of deep learning, methods based on learned feature extractors have become very popular. Such an approach not only requires less human expertise, but also exhibits better empirical performance compared to handcrafted feature extractors (Krizhevsky et al., 2017) . Moreover, deep features extractors obtained from a pre-training task can be used for downstream tasks by simply training a task-specific classifier, which offers good empirical performance (Yosinski et al., 2014) . The great success of learned features naturally leads to the following question: Are learned features optimal for various application scenarios? In this paper, we show that learned features in a standard deep classification model are not optimal and can be improved by handcrafting a better geometric arrangement of the features. We are motivated by a recent line of work that shows that the geometric arrangement of features in a standard deep classification model has a simple and elegant form: • Within each class, the features are maximally concentrated and collapse to the class mean. • Across different classes, the features are maximally separated. Namely, the distance between each pair of class means is the same and that distance is at the maximum. Contribution. We provide a study of whether N C is the desired property for deep features by considering a specific learning task. In particular, we consider the fine-grained classification task, where a deep learning model is first pretrained on a classification dataset with coarse class labels, and use the features extracted from such a model for a downstream classification task with fine-grained classes. For example, the pretraining task contains a coarse class, say flower, that contains all variants of flowers, while the fine-grained task may give each variant of flower a different class label. If N C occurs hence all features for the coarse class collapse to the class mean, then such features do not distinguish between different fine-grained classes hence cannot be used to successfully perform fine-grained classification. This simple thought experiment demonstrates that N C is not a desired property for fine-grained classification.

This phenomenon is called

Motivated by the observation above, we argue that deep features extracted during pretraining should not have the N C property for the task of fine-grained classification. Instead, the features for each pretraining class should be diverse to a certain extent, so that variation within the class is preserved and can benefit fine-grained classification. To realize this idea, we introduce a geometric arrangement of deep features called the maximal-separating-cone (MSC), where features of each class lie inside a cone with a nontrivial radius (i.e. instead of collapsing to class mean as in N C), and the axes of the cones are maximally separated. See Figure 2 for an illustration. To obtain such an arrangement, we present a technique based on the simple idea that, if a feature lies inside its cone, then its loss is set to be a constant by a hinge function. Figure 1 provides a comparison of N C features and our MSC features for fine-grained classification with CIFAR-20 and CIFAR-100. The contribution of this work is summarized as follows. • We design a MSC arrangement of deep features where within-class features are diverse within a cone instead of collapsing to the class mean (as in N C). • We present a generic technique based on the design of classifier weight and training loss function to obtain MSC features. • We conduct experiments on fine-grained classification to demonstrate the effectiveness of our method. We also provide ablation study to justify the design of each component in our method.



Figure 1: Fine-grained classification with N C vs. diverse features. We train a ResNet32 on CIFAR-20, which contains 20 coarse classes, and use the features for classification on CIFAR-100, which contains 100 fine-grained class labels of the 20 coarse classes. The plots provide a visualization of randomly sampled features from three selected coarse classes (enclosed in dashed circles), each with two selected fine-grained classes (shown with different colors). Our visualization uses the technique of Müller et al. (2019). (a) With cross-entropy (CE) loss, within-class features collapse and finegrained classes are less distinguishable. (b) With our method (i.e., Hinge CE), within-class features are diverse and fine-grained classes are more distinguishable.

