DO WE NEED NEURAL COLLAPSE? LEARNING DI-VERSE FEATURES FOR FINE-GRAINED AND LONG-TAIL CLASSIFICATION

Abstract

Feature extractors learned from supervised training of deep neural networks have demonstrated superior performance over handcrafted ones. Recently, it is shown that such learned features have a neural collapse property, where within-class features collapse to the class mean and different class means are maximally separated. This paper examines the neural collapse property in the context of fine-grained classification tasks, where a feature extractor pretrained from a classification task with coarse labels is used for generating features for a downstream classification task with fine-grained labels. We argue that the within-class feature collapse is an undesirable property for fine-grained classification. Hence, we introduce a geometric arrangement of features called the maximal-separating-cone, where within-class features lie in a cone of nontrivial radius instead of collapsing to the class mean, and cones of different classes are maximally separated. We present a technique based on classifier weight and training loss design to produce such an arrangement. Experimentally we demonstrate an improved fine-grained classification performance with a feature extractor pretrained by our method. Moreover, our technique also provides benefits for classification on data with long-tail distribution over classes. Our work may motivate future efforts on the design of better geometric arrangement of deep features.

1. INTRODUCTION

The extraction of features or representations from image, language, and speech data in their raw forms is a problem of fundamental interest in machine learning. A standard approach in classical machine learning is to handcraft a feature extractor that maps from the input to the feature space (Lowe, 2004) , which often requires meticulous and onerous work of human and domain experts. With the success of deep learning, methods based on learned feature extractors have become very popular. Such an approach not only requires less human expertise, but also exhibits better empirical performance compared to handcrafted feature extractors (Krizhevsky et al., 2017) . Moreover, deep features extractors obtained from a pre-training task can be used for downstream tasks by simply training a task-specific classifier, which offers good empirical performance (Yosinski et al., 2014) . The great success of learned features naturally leads to the following question: Are learned features optimal for various application scenarios? In this paper, we show that learned features in a standard deep classification model are not optimal and can be improved by handcrafting a better geometric arrangement of the features. We are motivated by a recent line of work that shows that the geometric arrangement of features in a standard deep classification model has a simple and elegant form: • Within each class, the features are maximally concentrated and collapse to the class mean. • Across different classes, the features are maximally separated. Namely, the distance between each pair of class means is the same and that distance is at the maximum. 



This phenomenon is called Neural Collapse (N C)(Papyan et al., 2020). The occurrence and prevalence of N C is verified empirically through experiments with a variety of datasets and network architectures(Han et al., 2021). Motivated by such an observation, recent work provided theoretical analysis of deep features under the so-called unconstrained feature models(Mixon et al., 2020;   1

