WHAT DEEP REPRESENTATIONS SHOULD WE LEARN? -A NEURAL COLLAPSE PERSPECTIVE

Abstract

For classification tasks, when sufficiently large networks are trained until convergence, an intriguing phenomenon has recently been discovered in the last-layer classifiers, and features termed neural collapse (NC): (i) the within-class variability of the features collapses to zero, and (ii) the between-class feature means are maximally and equally separated. Despite of recent endeavors to understand why NC happens, a fundamental question remains: whether NC is a blessing or a curse for deep learning? In this work, we investigate the problem under the setting of transfer learning that we pretrain a model on a large dataset and transfer it to downstream tasks. Through various experiments, our findings on NC are twofold: (i) when pre-training models, preventing intra-class variability collapse (to a certain extent) better preserve the structures of data, and leads to better model transferability; (ii) when fine-tuning models on downstream tasks, obtaining features with more NC on downstream data results in better test accuracy on the given task. Our findings based upon NC not only explain many widely used heuristics in model pretraining (e.g., data augmentation, projection head, self-supervised learning), but also leads to more efficient and principled transfer learning method on downstream tasks.

1. INTRODUCTION

Recently, an intriguing phenomenon has been discovered in terms of learned deep representations, in which the last-layer features and classifiers collapse to simple but elegant mathematical structures on the training data: (i) for each class, the intra-class variability of last-layer features collapses to zero, and (ii) the between-class class means and the last-layer classifiers all collapse to the vertices of a Simplex Equiangular Tight Frame (ETF) up to scaling. This phenomenon, termed Neural Collapse (N C) (Papyan et al., 2020; Han et al., 2022) , has been empirically demonstrated to persist across a variety of network architectures and datasets. Theoretically, more recent works (Fang et al., 2021; Zhu et al., 2021; Zhou et al., 2022; Tirer & Bruna, 2022) justified the prevalence of N C under simplified unconstrained feature models across a variety of training losses and problem formulations. Despite of recent endeavors of demystifying such an interesting phenomenon, a fundamental question lingers: is N C a blessing or a curse for deep representation learning? Understanding such a question could address many important but mysterious aspects of deep representation learning. For example, quite a few recent works (Papyan et al., 2020; Galanti et al., 2022; Hui et al., 2022) studied the connection between N C and generalization of overparameterized deep networks. In this work, we aim to understand transfer learning by studying the relationship between N C and the transferability of pretrained deep models. Transfer learning has become an increasingly popular approach in computer vision, medical imaging, and natural language processing (Zhuang et al., 2020) . With domain similarity, a pretrained large model on upstream datasets is reused as a starting point for fine-tuning a new model on a much smaller downstream task (Zhuang et al., 2020) . The pretrained model reuse with fine-tuning significantly reduces the computational cost, and achieves superior performances on problems with limited training datasets. However, without principled guidance, the underlying mechanism of transfer learning is not very well understood. First, when we are pretraining deep models on the upstream dataset, we lack good metrics for measuring the quality of the learned model or representation. In the past, people tended to rely empirically on controversial metrics for predicting the transferred test performance, such as the validation accuracy on the pretrained data (e.g., validation accuracy on ImageNet (Kornblith et al., 2019) ). For example, some popular approaches (e.g., label smoothing (Szegedy et al., 2016) and dropout (Srivastava et al., 2014) ) for boosting ImageNet validation accuracy turn out to hurt transfer performance on downstream tasks (Kornblith et al., 2021) . Additionally, when pretraining deep models, many methods improve transferability, such as the design of loss functions, data augmentations, increased model size, and projection head layers (Chen et al., 2020; Khosla et al., 2020) , are designed largely based upon trial-and-error without much insight of the underlying mechanism. Second, given the pretrained models, how to efficiently fine-tune the model on downstream tasks remains an open question. Although fully fine-tuning all the parameters of the pretrained model achieve the best performance, it becomes increasingly expensive as the model size grows (e.g., GPT-3 and transformer (Brown et al., 2020; Vaswani et al., 2017; Devlin et al., 2019; Dosovitskiy et al., 2021) ). All these challenges call for a deeper understanding of what makes pretrained deep models more transferable. Contributions of this work. In this work, we provide a comprehensive investigation of the relationship between the transferability of pretrained models and N C. As N C implies that intra-class variability for each class collapses to zero, the representations learned via vanilla supervised learning fails to capture the intrinsic dimensionality of the input data, and hence they often result in poor performance of transferability. Intuitively, to make the pretrained models transferable, for each class the learned features should be discriminative but diverse enough that they can preserve the intrinsic structures of the input data. On the other hand, on the downstream task when we fine-tune pretrained models, we desire more collapse of the features on the downstream training data. Based upon such intuitions, we adapt the metrics for evaluating N C to measure the quality of learned representations in terms of both intra-class diversity and between-class discrimination. As such, not only can we demystify several heuristics that are widely used in transfer learning, but it also opens a door for designing methods to transfer large pretrained models more effectively. In words, our experimental findings based upon the N C metrics can be summarized as follows. • The transferability of pretrained models correlates with learned feature diversity on the source dataset. )), we find that to a certain extent,foot_0 the more diverse the features are, the better the transferability of the pretrained model. This helps to explain the underlying mechanism of many popular heuristics for transfer learning. • More collapse of fine-tuned models leads to better test performance on downstream tasks. In contrast, when we are evaluating different pretrained models on downstream tasks, we observe that more collapsed features on downstream data usually lead to better transfer accuracy. This phenomenon not only happens on the penultimate layer across different pretrained models, but also across different layers of the same pretrained model. • pretrained models can be more effectively transfered through N C. Efficient and effective transfer learning is of paramount importance for large models nowadays. Inspired by the above findings, with the aim to collapse the features of the penultimate layer, we improve the transfer effectiveness while maintaining efficiency by only tuning one additional layer along with an add-on skip connection. We demonstrate that such a transfer learning strategy achieves better performances compared with the traditional fixed feature transfer learning setting and on-par or superior performances compared with full model fine-tuning setting. Relationship to prior arts. The prevalence of N C phenomenon has caught significant attention both in practice and theory recently, and our work draws the connection between N C and transfer learning. On the other hand, a few recent works are investigating the properties of deep representations for transfer learning, which is also related to ours. We'd like to summarize and briefly discuss those results below. • Understandings of the N C phenomenon. There are a line of recent works deciphering training, generalization, and transferability of deep networks in terms of N C, that are related to ours (see a recent review work (Kothapalli et al., 2022) ). For training, recent works showed that N C happens for a variety of loss function and formulations, such as cross-entropy (CE) (Papyan et al., 2020;  



We find that there is a certain threshold, that the transferability increases with the feature diversity below the threshold but decreases or become uncorrelated beyond it. Increasing the feature diversity will decrease the margin upon the threshold and hence the relationship with transferability becomes more involved with too large feature diversity.



By evaluating the N C metrics on different loss functions(Hui & Belkin, 2020)   and several widely used techniques in transfer learning (e.g., the addition of projection head, different data augmentations(Chen et al., 2020; Chen & He, 2021; Khosla et al., 2020) and adversarial training (Salman et al., 2020; Deng et al., 2021

