GENERALIZING AND DECOUPLING NEURAL COLLAPSE VIA HYPERSPHERICAL UNIFORMITY GAP

Abstract

The neural collapse (NC) phenomenon describes an underlying geometric symmetry for deep neural networks, where both deeply learned features and classifiers converge to a simplex equiangular tight frame. It has been shown that both crossentropy loss and mean square error can provably lead to NC. We remove NC's key assumption on the feature dimension and the number of classes, and then present a generalized neural collapse (GNC) hypothesis that effectively subsumes the original NC. Inspired by how NC characterizes the training target of neural networks, we decouple GNC into two objectives: minimal intra-class variability and maximal inter-class separability. We then use hyperspherical uniformity (which characterizes the degree of uniformity on the unit hypersphere) as a unified framework to quantify these two objectives. Finally, we propose a general objective -hyperspherical uniformity gap (HUG), which is defined by the difference between inter-class and intra-class hyperspherical uniformity. HUG not only provably converges to GNC, but also decouples GNC into two separate objectives. Unlike cross-entropy loss that couples intra-class compactness and inter-class separability, HUG enjoys more flexibility and serves as a good alternative loss function. Empirical results show that HUG works well in terms of generalization and robustness.

1. INTRODUCTION

Recent years have witnessed the great success of deep representation learning in a variety of applications ranging from computer vision [37] , natural language processing [16] to game playing [55, 64] . Despite such a success, how deep representations can generalize to unseen scenarios and when they might fail remain a black box. Deep representations are typically learned by a multi-layer network with cross-entropy (CE) loss optimized by stochastic gradient descent. In this simple setup, [86] has shown that zero loss can be achieved even with arbitrary label assignment. After continuing to train the neural network past zero loss with CE, [60] discovers an intriguing phenomenon called neural collapse (NC). NC can be summarized as the following characteristics: • Intra-class variability collapse: Intra-class variability of last-layer features collapses to zero, indicating that all the features of the same class concentrate to their intra-class feature mean. • Convergence to simplex ETF: After being centered at their global mean, the class-means are both linearly separable and maximally distant on a hypersphere. Formally, the class-means form a simplex equiangular tight frame (ETF) which is a symmetric structure defined by a set of maximally distant and pair-wise equiangular points on a hypersphere. • Convergence to self-duality: The linear classifiers, which live in the dual vector space to that of the class-means, converge to their corresponding class-mean and also form a simplex ETF. • Nearest decision rule: The linear classifiers behave like nearest class-mean classifiers. The NC phenomenon suggests two general principles for deeply learned features and classifiers: minimal intra-class compactness of features (i.e., features of the same class collapse to a single point), and maximal inter-class separability of classifiers / feature mean (i.e., classifiers of different classes have maximal angular margins). While these two principles are largely independent, popular loss functions such as CE and square error (MSE) completely couple these two principles together. Since there is no trivial way for CE and MSE to decouple these two principles, we identify a novel quantity -hyperspherical uniformity gap (HUG), which not only characterizes intra-class feature compactness and inter-class classifier separability as a whole, but also fully decouples these two principles. The decoupling enables HUG to separately model intra-class compactness and inter-class separability, making it highly flexible. More importantly, HUG can be directly optimized and used to train neural networks, serving as an alternative loss function in place of CE and MSE for classification. HUG is formulated as the difference between inter-class and intra-class hyperspherical uniformity. Hyperspherical uniformity [48] quantifies the uniformity of a set of vectors on a hypersphere and is used to capture how diverse these vectors are on a hypersphere. Thanks to the flexibility of HUG, we are able to use many different formulations to characterize hyperspherical uniformity, including (but not limited to) minimum hyperspherical energy (MHE) [45] , maximum hyperspherical separation (MHS) [48] and maximum gram determinant (MGD) [48] . Different formulations yield different interpretation and optimization difficulty (e.g., HUG with MHE is easy to optimize, HUG with MGD has interesting connection to geometric volume), thus leading to different performance. Similar to CE loss, HUG also provably leads to NC under the setting of unconstrained features [53] . Going beyond NC, we hypothesize a generalized NC (GNC) with hyperspherical uniformity, which extends the original NC to the scenario where there is no constraint for the number of classes and the feature dimension. NC requires the feature dimension no smaller than the number of classes while GNC no longer requires this. We further prove that HUG also leads to GNC at its objective minimum. Another motivation behind HUG comes from the classic Fisher discriminant analysis (FDA) [19] where the basic idea is to find a projection matrix T that maximizes between-class variance and minimizes within-class variance. What if we directly optimize the input data (without any projection) rather than optimizing the linear projection in FDA? We make a simple derivation below: Projection FDA: max T ∈R d×r tr T ⊤ SwT -1 T ⊤ S b T Data FDA: max x 1 ,••• ,xn∈S d-1 tr (S b ) -tr (Sw) where the between-class scatter matrix is S w = C i=1 j∈Ac (x j -µ i )(x j -µ i ) ⊤ , the within-class scatter matrix is S b = C i=1 n i (µ i -μ)(µ i -μ) ⊤ , n i is the number of samples in the i-th class, n is the total number of samples, µ i = n -foot_0 i j∈Ac x j is the i-th class-mean, and μ = n -1 n j=1 x j is the global mean. By considering class-balanced data on the unit hypersphere, optimizing data FDA is equivalent to simultaneously maximizing tr(S b ) and minimizing tr(S w ). Maximizing tr(S b ) encourages inter-class separability and is a necessary condition for hyperspherical uniformity. 1 Minimizing tr(S w ) encourages intra-class feature collapse, reducing intra-class variability. Therefore, HUG can be viewed a generalized FDA criterion for learning maximally discriminative features. However, one may ask the following questions: Why is HUG useful if we already have the FDA criterion? Could we simply optimize data FDA? In fact, the FDA criterion has many degenerate solutions. For example, we consider a scenario of 10-class balanced data where all features from the first 5 classes collapse to the north pole on the unit hypersphere and features from the rest 5 classes collapse to the south pole on the unit hypersphere. In this case, tr(S w ) is already minimized since it achieves the minimum zero. tr(S b ) also achieves its maximum n at the same time. In contrast, HUG naturally generalizes FDA without having these degenerate solutions and serves as a more reliable criterion for training neural networks. We summarize our contributions below: • We decouple the NC phenomenon into two separate learning objectives: maximal inter-class separability (i.e., maximally distant class feature mean and classifiers on the hypersphere) and minimal intra-class variability (i.e., intra-class features collapse to a single point on the hypersphere). • Based on the two principled objectives induced by NC, we hypothesize the generalized NC which generalizes NC by dropping the constraint on the feature dimension and the number of classes. • We identify a general quantity called hyperspherical uniformity gap, which well characterizes both inter-class separability and intra-class variability. Different from the widely used CE loss, HUG naturally decouples both principles and thus enjoys better modeling flexibility. • Under the HUG framework, we consider three different choices for characterizing hyperspherical uniformity: minimum hyperspherical energy, maximum hyperspherical separation and maximum Gram determinant. HUG provides a unified framework for using different characterizations of hyperspherical uniformity to design new loss functions.



We first obtain the upper bound n of tr(S b ) from tr(S b ) = C i=1 ni∥µi -μ∥ F ≤ Ci=1 ni∥µi∥ • ∥ μ∥ ≤ n. Because a set of vectors {µi} n i=1 achieving hyperspherical uniformity has Eµ 1 ,••• ,µn {∥ μ∥} → 0 (as n grows larger)[20]. Then we have that tr(S b ) attains n. Therefore, vectors achieving hyperspherical uniformity are one of its maximizers. tr(Sw) can simultaneously attain its minimum if intra-class features collapse to a single point.

