GENERALIZING AND DECOUPLING NEURAL COLLAPSE VIA HYPERSPHERICAL UNIFORMITY GAP

Abstract

The neural collapse (NC) phenomenon describes an underlying geometric symmetry for deep neural networks, where both deeply learned features and classifiers converge to a simplex equiangular tight frame. It has been shown that both crossentropy loss and mean square error can provably lead to NC. We remove NC's key assumption on the feature dimension and the number of classes, and then present a generalized neural collapse (GNC) hypothesis that effectively subsumes the original NC. Inspired by how NC characterizes the training target of neural networks, we decouple GNC into two objectives: minimal intra-class variability and maximal inter-class separability. We then use hyperspherical uniformity (which characterizes the degree of uniformity on the unit hypersphere) as a unified framework to quantify these two objectives. Finally, we propose a general objective -hyperspherical uniformity gap (HUG), which is defined by the difference between inter-class and intra-class hyperspherical uniformity. HUG not only provably converges to GNC, but also decouples GNC into two separate objectives. Unlike cross-entropy loss that couples intra-class compactness and inter-class separability, HUG enjoys more flexibility and serves as a good alternative loss function. Empirical results show that HUG works well in terms of generalization and robustness.

1. INTRODUCTION

Recent years have witnessed the great success of deep representation learning in a variety of applications ranging from computer vision [37] , natural language processing [16] to game playing [55, 64] . Despite such a success, how deep representations can generalize to unseen scenarios and when they might fail remain a black box. Deep representations are typically learned by a multi-layer network with cross-entropy (CE) loss optimized by stochastic gradient descent. In this simple setup, [86] has shown that zero loss can be achieved even with arbitrary label assignment. After continuing to train the neural network past zero loss with CE, [60] discovers an intriguing phenomenon called neural collapse (NC). NC can be summarized as the following characteristics: • Intra-class variability collapse: Intra-class variability of last-layer features collapses to zero, indicating that all the features of the same class concentrate to their intra-class feature mean. • Convergence to simplex ETF: After being centered at their global mean, the class-means are both linearly separable and maximally distant on a hypersphere. Formally, the class-means form a simplex equiangular tight frame (ETF) which is a symmetric structure defined by a set of maximally distant and pair-wise equiangular points on a hypersphere. • Convergence to self-duality: The linear classifiers, which live in the dual vector space to that of the class-means, converge to their corresponding class-mean and also form a simplex ETF. • Nearest decision rule: The linear classifiers behave like nearest class-mean classifiers. The NC phenomenon suggests two general principles for deeply learned features and classifiers: minimal intra-class compactness of features (i.e., features of the same class collapse to a single point), and maximal inter-class separability of classifiers / feature mean (i.e., classifiers of different classes have maximal angular margins). While these two principles are largely independent, popular loss functions such as CE and square error (MSE) completely couple these two principles together. Since there is no trivial way for CE and MSE to decouple these two principles, we identify a novel quantity -hyperspherical uniformity gap (HUG), which not only characterizes intra-class feature compactness and inter-class classifier separability as a whole, but also fully decouples these two principles. The decoupling enables HUG to separately model intra-class compactness and inter-class separability, making it highly flexible. More importantly, HUG can be directly optimized and used to train neural networks, serving as an alternative loss function in place of CE and MSE for classification. HUG is formulated as the difference between inter-class and intra-class hyperspherical uniformity. Hyperspherical uniformity [48] quantifies the uniformity of a set of vectors on a hypersphere and is used to capture how diverse these vectors are on a hypersphere. Thanks to the flexibility of HUG, we are able to use many different formulations to characterize hyperspherical uniformity, including (but not limited to) minimum hyperspherical energy (MHE) [45] , maximum hyperspherical separation (MHS) [48] and maximum gram determinant (MGD) [48] . Different formulations yield different interpretation and optimization difficulty (e.g., HUG with MHE is easy to optimize, HUG with MGD has interesting connection to geometric volume), thus leading to different performance. Similar to CE loss, HUG also provably leads to NC under the setting of unconstrained features [53] . Going beyond NC, we hypothesize a generalized NC (GNC) with hyperspherical uniformity, which extends the original NC to the scenario where there is no constraint for the number of classes and the feature dimension. NC requires the feature dimension no smaller than the number of classes while GNC no longer requires this. We further prove that HUG also leads to GNC at its objective minimum. Another motivation behind HUG comes from the classic Fisher discriminant analysis (FDA) [19] where the basic idea is to find a projection matrix T that maximizes between-class variance and minimizes within-class variance. What if we directly optimize the input data (without any projection) rather than optimizing the linear projection in FDA? We make a simple derivation below: Projection FDA: max T ∈R d×r tr T ⊤ SwT -1 T ⊤ S b T Data FDA: max x 1 ,••• ,xn∈S d-1 tr (S b ) -tr (Sw) where the between-class scatter matrix is S w = C i=1 j∈Ac (x j -µ i )(x j -µ i ) ⊤ , the within-class scatter matrix is S b = C i=1 n i (µ i -μ)(µ i -μ) ⊤ , n i is the number of samples in the i-th class, n is the total number of samples, µ i = n -foot_0 i j∈Ac x j is the i-th class-mean, and μ = n -1 n j=1 x j is the global mean. By considering class-balanced data on the unit hypersphere, optimizing data FDA is equivalent to simultaneously maximizing tr(S b ) and minimizing tr(S w ). Maximizing tr(S b ) encourages inter-class separability and is a necessary condition for hyperspherical uniformity. 1 Minimizing tr(S w ) encourages intra-class feature collapse, reducing intra-class variability. Therefore, HUG can be viewed a generalized FDA criterion for learning maximally discriminative features. However, one may ask the following questions: Why is HUG useful if we already have the FDA criterion? Could we simply optimize data FDA? In fact, the FDA criterion has many degenerate solutions. For example, we consider a scenario of 10-class balanced data where all features from the first 5 classes collapse to the north pole on the unit hypersphere and features from the rest 5 classes collapse to the south pole on the unit hypersphere. In this case, tr(S w ) is already minimized since it achieves the minimum zero. tr(S b ) also achieves its maximum n at the same time. In contrast, HUG naturally generalizes FDA without having these degenerate solutions and serves as a more reliable criterion for training neural networks. We summarize our contributions below: • We decouple the NC phenomenon into two separate learning objectives: maximal inter-class separability (i.e., maximally distant class feature mean and classifiers on the hypersphere) and minimal intra-class variability (i.e., intra-class features collapse to a single point on the hypersphere). • Based on the two principled objectives induced by NC, we hypothesize the generalized NC which generalizes NC by dropping the constraint on the feature dimension and the number of classes. • We identify a general quantity called hyperspherical uniformity gap, which well characterizes both inter-class separability and intra-class variability. Different from the widely used CE loss, HUG naturally decouples both principles and thus enjoys better modeling flexibility. • Under the HUG framework, we consider three different choices for characterizing hyperspherical uniformity: minimum hyperspherical energy, maximum hyperspherical separation and maximum Gram determinant. HUG provides a unified framework for using different characterizations of hyperspherical uniformity to design new loss functions.

2. ON GENERALIZING AND DECOUPLING NEURAL COLLAPSE

NC describes an intriguing phenomenon for the distribution of last-layer features and classifiers in overly-trained neural networks, where both features and classifiers converge to ETF. However, ETF can only exist when the feature dimension d and the number of classes C satisfy d ≥ C -1. This is not always true for deep neural networks. For example, neural networks for face recognition are usually trained by classifying large number of classes (e.g., more than 85K classes in [23] ), and the feature dimension (e.g., 512 in SphereFace [43] ) is usually much smaller than the number of classes. In general, when the number of classes is already large, it is prohibitive to use a larger feature dimension. Thus a question arises: what will happen in this case if a neural network is fully trained? Interestingly, one can observe that learned features in both cases approach to the configuration of equally spaced frames on the hypersphere. To accommodate the case of d < C -1, we extend NC to the generalized NC by hypothesizing that last-layer inter-class features and classifiers converge to equally spaced points on the hypersphere, which can be characterized by hyperspherical uniformity.

Generalized Neural Collapse (GNC)

We define the feature global mean as µ G = Ave i,c x i,c where x i,c ∈ R d is the last-layer feature of the i-th sample in the c-th class, the feature class-mean as µ c = Ave i x i,c for different classes c ∈ {1, • • • , C}, the feature within-class covariance as Σ W = Ave i,c (x i,c -µ c )(x i,c -µ c ) ⊤ and the feature between-class covariance as Σ B = Ave c (µ c -µ G )(µ c -µ G ) ⊤ . GNC states that • (1) Intra-class variability collapse: Intra-class variability of last-layer features collapse to zero, indicating that all the features of the same class converge to their intra-class feature mean. Formally, GNC has that Σ † B Σ W → 0 where † denotes the Moore-Penrose pseudoinverse. • (2) Convergence to hyperspherical uniformity: After being centered at their global mean, the class-means are both linearly separable and maximally distant on a hypersphere. Formally, the class-means converge to equally spaced points on a hypersphere, i.e., c̸ =c ′ K( μc, μc ′ ) → min μ1 ,••• , μC c̸ =c ′ K( μc, μc ′ ), ∥µc -µG∥ -∥µ c ′ -µG∥ → 0, ∀c ̸ = c ′ (1) where μi = ∥µ i -µ G ∥ -1 (µ i -µ G ) and K(•, •) is a kernel function that models pairwise inter- action. Typically, we consider Riesz s-kernel K s ( μc , μc ′ ) = sign(s) • ∥ μc -μc ′ ∥ -s or logarith- mic kernel K log ( μc , μc ′ ) = log ∥ μc -μc ′ ∥ -1 . For example, the Riesz s-kernel with s = d -2 is a variational characterization of hyperspherical uniformity (e.g., hyperspherical energy [45] ) using Newtonian potentials. In the case of d = 3, s = 1, the Riesz kernel is called Coulomb potential and the problem of finding minimal coulomb energy is called Thomson problem [70] . • (3) Convergence to self-duality: The linear classifiers, which live in the dual vector space to that of the class-means, converge to their corresponding class-means, leading to hyperspherical uniformity. Formally, GNC has that ∥w c ∥ -1 w c -μc → 0 where w c ∈ R d is the c-th classifier. • In contrast to NC, GNC further considers the case of d < C -1 and hypothesizes that both feature class-means and classifiers converge to hyperspherically uniform point configuration that minimizes some form of pairwise potentials. Similar to how NC connects tight frame theory [74] to deep learning, our GNC hypothesis connects potential theory [3] to deep learning, which may shed new light on understanding it. We show in Theorem 1 that GNC reduces to NC in the case of d ≥ C -1. Theorem 1 (Regular Simplex Optimum for GNC) Let f : (0, 4] → R be a convex and decreasing function defined at v = 0 by lim v→0 + f (v). If 2 ≤ C ≤ d + 1, then we have that the vertices of regular (C -1)-simplices inscribed in S d-1 with centers at the origin (equivalent to simplex ETF) minimize the hyperspherical energy c̸ =c ′ K( μc , μc ′ ) on the unit hypersphere S d-1 (d ≥ 3) with the kernel as K( μc , μc ′ ) = f (∥ μcμc ′ ∥ 2 ). If f is strictly convex and strictly decreasing, then these are the only energy minimizing C-point configurations. Thus GNC reduces to NC when d ≥ C -1. We note that Theorem 2 guarantees the simplex ETF as the minimizer of a general family of hyperspherical energies (as long as f is convex and decreasing). This suggests that there are many possible kernel functions K(•, •) in GNC that can effectively generalize NC. The case of d < C -1 is where GNC really gets interesting but complicated. Other than the regular simplex case, we also highlight a special uniformity case of 2d = C. In this case, we can prove in Theorem 2 that GNC(2) converges to the vertices of a cross-polytope as hyperspherical energy gets minimized. As the number of classes gets infinitely large, we show in Theorem 3 that GNC(2) leads to a point configuration that is uniformly distributed on S d-1 . Additionally, we show a simple yet interesting result in Proposition 1 that the last-layer classifiers are already initialized to be uniformly distributed on the hypersphere in practice. Theorem 2 (Cross-polytope Optimum for GNC) If C = 2d, then the vertices of the cross-polytope are the minimizer of the hyperspherical energy in GNC (2) . The cross-polytope optimum for GNC(2) is in fact quite intuitive, because it corresponds to the Cartesian coordinate system (up to a rotation). For example, the vertices of the unit cross-polytope in R 3 are (±1, 0, 0), (0, ±1, 0), (0, 0, ±1). These 6 vectors minimize the hyperspherical energy on S 2 . We illustrate both the regular simplex and cross-polytope cases in Figure 2 . For the other cases of d < C -1, there exists generally no simple and universal point structure that minimizes the hyperspherical energy, as heavily studied in [12, 27, 38, 63] . For the point configurations that asymptotically minimize the hyperspherical energy as C grows larger, Theorem 3 can guarantee that these configurations asymptotically converge to a uniform distribution on the hypersphere. Theorem 3 (Asymptotic Convergence to Hyperspherical Uniformity) Consider a sequence of point configurations { μC 1 , • • • , μC C } ∞ C=2 that asymptotically minimizes the hyperspherical energy on S d-1 as C → ∞, then { μC 1 , • • • , μC C } ∞ C=2 is uniformly distributed on the hypersphere S d-1 . Proposition 1 (Minimum Energy Initialization) With zero-mean Gaussian initialization (e.g., [22, 28] ), the C last-layer classifiers of neural networks are initialized as a uniform distribution on the hypersphere. The expected initial energy is C(C -1) S d-1 S d-1 ∥ μc -μc ′ ∥ -2 dσ d-1 ( μc )dσ d-1 ( μc ′ ). With Proposition 1, one can expect that the hyperspherical energy of the last-layer classifiers will first increase and then decrease to a lower value than the initial energy. To validate the effectiveness of our GNC hypothesis, we conduct a few experiments to show how both class feature means and classifiers converge to hyperspherical uniformity (i.e., minimizing the hyperspherical energy), and how intra-class feature variability collapses to almost zero. We start with an intuitive understanding about GNC from Figure 1 . The results are directly produced by the learned features without any visualization tool (such as t-SNE [73] ), so the feature distribution can reflect the underlying one learned by neural networks. We observe that GNC is attained in both d < C -1 and d ≥ C -1, while NC is violated in d < C -1 since the learned feature class-means can no longer form a simplex ETF. To see whether the same conclusion holds for higher feature dimensions, we also train two CNNs on CIFAR-100 with feature dimension as 64 and 128, respectively. The results are given in Figure 3 . Figure 3 shows that GNC captures well the underlying convergence of the neural network training. Figure 3(a, c ) shows that the hyperspherical energy of feature class-means and classifiers converge to a small value, verifying the correctness of GNC(2) and GNC(3) which indicate both feature class-means and classifiers converge to hyperspherical uniformity. More interestingly, in the MNIST experiment, we can compute the exact minimal energy on S 1 : 2 in the case of d = 2, C = 3 (1/3 for average energy) and ≈ 82.5 in the case of d = 2, C = 10 (≈ 0.917 for average energy). The final average energy in Figure 3 (a) matches our theoretical minimum well. From Figure 3 (c), we observe that the classifier energy stays close to its minimum at the very beginning, which matches our Proposition 1 that vectors initialized with zero-mean Gaussian are uniformly distributed over the hypersphere (this phenomenon becomes more obvious in higher dimensions). To evaluate the intra-class feature variability, we consider a hyperspherical reverse-energy E r = i̸ =j∈Ac ∥ xi -xj ∥ where xi = xi ∥xi∥ and A c denotes the sample index set of the c-th class. The smaller this reverse-energy gets, the less intra-class variability it implies. Figure 3(b, d ) shows that the intra-class feature variability approaches to zero, as GNC(1) suggests. Details and more empirical results on GNC are in given Appendix A. Now we discuss how to decouple the GNC hypothesis and how such a decoupling can enable us to design new objectives to train neural networks. GNC(1) and GNC(2) suggest to minimize intra-class feature variability and maximize inter-class feature separability, respectively. GNC(3) and GNC( 4) are natural consequences if GNC(1) and GNC(2) hold. It has long been discovered in [42, 68, 82] that last-layer classifiers serve as proxies to represent the corresponding class of features, and they are also an approximation to the feature class-means. GNC(3) indicates the classifiers converge to hyperspherical uniformity, which, together with GNC(1), implies GNC (4) . Until now, it has been clear that GNC really boils down to two decoupled objectives: maximize inter-class separability and minimize intra-class variability, which again echos the goal of FDA. The problem reduces to how to effectively characterize these two objectives while being decoupled for flexibility (unlike CE or MSE). In the next section, we propose to address this problem by characterizing both objectives with a unified quantity -hyperspherical uniformity.

3.1. GENERAL FRAMEWORK

As GNC(2) suggests, the inter-class separability is well captured by hyperspherical uniformity of feature class-means, so it is natural to directly use it as a learning target. On the other hand, GNC(1) does not suggest any easy-to-use quantity to characterize intra-class variability. We note that minimizing intra-class variability is actully equivalent to encouraging features of the same class to concentrate on a single point, which is the opposite of hyperspherical uniformity. Therefore, we can unify both intra-class variability and inter-class separability with a single characterization of hyperspherical uniformity. We propose to maximize the hyperspherical uniformity gap: max { xj } n j=1 L HUG := α • HU { μc} C c=1 T b : Inter-class Hyperspherical Uniformity -β • C c=1

HU { xi}i∈Ac

Tw : Intra-class Hyperspherical Uniformity (2) where α, β are hyperparameters, μc = µc ∥µc∥ is the feature class-mean projected on the unit hypersphere, µ c = c∈Ac x c is the feature class-mean, x i is the last-layer feature of the i-th sample and A c denotes the sample index set of the c-th class. HU({v i } m i=1 ) denotes some measure of hyperspherical uniformity for vectors {v 1 , • • • , v m }. Eq. 2 is the general objective for HUG. Without loss of generality, we assume that the larger it gets, the stronger hyperspherical uniformity we have. We mostly focus on supervised learning with parameteric class proxiesfoot_2 where the CE loss is widely used as a de facto choice, although HUG can be used in much broader settings as discussed later. In the HUG framework, there is no longer a clear notion of classifiers (unlike the CE loss), but we still can utilize class proxies (i.e., a generalized concept of classifiers) to facilitate the optimization. We observe that Eq. 2 directly optimizes the feature class-means for inter-class separability, but they are intractable to compute during training (we need to compute them in every iteration). Therefore it is nontrivial to optimize the original HUG for training neural networks. A naive solution is to approximate feature class-mean with a few mini-batches such that the gradients of T b can be still back-propagated to the last-layer features. However, it may take many mini-batches in order to obtain a sufficiently accurate class-mean, and the approximation gets much more difficult with large number of classes. To address this, we employ parametric class proxies to act as representatives of intra-class features and optimize them instead of feature class-means. We thus modify the HUG objective as max { xj } n j=1 ,{ ŵc} C c=1 L P-HUG := α • HU { ŵc} C c=1 Inter-class Hyperspherical Uniformity -β • C c=1 HU { xi}i∈Ac , ŵc Intra-class Hyperspherical Uniformity (3) where ŵc ∈ S d-1 is the parametric proxy for the c-th class. The intra-class hyperspherical uniformity term connects the class proxies with features by minimizing their joint hyperspherical uniformity, guiding features to move towards their corresponding class proxy. When training a neural network, the objective function in Eq. 3 will optimize network weights and proxies together. There are alternative ways to design the HUG loss from Eq. 2 for different learning scenarios, as discussed in Appendix C. Learnable proxies. We can view the class proxy μi as learnable parameters and update them with stochastic gradients, similarly to the parameters of neural networks. In fact, learnable proxies play a role similar to the last-layer classifiers in the CE loss, improving the optimization by aggregating intra-class features. The major difference between learnable proxies and moving-averaged proxies is the way we update them. As GNC(3) implies, class proxies in HUG can also be used as classifiers. Static proxies. Eq. 3 is decoupled into maximal inter-class separability and minimal intra-class variability. These two objects are independent and do not affect each other. We can thus optimize them independently. This suggests a even simpler way to assign class proxies -initializing class proxies with prespecified points that have attained hyperspherical uniformity, and fixing them in the training. There are two simple ways to obtain these class proxies: (1) minimizing their hyperspherical energy beforehand; (2) using zero-mean Gaussian to initialize the class proxies (Proposition 1). After initialization, class proxies will stay fixed and the features are optimized towards their class proxies. Partially learnable proxies. After the class proxies are initialized using the static way above, we can increase its flexibility by learning an orthogonal matrix for the class proxies to find a suitable orientation for them. Specifically, we can learn this orthogonal matrix using methods in [47] .

3.2. VARIATIONAL CHARACTERIZATION OF HYPERSPHERICAL UNIFORMITY

While there exist many ways to measure hyperspherical uniformity, we seek variational characterization due to simplicity. As examples, we consider minimum hyperspherical energy [45] that is inspired by Thomson problem [66, 70] and minimizes the potential energy, maximum hyperspherical separation [48] that is inspired by Tammes problem [69] and maximizes the smallest pairwise distance, and maximum gram determinant [48] that is defined by the volume of the formed parallelotope. Minimum hyperspherical energy. MHE seeks to find an equilibrium state with minimum potential energy that distributes n electrons on a unit hypersphere as evenly as possible. Hyperspherical uniformity is characterized by minimizing the hyperspherical energy for n vectors V n = {v 1 , • • • , v n ∈ R d }: min { v1 ,••• , vn∈S d-1 } Es( Vn) := n i=1 n j=1,j̸ =i Ks(vi, vj) , Ks(vi, vj) = ∥vi -vj∥ -s , s > 0 -∥vi -vj∥ -s , s < 0 , where vi := vi ∥vi∥ is the i-th vector projected onto the unit hypersphere. With HU( V ) = -E s ( V ), we apply MHE to HUG and formulate the new objective as follows (s b = 2, s w = -1): min { xj } n j=1 ,{ ŵc} C c=1 L MHE-HUG := α • Es b { ŵc} C c=1 -β • C c=1 Es w { xi}i∈Ac , ŵc which can already be used as to train neural networks. The intra-class variability term in Eq. 5 can be relaxed to a upper bound such that we can instead minimize a simple upper bound of L MHE-HUG : L ′ MHE-HUG := α • c̸ =c ′ ∥ ŵc -ŵc ′ ∥ -2 + β ′ • c i∈Ac ∥ xi -ŵc∥ ≥ L MHE-HUG which is much more efficient to compute in practice and thus can serve as a relaxed HUG objective. Moreover, L MHE-HUG and L ′ MHE-HUG share the same minimizer. Detailed derivation is in Appendix H. Maximum hyperspherical separation. MHS uses a maximum geodesic separation principle by maximizing the separation distance ϑ( Vn ) (i.e., the smallest pairwise distance in V n = {v 1 , • • • , v n ∈ R d }): max V {ϑ( Vn ) := min i̸ =j ∥v i -vj ∥}. Because ϑ( Vn ) is another variational definition, we cannot naively set HU(•) = ϑ(•). We define ϑ -1 ( Vn ) := max i̸ =j ∥v i -vj ∥ and HUG becomes max { xj } n j=1 ,{ ŵc} C c=1 L MHS-HUG := α • ϑ { ŵc} C c=1 -β • C c=1 ϑ -1 { xi}i∈Ac , ŵc , which, by replacing intra-class variability with its surrogate, results in a more efficient form: L ′ MHS-HUG := α • min c̸ =c ′ ∥ ŵc -ŵc ′ ∥ -β • c max i∈Ac ∥ xi -ŵc∥ which is a max-min optimization with a simple nearest neighbor problem inside. We note that L MHS-HUG and L ′ MHS-HUG share the same maximizer. Detailed derivation is given in Appendix H. Maximum gram determinant. MGD characterizes the uniformity by computing a proxy to the volume of the parallelotope spanned by the vectors. MGD is defined with kernel gram determinant: max { v1 ,••• , vn∈S d-1 } log det G := K(vi, vj) n i,j=1 , K(vi, vj) = exp -ϵ 2 ∥vi -vj∥ 2 (9) where we use a Gaussian kernel with parameter ϵ and G( Vn ) is the kernel gram matrix for Vn = {v 1 , • • • , vn }. With HU( Vn ) = det G( Vn ), minimizing intra-class uniformity cannot be achieved by minimizing det G( Vn ), since det G( Vn ) = 0 only leads to linear dependence. Then we have max { xj } n j=1 ,{ ŵc} C c=1 L MGD-HUG := α • log det G({ ŵc} C c=1 ) + β ′ • c i∈Ac ∥ xi -ŵc∥ where we directly use the surrogate loss from Eq. 6 as the intra-class variability term. With MGD, HUG has interesting geometric interpretation -it encourages the volume spanned by class proxies to be as large as possible and the volume spanned by intra-class features to be as small as possible.

3.3. THEORETICAL INSIGHTS AND DISCUSSIONS

There are many interesting theoretical questions concerning HUG, and this framework is highly related to a few topics in mathematics, such as tight frame theory [74] , potential theory [39] , sphere packing and covering [3, 18, 25] . The depth and breath of these topics are beyond imagination. In this section, we focus on discussing some highly related yet intuitive theoretical properties of HUG. The result above shows that the leading term of the minimum energy grows of order O(n 2 ) as n → ∞. Theorem 4 generally holds with a wide range of s for the Riesz kernel in hyperspherical energy. Moreover, the following result shows that MHS is in fact a limiting case of MHE as s → ∞. Proposition 2 (MHS is a Limiting Case of MHE) Let n ∈ N, n ≥ 2 be fixed and (S d-1 , L 2 ) be a compact metric space. We have that lim s→∞ (min Vn⊂S d-1 E s ( Vn )) 1 s = (max Vn⊂S d-1 ϑ( Vn )) -1 . Proposition 3 The HUG objectives in both Eq. 5 and Eq. 6 converge to simplex ETF when 2 ≤ C ≤ d + 1, converge to cross-polytope when C = 2d and asymptotically converge to GNC as C → ∞. Proposition 3 shows that HUG not only decouples GNC but also provably converges to GNC. Since GNC indicates that the CE loss eventually approaches to the maximizer of HUG, we now look into how the CE loss implicitly maximizes the HUG objective in a coupled way.

Proposition 4

The CE loss is L CE = n i=1 log(1 + C j̸ =yi exp(⟨w j , x i ⟩ -⟨w yi , x i ⟩)) where n is the number of samples, x i is the i-th sample with label y i and w j is the last-layer linear classifier for the j-th class. Bias is omitted for simplicity. L CE is bounded by (ρ = C -1) (11) where ρ 1 , ρ 2 , ρ 3 are constants and l ic is the softmax confidence of x i for the c-th class (Appendix L). This result [4] implies that minimizing CE effectively minimizes HUG. [50] proves that the minimizer of the normalized CE loss converges to hyperspherical uniformity. We rewrite their results below: Theorem 5 (CE Asymptotically Converges to HUG's Maximizer) Considering unconstrained features of C classes (each class has the same number of samples), with features and classifiers normalized on some hypersphere, we have that, for the minimizer of the CE loss, classifiers converge weakly to the uniform measure on S d-1 as C → ∞ and features collapse to their corresponding classifiers. The minimizer of CE also asymptotically converges to the maximizer of HUG. Theorem 5 shows that the minimizer of the CE loss with unconstrained features [53] asymptotically converges to the maximizer of HUG (i.e., GNC). Till now, we show that HUG shares the same optimum with CE (with hyperspherical normalization), while being more flexible for decoupling inter-class feature separability and intra-class feature variability. Therefore, we argue that HUG can be an excellent alternative for the widely used CE loss in classification problems. HUG maximizes mutual information. We can view HUG as a way to maximize mutual information I(X; Y ) = H(X) -H(X|Y ), where X denotes the feature space and Y is the label space. Maximizing H(X) implies that the feature should be uniform over the space. Minimizing H(X|Y ) means that the feature from the same class should be concentrated. This is nicely connected to HUG. The role of feature and class proxy norm. Both NC and GNC do not take the norm of feature and class proxy into consideration. HUG also assume both feature and class proxy norm are projected onto some hypersphere. Although dropping these norms usually improves generalizability [8, 9, 14, 43, 76] , training neural networks with standard CE loss still yields different class proxy norms and feature norms. We hypothesize that this is due to the underlying difference among training data distribution of different classes. One empirical evidence to support this is that average feature norm of different classes is consistent across training under different random seeds (e.g., average feature norm for digit 1 on MNIST stays the smallest in different run). [36, 46, 52] empirically show that feature norm corresponds to the quality of the sample, which can also viewed as a proxy to sample uncertainty. [56] theoretically shows that the norm of neuron weights (e.g., classifier) matters for its Rademacher complexity. As a trivial solution to minimize the CE loss, increasing the classifier norm (if the feature is correctly classified) can easily decrease the CE loss to zero for this sample, which is mostly caused by the softmax function. Taking both feature and class proxy norm into account greatly complicates the analysis (e.g., it results in weighted hyperspherical energy where the potentials between vectors are weighted) and seem to yield little benefit for now. We defer this issue to future investigation. HUG as a general framework for designing loss functions. HUG can be viewed as an inherently decoupled way of designing new loss functions. As long as we design a measure of hyperspherical uniformity, then HUG enables us to effortlessly turn it into a loss function for neural networks.

4. EXPERIMENTS AND RESULTS

Our experiments aims to demonstrate the empirical effectiveness of HUG, so we focus on the fair comparison to the popular CE loss under the same setting. Experimental details are in Appendix N. Different HUG variants. We compare different HUG variants and the CE loss on CIFAR-10 and CIFAR-100 with ResNet-18 [29] . Specifically, we use Eq. 6, Eq. 6 and Eq. 10 for MHE-HUG, MHS-HUG and MGD-HUG, respectively. The results are given in Table 1 . We can observe that all HUG variants outperform the CE loss. Among all, MHE-HUG achieves the best testing accuracy with considerable improvement over the CE loss. We note that all HUG variants are used without the CE loss. The performance gain of HUG are actually quite significant, since the CE loss is currently a default choice for classification problems and serves as a very strong baseline. Different methods to update proxies. We also evaluate how different proxy update methods will affect the classification performance. We use the same setting as Table 1 . For all the proxy update methods, we apply them to MHE-HUG (Eq. 6) under the same setting. The results are given Table 2 . We can observe that all the propose proxy update methods work reasonably well. More interestingly, static proxies work surprisingly well and outperform the CE loss even when all the class proxies are randomly initialized and then fixed throughout the training. The reason the static proxies work for MHE-HUG is due to Proposition 1. This result is significant since we no longer have to train class proxies in HUG (unlike CE). When trained with large number of classes, it is GPU-memory costly for learning class proxies, which is also known as one of the bottlenecks for face recognition [1] . HUG could be a promising solution to this problem. Loss landscape and convergence.

4.1. EXPLORATORY EXPERIMENTS AND ABLATION STUDY

We perturb neuron weights (refer to [40] ) to visualize the loss landscape of HUG and CE in Figure 4 . We use MHE in HUG here. The results show that HUG yields much flatter local minima than the CE loss in general, implying that HUG has potentially stronger generalization [34, 57] . We show more visualizations and convergence dynamics in Appendix O. Learning with different architectures. We evaluate HUG with different network architectures such as VGG-16 [65] , ResNet-18 [29] and DenseNet-121 [31] . Results in Long-tailed recognition. We consider the task of long-tailed recognition, where the data from different classes are imbalanced. The settings generally follow [6] , and the dataset gets more imbalanced if the imbalance ratio (IR) gets smaller. The potential of HUG in imbalanced classification is evident, as the inter-class separability in the HUG is explicitly modeled and can be easily controlled. Experimental results in Table 4 show that HUG can consistently outperform the CE loss in the challenging long-tailed setting under different imbalanced ratio. Continual learning. We demonstrate the potential of HUG in the class-continual learning setting, where the training data is not sampled i.i.d. but comes in class by class. Since training data is highly biased, hyperspherical uniformity among class proxies is crucial. Due to the decoupled nature of HUG, we can easily increase the importance of inter-class separability, unlike CE. We use a simple continual learning method -ER [62] where the CE loss with memory is used. We replace it with HUG. Table 5 NLP tasks. As an exploration, we evaluate HUG on some simple NLP classification tasks. Our experiments follow the same settings as [32] and finetune the BERT model [15] in these tasks. Table 7 shows that HUG yields better generalizability than CE, demonstrating its potential for NLP.

5. RELATED WORK AND CONCLUDING REMARKS

We start by generalizing and decoupling the NC phenomenon, obtaining two basic principles for loss functions. Based on these principles, we identify a quantity hyperspherical uniformity gap, which not only decouples NC but also provides a general framework for designing loss functions. We demonstrate a few simple HUG variants that outperform the CE loss in terms of generalization and adversarial robustness. There is a large body of excellent work in NC that is related to HUG, such as [26, 33, 71, 89] . [88] extends the study of NC to more practical loss functions (e.g., focal loss and losses with label smoothing). Different from existing work in hyperspherical uniformity [41, 45, 48] and generic diversity (decorrelation) [2, 7, 11, 54, 77, 83] , HUG works as a new learning target (used without CE) rather than acting as a regularizer for the CE loss (used together with CE). Following the spirit of [32] , we demonstrate the effectiveness and potential of HUG as a valid substitute for CE.

2D subspace

Simplex ETF (NC) d=3, C=4 3D space Hyperspherical Uniformity (GNC) d=2, C=4 Figure 5 : Geometric connection between GNC and [87] . Relevant theoretical results. [87] has discussed NC under the case of d < C -1, and shown that the global solution in this case yields the best rand-d approximation of the simplex ETF. Along with [87] , GNC gives a more profound characterization of the convergence of class-means. We show a special case of d = 2, C = 4. It is easy to see that hyperspherical uniformity in this case forms four vectors with adjacency ones being perpendicular. This is also the case captured by the best rank-2 approximation (i.e., a 2-dimensional hyperplane with simplex ETF projected onto it). Figure 5 gives a geometric interpretation for the connection between [87] and GNC. [3] provides an in-depth introduction and comprehensive theoretical analysis for the energy minimization problem, which significantly benefits this work. Connection to contrastive learning. The goal of contrastive learning [8, 10, 24, 30, 72, 78, 85] is to learn discriminative features through instance-wise discrimination and contrast. Despite the lack of class labels, [78] discovers that contrastive learning performs sample-wise alignment and sample-wise uniformity, sharing a similar high-level spirit to intra-class variability and inter-class separability. [35] adapts contrastive learning to the supervised settings where labeled samples are available, which also shares conceptual similarity to our framework and settings. Related work on (deep) metric learning. Metric learning also adopts similar idea where similar samples are pulled together and dissimilar ones are pushed away. HUG has intrinsic connections to a number of loss functions in metric learning [4, 17, 21, 24, 58, 59, 61, 67, 68, 79, [79] [80] [81] 84] .

6. BROADER IMPACT AND FUTURE WORK

Our work reveals the underlying principle -hyperspherical uniformity gap, for classification loss function, especially in the context of deep learning. We provide a simple yet effective framework for designing decoupled classification loss functions. Rather than previous objective functions that are coupled and treated as a black-box, our loss function has clear physical interpretation and is fully decoupled for different functionalities. These characteristics may help neural networks to identify intrinsic structures hidden in data and true causes for classifying images. HUG may have broader applications in interpretable machine learning and fairness / bias problems. Our work is by no means perfect, and there are many aspects that require future investigation. For example, the implicit data mining in CE [49] is missing in the current HUG design, current HUG losses are more sensitive to hyperparameters than CE (the flexibility of decoupling also comes at a price), current HUG losses could be more unstable to train (more difficult to converge) than CE, and it requires more large-scale experiments to fully validate the superiority of current HUG losses. We hope that our work can serve as a good starting point to rethink classification losses in deep learning.

A EMPIRICAL RESULTS ON GENERALIZED NEURAL COLLAPSE A.1 DETAILED METRIC DEFINITION

We consider four metrics: average classifier energy (ACE), average class-mean energy (ACME), average feature reverse-energy (AFRE) and average feature-mean reverse-energy (AFMRE) in the paper. Their definitions are given below: E ACE = 1 C(C -1) i̸ =j ∥ ŵi -ŵj ∥ -2 (12) E ACME = 1 C(C -1) i̸ =j ∥ μi -μj ∥ -2 E AFRE = 1 C C c=1 1 |A c | • (|A c | -1) i̸ =j∈Ac ∥ xi -xj ∥ ( ) E AFMRE = 1 C C c=1 1 |A c | i∈Ac ∥ xi -μc ∥ ( ) where |A c | denotes the cardinality of the set A c , μc is the normalized feature mean of the c-th class and ŵc denotes the normalized class proxy of the c-th class.

A.2 EMPIRICAL RESULTS OF GNC ON IMAGENET

We find that the GNC hypothesis remains valid and informative even under the scenario of large number of classes (we use the 1000-class ImageNet-2012 dataset [13] here). Experimental results with ResNet-18 [29] (feature dimension as 512) are given in Figure 6 . Experimental results with ResNet-50 [29] (feature dimension as 2048) are given in Figure 7 . 

B 2D MNIST FEATURE VISUALIZATION

We also visualize the 2D MNIST feature in Figure 8 , Figure 9 and Figure 10 , which is done by directly setting the output feature dimension as 2. Different color denotes different class and black arrow denotes the class proxy. We compare the difference between the CE loss and the HUG-MHE loss (with either independently optimized proxies or fully learnable proxies). Specifically, for the HUG-MHE loss with independently optimized proxies, we use the following form: max { xj } n j=1 ,{ ŵc} C c=1 L P-HUG := α • HU { ŵc} C c=1 Inter-class Hyperspherical Uniformity -β • C c=1 HU { xi}i∈Ac , SG Intra-class Hyperspherical Uniformity (16) where we stop the gradient for the class proxies in the intra-class hyperspherical uniformity term. Form the results, we observe that the our HUG losses generally learns better representations than the CE loss, and moreover, HUG learns more aligned class proxy and class feature-mean than CE. 

C OTHER VARIANTS IN THE HUG FRAMEWORK

There are plenty of interesting and useful instantiations for the loss function under the HUG framework. In this section, we discuss a few highly relevant and natural ones.

C.1 PROXY-FREE HUG

We have the following general HUG objective function: max { xj } n j=1 L HUG := α • HU { μc } C c=1 T b : Inter-class Hyperspherical Uniformity -β • C c=1 HU { xi } i∈Ac Tw: Intra-class Hyperspherical Uniformity where we can have many possible instantiations. Other than the proxy-based form proposed in the main paper, we can also have a proxy-free version: max { xj } n j=1 L PF-HUG := α • HU { xi∈Ac } C c=1 Inter-class Hyperspherical Uniformity -β • C c=1 HU { xi } i∈Ac Intra-class Hyperspherical Uniformity (18) where { xi∈Ac } C c=1 denotes a set of vectors that consist of one random sample per class. This is essentially to replace the class proxy with a random sample from this class. The proxy-free HUG loss can be used in the scenario where extremely large amount of classes exist and storing class proxies can be very expensive, or in the scenario of self-supervised contrastive learning where each instance and its augmentations are viewed as one class. A MHE-based instantiation of Eq. 18 is given by min { xj } n j=1 L MHE-PF-HUG := α • E s b { xi∈Ac } C c=1 -β • C c=1 E sw { xi } i∈Ac which can be similarly relaxed to L ′ MHE-PF-HUG = α • c̸ =c ′ ∥ xi∈Ac -xj∈A c ′ ∥ -2 + β ′ • c i∈Ac,j∈Ac,i̸ =j ∥ xi -xj ∥ where xi∈Ac denotes a randomly selected sample from the c-th class. The first term in Eq. 20 can also be viewed as a scalable stochastic approximation to the first term in the following loss function: L ′′ MHE-PF-HUG = α • i∈Ac,j∈A c ′ ,c̸ =c ′ ∥ xi -xj ∥ -2 + β ′ • c i∈Ac,j∈Ac,i̸ =j ∥ xi -xj ∥ which is typically optimized by stochastic gradients (samples come as a mini batch) in practice.

C.2 COUPLED HUG

One advantage of HUG is that it decouples intra-class variability and inter-class separability. However, coupling may also bring some benefits (e.g., robustness on hyperparameters, stability in training). To this end, we also propose a coupled loss function using the HUG framework: 

HU { xi}i∈Ac

Intra-class Hyperspherical Uniformity (22) which can be turned into a MHE-based instantiation: L ′′ MHE-C-HUG = α • n i=1 c̸ =yi ∥ xi -ŵc ∥ -2 + β ′ • c i∈Ac,j∈Ac,i̸ =j ∥ xi -xj ∥ where the first term itself couples intra-class and inter-class hyperspherical uniformity. Although the coupled HUG drops the flexibility that the original HUG framework brings, it may introduce extra advantages (e.g., training stability).

C.3 HUG WITHOUT HYPERSPHERICAL NORMALIZATION

While the CE loss does not necessarily require hyperspherical normalization for the proxies and features (but hyperspherical normalization does improve CE's generalizability [44, 75] ), we also consider the HUG framework without hyperspherical normalization here. We note that this issue remains an open challenge and we only aim to provide some simple yet natural designs. The obvious problem to remove hyperspherical normalization is that HUG has a trivial way to decrease its loss -simply increasing the magnitude of features and proxies. A naive way to address this is to introduce magnitude penalty terms for the features and proxies. This results in where s denotes the magnitude hyperparameter.

D PROOF OF THEOREM 1

We first let VC = {v 1 , • • • , vC } be an arbitrary vector configuration in S d-1 . Then we will have that Λ( VC ) := C i=1 C j=1 ∥v i -vj ∥ 2 = C i=1 C j=1 (2 -2v i • vj ) =2C 2 -2 C i=1 vi 2 ≤2C 2 which holds if and only if C i=1 vi = 0. The vertices of a regular (n -1)-simplex at the origin well satisfy this condition. With the properties of the potential function f , we have that E f (v C ) := C i=1 j:j̸ =i f ∥v i -vj ∥ 2 ≥C(C -1)f Λ(v C ) C(C -1) ≥C(C -1)f 2C C -1 which holds true if all pairwise distance ∥v i -vj ∥ are equal for i ̸ = j and the center of mass is at the origin (i.e., C i=1 vi = 0). Therefore, for the vector configuration V * C which contains the vertices of a regular (C -1)-simplex inscribed in S d and centered at the origin, we have that for 2 ≤ C ≤ d + 1 E f ( V * n ) = C(C -1)f 2C C -1 ≤ E f ( VC ). ( ) If f is strictly convex and strictly decreasing, then E f ( VC ) ≥ C(C -1)f ( 2C C-1 ) holds only when V * C is a regular (C -1)-simplex inscribed in S d-1 and centered at the origin. ■

E PROOF OF THEOREM 2

This result comes as a natural conclusion from [12] where they prove that any sharp code is a minimal hyperspherical f -energy N -point configuration for any interaction potential f that is absolutely monotone on [-1, 1] including all Riesz s-potentials f (t) = 2(t -2t) -s/2 for s > 0. Before we move on, we need to introduce the definition of sharp code: Definition 1 Let VN = {v 1 , • • • , vN } be a N -point configuration on S d ′ . • If for every (d ′ + 1)-variate polynomial P of degree at most m, S d ′ P dσ d ′ = 1 N N i=1 P (v i ) then VN is called a spherical m-design. • If VN is a configuration of N distinct points such that the set of inner products between distinct points in VN has cardinality k, then VN is called a spherical k-distance set. • The configuration VN is a sharp code if it is both a k-distance set and a spherical (2k -1)design. The Cohn-Kumar Universal Optimality theorem [12] states that any sharp code is universally optimal. By universal optimality, we mean that Definition 2 An N -point configuration VN on S d ′ is called universally optimal if E f ( VN ) := v1, v2∈ VN ,v1̸ =v2 f (v ⊤ 1 v2 ) = min VN ⊂S d ′ E f ( VN ) holds for any absolutely monotone function f : [-1, 1) → R. Then formally, Cohn-Kuma Universal Optimality Theorem states: Theorem 6 If VN is a sharp code on S d ′ , then VN is universally optimal. Because the vertices of the cross-polytope are a sharp code, then this vertex set (2d ′ + 2 points in total) is universally optimal, which implies that E f ( ŴN ) = min ŴN ⊂S d ′ E f ( ŴN ) where ŴN denote the vertex set of the cross-polytope. Then we let s = 2 for the f -energy and d ′ = d -1, and we prove our theorem. ■ F PROOF OF THEOREM 3 This theorem is in fact a well-known result (see [3, 27, 38, 63] ). This general result is stated as Theorem 7 If A ⊂ R p is compact with dim A > 0 and 0 < s < dim A, then lim N →∞ ε s (A, N ) N 2 = W s (A), where ε s (A, n) := min Ŵn⊂A E s ( Ŵn ) and W s (A) is Wiener constant. Moreover, the equilibrium measure µ s,A on A is unique for the Riesz s-kernel when 0 < s < dim A. Finally, any sequence Moreover, the same theorem also gives that the leading term of the minimum hyperspherical energy is of order O(n 2 ) as n → ∞. ■ {v N 1 , • • • , vN N } ∞ N =2 of asympototically s-energy minimizing N -point configuration on A satisfies v({v N 1 , • • • , vN N }) → weak µ s,A , N → ∞

G PROOF OF PROPOSITION 1

We show that zero-mean equal-variance Gaussian distributed vectors (after normalized to norm 1) are uniformly distributed over the unit hypersphere with Theorem 8. Lemma 1 Let x be a n-dimensional random vector with distribution N (0, 1) and U ∈ R n×n be an orthogonal matrix (U U ⊤ = U ⊤ U = I). Then Y = U x also has the distribution of N (0, 1). Proof G.1 For any measurable set A ⊂ R n , we have that P (Y ∈ A) = P (X ∈ U ⊤ A) = U ⊤ A 1 ( √ 2π) n e -1 2 ⟨x,x⟩ = A 1 ( √ 2π) n e -1 2 ⟨U x,U x⟩ = A 1 ( √ 2π) n e -1 2 ⟨x,x⟩ because of orthogonality of U . Therefore the lemma holds. ■ Theorem 8 The normalized vector of Gaussian variables is uniformly distributed on the sphere. Formally, let x 1 , x 2 , • • • , x n ∼ N (0, 1) and be independent. Then the vector x = x 1 z , x 2 z , • • • , x n z follows the uniform distribution on S n-1 , where z = x 2 1 + x 2 2 + • • • + x 2 n is a normalization factor. Proof G.2 A random variable has distribution N (0, 1) if it has the density function f (x) = 1 √ 2π e -1 2 x 2 . ( ) A n-dimensional random vector x has distribution N (0, 1) if the components are independent and have distribution N (0, 1) each. Then the density of x is given by f (x) = 1 ( √ 2π) n e -1 2 ⟨x,x⟩ . Then we use Lemma 1 about the orthogonal-invariance of the normal distribution. Because any rotation is just a multiplication with some orthogonal matrix, we know that normally distributed random vectors are invariant to rotation. As a result, generating x ∈ R n with distribution N(0, 1) and then projecting it onto the hypersphere S n-1 produces random vectors U = x ∥x∥ that are uniformly distributed on the hypersphere. Therefore the theorem holds. ■ The above results indicate that as long as class proxies are initialize with zero-mean Gaussian, they are uniformly distributed over the hypersphere in a probabilistic sense. ■

H DERIVATION OF HUG SURROGATE FOR MHE AND MHS

The derivation of L ′ MHE-HUG is as follows: L MHE-HUG := α • E s b { ŵc } C c=1 -β • C c=1 E sw { xi } i∈Ac , ŵc = α • c̸ =c ′ ∥ ŵc -ŵc ′ ∥ -2 + β • c i,j∈Ac,i̸ =j ∥ xi -xj ∥ + 2 • i∈Ac ∥ xi -ŵc ∥ = α • c̸ =c ′ ∥ ŵc -ŵc ′ ∥ -2 + β • c i,j∈Ac,i̸ =j ∥ xi -ŵc + ŵc -xj ∥ + 2 • i∈Ac ∥ xi -ŵc ∥ ≤ α • c̸ =c ′ ∥ ŵc -ŵc ′ ∥ -2 + β • c i,j∈Ac,i̸ =j (∥ xi -ŵc ∥ + ∥ ŵc -xj ∥) + 2 • i∈Ac ∥ xi -ŵc ∥ = α • c̸ =c ′ ∥ ŵc -ŵc ′ ∥ -2 + β ′ • c i∈Ac ∥ xi -ŵc ∥ =: L ′ MHE-HUG The derivation of L ′ MHS-HUG is as follows: L MHS-HUG := α • ϑ { ŵc } C c=1 -β • C c=1 ϑ { xi } i∈Ac , ŵc = α • min c̸ =c ′ ∥ ŵc -ŵc ′ ∥ -β • c max u,v∈{{ xi}i∈A c , ŵc},u̸ =v ∥u -v∥ ≤ α • min c̸ =c ′ ∥ ŵc -ŵc ′ ∥ -β • c max i∈Ac ∥ xi -ŵc ∥ =: L ′ MHS-HUG . Most importantly, c max u,v∈{{ xi}i∈A c , ŵc},u̸ =v ∥u -v∥ in L MHS-HUG and L ′ MHS-HUG share the same minimizer (minimum is 0, which happens when intra-class feature collapse to its class proxy). Therefore, L ′ MHS-HUG and L MHS-HUG share the same maximizer, and L ′ MHS-HUG can be viewed as a surrogate loss for L MHS-HUG .

I PROOF OF PROPOSITION 2

For notational convenience, we first define ε s (S d-1 , n) := min Vn⊂S d-1 E s ( Vn ) and δ ρ n (S d-1 ) := max Vn⊂S d-1 ϑ( Vn ). We then define that V s n is a s-energy minimizing n-point configuration on S d-1 if 0 < s < ∞ (i.e., MHE configuration) and V ∞ n denotes a best-packing configuration on S d-1 if s = ∞ (i.e., MHS configuration). Since we are considering s > 0, we only need to discuss the case of K s (v i , vj ) = ρ(v i , vj ) -s . Then we will have the following equation: ε s (S d-1 , n) 1 s = E s ( V s n ) 1 s ≥ 1 δ ρ n ( V s n ) ≥ 1 δ ρ n (S d-1 ) . Moreover, we have that ε s (S d-1 , n) 1 s ≤ E s ( V ∞ n ) 1 s = 1 δ ρ ( V ∞ n ) 1≤i̸ =j≤N δ ρ ( V ∞ n ) ρ(v ∞ i , v∞ j ) s 1 s ≤ 1 δ ρ ( V ∞ n ) n(n -1) 1 s Therefore, we will end up with lim s→∞ sup ε s (S d-1 , n) 1 s ≤ 1 δ ρ ( V ∞ n ) = 1 δ ρ n (S d-1 ) . Then we take both Eq. 34 and Eq. 36 into consideration and have that lim s→∞ ε s (S d-1 , n) 1 s = 1 δ ρ n (S d-1 ) which concludes the proof. ■ J PROOF OF PROPOSITION 3 We write down the formulation of the HUG objectives (with MHE): min { xi} n i=1 ,{ ŵc} C c=1 L MHE-HUG := α • E s b { ŵc } C c=1 -β • C c=1 E sw { xi } i∈Ac , ŵc = α • c̸ =c ′ ∥ ŵc -ŵc ′ ∥ -2 + β • c i,j∈Ac,i̸ =j ∥ xi -xj ∥ + 2 • i∈Ac ∥ xi -ŵc ∥ min { xi} n i=1 ,{ ŵc} C c=1 L ′ MHE-HUG = α • c̸ =c ′ ∥ ŵc -ŵc ′ ∥ -2 + β ′ • c i∈Ac ∥ xi -ŵc ∥ For both objectives, we can see that the minimizer of the second term (i.e., the intra-class variability term) is all intra-class feature collapse to their class proxy and therefore the second term achieves the global minimum 0. For the first term of both objectives, the global minimizer can be obtain directly from Theorem 1, Theorem 2 and Theorem 3. It is easy to see that the global minimizer of the inter-class separability term and the intra-class variability term does not contradict with each other and can be achieved simultaneously.  ■ K PROOF OF PROPOSITION 4 n i=1 log(1 + C j=1̸ =yi exp(⟨w j , x i ⟩ -⟨w yi , x i ⟩)) ≥ n i=1 C j=1̸ =yi log(1 + exp(⟨w j , x i ⟩ -⟨w yi , x i ⟩)) ≥ n i=1 C j=1̸ =yi (⟨w j , x i ⟩ -⟨w yi , x i ⟩) = n i=1 C j̸ =yi ⟨w j , x i ⟩ Q1: Coupling IS and IV -(C -1) n i=1 ⟨w yi , x i ⟩ Q2: Inter-class Variability ( ) n i=1 log(1 + C j=1̸ =yi exp(⟨w j , x i ⟩ -⟨w yi , x i ⟩)) ≤ log(1 + n i=1 C j=1̸ =yi exp(⟨w j , x i ⟩ -⟨w yi , x i ⟩)) ≤ log(1 + n i=1 C j=1̸ =yi (exp(⟨w j , x i ⟩) + exp(-⟨w yi , x i ⟩))) = log 1 +

L DERIVATION OF CE'S LOWER BOUND

The derivation is actually very simple and this result is originally given by [4] . We find that it naturally matches the intuition behind HUG. For our paper to be self-contained, we briefly give the simple derivation below. For the details, please refer to Proposition 1 in [4] . We start by rewriting the CE loss as L CE = - n i=1 ⟨w yi , x i ⟩ + λn 2 C c=1 ⟨w c , w c ⟩ Q1(w) + n i=1 log C c=1 exp(⟨w c , x i ⟩) - λn 2 C c=1 ⟨w c , w c ⟩ Q2(w) where λ can be chosen such that both Q 1 (w) and Q 2 (w) become convex functions with respect to w. Taking advantage of the convexity, we can separately set the gradient of Q 1 (w) and Q 2 (w) with respect to w as 0 and compute their minima. Specifically, we end up with Q 1 (w) ≥ Q 1 (w * Q1 ) = - 1 2λn n i=1 j∈Ay i ⟨x i , x j ⟩, Q 2 (w) ≥ Q 2 (w * Q2 ) = n i=1 log C c=1 exp 1 λn n j=1 l jc ⟨x i , x j ⟩ - n 2λ C c=1 1 n n i=1 l ic x i 2 , where l ic = exp(⟩wc,wi⟩) j exp(⟨wj ,xi⟩) denotes the softmax confidence. Combining the two lower bounds above, we can have that  L CE ≥ Q 1 (w * Q1 ) + Q 2 (

N EXPERIMENTAL DETAILS

General settings. For MHE-HUG and MHS-HUG, α and β are set as 0.15 and 0.015, respectively. For MGD-HUG, α and β are set as 0.15 and 0.03, respectively. We train the model for 200 epochs with 512 batchsize for both the cross-entropy (CE) loss and HUG. We use the stochastic gradient descent with momentum 0.9 and weight decay 2 × 10 -4 . The initial learning rate is set as 0.1 for both CIFAR-100 and CIFAR-10 and is divided by 10 at 60, 120, 180 epoch. For the general classification experiments, we use multiple architectures, including ResNet-18, VGG16 and DenseNet121. we use the simple data augmentation: 4 pixels are padded on each side, and image is randomly cropped. Long-tailed recognition. We follow LDAM [6] to obtain imbalanced CIFAR-10 and CIFAR-100 datasets with different imbalanced ratio. Following LDAM, we use ResNet-32 as our base network. The other setting is the same as our general setting. Continual learning. We follow DER [5] to construct our continual learning experiments. We split both the CIFAR-10 and CIFAR-100 training set into 5 tasks. Each task has 2 classes and 20 classes for CIFAR-10 and CIFAR-100, respectively. The training batchsize is set as 64, where there are 32 incoming samples and 32 replayed samples. Different size of memory buffer is also studied. Adversarial robustness. For the experiments of adversarial robustness, we first obtain the model trained with CE and HUG. With the information of the attacked model, PGD [51] generates some adversarial examples to mislead the attacked model. The test accuracy in the experiments of adversarial robustness shows the accuracy of the perturbed samples. Visualizing loss landscape. We perturb neuron weights to visualize the loss landscape, as proposed in [40] . For details, we perturb the model weight with 400 interpolation points in two random vectors around the current model weight minima. The visualization method is also the same as [47] .

O ADDITIONAL EXPERIMENTAL RESULTS

Training convergence. We observe the training convergence of HUG on CIFAR-10 and CIFAR-100. Both the evaluation accuracy and the training loss, including the overall losses, the intra-class loss and the inter-class loss, are shown in Figure 11 . For both the CIFAR-10 and CIFAR-100, the inter-class uniformity loss remains relatively small, which is consistent with the empirical finding in [41, 47] . Moreover, we find that the intra-class uniformity loss (i.e., intra-class variability) dominates the overall loss on CIFAR-100 dataset and it is relatively difficult to optimize when the class number becomes large. 2D loss contour. We also utilize the method in [40] to visualize the 2D loss landscape, which is more easy to visualize the flatness of the loss landscape. As shown in Figure 12 , the 2D loss landscape of our HUG loss is flatter than the widely used CE loss, showing that HUG yields a flat minima which may have better generalization ability. The ablation of α and β. In our HUG framework, we introduce two scaling hyperparameters, α for the inter-class hyperspherical uniformity, β for the intra-class hyperspherical uniformity. We investigate the effect of the two hyperparameters for the model performance. As shown in Table 8 , HUG is not sensitive to α, as the inter-class hyperspherical uniformity is always easy to optimize. HUG is also not sensitive to β in a wide range. The ablations are conducted on CIFAR-100. α is set as 0.15 when we perform ablation on β. β is set as 0.015 when doing ablation on α. 



We first obtain the upper bound n of tr(S b ) from tr(S b ) = C i=1 ni∥µi -μ∥ F ≤ Ci=1 ni∥µi∥ • ∥ μ∥ ≤ n. Because a set of vectors {µi} n i=1 achieving hyperspherical uniformity has Eµ 1 ,••• ,µn {∥ μ∥} → 0 (as n grows larger)[20]. Then we have that tr(S b ) attains n. Therefore, vectors achieving hyperspherical uniformity are one of its maximizers. tr(Sw) can simultaneously attain its minimum if intra-class features collapse to a single point. Parametric class proxies are a set of parameters used to represent a group of samples in the same class. Therefore, these proxies store the information about a class. Last-layer classifiers are a typical example.



Figure 1: 2D learned feature visualization on MNIST. The features are inherently 2-dimensional and are plotted without visualization tools. (a) Case 1: d = 2, C = 3; (b) Case 2: d = 2, C = 10.Motivated by this question, we conduct a simple experiment to simulate the case of d ≥ C -1 and the case of d < C -1. Specifically, we train a convolutional neural network (CNN) on MNIST with feature dimension 2. For the case of d ≥ C -1, we use only 3 classes (digit 0,1,2) as the training set. For the case of d < C -1, we use all 10 classes as the training set. We visualize the learned features of both cases in Figure1. The results verify the case of d ≥ C -1 indeed approaches to NC, and ETF does not exist in the case of d < C -1. Interestingly, one can observe that learned features in both cases approach to the configuration of equally spaced frames on the hypersphere. To accommodate the case of d < C -1, we extend NC to the generalized NC by hypothesizing that last-layer inter-class features and classifiers converge to equally spaced points on the hypersphere, which can be characterized by hyperspherical uniformity.

Nearest decision rule: The learned linear classifiers behave like the nearest class-mean classifiers. Formally, GNC has that arg max c ⟨w c , x⟩ + b c → arg min c ∥x -µ c ∥.

Figure 2: Geometric illustration in R 3 of (a) regular simplex optimum (equivalent to simplex ETF in NC) and (b) cross-polytope optimum in GNC.

Figure 3: Training dynamics of hyperspherical energy (which captures inter-class separability) and hyperspherical reverse-energy (which captures intra-class variability). (a,b) MNIST with d = 2, C = 10 and d = 2, C = 3. (c,d) CIFAR-100 with d = 64, C = 100 and d = 128, C = 100.

Order of Minimum Hyperspherical Energy) If d -1 > s > 0 or 0 > s > -2 and d ∈ N, we have that lim n→∞ {n -2 • min Vn E s ( Vn )} = c(s, d) where c(s, d) is a constant involving s, d.

Coupled IS and IV -ρ n i=1 ⟨wy i , xi⟩ Q 2 : Inter-class Variability ≤ LCE ≤ log 1 + n i=1 C j̸ =y i exp(⟨wj, xi⟩) Q 3 : Coupled IS and IV + ρ n i=1 exp(-⟨wy i , xi⟩) Q 4 : Inter-class Variability .We show in Proposition 4 that CE inherently optimizes two independent criterion: intra-class variability (IV) and inter-class separability (IS). With normalized classifiers and features, we can see that Q 1 and Q 3 have similar minimum where x i = w yi and w i , ∀i attain hyperspherical uniformity.Published as a conference paper at ICLR 2023We show that CE is lower bounded by the gap of inter-class and intra-class hyperspherical uniformity:

(a) CE Loss (b) HUG: Both losses (c) HUG: Intra-class Variability (d) HUG: Inter-class Separability

Figure 4: Loss landscape visualization. (b,c,d) show L ′ MHE-HUG , T b and Tw, respectively.

Figure 6: Training dynamics of hyperspherical energy (which captures inter-class separability) and hyperspherical reverse-energy (which captures intra-class variability). ImageNet-2012 [13] with ResNet-18 [29] (d = 512, C = 1000).

Feature-mean Reverse-Energy Feature Reverse-Energy

Figure 7: Training dynamics of hyperspherical energy (which captures inter-class separability) and hyperspherical reverse-energy (which captures intra-class variability). ImageNet-2012 [13] with ResNet-50 [29] (d = 2048, C = 1000).

Figure 8: 2D MNIST feature visualization for the CE loss at 1,5,10,15,20 epochs (top left -top right -middle left -middle right -bottom).

Figure 9: 2D MNIST feature visualization for the HUG loss (randomly initialized and then optimized proxies) at 1,5,10,15,20 epochs (top lefttop right -middle left -middle right -bottom).

Figure 10: 2D MNIST feature visualization for the HUG loss (fully learnable proxies) at 1,5,10,15,20 epochs (top left -top right -middle leftmiddle right -bottom).

=y i , xi Coupled Intra-class and Inter-class Hyperspherical Uniformity -β• C c=1

the theorem above, with s = 2, d -1 > s, N = C and A = S d-1 , we have that W s (S d-1 ) is a constant term, and most importantly, we have that these point sequences { μC 1 , • • • , μC C } asymptotically minimizes the hyperspherical energy on S d-1 .

j , x i ⟩) Q3: Coupling IS and IV + (C -1) n i=1 exp(-⟨w yi , x i ⟩) Q4: Inter-class Variability

jc ⟨x i , x j ⟩ -⟨x i , x j ⟩(45) where the first two terms encourage larger inter-class hyperspherical uniformity, and the last term promotes smaller intra-class hyperspherical uniformity.

Figure 11: HUG's training loss and testing accuracy (%) on CIFAR-10 (left) and CIFAR-100 (right).

Figure 12: The 2D Loss Contour of different loss objective. From left to the right: (1). CE loss. (2). HUG overall loss. (3). intra-class loss. (4). inter-class loss.



Testing error (%) with different architectures.



Testing accuracy (%) of long-tailed recognition.

Final testing accuracy (%) of continual learning.

shows HUG consistently improves ER under different memory size. Testing accuracy (%) under adversarial attacks.



Accuracy 74.15 76.48 76.12 75.87 75.59 75.24 74.81 74.00 Effect of hyperparameters α and β.

ACKNOWLEDGEMENT

The authors would like to sincerely thank the anonymous reviewers for all the detailed and valuable suggestions that have significantly improved the paper. This work is supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A, 01IS18039B; and by the Machine Learning Cluster of Excellence, EXC number 2064/1 -Project number 390727645. AW acknowledges support from a Turing AI Fellowship under EPSRC grant EP/V025279/1, and the Leverhulme Trust via CFI.

M PROOF OF THEOREM 5

This theorem follows naturally from the main result in [50] . [50] has proved that the minimizer of a simplified form of the cross-entropy loss is the simplex ETF when 2 ≤ C ≤ d + 1 and the minimizer also asymptotically converges to uniform measure on the hypersphere. More formally, we have Theorem 9 ( [50] ) Consider the following variational problemLet µ n be the probability measure on S d generated by a minimizerthen for any α > 0, µ n converges weakly to the unform measure on S d-1 as n → ∞.From Theorem 3, we know that HUG with specific potential energy also converges to the uniform measure on S d-1 . Combining the results above, we can conclude that HUG and CE share the same minimizer. ■

