PERTURBATION ANALYSIS OF NEURAL COLLAPSE

Abstract

Training deep neural networks for classification often includes minimizing the training loss beyond the zero training error point. In this phase of training, a "neural collapse" behavior has been observed: the variability of features (outputs of the penultimate layer) of within-class samples decreases and the mean features of different classes approach a certain tight frame structure. Recent works analyze this behavior via idealized unconstrained features models where all the minimizers exhibit exact collapse. However, with practical networks and datasets, the features typically do not reach exact collapse, e.g., because deep layers cannot arbitrarily modify intermediate features that are far from being collapsed. In this paper, we propose a richer model that can capture this phenomenon by forcing the features to stay in the vicinity of a predefined features matrix (e.g., intermediate features). We explore the model in the small vicinity case via perturbation analysis and establish results that cannot be obtained by the previously studied models. For example, we prove reduction in the within-class variability of the optimized features compared to the predefined input features (via analyzing gradient flow on the "central-path" with minimal assumptions), analyze the minimizers in the near-collapse regime, and provide insights on the effect of regularization hyperparameters on the closeness to collapse. We support our theory with experiments in practical deep learning settings.

1. INTRODUCTION

Modern classification systems are typically based on deep neural networks (DNNs), whose parameters are optimized using a large amount of labeled training data. Their training scheme often includes minimizing the training loss beyond the zero training error point (Hoffer et al., 2017; Ma et al., 2018; Belkin et al., 2019) . In this terminal phase of training, a "neural collapse" (NC) behavior has been empirically observed when using either cross-entropy (CE) loss (Papyan et al., 2020) or mean squared error (MSE) loss (Han et al., 2022) . The NC behavior includes several simultaneous phenomena that evolve as the number of epochs grows. The first phenomenon, dubbed NC1, is decrease in the variability of the features (outputs of the penultimate layer) of training samples from the same class. The second phenomenon, dubbed NC2, is increasing similarity of the structure of the inter-class features' means (after subtracting the global mean) to a simplex equiangular tight frame (ETF). The third phenomenon, dubbed NC3, is alignment of the last layer's weights with the inter-class features' means. A consequence of these phenomena is that the classifier's decision rule becomes similar to nearest class center in feature space. Many recent works attempt to theoretically analyze the NC behavior (Mixon et al., 2020; Lu & Steinerberger, 2022; Wojtowytsch et al., 2021; Fang et al., 2021; Zhu et al., 2021; Graf et al., 2021; Ergen & Pilanci, 2021; Ji et al., 2021; Galanti et al., 2021; Tirer & Bruna, 2022; Zhou et al., 2022; Thrampoulidis et al., 2022; Yang et al., 2022; Kothapalli et al., 2022) . The mathematical frameworks are almost always based on variants of the unconstrained features model (UFM), proposed by Mixon et al. (2020) , which treats the (deepest) features of the training samples as free optimization variables (disconnected from data or intermediate/shallow features). Typically, in these "idealized" models all the minimizers exhibit "exact collapse" (i.e., their within-class variability is exactly 0 and an exact simplex ETF structure is demonstrated) provided that arbitrary (but nonzero) level of regularization is used. However, the features of DNNs are not free optimization variables but outputs of predetermined architectures that get training samples as input and have parameters (shared by all the samples) that are hard to optimize. Thus, usually, the deepest features demonstrate reduced "NC distance metrics" (such as within-class variability) compared to features of intermediate layers but do not exhibit convergence to an exact collapse. Indeed, as can be seen in any NC paper that presents empirical results, the decrease in the NC metrics is typically finite and stops above zero at some epoch (the margin depends on the dataset complexity, architecture, hyperparameter tuning, etc.). In this paper, this issue is taken into account by studying a model that can force the features to stay in the vicinity of a predefined features matrix. By considering the predefined features as intermediate features of a DNN, the proposed model allows us to analyze how deep features progress from, or relate to, shallower features. We explore the model in the small vicinity case via perturbation analysis and establish results that cannot be obtained by the previously studied UFMs. Specifically, we prove reduction in the within-class variability of the optimized features compared to the predefined input features. To obtain this result (for arbitrary input features), we prove monotonic decrease of withinclass variability along gradient flow on the "central-path" of a UFM with minimal assumptions (i.e., we drop the assumptions and modifications of the flow that Han et al. (2022) did to facilitate their analysis). Next, we provide a closed-form approximation for the model's minimizer. Then, focusing on the case where the input features matrix is already near collapse (e.g., the penultimate features of a well-trained DNN), we present a fine-grained analysis of our closed-form approximation, which provides insights on the effect of regularization hyperparameters on the closeness to collapse. We support our theory with experiments in practical deep learning settings. Following the work of Mixon et al. (2020) , in order to mathematically show the emergence of minimizers with NC structure, most of the theoretical papers have followed the "unconstrained features model" (UFM) approach, where the features {h θ (x k,i )} are treated as free optimization variables {h k,i }. Namely, they study problems of the form min

2. BACKGROUND AND PROBLEM SETUP

W,b,{h k,i } 1 Kn K k=1 n i=1 L (Wh k,i + b, y k ) + R (W, b, {h k,i }) . One such example is the work in (Tirer & Bruna, 2022) , which considered a setting with regularized MSE loss (which shares similarity with models in the matrix factorization literature (Koren et al., 2009; Chi et al., 2019) , except the assumptions that d ≥ K and on the specific structure of Y): min W,H 1 2Kn ∥WH -Y∥ 2 F + λ W 2K ∥W∥ 2 F + λ H 2Kn ∥H∥ 2 F , where H = [h 1,1 , . . . , h 1,n , h 2,1 , . . . , h K,n ] ∈ R d×Kn is the (organized) unconstrained features matrix, Y = I K ⊗ 1 ⊤ n ∈ R K×Kn ( where ⊗ denotes the Kronecker product) is its associated onehot vectors matrix, and λ W and λ H are positive regularization hyperparameters. It was shown that all the (global) minimizers of this bias-free UFM exhibit an orthogonal collapse, as stated in the following theorem. 2 Theorem 2.1 (Theorem 3.1 in (Tirer & Bruna, 2022)  ). Let d ≥ K and define c := √ λ H λ W . If c ≤ 1, then any global minimizer (W * , H * ) of Eq. 1 satisfies h * k,1 = . . . = h * k,n =: h In short, the theorem states that any minimizer (W * , H * ) of Eq. 1 obeys that H * = H ⊗ 1 ⊤ n for some H ∈ R d×K , and W * H ∝ H ⊤ H ∝ W * W * ⊤ ∝ I K . It is not hard to show that H ⊤ H = ρI K implies that H -h * G 1 ⊤ K ⊤ H -h * G 1 ⊤ K = ρ I K - 1 K 1 K 1 ⊤ K , where h * G = 1 K K k=1 h * k = 1 K H1 K is the global mean. Namely, the "mean-subtracted features" collapse to a simplex ETF. From the structure of the problem and the theorem, we see that there are infinitely many minimizers of Eq. 1. Indeed, as can be deduced from the proof of Theorem 2.1 in (Tirer & Bruna, 2022) : Taking any (partial) orthonormal matrix R ∈ R d×K (i.e., R ⊤ R = I K ), one can construct a minimizer for Eq. 1 simply by H * = ρ(λ W , λ H )R ⊗ 1 ⊤ n and W * = λ H /λ W ρ(λ W , λ H )R ⊤ . The existing literature includes other different UFM settings where all the minimizers exhibit NC structures (e.g., see (Lu & Steinerberger, 2022; Wojtowytsch et al., 2021; Zhu et al., 2021; Fang et al., 2021; Thrampoulidis et al., 2022) ). However, as discussed in Section 1, all the previously studied UFMs are idealized and their results deviate from the situation in practical DNN training, where the features do not exhibit exact collapse (e.g., since deep layers cannot arbitrarily modify intermediate features that are far from being collapsed) and the setting of the hyperparameters affects the distance from NC structure. In this paper, we consider a different model with the goal of better analyzing the real-world "near collapse" situation where "exact NC" cannot be reached. Motivated by Eq. 1, we consider the following model min W,H f (W, H; H 0 ) = 1 2Kn ∥WH -Y∥ 2 F + λ W 2K ∥W∥ 2 F + λ H 2Kn ∥H∥ 2 F + β 2Kn ∥H -H 0 ∥ 2 F , where H 0 ∈ R d×Kn is an input features matrix, which is fixed, and β is a positive hyperparameter that controls the distance of H from H 0 . Let us discuss the motivation for studying this model. As before, we interpret W and H as the final weights and deepest features of the DNN, respectively. Clearly, for H 0 = 0 this model reduces to Eq. 1 (with ∥H∥ 2 F regularized by λ H + β). Furthermore, when H 0 is nonzero, but already a minimizer of Eq. 1 (and thus has zero within-class variability and an orthogonal frame structure), the following statement is straightforward. Corollary 2.2. Let d ≥ K, λ H λ W < 1, and let (W * , H * ) be a minimizer of Eq. 1. Then, the minimizer of f (W, H; H 0 = H * ) (in Eq. 2) is unique 3 and it is given by (W * , H * ). That is, Eq. 2 allows us to pick one of the minimizers of Eq. 1 by H 0 and transfer its orthogonal collapse properties, which are stated in Theorem 2.1, to the minimizer of Eq. 2. However, the usefulness of Eq. 2 comes from exploring cases with nonzero/non-collapsed H 0 . Indeed, while H can be interpreted as the deepest features of a DNN, here we interpret H 0 as the features that are obtained in a shallower layer. In this case, 1/β can be understood as the complexity of the subnetwork from H 0 to H. We are particularly interested in the the large β regime, β ≫ 1, where H 0 expresses penultimate features (only one layer before H) that significantly constrain H. In Appendix F we review practical DNNs where the distance between the deepest and penultimate features may be small or is even inherently small. In this paper we focus on this large β regime, and provide mathematical reasoning for the empirical NC behavior that are not captured by previously studied UFMs, such as proving that the optimized H has smaller within-class variability than H 0 , and analyzing how perturbations from collapse of H 0 can be mitigated by the minimizer of Eq. 2.

3. DECREASE IN WITHIN-CLASS VARIABILITY

As discussed above, while the features matrix H represents the output of a DNN's penultimate layer, the input matrix H 0 can be interpreted as the features of a preceding layer. Several works have presented empirical settings where the within-class variability of the features, measured by some "NC1 metric", decreases across depth (Papyan et al., 2020; Tirer & Bruna, 2022; Galanti, 2022) . The goal of this section is to prove such a phenomenon for the model stated in Eq. 2. The theory that we provide shows also monotonic decrease of the within-class variability (till exact collapse) along gradient flow on the "central-path" of the UFM stated in Eq. 1. Let us begin with several definitions that will be used in this section. For a given set of n features for each of K classes, {h k,i }, we define the per-class and global means as h k := 1 n n i=1 h k,i and h G := 1 Kn k k=1 n i=1 h k,i , respectively, as well as the mean features matrix H := h 1 , . . . , h K . Next, we define the within-class and between-class d × d covariance matrices Σ W (H) := 1 Kn K k=1 n i=1 (h k,i -h k )(h k,i -h k ) ⊤ , Σ B (H) := 1 K K k=1 (h k -h G )(h k -h G ) ⊤ . The within-class variability collapse (NC1) can be expressed as Σ W (H) → 0 while Σ B (H) ↛ 0, where the limit takes place with increasing the training epoch, and Σ B (H) > 0 filters degenerate cases such as H = 0. Several papers considered in their experiments the metric Papyan et al., 2020; Han et al., 2022; Zhu et al., 2021 ). Yet, we believe that considering the metric 1 K Tr Σ W (H)Σ † B (H) , where Σ † B denotes the pseudoinverse of Σ B ( N C 1 (H) := Tr (Σ W (H)) /Tr (Σ B (H)) (3) is more amenable for theoretical analysis while capturing the desired nondegenerate collapse behavior. 4 Indeed, the trace of a covariance matrix equals zero if and only if the covariance matrix is a zero matrix (this follows from Cov 2 (X, Y ) ≤ Var(X)Var(Y )). Recall that the minimizer w.r.t. W in Eq. 2 (and Eq. 1) has a closed-form expression that is a function of H, which is given by W * (H) = YH ⊤ (HH ⊤ + nλ W I d ) -1 . Thus, the optimization in Eq. 2 is equivalent to H 1/β := argmin H L(H) + β 2Kn ∥H -H 0 ∥ 2 F 4 The metric 1 K Tr ΣW Σ † B was considered in (Han et al., 2022 ). Yet, to state a result on this metric the authors claim (in the proof of Cor. 2) that a nonzero eigenvalue of Σ -1/2 W HH ⊤ Σ -1/2 W equals the reciprocal of the associated nonzero eigenvalue of Σ 1/2 W (HH ⊤ ) † Σ 1/2 W . However, this is not correct in general (due to the inherent rank deficiency of HH ⊤ ). For example, for Σ 1/2 W = 2 1 1 2 and HH ⊤ = 1 0 0 0 , we have that the single nonzero eigenvalue of the former is 5/9 while the single nonzero eigenvalue of the latter is 5. where L(H) := 1 2Kn ∥W * (H)H -Y∥ 2 F + λ W 2K ∥W * (H)∥ 2 F + λ H 2Kn ∥H∥ 2 F . For large β, the minimizer H 1/β can be viewed as a backward/implicit gradient descent update from H 0 with respect to the loss L. This follows from rewriting the first order optimality condition as H 1/β -H 0 1/β = -Kn∇L(H 1/β ). Observing that for β → ∞ we have H 1/β → H 0 (formally shown in Appendix B), the above equation can be written as dHt dt t=0 = -Kn∇L(H 0 ), where we think of t as β -1 . This naturally gives rise to the gradient flow dH t dt = -Kn∇L(H t ), associated with the UFM in Eq. 1. This means that results on this flow can be translated to results on the minimizer of Eq. 2 in the large β regime. Indeed, in Theorem 3.1 below, we show that N C 1 (H) monotonically decreases along this flow, which implies that N C 1 (H 1/β ) < N C 1 (H 0 ) for large enough β (see the statement in Corollary 3.2 below). Note that a flow for an objective that is equivalent to L(H) with λ W = 0 and λ H = 0 has been studied in (Han et al., 2022) , who called it the "central path". The motivation for studying such an objective, where the optimization variable W is replaced by the optimal W * (H), comes from the empirical observation in (Han et al., 2022) that the gap ∥W * (H)H -Y∥ 2 F -∥WH -Y∥ 2 F is rather small (compared to each term) during the optimization process of practical DNNs. We now state our result for gradient flow on the "central path" (which is proved in Appendix A). Theorem 3.1. Assume that λ W > 0, λ H ≥ 0, and that H 0 is non-collapsed (i.e., Σ W (H 0 ) ̸ = 0). Then, along the gradient flow, which is stated in Eq. 4, we have that • N C 1 (H t ) strictly decreases along the flow untill it reaches zero. • t → e 2λ H t Tr(Σ W (H t )) decreases along the flow. In particular, when λ H > 0, Tr(Σ W (H t )) decays exponentially. • t → e 2λ H t Tr(Σ B (H t )) strictly increases along the flow. Remark. Note that our gradient flow analysis has minimal assumptions. Unlike (Han et al., 2022) , our flow does not assume zero global mean (h G = 0), λ W = λ H = 0 and invertibility of Σ W . And most importantly, it does not include any engineered renormalization and projection of the gradient, contrary to the previous work. Thus, it is more similar to practical gradient descent optimization of DNNs. Our unmodified flow and minimal assumptions require a different, and more general, analysis with quite involved computations.foot_1  Not only does Theorem 3.1 state a monotonic decrease toward 0 in the NC1 metric, it also provides a separation between the behavior of Tr(Σ W ) and Tr(Σ B ) along the flow. A strict separation is observed for λ H = 0: Tr(Σ W ) decreases while Tr(Σ B ) increases. As gradient flow is often used as a proxy for analyzing gradient descent with a small step-size (Elkabetz & Cohen, 2021) , if we overlook the difference between optimizing the UFM in Eq. 1 jointly w.r.t. W and H and restricting the optimization to the "central path" (W * (H), H), then our theory also provides a mathematical reasoning for the experiments on gradient descent in (Tirer & Bruna, 2022 ) that show monotonic decrease in within-class variability. Finally, with our interpretation of t as β -1 , the following Corollary is a direct consequence of Theorem 3.1 and the continuity of ∇L(H) (see Appendix B for a formal proof). Corollary 3.2. Assume that H 0 is non-collapsed. Then, there exists some constant C = C(H 0 ) > 0 such that for β > C we have that N C 1 (H 1/β ) < N C 1 (H 0 ). Recall that in the large β regime we can interpret H as features of DNN that are deeper than H 0 but such that the architecture between H 0 and H is extremely simple (e.g., they are features of adjacent layers) and thus the distance between them is constrained. Under this interpretation, Corollary 3.2 implies that layer-wise optimization of DNN where each time a new layer is added (so that the previous deepest features H 1/β are considered as the new H 0 ) will result in gradually depthwise decreasing NC1. An extension of the model in Eq. 2 that will include multiple levels of optimizable parameters may be able to provide similar reasoning to the gradual depthwise decrease in NC1 that is observed in practical DNN training, where all the layers are optimized simultaneously.

4. ANALYSIS OF THE NEAR-COLLAPSE REGIME

In this section, we will explore the behavior of the minimizers of Eq. 2 in the near collapse regime. As stated in Corollary 2.2, if H 0 is already collapsed then the minimizer of Eq. 2 is also collapsed. This is aligned with the rationale that if we have a DNN that already exhibits collapse at some intermediate layer, we would expect the subsequent layers to maintain this collapse. 6 Essentially, we would like to analyze the minimizer of Eq. 2 for H 0 that is not already collapsed. Unfortunately, for general non-collapsed H 0 it is not likely that the minimizer is amenable for explicit analytical characterization. Yet, the fact that for orthogonally collapsed H 0 = H * we get a unique minimizer (W * , H * ) of Eq. 2, which is still characterized by Theorem 2.1, gives us a desirable setting for examining the minimizer of Eq. 2 obtained for H0 = H * + δH 0 (with sufficiently small δH 0 ) by exploiting our knowledge on (W * , H * ; H 0 = H * ). Analyzing the near-collapse setting will shed light on the way that the deviation from collapse in the input features is transferred to the optimized features, e.g., the amount of interaction within/between classes and the effects of hyperparameters. Such insights can be latter examined empirically beyond the near-collapse regime. Let us denote by ( W * , H * ) the minimizer of f (W, H; H0 ). We are interested in studying the dependence of δW := W * -W * and δH := H * -H * on δH 0 = H0 -H * without the requirement of computing ( W * , H * ) (that lack analytical expressions). In particular, our focus is on the relation between the features δH and δH 0 (rather than δW and δH 0 ), both because a minimizer H * uniquely implies the associated W * , and because important aspects of NC, such as within-class variability decrease (NC1) and inter-class feature structure (NC2), consider the feature mapping rather than the last layer weights. We begin with establishing such a result in the following theorem (which is proved in Appendix C) for H 0 that is not necessarily a collapsed features matrix. The notation in the theorem is as follows. We use vec(•) to denote the column-stack vectorization of a matrix. The derivatives are w.r.t. the vectorized matrices vec(H) and vec(W). For example, ∇ H f ∈ R dnK×1 stands for the derivative of f w.r.t. vec(H), and a second derivative w.r.  and set some H 0 and δH  t. vec(W) ⊤ yields ∇ ⊤ W ∇ H f ∈ R dnK×Kd . Theorem 4.1. Let d ≥ K, 0 . Let ( Ŵ * , Ĥ * ) be the minimizer of f (W, H; H 0 ) (with f stated in Eq. 2). Let ( W * , H * ) be the minimizer of f (W, H; H0 = H 0 + δH 0 ). Define δW := W * -Ŵ * and δH := H * -Ĥ * . Then, with approximation ac- curacy of O(∥δH∥ 2 , ∥δW∥ 2 , ∥δH 0 ∥ 2 ), we have that vec(δH) ≈ β Kn ∇ ⊤ H ∇ H f -∇ ⊤ W ∇ H f (∇ ⊤ W ∇ W f ) -1 ∇ ⊤ H ∇ W f -1 vec(δH 0 ), vec(δW) ≈ - β Kn (∇ ⊤ W ∇ W f ) -1 ∇ ⊤ H ∇ W f ∇ ⊤ H ∇ H f -∇ ⊤ W ∇ H f (∇ ⊤ W ∇ W f ) -1 ∇ ⊤ H ∇ W f -1 vec(δH 0 ), where all the derivativesfoot_3 are evaluated at the point ( Ŵ * , Ĥ * ; H 0 ). In particular, for β ≫ max{1, λ H } we have (with additional approximation error of O(β -2 )) vec(δH) ≈ I dnK - λ H β I dnK - 1 β I nK ⊗ Ŵ * ⊤ Ŵ * + 1 β Z * vec(δH 0 ), where Z * := (E * ⊤ + Ĥ * ⊗ Ŵ * ) ⊤ ( Ĥ * Ĥ * ⊤ ⊗ I K + nλ W I dK ) -1 (E * ⊤ + Ĥ * ⊗ Ŵ * ), E * := vec(e d,1 e ⊤ K,1 ( Ŵ * Ĥ * -Y)), ..., vec(e d,1 e ⊤ K,K ( Ŵ * Ĥ * -Y)), vec(e d,2 e ⊤ K,1 ( Ŵ * Ĥ * -Y)), ... ..., vec(e d,d e ⊤ K,K ( Ŵ * Ĥ * -Y)) , and e d,i is the standard vector in R d with 1 in its ith entry (similar definition stands for e K,k ). Observe that, assuming small approximation error, Theorem 4.1 states the linear operation that transforms δH 0 to δH. We will focus on the large β regime that is stated in Eq. 5, where the matrix inversion can be well approximated. Furthermore, due to the vectorization operation, observe that the linear expression vec(δH) ≈ Fvec(δH 0 ) has the following block-based representation    vec(δH (1) ) . . . vec(δH (K) )    ≈    F 1,1 . . . F 1,K . . . F K,1 . . . F K,K       vec(δH (1) 0 ) . . . vec(δH (K) 0 )    , where δH (k) := δH[:, dn(k -1) + 1 : dnK] ∈ R d×n is the sub-matrix of δH that is composed of the columns associated with the kth class (and similarly for δH 0 ). Namely, we have that F ∈ R dnK×dnK is composed of blocks of size dn × dn. The diagonal blocks are the "intra-class blocks". Each of them shows the effect of perturbation in a certain class in H 0 on the features of the same class in H. The off-diagonal blocks are the "inter-class blocks". Each of them shows the effect of perturbation in a certain class in H 0 on the features of another class in H. Recall that for H 0 = H * that is already exactly collapsed, the minimizer of f (•; H 0 ) is also collapsed, so Ĥ * = H * in the above theorem. Importantly, in this case the matrix in Eq. 6 transforms deviation from exact collapse in the input features to deviation from exact collapse in the optimized features. Thus, we have that stronger attenuation behavior of the blocks of F (e.g., small singular values) implies that the minimizer H * is closer to exact collapse. Based on specializing Theorem 4.1 to the near-collapse case, we present in the following theorem (which is proved in Appendix D) an exact analysis of singular values of the blocks of F. (The notations σ max (•) and σ min (•) stand for the largest and smallest singular values of a matrix, respectively). Theorem 4.2. Consider the setting of Theorem 4.1, λ H λ W < 1 (assumed in Theorem 2.1), d > K, β ≫ max{1, λ H }, and the representation of Eq. 5 that is given in Eq. 6. Let H 0 be a collapse features matrix (minimizer of Eq. 1 for the same λ H , λ W as in Eq. 2). Then, for k, k ∈ From Theorem 4.2 we gain the following insights on the minimizer of Eq. 2 in the near-collapse and large β regime. First, observe that not only do exactly collapsed minimizers have orthogonal features for different classes, but also in the near-collapse setting an intra-class block is much more dominant than each inter-class block, as follows from F k, k being rank-1 and σ max (F k, k) ≪ σ min (F k,k ). For generic perturbations that do not concentrate in specific low-dimensional subspaces this implies that also before/near pure collapse, we have that the deviation from collapse in the features of a certain class is mainly due to deviation from collapse of input (preceding) features of the same class and not those of the K -1 other classes. (See Appendix D.1 for more details, and note that this also implies preservation of per-class near-collapse). Second, we see that the feature mapping regularization plays the major role in approaching (near-)collapse behavior. Indeed, increasing λ H decreases the spectral values of the (more dominant) intra-class blocks {F k,k } (contrary to increasing λ W ). Recall that reducing the singular values of the blocks of F implies reducing the distance of the minimizer H * from exact collapse. Third, our result on the inter-class blocks {F k, k̸ =k } hints that the regularization of the last layer's weights (determined by λ W > 0) may still have a supportive effect on reaching (near-)collapse behavior by reducing the component of the deviation from collapse that is due to "crosstalk"/interference of features of different classes (e.g., when some classes are harder to be classified then others). In the sequel, we show that the above observations correlate with the NC behavior in practical settings. [K] with k ̸ = k we have that F k,k is full rank, F k, k is rank-1, σ max (F k,k ) = 1, σ min (F k,k ) = 1 -β -1 λ H /λ W , σ max (F k, k) = 2β -1 λ H (1 -λ H λ W ).

5. EXPERIMENTS

In this section, we translate the insights that are obtained for the model in Eq. 2 to what is observed with practical DNNs and datasets. We evaluate the distance of DNN's features from exact NC using metrics that have been also used in previous works. Despite defining the metric N C 1 in Eq. 3, here we mainly measure within-class variability using N C 1 := 1 K Tr Σ W Σ † B , where we use the definitions of Section 3. (We use this metric due to its popularity even though it is less amenable for theoretical analysis). We measure the structure of the features using N C 2 := (H -h G 1 ⊤ K ) ⊤ (H -h G 1 ⊤ K ) ∥(H -h G 1 ⊤ K ) ⊤ (H -h G 1 ⊤ K )∥ F - 1 √ K -1 (I K - 1 K 1 K 1 ⊤ K ) F , where the simplex ETF is normalized to unit Frobenius norm. Here we show this behavior also for layer-wise training, which is better represented by our model. We consider the CIFAR-10 dataset and train an MLP with 1 to 10 hidden layers and a final classification layer. Each time, we add and train a hidden layer on top of the previous hidden layers, which are maintained fixed. Then we compute the NC1 metrics for the deepest features. Due to space limitation, the experimental details are deferred to Appendix E.1. Figure 2 demonstrates decrease in both N C 1 and N C 1 as we add more hidden layers on top the previous, which are maintained fixed. Note that our theory justifies such decrease for all the layers (the features are not required to be near collapse). Next, we turn to demonstrate correlation of practical NC behavior with the insight gained in Section 4 that λ H plays a bigger role than λ W does in approaching NC. Based on the equivalence of L 2 -regularization with weight decay (WD) in gradient-based methods, we can make the analogy of regularizing H in Eq. 2 to WD of the weights of practical DNNs in the feature mapping layers (i.e., excluding the last layer's weights). Importantly, note that this analogy is empirically justified for plain UFMs in (Zhu et al., 2021) . Under this analogy, our analysis suggests that, as long as entering the zero training error phase of training is maintained, increasing (resp. decreasing) the WD in the feature mapping layers should decrease (resp. increase) the distance from exact collapse more than increasing (resp. decreasing) the WD in the classification layer. Indeed, we empirically show this behavior below. (More experiments are presented in Appendix E.2). We note that there exists a work that empiricallyfoot_4 shows that WD facilitates collapse (Rangamani & Banburski-Fahey, 2022) , however, they do not examine the WD in feature mapping and classification layers separately. We consider the CIFAR-10 dataset and examine how modifying the regularization hyperparameters affects the NC behavior of the widely used ResNet18 (He et al., 2016a) compared to a baseline setting. Specifically, as a baseline hyperparameter setting, we consider one that is used in previous works (Papyan et al., 2020; Zhu et al., 2021) : default PyTorch initialization of the weights, SGD optimizer with learning rate 0.05 that is divided by 10 every 40 epochs, momentum of 0.9, and WD of 5e-4 for all the network's parameters. The modifications include: 1) doubling the WD only for the last (FC) layer; 2) doubling the WD only for feature mapping (conv) layers; 3) zeroing the WD for the last layer; and 4) zeroing the WD for feature mapping layers. Figure 3 presents the NC1 and NC2 metrics of the (deepest) features for: (Top) MSE loss with no bias in the FC layer (similar to the analyzed model); and (Bottom) CE loss with bias in the FC layer. In all the settings, we reach zero training error at the 40 epoch approximately. The empirical results show that modifying the WD in the feature mapping layers leads to curves with larger deviations from the baseline compared to modifying the last layer's WD, which is aligned with the theory established in Section 4 (i.e., the important role of λ H in attenuating the dominant intraclass perturbations). Reducing (zeroing) the WD in the feature mapping increases the distance from exact NC (i.e., from 0 value of the metrics), while increasing the WD decreases the gap from exact NC, as the theory predicts. The fact that sometimes (e.g., with CE loss) increasing the WD of the last layer can also decrease the gap from collapse hints that mitigating inter-class interference/correlation of features in practical deep learning settings is more significant for reaching NC than in our analysis that considers a near-collapse regime.foot_5 Yet, both the experiments and the theoretical study show that the regularization of the feature mapping has larger significance in approaching NC.

6. CONCLUSION

The features that are learned by training practical networks on real world datasets typically do not reach exact NC. In this paper, we addressed this issue by studying a model that can force the features to stay in the vicinity of a predefined features matrix. We analyzed it for the small vicinity case and established results that cannot be obtained by the previously studied (idealized) UFMs. We proved reduction in within-class variability of the optimized features compared to the input features (via analyzing gradient flow along the "central-path" of a UFM with minimal assumptions, unlike existing literature). We also presented an analysis of the model's minimizer in the near-collapse regime that provides insights on the effect of the regularization hyperparameters on the closeness to collapse, which correlate with the behavior in practical deep learning settings. We believe that our perturbation analysis approach, which is based on exploiting our knowledge on exactly collapsed minimizers of UFMs for studying non-collapse cases, can be applied to models other than the one considered in this paper, such as models with different loss functions and/or multiple levels of features and/or imbalanced data.

A PROOF OF THEOREM 3.1

To prove Theorem 3.1, in addition to the within-class and between-class covariance matrices, let us define the total covariance matrix (across all classes) of the non-centered features ΣT (H) := 1 Kn K k=1 n i=1 h k,i h ⊤ k,i . For convenience we also define the non-centered between-class covariance matrix ΣB (H) := 1 K k=1 h k h ⊤ k . We have the decomposition ΣT (H) = Σ W (H) + ΣB (H). Using YH ⊤ = (I K ⊗ 1 ⊤ n )H ⊤ = nH ⊤ and ΣT = 1 Kn HH ⊤ , we have that for each feature matrix H, the optimal weight matrix W * (H) is given by W * (H) = 1 K H ⊤ ( 1 Kn HH ⊤ + λ W K I) -1 = 1 K H ⊤ ( ΣT + λ W K I) -1 . Next, let us simplify the terms with W * (H) in L(H): L(H) := 1 2Kn ∥W * (H)H -Y∥ 2 F + λ W 2K ∥W * (H)∥ 2 F + λ H 2Kn ∥H∥ 2 F . For the first term in L(H), observe that 1 2Kn ∥W * (H)H -Y∥ 2 F = 1 2Kn Tr W * (H)HH ⊤ W * (H) ⊤ - 1 Kn Tr W * (H)HY ⊤ + 1 2 = 1 2K Tr ( ΣT + λ W K I) -1 ΣT ( ΣT + λ W K I) -1 ΣB - 1 K Tr ( ΣT + λ W K I) -1 ΣB + 1 2 = - λ W 2K 2 Tr ( ΣT + λ W K I) -2 ΣB - 1 2K Tr ( ΣT + λ W K I) -1 ΣB + 1 2 , where in the second equality we used ΣB = 1 K HH ⊤ , and in the last equality we used ( ΣT + λ W K I) -1 ΣT = I -λ W K ( ΣT + λ W K I) -1 . For the second term in L(H), observe that λ W 2K ∥W * (H)∥ 2 F = λ W 2K Tr W * (H)W * (H) ⊤ = λ W 2K 2 Tr ( ΣT + λ W K I) -2 ΣB . Adding the two terms together, 1 2Kn ∥W * (H)H -Y∥ 2 F + λ W 2K ∥W * (H)∥ 2 F = - 1 2K Tr ( ΣT + λ W K I) -1 ΣB + 1 2 = 1 2K Tr ( ΣT + λ W K I) -1 (Σ W + λ W K I) - d -K 2K , where we used ( ΣT + λ W K I) -1 ΣB = I -( ΣT + λ W K I) -1 (Σ W + λ W K I). Finally, for the third term in L(H) we have λ H 2Kn ∥H∥ 2 F = λ H 2 Tr ΣT . To conclude L(H) = 1 2K Tr (Σ W + λ W K I)( ΣT + λ W K I) -1 + λ H 2 Tr ΣT - d -K 2K . Next, we are going to analyze the traces of dΣ B dt , dΣ W dt , and d ΣT dt , along the flow that is stated in Eq. 4, which is repeated here for the convenience of the reader: dH t dt = -Kn∇L(H t ). In the following lemma, we state the required derivatives. Lemma A.1. Denote C B := Σ B ( ΣT + λ W K I) -1 , CB := ΣB ( ΣT + λ W K I) -1 and C W := Σ W ( ΣT + λ W K I) -1 . Along the gradient flow we have dΣ B dt = 1 K C B (I -CB ) + (I -C⊤ B )C ⊤ B -2λ H Σ B dΣ W dt = - 1 K C W CB + C⊤ B C ⊤ W -2λ H Σ W d ΣT dt = 1 K -CB -C W ) CB + C⊤ B (I -C⊤ B -C ⊤ W ) -2λ H ΣT Proof. We use the notation ∂ kjl to denote the derivative w.r.t. the lth entry of h k,j . Then ∂ kjl Σ B = 1 Kn (e l (h k -h G ) ⊤ + (h k -h G )e ⊤ l ), ∂ kjl Σ W = 1 Kn e l (h k,j -h k ) ⊤ + (h k,j -h k )e ⊤ l , ∂ kjl ΣT = 1 Kn (e l h ⊤ k,j + h k,j e ⊤ l ) , where e l ∈ R d is the one-hot vector whose lth entry is one (i.e., a standard basis vector). By the product rule, ∂ kjl L(H) = 1 2K Tr (∂ kjl Σ W )( ΣT + λ W K I) -1 + 1 2K Tr (Σ W + λ W K I)∂ kjl ΣT + λ W K I -1 + λ H Kn e ⊤ l h k,j = 1 2K Tr (∂ kjl Σ W )( ΣT + λ W K I) -1 - 1 2K Tr (Σ W + λ W K I) ΣT + λ W K I -1 ∂ kjl ΣT ΣT + λ W K I -1 + λ H Kn e ⊤ l h k,j = 1 K 2 n ( ΣT + λ W K I) -1 (h k,j -h k ) -ΣT + λ W K I -1 (Σ W + λ W K I) ΣT + λ W K I -1 h k,j + λ H Kh k,j ⊤ e l . Therefore, the gradient of L is given by  ∇L(H) = (7) 1 K 2 n ( ΣT + λ W K I) -1 (H -H ⊗ 1 ⊤ n ) -ΣT + λ W K I -1 Σ W + λ W K I ΣT + λ W K I -1 H + λ H KH . K I) -1 , C B := Σ B ( ΣT + λ W K I) -1 , CB := ΣB ( ΣT + λ W K I) -1 C W := Σ W ( ΣT + λ W K I) -1 and write ∂ kjl L(H) = ⟨L kj , e l ⟩, L kj = 1 K 2 n C(h k,j -h k ) -(I -C⊤ B )Ch k,j + λ H Kh k,j . Using the chain rule, we have that dΣ B (a, b) dt = k,j,l ∂ kjl Σ B (a, b) dh k,j [ℓ] dt = k,j,l ∂ kjl Σ B (a, b)(-Kn∂ kjl L(H)) = k,j l -⟨e a , e l ⟩⟨e b , h k -h G ⟩ + ⟨e a , h k -h G ⟩⟨e l , e b ⟩ ⟨e l , L kjl ⟩ = k,j -⟨e a , L kj ⟩⟨e b , h k -h G ⟩ + ⟨e a , h k -h G ⟩⟨L kj , e b ⟩ = e T a   k,j -L k,j (h k -h G ) ⊤ -(h k -h G )L ⊤ k,j   e b = 1 K e T a C B (I -CB ) + (I -C⊤ B )C ⊤ B e b -2λ H e ⊤ Σ B e b Similar computation yields dΣ W (a, b) dt = - 1 K e ⊤ a C W CB + C⊤ B C ⊤ W e b -2λ H e ⊤ a Σ W e b d ΣT (a, b) dt = 1 K e T a (I -CB -C W ) CB + C⊤ B (I -C⊤ B -C ⊤ W ) e b -2λ H e ⊤ a ΣT e b Let T B : t → e 2λ H t Tr(Σ B ) and T W : t → e 2λ H t Tr(Σ W ). The above lemma suggests that T B strictly increases along the flow, while T W decreases. Indeed, d T W dt = e 2λ H t ( d Tr(Σ W ) dt + 2λ H Tr(Σ W )) = - 2 K e 2λ H t Tr(C W CB ) = -e 2λ H t 2 K Tr(Σ W ( ΣT + λ W K I) -1 ΣB ( ΣT + λ W K I) -1 ) ≤ 0, The last inequality holds because the trace of the product of two positive semidefinite matrices is always non-negative (e.g. by Von-Neumann's trace inequality). Similarly dT B dt = 2 K e 2λ H t Tr(C B (I -CB )) = 2 K e 2λ H t Tr(Σ B ( ΣT + λ W K I) -1 (I -ΣB ( ΣT + λ W K I) -1 )) = 2 K e 2λ H t Tr(Σ B ( ΣT + λ W K I) -1 (Σ W + λ W K I)( ΣT + λ W K I) -1 ) = 2 K e 2λ H t Tr(Σ B ( ΣT + λ W K I) -1 Σ W ( ΣT + λ W K I) -1 ) + λ W K Tr(Σ B ( ΣT + λ W K I) -2 ) ≥ 2λ W K 2 e 2λ H t Tr(Σ B ( ΣT + λ W K I) -2 ) > 0, where the strict inequality again comes from Von-Neumann trace inequality, which ensures that the trace of product of a positive definite matrix and a non-zero positive semidefinite matrix is positive. Since N C 1 = T W /T B , the above computation also shows that N C 1 has to strictly decrease along the flow.

B PROOF OF COROLLARY 3.2

Recall that the minimizer H 1/β satisfies the first order equation H 1/β -H 0 = - Kn β ∇L(H 1/β ). We first show that H 1/β → H 0 as β → ∞. The following lemma would be helpful. Lemma B.1. There exists a constant M > 0 independent of H, such that ∥∇L(H)∥ F ≤ M ∥H∥ F , for any H ∈ R d×Kn . Proof. We bound each term in the expression of ∇L equation Eq. 7 individually. For the first term we have ∥( ΣT + λ W K I) -1 (H -H ⊗ 1 ⊤ n )∥ F ≤ ∥( ΣT + λ W K I) -1 ∥ op ∥(H -H ⊗ 1 ⊤ n )∥ F ≤ K λ W ∥(H -H ⊗ 1 ⊤ n )∥ F ≤ 2K λ W ∥H∥ F , where ∥ • ∥ op denotes the operator norm and the second inequality is due to the fact that each eigenvalue of ( ΣT + λ W K I) -1 is no bigger than K λ W . Similarly, ΣT + λ W K I -1 Σ W + λ W K I ΣT + λ W K I -1 H F ≤ K λ W ΣT + λ W K I -1 2 Σ W + λ W K I ΣT + λ W K I -1 2 op ∥H∥ F , where in the last inequality we used ∥( ΣT + λ W K I) -1/2 ∥ op ≤ K/λ W since every eigenvalue of ( ΣT + λ W K I) -1/2 is bounded by K/λ W . Denote A = Σ W + λ W K I , B = ΣT + λ W K I and use A + ΣB = B, we have ∥B -1/2 AB -1/2 ∥ op = ∥(B -1/2 A 1/2 )(B -1/2 A 1/2 ) ⊤ ∥ op = ∥(B -1/2 A 1/2 ) ⊤ (B -1/2 A 1/2 )∥ op = ∥A 1/2 B -1 A 1/2 ∥ op = ∥(A -1/2 (A + ΣB )A -1/2 ) -1 ∥ op = ∥(I + A -1/2 ΣB A -1/2 ) -1 ∥ op ≤ 1. Combining the above bounds together, we have obtained for any H ∈ R d×Kn , ∥∇L(H)∥ F ≤ 1 Kn 3 λ W + λ H ∥H∥ F . Next, we combine the lemma and the stationary equation Eq. 8 to get ∥H 1/β -H 0 ∥ F ≤ nKM β ∥H 1/β ∥ F ≤ nKM β ∥H 1/β -H 0 ∥ F + nKM β ∥H 0 ∥ F . Rearranging, we have the bound ∥H 1/β -H 0 ∥ F ≤ β nKM -1 -1 ∥H 0 ∥ F . This implies that H 1/β → H 0 as β → ∞. Combined with the continuity of ∇L(•) and the first order equation Eq. 8, this further implies lim β→∞ H 1/β -H 0 1/β = -Kn∇L(H 0 ). Now, by chain rule, lim β→∞ N C 1 (H 1/β ) -N C 1 (H 0 ) 1/β = ⟨∇ H N C 1 (H 0 ), lim β→∞ H 1/β -H 0 1/β ⟩ = ⟨∇ H N C 1 (H 0 ), -Kn∇L(H 0 )⟩ = d dt t=0 N C 1 (H t ). In the last line, H t denotes the gradient flow iterate defined in Eq. 4. By (the proof of) Theorem 3.1, when H 0 is non-collapsed, d dt t=0 N C 1 (H t ) < 0 must hold. This further implies that there exists some constant C = C(H 0 ) > 0 such that for β > C we have that N C1(H 1/β )-N C1(H0) 1/β < 0. C PROOF OF THEOREM 4.1 Our proof is essentially a perturbation analysis approach that exploits the fact that each of the minimizers is a stationary point of its associated objective function. Namely, the minimizer of the perturbed problem f (W, H; H0 ), i.e., ( W * , H * ), obeys that ∇f ( W * , H * ; H0 ) = ∇ H f ( W * , H * ; H0 ) ∇ W f ( W * , H * ; H0 ) = 0, and the minimizer of the unperturbed problem, i.e., (W * , H * ) where for brevity we omit the 'ˆ' symbol, obeys that ∇f (W * , H * ; H 0 ) = ∇ H f (W * , H * ; H 0 ) ∇ W f (W * , H * ; H 0 ) = 0. We use these properties in the following first order Taylor approximation of ∇f ( W * , H * ; H0 ) around (W * , H * ; H 0 ) (with accuracy of O(∥δH∥ 2 , ∥δW∥ 2 , ∥δH 0 ∥ 2 )) that is given by ∇ H f ( W * , H * ; H0 ) ∇ W f ( W * , H * ; H0 ) ≈ ∇ H f (W * , H * ; H 0 ) ∇ W f (W * , H * ; H 0 ) (9) + ∇ ⊤ H ∇ H f (W * , H * ; H 0 ) ∇ ⊤ W ∇ H f (W * , H * ; H 0 ) ∇ ⊤ H ∇ W f (W * , H * ; H 0 ) ∇ ⊤ W ∇ W f (W * , H * ; H 0 ) vec(δH) vec(δW) + ∇ ⊤ H0 ∇ H f (W * , H * ; H 0 ) ∇ ⊤ H0 ∇ W f (W * , H * ; H 0 ) vec(δH 0 ). Recall that δH := H * -H * , δW := W * -W * , and δH 0 = H0 -H 0 . Since the two terms in the first line of Eq. 9 vanish, we get that vec(δH) vec(δW) ≈ - ∇ ⊤ H ∇ H f ∇ ⊤ W ∇ H f ∇ ⊤ H ∇ W f ∇ ⊤ W ∇ W f -1 ∇ ⊤ H0 ∇ H f ∇ ⊤ H0 ∇ W f vec(δH 0 ), where all the derivatives are evaluated at (W * , H * ; H 0 ), which is omitted in order to simplify the presentation. As shown below, in our setting the matrix that is inverted is indeed nonsingular. We turn now to compute the derivatives. Let us denote h := vec(H), w := vec(W), and y := vec(Y). Observe that from well known identities on the Kronecker product and the vectorization operation we have 1 2Kn ∥WH -Y∥ 2 F = 1 2Kn ∥(I kn ⊗ W)h -y∥ 2 2 = 1 2Kn ∥(H ⊤ ⊗ I K )w -y∥ 2 2 . Therefore, the first order derivatives are given by ∇ H f (W, H; H 0 ) = 1 Kn (I kn ⊗ W ⊤ )((I kn ⊗ W)h -y) + λ H Kn h + β Kn (h -vec(H 0 )), ∇ W f (W, H; H 0 ) = 1 Kn (H ⊗ I K )((H ⊤ ⊗ I K )w -y) + λ W K w. Hence, ∇ ⊤ H0 ∇ H f = - β Kn I dnK , ∇ ⊤ H0 ∇ W f = 0 Kd×dnK . Plugging these expressions in Eq. 10 and using blockwise matrix inversion gives vec(δH) ≈ β Kn ∇ ⊤ H ∇ H f -∇ ⊤ W ∇ H f (∇ ⊤ W ∇ W f ) -1 ∇ ⊤ H ∇ W f -1 vec(δH 0 ), vec(δW) ≈ - β Kn (∇ ⊤ W ∇ W f ) -1 ∇ ⊤ H ∇ W f ∇ ⊤ H ∇ H f -∇ ⊤ W ∇ H f (∇ ⊤ W ∇ W f ) -1 ∇ ⊤ H ∇ W f -1 vec(δH 0 ), which are stated in the theorem, where all the derivatives are evaluated at the point (W * , H * ; H 0 ). Let us state the second order derivatives that appear above. First, one can observe that ∇ ⊤ H ∇ H f (W * , H * ; H 0 ) = 1 Kn I nK ⊗ W * ⊤ W * + λ H Kn I dnK + β Kn I dnK , ∇ ⊤ W ∇ W f (W * , H * ; H 0 ) = 1 Kn HH * ⊤ ⊗ I K + λ W K I Kd . As for the mixed partial derivative, applying ∇ ⊤ W on Eq. 11, we get ∇ ⊤ W ∇ H f = ∂ ∂w ∇ H f = 1 Kn ∂ ∂w (I kn ⊗ W ⊤ )r + 1 Kn (I kn ⊗ W ⊤ ) ∂ ∂w ((I kn ⊗ W)h -y) = 1 Kn ∂ ∂w (I kn ⊗ W ⊤ )r + 1 Kn (I kn ⊗ W ⊤ ) ∂ ∂w ((H ⊤ ⊗ I K )w -y) = 1 Kn E(W, H) + 1 Kn (I kn ⊗ W ⊤ )(H ⊤ ⊗ I K ) = 1 Kn E(W, H) + 1 Kn (H ⊤ ⊗ W ⊤ ) where r := vec(WH-Y) but treated as independent of w due to the product rule, and E(W, H) := ∂ ∂w (I kn ⊗ W ⊤ )r . Denoting w k,i := W [k, i], we have that ∂ ∂w k,i (I kn ⊗ W ⊤ )r = (I kn ⊗ ∂ ∂w k,i W ⊤ )r = (I kn ⊗ e d,i e ⊤ K,k )r = vec(e d,i e ⊤ K,k (WH -Y)), where e d,i is the standard vector in R d with 1 in its ith entry (similar definition stands for e K,k ). Therefore, ∇ ⊤ H ∇ W f (W * , H * ; H 0 ) = 1 Kn E * ⊤ + 1 Kn (H * ⊗ W * ), ∇ ⊤ W ∇ H f (W * , H * ; H 0 ) = 1 Kn E * + 1 Kn (H * ⊤ ⊗ W * ⊤ ), where We focus now on the effect the deviation δH 0 = H0 -H 0 on the feature learning δH = H * -H * . This requires inverting the dnK × dnK matrix that links δH and δH 0 , which is quite challenging. Yet, from the derivatives that are stated above we observe the following E * = E(W * , H * ) and E(W, H) ∈ R dnK×Kd vec(δH) ≈ β Kn ∇ ⊤ H ∇ H f -∇ ⊤ W ∇ H f (∇ ⊤ W ∇ W f ) -1 ∇ ⊤ H ∇ W f -1 vec(δH 0 ) = I dnK + λ H β I dnK + 1 β I nK ⊗ W * ⊤ W * - Kn β ∇ ⊤ W ∇ H f (∇ ⊤ W ∇ W f ) -1 ∇ ⊤ H ∇ W f -1 vec(δH 0 ) = I dnK + λ H β I dnK + 1 β I nK ⊗ W * ⊤ W * - 1 β Z * -1 vec(δH 0 ) where Z * := (E * ⊤ + H * ⊗ W * ) ⊤ (H * H * ⊤ ⊗ I K + nλ W I dK ) -1 (E * ⊤ + H * ⊗ W * ). Therefore, under the assumption of β ≫ max{1, λ H }, which is associated with a restrictive link between H 0 and H, we can use the first-order truncated Neumann series to approximate the matrix inversion (with accuracy of O(β -2 )) that is stated in Eq. 5 and repeated here for the convenience of the reader: vec(δH) ≈ I dnK - λ H β I dnK - 1 β I nK ⊗ W * ⊤ W * + 1 β Z * vec(δH 0 ).

D PROOF OF THEOREM 4.2

In this section we compute the entire spectrum (singular values) for the diagonal blocks ("intra-class blocks") and the off-diagonal blocks ("inter-class blocks") of the block matrix in Eq. 6. To keep the main body of the paper concise, we present in the statement of Theorem 4.2 only the results for σ max (F k,k ) and σ min (F k,k ) of the full rank matrix F k,k , as well as σ max (F k, k) of the rank-1 matrix F k, k ( k ̸ = k). Recall that we consider the (non-degenerate) setting c := √ λ H λ W < 1. Therefore, when H 0 = H * is a minimizer of Eq. 1 (associated with W * ), from Corollary 2.2 we have that (W * , H * ) the minimizer of f (W, H; H 0 ) is also orthogonally collapsed and characterized by Theorem 2.1 with λ W and λ H (independent of K, n, d). That is, H * = H ⊗ 1 ⊤ n and W * H ∝ H ⊤ H ∝ W * W * ⊤ ∝ I K . We also have the following results for the spectral norm of H and W * , that we denote by σ H and σ W respectively: σ 2 H = (1 -c) λ W λ H = λ W λ H -λ W , σ 2 W = (1 -c) λ H λ W = λ H λ W -λ H . Observe that these expressions do not depend on the number of samples K, n, d. Note also that σ 2 H σ 2 W = (1 - √ λ H λ W ) 2 = (1 -c) 2 < 1. We remind the reader that vec(δH) ≈ Fvec(δH 0 ), for F = I dnK - λ H β I dnK - 1 β I nK ⊗ W * ⊤ W * + 1 β Z * , where Z * := (E * ⊤ + H * ⊗ W * ) ⊤ (HH * ⊤ ⊗ I K + nλ W I dK ) -1 (E * ⊤ + H * ⊗ W * ), and E * = E(W * , H * ) and E(W, H) ∈ R dnK×Kd is defined as E(W, H) := vec(e d,1 e ⊤ K,1 (WH -Y)), ..., vec(e d,1 e ⊤ K,K (WH -Y)), vec(e d,2 e ⊤ K,1 (WH -Y)), ... ..., vec(e d,d e ⊤ K,K (WH -Y)) , where e d,i is the standard vector in R d with 1 in its ith entry (similar definition stands for e K,k ). For the collapsed minimizer (W * , H * ), we know that H * = σ H R ⊗ 1 ⊤ n and W * = σ W R ⊤ for some (partial) orthonormal matrix R ∈ R d×K (i.e., R ⊤ R = I K ). Therefore, we have that W * H * -Y = -cI K ⊗1 ⊤ n ⊗I d , and that H * ⊗W * = (1-c)R⊗1 ⊤ n ⊗R ⊤ . Observe that the alignment of the former expression with the latter (where the locations of the dimensions d and K are swapped) is done using the matrices {e d,i e ⊤ K,k }. Indeed, we can write E * ⊤ = K d,K (-cI K ⊗ 1 ⊤ n ⊗ I d ), where K d,K ∈ R Kd×dK is the permutation matrix that satisfies K ⊤ d,K (X 1 ⊗ X 2 )K d,K = X 2 ⊗ X 1 for any X 1 ∈ R d×d and X 2 ∈ R K×K . Such a matrix K d,k is also known as commutation matrix in the matrix theory literature. Another useful property of the commutation matrix that we will frequently use is that K d,K (x ⊗ Y) = Y ⊗ x (12) for any x ∈ R K×1 and Y ∈ R d×m . Let us extract the k, k-th block Z * k, k ∈ R dn×dn of Z * . First, observe that Z * = (E * ⊤ + H * ⊗ W * ) ⊤ (H * H * ⊤ ⊗ I K + nλ W I dK ) -1 (E * ⊤ + H * ⊗ W * ) = 1 n B ⊤ (A ⊗ I K )B, where A = (σ 2 H RR ⊤ + λ W I d ) -1 B = -cK d,K (I K ⊗ 1 ⊤ n ⊗ I d ) + (1 -c)(R ⊗ 1 ⊤ n ⊗ R ⊤ ). Denote by {e k } the standard basis vectors in R K . To extract the k, k-th block of Z * , we compute Z * k, k = (e k ⊗ I dn ) ⊤ Z * (e k ⊗ I dn ) = 1 n (B(e k ⊗ I dn )) ⊤ (A ⊗ I K ) (B(e k ⊗ I dn )) , with B(e k ⊗ I dn ) = -cK d,K (e k ⊗ 1 ⊤ n ⊗ I d ) + (1 -c)(r k ⊗ 1 ⊤ n ⊗ R ⊤ ) = -c(1 ⊤ n ⊗ I d ⊗ e k ) + (1 -c)(r k 1 ⊤ n ⊗ R ⊤ ) , where in the last line, we have used property Eq. 12 to swap the Kronecker product. Then, Z * k, k = 1 n (-c(1 ⊤ n ⊗ I d ⊗ e k ) + (1 -c)(r k 1 ⊤ n ⊗ R ⊤ )) ⊤ (A ⊗ I K )(-c(1 ⊤ n ⊗ I d ⊗ e k) + (1 -c)(r k1 ⊤ n ⊗ R ⊤ )) = c 2 (e ⊤ k e k) 1 n (1 n 1 ⊤ n ) ⊗ A + (1 -c) 2 (r ⊤ k Ar k) 1 n (1 n 1 ⊤ n ) ⊗ RR ⊤ -c(1 -c) 1 n (1 n 1 ⊤ n ) ⊗ (Ar kr ⊤ k + r kr ⊤ k A) . Let us write R = [r 1 r 2 ...r K ] ∈ R d×K and let r K+1 , ..., r d be the orthonormal vectors such that {r i } d i=1 forms an orthonormal basis. We know that A = (σ 2 H RR ⊤ + λ W I d ) -1 = K i=1 1 σ 2 H + λ W r i r ⊤ i + d j=K+1 1 λ W r j r ⊤ j . Therefore, r ⊤ k Ar k = δ k k σ 2 H + λ W Ar kr ⊤ k = r kr ⊤ k A = 1 σ 2 H + λ W r kr ⊤ k . We can thus conclude that Z * k, k = 1 n (1 n 1 ⊤ n ) ⊗ δ k kc 2 A + δ k k(1 -c) 2 σ 2 H + λ W RR ⊤ - 2c(1 -c) σ 2 H + λ W r kr ⊤ k . When k ̸ = k, the off-diagonal block of Z * is given by Z * k, k = 1 n (1 n 1 ⊤ n ) ⊗ - 2c(1 -c) σ 2 H + λ W r kr ⊤ k , which is a rank-1 matrix. Since other matrices in F do not contribute to the inter-class block, we know that F k, k = 1 β Z * k, k. It is well-known that the eigenvalues of Kronecker product of two matrices are given by the products of their eigenvalues. We know that 1 n (1 n 1 ⊤ n ) has exactly one non-zero eigenvalue, which equals to 1. This implies that σ max (F k, k) = 2c(1 -c) β(σ 2 H + λ W ) = 2λ H (1 - √ λ H λ W ) β . Next, let us compute the intra-class block. Setting k = k in equation Eq. 13, we get Z * k,k = 1 n (1 n 1 ⊤ n ) ⊗ d i=1 µ i r i r ⊤ i = d i=1 µ i ( 1 n 1 n 1 ⊤ n ) ⊗ (r i r ⊤ i ) where µ k = c 2 σ 2 H + λ W + (1 -c) 2 σ 2 H + λ W - 2c(1 -c) σ 2 H + λ W = (2c -1) 2 λ H λ W , µ i = c 2 σ 2 H + λ W + (1 -c) 2 σ 2 H + λ W = (c 2 + (1 -c) 2 ) λ H λ W , for 1 ≤ i ≤ K and i ̸ = k, µ j = c 2 λ W = λ H , for K < j ≤ d The intra-class block is therefore given by F k,k = (1 - λ H β )I nd - σ 2 W β I n ⊗ (RR ⊤ ) + 1 β d i=1 µ i ( 1 n 1 n 1 ⊤ n ) ⊗ (r i r ⊤ i ) = (1 - λ H β ) d i=1 I n ⊗ (r i r ⊤ i ) - σ 2 W β K i=1 I n ⊗ (r i r ⊤ i ) + 1 β d i=1 µ i ( 1 n 1 n 1 ⊤ n ) ⊗ (r i r ⊤ i ) = d i=1 λ i I n ⊗ (r i r ⊤ i ) + 1 β d i=1 µ i ( 1 n 1 n 1 ⊤ n ) ⊗ (r i r ⊤ i ), where λ i = 1 - λ H β - σ 2 W β = 1 - 1 β λ H λ W , for 1 ≤ i ≤ K (14) λ i = 1 - λ H β , for K < i ≤ d Let s 1 = 1 √ n 1 n and {s i } n i=1 be a set of orthonormal basis of R n . Then, we can further write F k,k = d i=1 ( n j=1 λ i s j s ⊤ j ) ⊗ (r i r ⊤ i ) + 1 β d i=1 µ i (s 1 s ⊤ 1 ) ⊗ (r i r ⊤ i ) = d i=1 n j=1 λ i (s j ⊗ r i )(s j ⊗ r i ) ⊤ + 1 β d i=1 µ i (s 1 ⊗ r i )(s 1 ⊗ r i ) ⊤ One can easily verify that {s j ⊗ r i } 1≤j≤n,1≤i≤d is an orthonormal basis of R nd . So, Eq. 16 gives us the eigendecomposition of F k,k . The spectral norm of F k,k is therefore given by σ max (F k,k ) = max 1≤i≤d max{|λ i |, |λ i + 1 β µ i |}. As we consider the large β regime, the expressions in both Eq. 14 and Eq. 15 are positive. Observe that for K < i ≤ d (associated with the over-parameterization of the model) we have that the eigenvalue associated with the eigenvector (s 1 ⊗ r i ) is given by λ i + 1 β µ i = 1 - λ H β + λ H β = 1. Note, though, that due to the Kronecker product with s 1 = 1 √ n 1 n , perturbation in the direction of this eigenvector does not affect the variability in the kth class at all. Furthermore, generic/practical perturbations are likely to correlate with, or have their power spectrum spread over, many components of the dn dimensional eigenbasis of F k,k and not concentrate in an extremely low dimensional d -K subspace (composed only of s 1 ⊗ r i with K < i < d). Thus, we expect these eigenvectors to have small correlation with generic perturbations. Showing that σ max (F k,k ) = 1 reduces now to eliminating the option of eigenvalues larger than 1 for 1 ≤ i ≤ K. This is equivalent to having that 1 β λ H λ W -1 + (2c -1) 2 < 0, 1 β λ H λ W -1 + (c 2 + (1 -c) 2 ) < 0, and both are ensured under our assumption c := √ λ H λ W < 1 (the non-degenerate case of the model). Finally, observing that Eq. 14 is smaller than Eq. 15, and that the second term in Eq. 16 does not include eigenvectors (s j ⊗ r i ) for j > 1, we conclude that σ min (F k,k ) = 1 - 1 β λ H λ W . D.1 ADDITIONAL DISCUSSION ON THE RESULTS OF THE THEOREM Theorem 4.2 has no restricting assumptions on the number of classes K. The only assumption, which is common in theoretical NC papers and is also what is done in practice is that d > K, i.e., that the dimension of the features is larger than the number of classes. This means that, regardless of the number of classes, the inter-class (off-diagonal) blocks have rank 1, while the intra-class (diagonal) blocks have full rank (recall that each block is of size dn × dn). Considering the conclusions from Theorem 4.2, which are stated in Section 4, if we sum up the maximal contribution of each of the K -1 inter-class blocks of a certain class, i.e., (K -1)σ max (F k, k), then for guaranteeing that this sum is smaller than the minimal contribution of the intra-class block, i.e., σ min (F k,k ), we may need to assume that β ≫ K. Note that this is a reasonable assumption under our large β setting. Yet, we believe that the rank difference between the two types of blocks is a more important indicator for the dominance of the intra-class blocks, and this property is independent of the number of classes K. Specifically, since dn > K (all the more so, in practice we even have n ≫ K), then for generic perturbations (that uniformly span the entire dnK dimensional space) the rank-1 inter-class blocks nullify much of the perturbation contrary to the intra-class block (which has full rank). This strengthen our conclusion that the deviation from collapse of each class of the minimizer H is dominated by the deviation from collapse of the same class in H 0 rather than by the deviations of other classes. One thing that should be reminded here is that we analyse the "near-NC" regime, so we assume that the system is already not far from exact NC. Reaching this point in general might become harder when the number of classes grows. Another point that can be raised regarding the results of Theorem 4.2, is that we do not analyze the full matrix F but rather its blocks. In fact, we believe that our analysis, which includes complete spectral analysis for each block separately, is more informative, as it clearly distinguishes between properties of intra-and inter-class blocks and provides insights on the roles of the regularization hyperparameters that are aligned with practical DNN training. In contrast, in the large β regime we have that F is full rank, which masks the rank-1 property of the inter-class (off-diagonal) blocks. Nevertheless, analyzing the relationship of the full F and its blocks is an interesting a direction for future research.

E ADDITIONAL EXPERIMENTS AND EXPERIMENTAL DETAILS E.1 EXPERIMENTAL DETAILS FOR THE LAYER-WISE EXPERIMENT

In this section, we provide the experimental details for the layer-wise training experiment that is presented in Figure 2 in the main body of the paper. We train an MLP with 10 hidden layers on CIFAR-10 dataset, where each sample is flattened to a 3072x1 vector. Each hidden layer includes 3072 fully connected neurons with default PyTorch initialization of the weights, batchnorm, and ReLU nonlinearity. We start with one hidden layer and train the MLP with 3 epochs of Adam with mini-batch size of 256, learning rate of 1e-4, and CE loss. Then, we compute NC1 metrics for the deepest features. At this point, the first "outer iteration" of the procedure is finished. We fix the parameters in the existing hidden layers, insert a new hidden layer before the final classification layer, and repeat the procedure. Namely, at each outer iteration of the procedure we optimize only the deepest hidden layer, which has just been inserted with default PyTorch initialization of the weights, and the final classification layer, which is "initialized" with its weights from the previous outer iteration. Let us provide more details that has led to the implementation decisions that are stated above. We have found that layer-wise training of DNNs (on a practical dataset, e.g., CIFAR-10 that we use here) is significantly harder than end-to-end training in terms of reaching a small training loss value. (Presumably, this is the reason that DNNs are typically trained in an end-to-end fashion). Careful configuration of the training procedure was required for reaching considerable low loss (though, still not zero training error) and low NC1 metrics as presented in Figure 2 . From our efforts in layer-wise training the 10-layer MLP we observed the following: Adam optimizer worked better than SGD (which is harder to tune); Layer-wise minimization with CE loss (rather than MSE loss) has led to lower NC1 metrics; Using no more than 3 epochs per "outer iteration" allowed reaching lower values for the loss and the NC1 metrics at the deeper layers. Regarding the latter (i.e., more epochs per outer iteration lead to worse optimization results), when there are only one or two hidden layers then the decrease in the loss and the decrease in the NC1 metrics are larger when more epochs are being used. However, when we add in that case more hidden layers, the optimization appears to get stuck at some local minima with higher loss and NC1 metrics compared to what we get with only 3 epochs per outer iteration. As far as we understand, this behavior follows from the (extreme) nonconvexity of the problem.

E.2 MORE EXPERIMENTS ON THE EFFECT OF THE REGULARIZATION HYPERPARAMETERS

In this section, we present more experiments that examine how modifying the regularization hyperparameters affects the NC behavior of a practical DNN -ResNet18 (He et al., 2016a ) -compared to a baseline setting. Specifically, as a baseline hyperparameter setting, we consider one that is used in previous works (Papyan et al., 2020; Zhu et al., 2021) : default PyTorch initialization of the weights, SGD optimizer with mini-batch size of 256, learning rate of 0.05 that is divided by 10 every 40 epochs, momentum of 0.9, and weight decay (L 2 regularization) of 5e-4 for all the network's parameters. The first set of experiments is similar to the experiments in Section 5. These experiments support the insight gained in Section 4 that λ H (the regularization of the feature mapping) plays a bigger role than λ W (the regularization of the classification layer) does in approaching NC. We compare the NC1 and NC2 metrics (defined in Section 5) of the baseline setting and the following modified settings: 1) doubling the weight decay only for the last (FC) layer; 2) doubling the weight decay only for feature mapping (conv) layers; 3) zeroing the weight decay for the last layer; and 4) zeroing the weight decay for feature mapping layers. In Figure 4 we consider the MNIST dataset with 3K training samples per class. Figure 4a presents the NC1 and NC2 metrics of the deepest features for MSE loss and no bias in the FC layer. In Table 1 we report the mean and the standard deviation (SD) for the NC metrics computed for the 1 , the experiments show the important role of the regularization of the feature mapping layers in approaching NC. Namely, modifying the regularization of the feature mapping layers leads to curves with larger deviations from the baseline compared to modifying the last layer's regularization. This is aligned with the theory established in Section 4 that links increasing λ H to reducing the dominant component of the distance from collapse of a class, which is the deviation from collapse of its own features in preceding layers. The second set of experiments shows the role of λ W in mitigating the interferences between the features of different classes (such interferences can hinder approaching NC). To visualize such behavior we use a "per-class NC1" metric, defined as N C (k) 1 := 1 K Tr 1 n n i=1 (h k,i -h k )(h k,i -h k ) ⊤ Σ † B . Note that the NC1 metric, which is defined in Section 5, can be written as N C 1 = 1 K 1 K K k=1 Tr 1 n n i=1 (h k,i -h k )(h k,i -h k ) ⊤ Σ † B = 1 K K k=1 N C (k) 1 . We also use the following metric to measure the alignment of the mean features and the last layer's weights N C 3 := W(H -h G 1 ⊤ K ) ∥W(H -h G 1 ⊤ K )∥ F - 1 √ K -1 (I K - 1 K 1 K 1 ⊤ K ) F , where the simplex ETF is normalized to unit Frobenius norm. In Figure 6a we present the NC metrics of the deepest features of the baseline training scheme on the MNIST dataset with 3K samples per class. The other lines in Figure 6 show the NC metrics for a modified training set, where the samples of classes (digits) 4 and 9 are degraded by a uniform blur (blur kernel of size 9 × 9) that hardens the distinction between them. Each line corresponds to a different value of weight decay for the last layer's parameters. Yet, in all of the settings we reached zero training error at the 40 epoch approximately. The empirical results show that large λ W facilitates reaching reduced NC metrics (closeness to NC structure) by reducing the effect ("interference") of the features of the degraded samples on the features of the other classes. This is aligned with the theory that is established for our model in Section 4. F ADDITIONAL MOTIVATION FOR THE MODEL IN EQ. 2 In the model that we consider in Eq. 2, we interpret H as the deepest features of a DNN and H 0 as shallower features of the DNN. In particular, in the large β regime that we theoretically analyze in the paper, we interpret H 0 as the penultimate features (one layer before H). Even though the relation between H and H 0 in our model differs from their explicit relation in many practical DNNs, there exist networks where it is very reasonable to assume that the deepest features and the penultimate features are close to each other. For example, consider the ResNet architecture from (He et al., 2016b) , where (under our interpretation of H and H 0 ) the deepest features obey H = H 0 + r(H 0 ), where r(•) denotes a residual block. The residual term can potentially be very small if H 0 already separates the classes (e.g., it has a "near NC" structure). In fact, in the popular neural ODE framework (Chen et al., 2018) , which is understood as the infinite depth limit of these ResNets, we inherently have that H ≈ H 0 . Another example where the concept H ≈ H 0 inherently holds is deep equilibrium models (DEQ) (Bai et al., 2019) . These practical DNN frameworks provide the rationality for analyzing our model. Furthermore, our theoretical results, such as depthwise decrease in the within-class variability, are aligned also with the empirical behavior of DNN architectures beyond the aforementioned examples (e.g., plain MLP). 



Hui & Belkin (2021) have shown that training DNN classifiers with MSE loss is a powerful strategy whose performance is similar to training with CE loss. In more detail, all the assumptions in(Han et al., 2022) (including continually renormalization the gradient) lead to the fact that only the singular values (and not the singular vectors) of an "SNR matrix" Σ -1/2 W (H)H vary along their flow. However, since we do not make their assumptions, we do not have such a matrix whose singular bases are fixed along the flow and we need to approach the problem in a more general way. This is also aligned with empirical observations of gradual depthwise collapse in practical DNNs and with Corollary 3.2 at the limit where H0 is nearly collapsed. The derivatives are stated in the proof. Note that the claim in(Rangamani & Banburski-Fahey, 2022) that NC solution cannot minimize unregularized bias-free MSE loss comes from demanding that H * -without subtracting the global mean -will be a simplex ETF rather than an orthogonal frame as shown in Theorem 2.1. In Appendix E.2, we demonstrate the role of λW in mitigating inter-class interference of features, which is identified by our analysis, also empirically with practical DNNs.



Consider a classification task with K classes and n training samples per class. Let us denote by y k ∈ R K the one-hot vector with 1 in its k-th entry and by x k,i ∈ R p the i-th training sample of the k-th class. DNN-based classifiers can be typically expressed as DNN Θ (x) = Wh θ (x) + b, where h θ (•) : R p -→ R d (with d ≥ K) is the feature mapping that is composed of multiple layers (with learnable parameters θ), and W = [w 1 , . . . , w K ] ⊤ ∈ R K×d (w ⊤ k denotes the kth row of W) and b ∈ R K are the weights and bias of the last classification layer. The network's parameters Θ = {W, b, θ} are usually learned by empirical risk minimization min (Wh θ (x k,i ) + b, y k ) + R (Θ) , where L(•, •) is a loss function (e.g., CE or MSE 1 ) and R(•) is a regularization term.

Figure 1: The effect of λH on the spectrum of F k,k .Remark. In Appendix D we derive expressions for the complete singular value decomposition of F k,k and F k, k. Our expressions for the entire spectrum of F k,k reveal its step-wise decreasing shape, as visualized in Figure1for β = 100, K = 4, d = 10, n = 10, λ W = √ 2 and various values of λ H . To keep the paper concise, we state in the above theorem only the results for the maximal and minimal singular values of F k,k , but note that, similarly to σ min (F k,k ), almost all singular values decrease as λ H increases. Even though a small portion ( 1-K/d n ) of the singular values equal 1 (as shown in our analysis in Appendix D), we can still gain insights on the attenuation profile since generic perturbations are unlikely to concentrate in such an extremely low-dimensional subspace (and, in fact, the singular vectors associated with this subspace do not affect the within-class variability).

Figure 2: Layer-wise training of MLP on CIFAR-10.The result of Section 3 provides reasoning to justify depthwise decrease in within-class variability, which has already been empirically demonstrated for end-to-end training in several papers(Papyan et al., 2020;Tirer & Bruna, 2022;Galanti, 2022) (we present such experiments in Appendix E.2). Here we show this behavior also for layer-wise training, which is better represented by our model. We consider the CIFAR-10 dataset and train an MLP with 1 to 10 hidden layers and a final classification layer. Each time, we add and train a hidden layer on top of the previous hidden layers, which are maintained fixed. Then we compute the NC1 metrics for the deepest features. Due to space limitation, the experimental details are deferred to Appendix E.1. Figure2demonstrates decrease in both N C 1 and N C 1 as we add more hidden layers on top the previous, which are maintained fixed. Note that our theory justifies such decrease for all the layers (the features are not required to be near collapse).

Figure 3: The effect of modifying the weight decay (WD) on NC metrics for ResNet18 trained on CIFAR-10. Top: MSE loss without bias; Bottom: CE loss with bias. Observe that modifying the WD in the feature mapping increases the deviation from the baseline more than modifying the WD of the last layer.

Next, we compute how each covariance matrix updates along the flow. Let Σ B (a, b) = e ⊤ a Σ B e b denote the a, b-th entry of Σ B . We further denote C := ( ΣT + λ W

is given by E(W, H) := vec(e d,1 e ⊤ K,1 (WH -Y)), ..., vec(e d,1 e ⊤ K,K (WH -Y)), vec(e d,2 e ⊤ K,1 (WH -Y)), ... ..., vec(e d,d e ⊤ K,K (WH -Y)) .

Figures 4b and 4c present the NC1 and NC2 metrics of the deepest and intermediate (output of 3 out of the 4 ResBlock) features, respectively, when for CE loss with bias in the FC layer. In all the settings, we reach zero training error at the 40 epoch approximately. In Figure 5 we repeat the experiments with 5K training samples per class. Furthermore, repeating the experiments with 3 different random seeds for initializing the DNN's parameters yields similar curves that demonstrate the same trends.

Samples of classes 4 and 9 are blurred, last layer's WD remains 5e-4. The effect of the blurred classes on the NC metrics (avg. and other classes) is minor.

Samples of classes 4 and 9 are blurred, last layer's WD reduced to 5e-5. The blurred classes affect the "per-class NC1" of other classes and the NC metrics increase.

Samples of classes 4 and 9 are blurred, last layer has no WD. The blurred classes further interfere with other classes and the NC metrics further increase.

Figure6: The effect of modifying the weight decay (WD) of the last layer's weights on NC metrics for ResNet18 trained on MNIST with 3K samples per class where samples from classes 4 and 9 are blurred. Observe that small WD in the last layer increases the effect of the "pre-class NC1" curves of the blurred classes on the other classes, and increases also the other NC metrics.

The effect of modifying the weight decay (WD) on NC metrics for ResNet18 trained on CIFAR-10 and MNIST datasets -mean and SD are computed for 3 random seeds. Observe that modifying the WD in the feature mapping increases the deviation from the baseline more than modifying the WD of the last layer.Similar to previous works, from comparing Figures4b and 4c(as well as Figures5b and 5c) we see that the NC distance metrics are larger in the intermediate features, which correlates with the results for our model in Section 3. Examining all the settings of Figures4 and 5, as well as Table

