PERTURBATION ANALYSIS OF NEURAL COLLAPSE

Abstract

Training deep neural networks for classification often includes minimizing the training loss beyond the zero training error point. In this phase of training, a "neural collapse" behavior has been observed: the variability of features (outputs of the penultimate layer) of within-class samples decreases and the mean features of different classes approach a certain tight frame structure. Recent works analyze this behavior via idealized unconstrained features models where all the minimizers exhibit exact collapse. However, with practical networks and datasets, the features typically do not reach exact collapse, e.g., because deep layers cannot arbitrarily modify intermediate features that are far from being collapsed. In this paper, we propose a richer model that can capture this phenomenon by forcing the features to stay in the vicinity of a predefined features matrix (e.g., intermediate features). We explore the model in the small vicinity case via perturbation analysis and establish results that cannot be obtained by the previously studied models. For example, we prove reduction in the within-class variability of the optimized features compared to the predefined input features (via analyzing gradient flow on the "central-path" with minimal assumptions), analyze the minimizers in the near-collapse regime, and provide insights on the effect of regularization hyperparameters on the closeness to collapse. We support our theory with experiments in practical deep learning settings.

1. INTRODUCTION

Modern classification systems are typically based on deep neural networks (DNNs), whose parameters are optimized using a large amount of labeled training data. Their training scheme often includes minimizing the training loss beyond the zero training error point (Hoffer et al., 2017; Ma et al., 2018; Belkin et al., 2019) . In this terminal phase of training, a "neural collapse" (NC) behavior has been empirically observed when using either cross-entropy (CE) loss (Papyan et al., 2020) or mean squared error (MSE) loss (Han et al., 2022) . The NC behavior includes several simultaneous phenomena that evolve as the number of epochs grows. The first phenomenon, dubbed NC1, is decrease in the variability of the features (outputs of the penultimate layer) of training samples from the same class. The second phenomenon, dubbed NC2, is increasing similarity of the structure of the inter-class features' means (after subtracting the global mean) to a simplex equiangular tight frame (ETF). The third phenomenon, dubbed NC3, is alignment of the last layer's weights with the inter-class features' means. A consequence of these phenomena is that the classifier's decision rule becomes similar to nearest class center in feature space. Many recent works attempt to theoretically analyze the NC behavior (Mixon et al., 2020; Lu & Steinerberger, 2022; Wojtowytsch et al., 2021; Fang et al., 2021; Zhu et al., 2021; Graf et al., 2021; Ergen & Pilanci, 2021; Ji et al., 2021; Galanti et al., 2021; Tirer & Bruna, 2022; Zhou et al., 2022; Thrampoulidis et al., 2022; Yang et al., 2022; Kothapalli et al., 2022) . The mathematical frameworks are almost always based on variants of the unconstrained features model (UFM), proposed by Mixon et al. (2020) , which treats the (deepest) features of the training samples as free optimization variables (disconnected from data or intermediate/shallow features). Typically, in these "idealized" models all the minimizers exhibit "exact collapse" (i.e., their within-class variability is exactly 0 and an exact simplex ETF structure is demonstrated) provided that arbitrary (but nonzero) level of regularization is used. However, the features of DNNs are not free optimization variables but outputs of predetermined architectures that get training samples as input and have parameters (shared by all the samples) that are hard to optimize. Thus, usually, the deepest features demonstrate reduced "NC distance metrics" (such as within-class variability) compared to features of intermediate layers but do not exhibit convergence to an exact collapse. Indeed, as can be seen in any NC paper that presents empirical results, the decrease in the NC metrics is typically finite and stops above zero at some epoch (the margin depends on the dataset complexity, architecture, hyperparameter tuning, etc.). In this paper, this issue is taken into account by studying a model that can force the features to stay in the vicinity of a predefined features matrix. By considering the predefined features as intermediate features of a DNN, the proposed model allows us to analyze how deep features progress from, or relate to, shallower features. We explore the model in the small vicinity case via perturbation analysis and establish results that cannot be obtained by the previously studied UFMs. Specifically, we prove reduction in the within-class variability of the optimized features compared to the predefined input features. To obtain this result (for arbitrary input features), we prove monotonic decrease of withinclass variability along gradient flow on the "central-path" of a UFM with minimal assumptions (i.e., we drop the assumptions and modifications of the flow that Han et al. ( 2022) did to facilitate their analysis). Next, we provide a closed-form approximation for the model's minimizer. Then, focusing on the case where the input features matrix is already near collapse (e.g., the penultimate features of a well-trained DNN), we present a fine-grained analysis of our closed-form approximation, which provides insights on the effect of regularization hyperparameters on the closeness to collapse. We support our theory with experiments in practical deep learning settings. Following the work of Mixon et al. (2020) , in order to mathematically show the emergence of minimizers with NC structure, most of the theoretical papers have followed the "unconstrained features model" (UFM) approach, where the features {h θ (x k,i )} are treated as free optimization variables {h k,i }. Namely, they study problems of the form min

2. BACKGROUND AND PROBLEM SETUP

W,b,{h k,i } 1 Kn K k=1 n i=1 L (Wh k,i + b, y k ) + R (W, b, {h k,i }) . One such example is the work in (Tirer & Bruna, 2022) , which considered a setting with regularized MSE loss (which shares similarity with models in the matrix factorization literature (Koren et al., 2009; Chi et al., 2019) , except the assumptions that d ≥ K and on the specific structure of Y): min W,H 1 2Kn ∥WH -Y∥ 2 F + λ W 2K ∥W∥ 2 F + λ H 2Kn ∥H∥ 2 F , where H = [h 1,1 , . . . , h 1,n , h 2,1 , . . . , h K,n ] ∈ R d×Kn is the (organized) unconstrained features matrix, Y = I K ⊗ 1 ⊤ n ∈ R K×Kn (where ⊗ denotes the Kronecker product) is its associated onehot vectors matrix, and λ W and λ H are positive regularization hyperparameters. It was shown that



Hui & Belkin (2021) have shown that training DNN classifiers with MSE loss is a powerful strategy whose performance is similar to training with CE loss.



Consider a classification task with K classes and n training samples per class. Let us denote by y k ∈ R K the one-hot vector with 1 in its k-th entry and by x k,i ∈ R p the i-th training sample of the k-th class. DNN-based classifiers can be typically expressed as DNN Θ (x) = Wh θ (x) + b, where h θ (•) : R p -→ R d (with d ≥ K) is the feature mapping that is composed of multiple layers (with learnable parameters θ), and W = [w 1 , . . . , w K ] ⊤ ∈ R K×d (w ⊤ k denotes the kth row of W) and b ∈ R K are the weights and bias of the last classification layer. The network's parameters Θ = {W, b, θ} are usually learned by empirical risk minimization min (Wh θ (x k,i ) + b, y k ) + R (Θ) , where L(•, •) is a loss function (e.g., CE or MSE 1 ) and R(•) is a regularization term.

