ON THE IMPLICIT BIAS TOWARDS DEPTH MINIMIZA-TION IN DEEP NEURAL NETWORKS Anonymous

Abstract

Recent results in the literature suggest that the penultimate (second-to-last) layer representations of neural networks that are trained for classification exhibit a clustering property called neural collapse (NC). We study the implicit bias of stochastic gradient descent (SGD) to favor of low-depth solutions when training deep neural networks. We characterize a notion of effective depth that measures the first layer for which sample embeddings are separable using the nearest-class center classifier. Furthermore, we hypothesize and empirically show that SGD implicitly selects neural networks of small effective depths. Secondly, while neural collapse emerges even when generalization should be impossible -we argue that the degree of separability in the intermediate layers is related to generalization. We derive a generalization bound based on comparing the effective depth of the network with the minimal depth required to fit the same dataset with partially corrupted labels. Remarkably, this bound provides non-trivial estimations of the test performance and is independent of the depth. Deep learning systems have steadily advanced the state of the art in a wide range of benchmarks, demonstrating impressive performance in tasks ranging from image classification (Taigman et al., 



Traditional generalization bounds (Vapnik, 1998; Shalev-Shwartz & Ben-David, 2014; Mohri et al., 2012; Bartlett & Mendelson, 2003) are based on uniform convergence. In this approach, instead of directly analyzing the population error of a learning algorithm, a uniform convergence-type argument would control the worst-case generalization gap (distance between train and test errors) over a class of predictors containing the outputs of the learning algorithm. Typically, this is done because for many algorithms it is difficult to exactly characterize the learned predictor. Nagarajan & Kolter (2019), however, raised significant questions about the applicability of typical uniform convergence arguments to certain interpolation learning regimes. They described theoretical settings in which an interpolation learning algorithm generalizes well but any uniform convergence bound cannot identify that. Following their work, Bartlett & Long (2021) ; Zhou et al. (2020) ; Negrea et al. (2020) ; Yang et al. (2021) all demonstrated the failure of forms of uniform convergence in various interpolation learning setups. Contributions. Because of the inherent limitations of uniform convergence bounds, in this paper we pursue a novel approach for measuring generalization in deep learning that is not based on uniform convergence. Instead, our bound suggests that the model performs well at test time if its complexity is small compared to the complexity of a network required to fit the same dataset with partially random labels. In other words, even if a trained network has a complexity greater than the (Bartlett & Mendelson, 2003) 8.911e+14 1.74e+17 2.13e+22 3.613e+17 9.145e+18 4.088e+22 1.076e+23 6.682e+28 2.758e+35 L3,1.5 (Neyshabur et al., 2015) 5.462e+05 1.6e+06 1.308e+06 7.523e+07 6.997e+07 2.636e+08 4.633e+08 2.275e+09 5.061e+09 Frobenius (Neyshabur et al., 2015) 1.848e+06 8.194e+06 2.216e+07 2.486e+08 2.335e+08 1.585e+09 1.967e+09 1.442e+10 3.038e+11 Spec L1 (Bartlett et al., 2017) 2.861e+05 6.412e+05 9.566e+05 4.706e+06 3.516e+06 3.176e+06 1.19e+07 1.449e+08 1.272e+10 Spec Frob (Neyshabur et al., 2019) 3.948e+03 1.1199e+04 1.538e+04 4.0229e+04 2.884e+04 2.543e+04 9.4833e+04 1.011e+06 1.033e+08 Table 1 : Comparing our bound with baseline bounds in the literature for networks of varying depths. Our error bound is reported in the fourth row, and the baseline bounds are reported in the bottom rectangle. While the test error is universally bounded by 1, the baseline bounds are much larger than 1, and therefore, are meaningless. In contrast, our bound achieves relatively tight estimations of the test error and unlike the baseline bounds, our bound is fairly unaffected by the network's depth. number of training samples, it may be less complex than a model that fits partially random labels. As a result, in such cases, our bound may provide a non-trivial estimate of the test error. To formally describe our notion of complexity, we employ the notion of nearest class-center (NCC) separability. This property asserts that the feature embeddings associated with training samples belonging to the same class are separable according to the nearest class-center decision rule. While original results (Papyan et al., 2020) observed NCC separability at the penultimate layer of trained networks, recent results (Ben-Shaul & Dekel, 2022) observed NCC separability also in intermediate layers. In this work, we introduce the notion of 'effective depth' of neural networks that regards to the lowest layer for which its features are NCC separable (see Sec. 3.2). We make multiple important observations regarding effective depths. (i) We empirically show that the effective depth of trained networks monotonically increases when increasing the amount of random labels in data. (ii) We observe that when training sufficiently deep networks, they converge to (approximately) the same effective depth L 0 . Furthermore, as we show in Tab. 1, unlike traditional generalization bounds, our bound is empirically non-vacuous and independent of depth.

1.1. RELATED WORK

Neural collapse and generalization. Our work is closely related to the recent line of work on Neural collapse (Papyan et al., 2020; Han et al., 2022) . Neural collapse identifies training dynamics of deep networks for standard classification tasks, where the feature embeddings associated with training samples belonging to the same class tend to concentrate around their means. While several papers analyzed the emergence of neural collapse from a theoretical standpoint (e.g., (Zhu et al., 2021; Rangamani et al., 2022; Lu & Steinerberger, 2020; Fang et al., 2021; Ergen & Pilanci, 2021) ), its specific role in deep learning and its potential relationship with generalization is still unclear. Recent work (Galanti et al., 2022a; Xu et al., 2022; Galanti et al., 2022b) studied the conditions for when class features variation collapse generalizes from the train samples, to both test samples and new classes and in the transfer learning setting. We focus on the following (independent) question in this work: Is neural collapse a good indicator of how well a network generalizes? As a counter-argument, Zhu et al. (2021) provided empirical evidence that neural collapse occurs even when training the network with random labels. As a result, the presence of neural collapse cannot indicate whether or not the network generalizes. This experiment, however, does not rule out the possibility of an indirect relationship between neural collapse and generalization. We contend that the degree of separability in the intermediate layers is related to generalization. Emergence of structure in deep networks. While various papers Papyan (2020); Tirer & Bruna (2022) ; Galanti et al. (2022a) ; Ben-Shaul & Dekel (2022) ; Cohen et al. (2018) ; Alain & Bengio (2017) ; Montavon et al. (2011) ; Papyan et al. (2017) ; Ben-Shaul & Dekel (2021) ; Shwartz-Ziv & Tishby (2017) investigated certain geometrical properties within intermediate layers (e.g., clustering and separability), this paper is the first to demonstrate that deep neural networks tend to converge to a minimal effective depth that is independent of the network's depth. Even though one can de-rive "effective depths" from the experiments of Cohen et al. (2018) , we show that when training sufficiently deep networks they converge to (approximately) the same effective depth.

2. PROBLEM SETUP

In this section we describe the learning setting we use in the theory and experiments. We consider the problem of training a model for standard multi-class classification. Formally, the target task is defined by a distribution P over samples (x, y) ∈ X × Y C , where X ⊂ R d is the instance space, and Y C is a label space with cardinality C. To simplify the presentation, we use one-hot encoding for the label space, that is, the labels are represented by the unit vectors in R C , and Y C := {e c : c = 1, . . . , C} where e c ∈ R C is the cth standard unit vector in R C ; with a slight abuse of notation, we allow ourselves to write y = c instead of y = e c . For a pair (x, y) distributed by P , we denote by P c the class conditional distribution of x given y = c (i.e., P c (•) : = P[x ∈ • | y = c]). A classifier h W : X → R C assigns a soft label to an input point x ∈ X , and its performance on the distribution P is measured by the expected risk L P (h W ) := E (x,y(x))∼P [ℓ(h W (x), y(x))], where ℓ : R C × Y C → [0, ∞) is a non-negative loss function (e.g., L 2 or cross-entropy losses). We typically do not have direct access to the full population distribution P . Therefore, we generally aim to learn a classifier, h, using some balanced training data S := {(x i , y i )} m i=1 = ∪ C c=1 S c = ∪ C c=1 {x ci , y ci } m0 i=1 ∼ P B (m) of m = C • m 0 samples consisting m 0 independent and identically distributed (i.i.d.) samples drawn from P c for each c ∈ [C] . Specifically, we intend to find W that minimizes the regularized empirical risk L λ S (h W ) := 1 m m i=1 ℓ(h W (x i ), y i ) + λ∥W ∥ 2 2 , where the regularization controls the complexity of the function h W and typically helps reducing overfitting. Finally, the performance of the trained model is evaluated using the train and test error rates; err S (h W ) := Neural networks. In this work, the classifier h W is a neural network, decomposed into a set of parametric layers. Formally, we write h W := e We • f L W f := e We • g L W L • • • • • g 1 W1 , where g i Wi ∈ {g ′ : R pi → R pi+1 } are parametric functions and e We ∈ {e ′ : R p L+1 → R C } is a linear function. For example, g i Wi could be a standard linear or convolutional layer, a residual block or a pooling layer. Here, σ is an element-wise ReLU activation function. With a slight abuse of notation, we omit specifying the sub-scripted weights, f i := g i • • • • • g 1 and h := h W . Optimization. We optimize our models to minimize the regularized empirical risk L λ S (h) by applying SGD for a certain number of iterations T with coefficient λ > 0. Specifically, we initialize the weights W 0 = γ of h using a standard initialization procedure and at each iteration, we update W t+1 ← W t -µ t ∇ W L S (h t ), where µ t > 0 is the learning rate at the t'th iteration and the subset S ⊂ S of size B is selected uniformly at random. Throughout the paper, we denote by h γ S the output of the learning algorithm starting from the initialization W 0 = γ. When γ is irrelevant or obvious from context, we will simply write h γ S = h S = e S • f S .

3. NEURAL COLLAPSE AND GENERALIZATION

In this section we theoretically explore the relationship between neural collapse and generalization. We start by introducing neural collapse, NCC separability, and effective depth of neural networks. Then, we connect these notions with the test-time performance of neural networks. In this paper we focus on a weak form of NC4 we call "nearest class-center separability" (NCC separability). Formally, suppose we have a dataset S = ∪ C c=1 S c of samples and a mapping f : R d → R p , the features of f are NCC separable (w.r.t. S) if for all i ∈ [m], we have ĥ(x i ) = y i , where ĥ(x) := arg min c∈[C] ∥f (x) -µ f (S c )∥. To measure the degree of NCC separability of a feature map f , we use the train and test classification error rates of the NCC classifier on top of the given layer, err S ( ĥ) and err P ( ĥ). Essentially, NC4 asserts that during training, the feature embeddings in the penultimate layer become separable and the classifier h itself converges to the 'nearest class-center classifier' ĥ.

3.2. EFFECTIVE DEPTHS AND GENERALIZATION

In this section we study the effective depths of neural networks and their connection with generalization. To formally define this notion, we focus on neural networks whose L top-most layers are of the same size. We observe that neural networks trained for standard classification exhibit an implicit bias towards depth minimization. Observation 1 (Minimal depth hypothesis). Suppose we have a dataset S. There exists an integer L 0 ≥ 1, such that, if we train a neural network of any depth L ≥ L 0 for cross-entropy minimization on S using SGD with weight decay, the learned features f l become (approximately) NCC separable for all l ∈ {L 0 , . . . , L}. We note that if the L 0 'th layer of f L exhibits NCC separability, we could correctly classify the samples already in the L 0 'th layer of f L using a linear classifier (i.e., the nearest class-center classifier). Therefore, intuitively its depth is effectively upper bounded by L 0 . The notion of effective depth of a neural network is formally defined as follows. Definition 1 (ϵ-effective depth). Suppose we have a dataset S and a neural network h = e • g L • • • • • g 1 with g 1 : R n → R p2 , g i : R pi → R pi+1 and linear classifier e : R p L+1 → R C . Let ĥi (x) := arg min c∈[C] ∥f i (x) -µ fi (S c )∥. The ϵ-effective depth d ϵ S (h) of the network h is the minimal value i ∈ [L], such that, err S ( ĥi ) ≤ ϵ (and d ϵ S (h) = L if such i ∈ [L] is non-existent). To avoid confusion, we note that the ϵ-effective depth is a property of a neural network and not of the function it implements. That is, a function can be implemented by two different architectures of different effective depths. While our empirical observations in Sec. 4 suggest that the optimizer learns neural networks of low-depths, it is not necessarily the lowest depth that allows NCC separability. As a next step, we define the ϵ-minimal NCC depth. Intuitively, the NCC depth of a given architecture is the minimal value L ∈ N, for which there exists a neural network of depth L whose features are NCC separable. As we will show, the relationship between the ϵ-effective depth of a neural network and the ϵ-minimal NCC depth is connected with generalization. Definition 2 (ϵ-Minimal NCC depth). Suppose we have a dataset S = ∪ C c=1 S c and a neural network architecture f L = g L • • • • • g 1 with g 1 : R n → R n0 and g i ∈ G ⊂ {g ′ | g ′ : R n0 → R n0 } for all i = 2, . . . , L. The ϵ-minimal NCC depth of G is the minimal depth L for which there exist parameters W = {W i } L i=1 , such that, f ′ := f L W = g L W L • • • • • g 1 W1 satisfies err S ( ĥ) ≤ ϵ, where ĥ(x) := arg min c∈[C] ∥f ′ (x) -µ f ′ (S c )∥. We denote the ϵ-minimal NCC depth by d ϵ min (G, S). To study the performance of a given model, we consider the following setup. Let S 1 = {(x 1 i , y 1 i )} m i=1 and S 2 = {(x 2 i , y 2 i )} m i=1 be two balanced datasets. We think of them as two splits of the training dataset S. We assume that the classifier h γ S1 is trained on S 1 and we use S 2 to evaluate its performance. We denote by X j = {x j i } m i=1 and Y j = {y j i } m i=1 the instances and labels in S j . To formally state our bound, we make two technical assumptions. The first is that the misclassified labels that h γ S1 produces over the samples X 2 = ∪ C c=1 {x 2 ci } m0 i=1 are distributed uniformly. Definition 3 (δ m -uniform mistakes). We say that the mistakes of a learning algorithm A : (S 1 , γ) → h γ S1 are δ m -uniform, if with probability ≥ 1 -δ m over the selection of S 1 , S 2 ∼ P B (m), the values and indices of the mistaken labels of h γ S1 over X 2 are uniformly distributed (as a function of γ). The above definition provides two conditions regarding the learning algorithm. It assumes that with a high probability (over the selection of S 1 , S 2 ), h γ S1 makes the same number of mistakes on S 2 across all initializations γ. In addition, it assumes that the mistakes are distributed uniformly across the samples in S 2 and their (incorrect) values are also distributed uniformly. While these assumptions may be violated in practice, the train error typically has a small variance and the mistakes are almost distributed uniformly when the classes are non-hierarchical (e.g., CIFAR10, MNIST). For the second assumption, we consider the following term. Let p ∈ (0, 1/2), α ∈ (0, 1), we denote δ 2 m,p,α := P S1,S2, Ỹ2, Ŷ2 ∃ q ≥ (1 + α) p : d ϵ min (G, S 1 ∪ S2 ) > E Ŷ2 [d ϵ min (G, S 1 ∪ Ŝ2 )] , where Ỹ2 = {ỹ i } m i=1 and Ŷ2 = {ŷ i } m i=1 are uniformly selected to be sets of labels that disagree with Y 2 on pm and qm values (resp.) and S2 and Ŝ2 are datasets obtained by replacing the labels of S 2 with Ỹ2 and Ŷ2 (resp.). We assume that δ 2 m,p,α is small. Meaning, with a high probability, the minimal depth to fit (2 -p)m correct labels and pm random labels is upper bounded by the expected minimal depth to fit (2 -q)m correct labels and qm random labels for any q ≥ (1 + α)p. To understand this assumption, we note that in both cases, the model has to fit at least m correct labels and pm (or qm) random labels. However, we typically need to increase the capacity of the model in order to fit extended amounts of random labels (see Figs. 3) . Following the setting above, we are prepared to formulate our generalization bound. Proposition 1. Let m ∈ N, p ∈ (0, 1/2), α ∈ (0, 1) and ϵ ∈ (0, 1). Assume that the error of the learning algorithm is δ 1 m -uniform. Assume that S 1 , S 2 ∼ P B (m). Let h γ S1 be the output of the learning algorithm given access to a dataset S 1 and initialization γ. Then, E S1 E γ [err P (h γ S1 )] ≤ P S1,S2, Ỹ2 E γ [d ϵ S1 (h γ S1 )] ≥ d ϵ min (G, S 1 ∪ S2 ) + (1 + α)p + δ 1 m + δ 2 m,p,α , where Ỹ2 = { Ỹi } m i=1 is uniformly selected to be a set of labels that disagrees with Y 2 on pm values. The above proposition provides an upper bound on the expected test error of the classifier h γ S1 which is the term that we would like to bound. The proposition assumes that the mistakes h γ S1 generates on X 2 are distributed uniformly (with probability ≥ 1 -δ 1 m ). To account the likelihood that this assumption fails, our bound includes the term δ 1 m , which is assumed to be small. Informally, the bound suggests the following idea to evaluate the performance of h γ S1 . We start with an initial guess p m = p ∈ (0, 1/2) of the test error of h γ S1 . Using this guess, we compare its ϵ-effective depth with the ϵ-minimal NCC depth d ϵ min (G, S 1 ∪ S2 ) required to NCC separate the samples in S 1 ∪ S2 , where S2 is the result of randomly relabeling p m m of S 2 's labels. Intuitively, if the mistakes of h γ S1 are uniformly distributed and its ϵ-effective depth is smaller than d ϵ min (G, S 1 ∪ S2 ), then, we expect h γ S1 to make at most p m mistakes on S 2 . Therefore, in a sense, the choice of p m serves as a 'guess' whether the effective depth of a model trained with S 1 is likely to be smaller than the ϵ-minimal NCC depth required to NCC separate the samples in S 1 ∪ S2 . Next, we interpret each term separately. The term E γ [d ϵ S1 (h γ S1 )] depends on the complexity of the classification problem and the implicit bias of SGD to favor networks of small ϵ-effective depths. In the worst case, if SGD does not minimize the ϵ-effective depth or the labels in S 1 are random (and m is sufficiently large), we expect E γ [d ϵ S1 (h γ S1 )] = L. On the other hand, d ϵ min (G, S 1 ∪ S2 ) measures the complexity of a task that involves fitting a dataset of size 2m samples, where (2 -p m )m ≥ m of the labels are correct and p m m are random labels. By decreasing p m , we expect d ϵ min (G, S 1 ∪ S2 ) to decrease, making the first term in the bound larger. In addition, if h = e • f L is a neural network of a fixed width, it is impossible to fit an increasing amount of random labels without increasing the depth. Therefore, when p m m -→ m→∞ ∞, the dataset S 1 ∪ S2 becomes increasingly harder to fit, and we expect d ϵ min (G, S 1 ∪ S2 ) to tend to infinity. If E γ [d ϵ S1 (h γ S1 )] is bounded as a function of L and m and if p m m -→ m→∞ ∞, we obtain that P[E γ [d ϵ S1 (h γ S1 )] ≥ d ϵ min (G, S 1 ∪ S2 )] -→ m→∞ 0 and together with p m -→ m→∞ 0, we have E S1 [err P (h S1 )] ≤ δ 1 m + δ 2 m,p,α + o m (1). As a side note, computing the expectation over S 1 , S 2 in the bound is impossible, due to the limited access of the training data. However, instead, we empirically estimate this term using a set of k pairs (S i 1 , S i 2 ) of m samples, yielding an additional term that scales as O(1/ √ k) to the bound (see Prop. 2 in the appendix).

3.3. COMPARING PROP. 1 WITH STANDARD GENERALIZATION BOUNDS

Classic bounds (e.g., (Vapnik, 1998) ) are based on bounding the test error with the sum between the train error together with a term O( C(H)/m), where C(H) measures the complexity (e.g., VC dimension) of the class H (e.g., neural networks) and m is the number of training samples. However, as discussed in Sec. 1, these bounds are vacuous in overparameterized learning regimes (e.g., training ResNet-50 on CIFAR10 classification). For instance, for VC-dimension based bounds (Vapnik, 1998) , C(H) equals the VC-dimension of the class H which scales with the number of trainable parameters for ReLU networks (Bartlett et al., 2019) . For example, even though the ResNet-50 architecture generalizes well when trained on CIFAR10, it has over 23 million parameters compared to the m = 50000 training samples in the dataset. More recently, Neyshabur et al. (2015) ; Bartlett et al. (2017) ; Golowich et al. (2017); Neyshabur et al. (2018) suggested generalization bounds for neural networks that weakly depend on uniform convergence. In these bounds, the class-complexity C(H) is replaced with the individual complexity C(h W ) of the function we learn. For example, Golowich et al. (2017) proposed bounds that scale with C(h W ) = ρ 2 L, where L is the depth of h W and ρ measures the product of the norms of its weight matrices. However, Nagarajan & Kolter (2019) showed that in certain cases unregularized least squares can generalize well even when its norm ρ scales as Θ( √ m) and the bound becomes Θ m (1). Furthermore, these bounds tend to be very large in practice (see Tab. 8 in (Neyshabur et al., 2019) and Tab. 1) and are negatively correlated with the test performance (Jiang et al., 2020) . In addition, if the network's weight matrices' norms are larger than 1, quantities like ρ grow exponentially when L is varied. As shown in Tab. 1 this is empirically the case. Our Prop. 1 offers a different way to measure generalization. Since this bound is not based on uniform convergence, it does not require that the network's complexity would be small in comparison to m; rather, the bound guarantees generalization if the network's effective size is smaller than that of a network that fits partially random labels. For instance, when the optimizer has a strong bias towards minimizing the effective depth, E γ [d ϵ S1 (h γ S1 )] ≈ d ϵ min (G, S 1 ) which is by definition upper bounded by d ϵ min (G, S 1 ∪ S2 ). We note that d ϵ min (G, S 1 ∪ S2 ) grows to infinity as m → ∞ (since the network needs to memorize m → ∞ random labels). On the other hand, d ϵ min (G, S 1 ) is bounded by the depth of a network that approximates the target function y up to an approximation error ϵ (which typically exists due to universal approximation arguments). Therefore, for sufficiently large m, we expect to have d ϵ min (G, S 1 ∪ S2 ) > d ϵ min (G, S 1 ). As we empirically see in Sec. 4, the effective depths of SGD-trained networks are usually small. Unlike previous bounds, our bound has the advantage of being fairly independent of L. Namely, when the minimal depth hypothesis (Obs. 1) holds, we expect E γ [d ϵ S1 (h γ S1 )] to be unaffected by the depth L of h γ S1 (as long as L ≥ L 0 ). Since d ϵ min (G, S 1 ∪ S2 ) is by definition independent of L, we expect P[E γ [d ϵ S1 (h γ S1 )] ≥ d ϵ min (G, S 1 ∪ S2 )] to be independent of L (when L ≥ L 0 ). In Tab. 1 we empirically validate that our bound does not grow when increasing L.

4. EXPERIMENTS

In this section, we experimentally analyze the emergence of neural collapse in the intermediate layers of neural networks. First, we validate the "Minimal Depth Hypothesis" (Obs. 1). Following that, we look at how corrupted labels affect the extent of intermediate layer NCC separability and the ϵ-effective depth. We show that as the number of corrupted labels in the data increases, so does Figure 2 : Averaged ϵ-effective depths over the last few epochs. We plot the ϵ-effective depth (y-axis) as a function of ϵ (x-axis). Each line specifies the ϵ-effective depth of a neural network of a certain depth L. We show the averaged ϵ-effective depth over the last k = 1, 20 epochs across 5 initializations. The network's architecture, dataset and k are specified below each plot. the ϵ-effective depth. Finally, using the bound in Prop. 1, we provide non-trivial estimates of the test error. In Tab. 1, we empirically compare our bound with relevant baselines and show that, unlike other bounds, it achieves non-vacuous estimations of the test error. Throughout the experiments, we used Tesla-k80 GPUs for several hundred runs. Each run took between 5-20 hours. For additional experiments, see Appendix A. The plots are best viewed when zoomed in.

4.1. SETUP

Training process. We consider k-class classification problems (e.g., CIFAR10) and train multi- layered neural networks h = e • f L = e • g L • • • • • g 1 : R n → R C on the corresponding training dataset S. The models are trained with SGD for cross-entropy loss minimization between its logits and the one-hot encodings of the labels. We consistently use batch size 128, learning rate schedule with an initial learning rate 0.1, decayed three times by a factor of 0.1 at epochs 60, 120, and 160, momentum 0.9 and weight decay 5e-4. Each model is trained for 500 epochs. Datasets. We consider various datasets: MNIST, Fashion MNIST, and CIFAR10. For CIFAR10 we used random cropping, random horizontal flips, and random rotations (by 15k degrees for k uniformly sampled from [24]). All datasets were standardized.

4.2. RESULTS

Intermediate neural collapse. To study the bias towards minimal depth, we trained a set of CONV-L-400 networks on CIFAR10 with varying depths. In each plot at Fig. 1 we report the train NCC classification accuracy rates for each intermediate layer of a network of a certain depth. We make multiple interesting observations; (i) For networks with 8 or higher hidden layers, the eighth and higher layers exhibit NCC train accuracy of approximately 100%, and therefore, are effectively of depth 7. (ii) We observe that neural collapse strengthens when increasing the network's depth, on both train and test data. (iii) The embeddings of the top layers become NCC separable approximately at the same epoch. (iv) The degree of NCC separability of intermediate layer i converges as a function of L. The results of this experiment are substantially extended and repeated with different architectures and datasets in the appendix (see . In these experiments we report the NCC train and test accuracy rates along with additional measures of neural collapse when varying the depth. Specifically, in Figs. 4 and 5 we report the results with CONVRES-L-500. The effect of the depth on the ϵ-effective depth. In Obs. 1 we claimed that the ϵ-effective depth is insensitive to the actual depth of the network (once it exceeds a certain threshold). To validate this hypothesis we conducted the following experiments. We trained models on MNIST, Fashion MNIST and CIFAR10 with varying depth L. In Fig. 2 we plotted the averaged ϵ-effective depths of each network's last k = 1, 20 epochs as a function of ϵ. We also average the results across 5 different weight initializations and plot them along with error bar standard deviations. As can be seen, the ϵ-effective depth is almost unaffected by the choice of L for a given ϵ. Remarkably, for each ϵ, the averaged effective depth varies very little across the various networks. Differently said, the ϵ-effective depths of two trained deep networks of different depths are more or less the same, validating our Minimal Depth Hypothesis. NCC separability with partially corrupted labels. Simply put, Prop. 1 compares the depths required to fit correct labels and partially corrupt labels. To better understand the effect of corrupted labels on the complexity of the task, we compare the ϵ-effective depths of models trained with varying amounts of corrupted labels. Namely, we study the degree of NCC separability in the intermediate layers of neural networks that are trained with varying amounts of corrupted labels. For this experiment we trained instances of CONV-10-400 for CIFAR10 classification with 0%, 10% and 75% corrupted labels (e.g., uniformly distributed random labels). We plot the degrees of NCC separation on the train and test sets, 1err S ( ĥi ) and 1err P ( ĥi ), across the intermediate layers of the neural networks during the optimization procedure. As can be seen in Fig. 3 , when increasing the amount of random labels, the degree of NCC separability across the intermediate layers tend to decrease. For example, when training with ≥ 25% corrupted labels, the sixth layer's NCC accuracy rate drops lower than 98%, in comparison with training without corrupted labels that gives us > 98% accuracy. In particular, the ϵ-effective depth of the former network is 6 while the latter's is 5, when ϵ = 0.02 (see Def. 1). This experiment is extended and repeated in a variety of settings in Figs. 14 15 16 17 18 . Estimating the bound in equation 3. We estimate the bound in equation 3 for multiple architectures and datasets. In each case we used ϵ = 0.005 by default and employed different 'guesses' p (see Tab. 2) depending on the complexity of the learning task. We report an estimation of the expected test error of the models, E S1,γ [err P (h γ S1 )] and an estimation of the bound for each selection of p. For concrete technical details, see Appendix A. As can be seen, for appropriate selections of p, we obtained non-trivial estimates to the test performance of the models, which are almost unheard of when it comes to standard bounds for deep neural networks. As expected, if the guess p is overoptimistic (e.g., close to E S1,γ [err P (h γ S1 )]), then, the first term in the bound tends to be large compared to E S1,γ [err P (h γ S1 )]. Comparing our bound with standard generalization bounds. Since the ϵ-effective depth of sufficiently deep neural networks is insensitive to depth (see Fig. 2 ), we expect the bound to be insensitive to depth as well. We estimate the bound in equation 3 for CONV-L-50 trained on MNIST and CONV-L-100 trained on Fashion MNIST and CIFAR10 with L = 10, 12, 15 for the first two and with L = 15, 18, 20 for CIFAR10. As shown in Tab. 1, we obtain similar bounds for each depth. Finally, we compare our bound to several baseline generalization bounds for deep networks to show that it outperforms traditional generalization bounds. We used the implementation of Neyshabur et al. (2019) to compute the bounds. While our bound is empirically non-vacuous and fairly independent of depth, the traditional bounds are extremely vacuous and rapidly grow when increasing the depth.

5. CONCLUSIONS

Understanding the ability of SGD to generalize well when training overparameterized neural network is attributed as one of the major open problems in deep learning theory (Zhang et al., 2017) . In this paper we offer a new angle to study the role of depth in deep learning and the connection between neural collapse and generalization. We characterize a notion of effective depth that measures the lowest layer that enjoys NCC separability. We introduce a novel generalization bound that measures the likelihood in which the effective depth of a trained neural network is (strictly) smaller than the minimal depth required to achieve NCC separability with partially corrupted labels. This criterion, as demonstrated empirically, is a good predictor of generalization. Furthermore, we characterize and empirically demonstrate that when sufficiently deep networks are trained, they converge to the same effective depth, implying that our bound is fairly constant when the depth is varied.

A ADDITIONAL EXPERIMENTS AND DETAILS

A.1 ARCHITECTURES In this section we give a detailed description of the architectures used in the experiments. The first architecture is a convolutional network, denoted by CONV-L-H. The network starts with a stack of a 2 × 2 convolutional layer with stride 2, batch normalization, a convolution of the same structure, batch normalization, and ReLU. Following that we have a set of L stacks of blocks g i (x) = σ(B i (C i (x))) , where C i is a 3 × 3 convolutional layers with H channels, stride 1 and padding 1, B i is a batch normalization layer, and σ is the ReLU activation. The last layer is linear. When computing the effective depth, the i'th intermediate layer refers to the output of the i'th block of g i . The second architecture is an MLP, denoted by MLP-L-H consisting of L hidden layers, where each layer g i (x) = σ(B i (T i (x))) contains a linear layer T i of output width H, followed by batch normalization B i and ReLU activation σ. The last layer is linear. The third architecture is a convolutional residual network denoted by CONVRES-L-H. The network starts with a stack of a 2 × 2 convolutional layer with stride 2, batch normalization, a convolution of the same structure, batch normalization, and ReLU. Following that we have a set of L residual blocks, where each block computes g i (x) = σ(x + B 2 i (C 2 i (σ(B 1 i (C 1 i (x)))))) , where each C j i is a 3 × 3 convolutional layer with H channels, stride 1 and padding 1, B j i is a batch normalization layer and σ is the ReLU activation. The last layer is linear.

A.2 ESTIMATING THE GENERALIZATION BOUND

In this section we describe how we empirically estimate the bound in Prop. 1. Estimating the bound. We would like to estimate the first term in the bound, P S1,S2, Ỹ2 E γ [d ϵ S1 (h γ S1 )] ≥ d ϵ min (G, S 1 ∪ S2 ) . According to Prop. 2 in order to estimate this term we need to generate i.i.d. triplets (S i 1 , S i 2 , Ỹ i 2 ). Since we have a limixted access to training data, we use a variation of cross-validation and generate k 1 = 5 i.i.d. disjoint splits (S i 1 , S i 2 ) of the training data S. For each one of these pairs, we generate k 2 = 3 corrupted labelings Ỹ ij 2 . We denote by Sij 2 the set obtained by replacing the labels of S i 2 with Ỹ ij 2 and Sij 3 := S i 1 ∪ Sij 2 . As a first step, we would like to estimate E γ [d ϵ S i 1 (h γ S i 1 )] for each i ∈ [k 1 ]. For this purpose, we randomly select T 1 = 5 different initializations γ 1 , . . . , γ T1 and for each one, we train the model h γt S i 1 using the training protocol described in Sec. 4.1. Once trained, we compute d ϵ S1 (h γt S i 1 ) for each t ∈ [T 1 ] (see Def. 1) and approximate E γ [d ϵ S i 1 (h γ S i 1 )] using d i := 1 T1 T1 t=1 d ϵ S i 1 (h γt S i 1 ). As a next step, we would like to evaluate I[d i ≥ d ϵ min (G, Sij 3 )]. We notice that d i ≥ d ϵ min (G, S i 1 ∪ Si 2 ) if and only if there is a d i -layered neural network f = g di • • • • • g 1 for which err Sij 3 ( ĥ) ≤ ϵ, where ĥ(x) := arg min c∈[C] ∥f (x) -µ f (S c )∥. In general, computing this Boolean value is computationally hard. Therefore, to estimate this Boolean value, we simply train a (d i + 1)-layered network h = e • f and check whether its penultimate layer is ϵ-NCC separable, i.e., err Sij 3 ( ĥ) ≤ ϵ, where ĥ(x  ) := arg min c∈[C] ∥f (x) -µ f (S c )∥. If SGD d i ≥ d ϵ min (G, Sij 3 )] using I[d i ≥ min t∈[T2] d ϵ Sij 3 (h t )]. Our final estimation is the following 1 k 1 k1 i=1 1 k 2 k2 j=1 I d i ≥ min t∈[T2] d ϵ Sij 3 (h t ) ≈ P S1,S2, Ỹ2 E γ [d ϵ S1 (h γ S1 )] ≥ d ϵ min (G, S 1 ∪ S2 ) . (5) In order to estimate the bound we assume that δfoot_0 m and δ 2 m,p,α are negligible constants and that α = 1. The estimation of the bound is given by the sum of the left hand side in equation 5 and p. Estimating the mean test error. To estimate the mean test error, E S1,γ [err P (h γ S1 )], as typically done in machine learning, we replace the population distribution P with the test set S test and we replace the expectation over S 1 and γ with averages across the k 1 = 5 random selections of {S i 1 } k1 i=1 and T 1 = 5 random selections of {γ t } T1 t=1 . Namely, we compute the following 1 k1 k1 i=1 1 T1 T1 t=1 err Stest (h γt S i 1 ) ≈ E S1,γ [err P (h γ S1 )].

A.3 NEURAL COLLAPSE

To obtain a comprehensive analysis of collapse across layers, we also estimate the degree of NC1. To evaluate NC1, we follow the process suggested by Galanti et al. (2022a) , which is a simplified version of the original approach of Papyan et al. ( 2020). For a feature map f : R d → R p and two (class-conditional) distributions 1 Q 1 , Q 2 over X ⊂ R d , we define their class-distance normalized variance (CDNV) to be V f (Q 1 , Q 2 ) := Var f (Q 1 ) + Var f (Q 2 ) 2∥µ f (Q 1 ) -µ f (Q 2 )∥ 2 , where µ u (Q) := E x∼Q [u(x)] and by Var u (Q) := E x∼Q [∥u(x) -µ u (Q)∥ 2 ] the mean and variance of u(x) for x ∼ Q. Essentially, this quantity measures to what extent the feature vectors of samples from Q 1 and Q 2 are separated and clustered in space. To demonstrate the gradual evolution of collapse across the layers, for each sub-architecture f i = g i • • • • • g 1 (x)

we consider the train and test class features variations Avg

c̸ =c ′ [V f i (S c , S c ′ )] and Avg c̸ =c ′ [V f i (P c , P c ′ )]. The population distribution of each class, P c , is replaced with the test samples of that class. As shown by Galanti et al. (2022a) , this definition is essentially the same as that of Papyan et al. (2020) . Furthermore, they showed that the NCC classification error rate can be upper bounded in terms of the CDNV. However, the NCC error can be zero in cases where the CDNV is larger than zero. For example, if the two classes are uniformly distributed over the 1-radius circles around the points (-1, 0) and (1, 0) in R 2 , then they are perfectly NCC separable while the CDNV between the two distributions is 0.25. Auxiliary experiments on the effective depth. In Figs. 7-13 we plot the CDNV and the NCC accuracy rates of neural networks with varying numbers of hidden layers evaluated on the train and test data. Each curve stands for a different layer within the network. As can be seen, in all cases, for networks deeper than a threshold we obtain (near perfect) NCC separability in all of the top layers. Furthermore, the degree of neural collapse seems to improve with the network's depth. Auxiliary experiments with noisy labels. In Figs. 14-18 we repeat the experiment in Fig. 3 and plot the results of the same experiment, with different networks and datasets (see captions). As can be seen, the effective NCC depth of a neural network tends to increase as we train with increasing amounts of corrupted labels. 

B PROOFS

Proposition 1. Let m ∈ N, p ∈ (0, 1/2), α ∈ (0, 1) and ϵ ∈ (0, 1). Assume that the error of the learning algorithm is δ 1 m -uniform. Assume that S 1 , S 2 ∼ P B (m). Let h γ S1 be the output of the learning algorithm given access to a dataset S 1 and initialization γ. Then, E S1 E γ [err P (h γ S1 )] ≤ P S1,S2, Ỹ2 E γ [d ϵ S1 (h γ S1 )] ≥ d ϵ min (G, S 1 ∪ S2 ) + (1 + α)p + δ 1 m + δ 2 m,p,α , where Ỹ2 = { Ỹi } m i=1 is uniformly selected to be a set of labels that disagrees with Y 2 on pm values. Proof. Let S 1 = {(x 1 i , y 1 i )} m i=1 and S 2 = {(x 2 i , y 2 i )} m i=1 be two balanced datasets. Let ϵ > 0, p > 0 and q = (1 + α) p. Let Ỹ2 and Ŷ2 be a uniformly selected set of labels that disagree with Y 2 on pm and qm randomly selected labels (resp.). We denote by S2 and Ŝ2 the relabeling of S 2 with the labels in Ỹ2 and in Ŷ2 (resp.). We define four different events, A 1 = {(S 1 , S 2 , Ỹ2 ) | ∃ q ≥ (1 + α) p : d ϵ min (G, S 1 ∪ S2 ) > E Ŷ2 [d ϵ min (G, S 1 ∪ Ŝ2 )]} A 2 = {(S 1 , S 2 ) | the mistakes of h γ S1 are not uniform over S 2 } A 3 = {(S 1 , S 2 , Ỹ2 ) | (S 1 , S 2 , Ỹ2 ) / ∈ A 1 ∪ A 2 and E γ [d ϵ S1 (h γ S1 )] < d ϵ min (G, S 1 ∪ S2 )} A 4 = {(S 1 , S 2 , Ỹ2 ) | (S 1 , S 2 , Ỹ2 ) / ∈ A 1 ∪ A 2 and E γ [d ϵ S1 (h γ S1 )] ≥ d ϵ min (G, S 1 ∪ S2 )} B 1 = {(S 1 , S 2 , Ỹ2 ) | E γ [d ϵ S1 (h γ S1 )] ≥ d ϵ min (G, S 1 ∪ S2 )} By the law of total expectation E S1 E γ [err P (h γ S1 )] = E S1,S2 E γ [err S2 (h γ S1 )] = 4 i=1 P[A i ] • E S1,S2, Ỹ2 [E γ [err S2 (h γ S1 )] | A i ] ≤ P[A 1 ] + P[A 2 ] + E S1,S2, Ỹ2 [E γ [err S2 (h γ S1 )] | A 3 ] + P[B 1 ], where the last inequality follows from err S2 (h γ S1 ) ≤ 1, P[A 3 ] ≤ 1 and A 4 ⊂ B 1 . We would like to upper bound each one of the above terms. First, we notice that since the mistakes of the network are δ 1 m -uniform, P[A 2 ] ≤ δ 1 m . In addition, by definition P[A 1 ] ≤ δ 2 m,p,α . As a next step, we upper bound E S1,S2, Ỹ2 [E γ [err S2 (h γ S1 )] | A 3 ]. Assume that (S 1 , S 2 , Ỹ2 ) ∈ A 3 . Hence, (S 1 , S 2 , Ỹ2 ) / ∈ A 1 ∪ A 2 . Then, the mistakes of h γ S1 over S 2 are uniformly distributed (with respect to the selection of γ). Assume by contradiction that err S2 (h γ S1 ) > (1 + α) p with nonzero probability over the selection of γ. Then, since the mistakes of h γ S1 over S 2 are uniformly distributed, err S2 (h γ S1 ) > (1 + α) p for all initializations γ. Therefore, we have E Ŷ2 [d ϵ min (G, S 1 ∪ Ŝ2 )] ≤ E γ [d ϵ S1 (h γ S1 )] < d ϵ min (G, S 1 ∪ S2 ), where the first inequality follows from the definition of d ϵ min (G, S 1 ∪ Ŝ2 ) and the second one by the assumption that (S 1 , S 2 , Ỹ2 ) ∈ A 3 . However, this inequality contradicts the fact that (S 1 , S 2 , Ỹ2 ) / ∈ A 1 . Therefore, we conclude that in this case, E γ [err S2 (h γ S1 )] ≤ (1 + α) p and E S1,S2, Ỹ2 [E γ [err S2 (h γ S1 )] | A 3 ] ≤ (1 + α) p. Proposition 2. Let m ∈ N, p ∈ (0, 1/2), α ∈ (0, 1) and ϵ ∈ (0, 1). Assume that the error of the learning algorithm is δ 1 m -uniform. Let S 1 , S 2 , S i 1 , S i 2 ∼ P B (m) (for i ∈ [k]). Let Ỹ i 2 = {ỹ i } m i=1 be a set of labels that disagrees with Y i 2 on uniformly selected pm labels and Si 2 is a relabeling of S 2 with the labels in Ỹ i 2 . Let h γ S1 be the output of the learning algorithm given access to a dataset S 1 and initialization γ. Then, with probability at least 1 -δ over the selection of {(S i 1 , S i 2 , Ỹ i 2 )} k i=1 , we



The definition can be extended to finite sets S1, S2 ⊂ X by definingV f (S1, S2) = V f (U [S1], U [S2]).



arg max c h W (x i ) c ̸ = y i ] and err P (h W ) := E (x,y)∼P [I[arg max c h W (x) c ̸ = y]], where I : {True, False} → {0, 1} the indicator function.

NEAREST CLASS-CENTER SEPARABILITY Neural collapse identifies training dynamics of deep networks for standard classification tasks, in which the features of the penultimate layer associated with training samples belonging to the same class tend to concentrate around their class-means. This includes (NC1) class-features variability collapse, (NC2) the class means of the embeddings collapse to the vertices of a simplex equiangular tight frame, (NC3) the last-layer classifiers collapse to the class means up to scaling and (NC4) the classifier's decision collapses to simply choosing whichever class has the closest train class mean, while maintaining a zero classification error.

We used three types of architectures: (a) MLP-L-H with L fully-connected layers of width H, (b) CONV-L-H with L 3×3 convolutional layers with padding 1, stride 1 and H output channels and (c) a residual convolutional network CONVRES-L-H with L residual blocks with two 3 × 3 convolutional layers. In each network the layers are interlaced with batch normalization layers and ReLU activations. For more details see Appendix A.1.

Figure 18: Intermediate neural collapse of CONV-10-50 trained on MNIST with partially corrupted labels. See Fig. 14 for details.

Estimating the bound in Prop. 1. We used ϵ = 0.005 to measure the effective depths.

Intermediate neural collapse of CONVRES-L-500 trained on CIFAR10. We plot the NCC train and test accuracy rates of neural networks with varying numbers of hidden layers. Each curve stands for a different layer within the network. Intermediate neural collapse of CONV-L-400 trained on CIFAR10. We plot the NCC train and test accuracy rates of neural networks with varying numbers of hidden layers. Each curve stands for a different layer within the network. Intermediate neural collapse of CONV-L-50 trained on MNIST. We plot the NCC train and test accuracy rates of neural networks with varying numbers of hidden layers. Each curve stands for a different layer within the network. Intermediate neural collapse of MLP-L-100 trained on Fashion MNIST. We plot the NCC train and test accuracy rates of neural networks with varying numbers of hidden layers. Each curve stands for a different layer within the network. Intermediate neural collapse of MLP-L-100 trained on Fashion MNIST. See Fig.1in the main text for details. Intermediate neural collapse of CONV-10-400 trained on CIFAR10 with partially corrupted labels. In the first (third) row, we plot the CDNV on the train (test) data for intermediate layers of networks trained with varying amounts of corrupted labels (see legend). In the second (fourth) row, we plot the NCC accuracy rates of the various layers of a network trained with a certain amount of corrupted labels (see titles). Intermediate neural collapse of CONVRES-10-500 trained on CIFAR10 with noisy labels. See Fig 3 in the main text for details. Intermediate neural collapse of MLP-10-500 trained on CIFAR10 with noisy labels. See Fig 3 in the main text for details. Intermediate neural collapse of CONV-10-100 trained on Fashion MNIST with noisy labels. See Fig. 3 in the main text for details.

annex

haveProof. By Prop. 1, we haveWe define i.i.d. random variablesTherefore, we can rewrite,By Hoeffding's inequality,By choosing ϵ = log(1/2δ)/2k, we obtain that with probability at least 1 -δ, we haveWhen combined with Prop. 1, we obtain the desired bound.

