ON THE IMPLICIT BIAS TOWARDS DEPTH MINIMIZA-TION IN DEEP NEURAL NETWORKS Anonymous

Abstract

Recent results in the literature suggest that the penultimate (second-to-last) layer representations of neural networks that are trained for classification exhibit a clustering property called neural collapse (NC). We study the implicit bias of stochastic gradient descent (SGD) to favor of low-depth solutions when training deep neural networks. We characterize a notion of effective depth that measures the first layer for which sample embeddings are separable using the nearest-class center classifier. Furthermore, we hypothesize and empirically show that SGD implicitly selects neural networks of small effective depths. Secondly, while neural collapse emerges even when generalization should be impossible -we argue that the degree of separability in the intermediate layers is related to generalization. We derive a generalization bound based on comparing the effective depth of the network with the minimal depth required to fit the same dataset with partially corrupted labels. Remarkably, this bound provides non-trivial estimations of the test performance and is independent of the depth.

1. INTRODUCTION

Deep learning systems have steadily advanced the state of the art in a wide range of benchmarks, demonstrating impressive performance in tasks ranging from image classification (Taigman et al., 2014; Zhai et al., 2021) , language processing (Devlin et al., 2019; Brown et al., 2020) , open-ended environments (Silver et al., 2016; Arulkumaran et al., 2019) , to coding (Chen et al., 2021) . Recent research indicates that deep neural networks generalize well, in part because the number of parameters exceeds the number of training samples (Belkin et al., 2018; Belkin, 2021; Advani & Saxe, 2017; Belkin et al., 2019) . However, it has been shown that in this case deep learning models can precisely interpolate arbitrary training labels (Zhang et al., 2017) (also known as the "interpolation regime"). Therefore, understanding interpolation learning appears to be a critical step toward a better theoretical understanding of deep learning's successes. Traditional generalization bounds (Vapnik, 1998; Shalev-Shwartz & Ben-David, 2014; Mohri et al., 2012; Bartlett & Mendelson, 2003) are based on uniform convergence. In this approach, instead of directly analyzing the population error of a learning algorithm, a uniform convergence-type argument would control the worst-case generalization gap (distance between train and test errors) over a class of predictors containing the outputs of the learning algorithm. Typically, this is done because for many algorithms it is difficult to exactly characterize the learned predictor. Nagarajan & Kolter (2019), however, raised significant questions about the applicability of typical uniform convergence arguments to certain interpolation learning regimes. They described theoretical settings in which an interpolation learning algorithm generalizes well but any uniform convergence bound cannot identify that. Contributions. Because of the inherent limitations of uniform convergence bounds, in this paper we pursue a novel approach for measuring generalization in deep learning that is not based on uniform convergence. Instead, our bound suggests that the model performs well at test time if its complexity is small compared to the complexity of a network required to fit the same dataset with partially random labels. In other words, even if a trained network has a complexity greater than the We make multiple important observations regarding effective depths. (i) We empirically show that the effective depth of trained networks monotonically increases when increasing the amount of random labels in data. (ii) We observe that when training sufficiently deep networks, they converge to (approximately) the same effective depth L 0 . Furthermore, as we show in Tab. 1, unlike traditional generalization bounds, our bound is empirically non-vacuous and independent of depth.

1.1. RELATED WORK

Neural collapse and generalization. Our work is closely related to the recent line of work on Neural collapse (Papyan et al., 2020; Han et al., 2022) . Neural collapse identifies training dynamics of deep networks for standard classification tasks, where the feature embeddings associated with training samples belonging to the same class tend to concentrate around their means. While several papers analyzed the emergence of neural collapse from a theoretical standpoint (e.g., (Zhu et al., 2021; Rangamani et al., 2022; Lu & Steinerberger, 2020; Fang et al., 2021; Ergen & Pilanci, 2021) ), its specific role in deep learning and its potential relationship with generalization is still unclear. Recent work (Galanti et al., 2022a; Xu et al., 2022; Galanti et al., 2022b) studied the conditions for when class features variation collapse generalizes from the train samples, to both test samples and new classes and in the transfer learning setting. We focus on the following (independent) question in this work: Is neural collapse a good indicator of how well a network generalizes? As a counter-argument, Zhu et al. ( 2021) provided empirical evidence that neural collapse occurs even when training the network with random labels. As a result, the presence of neural collapse cannot indicate whether or not the network generalizes. This experiment, however, does not rule out the possibility of an indirect relationship between neural collapse and generalization. We contend that the degree of separability in the intermediate layers is related to generalization. and separability), this paper is the first to demonstrate that deep neural networks tend to converge to a minimal effective depth that is independent of the network's depth. Even though one can de-



Following their work, Bartlett & Long (2021); Zhou et al. (2020); Negrea et al. (2020); Yang et al. (2021) all demonstrated the failure of forms of uniform convergence in various interpolation learning setups.

of structure in deep networks. While various papers Papyan (2020); Tirer & Bruna (2022); Galanti et al. (2022a); Ben-Shaul & Dekel (2022); Cohen et al. (2018); Alain & Bengio (2017); Montavon et al. (2011); Papyan et al. (2017); Ben-Shaul & Dekel (2021); Shwartz-Ziv & Tishby (2017) investigated certain geometrical properties within intermediate layers (e.g., clustering

Comparing our bound with baseline bounds in the literature for networks of varying depths. Our error bound is reported in the fourth row, and the baseline bounds are reported in the bottom rectangle. While the test error is universally bounded by 1, the baseline bounds are much larger than 1, and therefore, are meaningless. In contrast, our bound achieves relatively tight estimations of the test error and unlike the baseline bounds, our bound is fairly unaffected by the network's depth. number of training samples, it may be less complex than a model that fits partially random labels. As a result, in such cases, our bound may provide a non-trivial estimate of the test error.To formally describe our notion of complexity, we employ the notion of nearest class-center (NCC) separability. This property asserts that the feature embeddings associated with training samples belonging to the same class are separable according to the nearest class-center decision rule. While original results(Papyan et al., 2020)  observed NCC separability at the penultimate layer of trained networks, recent results(Ben-Shaul & Dekel, 2022)  observed NCC separability also in intermediate layers. In this work, we introduce the notion of 'effective depth' of neural networks that regards to the lowest layer for which its features are NCC separable (see Sec. 3.2).

