ON THE IMPLICIT BIAS TOWARDS DEPTH MINIMIZA-TION IN DEEP NEURAL NETWORKS Anonymous

Abstract

Recent results in the literature suggest that the penultimate (second-to-last) layer representations of neural networks that are trained for classification exhibit a clustering property called neural collapse (NC). We study the implicit bias of stochastic gradient descent (SGD) to favor of low-depth solutions when training deep neural networks. We characterize a notion of effective depth that measures the first layer for which sample embeddings are separable using the nearest-class center classifier. Furthermore, we hypothesize and empirically show that SGD implicitly selects neural networks of small effective depths. Secondly, while neural collapse emerges even when generalization should be impossible -we argue that the degree of separability in the intermediate layers is related to generalization. We derive a generalization bound based on comparing the effective depth of the network with the minimal depth required to fit the same dataset with partially corrupted labels. Remarkably, this bound provides non-trivial estimations of the test performance and is independent of the depth.

1. INTRODUCTION

Deep learning systems have steadily advanced the state of the art in a wide range of benchmarks, demonstrating impressive performance in tasks ranging from image classification (Taigman et al., 2014; Zhai et al., 2021) , language processing (Devlin et al., 2019; Brown et al., 2020) , open-ended environments (Silver et al., 2016; Arulkumaran et al., 2019 ), to coding (Chen et al., 2021) . Recent research indicates that deep neural networks generalize well, in part because the number of parameters exceeds the number of training samples (Belkin et al., 2018; Belkin, 2021; Advani & Saxe, 2017; Belkin et al., 2019) . However, it has been shown that in this case deep learning models can precisely interpolate arbitrary training labels (Zhang et al., 2017) (also known as the "interpolation regime"). Therefore, understanding interpolation learning appears to be a critical step toward a better theoretical understanding of deep learning's successes. Traditional generalization bounds (Vapnik, 1998; Shalev-Shwartz & Ben-David, 2014; Mohri et al., 2012; Bartlett & Mendelson, 2003) are based on uniform convergence. In this approach, instead of directly analyzing the population error of a learning algorithm, a uniform convergence-type argument would control the worst-case generalization gap (distance between train and test errors) over a class of predictors containing the outputs of the learning algorithm. Typically, this is done because for many algorithms it is difficult to exactly characterize the learned predictor. Nagarajan & Kolter (2019), however, raised significant questions about the applicability of typical uniform convergence arguments to certain interpolation learning regimes. They described theoretical settings in which an interpolation learning algorithm generalizes well but any uniform convergence bound cannot identify that. Contributions. Because of the inherent limitations of uniform convergence bounds, in this paper we pursue a novel approach for measuring generalization in deep learning that is not based on uniform convergence. Instead, our bound suggests that the model performs well at test time if its complexity is small compared to the complexity of a network required to fit the same dataset with partially random labels. In other words, even if a trained network has a complexity greater than the



Following their work, Bartlett & Long (2021); Zhou et al. (2020); Negrea et al. (2020); Yang et al. (2021) all demonstrated the failure of forms of uniform convergence in various interpolation learning setups.

