VC THEORETICAL EXPLANATION OF DOUBLE DE-SCENT

Abstract

There has been growing interest in generalization performance of large multilayer neural networks that can be trained to achieve zero training error, while generalizing well on test data. This regime is known as 'second descent' and it appears to contradict the conventional view that optimal model complexity should reflect an optimal balance between underfitting and overfitting, i.e., the bias-variance trade-off. This paper presents a VC-theoretical analysis of double descent and shows that it can be fully explained by classical VC-generalization bounds. We illustrate an application of analytic VC-bounds for modeling double descent for classification, using empirical results for several learning methods, such as SVM, Least Squares, and Multilayer Perceptron classifiers. In addition, we discuss several reasons for the misinterpretation of VC-theoretical results in Deep Learning community.

1. INTRODUCTION

There have been many recent successful applications of Deep Learning (DL). However, at present, various DL methods are driven mainly by heuristic improvements, while theoretical and conceptual understanding of this technology remains limited. For example, large neural networks can be trained to fit available data (achieving zero training error) and still achieve good generalization for test data. This contradicts the conventional statistical wisdom that overfitting leads to poor generalization. This phenomenon has been systematically described by Belkin et al. (2019) who introduced the term 'double descent' and pointed out the difference between the classical regime (first descent) and the modern one (second descent). The disagreement between the classical statistical view and modern machine learning practice provides motivation for new theoretical explanations of the generalization ability of DL networks and other over-parameterized estimators. Several different explanations include: special properties of multilayer network parameterization (Bengio, 2009) , choosing proper inductive bias during second descent (Belkin et al., 2019) , the effect of Stochastic Gradient Descent (SGD) training (Zhang et al., 2021; Neyshabur et al., 2014; Dinh et al., 2017) , the effect of various heuristics (used for training) on generalization (Srivastava et al., 2014) , and the effect of margin on generalization (Bartlett et al., 2017) . The current consensus view on the 'generalization paradox' in DL networks is summarized below: -Existing indices for model complexity (or capacity), such as VC-dimension, cannot explain generalization performance of DL networks. -'Classical' theories developed in ML and statistics cannot explain generalization performance of DL networks and the double descent phenomenon. Specifically, Zhang et al. (2021) argues that the ability of large DL networks to achieve zero training error (during second descent mode) effectively "rules out all of the VC-dimension arguments as a possible explanation for the generalization performance of state-of-the-art neural networks." This paper demonstrates that these assertions are incorrect, and that classical VC-theoretical results can fully explain generalization performance of DL networks, including double descent, for classification problems. In particular, we show that proper application of VC-bounds using correct estimates of VC-dimension provides accurate modeling of double descent, for various classifiers trained using SGD, least squares loss and standard SVM loss. The proposed VC-theoretical explanation provides additional insights on generalization performance during first descent vs. second descent, and on the effect of statistical properties of the data on double descent curves. Next, we briefly review VC-theoretical concepts and results necessary for understanding generalization performance of all learning methods based on minimization of training error (Vapnik, 1998; 1999; 2006; Cherkassky & Mulier, 2007) : 1. Finite VC-dimension provides necessary and sufficient conditions for good generalization. 2. VC-theory provides analytic bounds on (unknown) test error, as a function of training error, VC-dimension and the number of training samples. Clearly, these VC-theoretical results contradict an existing consensus view that VC-theory cannot account for generalization performance of large DL networks. This disagreement results from a misinterpretation of basic VC-theoretical concepts in DL research. These are a few examples of such misunderstanding: -A common view that VC-dimension grows with the number of parameters (weights), and therefore, "traditional measures of model complexity struggle to explain the generalization ability of large artificial neural networks" (Zhang et al., 2021) . In fact, it is well known that VC-dimension can be equal, or larger, or smaller, than the number of parameters (Vapnik, 1998; Cherkassky & Mulier, 2007) . -Another common view is that "VC-dimension depends only on the model family and data distribution, and not on the training procedure used to find models" (Nakkiran et al., 2021) . In fact, VC-dimension does not depend on data distribution (Vapnik, 1998; Cherkassky & Mulier, 2007; Schölkopf & Smola, 2002) . Furthermore, VC-dimension certainly depends on SGD algorithm (Vapnik, 1998; Cherkassky & Mulier, 2007) . For classification problems, VC-theory provides analytic generalization bounds for (unknown) Prediction Risk (or test error R tst ), as a function of Empirical Risk (or training error R trn ) and VCdimension (h) of a set of admissible models. That is, for a given training data set (of size n), VC-bound has the following form (Vapnik, 1998; 1999; 2006; Cherkassky & Mulier, 2007) : R tst ≤ R trn + ε 2 1 + 1 + 4R trn ε , where ε = a 1 n h ln a 2 n h + 1 -ln η 4 This VC-bound (1) holds with a probability of 1-η for all possible models (functions) including the one minimizing R trn . The value of η is preset to a small value, i.e., η = 4/ √ n. The second additive term in (1), called the confidence interval (also known as excess risk), depends on both R trn and h. This bound describes the relationship between training error, test error, and VC-dimension, and it is used for conceptual understanding of model complexity control, i.e. the effect of VC-dimension on test error. Application of this bound for accurate modeling of double descent curves requires: -Selecting proper values of positive constants a 1 and a 2 . The worst-case values a 1 = 4 and a 2 = 2, provided in VC-theory (Vapnik, 1998; 1999) correspond to the worst-case "heavy-tailed" distributions, resulting in VC-bounds that are too loose for real-life data sets (Cherkassky & Mulier, 2007) . For real-life data sets, when distributions are unknown, we suggest the values a 1 = 3 and a 2 = 1, that were used for all empirical results in this paper. For additional discussion about selecting these values (incorporating a-priori knowledge about unknown distributions), see Appendix A. -Analytic estimates of VC-dimension. For many learning methods (including DL), analytic estimates of VC-dimension are not known. For example, for SGD-style algorithms, the effect of various heuristics (e.g., initialization of weights, etc.) on VC-dimension is difficult (or impossible) to quantify analytically. Note that VC-bound (1) provides a conceptual explanation of both first and second descent. That is, first descent corresponds to minimizing this bound when training error is non-zero (Vapnik, 1998; 1999; Cherkassky & Mulier, 2007) . Second descent corresponds to minimizing this bound when training error is kept at zero, using models having small VC-dimension. This can be shown by setting the training error in bound (1) to zero, resulting in the following bound for test error during the second descent (uisng values a 1 = 3 and a 2 = 1): R tst ≤ ε, where ε = 3h n ln n h + 1 (2)

