VC THEORETICAL EXPLANATION OF DOUBLE DE-SCENT

Abstract

There has been growing interest in generalization performance of large multilayer neural networks that can be trained to achieve zero training error, while generalizing well on test data. This regime is known as 'second descent' and it appears to contradict the conventional view that optimal model complexity should reflect an optimal balance between underfitting and overfitting, i.e., the bias-variance trade-off. This paper presents a VC-theoretical analysis of double descent and shows that it can be fully explained by classical VC-generalization bounds. We illustrate an application of analytic VC-bounds for modeling double descent for classification, using empirical results for several learning methods, such as SVM, Least Squares, and Multilayer Perceptron classifiers. In addition, we discuss several reasons for the misinterpretation of VC-theoretical results in Deep Learning community.

1. INTRODUCTION

There have been many recent successful applications of Deep Learning (DL). However, at present, various DL methods are driven mainly by heuristic improvements, while theoretical and conceptual understanding of this technology remains limited. For example, large neural networks can be trained to fit available data (achieving zero training error) and still achieve good generalization for test data. This contradicts the conventional statistical wisdom that overfitting leads to poor generalization. This phenomenon has been systematically described by Belkin et al. (2019) who introduced the term 'double descent' and pointed out the difference between the classical regime (first descent) and the modern one (second descent). The disagreement between the classical statistical view and modern machine learning practice provides motivation for new theoretical explanations of the generalization ability of DL networks and other over-parameterized estimators. Several different explanations include: special properties of multilayer network parameterization (Bengio, 2009) , choosing proper inductive bias during second descent (Belkin et al., 2019) , the effect of Stochastic Gradient Descent (SGD) training (Zhang et al., 2021; Neyshabur et al., 2014; Dinh et al., 2017) , the effect of various heuristics (used for training) on generalization (Srivastava et al., 2014) , and the effect of margin on generalization (Bartlett et al., 2017) . The current consensus view on the 'generalization paradox' in DL networks is summarized below: -Existing indices for model complexity (or capacity), such as VC-dimension, cannot explain generalization performance of DL networks. -'Classical' theories developed in ML and statistics cannot explain generalization performance of DL networks and the double descent phenomenon. Specifically, Zhang et al. (2021) argues that the ability of large DL networks to achieve zero training error (during second descent mode) effectively "rules out all of the VC-dimension arguments as a possible explanation for the generalization performance of state-of-the-art neural networks." This paper demonstrates that these assertions are incorrect, and that classical VC-theoretical results can fully explain generalization performance of DL networks, including double descent, for classification problems. In particular, we show that proper application of VC-bounds using correct estimates of VC-dimension provides accurate modeling of double descent, for various classifiers trained using SGD, least squares loss and standard SVM loss. The proposed VC-theoretical explanation provides additional insights on generalization performance during first descent vs. second descent, and on the effect of statistical properties of the data on double descent curves.

