BENIGN OVERFITTING IN CLASSIFICATION: PROVABLY COUNTER LABEL NOISE WITH LARGER MODELS

Abstract

Studies on benign overfitting provide insights for the success of overparameterized deep learning models. In this work, we examine whether overfitting is truly benign in real-world classification tasks. We start with the observation that a ResNet model overfits benignly on Cifar10 but not benignly on ImageNet. To understand why benign overfitting fails in the ImageNet experiment, we theoretically analyze benign overfitting under a more restrictive setup where the number of parameters is not significantly larger than the number of data points. Under this mild overparameterization setup, our analysis identifies a phase change: unlike in the previous heavy overparameterization settings, benign overfitting can now fail in the presence of label noise. Our analysis explains our empirical observations, and is validated by a set of control experiments with ResNets. Our work highlights the importance of understanding implicit bias in underfitting regimes as a future direction.

1. INTRODUCTION

Modern deep learning models achieve good generalization performances even with more parameters than data points. This surprising phenomenon is referred to as benign overfitting, and differs from the canonical learning regime where good generalization requires limiting the model complexity (Mohri et al., 2018) . One widely accepted explanation for benign overfitting is that optimization algorithms benefit from implicit bias and find good solutions among the interpolating ones under the overparametrized settings. The implicit bias can vary from problem to problem. Examples include the min-norm solution in regression settings or the max-margin solution in classification settings (Gunasekar et al., 2018a; Soudry et al., 2018; Gunasekar et al., 2018b) . These types of bias in optimization can further result in good generalization performances (Bartlett et al., 2020; Zou et al., 2021; Frei et al., 2022) . These studies provide novel insights, yet they sometimes differ from the deep learning practice: state-of-the-art models, despite being overparameterized, often do not interpolate the data points (e.g., He et al. (2016); Devlin et al. (2018) ). We first examine the existence of benign overfitting in realistic setups. In the rest of this work, we term benign overfitting as the observation that validation performance does not drop while the model fits more training data pointsfoot_0 . We test whether ResNet (He et al., 2016) models overfit data benignly for image classification on CIFAR10 and ImageNet. Our results are shown in Figure 1 below. In particular, we first trained ResNet18 on CIFAR10 for 200 epochs and the model interpolates the training data. In addition, we also trained ResNet50 on ImageNet for 500 epochs, as opposed to the common schedule that stops at 90 epochs. Surprisingly, we found that although benign overfitting happens on the CIFAR10 dataset, overfitting is not benign on the ImageNet dataset-the test loss increased as the model further fits the train set. More precisely, the ImageNet experiment does not overfit benignly since the best model is achieved in the middle of the training. The different 2020) shows that the interpolator fails under mild overparameterization regimes while may work under heavy overparameterization, as is consistent with the double descent curve. However, the analysis under mild overparameterization in the classification task remained unknown. overfitting behaviors cannot be explained by known analysis for classification tasks, as no negative results have been studied yet. Motivated by the above observation, our work aims to understand the cause for the two different overfitting behaviors in ImageNet and CIFAR10, and to reconcile the empirical phenomenon with previous analysis on benign overfitting. Our first hint comes from the level of overparameterization. Previous results on benign overfitting in the classification setting usually requires that p = ω(n), where p denotes the number of parameters and n denotes the training sample size (Wang et al., 2021a; Cao et al., 2021; Chatterji et al., 2021; Frei et al., 2022) . However, in practice many deep learning models fall in the mild overparameterization regime, where the number of parameters is only slightly larger than the number of samples despite overparameterization. In our case, the sample size n = 10 6 in ImageNet, whereas the parameter size p ≈ 10 7 in ResNets. To close the gap, we study the overfitting behavior of classification models under mild overparameterization setups where p = Θ(n) (this is sometimes referred to as the asymptotic regimes). In particular, following Wang et al. Unlike previous analysis, we show that benign overfitting now provably fails in the presence of label noise (see Table 1 and Figure 2 ). This aligns with our empirical findings as ImageNet is known to suffer from mislabelling and multi-labels (Yun et al., 2021; Shankar et al., 2020) . More specifically, our analysis (see Theorem 3.1 for details) under the mild overparameterization (p = Θ(n)) setup supports the following statements that align with our empirical observations in Figure 1 and 2: • When the labels are noiseless, benign overfitting holds under similar conditions as in previous analyses. • When the labels are noisy, the interpolating solution can provably lead to a positive excess risk that does not diminish with the sample size.



A more detailed discussion can be found in Appendix E.2. This definition is slightly different from existing theoretical literature but can be verified more easily in practice.



Figure1: Different Overfitting Behaviors on ImageNet and CIFAR10. We use ResNet50/ResNet18 to train ImageNet/CIFAR10 and plot the training loss as well as the validation loss. We find that ResNet50 overfits the ImageNet non-benignly, while ResNet18 overfits CIFAR10 benignly.ClassificationNoiseless

(2021a); Cao et al. (2021); Chatterji et al. (2021); Frei et al. (2022), we analyze the solution of stochastic gradient descent for the Gaussian mixture models. We found that a phase change happens when we move from p = Ω(n log n) (studied in Wang et al. (2021a)) to p = Θ(n).



