BENIGN OVERFITTING IN CLASSIFICATION: PROVABLY COUNTER LABEL NOISE WITH LARGER MODELS

Abstract

Studies on benign overfitting provide insights for the success of overparameterized deep learning models. In this work, we examine whether overfitting is truly benign in real-world classification tasks. We start with the observation that a ResNet model overfits benignly on Cifar10 but not benignly on ImageNet. To understand why benign overfitting fails in the ImageNet experiment, we theoretically analyze benign overfitting under a more restrictive setup where the number of parameters is not significantly larger than the number of data points. Under this mild overparameterization setup, our analysis identifies a phase change: unlike in the previous heavy overparameterization settings, benign overfitting can now fail in the presence of label noise. Our analysis explains our empirical observations, and is validated by a set of control experiments with ResNets. Our work highlights the importance of understanding implicit bias in underfitting regimes as a future direction.

1. INTRODUCTION

Modern deep learning models achieve good generalization performances even with more parameters than data points. This surprising phenomenon is referred to as benign overfitting, and differs from the canonical learning regime where good generalization requires limiting the model complexity (Mohri et al., 2018) . One widely accepted explanation for benign overfitting is that optimization algorithms benefit from implicit bias and find good solutions among the interpolating ones under the overparametrized settings. The implicit bias can vary from problem to problem. Examples include the min-norm solution in regression settings or the max-margin solution in classification settings (Gunasekar et al., 2018a; Soudry et al., 2018; Gunasekar et al., 2018b) . These types of bias in optimization can further result in good generalization performances (Bartlett et al., 2020; Zou et al., 2021; Frei et al., 2022) . These studies provide novel insights, yet they sometimes differ from the deep learning practice: state-of-the-art models, despite being overparameterized, often do not interpolate the data points (e.g., He et al. ( 2016 We first examine the existence of benign overfitting in realistic setups. In the rest of this work, we term benign overfitting as the observation that validation performance does not drop while the model fits more training data pointsfoot_0 . We test whether ResNet (He et al., 2016) models overfit data benignly for image classification on CIFAR10 and ImageNet. Our results are shown in Figure 1 below. In particular, we first trained ResNet18 on CIFAR10 for 200 epochs and the model interpolates the training data. In addition, we also trained ResNet50 on ImageNet for 500 epochs, as opposed to the common schedule that stops at 90 epochs. Surprisingly, we found that although benign overfitting happens on the CIFAR10 dataset, overfitting is not benign on the ImageNet dataset-the test loss increased as the model further fits the train set. More precisely, the ImageNet experiment does not overfit benignly since the best model is achieved in the middle of the training. The different



A more detailed discussion can be found in Appendix E.2. This definition is slightly different from existing theoretical literature but can be verified more easily in practice.



); Devlin et al. (2018)).

