IS THE PERFORMANCE OF MY DEEP NETWORK TOO GOOD TO BE TRUE? A DIRECT APPROACH TO ESTIMAT-ING THE BAYES ERROR IN BINARY CLASSIFICATION

Abstract

There is a fundamental limitation in the prediction performance that a machine learning model can achieve due to the inevitable uncertainty of the prediction target. In classification problems, this can be characterized by the Bayes error, which is the best achievable error with any classifier. The Bayes error can be used as a criterion to evaluate classifiers with state-of-the-art performance and can be used to detect test set overfitting. We propose a simple and direct Bayes error estimator, where we just take the mean of the labels that show uncertainty of the class assignments. Our flexible approach enables us to perform Bayes error estimation even for weakly supervised data. In contrast to others, our method is model-free and even instancefree. Moreover, it has no hyperparameters and gives a more accurate estimate of the Bayes error than several baselines empirically. Experiments using our method suggest that recently proposed deep networks such as the Vision Transformer may have reached, or is about to reach, the Bayes error for benchmark datasets. Finally, we discuss how we can study the inherent difficulty of the acceptance/rejection decision for scientific articles, by estimating the Bayes error of the ICLR papers from 2017 to 2023. * Currently at Preferred Networks. 1 AUC stands for "Area under the ROC Curve." 1

1. INTRODUCTION

Comparing the prediction performance of a deep neural network with the state-of-the-art (SOTA) performance is a common approach to validating advances brought by a new proposal in machine learning research. By definition, SOTA performance such as error or 1-AUC 1 monotonically decreases over time, but there is a fundamental limitation in the prediction performance that a machine learning model can achieve. Hence, it is important to figure out how close the current SOTA performance is to the underlying best performance achievable (Theisen et al., 2021) In classification problems, one way to characterize the irreducible part is the Bayes error (Cover & Hart, 1967; Fukunaga, 1990) , which is the best achievable expected error with any measurable function. The Bayes error will become zero in the special case where class distributions are completely separable, but in practice, we usually deal with complex distributions with some class overlap. Natural images tend to have a lower Bayes error (e.g., 0.21% test error on MNIST (LeCun et al., 1998) in Wan et al. (2013) ), while medical images tend to have a higher Bayes error, since even medical technologists can disagree on the ground truth label (Sasada et al., 2018) . If the current model's performance has already achieved the Bayes error, it is meaningless to aim for further error improvement. We may even find out that the model's error has exceeded the Bayes error. This may imply test set overfitting (Recht et al., 2018) is taking place. Knowing the Bayes error will be helpful to avoid such common pitfalls. Estimating the Bayes error has been a topic of interest in the research community (Fukunaga & Hostetler, 1975; Fukunaga, 1990; Devijver, 1985; Berisha et al., 2016; Noshad et al., 2019; Michelucci 



. For example, Henighan et al. (2020) studied the scaling law (Kaplan et al., 2020) of Transformers (Vaswani et al., 2017) by distinguishing the reducible and irreducible errors.

