IS THE PERFORMANCE OF MY DEEP NETWORK TOO GOOD TO BE TRUE? A DIRECT APPROACH TO ESTIMAT-ING THE BAYES ERROR IN BINARY CLASSIFICATION

Abstract

There is a fundamental limitation in the prediction performance that a machine learning model can achieve due to the inevitable uncertainty of the prediction target. In classification problems, this can be characterized by the Bayes error, which is the best achievable error with any classifier. The Bayes error can be used as a criterion to evaluate classifiers with state-of-the-art performance and can be used to detect test set overfitting. We propose a simple and direct Bayes error estimator, where we just take the mean of the labels that show uncertainty of the class assignments. Our flexible approach enables us to perform Bayes error estimation even for weakly supervised data. In contrast to others, our method is model-free and even instancefree. Moreover, it has no hyperparameters and gives a more accurate estimate of the Bayes error than several baselines empirically. Experiments using our method suggest that recently proposed deep networks such as the Vision Transformer may have reached, or is about to reach, the Bayes error for benchmark datasets. Finally, we discuss how we can study the inherent difficulty of the acceptance/rejection decision for scientific articles, by estimating the Bayes error of the ICLR papers from 2017 to 2023. * Currently at Preferred Networks. 1 AUC stands for "Area under the ROC Curve."

1. INTRODUCTION

Comparing the prediction performance of a deep neural network with the state-of-the-art (SOTA) performance is a common approach to validating advances brought by a new proposal in machine learning research. By definition, SOTA performance such as error or 1-AUC 1 monotonically decreases over time, but there is a fundamental limitation in the prediction performance that a machine learning model can achieve. Hence, it is important to figure out how close the current SOTA performance is to the underlying best performance achievable (Theisen et al., 2021) . For example, Henighan et al. ( 2020) studied the scaling law (Kaplan et al., 2020 ) of Transformers (Vaswani et al., 2017) by distinguishing the reducible and irreducible errors. In classification problems, one way to characterize the irreducible part is the Bayes error (Cover & Hart, 1967; Fukunaga, 1990) , which is the best achievable expected error with any measurable function. The Bayes error will become zero in the special case where class distributions are completely separable, but in practice, we usually deal with complex distributions with some class overlap. Natural images tend to have a lower Bayes error (e.g., 0.21% test error on MNIST (LeCun et al., 1998) in Wan et al. (2013) ), while medical images tend to have a higher Bayes error, since even medical technologists can disagree on the ground truth label (Sasada et al., 2018) . If the current model's performance has already achieved the Bayes error, it is meaningless to aim for further error improvement. We may even find out that the model's error has exceeded the Bayes error. This may imply test set overfitting (Recht et al., 2018) is taking place. Knowing the Bayes error will be helpful to avoid such common pitfalls. Estimating the Bayes error has been a topic of interest in the research community (Fukunaga & Hostetler, 1975; Fukunaga, 1990; Devijver, 1985; Berisha et al., 2016; Noshad et al., 2019; Michelucci et al., 2021; Theisen et al., 2021) . To the best of our knowledge, all previous papers have proposed ways to estimate the Bayes error from a dataset consisting of pairs of instances and their hard labels. When instances and hard labels are available, one can also train a supervised classifier, which is known to approach the Bayes classifier (that achieves the Bayes error) with sufficient training data provided that the model is correctly specified. This is an interesting research problem from the point of view of Vapnik's principlefoot_0 (Vapnik, 2000) since we can derive the Bayes error from the Bayes classifier (and the underlying distribution) while we cannot recover the Bayes classifier from the knowledge of the Bayes error, which is just a scalar. How can we take full advantage of this property? While the Bayes error is usually defined as the best achievable expected error with any measurable function, it is known to be equivalent to the expectation of the minimum of class-posteriors with respect to classes for binary classification. Inspired by Vapnik's principle, our main idea is to skip the intermediate step of learning a function model, and we directly approximate the minimum of the class-posteriors by using soft labels (corresponding to the class probability) or uncertainty labels (corresponding to the class uncertainty). 3Our proposed method has two benefits. Firstly, our method is model-free. Since we do not learn a model, we can escape the curse of dimensionality, while dealing with high-dimensional instances would cause issues such as overfitting if we were to train a model. High dimensionality may cause performance deterioration for other Bayes error estimation methods (Berisha et al., 2016; Noshad et al., 2019) due to divergence estimation. We experimentally show how our method can more accurately estimate the Bayes error than baselines that utilize instances and soft labels. Our modelfree method is also extremely fast since we do not have any hyperparameters to tune nor a function model to train. The second benefit is a more practical one: our method is completely instance-free. Suppose our final goal is to estimate the Bayes error instead of training a classifier. In that case, we do not need to collect instance-label pairs, and it may be less costly to collect soft/uncertainty labels without instances. Dealing with instances can cause privacy issues, and it can be expensive due to data storage costs especially when they are high-dimensional or can come in large quantities. It may lead to security costs to protect instances from a data breach. As an example of an instance-free scenario, we can consider doctors who are diagnosing patients by inspecting symptoms and asking questions, without explicitly collecting or storing the patients' data in the database. In this scenario, the hospital will only have the decisions and confidence of doctors, which can be used as soft labels. The contributions of the paper is as follows. We first propose a direct way to estimate the Bayes error from soft (or uncertainty) labels without a model nor instances. We show that our estimator is unbiased and consistent. In practice, collecting soft/uncertainty labels can be difficult since the labelling process can become noisy. We propose a modified estimator that is still unbiased and consistent even when the soft labels are contaminated with zero-mean noise. We also show that our approach can be applied to other classification problems, such as weakly supervised learning (Sugiyama et al., 2022) . Finally, we show the proposed methods' behavior through various experiments. Our results suggest that recently proposed deep networks such as the Vision Transformer (Dosovitskiy et al., 2021) has reached or is about to reach the Bayes error for benchmark datasets, such as CIFAR-10H (Peterson et al., 2019) and Fashion-MNIST-H (which is a new dataset we present; explained in Sec. 5.3). We also demonstrate how our proposed method can be used to estimate the Bayes error for academic conferences such as ICLR, by regarding them as an accept/reject binary classification problem.

2. BACKGROUND

Before we discuss our setup, we will review ordinary binary classification from positive and negative data and binary classification from positive-confidence data in this section. Learning from positive-negative data Suppose a pair of d-dimensional instance x ∈ R d and its class label y ∈ {+1, -1} follows an unknown probability distribution with probability density



The Vapnik's principle is known as the following: "When solving a given problem, try to avoid solving a more general problem as an intermediate step." (Sec. 1.2 of Vapnik, 2000) An uncertainty label is 1 subtracted by the dominant class posterior. We will discuss how to obtain soft labels in practice in Sec. 3.

