SELECTIVE CLASSIFIER ENSEMBLE

Abstract

Selective classification allows a machine learning model to abstain from predicting some hard inputs and thus improve the safety of its predictions. In this paper, we study the ensemble of selective classifiers, i.e. selective classifier ensemble, which combines several weak selective classifiers to obtain a more powerful model. We prove that under some assumptions, the ensemble has a lower selective risk than the individual model under a range of coverage. The proof is nontrivial since the selective risk is a non-convex function of the model prediction. The assumptions and the theoretical result are supported by systematic experiments on both computer vision and natural language processing tasks. A surprising empirical result is that a simple selective classifier ensemble, namely, the ensemble model with maximum probability as confidence, is the state-of-the-art selective classifier. For instance, on CIFAR-10, using the same VGG-16 backbone model, this ensemble reduces the AURC (Area Under Risk-Coverage Curve) by about 24%, relative to the previous state-of-the-art method.

1. INTRODUCTION

Although recent years have witnessed the broad applications of deep learning models, their securities have not been fully guaranteed, which gives rise to the study of selective classification. For any given deep learning classifier, there might be inputs that the model is not able to classify in practical applications, for which the model might make unpredictable errors. To prevent this kind of error, we must accurately delimit the deep learning classifier's application scope. This need gives rise to the study of selective classification that learns a selective classifier (f, g), where f is a conventional classifier, and g is a selective function that decides whether the selective classifier should abstain from prediction. Since the classifier is well studied, the study of selective classification focuses on the design of the selective function. A standard approach to designing the selective function is to design a confidence score function with a threshold, and several confidence score functions have been developed. A simple confidence score function is the maximum predictive probability of the classifier (Hendrycks & Gimpel, 2017) . More advanced methods modify the model architecture (Geifman & El-Yaniv, 2019) or the loss function (Liu et al., 2019; Huang et al., 2020) of the classifier to train the confidence score function and the classifier simultaneously. For example, Deep Gambler (Liu et al., 2019) regards the selective classification problem as gambling and proposes a novel loss function to train the classifier and the confidence score function. Although there are various individual models for the selective classifier, there has been no systematic study of the ensemble method in selective classification. It is well known that the ensemble method, which combines the individual models to obtain a more powerful model, can improve the predictive performances of machine learning models (see Zhou (2012) for a review), but only a particular selective classifier ensemble, the ensemble of Softmax Response (Hendrycks & Gimpel, 2017) , has been empirically studied by Lakshminarayanan et al. (2017) . Ensembles of other kinds of selective classifiers, and the theoretical foundation of the ensemble in selective classification have not been studied yet. In this paper, we first demonstrate the theoretical foundation of the ensemble on selective classifiers, that is, with some assumptions, the ensemble has a lower selective risk than the individual model under a range of coverage. The proof is nontrivial since the selective risk (with the 0/1 loss) are non-convex. Second, we show the experimental results of the ensemble's performance in selective classification. The contributions of this paper are summarized as follows. • We are the first to theoretically demonstrate that based on several reasonable assumptions, the ensemble has a lower selective risk than the individual model under a range of coverage. We verify this by systematic experiments on the tasks of image classification and text classification. • We show a surprising experimental result that two simple methods, the SR ensemble and the Reg-curr ensemble, which can be summarized as the ensemble model with maximum probability as confidence, are the state-of-the-art selective classifiers.

2. PROBLEM FORMULATION OF SELECTIVE CLASSIFICATION

A selective classifier is composed of a standard classifier and a selective function. Considering a standard classification problem, X is a feature space, Y = {1, 2, ..., K} is a finite label set, and a classifier f is a function f : X → Y. A labeled dataset D = {(x i , y i )} N i=0 ⊆ X × Y is sampled from a distribution p X,Y . Our goal is to learn a selective classifier where f is a standard classifier and g : X → {0, 1} is a selective function that estimates the correctness of f 's prediction. Given input x, the output of selective classifier (f, g) is (f, g)(x) = f (x), if g(x) = 1 Abstain, if g(x) = 0 . (1) Usually, g is realized by a confidence score κ : X → R + with a threshold τ (Geifman & El-Yaniv, 2017) , namely g(x) = I{κ(x) > τ }, where I is the indicator function. Coverage and selective risk are two basic evaluation metrics of selective classifiers, and the goal of selective classifiers is to minimize the selective risk for target coverage. The coverage of (f, g) is defined to be the probability of (f, g) not abstaining from prediction (Geifman & El-Yaniv, 2017) , i.e. φ(f, g) := E p(x) [g(x)], where p(x) is the probability density function of input x. The selective risk (Geifman & El-Yaniv, 2017) of (f, g) is R(f, g) := E p(x) [ (f (x) , y)g(x)] E p(x) (g(x)) , where : Y × Y → R + is a given loss function. Usually, is the 0/1 loss (Geifman & El-Yaniv, 2017; 2019; Liu et al., 2019; Huang et al., 2020) . Based on these definitions, the objective of selective classifiers is formalized as min R(f, g), s.t. φ(f, g) ≥ c target , where c target is a given target coverage. When the selective function g is developed as (2), the confidence threshold τ controls the tradeoff between coverage and selective risk. With different values of τ , (f, g) has different pairs of coverage and selective risk (φ(f, g; τ ), R(f, g; τ )), which forms the risk-coverage curve (Geifman & El-Yaniv, 2017) of (f, g). The risk-coverage curve specifies the entire performance profile of a selective classifier, and it is easy to see that the selective classifier with a lower risk-coverage curve is better. To evaluate selective classifiers more concisely, the area under the risk-coverage curve (AURC) is introduced as a metric of selective classifiers (Xin et al., 2021) , and the selective classifier with a lower AURC is better.

3. RELATED WORK

Here, we summarize the previous studies on selective classification and ensemble methods. We also discuss the difference between selective classification and out-of-distribution detection.

3.1. SELECTIVE CLASSIFICATION

The critical problem of designing a selective classifier is to design its selective function, and there are two types of selective function g. One is the implicit selective function, which is derived from the classifier. The other is the explicit selective function, a neural network trained with the classifier f simultaneously. Previous works for selective classifiers are listed as follows. Selective classifiers with implicit selective functions include SR (Softmax Response) (Hendrycks & Gimpel, 2017) , MC-Dropout (Monte Carlo-Dropout) (Gal & Ghahramani, 2016), and Reg-curr (Xin et al., 2021) . In SR, the selective classifier is a vanilla classifier with its maximum predictive probability as confidence (i.e., the maximum output of the softmax layer). In MC-Dropout, it enables the dropout layer of the classifier and runs multiple feed-forward iterations at inference time to obtain the variance of the maximum probability output of f 's softmax layer, whose negative value is used as the confidence score. Reg-curr behaves the same as the SR at inference time but uses an RPPbased regularizer at training time, where RPP (Reversed Pair Proportion) (Xin et al., 2021) is the proportion of reversed pairs of confidence scores. Selective classifiers with explicit selective functions include SN (SelectiveNet) (Geifman & El-Yaniv, 2019) , Gambler (Deep Gambler) (Liu et al., 2019) , and SAT (Self-Adaptive Training) (Huang et al., 2020) . SN is a neural network that combines f and g, where f and g share convolutional layers and have their separate fully-connected layers. The loss function of the model is the selective risk with some regularizers. A hyperparameter c is needed to specify the target coverage. Gambler adds the abstention option to the classifier as an extra class, that is, for a given input x, the predictive probability of the extra class is the confidence of abstention, i.e., 1 -κ(x). At training time, it regards the selective classification problem as gambling and is trained to maximize the gambling reward. Similar to Gambler, SAT adds the abstention option to the classifier as an extra class. However, SAT has a different training procedure, which is trained with a soft label that tells the model which sample to reject. SAT is the previous state-of-the-art selective classifier for image classification tasks.

3.2. RANDOMIZATION-BASED ENSEMBLE

In the randomization-based ensemble method, each member model is trained independently with the same architecture and training procedure but with different randomization seeds for the random initialization of parameters and the random shuffling of training data for each training epoch. At inference time, the predictive probability of the ensemble for each class is the average of those of member models. Lakshminarayanan et al. (2017) applies this ensemble method to deep neural networks (Deep Ensemble) and has achieved state-of-the-art performance in uncertainty estimation. The difference between Lakshminarayanan et al. ( 2017) and our work is that they only empirically study the ensemble of vanilla classifiers (the SR ensemble), but we not only empirically study ensembles of multiple outstanding selective classifiers but also provide the theoretical foundation of the ensemble in selective classification. As for the theoretical works, Krogh & Vedelsby (1994) proposes the error-ambiguity decomposition to explain the better performance of the randomizationbased ensemble in regression tasks. However, for classification tasks, there is no such simple and elegant analysis, since the evaluation metrics are non-convex (Zhou, 2021). Thus, the corresponding analysis for classification tasks needs additional assumptions, e.g., unbiased, uncorrelated, and identically distributed estimation errors for the posterior probability distribution (Tumer & Ghosh, 1996; Fumera & Roli, 2005) . Nevertheless, these assumptions are impractical (Fumera & Roli, 2005) . As far as we know, there is no systematic study of the ensemble in the context of selective classification.

3.3. OUT-OF-DISTRIBUTION DETECTION

A related topic of selective classification is out-of-distribution (OOD) detection (Lakshminarayanan et al., 2017) ) (also called as open set recognition (Scheirer et al., 2012) , or novelty detection (Schölkopf et al., 2001) ), which detects samples that differ significantly from a given dataset, i.e., OOD samples. The essential difference between selective classification and OOD detection lies in their different goals. The goal of the former is to detect samples where the classifier predicts incorrectly, which depends on both the classifier and samples, while that of the latter is to detect samples that differ significantly from a given dataset, which depends on samples only. In addition, at present, selective classification assumes that test data and training data are sampled from the same distribution (Geifman & El-Yaniv, 2017) , instead of using OOD test data as OOD detection. Thus, selective classification and OOD detection are complementary in preventing error predictions of machine learning models, as Figure 3 shows.

4. METHOD

With the randomization-based ensemble method, we propose the selective classifier ensemble. The basic idea is that each predictive probability (as well as the confidence score in the case of explicit selective functions) of the ensemble should be the average of those of the member models. Formally, we assume that for an input sample x, a classifier f at first provides the predictive probability distribution πθ = (π 1 θ , • • • , πK θ ) and then makes prediction f (x; θ) = arg max 1≤k≤K πk θ (x), where K is the number of classes, θ denotes the parameters of f , and πk θ (x) is the predictive probability for class k (the superscript is not an exponent). Then, the predictive probability distribution of the ensemble classifier of M member models is πens (x) := 1 M M m=1 πm (x). ( ) The ensemble of the selective function is defined as follows. For implicit selective functions (e.g. SR), to keep the ensemble the same kind of selective classifier as the individual model (for example, the ensemble of SR should still be an SR model), the confidence score of the ensemble is derived from πens in the same way as the individual model. For example, the confidence score of the SR ensemble is κens (x) = max k πk ens (x). For explicit selective functions (e.g. SAT), the confidence score of the ensemble is the average of those of member models, κens (x) = 1 M M m=1 κm (x).

5. THEORETICAL ANALYSIS OF SELECTIVE CLASSIFIER ENSEMBLE

In this section, we analyze the selective risk (with the 0/1 loss) of the ensemble of a simple selective classifier, the SR ensemble (see Section 4 for its definition). If the selective risk is a convex function of the predictive probability distribution, then according to the definition of the convex function, (6) implies that the selective risk of the ensemble is less than or equal to that of the individual model. However, the selective risk is non-convex because the 0/1 loss is a step function. Thus, the analysis is not easy. We need some assumptions to prove a lower selective risk of the ensemble. We introduce the assumptions in Section 5.1 (verified in Section 6.1) and the theoretical results in Section 5.2. The analysis for the other selective classifiers is left for future study.

5.1. ASSUMPTIONS

Given an SR ensemble with M (M > 1) members, we assume that there are samples on which all member models provide almost the same predictive probability distributions. Furthermore, we idealize them as definite samples, for which all member models provide precisely the same predictive probability distribution. Then, the rest samples are referred to as ambiguous samples. Let D be the event that the input sample is definite and A be the event that the input sample is ambiguous. Considering that the input sample is randomly drawn from a dataset, the predictive probability for class k (1 ≤ k ≤ K) and the confidence of the SR model are random variables. We denote these random variables as Π k and C respectively and use π k and κ to denote their values respectively. Generally, for any continuous random variable Z, p Z denotes the probability density function (PDF) of Z, and for any real variable z, z → 1 -denotes that z approaches 1 from the left. Based on the idealization and notations above, we introduce the following assumptions. Assumption 1. For any individual SR model, with its confidence score denoted as C, we have lim τ →1 -Pr(Err|A, C ≥ τ ) > lim τ →1 -Pr(Err|D, C ≥ τ ), where Err is the event that the model makes an error prediction. In  p Π k 1 ,Π k 2 (u, v|D) = +∞, if (u, v) ∈ {(λ, λ)|λ ∈ [0, 1]} 0, otherwise . On the contrary, ambiguous samples do not have such a property. We intensify this by Assumption 3 to provide a good analytical property of ambiguous samples. Furthermore, Assumption 3 reflects the diversity of the ensemble over ambiguous samples. Still consider the example above. If the predictions of θ 1 and θ 2 are sure to coincide, i.e., the ensemble model has no diversity, then the PDF of Π k 1 and Π k 2 is unbounded. Conversely, if the PDF of Π k 1 and Π k 2 is bounded, then the predictions of the member models are diverse. Thus, Assumption 3 provides the diversity of the ensemble over ambiguous samples. It is well known that the randomization-based ensemble has diversity (Zhou, 2012) . Since the ensemble does not have diversity over definite samples, it must have diversity over ambiguous samples. Thus, we do not provide experimental verification for Assumption 3.

5.2. ANALYSIS RESULTS

With the assumptions above, we derive Theorem 2 (see Appendix B for proof details), which shows that the selective risk of the ensemble is lower than that of the individual model under a range of coverage. The intuition of its proof is as follows. According to Assumption 1, the individual model is not modest over ambiguous samples. On the contrary, based on Assumption 3, we prove that the ensemble is modest over ambiguous samples (Proposition 1). In addition, both the individual model and the ensemble are not modest over definite samples (due to Assumption 2 and the definition of definite samples). Thus, considering that the classifier's error rate over ambiguous samples is higher than that over definite samples when confidence approaches 1 (Assumption 1), the individual model suffers more wrong but confident predictions that come with ambiguous samples than the ensemble. Therefore, the selective risk of the individual model is higher than that of the ensemble (Theorem 2). In a word, the intuition is that because the ensemble avoids to be overconfident over ambiguous samples, the ensemble has a lower selective risk. Before Theorem 2, we discuss Proposition 1, which provides critical insight into the better performance of the ensemble. Proposition 1. If Assumption 3 holds, then lim κens→1 -p Cens (κ ens |A) = 0, where C ens is the confidence score of the ensemble. Proposition 1 suggests that given the input sample being ambiguous, when the confidence of the ensemble approaches one, its PDF approaches zero. In other words, the ensemble is modest over ambiguous samples. Based on this proposition, we prove that the SR ensemble has a lower selective risk than an SR individual model under a range of coverage. Theorem 2. If Assumption 1-3 holds, then for any individual SR model and any SR ensemble, ∃φ 0 ∈ (0, 1) such that ∀φ ∈ (0, φ 0 ), R ens (φ) < R ind (φ), where R ens (φ) and R ind (φ) are the selective risks of the SR ensemble and the individual SR model under coverage φ, respectivelyfoot_0 . Evaluation Metrics. The evaluation metrics are AURC and selective risk (the lower, the better for both). AURC is a comprehensive metric of selective classifiers, and selective risk is a standard metric in previous works (Geifman & El-Yaniv, 2019; Liu et al., 2019; Huang et al., 2020) . In this paper, given a selective classifier, the result of selective risk is shown in the form of risk-coverage curves, which shows the selective risk of the selective classifier against its coverage. According to the object of selective classification, a selective classifier with a lower risk-coverage curve is better.

6. EXPERIMENTS

Networks. Following Huang et al. (2020); Xin et al. (2021) , for image classification and text classification, we use VGG-16 (Simonyan & Zisserman, 2014) and BERT-base (Devlin et al., 2019) as the backbones of selective classifiers, respectively. More details of the backbone models and their training procedures are provided in Appendix C.2. Baselines. We use SR (Geifman & El-Yaniv, 2017) , SN (SelectiveNet) (Geifman & El-Yaniv, 2019) , Gambler (Liu et al., 2019) , SAT (Huang et al., 2020), and Reg-curr (Xin et al., 2021) for both image classification and text classification. Note that the SN is optimized for fixed coverage, so the comprehensive metrics AURC and risk-coverage curve, which summarizes performances under different coverages, are not suitable for evaluating the SN. Thus, we only evaluate the selective risk for a fixed coverage of the SN ensemble and provide the results in Appendix E. To ensure a fair comparison, for tasks of image classification, all the baselines are re-implemented based on the open resource code of SAT (Huang et al., 2020) , and for tasks of text classification, they are re-implemented based on the open resource code of Reg-curr (Xin et al., 2021) . The details of the hyperparameters of each baseline are provided in Appendix C.3.

6.1. VERIFICATION OF THE ASSUMPTIONS

We examine Assumption 1 and Assumption 2 (only for the baseline of SR) on datasets for both image classification and text classification. Since the definite samples are the idealization of samples on which member models provide almost the same predictive probability distributions, we take samples with a standard deviation of predictive probability distributions of member models (or STD for short) less than a small positive number as definite samples and the other samples as ambiguous samples in experiments. Formally, STD := M j=1 ( πj -1 M M i=1 πi) 2 M -1 , where πj is the predictive probability distribution vector of the j-th member model, and samples with STD < approximates definite samples in experiments. We choose = 10 -3 for datasets of image classification and = 10 -2 for datasets of text classification. Figure 1 (a) and Figure 1 (e) show the selective error rates (selective risks) of samples with STD < 10 -3 and samples with STD ≥ 10 -3 given a range of confidence thresholds (which approximate Pr(Err|D, C ≥ τ ) and Pr(Err|A, C ≥ τ )) on the test set of each dataset. The results show that the selective risk of samples with STD < 10 -3 is lower than that of samples with STD ≥ 10 -3 for all confidence thresholds near 1 on all datasets, which verifies Assumption 1. Figure 1(b)-1(d) and Figure 1(f)-1(h) shows the histogram of confidence scores of samples with STD < and that of other samples, which approximate p C (κ|D) and p C (κ|A), on the test set of each dataset. The results show that the number of samples with STD < 10 -3 is non-zero in the top bin on all datasets, which verifies Assumption 2. In summary, Assumption 1 and 2 hold on all datasets. 

6.2. EVALUATION OF SELECTIVE CLASSIFIER ENSEMBLES

We first verify Theorem 2. Figure 2 shows the risk-coverage curves of the ensembles and the individual models of each baseline on each dataset. As we can see, except on MRPC, the risk-coverage curve of the ensemble is always entirely below that of the individual model, i.e., the ensemble has a lower selective risk than the individual model under any coverage, which is consistent with Theorem 2. The abnormal results on MRPC dataset may be because of the small number of samples in MRPC. The development set of MRPC has only 0.4k samples, which is much smaller than development sets of other datasets (see Table 2 ). More importantly, when the coverage is low, say 10%, only about 40 samples in MRPC are selected to predict, which may cause a large variance in selective risk estimation. Thus, the estimation of selective risk is not accurate under low coverage, which may explain the violation of Theorem 2 on MRPC. In summary, except the results on MRPC, which might have a large variance in selective risk estimation, the experimental results in Figure 2 verify the correctness and practicability of Theorem 2. Surprisingly, the SR ensemble and Reg-curr ensemble are state-of-the-art selective classifiers for image and text classification tasks, respectively. They only use the maximum probability as the confidence score, while SAT and SN use more sophisticated and explicit confidence score functions. Note that, on SVHN, the SAT ensemble performs better than the SR ensemble. We find that the annotations on SVHN are noisy, and after manually removing some noisy samples in the training set, the SR ensemble performs better than SAT ensemble. Therefore, the SVHN result indicates that SAT is better at handling label noise rather than selecting when to abstain. Please refer to Appendix F.3 for more details. In addition, we conduct experiments to explore further properties of the selective classifier ensemble, including the effect of the number of members and the relationship between the classification performance and the selective classification performance (see Appendix F). The results are that an ensemble with more member models has better selective performance, and good classification performance of the ensemble does not necessarily imply good selective classification performance.

7. DISCUSSION

A possible direction for future work is to adapt our analysis to standard classification. The previous analyses of the randomization-based ensemble for standard classification need some impractical assumptions (Tumer & Ghosh, 1996; Fumera & Roli, 2005) . On the contrary, this paper's assumptions are a good approximation of practical settings (see Section 6.1), and more importantly, the standard classification is a particular case of selective classification, i.e., selective classification with coverage of 1. Therefore, our analysis (although it does not cover the case of coverage of 1, i.e., the case of standard classification) may motivate the analysis of the randomization-based ensemble for standard classification in practical settings. Another possible direction for future work is the relaxation of assumptions. This paper's assumption is a little strong for the convenience of proof. Although the experimental results suggest that these assumptions are the actual behaviors of the SR model, we guess the assumptions can be relaxed while the conclusion keeps the same. It is interesting to relax the assumption of the idealization of the definite samples. Although the idealization may be a good approximation of practical setting, it is unrealistic anyway. We believe that a similar theoretical result holds in the absence of the idealization.

8. CONCLUSION

We prove that under some assumptions, the ensemble has a lower selective risk than the individual model under a range of coverage. Although the metrics of selective classification are non-convex, we complete the proof with the help of several assumptions motivated by empirical observations. The assumptions and the result are well supported by the experimental results on multiple datasets of image classification tasks and text classification tasks. A surprising empirical result is that two simple methods, SR ensemble and its variant Reg-curr ensemble, (which can be summarized as the ensemble models with maximum probability as confidence) are state-of-the-art selective classifiers. 

B PROOFS

The complete proof is somewhat complex, but its intuition is straightforward. In a word, the intuition is that because the ensemble avoids being overconfident over ambiguous samples, the ensemble has a lower selective risk. Details of the intuition are as follows: 1. the individual model is not "modest" over both ambiguous samples and definite samples (Assumption 2); 2. by contrast, based on Assumption 3, we prove that the ensemble provides modest confidence to ambiguous samples (Proposition 1). In addition, the confidence of definite samples remains the same throughout ensembling (due to the definition of definite samples); 3. thus, when confidence approaches 1, as long as the classifier's error rate over definite samples is lower than the error rate over ambiguous samples (Assumption 1), the individual model suffers more error predictions that come with the ambiguous samples than the ensemble. Based on this, we prove that the selective risk drops under a range of coverage via ensembling (Theorem 2). We prove Proposition 1 in Section B.1 and prove Theorem 2 in Section B.2. The road map of the complete proof of the final result, i.e., Theorem 2, is shown in Figure 4 . To prove Proposition 1, that is, the ensemble is modest over ambiguous samples, we first show that over ambiguous samples, the ensemble provides a moderate predictive probability for each class. By moderate predictive probability, we mean a predictive probability whose PDF approaches zero when itself approaches one or zerofoot_5 . Formally, we have the following lemma (proof is provided in Section B.1.1). Lemma 3. If Assumption 3 holds, then p Π k ens (0|A) = 0, p Π k ens (1|A) = 0, and p Π k ens (π k ens |A) = O((π k ens ) M -1 ) (π k ens → 0 + ), p Π k ens (π k ens |A) = O((1 -π k ens ) M -1 ) (π k ens → 1 -), where the notation follows that of Assumption 3, and Π k ens is the predictive probability of the ensemble for class k. Secondly, to prove the ensemble is modest over ambiguous samples, we show the relationship between the PDF of confidence and the PDFs of predictive probabilities. Note that the confidence of an SR model is the maximum predictive probability. Thus, the following lemma (proof is provided in Section B.1.2) bounds the PDF of confidence by PDFs of predictive probabilities, and then it is clear that the ensemble is modest over ambiguous samples, considering the moderate predictive probabilities of the ensemble over ambiguous samples. Lemma 4. Let Π k (1 ≤ k ≤ K) be K continuous random variables, and C := max k Π k . Then we have p C (κ) ≤ K k=1 p Π k (κ). Finally, since C ens = max k Π k ens , Lemma 3 and Lemma 4 derive that when κ ens → 1 -, p Cens (κ ens |A) ≤ K k=1 p Π k ens (κ ens |A) = O((1 -κ ens ) M -1 ), and thus Proposition 1 holds.

B.1.1 PROOF OF LEMMA 3

We first derive the PDF of the average of multiple continuous random variables in terms of the PDFs of these random variables (Lemma 5), which helps us to analyze the PDF of the ensemble's predictive probabilities. Lemma 5. Let X 1 , X 2 , . . . , X M be M continuous random variables, and their average is X avg := 1 M M i=1 X i . Then the PDF of X avg is p Xavg (x avg ) =M R M -1 dx 1 dx 2 • • • dx M -1 • p X (x 1 , x 2 , . . . , x M -1 , M x avg - M -1 i=1 x i ), ( ) where p X is p X1,X2,...,X M for short. Proof. The distribution function of X avg is F Xavg (x avg ) = i xi≤M xavg dx 1 • • • dx M -1 dx M • p X (x 1 , . . . , x M ) = R M -1 dx 1 • • • dx M -1 M xavg-M -1 i=1 xi -∞ dx M • p X (x 1 , • • • , x M ). Let x M = u - M -1 i=1 x i , then the integral above is equal to R M -1 dx 1 • • • dx M -1 M xavg -∞ du • p X (x 1 , . . . , x M -1 , u - M -1 i=1 x i ) = M xavg -∞ du R M -1 dx 1 • • • dx M -1 • p X (x 1 , . . . , x M -1 , u - M -1 i=1 x i ). The PDF of X avg is the derivative of F Xavg , which, combined with (,16) derives p Xavg (x avg ) =F Xavg (x avg ) = d(M x avg ) dx avg • dF Xavg d(M x avg ) =M R M -1 dx 1 • • • dx M -1 • p X (x 1 , . . . , x M -1 , M x avg - M -1 i=1 x i ), which is exactly (15). Proof of Lemma 3. Based on Lemma 5 and Assumption 3, we prove Lemma 3 as follows. Proof. With Lemma 5 applied to Π k i , 1 ≤ i ≤ M , we have p Π k ens (π k ens |A) =M R M -1 dπ k 1 • • • dπ k M -1 • p Π k (π k 1 , . . . , π k M -1 , M π k ens - M -1 i=1 π k i |A), where p Π k is p Π k 1 ,...,Π k M for short. The integrand in the right-hand side of (17) being non-zero requires 0 ≤ π k i ≤ 1, i = 1, 2, . . . , M -1 0 ≤ M π k ens - M -1 i=1 π k i ≤ 1 . ( ) Firstly, we prove ( 10) and ( 12). When 0 ≤ π k ens ≤ 1 M , it easy to verify that ( 18) is equivalent to                  0 ≤ π k 1 ≤ M π k ens 0 ≤ π k 2 ≤ M π k ens -π k 1 • • • 0 ≤ π k i ≤ M π k ens -π k 1 -• • • -π k i-1 • • • 0 ≤ π k M -1 ≤ M π k ens -π k 1 -• • • -π k M -2 . ( ) Thus, (17) transforms into p Π k ens (π k ens |A) =M M π k ens 0 dπ k 1 • • • M π k ens -i-1 j=1 π k j 0 dπ k i • • • M π k ens -M -2 j=1 π k j 0 dπ k M -1 • p Π k (π k 1 , . . . , π k M -1 , M π k ens - M -1 i=1 π k i |A). Considering that p Π k (•|A) is bounded, as Assumption 3 claims, let B be one of its upper bounds. Then we have p Π k ens (π k ens |A) ≤ M M π k ens 0 dπ k 1 • • • M π k ens -i-1 j=1 π k j 0 dπ k i • • • M π k ens -M -2 j=1 π k j 0 dπ k M -1 B ≤ M M π k ens 0 dπ k 1 • • • M π k ens 0 dπ k i • • • M π k ens 0 dπ k M -1 B = M B • M π k ens 0 dπ k 1 • • • M π k ens 0 dπ k M -1 = M B • (M π k ens ) M -1 = M M B • (π k ens ) M -1 , which directly derives (10) and ( 12) (note that the PDF is non-negative). Secondly, we prove (11) and ( 13). These two equations could be derived like ( 10) and ( 12). However, here we take another way that uses a little trick to simplify the proof. Let Π k i and Π k ens be the corresponding random variables of π k i and π k ens respectively, and U i = 1 -Π k i , U ens = 1 M M i=1 U i = 1 -Π k ens . Applying (20) to U i and U ens , we get that when 0 ≤ u ens ≤ 1 M , p Uens (u ens |A) ≤ M M B • u M -1 ens . It is easy to see that p Uens (u ens |A) = p Π k ens (1 -u ens |A). Sticking this into (21), we have p Π k ens (1 -u ens |A) ≤ M M B • u M -1 ens , when 0 ≤ u ens ≤ 1 M . With the (1 -u ens ) in the equation above replaced with π k ens , we have 11) and (13). p Π k ens (π k ens |A) ≤ M M B • (1 -π k ens ) M -1 , when 1 -1 M ≤ π k ens ≤ 1, which directly derives (

B.1.2 PROOF OF LEMMA 4

Proof. First of all, we prove ∀κ 1 , κ 2 , κ 1 < κ 2 , F C (κ 2 ) -F C (κ 1 ) ≤ K k=1 F Π k (κ 2 ) -F Π k (κ 1 ) It is easy to see that F C (κ) = F Π 1 ,...,Π K (κ, . . . , κ) = (-∞,κ] K dπ 1 • • • dπ K p Π 1 ,...,Π K (π 1 , • • • , π K ), so the left-hand side of ( 22) is (-∞,κ2] K dπ 1 • • • dπ K p Π 1 ,...,Π K (π 1 , • • • , π K ) - (-∞,κ1] K dπ 1 • • • dπ K p Π 1 ,...,Π K (π 1 , • • • , π K ) = (-∞,κ2] K \(-∞,κ1] K dπ 1 • • • dπ K p Π 1 ,...,Π K (π 1 , • • • , π K ), where the last equality is due to (-∞, κ 1 ] ⊂ (-∞, κ 2 ], and the right-hand side of ( 22) is K k=1 [κ1,κ2] dπ k p Π k (π k ) = K k=1 R k-1 ×[κ1,κ2]×R K-k dπ 1 • • • dπ K • p Π 1 ,...,Π K (π 1 , • • • , π K ) ≥ K k=1 R k-1 ×[κ1,κ2]×R K-k dπ 1 • • • dπ K • p Π 1 ,...,Π K (π 1 , • • • , π K ), where the last inequality is because R k-1 × [κ 1 , κ 2 ] × R K-k for different k, 1 ≤ k ≤ K, may have an intersection. To prove ( 22), we only need to prove that the right-hand side of ( 23) is less than or equal to the right-hand side of ( 24), which is equivalent to prove (-∞, κ 2 ] K \ (-∞, κ 1 ] K ⊂ K k=1 R k-1 × [κ 1 , κ 2 ] × R K-k . ( ) Now we prove (25). ∀(π 1 , . . . , π K ) ∈ (-∞, κ 2 ] K \ (-∞, κ 1 ] K , we have ∀k, 1 ≤ k ≤ K, π k ≤ κ 2 , ( ) ∃k 0 , 1 ≤ k 0 ≤ K, π k0 > κ 1 , where ( 27) is because if all π k is less than or equal to κ 1 instead, then (π 1 , . . . , π K ) ∈ (-∞, κ 1 ] K , which contradicts with (π 1 , . . . , π K ) ∈ (-∞, κ 2 ] K \ (-∞, κ 1 ] K . Thus, π k0 ∈ [κ 1 , κ 2 ], so (π 1 , . . . , π K ) ∈ R k0-1 × [κ 1 , κ 2 ] × R K-k0 ⊂ K k=1 R k-1 × [κ 1 , κ 2 ] × R K-k , which is precisely (25), and therefore ( 22) is proved. With ( 22) and the definition of derivatives, it is easy to see that F C (κ) ≤ K k=1 F Π k (κ), which is equivalent to p C (κ) ≤ K k=1 p Π k (κ). Thus, Lemma 4 is proved.

B.2 PROOF OF THEOREM 2

The intuition of the proof is as follows. Intuitively, since the ensemble is more modest than the individual over ambiguous samples and is the same modest as the individual model over definite samples, the ensemble tends to select more definite samples when the confidence threshold approaches 1, compared with the individual model (Lemma 7). Thus, (still intuitively) as long as the selective risk over definite samples is lower than that over ambiguous samples when the confidence threshold approaches 1 (Assumption 1), the ensemble is certainly to have a lower selective risk when the confidence threshold approaches 1. Although the intuition is straightforward, the rigorous proof is not easy. For the convenience of the proof, we show Lemma 6 and Lemma 7 first. Lemma 6 claims that for the individual model, the selective risk given definite samples is lower than the overall selective risk. Lemma 7 claims that the ensemble is unlikely to select ambiguous samples to predict when the confidence threshold approaches 1. Lemma 6. If Assumption 1-2 holds, then for any individual SR model, lim τ →1 -R(φ(τ )) > lim τ →1 -Pr(Err|D, C ≥ τ ), where the notation follows those of Assumption 1-2. Proof. Using Bayes' rule, we have Pr(A|C ≥ τ ) = Pr(C ≥ τ |A)Pr(A) Pr(C ≥ τ |A)Pr(A) + Pr(C ≥ τ |D)Pr(D) . ( ) Using L'Hospital's rule, we have lim τ →1 - Pr(C ≥ τ |D) Pr(C ≥ τ |A) = lim τ →1 - 1 τ p C (κ|D)dκ 1 τ p C (κ|A)dκ = lim τ →1 - -p C (τ |D) -p C (τ |A) = lim τ →1 - p C (τ |D) p C (τ |A) >0, ( ) where the last inequality is due to Assumption 2. Combining this with (29), we have lim τ →1 -Pr(A|C ≥ τ ) = Pr(A) Pr(A) + Pr(D) lim τ →1 -Pr(C≥τ |D) Pr(C≥τ |A) , = Pr(A) Pr(A) + Pr(D) lim τ →1 -pC(τ |D) pC(τ |A) , >0. Now we derive (28). R(φ(τ )) =Pr(Err, A|C ≥ τ ) + Pr(Err, D|C ≥ τ ) =Pr(Err|A, C ≥ τ )Pr(A|C ≥ τ ) + Pr(Err|D, C ≥ τ )Pr(D|C ≥ τ ) =Pr(Err|A, C ≥ τ )Pr(A|C ≥ τ ) + Pr(Err|D, C ≥ τ )[1 -Pr(A|C ≥ τ )] =[Pr(Err|A, C ≥ τ ) -Pr(Err|D, C ≥ τ )]Pr(A|C ≥ τ ) + Pr(Err|D, C ≥ τ ). According to the equation above, we have lim τ →1 - R(φ(τ )) =[ lim τ →1 - Pr(Err|A, C ≥ τ ) -lim τ →1 - Pr(Err|D, C ≥ τ )] • lim τ →1 -Pr(A|C ≥ τ ) + lim τ →1 -Pr(Err|D, C ≥ τ ) Due to (31) and Assumption 1, the first term of the equation above is positive, so ( 28) is derived. Pr(A) lim τens→1 -pC ens (τens|A) pC ens (τens|D) + Pr(D) , ( ) where C ens is the confidence score of the ensemble. Because for a definite sample, the confidence score of the ensemble is equal to that of the individual model, we have lim τens→1 - p Cens (τ ens |A) p Cens (τ ens |D) = lim τens→1 - p Cens (τ ens |A) p C (τ ens |D) = 0, where the last equality is due to Proposition 1 and Assumption 2. Substituting this to (34), we obtain (33). Proof of Theorem 2. Proof. First, for the convenience of the proof, given an SR model (f, g), we define a threshold-tocoverage function ρ (f,g) of (f, g) that maps the confidence threshold to the corresponding coverage, ρ (f,g) : (0, 1) → (0, 1), τ → φ(f, g; τ ). Second, we prove that ∃δ ∈ (0, 1), ∀τ ens ∈ (1 -δ, 1), Pr(A|C ens ≥ τ ens ) -Pr(Err ind |C ≥ τ ) + Pr(Err ens |D, C ens ≥ τ ens ) < 0, τ = max ρ -1 • ρ ens (τ ens ) ( ) where C is the confidence score of the individual SR model, Err ind is the event that the individual model makes an error prediction, Err ens is the event that the ensemble makes an error prediction, ρ and ρ ens are the threshold-to-coverage functions of the individual model and the ensemble respectively. Note that the symbol ρ -1 denotes the preimage under ρ, rather than the inverse function of ρ. Because when τ ens → 1 -, the coverage of the ensemble ρ ens (τ ens ) approaches 0, and the coverage of the individual model is equal to the coverage of the ensemble, i.e., ρ(τ ) = ρ ens (τ ens ), we have τ → 1 -when τ ens → 1 -. Thus, lim τens→1 -Pr(Err ind |C ≥ τ ) = lim τ →1 -Pr(Err ind |C ≥ τ ). In addition, for definite samples, the confidence score of the ensemble and that of the individual model are the same, and the ensemble and the individual model make error predictions on the same set of samples, i.e. Err ind = Err ens , so lim τens→1 - Pr(Err ens |D, C ens ≥ τ ens ) = lim τens→1 - Pr(Err ens |D, C ≥ τ ens ) = lim τ →1 -Pr(Err ens |D, C ≥ τ ) = lim τ →1 - Pr(Err ind |D, C ≥ τ ), ( ) where the second equality is just a variable substitution. Finally, we have lim τens→1 -[Pr(A|C ens ≥ τ ens ) -Pr(Err ind |C ≥ τ ) + Pr(Err ens |D, C ens ≥ τ ens )] = lim τens→1 - [0 -Pr(Err ind |C ≥ τ ) + Pr(Err ens |D, C ens ≥ τ ens )] = -lim τens→1 -Pr(Err ind |C ≥ τ ) + lim τens→1 -Pr(Err ind |D, C ≥ τ ) = -lim τ →1 -Pr(Err ind |C ≥ τ ) + lim τ →1 -Pr(Err ind |D, C ≥ τ ) <0, where the first equality is due to Lemma 7, the second equality is due to (37) and ( 38), and the last inequality is due to Lemma 6. Thus, with (39), it is easy to see that (35) holds.  where the second inequality is due to (35). Pr(Err ens |C ens ≥ τ ens ) is the selective risk of the ensemble given the confidence threshold of τ ens (or given coverage of ρ ens (τ ens )), and Pr(Err ind |C ≥ τ ) is the selective risk of the individual model given the confidence threshold of τ (or given coverage of ρ(τ ) = ρ ens (τ ens )). These two selective risks are under the same coverage ρ ens (τ ens ). Thus, ( 41) is equivalent to that ∃δ ∈ (0, 1), ∀τ ens ∈ (1 -δ, 1), the ensemble has a lower selective risk than the individual model given the coverage of φ = ρ ens (τ ens ). This statement can be simplified as ∃δ ∈ (0, 1), ∀φ ∈ (0, ρ ens (1 -δ)), the ensemble has a lower selective risk than the individual model, given the coverage of φ.

C DETAILS OF EXPERIMENTS C.1 DATASETS

The experiments were conducted on multiple data sets of image classification and text classification. The image classification datasets are CIFAR-10, CIFAR-100, (Krizhevsky, 2009) and SVHN (Netzer et al., 2011) , whose image sizes are all 32 × 32 × 3 pixels. The datasets of text classification are MRPC (Dolan & Brockett, 2005) , MNLI (Williams et al., 2018) and QNLI (Wang et al., 2018) . The task of MRPC is to judge whether two paragraphs of text are semantically equivalent. MNLI's task is to judge the inferential relationship between sentences (three categories). The task of QNLI is to determine whether a paragraph has the answer to a given question. (Srivastava et al., 2014) , batch normalization (Ioffe & Szegedy, 2015) . It is trained in the same way as Huang et al. (2020) . The model is optimized using SGD with an initial learning rate of 0.1 (the learning rate decays by half in every 25 epochs), the momentum of 0.9, weight decay of 0.0005, batch size of 128, and a total training epoch of 300. Data preprocessing includes data augmentation (random cropping and flip) and normalization. The implementations of the backbone model and data preprocessing are based on the official open-sourced implementation of SAT to ensure a fair comparison. For text classification, the backbone model of selective classifiers is BERT-base (Devlin et al., 2019) . Pretrained BERT-base is provided by the Huggingface Transformer Library (Wolf et al., 2020) . It is trained/fine-tuned in the same way as Xin et al. ( 2021), except on dataset MRPC. On QNLI and MNLI, the model is trained/fine-tuned using AdamW (Loshchilov & Hutter, 2017) for 3 epochs, with a learning rate of 2 × 10 -5 , batch size of 32, and the maximum input sequence length of 128. On MRPC, the model is trained/fine-tuned for 10 epoch, with other settings the same as those on QNLI and MNLI. This unique setting of training epoch is due to the small number of samples in MRPC, which makes the training require more epochs to reach convergence on MRPC.

C.3 HYPERPARAMETERS OF SELECTIVE CLASSIFIERS

For the hyperparameter c of SN, we choose c = 0.9 for evaluating its selective risk, given the coverage of 90%. The results of SN are reported in Appendix E. For the hyperparameter o of Gambler, we tune o on validation sets in the same way as Liu et al. (2019) . For the hyperparameter α of SAT, we set α = 0.99, the same as Huang et al. (2020) . For the hyperparameter λ of Reg-curr, we set λ = 0.05.

D ADDITIONAL EXPERIMENTAL RESULTS

Table 3 and 4 shows the selective risks of ensembles under coverage 10%-100% on each dataset, where hyperparameters of Gambler are the same as those in Table 1 . Notably, no ensemble consistently outperforms others under all coverage on all datasets, so it is not easy to tell which ensemble is state-of-the-art in this regard. This phenomenon is because different ensembles have similar overall performance but adopt different trade-offs between coverage and selective risk. In this case, we need a comprehensive metric, e.g., AURC, to identify the state-of-the-art (see Table 1 ).

E EMPIRICAL RESULTS OF SN

With the hyperparameter c of 0.9, we report the selective risks given the coverage of 90% of SN ensembles and the individual SN in Figure 5 . The coverage is set to 90% because the target coverage of the SN is c (Geifman & El-Yaniv, 2019)), and c is 0.9 in our experiments. The results show that each ensemble of SN has a lower selective risk than the individual SN.

F FURTHER PROPERTIES OF SELECTIVE CLASSIFIER ENSEMBLE F.1 THE EFFECT OF NUMBER OF MEMBERS ON SELECTIVE CLASSIFIER ENSEMBLE

We evaluate AURCs of the SR ensemble, Gambler ensemble, and SAT ensemble of different numbers of members on CIFAR10, and find that an ensemble with more members has a better performance, but is less efficient. The results are shown in Figure 6 . In most cases, the AURC on the test set of CIFAR-10 decreases as the number of members in the ensemble increases. In addition, as the number of members in the ensemble grows, the effect of adding one member drops. On the one hand, the result shows that an ensemble with a small number of members has good selective classification performance. On the other hand, it indicates that when the number of member models is large, increasing the number of members to improve the performance of the selective classification ensemble is inefficient.

F.2 GOOD CLASSIFICATION PERFORMANCE DOES NOT IMPLY GOOD SELECTIVE CLASSIFICATION PERFORMANCE

It is well known that the ensemble has better classification performance than an individual model, but this does not guarantee a better selective classification performance of the ensemble. To demonstrate this, we design an SR model with a big backbone, and show that it has as good classification performance as an SR ensemble with a standard backbone but worse selective classification performance than an SR model with a standard backbone. The big backbone is designed to have twice as many filters in every convolutional layer and neurons in every fully connected hidden layer as those of the standard VGG-16, which is therefore called Big VGG-16. It is easy to see that its number of parameters is approximately 2 2 = 4 times as many as that of standard VGG-16. We train an SR ensemble of 4 VGG-16s and an SR model with a backbone of Big VGG-16 on CIFAR-10 and show the evaluation results in Figure 7 and Table 5 . Figure 7 shows that when coverage is high, the ensemble and the big individual model have similar selective risks, and especially, the classification error rates (i.e., selective risk of 100% coverage) of the ensemble and the big individual model are similar. However, when coverage is low, the big individual model has significantly higher selective risk than the ensemble. Table 5 shows that the AURC of Big VGG-16 is much higher than the ensemble of 4 VGG-16s and even higher than SR. In summary, we show that a selective classifier with a good classification performance is not guaranteed to have good selective classification performance, so the good selective classification performance of the ensemble is not a trivial result of its good classification performance.

F.3 THE EFFECTS OF LABEL NOISE OF SVHN ON SELECTIVE CLASSIFIER ENSEMBLES

In this section, we compare the effect of label noise of SVHN on the SR ensemble with that on SAT ensemble, whose result might explain the abnormal experimental results (compared to results on other datasets) on SVHN in Section 6.2. SVHN is not a clean dataset, and much more label noise can be detected in SVHN than in CIFAR-10 and CIFAR-100. Using the soft label of SAT (Huang et al., 2020) , we detect label noise in SVHN, CIFAR-10, and CIFAR-100, and find that SVHN has significantly more label noise than CIFAR-10 and CIFAR-100. The result is presented in the following. In addition, it is known that SAT is robust to label noise (Huang et al., 2020) , while SR is not so, so we conjecture that the label noise of SVHN is why the SR ensemble is inferior to SAT on SVHN. We detect label noise with the help of the soft label of SAT. For a sample x i , the soft label of SAT (Huang et al., 2020) , t i,yi , is used to measure x i 's learning difficulty. The soft label of SAT is initialized as 1 and updated at every training epoch as below where p θ (Y |x) is the predictive probability distribution of the classifier, y i is the label of x i , α is a hyperparameter. The smaller the t i,yi is, the lower the true class predictive probability of the classifier on x i during training time, indicating that x i is more difficult to learn. By selecting a percentage of samples with the lowest t i,yi , we get the most difficult samples to learn for the classifier, from which we can easily detect label noise manually. t i,yi ← α × t i,yi + (1 -α) × p θ (y i |x i ), In training sets of SVHN, CIFAR-10, and CIFAR-100, we detect label noise manually among the top-0.1% difficult (measured by the soft label of SAT) samples. The numbers of mislabeled samples detected in SVHN, CIFAR-10, and CIFAR100 are shown in Table 6 . The result shows that SVHN has significantly more mislabeled samples detected than CIFAR-10 and CIFAR-100, indicating much more label noise in SVHN than in CIFAR-10 and CIFAR-100. To verify the effect of label noise, the following experiments are designed. using the soft label of SAT. Secondly, we remove the detected mislabeled samples from the original dataset. The remaining SVHN dataset is called the clean SVHN. Accordingly, the original dataset is called the original SVHN. Finally, we retrain and test the SR ensemble and SAT ensemble and compare their test results. In the second step, the reason for removing mislabeled samples rather than modifying them is that some samples cannot be classified even by humans, and some samples are not in the range of categories of SVHN. Thus, the label noise cannot be eliminated by modifying the labels but by removing mislabeled samples. The test results of the SR ensemble and SAT ensemble on clean SVHN are shown in Table 7 . It is not surprising that the AURCs of the SR ensemble and SAT ensemble are significantly lower on the clean SVHN than the original SVHN. Furthermore, on the clean SVHN, when the number of members is 5, the AURC of the SR ensemble is lower than that of SAT ensemble. Combined with results on the original SVHN, where the AURC of the SR ensemble is higher than that of SAT ensemble, we conclude that label noise in SVHN is why the SR ensemble has a higher AURC than SAT ensemble. In other words, label noise is why the SR ensemble performs worse in selective classification than SAT ensemble on SVHN. In summary, by experiments, we show that the SR ensemble is not as robust to label noise as SAT ensemble, and label noise in SVHN is why the SR ensemble is not as good as SAT ensemble on SVHN. We construct the clean SVHN, which is SVHN without some mislabeled samples. On the clean SVHN, we compare the SR ensemble with SAT ensemble and find that the SR ensemble is superior to SAT ensemble in selective classification performance. Combined with former experimental results, we conclude that label noise in SVHN is why the SR ensemble is inferior to SAT on SVHN. Considering the experimental results on the clean SVHN and previous experimental results on CIFAR-10 and CIFAR-100 (see Table 7 and Table 1 ), the SR ensemble is superior to SAT ensemble in selective classification on clean image classification datasets, Thus, SR ensemble is the state-ofthe-art selective classification method on clean image classification datasets, but is not as robust to label noise as SAT ensemble.

G AN EXTENSION OF THEOREM 2

This section discusses the lower bound of φ 0 mentioned in Theorem 2. We aim to calculate φ 0 's lower bound without training an ensemble (otherwise, we can measure it directly on the ensemble). Preliminaries. The φ 0 can be obtained by solving the following optimization problem. where φ ens (τ ens ) is coverage of the ensemble with confidence threshold τ ens , and φ(τ ) is coverage of the individual model given the confidence threshold τ , and then φ 0 = φ(τ * ), where τ * is the optimal solution to (42). Assume that R is a monotone increasing function and that φ is a monotone decreasing functionfoot_6 , then it is easy to show, using proof by contradiction, that (42) can be further transformed into min τ,τens τ (43) s.t. φ ens (τ ens ) ≥ φ(τ ) R ens (φ ens (τ ens )) < R(φ(τ )). To solve (43), we need more information about the ensemble. Besides the number of classes K, assume that we know: an oracle that tells whether a sample is definite; M , the number of member models; and B, the upper bound of p Π k Suppose φ 0 = φ(τ 0 ), since τ 0 is the optimal solution to (43), the optimal solution to (50) (denoted as τ * ) provides an upper bound of τ 0 . Thus, considering φ is a monotone decreasing function of τ , φ(τ * ) is a lower bound of φ 0 = φ(τ 0 ).

Algorithm.

We design Algorithm 1 to search for the solution to (50) and then obtain the lower bound of φ 0 . Since τ ens is determined by τ (the first constraint of (50)), (50) can be reduced to a one-dimensional search problem. Our algorithm adopts a binary search for efficiency, although this method might provide a suboptimal solution. The procedure of Algorithm 1 in each iteration of the binary search is as follows. 1. Given current τ , Algorithm 1 determines τ ens using SEARCHFORTAUENS (see Algorithm 2), a procedure that searches for τ ens ∈ [0, 1] using binary search s.t. φ(τ ) = Pr(C ≥ τ ens , D). Note that τ ens might not exist, as long as τ is so low that φ(τ ) > Pr(D) = sup τens Pr(C ≥ τ ens , D). This problem will be addressed shortly. 2. Algorithm 1 exams whether τ ens exists. If τ ens exists, Algorithm 1 then examines whether the second constraint of (50) holds for current τ and τ ens , which is implemented by VERI-FYSECONDCONSTRAINT (see Algorithm 3). 3. If τ ens exists and the second constraint holds, Algorithm 1 searches for a smaller τ in the left half feasible area; otherwise, Algorithm 1 searches for a greater τ in the right half feasible area. Once the binary search completes and outputs τ * , Algorithm 1 returns the coverage of θ with confidence threshold τ * . An Example. To show that Algorithm 1 works in reality, we run this algorithm on CIFAR-10, using the same individual model as Section 6. In this example, K = 10, M = 5, the oracle is implemented by another ensemble with two individual models (the oracle outputs True if and only if the STD over member models' predictive distributions < 10 -3 ). Note that it is difficult to estimate B. On the one hand, we need to train an ensemble with M models to estimate B, which is costly. On the other hand, the domain of p Π k 1 ,...,Π k M (•|A) has high dimension, so the observed data points are sparse in this domain, which makes the estimation of B more difficult. Thus, we do not estimate B but try several hypothetical values of B to see at what B the lower bound of φ 0 is big. With different Bs, we obtain different lower bounds of φ 0 as Table 8 shows. We can see that when B ≤ 10 8 , the lower bound of φ 0 is greater than 50%, which indicates that Algorithm 1 may be robust to the choice of B.



Beyond Theorem 2, we provide an elaborate analysis on the lower bound of φ0 in Appendix G. The result that only on SVHN is SAT the state-of-the-art contradicts that ofHuang et al. (2020). This contradiction may be becauseHuang et al. (2020) directly cites the evaluation results of other methods from different papers, which may have subtle differences in the training settings of the backbone model (e.g., data augmentation). By contrast, we re-implement all selective classification methods based on the open resource code of SAT(Huang et al., 2020) for image classification in our experiments, so the fairness of comparison is guaranteed. For more details on the comparison of ensembles, see Table3 and 4, which provide the selective risks of ensembles under coverage 10%-100%. Thus, Proposition 1, or that the ensemble is modest over ambiguous samples, is equivalent to that the ensemble provides moderate confidence over ambiguous samples The latter is actually an obvious fact.



We conduct experiments on multiple datasets for image classification and text classification tasks. FollowingGeifman & El-Yaniv (2017; 2019);Liu et al. (2019);Huang et al. (2020), we use CIFAR-10, CIFAR-100(Krizhevsky, 2009), and SVHN(Netzer et al., 2011) for image classification tasks, and following Xin et al. (2021), we use MRPC(Dolan & Brockett, 2005), MNLI(Williams et al., 2018), and QNLI (Wang et al., 2018)  for text classification tasks. The usage of the training set and test set in selective classification are the same as the standard classification, because current selective classification focuses on in-domain data (i.e., data from the same distribution as the training set)(Geifman & El-Yaniv, 2017). For example, if the selective classifier is trained on the training set ofCIFAR-10 (Krizhevsky, 2009), then it will be tested on the test set of CIFAR-10. Furthermore, MNLI's development set and test set are divided into matched and mismatched parts. The matched parts are sampled from the same source as the training set (so they are in-domain data), while the mismatched parts are sampled from different sources. In our experiments, only the matched parts are used since the current selective classification only considers in-domain data. In addition, test sets of MRPC, QNLI, and MNLI are not accessible, so we use their development sets as test sets. FollowingLiu et al. (2019);Huang et al. (2020), since CIFAR-10, CIFAR-100 and SVHN originally had no development set, their development sets were 2000 samples randomly split from corresponding test sets. More details of all the datasets in our experiments are described in Appendix C.1.

Figure 1: (a)/(e): the selective error rates (selective risks) of definite samples and ambiguous samples given a range of confidence thresholds on the test set of each dataset for image/text classification. (b)-(d): the histogram of confidence scores of samples with STD < 10 -3 and that of other samples on the test set of each dataset for image classification. (f)-(h): the histogram of confidence scores of samples with STD < 10 -2 and that of other samples on the test set of each dataset for text classification.

Figure 2: Risk-coverage curves of ensembles and individual models of each baseline on multiple datasets, where all ensembles consist of 5 member models.

Figure 3 shows the relationship between selective classification and OOD detection (or open set recognition, novelty detection).

Figure 3: Relationship between selective classification and OOD detection (or open set recognition, novelty detection).

Figure 4: The road map of the proof of Theorem 2

If Assumption 2-3 hold, then lim τens→1 -Pr(A|C ens ≥ τ ens ) = 0, (33) where the notation follows those of Assumption 2-3. Proof. Similar to (31), with Bayes' rule and L'Hospital's rule, we can derive that lim τens→1 - Pr(A|C ens ≥ τ ens ) = Pr(A) lim τens→1 -Pr(Cens≥τens|A) Pr(Cens≥τens|D) Pr(A) lim τens→1 -Pr(Cens≥τens|A) Pr(Cens≥τens|D) + Pr(D) , = Pr(A) lim τens→1 -pC ens (τens|A)pC ens (τens|D)

Figure 5: Selective risks of SN ensembles (with 2 to 5 members) and the individual SN (with only 1 member) given the coverage of 90%

. R ens (φ) < R(φ),where R ens (φ) and R(φ) are the selective risks of the ensemble and the individual model under coverage φ, respectively. Suppose we know all about the individual model, e.g., the mapping from φ to R is known. Since maximizing coverage is equivalent to minimizing the confidence threshold for a fixed model, the original optimization problem can be substituted by min . φ ens (τ ens ) = φ(τ ) R ens (φ ens (τ ens )) < R(φ(τ )),

,...,Π k M (•|A) for all k ∈ {1, 2, ..., K}. It is natural to know K and M , and we need the oracle and B because they provide critical information about the ensemble's behavior. The oracle can be implemented by an ensemble with M (M M ) members.Eliminating the Unknowns. We are now committed to translating the unknowns in the (43) into known quantities. Firstly, we eliminate the unknowns in the first constraint. According to (20) and Lemma 4, it is easy to prove (by an integral)Pr(C ens ≥ τ ens |A) ≤ β(1 -τ ens ) M ,(44)where C and C ens are the confidence scores of the individual model and the ensemble, respectively, A/D represents the event that the input sample is ambiguous/definite, β = K•M M -1 •B. Combining Algorithm 1: A Lower Bound of φ 0 . Input: the individual model θ; the test set D = {(x i , y i )} N i=1 ; the number of classes K; the oracle Ω : X → {0, 1} that tells whether a sample is definite; the number of member models M ; and B, the upper bound of p Π k 1 ,...,Π k M (•|A) for all k ∈ {1, 2, ..., K}. Output: An lower bound of φ 0 mentioned in Theorem 2 lef t = 0 right = 1 = 10 -9 while right -lef t > do τ = (lef t + right)/2 τ ens = SEARCHFORTAUENS(τ, θ, D, Ω) if τ ens is not None and VERIFYSECONDCONSTRAINT(τ , τ ens , θ, D, Ω, K, M , B) is True then right = τ else lef t = τ τ * = (lef t + right)/2 return 1 N N i=1 I{C(x i ; θ) ≥ τ * } // C(x i ; θ) is the confidence of θ on sample x i .

Assumption 1, Pr(Err|A, C ≥ τ ) is the selective risk of the individual model with a confidence threshold of τ for ambiguous samples, and Pr(Err|D, C ≥ τ ) is that for definite samples. The motivation for Assumption 1 is that ambiguous samples seem more difficult to classify than definite samples.

AURC/10 -4 on each dataset, where MNLI-(m) is the matched part of the MNLI development set. The means and standard deviations are calculated over three trials. The best entries are marked in bold.SAT ensemble has the lowest AURCs on SVHN 2 , and Reg-curr has the lowest AURCs on QNLI and MNLI 3 .

Kagan Tumer and Joydeep Ghosh. Analysis of decision boundaries in linearly combined neural classifiers. Pattern recognition, 29(2):341-348, 1996. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353-355, 2018. Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112-1122, 2018. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp. 38-45, 2020. Ji Xin, Raphael Tang, Yaoliang Yu, and Jimmy Lin. The art of abstention: Selective prediction and error regularization for natural language processing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1040-1051, 2021.

Third, we have Pr(Err ens |C ens ≥ τ ens ) =Pr(Err ens , A|C ens ≥ τ ens ) + Pr(Err ens , D|C ens ≥ τ ens ) =Pr(Err ens |A, C ens ≥ τ ens )Pr(A|C ens ≥ τ ens ) + Pr(Err ens |D, C ens ≥ τ ens )Pr(D|C ens ≥ τ ens ) =Pr(Err ens |A, C ens ≥ τ ens )Pr(A|C ens ≥ τ ens ) + Pr(Err ens |D, C ens ≥ τ ens )[1 -Pr(A|C ens ≥ τ ens )] =[Pr(Err ens |A, C ens ≥ τ ens ) -Pr(Err ens |D, C ens ≥ τ ens )] • Pr(A|C ens ≥ τ ens ) + Pr(Err ens |D, C ens ≥ τ ens ) ≤Pr(A|C ens ≥ τ ens ) + Pr(Err ens |D, C ens ≥ τ ens ), (40) where the last inequality is because any probability is in [0, 1]. Combining this with (35), we have ∃δ ∈ (0, 1), ∀τ ens ∈ (1 -δ, 1), Pr(Err ens |C ens ≥ τ ens ) ≤Pr(A|C ens ≥ τ ens ) + Pr(Err ens |D, C ens ≥ τ ens ) <Pr(Err ind |C ≥ τ ) -Pr(Err ens |D, C ens ≥ τ ens ) + Pr(Err ens |D, C ens ≥ τ ens ) =Pr(Err ind |C ≥ τ ),

The sizes of the training set, development set, and test set of each data set used in experiments are shown in Table 2. MNLI's development set and test set are divided into matched and mismatched parts. In the table, (m) represents matched, and (mm) represents mismatched. The matched parts are sampled from the same source as the training set, while the mismatched parts are sampled from different sources. Current selective classification only considers test samples from the same distribution as the training set, so only the matched parts are used in experiments. In addition, test sets of MRPC, QNLI, and MNLI are not accessible, so we use their development sets as test sets. According to Liu et al. (2019); Huang et al. (2020), since CIFAR-10, CIFAR-100 and SVHN originally had no development set, their development sets were 2000 samples randomly divided from corresponding test sets. Sizes of training sets, development sets, and test sets for each dataset used in experiments

The selective risks of ensembles under coverage 10%-100% on image classification datasets. The means and standard deviations are calculated over three trials. The best entries and those that overlap with the best entries are marked in bold.

The selective risks of ensembles under coverage 10%-100% on text classification datasets. The means and standard deviations are calculated over three trials. The best entries and those that overlap with the best entries are marked in bold.

The AURCs(/10 -4 ) of Big VGG-16, a vanilla VGG-16, and the ensemble of 4 VGG-16s on CIFAR-10. The best entries are marked in bold.

Firstly, we detect label noise manually among the 1% of the hardest-to-learn samples of SVHN training set and test set, Numbers of mislabeled samples in the top-0.1% difficult training samples of SVHN, CIFAR-10, and CIFAR-100.

AURC/10 -4 of SR ensemble and SAT ensemble on the clean SVHN

annex

Secondly, we eliminate the unknowns in the second constraint. according to (41), ( 35) is a sufficient condition of R ens < R. We rewrite (35) aswhere C ens is the confidence score of the ensemble, and Err represents the event that the individual model makes an error prediction. Note that the first term in the second constraint of (47) contains C ens , which is unknown, so we cannot directly replace (43)'s second constraint with (47). We eliminate C ens as follows:where the first inequality is due to (44), and the second inequality is due to ( 45) and ( 46). Thus, a sufficient condition of (47) iswith which we replace the second constraint of (43). In summary, we can intensify the constraints of ( 43) and obtain the following optimization problem that does not contain the unknowns. It is easy to see that the optimal solution to ( 49) is an upper bound of that to (43). minFurther Simplification and Final Result. It is easy to show, by proof of contraction, that the first constraint of ( 49) can be substituted byThus, the second constraint of (49) can be simplified asThus, the final version of the optimization problem with respect to τ is min Algorithm 2: SEARCHFORTAUENS Input: the confidence threshold τ ; the individual model θ; the test set D = {(x i , y i )} N i=1 ; the oracle Ω : X → {0, 1} that tells whether a sample is definite. Output: τ ens ∈ [0, 1] that satisfies the first constraint of (50).Algorithm 3: VERIFYSECONDCONSTRAINT Input: the confidence threshold τ ; τ ens ; the individual model θ; the test set D = {(x i , y i )} N i=1 ; the oracle Ω : X → {0, 1} that tells whether a sample is definite; the number of classes K; the number of member models M ; and B, the upper bound ofOutput: True if and only if τ and τ ens satisfy the second constraint of ( 50)This example also indicates the relationship between the ensemble's diversity and its selective classification performance. Since an ensemble with a smaller B seems to have more diversity over ambiguous samples, the result in Table 8 suggests that as long as the ensemble has enough diversity over ambiguous samples, the ensemble is guaranteed to have a lower selective risk than the individual model under a considerable range of coverage. 

