TOWARDS BETTER SELECTIVE CLASSIFICATION

Abstract

We tackle the problem of Selective Classification where the objective is to achieve the best performance on a predetermined ratio (coverage) of the dataset. Recent state-of-the-art selective methods come with architectural changes either via introducing a separate selection head or an extra abstention logit. In this paper, we challenge the aforementioned methods. The results suggest that the superior performance of state-of-the-art methods is owed to training a more generalizable classifier rather than their proposed selection mechanisms. We argue that the best performing selection mechanism should instead be rooted in the classifier itself. Our proposed selection strategy uses the classification scores and achieves better results by a significant margin, consistently, across all coverages and all datasets, without any added compute cost. Furthermore, inspired by semi-supervised learning, we propose an entropy-based regularizer that improves the performance of selective classification methods. Our proposed selection mechanism with the proposed entropy-based regularizer achieves new state-of-the-art results.

1. INTRODUCTION

A model's ability to abstain from a decision when lacking confidence is essential in mission-critical applications. This is known as the Selective Prediction problem setting. The abstained and uncertain samples can be flagged and passed to a human expert for manual assessment, which, in turn, can improve the re-training process. This is crucial in problem settings where confidence is critical or an incorrect prediction can have significant consequences such as in the financial, medical, or autonomous driving domains. Several papers have tried to address this problem by estimating the uncertainty in the prediction. Gal & Ghahramani (2016) proposed using MC-dropout. Lakshminarayanan et al. (2017) proposed to use an ensemble of models. Dusenberry et al. (2020) and Maddox et al. (2019) are examples of work using Bayesian deep learning. These methods, however, are either expensive to train or require lots of tuning for acceptable results. In this paper, we focus on the Selective Classification problem setting where a classifier has the option to abstain from making predictions. Models that come with an abstention option and tackle the selective prediction problem setting are naturally called selective models. Different selection approaches have been suggested such as incorporating a selection head Geifman & El-Yaniv (2019) or an abstention logit (Huang et al., 2020; Ziyin et al., 2019) . In either case, a threshold is set such that selection and abstention values above or below the threshold decide the selection action. SelectiveNet Geifman & El-Yaniv (2019) proposes to learn a model comprising of a selection head and a prediction head where the values returned by the selection head determines whether the datapoint is selected for prediction or not. Huang et al. (2020) and Ziyin et al. (2019) introduced an additional abstention logit for classification settings where the output of the additional logit determines whether the model abstains from making predictions on the sample. The promising results of these works suggest that the selection mechanism should focus on the output of an external head/logit. On the contrary, in this work, we argue that the selection mechanism should be rooted in the classifier itself. The results of our rigorously conducted experiments show that (1) the superior performance of the state-of-the-art methods is owed to training a more generalizable classifier rather than their proposed external head/logit selection mechanisms. These results suggest that future work in selective classification (i) should aim to learn a more generalizable classifier and (ii) the selection mechanism should be based on the classifier itself rather than the recent research directions of architecture modifications for an external logit/head. (2) We highlight a connection between selective classification and semi-supervised learning. To the best of our knowledge, this has has not been explored before. We show that entropy-minimization regularization, a common technique in semi-supervised learning, significantly improves the performance of the state-of-the-art selective classification method. The promising results suggest that additional research is warranted to explore the relationship between these two research directions. From a practical perspective, (3) we propose a selection mechanism that outperforms the original selection mechanism of state-of-the-art methods. Furthermore, this method can be immediately applied to an already deployed selective classification model and instantly improve performance at no additional cost. (4) We show a selective classifier trained with the entropy-regularised loss and with selection according to the classification scores achieves new state-of-the-art results by a significant margin (up to 80% relative improvement). ( 5) Going beyond the already-saturated datasets often used for Selective Classification research, we include results on larger datasets: StanfordCars, Food101, Imagenet, and Imagenet100 to test the methods on a wide range of coverages and ImagenetSubset to test the scalability of the methods.

2. RELATED WORK

The option to reject a prediction has been explored in depth in various learning algorithms not limited to neural networks. Primarily, Chow (Chow, 1970 ) introduced a cost-based rejection model and analysed the error-reject trade-off. There has been significant study in rejection in Support Vector Machines (Bartlett & Wegkamp, 2008; Fumera & Roli, 2002; Wegkamp, 2007; Wegkamp & Yuan, 2011) . The same is true for nearest neighbours (Hellman, 1970) and boosting (Cortes et al., 2016) . In 1989 , LeCun et al. (1989) proposed a rejection strategy for neural networks based on the most activated output logit, second most activated output logit, and the difference between the activated output logits. Geifman & El-Yaniv (2017) presented a technique to achieve a target risk with a certain probability for a given confidence-rate function. As examples of confidence-rate functions, the authors suggested selecting according to Softmax Response and MC-Dropout as selection mechanisms for a vanilla classifier. We build on this idea to demonstrate that Softmax Response, if utilized correctly, is the highest performing selection mechanism in the selective classification settings. Beyond selective classification, max-logit (Softmax Response) has also been used in anomaly detection (Hendrycks & Gimpel, 2016; Dietterich & Guyer, 2022) . Future work focused on architectural changes and selecting according to a separately computed head/logit with their own parameters. The same authors, Geifman & El-Yaniv (2019) later proposed SelectiveNet (see Section 3.2.1), a three-headed model, comprising of heads for selection, prediction, and auxiliary prediction. Deep Gamblers (Ziyin et al., 2019) (see Appendix A.1) and Self-Adaptive Training (Huang et al., 2020) (see Section 3.3.1) propose a (C + 1)-way classifier, where C is the number of classes and the additional logit represents abstention. In contrast, in this work, we explain how selecting via entropy and max-logit can work as a proxy to select samples which could potentially minimise the cross entropy loss. In general, we report the surprising results that the selection head of the SelectiveNet and the abstention logits in Deep Gamblers and Self-Adaptive Training are suboptimal selection mechanisms. Furthermore, their previously reported good performance is rooted in their optimization process converging to a more generalizable model. Another line of work which tackles the selective classification is that of cost-sensitive classification (Charoenphakdee et al., 2021) . However, the introduction of the target coverage adds a new variable and changes the mathematical formulation. Other works have proposed to perform classification in conjunction with expert decision makers (Mozannar & Sontag, 2020) . In this work, we also highlight a connection between semi-supervised learning and selective classification, which, to the best of our knowledge, has not been explored before. As a result, we propose an entropy-regularized loss function in the Selective Classification settings to further improve the performance of the Softmax Response selection mechanism. However, entropy minimization objectives have been widely used for Unsupervised Learning (Long et al., 2016) , Semi-Supervised Learning (Grandvalet & Bengio, 2004) , and Domain Adaptation (Vu et al., 2019; Wu et al., 2020) .

3. BACKGROUND

In this section, we introduce the Selective Classification problem. Additionally, we describe the top methods for Selective Classification. To the best of our knowledge, Self-Adaptive Training (Huang et al., 2020) achieves the best performance on the selective classification datatests.

3.1. PROBLEM SETTING: SELECTIVE CLASSIFICATION

The selective prediction task can be formulated as follows. Let X be the feature space, Y be the label space, and P (X , Y) represent the data distribution over X × Y. A selective model comprises of a prediction function f : X → Y and a selection function g : X → {0, 1}. The selective model decides to make predictions when g(x) = 1 and abstains from making predictions when g(x) = 0. The objective is to maximise the model's predictive performance for a given target coverage c target ∈ [0, 1], where coverage is the proportion of the selected samples. The selected set is defined as {x : g(x) = 1}. Formally, an optimal selective model, parameterised by θ * and ψ * , would be the following: θ * , ψ * = argmin θ,ψ E P [l(f θ (x), y) • g ψ (x)], s.t. E P [g ψ (x)] ≥ c target , where E P [l(f θ (x), y) • g ψ (x)] is the selective risk. Naturally, higher coverages are correlated with higher selective risks. In practice, instead of a hard selection function g ψ (x), existing methods aim to learn a soft selection function ḡψ : X → R such that larger values of ḡψ (x) indicate the datapoint should be selected for prediction. At test time, a threshold τ is selected for a coverage c such that g ψ (x) = 1 if ḡψ (x) ≥ τ 0 otherwise , s.t. E[g ψ (x)] = c target (2) In this setting, the selected (covered) dataset is defined as {x : ḡψ (x) ≥ τ }. The process of selecting the threshold τ is known as calibration. 3.2 APPROACH: LEARN TO SELECT

3.2.1. SELECTIVENET

SelectiveNet (Geifman & El-Yaniv, 2019 ) is a three-headed network proposed for selective learning. A SelectiveNet model has three output heads designed for selection ḡ, prediction f , and auxiliary prediction h. The selection head infers the selective score of each sample, as a value between 0 to 1, and is implemented with a sigmoid activation function. The auxiliary prediction head is trained with a standard (non-selective) loss function. Given a batch {(x i , y i )} m i=1 , where y i is the label, the model is trained to minimise the loss L where it is defined as: L = α (L selective + λL c ) + (1 -α)L aux , L selective = 1 m m i=1 ℓ(f (x i ), y i )ḡ(x i ) 1 m m i=1 ḡ(x i ) , L c = max(0, (c target - 1 m m i=1 ḡ(x i )) 2 ), L aux = 1 m m i=1 ℓ(h(x i ), y i ), where ℓ is any standard loss function. In Selective Classification, ℓ is the Cross Entropy loss function. The coverage loss L c encourages the model to achieve the desired coverage and ensures ḡ(x i ) > 0 for at least c target proportion of the batch samples. The selective loss L selective discounts the weight of difficult samples via the soft selection value ḡ(x) term encouraging the model to focus more on easier samples which the model is more confident about. The auxiliary loss L aux ensures that all samples, regardless of their selective score (ḡ(x)), contribute to the learning of the feature model. λ and α are hyper-parameters controlling the trade-off of different terms. Unlike Deep Gamblers and Self-Adaptive Training, SelectiveNet trains a separate model for each target coverage c target . In the SelectiveNet paper (Geifman & El-Yaniv, 2019) , it has been suggested that the best performance is achieved when the training target coverage is equal to that of the evaluation coverage.

3.3. APPROACH: LEARN TO ABSTAIN

Self-Adaptive Training (Huang et al., 2020)  (x) = 1 -p θ (C + 1|x). Due to the space limitation, the formulation for Deep Gamblers is included in the Appendix.

3.3.1. SELF-ADAPTIVE TRAINING

In addition to learning a logit that represents abstention, Self-Adaptive Training (Huang et al., 2020) proposes to use a convex combination of labels and predictions as a dynamically moving training target instead of the fixed labels. Let y i be the one-hot encoded vector representing of the label for a datapoint (x i , y i ) where y i is the label. Initially, the model is trained with a cross-entropy loss for a series of pre-training steps. Afterwards, the model is updated according to a dynamically moving training target. The training target t i is initially set equal to the label t i ← y i such that the training target is updated according to t i ← α × t i + (1 -α) × p θ (•|x i ) s.t. α ∈ (0, 1) after each model update. Similar to Deep Gamblers, the model is trained to optimise a loss function that allows the model to also choose to abstain on hard samples instead of making a prediction: L = - 1 m m i=1 [t i,yi log p θ (y i |x i ) + (1 -t i,yi ) log p θ (C + 1|x i )], where m is the number of datapoints in the batch. As training progresses, t i approaches p θ (•|x i ). The first term is similar to the Cross Entropy Loss and encourages the model to learn a good classifier. The second term encourages the model to abstain from making predictions for samples that the model is uncertain about. This use of dynamically moving training target t i allows the model to avoid fitting on difficult samples as the training progresses.

4. METHODOLOGY

We motivate an alternative selection mechanism Softmax Response for Selective Classification models. We explain how the known state-of-the-art selective methods can be equipped with the proposed selection mechanism and why it further improves performance. Inspired by semi-supervised learning, we also introduce an entropy-regularized loss function.

4.1. MOTIVATION

Recent state-of-the-art methods have proposed to learn selective models with architecture modifications such as an external logit/head. These architecture modifications, however, act as regularization mechanisms that allow the method to train more generalizable classifiers (see Table 1 ). As a result, the claimed improved results from these models could actually be attributed to their classifiers being more generalizable. For these selective models to have strong performance in selective classification they require the external logit/head to generalise in these sense that the external logit/head must select samples for which the classifier is confident of its prediction. Since the logit/head has its own set of learned model parameters, this adds another potential mode of failure for a selective model. Specifically, the learned parameters can fail to generalise and the logit/head may (1) suggest samples for which the classifier is not confident about and (2) reject samples for which the classifier is confident about. In the appendix (See Figure 4 

4.2. SELECTING ACCORDING TO THE CLASSIFIER

The cross entropy loss function is a popular loss function for classification due to its differentiability. However, during evaluation, the most utilized metric is accuracy, i.e., whether a datapoint is predicted correctly. In the cross-entropy objective of the conventional classification settings, p(c|x i ) is a one-hot encoded vector; therefore, the the cross-entropy loss can be simplified as CE (p(•|x i ), p θ (•|x i )) = - C u=1 p(u|x i ) log p θ (u|x i ) = -log p θ (y i |x i ), i.e. , during optimization, the logit of the correct class is maximised. Accordingly, the maximum value of logits can be interpreted as the model's relative confidence of its prediction. Therefore, a simple selection mechanism for a model would be to select according to the maximum predictive class score, ḡ(x) = max u∈{1,...C} p θ (u|x i ) (aka Softmax Response (Geifman & El-Yaniv, 2017) ). Alternatively, a model can also select according to its predictive entropy ḡ(x) = -H(p θ (•|x)), a metric of the model's uncertainty. An in-depth discussion is included in in the Appendix B.

4.3. RECIPE FOR BETTER SELECTIVE CLASSIFICATION

The recipe that we are providing for better selective classification is as follows: 1. Train a selective classifier (e.g., SelectiveNet, Self-Adaptive Training, or Deep Gamblers).

2.. Discard its selection mechanism:

• For SelectiveNet: Ignore the selection head • For Self-Adaptive Training and Deep Gambler: Ignore the additional abstain logit and compute the final layer's softmax on the original C class logits. 3. Use a classifier-based selection mechanism (e.g., Softmax Response) to rank the samples. 4. Calculate the threshold value τ , based on the validation set, to achieve the desired target coverage and select samples with max logit greater than τ . Empirically, we show that selecting via entropy or Softmax Response both outperform selecting according to the external head/logit. From these results, we can conclude that the strong performance of these recent state-of-the-art methods were due to learning a more generalizable classifier rather than their proposed selection mechanisms. In Step 3, we experimented with both an entropy-based selection mechanism and Softmax Response but we found that Softmax Response performed better. Notably, Softmax Response does not require retraining and can be immediately applied to already deployed models for significant performance improvement at negligible cost.

4.4. ENTROPY-REGULARIZED LOSS FUNCTION

Here, we highlight a similarity between semi-supervised learning and selective classification, which to our knowledge has not been explored before. In the semi-supervised learning setting the training dataset consists of labelled and unlabelled data. A simple approach is to train solely on the labelled data and ignore the unlabelled data, i.e., training a model via supervised learning to the labelled data. This is equivalent to having a weight of 1 for the labelled samples and 0 for the unlabelled samples. However, this is suboptimal because it does not use any information from the unlabelled samples. Similarly, in Selective Classification, samples that are selected tend to have a high weight close to 1 (see, for example, in Section 3.2.1, the ḡ(x) term in the objective) and samples that are not selected have a low weight close to 0. One way that semi-supervised learning have proposed to tackle this is via an entropy minimization term. Entropy minimization is one of the most standard, well-studied, and intuitive methods for semisupervised learning. It uses the information of all the samples and increases the model's confidence in its predictions, including on the unlabelled samples, resulting in a better classifier. Inspired by the similarity in the objective of selective classification and the setting of semi-supervised learning, we propose an entropy-minimisation term for the objective function of selective classification methods: L new = L + β H(p θ (•|x)), where β is a hyperparameter that controls the impact. In our experiments, we found β = 0.01 to perform well in practice. The entropy minimization term encourages the model to be more confident in its predictions, i.e., increasing the confidence of the predicted class and decreasing the predictive entropy during training. Thus, it allows for better disambiguation between sample predictions. The larger coefficient on the cross-entropy term compared to that of the entropy-minimization term ensures that increasing the confidence of correct predictions are prioritised, benefitting Softmax Response. In Section 5, we show that this proposed loss function based on semi-supervised learning improves the performance in Selective Classification by a significant margin. These results opens the door to future exploration of the connection between Selective Classification and semi-supervised learning.

5. EXPERIMENTS

For the following experiments, we evaluate the following state-of-the-art methods (1) Selec-tiveNet (SN), (2) Self-Adaptive Training (SAT), and (3) Deep Gamblers. Furthermore, we compare the performance of these methods with the following selection mechanisms (1) original selection mechanism and (2) SR: Softmax Response (our proposed method). Due to space limitations, the table results for a vanilla classifier is included in the Appendix with several additional results. The goal of our experimental evaluation is to answer the following questions: (1) Is the superior performance of recent state-of-the-art methods due to their CIFAR-10. The CIFAR-10 dataset (Krizhevsky, 2009) comprises of small images: 50,000 images for training and 10,00 images for evaluation split into 10 classes. Each image is of size 32 × 32 × 3.

5.2. EXPERIMENT DETAILS

For our experiments, we adapted the publicly available official implementations of Deep Gamblers and Self-Adaptive Trainingfoot_1 . Experiments on SelectiveNet were conducted with our Pytorch implementation of the method which follow the details provided in the original paper (Geifman & El-Yaniv, 2019) . For the StanfordCars, Food101, Imagenet100, and ImagenetSubset datasets, we use a ResNet34 architecture for Deep Gamblers, Self-Adaptive Training, and the main body block of SelectiveNet. Following prior work, we use a VGG16 architecture for the CIFAR-10 experiments. We tuned the entropy minimization loss function hyperparameter with the following values: β ∈ {0.1, 0.01, 0.001, 0.0001}. CIFAR10, Food101, and StanfordCars experiments were run with 5 seeds. Imagenet-related experiments were run with 3 seeds. Additional details regarding hyperparameters are included in the Appendix.

5.3.1. CORRECTING THE MISCONCEPTION ABOUT THE SELECTION MECHANISM

In Table 2 , we compare the different selection mechanisms for a given selective classification method (SelectiveNet, Deep Gamblers, and Self-Adaptive Training). The results show that for each of these trained selective classifiers, their original selection mechanisms are suboptimal; in fact, selecting via Softmax Response outperforms their original selection mechanism. These results suggest that (1) the strong performance of these methods were due to them learning a more generalizable model rather than their proposed external head/logit selection mechanisms and (2) the selection mechanism should stem from the classifier itself rather than a separate head/logit. We see that Softmax Response is the state-of-the-art selection mechanism. It is important to note that this performance gain is achieved by simply changing the selection mechanism of the pre-trained selective model without any additional computational cost. This observation applies to SN, DG, and SAT models. An interesting result from this experiment is that at low coverages (30%, 20%, and 10%), Selec-tiveNet's performance progressively gets worse. We hypothesize that this is due to the optimisation process of SelectiveNet that allows the model to disregard (i.e., assign lower weight to their loss) a vast majority of samples during training at little cost, i.e., ḡ(x) ≈ 0, especially when the target 

5.3.3. SCALABILITY WITH THE NUMBER OF CLASSES: IMAGENETSUBSET

To evaluate the scalability of the proposed methodology with respect to the number of classes, we evaluate our proposed method SAT+EM+SR with the previous state-of-the-art SAT on ImagenetSubset. In Table 6 , we see once again that Self-Adaptive Training with our proposed entropy-regularised loss function and selecting according to Softmax Response outperforms the previous state-of-the-art (vanilla Self-Adaptive Training) by a very significant margin (up to 85% relative improvement) across all sizes of datasets. Due to the space limitations, the results for the other coverages of Table 6 are included in the Appendix.

5.3.4. ENTROPY-MINIMIZATION ONLY, SOFTMAX RESPONSE SELECTION ONLY, OR BOTH?

In this experiment, we show that applying EM or SR alone provide gains. However, to achieve state-of-the-art results by a large margin, it is crucial to use the combination of both SR and EM. Table 4 shows that using only the entropy-minimization (SAT-EM) slightly improves the performance of SAT. However, SAT+EM+SR (SAT+EM in conjunction with SR selection mechanism) improves upon SAT+SR and SAT+EM significantly, achieving new state-of-the art results for selective classification.

6. CONCLUSION

In this work, we analysed the state-of-the-art Selective Classification methods and concluded that their strong performance is owed to learning a more generalisable classifier rather, yet their suggested selective solutions are suboptimal. Accordingly, we showed that selection mechanisms based on the classifier itself outperforms the state-of-the-art selection methods. These results suggest that future work in selective classification should explore selection mechanisms based on the classifier itself rather than following recent works which proposed architecture modifications. Moreover, we also highlighted a connection between selective classification and semi-supervised learning, which to our knowledge has not been explored before. We show that a common technique in semi-supervised learning, namely, entropy-minimization, greatly improves performance in selective classification, opening the door to further exploration of the relationship between these two fields. From a practical perspective, we showed that selecting according to classification scores is the SOTA selection mechanism for comparison. Importantly, this method can be applied to an already deployed trained selective classification model and instantly improve performance at negligible cost. In addition, we showed a selective classifier trained with the entropy-regularised loss and with selection according to Softmax Response achieves new state-of-the-art results by a significant margin.

REPRODUCIBILITY STATEMENT

In our experiments, we build on the official implementations of Self-Adaptive Training available at https://github.com/LayneH/SAT-selective-cls. Our code is available at https://github.com/BorealisAI/towards-better-sel-cls. The experiments with Deep Gamblers (Link: https://github.com/Z-T-WANG/NIPS2019DeepGamblers) are run using the official implementaiton. Our Pytorch implementation of SelectiveNet follows the details in the original paper. The implementation details are available in Section 4. The hyperparameters are available in Section 5 and Appendix C.

A APPENDIX: ADDITIONAL BACKGROUND AND BROADER IMPACT

A.1 DEEP GAMBLERS Inspired by portfolio theory, Deep Gamblers proposes to train the model using the following loss function: L = - 1 m m i=1 p(y i |x) log p θ (y i |x) + 1 o p θ (C + 1|x) , where m is the number of datapoints in the batch and o is a hyperparameter controlling the impact of the abstain logit. Smaller values of o encourages the model to abstain more often. However, o ≤ 1 makes it ideal to abstain for all datapoints and o > C makes it ideal to predict for all datapoints. As a result, o is restricted to be between 1 and C. Note that the corresponding loss function with large values of o is approximately equivalent to the Cross Entropy loss.

A.2 BROADER IMPACT

The broader impact of this work depends on the application of the selective model. In terms of the societal impact, fairness in selection remains a concern as lowering the coverage can magnify the difference in recall between groups and increase unfairness (Jones et al., 2021; Lee et al., 2021) . The calibration step performed on the validation set assumes the validation and test data are sampled from the same distribution. Hence, in the case of out-of-distribution test data, a selective classifier calibrated to, for example, 70 When evaluating, Selective Classifiers may choose to predict samples with easier classes more than hard to predict classes. Thus, it would be undesirable in fairness applications that require equal coverage amongst the different classes.

B APPENDIX: ALTERNATIVE MOTIVATION

In the selective classification problem setting, the objective is to select c target proportion of samples for prediction according to the value outputted by a selection function, ḡ(x). Since each datapoint (x i , y i ) is an i.i.d. sample, it is optimal to iteratively select from the dataset D the sample x * that maximizes the selection function, i.e., x * ∈ argmax x∈D ḡ(x)., until the target coverage c target proportion of the dataset is reached. In other words, to select c target proportion of samples (coverage = c target ), it is sufficient to define the criterion ḡ and select a threshold τ such that exactly c target proportion of samples satisfy ḡ(x) > τ .

B.1 SELECTING VIA PREDICTIVE ENTROPY

At test time, given a dataset of datapoints D, if the labels were available, the optimal criterion to select the datapoint x ∈ D that minimise the loss function would be according to: argmin x∈D CE (p(•|x), p θ (•|x)) . However, at test time, the labels are unavailable. Instead, we can use the model's belief over what the label is, i.e., the learned approximation p θ (•|x) ≈ p(•|x). We know CE(p θ (•|x), p θ (•|x)) = H(p θ (•|x)) where H is the entropy function. As such, we can select samples according to argmin x∈D CE (p(•|x), p θ (•|x)) ≈ argmin x∈D H (p θ (•|x)) . In other words, entropy is an approximation for the unknown loss function. Accordingly, with respect to the discussed selection framework (Section 3.1), the samples with the largest negative entropy value, i.e., ḡ(x) = -H(p θ (•|x)) are best nominees for selection. In Figure 1a , we show the distribution of entropy for a trained vanilla classifier, empirically showing entropy to be strongly inversely correlated with the model's ability to correctly predict the labels. As a result, entropy is a good selection mechanism. We include results on CIFAR-10 and Imagenet100 for a vanilla classifier in Table 8 and Table 9 . In the case of entropy, a lower value correponds to higher model confidence. In contrast, in the case of max class logit, a higher value correponds to higher model confidence. Given a model with well-calibrated confidences (Guo et al., 2017; Minderer et al., 2021) , an interpretation of p θ (u|x) is a probability estimate of the true correctness likelihood, i.e., p θ (u|x) is the likelihood that u is the correct label of x. Let y i be the correct label for x i . For example, given 100 samples {x 1 , . . . , x 100 } with p θ (u|x i ) = 0.8, we would expect approximately 80% of the samples to have u as its label. As a result, p θ (y i |x i ) is the model's probability estimate that the correct label is y i . In classification, the probability that the calibrated model predicts a datapoint x correctly is equivalent to the value of the max class logit, i.e., max u∈{1,...C} p θ (u|x). Logically, the sample x i that should be selected for clasification is the sample the model's most likely to predict the sample correctly, i.e., i = argmax j max u∈{1,...C} p θ (u|x j ) . This selection is equivalent to selecting according to the following soft selection function ḡ(x) = max u∈{1,...C} p θ (u|x). Simply put, this is equivalent to selecting according to the maximum predictive class logit (aka Softmax Response (Geifman & El-Yaniv, 2017) ). In practice, neural network models are not guaranteed to have well-calibrated confidences. In Selective Classification, however, we threshold according to τ and select samples above the threshold τ for classification, so we do not use the exact values of the confidence (max class logit). As a result, we do not need the model to necessarily have well-calibrated confidences. Instead, it suffices if samples with higher confidences (max class logit) have a higher likelihood of being correct. In Figure 1b , we show the distribution of max class logit for a trained vanilla classifier, empirically showing larger max class logit to be strongly correlated with model's ability to correctly predict the label. As a result, max class logit is a good selection mechanism. We include results on CIFAR-10 and Imagenet100 for a vanilla classifier in Table 8 and Table 9 . In this section, we further illustrate how SelectiveNet's original selection mechanism is suboptimal. The optimisation of SelectiveNet's selective loss L selective (See Section 3.2.1) aims to learn a selection head (soft selection model) ḡ that outputs a low selection value for inputs with large cross-entropy loss and high selection value for inputs with low cross-entropy loss. At test time, good performance of SelectiveNet depends on the generalisation of both the prediction and selection heads. However, learned models can at times fail to generalise. In Figure 2 and Figure 3 , we show the distribution of entropy and max class logit for selected and not-selected samples according to a SelectiveNet model. In the plots, we see SelectiveNet's original selection mechanism selects several samples with large entropy and low max class logit. In Table 2 , we see that the selection mechanisms based on entropy and max class logit outperforms the original selection mechanism. This comparison further supports our argument that the selection mechanism should be rooted in the objective function instead of a separately calculated score.

C APPENDIX: ADDITIONAL EXPERIMENTAL DETAILS

C.1 HYPERPARAMETERS Following (Geifman & El-Yaniv, 2019) , SelectiveNet was trained with a target coverage rate and evaluated on the same coverage rate. As a result, there are different models for each experimental coverage rate. In contrast, target coverage does not play a role in the optimization process of Deep Gamblers and Self-Adaptive Training, hence, the results for different experimental coverages are computed with the same models. All CIFAR-10 experiments were performed with 5 seeds. All Imagenet-related experiments were performed with 3 seeds. For hyperparameter tuning, we split Imagenet100's training data into 80% training data and 20% validation data evenly across the different classes. We tested the following values for the entropy minimization coefficient β ∈ {0.1, 0.01, 0.001, 0.0001}. For the final evaluation, we trained the model on the entire training data. Self-Adaptive Training models are trained using SGD with an initial learning rate of 0.1 and a momentum of 0.9. Food101/Imagenet100/ImagenetSubset. The models were trained for 500 epochs with a mini-batch size of 128. The learning rate was reduced by 0.5 every 25 epochs. The entropy-minimization term was β = 0.01. CIFAR-10. The models were trained for 300 epochs with a mini-batch size of 64. The learning rate was reduced by 0.5 every 25 epochs. The entropy-minimization term was β = 0.001. StanfordCars. The models were trained for 300 epochs with a mini-batch size of 64. The learning rate was reduced by 0.5 every 25 epochs. The entropy-minimization term was β = 0.01. Imagenet. The models were trained for 150 epochs with a mini-batch size of 256. The learning rate was reduced by 0.5 every 10 epochs. The entropy-minimization term was β = 0.001.

C.2 COMPUTE

The experiments were primarily run on a GTX 1080 Ti. The CIFAR10 experiments took 1. In Table 8 , the difference in performance between selecting according to entropy and selecting according to Softmax Response is not significant. We attribute this marginal difference to the saturatedness of the CIFAR-10 dataset.

D.3.2 IMAGENET100

In 

D.5 SELECTION MECHANISMS: SELF-ADAPTIVE TRAINING

ImagenetSubset. In addition to the Imagenet100 experiments, we also evaluate Self-Adaptive Training trained with the proposed entropy-regularised loss function on ImagenetSubset. In Figure 6 (and Table 12 and Table 13 ), we see that training with the entropy-regularised loss function improves the scalability of Self-Adaptive Training when selecting according to Softmax Response. In Figure 6 



Note that the created dataset of ImagenetSubset with 100 classes is different than that of Imagenet100. The code is available at https://github.com/BorealisAI/towards-better-sel-cls.



Figure 1: A histogram of the number of datapoints according to a vanilla classifier trained on Imagenet100. The orange bar indicates the samples for which the model correctly predicts the class of the sample. The blue bar represents the samples for which the model incorrectly predicted the class.In the case of entropy, a lower value correponds to higher model confidence. In contrast, in the case of max class logit, a higher value correponds to higher model confidence.

Figure 2: Entropy Comparison. SelectiveNet trained on Imagenet100 for a target coverage of 0.8 and evaluated on a coverage of 0.8. In the case of entropy, a lower value correponds to higher model confidence. The histogram represents the counts of samples that were incorrectly predicted by the model. The left image indicates datapoints that were not selected by the selection head, i.e., datapoints with low selection value h(x) < τ . The right image indicates datapoints that were selected by the selection head, i.e., h(x) ≥ τ .

Figure 3: Max Class Logit Comparison. SelectiveNet trained on Imagenet100 for a target coverage of 0.8 and evaluated on a coverage of 0.8. In the case of max class logit, a higher value correponds to higher model confidence. The histogram represents the counts of samples that were incorrectly predicted by the model. The left image indicates datapoints that were not selected by the selection head, i.e., datapoints with low selection value h(x) < τ . The right image indicates datapoints that were selected by the selection head, i.e., h(x) ≥ τ

5 hours for Self-Adaptive Training and Deep Gamblers. SelectiveNet experiments took 3 hours each. The Imagenet100 experiments took 2 days for Self-Adaptive Training and Deep Gamblers. SelectiveNet experiments took 2.75 days each. The ImagenetSubset experiments took 0.5-4.5 days each for Self-Adaptive Training and Deep Gamblers, depending on the number of classes. SelectiveNet experiments took 0.75-5.5 days each, depending on the number of classes. C.3 IMAGENETSUBSET: CLASSES ImagenetSubset comprises of multiple datasets ranging from 25 to 175 classes in increments of 25, i.e., {D 25 , D 50 , D 75 , D 100 , D 125 , D 150 , D 175 }. Let C 25 , C 50 , . . . , C 175 represent the classes of the respective datasets. The classes for ImagenetSubset are uniform randomly sampled from the classes of Imagenet such that the classes of the smaller datasets are subsets of the classes of the larger datasets, i.e. D 25 ⊂ D 50 ⊂ D 75 ⊂ • • • ⊂ D 175 and C 25 ⊂ C 50 . . . C 175 . The list of Imagenet classes in each dataset is included below for reproducibility.

Figure 4: SelectiveNet (at 80% coverage) on Imagenet100: Incorrect and unconfident predictions according to its classifier but selected images according to its Selective Head. The selection score threshold for selecting images for prediction is 0.93. Included are the selection score according to the Selection Head, the image's label, and the top-3 predicted classes according to the classifier and their respective classifier scores.

Figure 5: SelectiveNet (at 80% coverage) on Imagenet100: Correct and confident predictions according to its classifier but rejected images according to its Selective Head.

Figure 6: ImagenetSubset at 70% coverage. (a), (b), and (c) Various Selective Models (d) Comparison of Self-Adaptive Training trained with the proposed entropy-regularised loss and without. The loss function improves the scalability of Self-Adaptive Training, particularly when using Softmax Response as the selection mechanism.

Figure 7: Risk Coverage Plots for Imagenet10, Food101, and StanfordCars. All plots show that SAT+EM+SR outperform SAT across all coverages, achieving state-of-the-art results.

In addition, we propose ImagenetSubset as a collection of datasets to evaluate the scalability of the methods for different number of classes. This is in contrast to the existing Selective Classification research which mainly have focused on small datasets such as CIFAR-10 with 10 or less classes, low resolution images (64x64 or less), and very low error (The error at 80% coverage is already lower than 1%) so this dataset is limited to high coverages (70%+). The results of the previously introduced datasets indicate saturation, e.g., 0.3% error at 70% coverage, discouraging experiments with lower coverages, which, in turn, prevents researchers from achieving conclusive results. Comparison of the selective classification error between SelectiveNet (SN), Deep Gamblers (DG), Self-Adaptive Training (SAT) with their original selection mechanisms vs. using Softmax Response (SR) as the selection mechanisms on ImageNet100.

Comparison of the selective classification error between Self-Adaptive Training (SAT) with the original selection mechanisms vs. using Softmax Response (SR) and the proposed entropy minimization loss function (EM) on StanfordCars and Food101

Results on Imagenet and demonstration of the impact of our SAT+EM+SR method over using SR alone or EM alone. coverage is as low as 10%. In contrast, Deep Gamblers and Self-Adaptive Training models are equally optimised over all samples regardless of their selection.

Comparison of existing Selective Classification baselines with MC-Dropout. The results of MC-Dropout are originally fromGeifman & El-Yaniv (2017). For a given coverage, the bolded result indicate the lowest selective risk (i.e. best result) and underlined result indicate the second lowest selective risk. 50%, and 40%) which is not in par with the previously reported results. We attribute the interesting results to our work being the first to evaluate these methods on large datasets at a wide range of coverages. Since previous works have mainly focused on toy datasets and high coverages (70 + %), they failed to capture these patterns. The main takeaway of these results, however, is that, across all the reported methods, selecting via Softmax Response is best.

Comparison of selection based on Entropy and Softmax Response for a vanilla classifier trained with cross-entropy loss on CIFAR-10.

Comparison of selection based on Entropy and Softmax Response for a vanilla classifier trained with cross-entropy loss on Imagenet100.

Table9, we see that selecting according to Softmax Response clearly outperforms selecting according to entropy. We see that Softmax Response learns a less generalizeable clasifier (See performance on 100% coverage) than Self-Adaptive Training, Deep Gamblers, and SelectiveNet. However, interestingly, we found that Softmax Response outperforms both Deep Gamblers and SelectiveNet on low coverages (10%, 20%, 30%). Previous works failed to capture this pattern due to lack of evaluation on larger datasets and lower coverages. In these results (Table10), we see that the difference in performance between the various selection mechanisms is marginal. Due to the marginal difference between errors, it is difficult to make conclusions from these results.Imagenet100. In Table10, we see that selecting according to Softmax Response and Entropy clearly outperforms the original selection mechanism.ImagenetSubset. In Table11, similar to Imagenet100, we see a clear substantial improvement when using Softmax Response as the selection mechanism instead of the original selection mechanism. Furthermore, we see that Entropy also outperforms the original selection mechanism. Deep Gamblers Results on CIFAR-10 and Imagenet100. Comparison of Selection Mechanism Results.

Deep Gamblers Results on ImagenetSubset (70% coverage) with various selective mechanisms.

, the results for SelectiveNet, Deep Gamblers, and Self-Adaptive Training on 70% coverage. Consistent with previous experiments, we see that both the selection mechanisms based on the classifier itself (predictive entropy and Softmax Response) significantly outperform the original selection mechanisms of the proposed methods. These results further support our conclusion that (1) the strong performance of these methods were due to them learning a more generalizable model and (2) the selection mechanism should stem from the classifier itself rather than a separate head/logit. Similarly, we see that Softmax Response is the state-of-the-art selection mechanism. In the experiments, we see that SelectiveNet struggles to scale to harder tasks. Accordingly, the achieved improvement in selective accuracy with Softmax Response (SR) increases as the number of classes increase. This suggests that the proposed selection mechanism is more beneficial for SelectiveNet as the difficulty of the task increases, i.e., improves scalability.

Self Adaptive Training with Entropy Minimization Loss Function on ImagenetSubset at 70% coverage. We see that SAT+EM+SR performs the best and outperforms SAT by a statistically significant margin.D.6 ABLATION: VARYING ARCHITECTUREIn these experiments, we show generalizability across architectures of our proposed entropyminimization and softmax response methodology. In Tables14, 15, and 16, we see that applying the entropy-minimization and Softmax Response methodology improves upon the state-of-the-art method's performance significantly.D.7 RISK-COVERAGE PLOTSFigure7shows the risk coverage plots for Imagenet100, Food101, and StanfordCars results.D.8 LEARNING CURVES PLOTSFigure8shows that SAT and SAT+EM models have converged on StanfordCars.

RegNetX: StanfordCars results

ShuffleNet: StanfordCars results

ACKNOWLEDGEMENTS

The authors acknowledge funding from the Quebec government.

