TOWARDS BETTER SELECTIVE CLASSIFICATION

Abstract

We tackle the problem of Selective Classification where the objective is to achieve the best performance on a predetermined ratio (coverage) of the dataset. Recent state-of-the-art selective methods come with architectural changes either via introducing a separate selection head or an extra abstention logit. In this paper, we challenge the aforementioned methods. The results suggest that the superior performance of state-of-the-art methods is owed to training a more generalizable classifier rather than their proposed selection mechanisms. We argue that the best performing selection mechanism should instead be rooted in the classifier itself. Our proposed selection strategy uses the classification scores and achieves better results by a significant margin, consistently, across all coverages and all datasets, without any added compute cost. Furthermore, inspired by semi-supervised learning, we propose an entropy-based regularizer that improves the performance of selective classification methods. Our proposed selection mechanism with the proposed entropy-based regularizer achieves new state-of-the-art results.

1. INTRODUCTION

A model's ability to abstain from a decision when lacking confidence is essential in mission-critical applications. This is known as the Selective Prediction problem setting. The abstained and uncertain samples can be flagged and passed to a human expert for manual assessment, which, in turn, can improve the re-training process. This is crucial in problem settings where confidence is critical or an incorrect prediction can have significant consequences such as in the financial, medical, or autonomous driving domains. Several papers have tried to address this problem by estimating the uncertainty in the prediction. Gal & Ghahramani (2016) proposed using MC-dropout. Lakshminarayanan et al. (2017) proposed to use an ensemble of models. Dusenberry et al. (2020) and Maddox et al. (2019) are examples of work using Bayesian deep learning. These methods, however, are either expensive to train or require lots of tuning for acceptable results. In this paper, we focus on the Selective Classification problem setting where a classifier has the option to abstain from making predictions. Models that come with an abstention option and tackle the selective prediction problem setting are naturally called selective models. Different selection approaches have been suggested such as incorporating a selection head Geifman & El-Yaniv (2019) or an abstention logit (Huang et al., 2020; Ziyin et al., 2019) . In either case, a threshold is set such that selection and abstention values above or below the threshold decide the selection action. SelectiveNet Geifman & El-Yaniv (2019) proposes to learn a model comprising of a selection head and a prediction head where the values returned by the selection head determines whether the datapoint is selected for prediction or not. Huang et al. (2020) and Ziyin et al. (2019) introduced an additional abstention logit for classification settings where the output of the additional logit determines whether the model abstains from making predictions on the sample. The promising results of these works suggest that the selection mechanism should focus on the output of an external head/logit. On the contrary, in this work, we argue that the selection mechanism should be rooted in the classifier itself. The results of our rigorously conducted experiments show that (1) the superior performance of the state-of-the-art methods is owed to training a more generalizable classifier rather than their proposed external head/logit selection mechanisms. These results suggest that future work in selective classification (i) should aim to learn a more generalizable classifier and (ii) the selection mechanism should be based on the classifier itself rather than the recent research directions of architecture modifications for an external logit/head. (2) We highlight a connection between selective classification and semi-supervised learning. To the best of our knowledge, this has has not been explored before. We show that entropy-minimization regularization, a common technique in semi-supervised learning, significantly improves the performance of the state-of-the-art selective classification method. The promising results suggest that additional research is warranted to explore the relationship between these two research directions. From a practical perspective, (3) we propose a selection mechanism that outperforms the original selection mechanism of state-of-the-art methods. Furthermore, this method can be immediately applied to an already deployed selective classification model and instantly improve performance at no additional cost. (4) We show a selective classifier trained with the entropy-regularised loss and with selection according to the classification scores achieves new state-of-the-art results by a significant margin (up to 80% relative improvement). ( 5) Going beyond the already-saturated datasets often used for Selective Classification research, we include results on larger datasets: StanfordCars, Food101, Imagenet, and Imagenet100 to test the methods on a wide range of coverages and ImagenetSubset to test the scalability of the methods.

2. RELATED WORK

The option to reject a prediction has been explored in depth in various learning algorithms not limited to neural networks. Primarily, Chow (Chow, 1970) introduced a cost-based rejection model and analysed the error-reject trade-off. There has been significant study in rejection in Support Vector Machines (Bartlett & Wegkamp, 2008; Fumera & Roli, 2002; Wegkamp, 2007; Wegkamp & Yuan, 2011) . The same is true for nearest neighbours (Hellman, 1970) and boosting (Cortes et al., 2016) . In 1989 , LeCun et al. (1989) proposed a rejection strategy for neural networks based on the most activated output logit, second most activated output logit, and the difference between the activated output logits. Geifman & El-Yaniv (2017) presented a technique to achieve a target risk with a certain probability for a given confidence-rate function. As examples of confidence-rate functions, the authors suggested selecting according to Softmax Response and MC-Dropout as selection mechanisms for a vanilla classifier. We build on this idea to demonstrate that Softmax Response, if utilized correctly, is the highest performing selection mechanism in the selective classification settings. Beyond selective classification, max-logit (Softmax Response) has also been used in anomaly detection (Hendrycks & Gimpel, 2016; Dietterich & Guyer, 2022) . Future work focused on architectural changes and selecting according to a separately computed head/logit with their own parameters. The same authors, Geifman & El-Yaniv (2019) later proposed SelectiveNet (see Section 3.2.1), a three-headed model, comprising of heads for selection, prediction, and auxiliary prediction. Deep Gamblers (Ziyin et al., 2019) (see Appendix A.1) and Self-Adaptive Training (Huang et al., 2020) (see Section 3.3.1) propose a (C + 1)-way classifier, where C is the number of classes and the additional logit represents abstention. In contrast, in this work, we explain how selecting via entropy and max-logit can work as a proxy to select samples which could potentially minimise the cross entropy loss. In general, we report the surprising results that the selection head of the SelectiveNet and the abstention logits in Deep Gamblers and Self-Adaptive Training are suboptimal selection mechanisms. Furthermore, their previously reported good performance is rooted in their optimization process converging to a more generalizable model. Another line of work which tackles the selective classification is that of cost-sensitive classification (Charoenphakdee et al., 2021) . However, the introduction of the target coverage adds a new variable and changes the mathematical formulation. Other works have proposed to perform classification in conjunction with expert decision makers (Mozannar & Sontag, 2020) . In this work, we also highlight a connection between semi-supervised learning and selective classification, which, to the best of our knowledge, has not been explored before. As a result, we propose an entropy-regularized loss function in the Selective Classification settings to further improve the performance of the Softmax Response selection mechanism. However, entropy minimization

