SELECTIVE CLASSIFIER ENSEMBLE

Abstract

Selective classification allows a machine learning model to abstain from predicting some hard inputs and thus improve the safety of its predictions. In this paper, we study the ensemble of selective classifiers, i.e. selective classifier ensemble, which combines several weak selective classifiers to obtain a more powerful model. We prove that under some assumptions, the ensemble has a lower selective risk than the individual model under a range of coverage. The proof is nontrivial since the selective risk is a non-convex function of the model prediction. The assumptions and the theoretical result are supported by systematic experiments on both computer vision and natural language processing tasks. A surprising empirical result is that a simple selective classifier ensemble, namely, the ensemble model with maximum probability as confidence, is the state-of-the-art selective classifier. For instance, on CIFAR-10, using the same VGG-16 backbone model, this ensemble reduces the AURC (Area Under Risk-Coverage Curve) by about 24%, relative to the previous state-of-the-art method.

1. INTRODUCTION

Although recent years have witnessed the broad applications of deep learning models, their securities have not been fully guaranteed, which gives rise to the study of selective classification. For any given deep learning classifier, there might be inputs that the model is not able to classify in practical applications, for which the model might make unpredictable errors. To prevent this kind of error, we must accurately delimit the deep learning classifier's application scope. This need gives rise to the study of selective classification that learns a selective classifier (f, g), where f is a conventional classifier, and g is a selective function that decides whether the selective classifier should abstain from prediction. Since the classifier is well studied, the study of selective classification focuses on the design of the selective function. A standard approach to designing the selective function is to design a confidence score function with a threshold, and several confidence score functions have been developed. A simple confidence score function is the maximum predictive probability of the classifier (Hendrycks & Gimpel, 2017) . More advanced methods modify the model architecture (Geifman & El-Yaniv, 2019) or the loss function (Liu et al., 2019; Huang et al., 2020) of the classifier to train the confidence score function and the classifier simultaneously. For example, Deep Gambler (Liu et al., 2019) regards the selective classification problem as gambling and proposes a novel loss function to train the classifier and the confidence score function. Although there are various individual models for the selective classifier, there has been no systematic study of the ensemble method in selective classification. It is well known that the ensemble method, which combines the individual models to obtain a more powerful model, can improve the predictive performances of machine learning models (see Zhou (2012) for a review), but only a particular selective classifier ensemble, the ensemble of Softmax Response (Hendrycks & Gimpel, 2017), has been empirically studied by Lakshminarayanan et al. (2017) . Ensembles of other kinds of selective classifiers, and the theoretical foundation of the ensemble in selective classification have not been studied yet. In this paper, we first demonstrate the theoretical foundation of the ensemble on selective classifiers, that is, with some assumptions, the ensemble has a lower selective risk than the individual model under a range of coverage. The proof is nontrivial since the selective risk (with the 0/1 loss) are non-convex. Second, we show the experimental results of the ensemble's performance in selective classification. The contributions of this paper are summarized as follows. • We are the first to theoretically demonstrate that based on several reasonable assumptions, the ensemble has a lower selective risk than the individual model under a range of coverage. We verify this by systematic experiments on the tasks of image classification and text classification. • We show a surprising experimental result that two simple methods, the SR ensemble and the Reg-curr ensemble, which can be summarized as the ensemble model with maximum probability as confidence, are the state-of-the-art selective classifiers.

2. PROBLEM FORMULATION OF SELECTIVE CLASSIFICATION

A selective classifier is composed of a standard classifier and a selective function. Considering a standard classification problem, X is a feature space, Y = {1, 2, ..., K} is a finite label set, and a classifier f is a function f : X → Y. A labeled dataset D = {(x i , y i )} N i=0 ⊆ X × Y is sampled from a distribution p X,Y . Our goal is to learn a selective classifier where f is a standard classifier and g : X → {0, 1} is a selective function that estimates the correctness of f 's prediction. Given input x, the output of selective classifier (f, g) is (f, g)(x) = f (x), if g(x) = 1 Abstain, if g(x) = 0 . Usually, g is realized by a confidence score κ : X → R + with a threshold τ (Geifman & El-Yaniv, 2017), namely g(x) = I{κ(x) > τ }, where I is the indicator function. Coverage and selective risk are two basic evaluation metrics of selective classifiers, and the goal of selective classifiers is to minimize the selective risk for target coverage. The coverage of (f, g) is defined to be the probability of (f, g) not abstaining from prediction (Geifman & El-Yaniv, 2017), i.e. φ(f, g) := E p(x) [g(x)], where p(x) is the probability density function of input x. The selective risk (Geifman & El-Yaniv, 2017) of (f, g) is R(f, g) := E p(x) [ (f (x), y)g(x)] E p(x) (g(x)) , where : Y × Y → R + is a given loss function. Usually, is the 0/1 loss (Geifman & El-Yaniv, 2017; 2019; Liu et al., 2019; Huang et al., 2020) . Based on these definitions, the objective of selective classifiers is formalized as min R(f, g), s.t. φ(f, g) ≥ c target , where c target is a given target coverage. When the selective function g is developed as (2), the confidence threshold τ controls the tradeoff between coverage and selective risk. With different values of τ , (f, g) has different pairs of coverage and selective risk (φ(f, g; τ ), R(f, g; τ )), which forms the risk-coverage curve (Geifman & El-Yaniv, 2017) of (f, g). The risk-coverage curve specifies the entire performance profile of a selective classifier, and it is easy to see that the selective classifier with a lower risk-coverage curve is better. To evaluate selective classifiers more concisely, the area under the risk-coverage curve (AURC) is introduced as a metric of selective classifiers (Xin et al., 2021) , and the selective classifier with a lower AURC is better.

3. RELATED WORK

Here, we summarize the previous studies on selective classification and ensemble methods. We also discuss the difference between selective classification and out-of-distribution detection.

