DETECTING MISCLASSIFICATION ERRORS IN NEURAL NETWORKS WITH A GAUSSIAN PROCESS MODEL

Abstract

As neural network classifiers are deployed in real-world applications, it is crucial that their predictions are not just accurate, but trustworthy as well. One practical solution is to assign confidence scores to each prediction, then filter out lowconfidence predictions. However, existing confidence metrics are not yet sufficiently reliable for this role. This paper presents a new framework that produces more reliable confidence scores for detecting misclassification errors. This framework, RED, calibrates the classifier's inherent confidence indicators and estimates uncertainty of the calibrated confidence scores using Gaussian Processes. Empirical comparisons with other confidence estimation methods on 125 UCI datasets demonstrate that this approach is effective. An experiment on a vision task with a large deep learning architecture further confirms that the method can scale up, and a case study involving out-of-distribution and adversarial samples shows potential of the proposed method to improve robustness of neural network classifiers more broadly in the future.

1. INTRODUCTION

Classifiers based on Neural Networks (NNs) are widely deployed in many real-world applications (LeCun et al., 2015; Anjos et al., 2015; Alghoul et al., 2018; Shahid et al., 2019) . Although good prediction accuracies are achieved, lack of safety guarantees becomes a severe issue when NNs are applied to safety-critical domains, e.g., healthcare (Selis ¸teanu et al., 2018; Gupta et al., 2007; Shahid et al., 2019) , finance (Dixon et al., 2017) , self-driving (Janai et al., 2017; Hecker et al., 2018) , etc. One way to estimate trustworthiness of a classifier prediction is to use its inherent confidence-related score, e.g., the maximum class probability (Hendrycks & Gimpel, 2017) , entropy of the softmax outputs (Williams & Renals, 1997) , or difference between the highest and second highest activation outputs (Monteith & Martinez, 2010) . However, these scores are unreliable and may even be misleading: high-confidence but erroneous predictions are frequently observed (Provost et al., 1998; Guo et al., 2017; Nguyen et al., 2015; Goodfellow et al., 2014; Amodei et al., 2016) . In a practical setting, it is beneficial to have a detector that can raise a red flag whenever the predictions are suspicious. A human observer can then evaluate such predictions, making the classification system safer. In order to construct such a detector, quantitative metrics for measuring predictive reliability under different circumstances are first developed, and a warning threshold then be set based on users' preferred precision-recall tradeoff. Existing such methods can be categorized into three types based on their focus: error detection, which aims to detect the natural misclassifications made by the classifier (Hendrycks & Gimpel, 2017; Jiang et al., 2018; Corbière et al., 2019) ; out-of-distribution (OOD) detection, which reports samples that are from different distributions compared to training data (Liang et al., 2018; Lee et al., 2018a; Devries & Taylor, 2018) ; and adversarial sample detection, which filters out samples from adversarial attacks (Lee et al., 2018b; Wang et al., 2019; Aigrain & Detyniecki, 2019) . Among these categories, error detection, also called misclassification detection (Jiang et al., 2018) or failure prediction (Corbière et al., 2019) , is the most challenging (Aigrain & Detyniecki, 2019) and underexplored. For instance, Hendrycks & Gimpel (2017) defined a baseline based on maximum class probability after softmax layer. Although the baseline performs reasonably well in most testing cases, reduced efficacy in some scenaria indicates room for improvement (Hendrycks & Gimpel, 2017) . Jiang et al. (2018) proposed Trust Score, which measures the similarity between the original classifier and a modified nearest-neighbor classifier. The main limitation of this method is scalability: the Trust Score may provide no or negative improvement over the baseline for high-dimensional data. ConfidNet (Corbière et al., 2019) builds a separate NN model to learn the true class probablity, i.e. softmax probability for the ground-truth class. However, ConfidNet itself is a standard NN, so its confidence scores may be unreliable or misleading: A random input may generate a random confidence score, and ConfidNet do not provide any information regarding uncertainty of these confidence scores. Moreover, none of these methods can differentiate natural classifier errors from risks caused by OOD or adversarial samples; if a detector could do that, it would be easier for practitioners to fix the problem, e.g., by retraining the original classifier or applying better preprocessing techniques to filter out OOD or adversarial data. To meet these challenges, a new framework is developed in this paper for error detection in NN classifiers. The main idea is to utilize RIO (Residual, Input, Output; Qiu et al., 2020) , a regression model based on Gaussian Processes, on top of the original NN classifier. Whereas the original RIO is limited to regression problems, the proposed approach extends it to misclassification detection by modifying its components. The new framework not only produces a calibrated confidence score based on the original maximum class probability, but also provides a quantitative uncertainty estimation of that score. Errors can therefore be detected more reliably. Note that the proposed method does not change the prediction accuracy of the original classification model. Instead, it provides a quantitative metric that makes it possible to detect misclassification errors. This framework, referred to as RED (RIO for Error Detection), is compared empirically to existing approaches on 125 UCI datasets and on a large-scale deep learning architecture. The results demonstrate that the approach is effective and robust. A further case study with OOD and adversarial samples shows the potential of using RED to diagnose the sources of mistakes as well, thereby leading to a possible comprehensive approach for improving trustworthiness of neural network classifiers in the future.

2. RELATED WORK

In the past two decades, a large volume of work was devoted to calibrating the confidence scores returned by classifiers. Early works include Platt Scaling (Platt, 1999; Niculescu-Mizil & Caruana, 2005) , histogram binning (Zadrozny & Elkan, 2001) , isotonic regression (Zadrozny & Elkan, 2002) , with recent extensions like Temperature Scaling (Guo et al., 2017) , Dirichlet calibration (Kull et al., 2019) , and distance-based learning from errors (Xing et al., 2020) . These methods focus on reducing the difference between reported class probability and true accuracy, and generally the rankings of samples are preserved after calibration. As a result, the separability between correct and incorrect predictions is not improved. In contrast, RED aims at deriving a score that can differentiate incorrect predictions from correct ones better. A related direction of work is the development of classifiers with rejection/abstention option. These approaches either introduce new training pipelines/loss functions (Bartlett & Wegkamp, 2008; Yuan & Wegkamp, 2010; Cortes et al., 2016) , or define mechanisms for learning rejection thresholds under certain risk levels (Dubuisson & Masson, 1993; Santos-Pereira & Pires, 2005; Chow, 2006; Geifman & El-Yaniv, 2017) . In contrast to these methods, RED assumes an existing pretrained NN classifier, and provides an additional metric for detecting potential errors made by this classifier, without specifying a rejection threshold. Designing metrics for detecting potential risks in NN classifiers has also become popular recently. While most approaches focus on detecting OOD (Liang et al., 2018; Lee et al., 2018a; Devries & Taylor, 2018) or adversarial examples (Lee et al., 2018b; Wang et al., 2019; Aigrain & Detyniecki, 2019) , work on detecting natural errors, i.e., regular misclassifications not caused by external sources, is more limited. Ortega (1995) and Koppel & Engelson (1996) conducted early work in predicting whether a classifier is going to make mistakes, and Seewald & Fürnkranz (2001) built a meta-grading classifier based on similar ideas. However, these early works did not consider NN classifiers. More recently, Hendrycks & Gimpel (2017) and Haldimann et al. (2019) demonstrated that raw maximum class probability is an effective baseline in error detection, although its performance was reduced in some scenaria. More elaborate techniques for error detection have also been developed recently. Mandelbaum & Weinshall (2017) proposed a confidence score based on the data embedding derived from the penul-timate layer of a NN. However, their approach requires modifying the training procedure in order to achieve effective embeddings. Jiang et al. (2018) introduced Trust Score to measure the similarity between a base classifier and a modified nearest-neighbor classifier. Trust Score outperforms the maximum class probability baseline in many cases, but negative improvement over baseline can be observed in high-dimensional problems, implying poor scalability of local distance computations. ConfidNet (Corbière et al., 2019) learns to predict the class probability of true class with another NN, while Introspection-Net (Aigrain & Detyniecki, 2019) utilizes the logit activations of the original NN classifier to predict its correctness. Since both models themselves are standard NNs, the confidence scores returned by them may be arbitrarily high without any uncertainty information. Moreover, existing approaches for error detection cannot differentiate natural misclassification error from OOD or adversarial samples, making it difficult to diagnose the sources of risks. In contrast, RED explicitly reports its uncertainty about the estimated confidence score, providing more reliable error detection. The uncertainty information returned by RED may also be helpful in clarifying the cause of classifier mistakes, as will be demonstrated in this paper.

3. METHODOLOGY

This section gives the general problem statement, introduces the basic idea of original RIO, on which RED is built, and describes the technical details of RED.

3.1. PROBLEM STATEMENT

Consider a training dataset D = (X , y) = {(x i , y i )} N i=1 , and a pretrained NN classifier that outputs a predicted label ŷi and class probabilities for each class σ i = [p i,1 , pi,2 , . . . , pi,K ] given x i , where N is the total number of training points and K is the total number of classes. The problem is to develop a metric that can serve as a quantitative indicator for detecting natural misclassification errors made by the pretrained NN classifier.

3.2. RIO

The original RIO (Qiu et al., 2020) was developed to quantify point-prediction uncertainty in regression models. More specifically, RIO fits a Gaussian Process (GP) to predict the residuals, i.e. the differences between ground-truth and original model predictions. It utilizes an I/O kernel, i.e. a composite of an input kernel and an output kernel, thus taking into account both inputs and outputs of the original regression model. As a result, it measures the covariances between data points in both the original feature space and the original model output space. For each new data point, a trained RIO model takes the original input and output of the base regression model, and predicts a distribution of the residual, which can be added back to the original model prediction to obtain both a calibrated prediction and the corresponding predictive uncertainty. In the original RIO work, SVGP (Hensman et al., 2013; 2015) was used as an approximate GP to improve the scalability of the approach. Both empirical results and theoretical analysis showed that RIO is able to consistently improve the prediction accuracy of base model as well as provide reliable uncertainty estimation. Moreover, RIO can be directly applied on top of any pre-trained models without retraining or modification. It therefore forms a promising foundation for improving reliability of error detection metrics as well.

3.3. RIO FOR ERROR DETECTION (RED)

Although RIO performs robustly in a wide variety of regression problems, it cannot be directly applied to classification models. A new framework, namely RED, is proposed to utilize RIO for error detection in classification domains. Building on the fact that the original maximum class probability is a strong baseline for error detection (Hendrycks & Gimpel, 2017; Haldimann et al., 2019) , the main idea of RED is to derive a more reliable confidence score by stacking RIO on top of the original maximum class probability. Since RIO was designed for single-output regression problems, it contains an output kernel only for scalar outputs. In RED, this original output kernel is extended to multiple outputs, i.e. to vector outputs 1: obtain target confidence score c = {c i = δ yi,ŷi } N i=1 , where δ yi,ŷi is the Kronecker delta (δ yi,ŷi = 1 if y i = ŷi , otherwise δ yi,ŷi = 0) 2: calculate residuals r = {r i = c i -ĉi } N i=1 3: for each optimizer step do 4: calculate covariance matrix K c ((X , σ), (X , σ)), where each entry is given by k c ((x i , σ i ), (x j , σ j )) = k in (x i , x j ) + k out (σ i , σ j ), for i, j = 1, 2, . . . , N 5: optimize GP hyperparameters by maximizing log marginal likelihood log p(r|X , σ) = -1 2 r (K c ((X , σ), (X , σ)) + σ 2 n I) -1 r -1 2 log |K c ((X , σ), (X , σ)) + σ 2 n I| -n 2 log 2π Deployment Phase: 6: calculate residual mean r * = k * (K c ((X , σ), (X , σ)) + σ 2 n I) -1 r and residual variance var(r * ) = k c ((x * , σ * ), (x * , σ * )) -k * (K c ((X , σ), (X , σ)) + σ 2 n I) -1 k * , where k * denotes the vector of kernel-based covariances (i.e., k c (x * , x i )) between x * and all the training points 7: return distribution of calibrated confidence score ĉ * ∼ N (ĉ * + r * , var(r * )) such as those of the final softmax layer of a NN classifer, representing estimated class probabilities for each class. This modification allows RIO to access more information from the classifier outputs. The targets for RIO training need to be redesigned as well. The raw targets of a classification problem are the ground-truth labels; they are in categorical space, while RIO works in continuous space. To solve this issue, RED constructs a different problem: Instead of predicting the labels directly, it predicts whether the original prediction is correct or not. A target confidence score is assigned to each training data point accordingly. The residual between this target confidence score and the originally returned maximum class probability is calculated, and a RIO model is trained to predict these residuals. Given a new data point, the trained RIO model combined with the original NN classifier thus provides a calibrated confidence score for detecting misclassification errors. Figure 1 illustrates the RED training and deployment processes conceptually, and Algorithm 1 specifies them in detail. In the training phase, the first step is to define a target confidence score c i for each training sample (x i , y i ). For simplicity, all training samples that are correctly predicted by the original NN classifier receive 1 as the target confidence score, and those that are incorrectly predicted receive 0. The validation dataset during the original NN training is included in the training dataset for RED. After the target confidence scores are assigned, a regression problem is formulated for the RIO model: Given the original raw features {x i } N i=1 and the corresponding softmax outputs of the original NN classifier {σ i = [p i,1 , pi,2 , . . . , pi,K ]} N i=1 , predict the residuals r = {r i = c i -ĉi } n i=1 between target confidence scores c = {c i } N i=1 and the original maximum class probabilities ĉ = {ĉ i = max(σ i )} N i=1 . The RIO model relies on an I/O kernel consisting of two components: the input kernel k in (x i , x j ), which measures covariances in the raw feature space, and the modified multi-output kernel k out (σ i , σ j ), which calculates covariances in the softmax output space. The hyperparameters of the I/O kernel are optimized to maximize the log marginal likelihood log p(r|X , σ). In the deployment phase, given a new data point x * , the trained RIO model provides a Gaussian distribution for the estimated residual r * ∼ N ( r * , var(r * )). By adding the estimated residual back to the original maximum class probability ĉ * , a distribution of calibrated confidence score is obtained as ĉ * ∼ N (ĉ * + r * , var(r * )). The mean ĉ * + r * can be directly used as a quantitative metric for error detection, and the variance var(r * ) represents the corresponding uncertainty of the confidence score.

4. EMPIRICAL EVALUATION

In this section, the error detection performance of RED is evaluated comprehensively on 125 UCI datasets, comparing it to other related methods. Its generality is then evaluated by applying it to two other base models, and its scale-up properties measured in a larger deep learning architecture and in a larger task. Finally, RED's potential to improve robustness more broadly is demonstrated in a case study focusing on OOD and adversarial samples.

4.1. COMPARISONS WITH RELATED APPROACHES

As a comprehensive evaluation of RED, an empirical comparison with seven related approaches on 125 UCI datasets (Dua & Graff, 2017) was performed. These approaches include maximum class probability baseline (MCP; Hendrycks & Gimpel, 2017) , three state-of-the-art approaches, namely Trust Score (Jiang et al., 2018) , ConfidNet (Corbière et al., 2019) , and Introspection-Net (Aigrain & Detyniecki, 2019) , as well as three earlier approaches, i.e. entropy of the original softmax outputs (Steinhardt & Liang, 2016) , DNGO (Snoek et al., 2015) , and the original SVGP (Hensman et al., 2013; 2015) . The 125 UCI datasets include 121 datasets used by Klambauer et al. (2017) and four more recent ones. Full details about the datasets and parametric setup of all tested algorithms, and a downloadable link for source codes are provided in Appendix A.1. Following the experimental setup of Hendrycks & Gimpel (2017) ; Corbière et al. (2019) ; Aigrain & Detyniecki (2019) , the task for each algorithm is to provide a confidence score for each testing point. An error detector can then use a predefined fixed threshold on this score to decide which points are probably misclassified by the original NN classifier. For RED, the mean of calibrated confidence score ĉ * + r * was used as the reported confidence score. Five threshold-independent performance metrics were used to compare the methods: AUPR-Error, which computes the area under the Precision-Recall (AUPR) Curve when treating incorrect predictions as positive class during the detection; AUPR-Success, which is similar to AUPR-Error but uses correct predictions as positive class; AUROC, which computes the area under receiver operating characteristic (ROC) curve for the error detection task; AP-Error, which computes the average precision (AP) under different thresholds treating incorrect predictions as positive class; and AP-Success, which is similar to AP-Error but uses correct predictions as positive class. AUPR and AUROC are commonly used in prior work (Hendrycks & Gimpel, 2017; Corbière et al., 2019; Aigrain & De- tyniecki, 2019), but as discussed by Davis & Goadrich (2006) and Flach & Kull (2015) , AUPR may provide overly-optimistic measurement of performance. To compensate for this issue, AP-Error and AP-Success are included as additional metrics. Since the target for the confidence metrics is to detect misclassification errors, the following discussion will focus more on AP-Error and AUPR-Error. Ten independent runs were conducted for each dataset. During each run, the dataset was randomly split into training dataset and testing dataset, and a standard NN classifier trained and evaluated on them. The same dataset split and trained NN classifier was used to evaluate all methods. Full details of the experimental setup are provided in Appendix A.1. Table 1 shows the ranks of each algorithm averaged over all 125 UCI datasets. The rank of each algorithm on each dataset is based on the average performance over 10 independent runs. RED performs best on all metrics; the performance differences between RED and all other methods are statistically significant under paired t-test and Wilcoxon test. Trust Score has the highest standard deviation, suggesting that its performance varies significantly across different datasets. As a more detailed comparison, Table 2 shows how often RED performs statistically significantly better, how often the performance is not significantly different, and how often it performs significantly worse than the other methods. RED is most often significantly better, and very rarely worse. In a handful of datasets Trust Score is better, but most often it is not. To illustrate the performance of RED further compared to the baseline and the three state-of-theart approaches, Figure 2 shows the distribution of the relative rank of RED, MCP baseline, Trust Score, ConfidNet and Introspection-Net as a function of the number of samples and the number of features in the dataset. These plots are based on the AP-Error metric; other metrics provide similar results. RED performs consistently well over different dataset sizes and feature dimensionalities. Trust Score performs best in several datasets, but occasionally also worst in both small and large datasets, making it a rather unreliable choice. ConfidNet generally exhibits worse performance on datasets with large dataset sizes and high feature dimensionalities, i.e. it does not scale well to larger problems. 

4.2. GENERALITY WRT. BASE MODELS

To evaluate generality and robustness of RED, it was applied to two other base models: an NN classifier using MC-dropout technique (Gal & Ghahramani, 2016 ) and a Bayesian Neural Network (BNN) classifier (Wen et al., 2018) . They were each trained as base classifiers, and RED was then applied to each of them (implementation details are provided in Appendix A.1). Experiments analogous to those in Table 2 were performed on 125 UCI datasets in both cases. Table 3 shows the pairwise comparisons between RED and the internal confidence scores returned by the base models. MCP and Entropy represent the maximum class probability and entropy of softmax outputs, respectively, after averaging over 100 test-time samplings. RED significantly improves MC-dropout and BNN classifier in most datasets, demonstrating that it is a general technique that can be applied to a variety of models.

4.3. SCALING UP TO LARGER ARCHITECTURES

To confirm that the RED approach scales up to large deep learning architectures, a VGG16 model (Simonyan & Zisserman, 2015) was trained on the CIFAR-10 dataset using a state-of-the-art training pipeline (see Appendix A.2 for details). In order to remove the influence of feature extraction in image preprocessing and to make the comparison fair, all approaches used the same logit outputs of the trained VGG16 model as their input features. 10 independent runs are performed. During each run, a VGG16 model is trained, and all the methods are evaluated based on this VGG16 model. Table 4 shows the results on the two main error detection performance metrics (note that the table lists absolute values instead of rankings along each metric). Trust Score performs much better than in previous literatures (Corbière et al., 2019) . This difference may be due to the fact that logit outputs are used as input features here, whereas Corbière et al. (2019) utilized a higher dimensional feature space for Trust Score. Based on the results, RED significantly outperforms all the counterparts in both metrics. This result demonstrates the advantages of RED in scaling up to larger architectures. The horizontal axis denotes the variance of RED-calibrated confidence score, and the vertical axis denotes the mean. If an in-distribution sample is correctly classified by original NN classifier, it is marked as "correct", otherwise it is marked "incorrect". Mean is a good separator of correct and incorrect classifications. High variance, on the other hand, indicates that RED is uncertain about its confidence score, which can be used to identify OOD and adversarial samples. In this manner, RED can serve as a foundation for improving robustness of classifiers more broadly in the future.

4.4. A CASE STUDY WITH OOD AND ADVERSARIAL SAMPLES

In all experiments so far, the mean of calibrated confidence score ĉ * + r * was used as RED's confidence score. Although good performance is observed in error detection by only using the mean, the variance of calibrated confidence score var(r * ) may be helpful if the scenario is more complex, e.g., the dataset includes some OOD data, or even adversarial data. A preliminary investigation of RED in such scenaria was performed by manually adding OOD and adversarial data into the test set of the UCI "annealing" dataset, on which RED detected errors well. The synthetic OOD and adversarial samples were created to be highly deceptive, aiming to evaluate the performance of RED under difficult circumstances. The OOD data were sampled from a Gaussian distribution with mean 0 and variance 1, and the number of added OOD data was the same as the number of samples in the original test set. Note that all data points from original dataset are normalized to have mean 0 and variance 1 for each feature dimension during preprocessing, so the OOD data and in-distribution data have similar scales. The adversarial data was created by adding negligible modifications to training samples that the original NN classifier predicted incorrectly with highest confidence. This process mimics an adversary that can arbitrarily alter the output of the NN classifier with minuscule changes to the input (Goodfellow et al., 2014) . Figure 3 shows the distribution of mean and variance of calibrated confidence scores for testing samples, including correctly and incorrectly labeled actual samples, as well as the synthetic OOD and adversarial samples. The mean is a good separator for correctly classified and incorrectly classified samples, which tend to cluster on the top and bottom half of the image, respectively. On the other hand, variance is a promising indicator of OOD and adversarial samples. RED's confidence scores of in-distribution samples have low variance because they covary with the training samples. The variance thus represents RED's confidence in its confidence score. Samples with large variance indicate that RED is uncertain about its confidence score, which can be used as a basis for detecting OOD and adversarial samples. Thus, although the main focus of this paper is to demonstrate RED on misclassification detection, the preliminary results in this subsection show that it provides a promising foundation for detecting other error types as well. 5 DISCUSSION AND FUTURE WORK Another interesting observation is that the variance is also helpful in detecting OOD and adversarial samples. This result follows from the design of the RIO uncertainty model. Since RIO in RED has an input kernel and an output kernel, lower estimated variance requires that the predicted sample is close to training samples in both the input feature space and the classifier output space. This requirement is difficult for OOD and adversarial attacks to achieve, providing a basis for detecting them. In a real-world deployment, it is necessary to define a threshold for triggering error warning based on RED's confidence scores. A practical way is to use a validation dataset to determine how the precision/recall tradeoff changes over different thresholds. The user can then select a threshold based on their preference. The most compelling direction of future work is to extend this capability of RED further. Instead of using a single dimensional confidence score for error detection, it is possible to use mean and variance simultaneously, leading to a two dimensional detection space. Further separation between OOD and adversarial samples may be possible by adding one more dimension, such as the ratio between input kernel output and output kernel output. Also, instead of using a hard target confidence score (i.e. either 0 or 1), it may be possible to define a soft target confidence that may be more informative. Further, RED may be used to calibrate other existing confidence metrics, such as the Trust Score, which may lead to a further improvement in detection performance.

6. CONCLUSION

This paper introduced a new framework, RED, for error detection in neural network classifiers that can produce a more reliable confidence score than previous methods. RED is able to not only provide a calibrated confidence score, but also report the uncertainty of the estimated confidence score. Experimental results show that RED's scores consistently outperform state-of-the-art methods in separating the misclassified samples from correctly classified samples. Preliminary experiments also demonstrate that the approach scales up to large deep learning architectures, and can form a basis for detecting OOD and adversarial samples as well. It is therefore a promising foundation for improving robustness of neural network classifiers.

A APPENDIX

A.1 EXPERIMENTAL SETUP FOR SECTION 4.1 AND SECTION 4.2 General Information 10 indenpendent runs are performed for each dataset. During each run, the dataset is randomly split into a training set (80%) and a testing set (20%), then a fully connected feedforward NN classifier with 2 hidden layers, each with 64 hidden neurons, are trained on the training set. The activation function is ReLU for all the hidden layers. The maximum number of epochs for training is 1000. 20% of the training set is used as validation set, and the split is random at each independent run. An early stop is triggered if the loss on validation set has not be improved for 10 epochs. The optimizer is Adam with learning rate 0.001, β 1 = 0.9, and β 2 = 0.999. The loss function is cross entropy loss. During each independent run, the same random dataset split and trained base NN classifier is used for evaluating all algorithms. All source codes for reproducing the experimental results can be downloaded from (https://drive.google.com/drive/folders/1X5R6sEkjmucR7B4MvF-rY6QnCvos6a7n?usp=sharing). Dataset Description In total, 125 UCI datasets are used in the experiments, among which 121 are from Klambauer et al. (2017) , and 4 are recent datasets released in Dua & Graff (2017) . All features in all datasets are normalized to have mean 0 and standard deviation 1. Full details regarding the number of samples N, number of features M, and number of classes K for each dataset are shown in Table 5 .

Parametric Setup for Algorithms

• RED: SVGP (Hensman et al., 2013; 2015) is used as an approximator to original GP. The number of inducing points are 50. RBF kernel is used for both input and multi-output kernel. Automatic Relevance Determination (ARD) feature is turned on. The signal variances and length scales of all the kernels plus the noise variance are the trainable hyperparameters. The optimizer is L-BFGS-B with default parameters as in Scipy.optimize documentation (https://docs.scipy.org/doc/scipy/reference/optimize.minimize-lbfgsb.html), and the maximum number of iterations is set as 1000. The optimization process runs until the L-BFGS-B optimizer decides to stop. To overcome the sensitivity of GP optimization to initialization of the hyperparameters (Ulapane et al., 2020) , 20 random initialization of the hyperparameters are tried for each independent run. For each random initialization, the signal variances are generated from a uniform distribution within interval [0, 1], and the lengthscales are generated from a uniform distribution within interval [0, 10]. For 10 initializations, the hyperparameters of input kernel are first optimized while the multi-output kernel is temporarily turned off, then after the optimizer stops, the multi-output kernel is turned on, and both the two kernels are optimized simultaneously. For the other 10 initializations, both kernels are optimized simultaneouly from the start. The average performance of the 3 best optimized model in terms of corresponding metrics are used as the final performance of RED on each independent run. During our preliminary investigation, several statistic metrics on training set is effective in picking the true best-performing model out of these 20 trials, e.g., the gap between average estimated confidence scores of correctly classified training samples and incorrectly classified training samples, the scale of optimized noise variance of SVGP model, the ratio between sum of signal variances and noise variance after optimization, etc. Since improving initialization and optimization of GP hyperparameters is out of the scope of this work, we simply use average performance of the best 3 models (top 15%) in comparison. • MCP baseline: The maximum class probability of softmax outputs of the base NN classifier is used as the confidence score of MCP baseline. The setup of the base NN classifier is provided above. • Trust Score: k=10, α = 0, without filtering. This is the same as the default setup in https://github.com/google/TrustScore. • ConfidNet: During training, the input to ConfidNet is the raw feature, and the target is the class probability of the ground-truth class returned by base NN classifier. The architecture of ConfidNet is a fully connected feed-forward NN regressor with 2 hidden layers, each with 64 hidden neurons. The activation function is ReLU for all the hidden layers. The maximum number of epochs for training is 1000. An early stop is triggered if the loss on validation data has not be improved for 10 epochs. The optimizer is RMSprop with learning rate 0.001, and the loss function is mean squared error (MSE). • Introspection-Net: During training, the input to Introspection-Net is the logit outputs of base NN classifier, and the target is 1 for correctly classified sample, and 0 for incorrectly classified sample. The architecture of ConfidNet is a fully connected feed-forward NN regressor with 2 hidden layers, each with 64 hidden neurons. The activation function is ReLU for all the hidden layers. The maximum number of epochs for training is 1000. An early stop is triggered if the loss on validation data has not be improved for 10 epochs. The optimizer is RMSprop with learning rate 0.001, and the loss function is mean squared error (MSE). • Entropy: The entropy of softmax outputs of the base NN classifier is used as the confidence score of Entropy. THe setup of the base NN classifier is provided above. • DNGO: A Bayesian linear regression layer similar to that of Snoek et al. (2015) is added after the logits layer of the original NN classifier to predict whether an original prediction is correct or not (1 for correct and 0 for incorrect). Default parametric setup, as in https://github.com/automl/pybnn/blob/master/pybnn/dngo.py, is used. • SVGP: The original SVGP without output kernel is used to predict directly whether a prediction made by the base NN classifier is correct or not (1 for correct and 0 for incorrect). All other parameters are identical to those in RED. • BNN MCP: The standard dense layers in the base NN classifier described in RED setup above is replaced with Flipout layers (Wen et al., 2018) . All other parameters are identical with those in RED. The maximum class probability averaging over 100 test-time samplings is used as the confidence score for error detection. • BNN Entropy: The same setup as with BNN MCP, except now the entropy of softmax outputs averaging over 100 test-time samplings is used as the confidence score for error detection. • MC-Dropout MCP: A dropout layer with dropout rate of 0.5 is added after each dense layer of the base NN classifier described in the RED setup. All other parameters are identical with those in RED. The maximum class probability averaging over 100 test-time Monte-Carlo samplings is used as the confidence score for error detection. • MC-Dropout Entropy: The same setup as with MC-Dropout MCP, except now the entropy of softmax outputs is averaged over 100 test-time Monte-Carlo samplings and used as confidence score for error detection. A.2 EXPERIMENTAL SETUP FOR SECTION 4.3

Setup of VGG16 Training

The standard VGG16 architecture (Simonyan & Zisserman, 2015) is used. The training pipeline is based on the default setup described in https://github.com/geifmany/cifar-vgg/blob/master/cifar10vgg.py. For the CIFAR-10 dataset, 40,000 samples are used as the training set, 10,000 as the validation set, and 10,000 as the testing set. Parametric Setup for Algorithms All algorithms use the logit outputs of the trained VGG16 model as input features. The maximum class probability of softmax outputs of the trained VGG16 model is used as the confidence score of MCP baseline. The parameters for RED, Trust Score, Entropy, DNGO and SVGP are identical to those in the UCI experiments. For ConfidNet and Introspection-Net, all parameters are the same as in the UCI experiments, except for that the number of hidden neurons for all hidden layers is increased to 128.

A.3 DETAILED RESULTS FOR SECTION 4.1 AND SECTION 4.2

This subsection shows all the detailed results for experiments performed in Section 4.1 and Section 4.2. The results are averaged over 10 independent runs in terms of AP-Error, AP-Success, AUPR-Error, AUPR-Success, and AUROC. Detailed results for Section 4.1 are shown in Table 5,  Table 6, Table 7, Table 8, and Table9 . Detailed results for Section 4.2 are shown in Table 10 , Table 11, Table 12, Table 13, and Table14 . Each table is corresponding to one performance metric. The column "N", "M", and "K" denotes the number of samples, number of features, and number of classes for corresponding datasets. To save space, "MCP", "Intro-Net", "MC-D" stands for "MCP Baseline", "Introspection-Net", and "MC-Dropout", respectively. For datasets that the original NN classifier achieves 100% accuracy, the entries are marked as "NA". For dataset splits that the number of samples in one particular class is too small for Trust Score to calculate neighborhood distance, the entries are marked as "NA". 



Figure 1: The RED Training and Deployment Processes. The solid pathways are active in both training and deployment phase, while the dashed pathways are active only in the training phase. During the training phase, a target confidence score c is assigned to each training sample according to whether it is correctly predicted by the original NN classifier or not. A RIO model is then trained to predict the residual between the target confidence score c and the original maximum class probability ĉ. The I/O kernel in RIO utilizes both the raw feature x and softmax outputs σ to predict the residuals. In the deployment phase, given a new data point, the trained RIO model provides a Gaussian distribution of estimated residual r defined by the mean r and variance var(r). Addition of r and ĉ forms a calibrated confidence score for error detection, and var(r) indicates the corresponding uncertainty.

Figure 2: Performance Ranks Across Dataset Sizes and Dimensionalities on UCI Datasets.Each plot represents the distribution of relative ranks for one method (each column) as a function of the dataset size (top row) and the feature dimensionality (bottom row). Each dot in each plot represents the relative rank in one dataset. RED performs consistently well over datasets of different sizes and dimensionalities. Trust Score performs inconsistently, and ConfidNet performs poorly on larger datasets.

Figure 3: Identifying OOD and Adversarial Samples Based on Mean and Variance of Confidence Scores.Each dot represents one sample in the testing set in the UCI Annealing task. The horizontal axis denotes the variance of RED-calibrated confidence score, and the vertical axis denotes the mean. If an in-distribution sample is correctly classified by original NN classifier, it is marked as "correct", otherwise it is marked "incorrect". Mean is a good separator of correct and incorrect classifications. High variance, on the other hand, indicates that RED is uncertain about its confidence score, which can be used to identify OOD and adversarial samples. In this manner, RED can serve as a foundation for improving robustness of classifiers more broadly in the future.

Mean Rank on UCI datasetsThe symbols * indicates that the differences between the marked entry and all other counterparts are statistically significant at the 5% significance level for both paired t-test and Wilcoxon test. The best entries that are significantly better than all the others under both statistical test are marked in boldface.

A Pairwise Comparison between RED and Other Methods on UCI datasets

A Pairwise Comparison between RED and Other Methods on UCI datasetsThe columns labeled + show the number of datasets on which RED performs significantly better at the 5% significance level in a paired t-test, Wilcoxon test, or both; those labeled -represent the contrary case; those labeled = represent no statistical significance.

A Comparison based on the VGG16 Network Architecture on the CIFAR-10 Task



Comparison between RED and Counterparts Using AP-Error

Comparison between RED and Counterparts Using AP-Success

Comparison between RED and Counterparts Using AUROC

Comparison between RED and Counterparts Using AP-Error

Comparison between RED and Counterparts Using AP-Success

Comparison between RED and Counterparts Using AUPR-Error

Comparison between RED and Counterparts Using AUPR-Success

Comparison between RED and Counterparts Using AUROC

