TRAINING NEURAL NETWORKS TO OPERATE AT HIGH ACCURACY AND LOW MANUAL EFFORT

Abstract

In human-AI collaboration systems for critical applications based on neural networks, humans should set an operating point based on a model's confidence to determine when the decision should be delegated to experts. The underlying assumption is that the network's confident predictions are also correct. However, modern neural networks are notoriously overconfident in their predictions, thus they achieve lower accuracy even when operated at high confidence. Network calibration methods mitigate this problem by encouraging models to make predictions whose confidence is consistent with the accuracy, i.e., encourage confidence to reflect the number of mistakes the network is expected to make. However, they do not consider that data need to be manually analysed by experts in critical applications if the confidence of the network is below a certain level. This can be crucial for applications where available expert time is limited and expensive, e.g., medical ones. The trade-off between the accuracy of the network and the amount of samples delegated to expert at every confidence threshold can be represented by a curve. In this paper we propose a new loss function for classification that takes into account both aspects by optimizing the area under this curve. We perform extensive experiments on multiple computer vision and medical image datasets for classification and compare the proposed approach with the existing network calibration methods. Our results demonstrate that our method improves classification accuracy while delegating less number of decisions to human experts, achieves better out-of-distribution samples detection and on par calibration performance compared to existing methods.

1. INTRODUCTION

Artificial intelligence (AI) systems based on deep neural networks have achieved state-of-the-art results by reaching or even outperforming human-level performance in many predictive tasks Esteva et al. (2017) ; Rajpurkar et al. (2018) ; Chen et al. (2017) ; Szegedy et al. (2016) . Despite the great potential of neural networks for automating various tasks, there are pitfalls when they are used in a fully automated setting, which makes them difficult to deploy in safety-critical applications such as healthcare Kelly et al. (2019) ; Quinonero-Candela et al. (2008) ; Sangalli et al. (2021) . Human-AI collaboration aims at tackling these issues by keeping humans in the loop and building systems that take advantage of humans and AI by minimizing their shortcomings Patel et al. (2019) . A simple way of building collaboration between a network and a human expert would be delegating the decisions to the expert when the network's confidence score is lower than a predetermined threshold which we refer to as "operating point". For example, in healthcare, a neural network trained for predicting whether a lesion is benign or malignant should leave the decision to the human doctor if not very confident Jiang et al. (2012) . In such cases the domain knowledge of the doctor could be exploited to assess more ambiguous cases, where for example education or previous experience can play a crucial role in the evaluation. Another example of human-AI collaboration is hate speech detection for social media (Conneau & Lample, 2019) , where neural networks highly reduce the load of manual analysis of contents required by humans. In industrial systems, curves are employed (Gorski et al., 2001) that assess a predictive model in terms of accuracy and number of samples that requires manual assessment from a human expert for varying operating points that loosely relate to varying confidence levels of the algorithm's prediction. We will refer to this performance curve as Confidence Operating Characteristics (COC), as it reminds of the classic Re-ceiver Operating Characteristic (ROC) curve where an analogous balance is sought after between Sensitivity and Specificity of a predictive model. COC curve can be used by domain experts, such as doctors, to identify the most suitable operating point that balances performance and amount of data to re-examine for the specific task. The underlying assumption in these applications is that the confidence level of networks indicates when the predictions are likely to be correct or incorrect. However, modern deep neural networks that achieve state-of-the-art results are known to be overconfident even in their wrong predictions. This leads to networks that are not well-calibrated, i.e., the confidence scores do not properly indicate the likelihood of the correctness of the predictions Guo et al. (2017) . Thus, neural networks suffer from lower accuracy than expected, when operated at high confidence thresholds. Network calibration methods mitigate this problem by calibrating the output confidences of the model Guo et al. (2017) ; Kumar et al. (2018) ; Karandikar et al. (2021) ; Gupta et al. (2021) . However, they do not consider that data may need to be manually analyzed by experts in critical applications if the confidence of the network is below a certain level. This can be crucial for various applications where expert time is limited and expensive. For example, in medical imaging, the interpretation of more complex data requires clinical expertise and the number of available experts is extremely limited, especially in low-income countries Kelly et al. (2019) . This motivates us to take the expert load into account along with accuracy when assessing the performance of human-AI collaboration systems and training neural networks. In this paper, we make the following contributions: • We propose a new loss function for multi-class classification that takes into account both of the aspects by maximizing the area under COC (AUCOC) curve. • We perform experiments on two computer vision and one medical image datasets for multiclass class classification. We compare the proposed AUCOC loss with the conventional loss functions for training neural networks as well as network calibration methods. The results demonstrate that our method improves other methods in terms of both accuracy and AUCOC. • We evaluate network calibration and out-of-distribution (OOD) samples detection performance of all methods. The results show that the proposed approach is able to consistently achieve better OOD samples detection and on par network calibration performance.

1.1. RELATED WORK

In industrial applications, curves that plot network accuracy on accepted samples against manual workload of a human expert are used for performance analysis of the system (Gorski et al., 2001) . To the best of our knowledge this is the first work that explicitly takes into account during the optimization process the trade-off between neural network performance and amount of data to be analysed by a human expert, in a human-AI collaborative system. Therefore, there is no direct literature that we can compare with. We found the literature on network calibration methods the closest to our setting because they also aim at improving the interaction between human and AI, by enabling the networks to delegate the decision to human when they are not very confident. Therefore, we compare our method with the existing network calibration methods in the literature. In a well-calibrated network, the probability associated with the predicted class label should reflect the likelihood of the correctness of the predictions. Guo et al. (2017) defines the calibration error as the difference in expectation between accuracy and confidence in each confidence bin. One category of calibration methods augments or replaces the conventional training losses with another loss to explicitly encourage reducing the calibration error. Kumar et al. (2018) propose a method called MMCE loss by replacing the bins with continuous kernel to obtain a continuous distribution and a differentiable measure of calibration. Karandikar et al. (2021) propose two loss functions for calibration, called Soft-AvUC and Soft-ECE, by replacing the hard confidence thresholding in AvUC (Krishnan & Tickoo, 2020) and binning in ECE Guo et al. (2017) with smooth functions, respectively. All these three functions are used as a secondary losses along with conventional losses such as cross-entropy. Mukhoti et al. (2020) find that Focal Loss (FL) (Lin et al., 2017) provides inherently more calibrated models, even if it was not originally designed to improve calibration, as it adds implicit weight regularisation. The second category of methods are post-hoc calibration approaches, which rescale model predictions after training. Platt scaling (Platt, 2000) and histogram binning (Zadrozny & Elkan, 2001) fall into this class. Temperature scaling (TS) (Guo et al., 2017) , a modern variant of Platt scaling, is the most popular approach of this group. The idea of TS is to scale the logits of a neural network by dividing by a positive scalar such that they do not saturate after the subsequent softmax or sigmoid activation. TS can be used as a complementary method for all networks and it does not affect the accuracy of the models while significantly improving calibration.

2. METHODS

In this section, we illustrate more in detail the curve that assesses a predictive model in terms of accuracy and the number of samples that are delegated to human expert for manual analysis, that has been used in industry, e.g., (Gorski et al., 2001) . We will refer to it as to Confidence Operating Characteristics (COC), as it reminds of the classic Receiver Operating Characteristic (ROC) curve where an analogous balance is sought after Sensitivity and Specificity of a predictive model. Then, we describe a novel cost function for training of neural networks: the area under COC (AUCOC) loss (AUCOCLoss).

2.1. NOTATION

Let D = ⟨(x n , y n )⟩ N n=1 denote a dataset composed of N samples from a joint distribution D(X , Y), where x n ∈ X and y n ∈ Y = {1, 2, ..., K} denote the input data and the corresponding class label, respectively. Let f θ (y|x) be the probability distribution predicted by a neural network f parametrized by θ for an input x. For each data point x n , ŷ = argmax y∈Y f θ (y|x n ) denotes the predicted class label, associated to a correctness score c n = 1(ŷ n = y n ) and to a confidence score r n = maxf θ (y|x n ), where r n ∈ [0, 1] and 1(.) is an indicator function. r = [r 1 , ...r N ] represents the vector containing all the predicted confidences for a set of data points. p(r) denotes the probability distribution over r values. In a human-AI collaboration, samples with confidence r lower than a threshold r 0 would be delegated to a human expert for manual assessment.

2.2. CONFIDENCE OPERATING CHARACTERISTICS (COC) CURVE

Our first goal is to find an appropriate evaluation method to assess the trade-off between a neural network's performance and the number of samples that requires manual analysis from a domain expert in a human-AI collaboration system. We decide to optimize the area under COC as it provides practictioners with flexibility in the choice of the operating point, similarly to classic ROC curve. x-y axes of COC: In order to construct the COC curve, first, we define a sliding threshold r 0 over the space of predicted confidences r and assume that the samples with confidence scores lower than r 0 are given to a human expert for decision. Then, for each threshold r 0 , we calculate (x-axis) the percentage of samples that are delegated to human expert and (y-axis) the accuracy of the network on the remaining samples corresponding to the threshold r 0 . We formulate the axes of the COC curve as follows x -axis : τ 0 = p(r < r 0 ) = r0 0 p(r)dr y -axis : E[c|r ≥ r 0 ] For each threshold level r 0 , τ 0 represents the fraction of samples whose confidence is lower than that threshold, i.e., the percentage of the samples that are delegated to expert for manual analysis. E[c|r ≥ r 0 ] corresponds to the expected value of the correctness score c for all the samples for which the network's confidence is equal or larger than the threshold r 0 , i.e., the samples for which network prediction will be used. This expected value can be computed as E[c|r ≥ r 0 ] = 1 0 E[c|r]p(r)dr/(1 -τ 0 ). We provide the derivation of Eq. 2 in Appendix A.1. The area under COC curve (AUCOC), like the area under ROC curve, is a global indicator of the performance of a system. Higher AUCOC indicates lower number of samples that are delegated to human experts or/and higher accuracy for the samples that are analysed by the network. Lower AUCOC on the other hand, indicates higher number of delegations to human experts or/and lower accuracy. AUCOC can be computed by integrating the y-axis expressed in Eq. 1 over the whole range of τ 0 ∈ [0, 1]: AU COC = 1 0 E[c|r ≥ r 0 ]dτ 0 = 1 0 1 r0 E[c|r]p(r)dr dτ 0 1 -τ 0 (3) (a) (b) Figure 1 : (a) shows how one could improve AUCOC, increasing the accuracy of the network and/or decreasing the amount of data to be analysed by the domain expert. The dashed curve has higher AUCOC than the solid one. (b) illustrates a toy example where two models have the same accuracy and ECE with 5 bins. However, they have different AUCOC values due to different ordering of correctly and incorrectly classified samples according to the assigned confidence by the network.

2.3. AUCOCLOSS: MAXIMIZING AUCOC FOR TRAINING NEURAL NETWORKS

In the previous section, we described COC to assess the performance of a system in terms of accuracy and number of samples delegated to human experts at various operating points and mentioned that higher AUCOC is desired. With this motivation, in this section, we introduce a new loss function called AUCOCLoss that maximizes AUCOC for training neural networks. AUCOC's explicit maximization would enforce the reduction of the number of samples delegated to human expert while maintaining the accuracy level assessed by the algorithm (i.e., keeping E[c|r ≥ r 0 ] constant) and/or the improvement in the prediction accuracy of the samples analysed by the algorithm while maintaining a particular amount of data to be analysed by the human (i.e., keeping τ 0 constant) as illustrated in Figure 1a . We define our loss function to maximize AUCOC as AU COCLoss = -log(AU COC). (4) We use the negative logarithm as AUCOC lies in the interval [0, 1], which corresponds to AU-COCLoss ∈ [0, inf] which is suitable for minimizing cost functions. For computing AUCOC for training, we use kernel density estimation (KDE) with training samples to estimate p(r) in Eq. 3 p(r) ≈ 1 N N n=1 K(r -r n ) (5) where K is a Gaussian kernel and we estimate its bandwidth using Scott's rule of thumb Scott (1979) . Then, the other terms in Eq. 3, E[c|r]p(r) and τ 0 , are estimated as follows E[c|r]p(r) ≈ 1 N N n=1 c n K(r -r n ) τ 0 ≈ 1 N r0 0 N n=1 K(r -r n )dr.

2.4. TOY EXAMPLE: ADDED VALUE BY COC CURVE AND AUCOC

In this section, we demonstrate the added value of assessing the performance of a predictive model using COC curve and AUCOC compared to expected calibration error (ECE) Guo et al. (2017) and classification accuracy on a toy example. The toy example on the left side of Figure 1b shows the confidence scores of 5 samples from two different models, NN1 and NN2, for a classification problem. The green circles denote the predicted confidences for correctly classified samples, while the red crosses the confidences of the misclassified ones. ECE divides the confidence space into bins, computes the difference between the average accuracy and confidence scores for each bin, and returns the average of the differences as the final measure of calibration error. Let us assume that we divide the confidence space into 5-bins as indicated with the gray dotted lines in the confidence spaces of NN1 and NN2 for computing ECE. This results in equal ECE for both networks. Note that these models also have equal classification accuracy since each classifies 2 out of 5 samples incorrectly. Looking at these two performance metrics, it is not possible to choose one model over the other since NN1 and NN2 perform identically. On the contrary, the AUCOC is larger for NN1 than for NN2, as shown in the rightmost plot in Figure 1b . The difference in AUCOC is due to the different ordering of correctly and incorrectly classified samples, which COC and AUCOC are able to detect. By looking at the AUCOC results, one would prefer NN1 compared to NN2. Indeed, NN1 is a better model than NN2 because it achieves either equal or better accuracy than NN2, for the same amount of data to be manually examined. Analogously, it delegates either equal or lower number of samples to expert for the same accuracy level.

COC:

We implement the selection of the operating points in COC by thresholding on the predicted output confidences r. First, we arrange the confidences r of the whole dataset (or batch) in ascending order. Each predicted confidence is then selected as threshold level for the vector r, and τ 0 and E[c|r ≥ r 0 ] are computed. This allows us to save time on threshold selections compared to exploring the confidence space with arbitrarily fine-grained levels. AUCOCLoss: There are two important points about the implementation of AUCOCLoss. First, instead of using E[c|r]p(r) ≈ 1 N N n=1 c n K(∥r -r n ∥) as given in Eq. 6, we approximated it as E[c|r]p(r) ≈ 1 N N n=1 r * K(∥r -r n ∥) where r * n = f θ (y n |x n ) is the confidence of the correct class for a sample n. The main reason of this modification is that the gradient of the misclassified samples becomes zero because c n is zero when a sample x n is not classified correctly. Therefore, the network cannot learn how to classify samples correctly, once they are misclassified. To deal with this issue, we replaced the correctness score c n , which can be either 0 or 1, with r * n which can take continuous values between 0 and 1. With this new approximation, we can also compute and back-propagate the gradients for misclassified samples.

3. EXPERIMENTS

In this section, we present our experimental evaluations on multi-class image classification tasks. We performed experiments on three datasets: CIFAR100 (Krizhevsky, 2009) , Tiny-ImageNet which is a subset of ImageNet (Deng et al., 2009) and on DermaMNIST (Yang et al., 2021) . We compared AUCOCLoss with different loss functions, most of which are designed to improve calibration performance, while preserving accuracy: cross-entropy (CE), focal-loss (FL) (Lin et al., 2017) , adaptive focal-loss (Mukhoti et al., 2020) , maximum mean calibration error loss (MMCE) (Kumar et al., 2018) , soft binning calibration objective (Soft-ECE) and soft accuracy versus uncertainty calibration (Soft-AvUC) (Karandikar et al., 2021) . We optimized MMCE, Soft-ECE, and Soft-AvUC losses jointly with a primary loss which we used either CE or FL, consistently with the literature (Karandikar et al., 2021) . We also performed experiments by optimizing the proposed AUCOCLoss as both primary and secondary loss with CE and FL. KDE in AUCOCLoss is applied batch-wise during the training. In addition, for all the experiments we report results after applying temperature scaling (TS) (Guo et al., 2017) as post-training technique. We use three metrics to evaluate the performance of the methods: classification accuracy, equal-mass expected calibration error (ECE) (Nixon et al., 2019) with 15 bins, and AUCOC. Classicification accuracy is simply the ratio between number of correct samples over total number of samples and we AUCOC is computed using Eq. 3. Equal-mass ECE divides the confidence space into M bins such that each bin B m contains equal number of samples and computes weighted average of the difference between the average accuracy and confidence of each bin. In addition, we report some examples of operating points of COC curve -given a certain accuracy we show the corresponding percentage of samples that need to be analyzed manually (τ 0 @acc) on the COC curves. A reliable AI system should have lower confidence whenever it encounters out-of-distribution (OOD) data, deferring those samples to the human expert for further investigation. Evaluating the OOD samples detection performance is a common experiment in the network calibration literature. Therefore, we investigated the OOD detection performance of all the methods when CIFAR100-C (Hendrycks & Dietterich, 2019) (with Gaussian noise) and SVHN (Netzer et al., 2011) are OOD datasets, while the network is trained on CIFAR100 (in-distribution). We evaluated the OOD detection performance of all methods using Area Under the Receiver Operating Characteristics (AUROC) curve. All the experiments were run 3 times and we report the average results in Tables 1, 2 3, and 4. 3.1 SETUP DETAILS For CIFAR100 (Krizhevsky, 2009) we used 45000/5000/10000 images respectively as training/validation/test sets and Wide-Resnet-28-10 (Zagoruyko & Komodakis, 2016), consistently with Karandikar et al. (2021) . We trained the model for 200 epochs, using Stochastic Gradient Descent (SGD), with batch of 512, momentum of 0.9 and an initial learning rate of 0.1, decreased after 60, 120, 160 epochs by a factor of 0.1. Tiny-ImageNet is a subset of ImageNet (Deng et al., 2009) with 64 × 64 images and 200 classes. We employed 90000/10000/10000 images as training/validation/test set, respectively. As in Mukhoti et al. (2020) we used ResNet-50 (He et al., 2016) as backbone architecture. SGD was used as optimiser with a batch size of 512, momentum of 0.9 and base learning rate of 0.1, divided by 0.1 at 40th and 60th epochs. We chose a medical imaging dataset, DermaMNIST (Yang et al., 2021) , as the third dataset since human-AI collaboration systems are particularly useful for medical domain. DermaMNIST is composed of dermatoscopic images with 7007/1003/2005 samples for training, validation and test set, respectively, categorised in 7 different diseases. We followed the training procedures of the original paper, employing a ResNet-50 He et al. (2016) , Adam optimizer and a batch size of 128. We found that a more suitable initial learning rate is 0.001 and trained the model for 100 epochs, reducing the learning rate by 0.1 after 50 and 75 epochs. We set the weighting factor for AUCOCLoss when it is used as secondary loss selected via crossvalidation. For the baselines we used the hyperparameter combinations specified by the original papers if done so. If not, we carried out cross-validation and selected the settings that provided the best performance on the validation set. In order not to be biased towards a specific metric, for all the methods, we saved the models with respect to the best ECE, accuracy, and AUCOC. We found empirically that models check-pointed using ECE provided very poor results and therefore we omit them while presenting results. Networks check-pointed using either accuracy or AUCOC provided comparable outcome. For Tiny-Imagenet, being the largest dataset, we report results for both checkpointing strategies in Sec. 4. For CIFAR100, DermaMNIST and OOD detection experiments, we only report the AUCOC checkpointing results in the main paper in Sec. 4 and the remaining results in Appendix E.

4. RESULTS

In Tables 1 2 and 3 we report the results on CIFAR100, Tiny-Imagenet and DermaMNIST, respectively. For Tiny-Imagenet, the largest dataset, the results for both AUCOC-based and accuracy-based checkpoints are reported, while for the other datasets we reported results only for AUCOC checkpoints in the main paper and the rest in Appendix E. Bold results indicate the methods that performed best for each metric. ↑ means the higher the better for a metric, while ↓ the lower the better. We empirically found that the optimization of AUCOC alone can be complex and AUCOCLoss mainly provides the best results when it is used as a secondary loss, regularized by another cost function. We find that AUCOCLoss outperformed consistently all the baselines on every dataset in terms of accuracy and AUCOC. In terms of ECE, the proposed loss function provided on par performance compared to the other loss functions particularly designed for network calibration. We Table 1 : Test results on CIFAR100 for accuracy, AUCOC and ECE. For each loss function, we report the results of the model check-pointed on the best AUCOC on validation set, pre and post TS. The last two columns report τ 0 corresponding to 90% and 95% accuracy (pre TS). In bold the best result for each metric. The average results over 3 runs are reported.

Pre TS

Post TS τ 0 ↓ @ acc. Loss funct. AUCOC In order to have an idea of what a certain AUCOC corresponds to in practice, Figure 2 show some examples of COC curves (more results are available in Appendix E). Overall, the plots AUCOCLoss lie above all the baselines, which is a desirable behavior as it corresponds to better operating points. Moreover, in Tables 1, 2 , 3 we report some operating points of the COC curves for each model. In particular, we measured the percentage of samples delegated to expert (τ 0 ) at 90% and 95% accuracy for CIFAR100 and DermaMNIST, and at 65% and 75% accuracy for Tiny-Imagenet (as the initial accuracy is also much lower in this dataset). In all the experiments, to varying degrees, AUCOCLoss is able to provide lower delegated samples than the baselines. In OOD experiments reported in Table 4 , we used the model trained on CIFAR100 and evaluated the OOD detection performance on both CIFAR100-C and SVHN dataset. We used the confidence scores of the models with and without applying TS when determining where a sample is OOD or indistribution. The bold results highlight the best results in terms of AUROC. On both OOD datasets, AUCOCLoss provided the highest AUROC, both before and after applying temperature scaling.. Finally, we investigated how the performances of the models vary when the batch size varies, which may be crucial for KDE-based methods like AUCOCLoss as it may affect the accuracy of density estimation. We investigated this by reducing the batch size from 128 (used in the main experiments) to 64 and 32 in DermaMNIST dataset. We report the results in Table 9 in Appendix E for space reasons. We observe that reducing the batch does not significantly affect the performance of our method.

5. CONCLUSION

In this paper we propose a new cost function for multi-class classification that takes into account the trade-off between a neural network's performance and the amount of data that requires manual analysis from a domain expert when the network is not confident enough, by maximizing the area under COC (AUCOC) curve. Extensive experiments on various computer vision and medical image datasets are presented, comparing the new loss with various baselines. The results demonstrate that our approach improves the other methods in terms of both accuracy and AUCOC and provides comparable ECE. Additionally, evaluate the performance of different losses for OOD samples detection and show that our method outperforms the baselines. While we presented COC and the loss based on AUCOC for multi-class classification, extensions to other tasks is possible and we will explore this direction in future work. Another possible direction would be investigating other performance metrics to embed in the y-axis of COC other than accuracy. We believe that this new direction of considering expert load in human-AI collaborated system is important, and COC curve and AUCOCLoss will serve as a baseline for future work. Where ndtr expresses the Gaussian cumulative distribution function and the last row in Equation 8exploits the trapezoidal rule for integrals computation.

A.2 DERIVATIONS OF THE GRADIENTS OF AUCOC

d dθ A = 1 0 d dθ 1 r0 E[c|r]p(r)dr dτ 0 1 -τ 0 Here, we use the assumption discussed in Section 2 that τ 0 does not depend on any parameter, thus allowing us to apply Leibnitz's integration rule, obtaining: d dθ A = 1 0 d dθ 1 r0 E[c|r]p(r)dr dτ 0 1 -τ 0 (10) = 1 0 1 r0 d dθ E[c|r]p(r)dr -E[c|r 0 ]p(r 0 ) dr 0 dθ dτ 0 1 -τ 0 τ 0 can be expressed as: τ 0 = p(r ≤ r 0 ) = r0 0 p(r)dr Consequently: dτ 0 dθ = r0 0 dp(r) dθ dr + p(r 0 ) dr 0 dθ = 0 dr 0 dθ = - r0 0 dp(r) dθ dr p(r 0 ) Plugging this expression back into Equation 10 we obtain: dA dθ = 1 0 1 r0 d dθ E[c|r]p(r)dr + E[c|r 0 ] r0 0 dp(r) dθ dr dτ 0 1 -τ 0 Assuming the use of a Gaussian kernel: K(||r -r n ||) = 1 √ 2πα • exp - (r -r n ) 2 2α 2 And re-writing: E[c|r 0 ]p(r 0 ) p(r 0 ) ≈ 1 N N n=1 1(c n )K(∥r 0 -r n ∥) 1 N N n=1 K(∥r 0 -r n ∥) The gradient of the area becomes: dA dr n = 1 0 1 r0 d dr n E[c|r]p(r)dr + E[c|r 0 ] r0 0 dp(r) dr n dr dτ 0 1 -τ 0 = = 1 0 { 1 r0 d dr n 1 √ 2πα 1 N N n=1 1(c n ) • exp - (r -r n ) 2 2α 2 dr+ E[c|r 0 ] r0 0 d dr n 1 √ 2πα 1 N N n=1 exp - (r -r n ) 2 2α 2 dr} dτ 0 1 -τ 0 = = 1 0 { 1 r0 1 √ 2πα 3 N 1(c n ) • (r -r n ) • exp - (r -r n ) 2 2α 2 dr+ E[c|r 0 ] r0 0 1 √ 2πα 3 N (r -r n ) • exp - (r -r n ) 2 2α 2 dr} dτ 0 1 -τ 0 = = 1 0 {- 1 √ 2παN 1(c n ) • exp - (1 -r n ) 2 2α 2 -exp - (r 0 -r n ) 2 2α 2 - E[c|r 0 ] 1 √ 2παN exp - (r 0 -r n ) 2 2α 2 -exp - (-r n ) 2 2α 2 dr} dτ 0 1 -τ 0 (17) Also for the gradients in the code implementation we exploited the trapezoidal rule for the computation of the external integral between [0,1].

B ADDITIONAL CALIBRATION METRICS

In this section we are adding results on additional calibration metrics: Kolmogorov-Smirnov (KS) (Gupta et al., 2021) , Brier score (Brier, 1950) and class-wise ECE (cw-ECE) (Kull et al., 2019) . Results are reported in Tables 5 respectively for CIFAR100, Tiny-ImageNet and DermaMNIST. For all the metrics and datasets, AUCOCLoss provides either the best or comparable results with respect to the baselines. The vector of thresholds r 0 = [r 0,1 , ..., r 0,K ] is chosen to be equal to the predicted confidences vector r = [r 1 , ...r n ]. This choice has a few noticeable consequences. First, it saves time and energy consumption, as the thresholds are inherently learnt from the predicted confidences for each model. Moreover, as r 0 is automatically "calibrated" to the output of each model, the interval of τ 0 ∈ [0, 1] is spanned uniformly, providing full flexibility to choose the most suitable operating point. Finally, and most importantly for a feasible implementation of the proposed method as reported in Appendix A.2, this choice makes τ 0 not dependent on the network parameters. In fact, what is relevant for the computation of τ 0 is the proportion of samples whose confidence is smaller than the corresponding r 0 and this adaptive choice of thresholds guarantees consistency irrespective of the specific values in r. This can be explained with the aid of Figure 3 . The first two axes at the top are the output confidence space for two neural networks (the same conclusion would hold if we considered a single model at two different epochs). The red arrows indicate the smallest output confidence of the dataset for each model, which have two different values. Equally, the green arrows point to the second smallest samples. Given our threshold selection strategy, these samples correspond to the two smallest thresholds for each model, r N N 1 0,1 , r N N 1 0,2 and r N N 2 0,1 r N N 2 0,2 for NN1 and NN2 respectively. Irrespective of the different threshold levels between the models, the corresponding values on τ 0 are the same for both NN1 and NN2, respectively 0% and 25% for the first two thresholds r 0,1 , r 0,2 , in this example with four samples. 

E ADDITIONAL RESULTS

In this section we report additional results and plots from our analysis. Moreover Table 9 reports an ablation study on the impact of batch size of both the baselines as well as AUCOCLoss. 



Figure2: COC curves on CIFAR100, run for one seed. In the first row the plots for models checkpointed with AUCOC, the second with accuracy. In the first column models without TS, in the second with TS.

Figure 3: Illustrative example that shows how τ 0 does not depend on any parameter θ nor on any specific threshold level r 0 . Even if the models have different threshold levels, the points on τ 0 axis are the same.

Figure4: Additional COC curves on CIFAR100. In the first row the plots for models check-pointed with AUCOC, the second with accuracy. In the first column models without TS, in the second with TS.

Test results on Tiny-ImageNet for accuracy, AUCOC and ECE. For each loss function, the first row reports the results of the model check-pointed on the best AUCOC (in gray) and the second row on the best accuracy on validation set, pre and post TS. The last two columns report τ 0 corresponding to 65% and 75% accuracy (pre TS). In bold the best result for each metric. The average results over 3 runs are reported.

Test results on DermaMNIST for accuracy, AUCOC and ECE. For each loss function, we report the results of the model check-pointed on the best AUCOC on validation set, pre and post TS. The last two columns report τ 0 corresponding to 90% and 95% accuracy (pre TS). In bold the best result for each metric. The average results over 3 runs are reported.

Test AUROC(%) on OOD detection for models trained on CIFAR100 and tested on CIFAR100-C (Gaussian noise) and SVHN, pre and post TS, for models check-pointed for AUCOC.

Test results on CIFAR100 for Brier score, class-wise ECE (cw-ECE) and Kolmogorov-Smirnov (KS), both for checkpoint on AUCOC (gray line)and accuracy (white line).

Test results on Tiny-ImageNet for Brier score, class-wise ECE (cw-ECE) and Kolmogorov-Smirnov (KS), both for checkpoint on AUCOC (gray line)and accuracy (white line).

Test results on DermaMNIST for Brier score, class-wise ECE (cw-ECE) and Kolmogorov-Smirnov (KS), both for checkpoint on AUCOC (gray line)and accuracy (white line).

Test results on RetinaMNIST for accuracy, AUCOC and ECE. For each loss function, we report the results of the model check-pointed on the best AUCOC on validation set, pre and post TS. The last two columns report τ 0 corresponding to 65% and 75% accuracy (pre TS). In bold the best result for each metric. The average results over 3 runs are reported.

Test results on DermaMNIST for accuracy, AUCOC and ECE for batch size 64 and 32 for models check-pointed on the best AUCOC on validation set.

Test results on CIFAR100 for accuracy, AUCOC and ECE. For each loss function, we report the results of the model check-pointed on the best accuracy on validation set, pre and post TS. The average results over 3 runs are reported. The best results are highlighted in bold.

Test results on DermaMNIST for accuracy, AUCOC and ECE. For each loss function, we report the results of the model check-pointed on the best accuracy on validation set, pre and post TS. The average results over 3 runs are reported.

Test AUROC(%) on OOD detection for models trained on CIFAR100 and tested on CIFAR100-C (Gaussian noise) and SVHN, pre and post TS. Results correspond to model checkpointed for accuracy.

A APPENDIX

A.1 DERIVATION OF EQUATION 3The y-axis in COC curve is expressed mathematically by: The x-axis in COC curve is expressed mathematically by:Using the τ 0 formulation, we can rewrite the y-axis asLet us assume to use a Gaussian kernel with this expression:Developing the equation, the area calculation becomes:Where: 

