TRAINING NEURAL NETWORKS TO OPERATE AT HIGH ACCURACY AND LOW MANUAL EFFORT

Abstract

In human-AI collaboration systems for critical applications based on neural networks, humans should set an operating point based on a model's confidence to determine when the decision should be delegated to experts. The underlying assumption is that the network's confident predictions are also correct. However, modern neural networks are notoriously overconfident in their predictions, thus they achieve lower accuracy even when operated at high confidence. Network calibration methods mitigate this problem by encouraging models to make predictions whose confidence is consistent with the accuracy, i.e., encourage confidence to reflect the number of mistakes the network is expected to make. However, they do not consider that data need to be manually analysed by experts in critical applications if the confidence of the network is below a certain level. This can be crucial for applications where available expert time is limited and expensive, e.g., medical ones. The trade-off between the accuracy of the network and the amount of samples delegated to expert at every confidence threshold can be represented by a curve. In this paper we propose a new loss function for classification that takes into account both aspects by optimizing the area under this curve. We perform extensive experiments on multiple computer vision and medical image datasets for classification and compare the proposed approach with the existing network calibration methods. Our results demonstrate that our method improves classification accuracy while delegating less number of decisions to human experts, achieves better out-of-distribution samples detection and on par calibration performance compared to existing methods.

1. INTRODUCTION

Artificial intelligence (AI) systems based on deep neural networks have achieved state-of-the-art results by reaching or even outperforming human-level performance in many predictive tasks Esteva et al. (2017) ; Rajpurkar et al. (2018) ; Chen et al. (2017) ; Szegedy et al. (2016) . Despite the great potential of neural networks for automating various tasks, there are pitfalls when they are used in a fully automated setting, which makes them difficult to deploy in safety-critical applications such as healthcare Kelly et al. (2019) ; Quinonero-Candela et al. (2008); Sangalli et al. (2021) . Human-AI collaboration aims at tackling these issues by keeping humans in the loop and building systems that take advantage of humans and AI by minimizing their shortcomings Patel et al. (2019) . A simple way of building collaboration between a network and a human expert would be delegating the decisions to the expert when the network's confidence score is lower than a predetermined threshold which we refer to as "operating point". For example, in healthcare, a neural network trained for predicting whether a lesion is benign or malignant should leave the decision to the human doctor if not very confident Jiang et al. (2012) . In such cases the domain knowledge of the doctor could be exploited to assess more ambiguous cases, where for example education or previous experience can play a crucial role in the evaluation. Another example of human-AI collaboration is hate speech detection for social media (Conneau & Lample, 2019) , where neural networks highly reduce the load of manual analysis of contents required by humans. In industrial systems, curves are employed (Gorski et al., 2001) that assess a predictive model in terms of accuracy and number of samples that requires manual assessment from a human expert for varying operating points that loosely relate to varying confidence levels of the algorithm's prediction. We will refer to this performance curve as Confidence Operating Characteristics (COC), as it reminds of the classic Re-ceiver Operating Characteristic (ROC) curve where an analogous balance is sought after between Sensitivity and Specificity of a predictive model. COC curve can be used by domain experts, such as doctors, to identify the most suitable operating point that balances performance and amount of data to re-examine for the specific task. The underlying assumption in these applications is that the confidence level of networks indicates when the predictions are likely to be correct or incorrect. However, modern deep neural networks that achieve state-of-the-art results are known to be overconfident even in their wrong predictions. This leads to networks that are not well-calibrated, i.e., the confidence scores do not properly indicate the likelihood of the correctness of the predictions Guo et al. (2017) . Thus, neural networks suffer from lower accuracy than expected, when operated at high confidence thresholds. 2021). However, they do not consider that data may need to be manually analyzed by experts in critical applications if the confidence of the network is below a certain level. This can be crucial for various applications where expert time is limited and expensive. For example, in medical imaging, the interpretation of more complex data requires clinical expertise and the number of available experts is extremely limited, especially in low-income countries Kelly et al. (2019) . This motivates us to take the expert load into account along with accuracy when assessing the performance of human-AI collaboration systems and training neural networks. In this paper, we make the following contributions: • We propose a new loss function for multi-class classification that takes into account both of the aspects by maximizing the area under COC (AUCOC) curve. • We perform experiments on two computer vision and one medical image datasets for multiclass class classification. We compare the proposed AUCOC loss with the conventional loss functions for training neural networks as well as network calibration methods. The results demonstrate that our method improves other methods in terms of both accuracy and AUCOC. • We evaluate network calibration and out-of-distribution (OOD) samples detection performance of all methods. The results show that the proposed approach is able to consistently achieve better OOD samples detection and on par network calibration performance.

1.1. RELATED WORK

In industrial applications, curves that plot network accuracy on accepted samples against manual workload of a human expert are used for performance analysis of the system (Gorski et al., 2001) . To the best of our knowledge this is the first work that explicitly takes into account during the optimization process the trade-off between neural network performance and amount of data to be analysed by a human expert, in a human-AI collaborative system. Therefore, there is no direct literature that we can compare with. We found the literature on network calibration methods the closest to our setting because they also aim at improving the interaction between human and AI, by enabling the networks to delegate the decision to human when they are not very confident. Therefore, we compare our method with the existing network calibration methods in the literature. In a well-calibrated network, the probability associated with the predicted class label should reflect the likelihood of the correctness of the predictions. 



Network calibration methods mitigate this problem by calibrating the output confidences of the model Guo et al. (2017); Kumar et al. (2018); Karandikar et al. (2021); Gupta et al. (

Guo et al. (2017)  defines the calibration error as the difference in expectation between accuracy and confidence in each confidence bin. One category of calibration methods augments or replaces the conventional training losses with another loss to explicitly encourage reducing the calibration error.Kumar et al. (2018)  propose a method called MMCE loss by replacing the bins with continuous kernel to obtain a continuous distribution and a differentiable measure of calibration.Karandikar et al. (2021)  propose two loss functions for calibration, called Soft-AvUC and Soft-ECE, by replacing the hard confidence thresholding in AvUC (Krishnan & Tickoo, 2020) and binning in ECE Guo et al. (2017) with smooth functions, respectively. All these three functions are used as a secondary losses along with conventional losses such as cross-entropy.Mukhoti et al. (2020)  find that Focal Loss (FL)(Lin et al., 2017)  provides inherently more calibrated models, even if it was not originally designed to improve calibration, as it adds implicit weight regularisation. The second category of methods are post-hoc calibration approaches, which rescale model predictions after training. Platt scaling(Platt, 2000)  and histogram

