TRAINING NEURAL NETWORKS TO OPERATE AT HIGH ACCURACY AND LOW MANUAL EFFORT

Abstract

In human-AI collaboration systems for critical applications based on neural networks, humans should set an operating point based on a model's confidence to determine when the decision should be delegated to experts. The underlying assumption is that the network's confident predictions are also correct. However, modern neural networks are notoriously overconfident in their predictions, thus they achieve lower accuracy even when operated at high confidence. Network calibration methods mitigate this problem by encouraging models to make predictions whose confidence is consistent with the accuracy, i.e., encourage confidence to reflect the number of mistakes the network is expected to make. However, they do not consider that data need to be manually analysed by experts in critical applications if the confidence of the network is below a certain level. This can be crucial for applications where available expert time is limited and expensive, e.g., medical ones. The trade-off between the accuracy of the network and the amount of samples delegated to expert at every confidence threshold can be represented by a curve. In this paper we propose a new loss function for classification that takes into account both aspects by optimizing the area under this curve. We perform extensive experiments on multiple computer vision and medical image datasets for classification and compare the proposed approach with the existing network calibration methods. Our results demonstrate that our method improves classification accuracy while delegating less number of decisions to human experts, achieves better out-of-distribution samples detection and on par calibration performance compared to existing methods.

1. INTRODUCTION

Artificial intelligence (AI) systems based on deep neural networks have achieved state-of-the-art results by reaching or even outperforming human-level performance in many predictive tasks Esteva et al. (2017); Rajpurkar et al. (2018); Chen et al. (2017); Szegedy et al. (2016) . Despite the great potential of neural networks for automating various tasks, there are pitfalls when they are used in a fully automated setting, which makes them difficult to deploy in safety-critical applications such as healthcare Kelly et al. ( 2019 A simple way of building collaboration between a network and a human expert would be delegating the decisions to the expert when the network's confidence score is lower than a predetermined threshold which we refer to as "operating point". For example, in healthcare, a neural network trained for predicting whether a lesion is benign or malignant should leave the decision to the human doctor if not very confident Jiang et al. (2012) . In such cases the domain knowledge of the doctor could be exploited to assess more ambiguous cases, where for example education or previous experience can play a crucial role in the evaluation. Another example of human-AI collaboration is hate speech detection for social media (Conneau & Lample, 2019) , where neural networks highly reduce the load of manual analysis of contents required by humans. In industrial systems, curves are employed (Gorski et al., 2001) that assess a predictive model in terms of accuracy and number of samples that requires manual assessment from a human expert for varying operating points that loosely relate to varying confidence levels of the algorithm's prediction. We will refer to this performance curve as Confidence Operating Characteristics (COC), as it reminds of the classic Re-



); Quinonero-Candela et al. (2008); Sangalli et al. (2021). Human-AI collaboration aims at tackling these issues by keeping humans in the loop and building systems that take advantage of humans and AI by minimizing their shortcomings Patel et al. (2019).

