DATASET CURATION BEYOND ACCURACY

Abstract

Neural networks are known to be data-hungry, and collecting large labeled datasets is often a crucial step in deep learning deployment. Researchers have studied dataset aspects such as distributional shift and labeling cost, primarily using downstream prediction accuracy for evaluation. In sensitive real-world applications such as medicine and self-driving cars, not only is the accuracy important, but also the calibration -the extent that model uncertainty reflects the actual correctness likelihood. It has recently been shown that modern neural networks are ill-calibrated. In this work, we take a complementary approach -studying how dataset properties, rather than architecture, affects calibration. For the common issue of dataset imbalance, we show that calibration varies significantly among classes, even when common strategies to mitigate class imbalance are employed. We also study the effects of label quality, showing how label noise dramatically increases calibration error. Furthermore, poor calibration can come from small dataset sizes, which we motive via results on network expressivity. Our experiments demonstrate that dataset properties can significantly affect calibration and suggest that calibration should be measured during dataset curation.

1. INTRODUCTION

Neural networks often require large amounts of labeled data to perform well, making data curation a crucial but costly aspect of deployment. Thus, researchers have studied dataset properties such as distributional shift (Miller et al., 2020) and the bias in crowd-sourced computer vision datasets (Tsipras et al., 2020) among others. Often, the evaluation criteria in such studies is downstream prediction accuracy. However, neural networks are increasingly deployed in sensitive real-world applications such as medicine (Caruana et al., 2015) , self-driving cars (Bojarski et al., 2016) , and scientific analysis (Attia et al., 2020) , where not only accuracy matters but also calibration. Calibration is the extent to which model certainty reflects the actual correctness likelihood. Calibration can be important when costs of false positives and false negatives are asymmetric, e.g., for a deadly disease with cheap treatment, doctors might initiate treatment when the probability of being sick exceeds 10%. Beyond simple classification, calibration can be important for beam-search in NLP (Ott et al., 2018) and algorithmic fairness (Pleiss et al., 2017) . Calibration in machine learning has been studied by e.g Zadrozny & Elkan (2001); Naeini et al. (2015) . Niculescu-Mizil & Caruana (2005) have shown that small scale neural networks can yield well-calibrated predictions. However, it has recently been observed by (Guo et al., 2017) that modern neural networks are ill-calibrated, whereas the now primitive Lenet (LeCun et al., 1998) achieves good calibration. In this work, we take a complementary approach; instead of focusing on network architecture, we study how calibration is influenced by dataset properties. We primarily focus on computer vision and perform extensive experiments across common benchmarks and more exotic datasets such as satellite images (the eurosat dataset (Helber et al., 2019) ) and species detection (the iNaturalist dataset (Van Horn et al., 2018) ). We consistently find that dataset properties can significantly affect calibration, causing effects comparable to network architecture. For example, we consider the ubiquitous problem of class imbalanced datasets, a common issue in practice (Van Horn et al., 2018; Krishna et al., 2017; Thomee et al., 2016) . For such datasets, the miscalibration is not uniform but instead varies across the different classes. This problem persists even when common strategies to mitigate class imbalanced are employed. Another practical concern is generating high-quality labels via e.g. crowdsourcing (Karger et al., 2011) . We demonstrate how labeling quality affects calibration, with noisier labels resulting in worse calibration. Additionally, we show that just the size of the dataset has a strong effect on calibration. This also holds when one artificially increases the dataset size by data augmentation. We motivate our findings by considering the geometry of the cross-entropy loss and utilizing recent results on network expressivity (Yun et al., 2019) . If the dataset is sufficiently small compared to the number of parameters, we argue that the lack of minimizer for the cross-entropy loss biases the network to high confidence and poor calibration. Our results highlight an underappreciated aspect of calibration and suggest that for sensitive applications, one should measure calibration during dataset curation.

2. BACKGROUND

Calibration. Calibration has a traditional place in machine learning (Zadrozny & Elkan, 2001; Naeini et al., 2015) . Before the advent of modern deep learning, Niculescu-Mizil & Caruana (2005) showed that neural networks can yield well-calibrated predictions for classification. However, Guo et al. (2017) showed that modern neural networks are ill-calibrated. Modern neural networks are modeled as e.g. resnet (He et al., 2016) or densenets (Huang et al., 2016) . It is important to note that accuracy and calibration do not necessarily follow each other, but can move independently -modern neural networks are ill-calibrated, but still yield excellent accuracy. Beyond image classification, the importance of calibration in NLP has further been studied by Ott et al. (2018) and its relationship to fairness by Pleiss et al. (2017) . Metrics for Calibration. We let {x i } ∈ R n×dx be a dataset of n datapoints with d x features and take {y i } to be the labels. Following Guo et al. (2017) , we assume that a neural network h outputs h(x i ) = (p i , ŷi ), where ŷi is the predicted class and pi is the estimated probability that the prediction is correct. For evaluating calibration, we divide the interval [0, 1] into M equally sized bins and assign predictions to bins based upon p. Within each bin B m we define the accuracy as acc(B m ) = 1 |Bm| i∈Bm 1(ŷ i = y i ). Similarly, we define the confidence as conf(B m ) = 1 |Bm| i∈Bm pi . For a well-calibrated model, we would expect the confidence and accuracy of each bin to be close to each other. Calibration error can be measured by their difference, evaluated on the test set. The



Figure 1: Calibration error for individual classes under class-imbalance. The classes are ordered from the most (left) to the least (right) amount of samples. Fewer samples result in larger calibration errors.Imbalance is injected in CIFAR10/100 and eurosat randomly, removing any correlation with class-specific properties. We do not modify Inaturalist, which already suffers from imbalance; thus classwise calibration is correlated with class-specific properties.

