BEYOND CALIBRATION: ESTIMATING THE GROUPING LOSS OF MODERN NEURAL NETWORKS

Abstract

The ability to ensure that a classifier gives reliable confidence scores is essential to ensure informed decision-making. To this end, recent work has focused on miscalibration, i.e., the over or under confidence of model scores. Yet calibration is not enough: even a perfectly calibrated classifier with the best possible accuracy can have confidence scores that are far from the true posterior probabilities. This is due to the grouping loss, created by samples with the same confidence scores but different true posterior probabilities. Proper scoring rule theory shows that given the calibration loss, the missing piece to characterize individual errors is the grouping loss. While there are many estimators of the calibration loss, none exists for the grouping loss in standard settings. Here, we propose an estimator to approximate the grouping loss. We show that modern neural network architectures in vision and NLP exhibit grouping loss, notably in distribution shifts settings, which highlights the importance of pre-production validation.

1. INTRODUCTION

Validating the compliance of a model to a predefined set of specifications is important to control operational risks related to performance but also trustworthiness, fairness or robustness to varying operating conditions. It often requires that probability estimates capture the actual uncertainty of the prediction, i.e. are close to the true posterior probabilities. Indeed, many situations call for probability estimates rather than just a discriminant classifier. Probability estimates are needed when the decision is left to a human decision maker, when the model needs to avoid making decisions if they are too uncertain, when the context of model deployment is unknown at training time, etc. To evaluate probabilistic predictions, statistics and decision theory have put forward proper scoring rules (Dawid, 1986; Gneiting et al., 2007) , such as the Brier or the log-loss. Strictly proper scoring rules are minimized when a model produces the true posterior probabilities, which make them a valuable tool for comparing models and selecting those with the best estimated probabilities (e.g. Dawid & Musio, 2014) . What they do not provide though is a means of validating whether the best estimated probabilities are good enough to be put into production, or whether further effort is needed to improve the model. Indeed, proper scores compound the irreducible loss -due to the inherent randomness of a problem, i.e. the aleatoric uncertainty-and the epistemic loss -which measures how far a model is from the best possible one. For example, a classifier with a Brier score of 0.15 could have optimal estimated probabilities (irreducible loss close to 0.15) or poor ones (irreducible loss close to 0). Calibration errors are another tool to evaluate probabilistic predictions, and measuring them is an active research topic in the machine learning community. (Kumar et al., 2019; Minderer et al., 2021; Roelofs et al., 2022) . The calibration error is in fact a component of proper scoring rules (Bröcker, 2009; Kull & Flach, 2015) : it measures whether among all samples to which a calibrated classifier gave the same confidence score, on average, a fraction equal to the confidence score is positive. Importantly, the calibration error can be evaluated efficiently as it does not require access to the ground truth probabilities, but solely to their calibrated version. Calibration is however an incomplete characterization of predictive uncertainty. It measures an aggregated error that is blind to potential individual errors compensating each other. For example, among a group of individuals to which a calibrated cancer-risk classifier assigns a probability of 0.6, a fraction of 60% actually has cancer. But a subgroup of them could be composed of 100% cancer patients while another would only contain 20% of cancer patients.

