SELECTIVE CLASSIFICATION CAN MAGNIFY DISPARITIES ACROSS GROUPS

Abstract

Selective classification, in which models can abstain on uncertain predictions, is a natural approach to improving accuracy in settings where errors are costly but abstentions are manageable. In this paper, we find that while selective classification can improve average accuracies, it can simultaneously magnify existing accuracy disparities between various groups within a population, especially in the presence of spurious correlations. We observe this behavior consistently across five vision and NLP datasets. Surprisingly, increasing abstentions can even decrease accuracies on some groups. To better understand this phenomenon, we study the margin distribution, which captures the model's confidences over all predictions. For symmetric margin distributions, we prove that whether selective classification monotonically improves or worsens accuracy is fully determined by the accuracy at full coverage (i.e., without any abstentions) and whether the distribution satisfies a property we call left-log-concavity. Our analysis also shows that selective classification tends to magnify full-coverage accuracy disparities. Motivated by our analysis, we train distributionally-robust models that achieve similar full-coverage accuracies across groups and show that selective classification uniformly improves each group on these models. Altogether, our results suggest that selective classification should be used with care and underscore the importance of training models to perform equally well across groups at full coverage.

1. INTRODUCTION

Selective classification, in which models make predictions only when their confidence is above a threshold, is a natural approach when errors are costly but abstentions are manageable. For example, in medical and criminal justice applications, model mistakes can have serious consequences, whereas abstentions can be handled by backing off to the appropriate human experts. Prior work has shown that, across a broad array of applications, more confident predictions tend to be more accurate (Hanczar & Dougherty, 2008; Yu et al., 2011; Toplak et al., 2014; Mozannar & Sontag, 2020; Kamath et al., 2020) . By varying the confidence threshold, we can select an appropriate trade-off between the abstention rate and the (selective) accuracy of the predictions made. In this paper, we report a cautionary finding: while selective classification improves average accuracy, it can magnify existing accuracy disparities between various groups within a population, especially in the presence of spurious correlations. We observe this behavior across five vision and NLP datasets and two popular selective classification methods: softmax response (Cordella et al., 1995; Geifman & El-Yaniv, 2017) and Monte Carlo dropout (Gal & Ghahramani, 2016) . Surprisingly, we find that increasing the abstention rate can even decrease accuracies on the groups that have lower accuracies at full coverage: on those groups, the models are not only wrong more frequently, but their confidence can actually be anticorrelated with whether they are correct. Even on datasets where selective classification improves accuracies across all groups, we find that it preferentially helps groups that already have high accuracies, further widening group disparities. These group disparities are especially problematic in the same high-stakes areas where we might want to deploy selective classification, like medicine and criminal justice; there, poor performance on particular groups is already a significant issue (Chen et al., 2020; Hill, 2020) . For example, we study a variant of CheXpert (Irvin et al., 2019) , where the task is to predict if a patient has pleural Selective accuracy

Average Worst group

Figure 1 : A selective classifier (ŷ, ĉ) makes a prediction ŷ(x) on a point x if its confidence ĉ(x) in that prediction is larger than or equal to some threshold τ . We assume the data comprises different groups each with their own data distribution, and that these group identities are not available to the selective classifier. In this figure, we show a classifier with low accuracy on a particular group (red), but high overall accuracy (blue). Left: The margin distributions overall (blue) and on the red group. The margin is defined as ĉ(x) on correct predictions (ŷ(x) = y) and -ĉ(x) otherwise. For a threshold τ , the selective classifier is thus incorrect on points with margin ≤ -τ ; abstains on points with margin between -τ and τ ; and is correct on points with margin ≥ τ . Right: By varying τ , we can plot the accuracy-coverage curve, where the coverage is the proportion of predicted points. As coverage decreases, the average (selective) accuracy increases, but the worst-group accuracy decreases. The black dots correspond to the threshold τ = 1, which is shaded on the left. effusion (fluid around the lung) from a chest x-ray. As these are commonly treated with chest tubes, models can latch onto this spurious correlation and fail on the group of patients with pleural effusion but not chest tubes or other support devices. However, this group is the most clinically relevant as it comprises potentially untreated and undiagnosed patients (Oakden-Rayner et al., 2020). To better understand why selective classification can worsen accuracy and magnify disparities, we analyze the margin distribution, which captures the model's confidences across all predictions and determines which examples it abstains on at each threshold (Figure 1 ). We prove that when the margin distribution is symmetric, whether selective classification monotonically improves or worsens accuracy is fully determined by the accuracy at full coverage (i.e., without any abstentions) and whether the distribution satisfies a property we call left-log-concavity. To our knowledge, this is the first work to characterize whether selective classification (monotonically) helps or hurts accuracy in terms of the margin distribution, and to compare its relative effects on different groups. Our analysis shows that selective classification tends to magnify accuracy disparities that are present at full coverage. Motivated by our analysis, we find that selective classification on group DRO models (Sagawa et al., 2020)-which achieve similar accuracies across groups at full coverage by using group annotations during training-uniformly improves group accuracies at lower coverages, substantially mitigating the disparities observed on standard models that are instead optimized for average accuracy. This approach is not a silver bullet: it relies on knowing group identities during training, which are not always available (Hashimoto et al., 2018) . However, these results illustrate that closing disparities at full coverage can also mitigate disparities due to selective classification.

2. RELATED WORK

Selective classification. Abstaining when the model is uncertain is a classic idea (Chow, 1957; Hellman, 1970) , and uncertainty estimation is an active area of research, from the popular approach of using softmax probabilities (Geifman & El-Yaniv, 2017) to more sophisticated methods using dropout (Gal & Ghahramani, 2016) et al., 2019; Mozannar & Sontag, 2020; De et al., 2020) . Selective classification can also improve out-of-distribution accuracy (Pimentel et al., 2014; Hendrycks & Gimpel, 2017; Liang et al., 2018; Ovadia et al., 2019; Kamath et al., 2020) . On the theoretical side, early work characterized optimal abstention rules given well-specified models (Chow, 1970; Hellman & Raviv, 1970) , with more recent work on learning with perfect precision (El-Yaniv & Wiener, 2010; Khani et al., 2016) and guaranteed risk (Geifman & El-Yaniv, 2017) . We build on this literature by establishing general conditions on the margin distribution for when selective classification helps, and importantly, by showing that it can magnify group disparities.

