SELECTIVE CLASSIFICATION CAN MAGNIFY DISPARITIES ACROSS GROUPS

Abstract

Selective classification, in which models can abstain on uncertain predictions, is a natural approach to improving accuracy in settings where errors are costly but abstentions are manageable. In this paper, we find that while selective classification can improve average accuracies, it can simultaneously magnify existing accuracy disparities between various groups within a population, especially in the presence of spurious correlations. We observe this behavior consistently across five vision and NLP datasets. Surprisingly, increasing abstentions can even decrease accuracies on some groups. To better understand this phenomenon, we study the margin distribution, which captures the model's confidences over all predictions. For symmetric margin distributions, we prove that whether selective classification monotonically improves or worsens accuracy is fully determined by the accuracy at full coverage (i.e., without any abstentions) and whether the distribution satisfies a property we call left-log-concavity. Our analysis also shows that selective classification tends to magnify full-coverage accuracy disparities. Motivated by our analysis, we train distributionally-robust models that achieve similar full-coverage accuracies across groups and show that selective classification uniformly improves each group on these models. Altogether, our results suggest that selective classification should be used with care and underscore the importance of training models to perform equally well across groups at full coverage.

1. INTRODUCTION

Selective classification, in which models make predictions only when their confidence is above a threshold, is a natural approach when errors are costly but abstentions are manageable. For example, in medical and criminal justice applications, model mistakes can have serious consequences, whereas abstentions can be handled by backing off to the appropriate human experts. Prior work has shown that, across a broad array of applications, more confident predictions tend to be more accurate (Hanczar & Dougherty, 2008; Yu et al., 2011; Toplak et al., 2014; Mozannar & Sontag, 2020; Kamath et al., 2020) . By varying the confidence threshold, we can select an appropriate trade-off between the abstention rate and the (selective) accuracy of the predictions made. In this paper, we report a cautionary finding: while selective classification improves average accuracy, it can magnify existing accuracy disparities between various groups within a population, especially in the presence of spurious correlations. We observe this behavior across five vision and NLP datasets and two popular selective classification methods: softmax response (Cordella et al., 1995; Geifman & El-Yaniv, 2017) and Monte Carlo dropout (Gal & Ghahramani, 2016) . Surprisingly, we find that increasing the abstention rate can even decrease accuracies on the groups that have lower accuracies at full coverage: on those groups, the models are not only wrong more frequently, but their confidence can actually be anticorrelated with whether they are correct. Even on datasets where selective classification improves accuracies across all groups, we find that it preferentially helps groups that already have high accuracies, further widening group disparities. These group disparities are especially problematic in the same high-stakes areas where we might want to deploy selective classification, like medicine and criminal justice; there, poor performance on particular groups is already a significant issue (Chen et al., 2020; Hill, 2020) . For example, we study a variant of CheXpert (Irvin et al., 2019) , where the task is to predict if a patient has pleural Figure 1 : A selective classifier (ŷ, ĉ) makes a prediction ŷ(x) on a point x if its confidence ĉ(x) in that prediction is larger than or equal to some threshold τ . We assume the data comprises different groups each with their own data distribution, and that these group identities are not available to the selective classifier. In this figure, we show a classifier with low accuracy on a particular group (red), but high overall accuracy (blue). Left: The margin distributions overall (blue) and on the red group. The margin is defined as ĉ(x) on correct predictions (ŷ(x) = y) and -ĉ(x) otherwise. For a threshold τ , the selective classifier is thus incorrect on points with margin ≤ -τ ; abstains on points with margin between -τ and τ ; and is correct on points with margin ≥ τ . Right: By varying τ , we can plot the accuracy-coverage curve, where the coverage is the proportion of predicted points. As coverage decreases, the average (selective) accuracy increases, but the worst-group accuracy decreases. The black dots correspond to the threshold τ = 1, which is shaded on the left. effusion (fluid around the lung) from a chest x-ray. As these are commonly treated with chest tubes, models can latch onto this spurious correlation and fail on the group of patients with pleural effusion but not chest tubes or other support devices. However, this group is the most clinically relevant as it comprises potentially untreated and undiagnosed patients (Oakden-Rayner et al., 2020) . To better understand why selective classification can worsen accuracy and magnify disparities, we analyze the margin distribution, which captures the model's confidences across all predictions and determines which examples it abstains on at each threshold (Figure 1 ). We prove that when the margin distribution is symmetric, whether selective classification monotonically improves or worsens accuracy is fully determined by the accuracy at full coverage (i.e., without any abstentions) and whether the distribution satisfies a property we call left-log-concavity. To our knowledge, this is the first work to characterize whether selective classification (monotonically) helps or hurts accuracy in terms of the margin distribution, and to compare its relative effects on different groups. Our analysis shows that selective classification tends to magnify accuracy disparities that are present at full coverage. Motivated by our analysis, we find that selective classification on group DRO models (Sagawa et al., 2020) -which achieve similar accuracies across groups at full coverage by using group annotations during training-uniformly improves group accuracies at lower coverages, substantially mitigating the disparities observed on standard models that are instead optimized for average accuracy. This approach is not a silver bullet: it relies on knowing group identities during training, which are not always available (Hashimoto et al., 2018) . However, these results illustrate that closing disparities at full coverage can also mitigate disparities due to selective classification.

2. RELATED WORK

Selective classification. Abstaining when the model is uncertain is a classic idea (Chow, 1957; Hellman, 1970) , and uncertainty estimation is an active area of research, from the popular approach of using softmax probabilities (Geifman & El-Yaniv, 2017) to more sophisticated methods using dropout (Gal & Ghahramani, 2016) , ensembles (Lakshminarayanan et al., 2017) , or training snapshots (Geifman et al., 2018) . Others incorporate abstention into model training (Bartlett & Wegkamp, 2008; Geifman & El-Yaniv, 2019; Feng et al., 2019) and learn to abstain on examples human experts are more likely to get correct (Raghu et al., 2019; Mozannar & Sontag, 2020; De et al., 2020) . Selective classification can also improve out-of-distribution accuracy (Pimentel et al., 2014; Hendrycks & Gimpel, 2017; Liang et al., 2018; Ovadia et al., 2019; Kamath et al., 2020) . On the theoretical side, early work characterized optimal abstention rules given well-specified models (Chow, 1970; Hellman & Raviv, 1970) , with more recent work on learning with perfect precision (El-Yaniv & Wiener, 2010; Khani et al., 2016) and guaranteed risk (Geifman & El-Yaniv, 2017) . We build on this literature by establishing general conditions on the margin distribution for when selective classification helps, and importantly, by showing that it can magnify group disparities. Group disparities. The problem of models performing poorly on some groups of data has been widely reported (e.g., Hovy & Søgaard (2015) ; Blodgett et al. (2016) ; Corbett-Davies et al. (2017) ; Tatman (2017); Hashimoto et al. (2018) ). These disparities can arise when models latch onto spurious correlations, e.g., demographics (Buolamwini & Gebru, 2018; Borkan et al., 2019) , image backgrounds (Ribeiro et al., 2016; Xiao et al., 2020 ), spurious clinical variables (Badgeley et al., 2019; Oakden-Rayner et al., 2020) , or linguistic artifacts (Gururangan et al., 2018; McCoy et al., 2019) . These disparities have implications for model robustness and equity, and mitigating them is an important open challenge (Dwork et al., 2012; Hardt et al., 2016; Kleinberg et al., 2017; Duchi et al., 2019; Sagawa et al., 2020) . Our work shows that selective classification can exacerbate this problem and must therefore be used with care.

3. SETUP

A selective classifier takes in an input x ∈ X and either predicts a label y ∈ Y or abstains. We study standard confidence-based selective classifiers (ŷ, ĉ), where ŷ : X → Y outputs a prediction and ĉ : X → R + outputs the model's confidence in that prediction. The selective classifier abstains on x whenever its confidence ĉ(x) is below some threshold τ and predicts ŷ(x) otherwise.

Data and training.

We consider a data distribution D over X ×Y ×G, where G = {1, 2, . . . , k} corresponds to a group variable that is unobserved by the model. We study the common setting where the model ŷ is trained under full coverage (i.e., without taking into account any abstentions) and the confidence function ĉ is then derived from the trained model, as described in the next paragraph. We will primarily consider models ŷ trained by empirical risk minimization (i.e., to minimize the average training loss); in that setting, the group g ∈ G is never observed. In Section 7, we will consider the group DRO training algorithm (Sagawa et al., 2020) , which observes g at training time only. In both cases, g is not observed at test time and the model thus does not take in g. This is a common assumption: e.g., we might want to ensure that a face recognition model has equal accuracies across genders, but the model only sees the photograph (x) and not the gender (g). Confidence. We will primarily consider softmax response (SR) selective classifiers, which take ĉ(x) to be the normalized logit of the predicted class. Formally, we consider models that estimate p(y | x) (e.g., through a softmax) and predict ŷ(x) = arg max y∈Y p(y | x), with the corresponding probability estimate p(ŷ(x) | x). For binary classifiers, we define the confidence ĉ(x) as ĉ(x) = 1 2 log p(ŷ(x) | x) 1 -p(ŷ(x) | x) . (1) This corresponds to a confidence of ĉ(x) = 0 when p(ŷ(x) | x) = 0.5, i.e., the classifier is completely unsure of its prediction. We generalize this notion to multi-class classifiers in Section A.1. Softmax response is a popular technique applicable to neural networks and has been shown to improve average accuracies on a range of applications (Geifman & El-Yaniv, 2017) . Other methods offer alternative ways of computing ĉ(x); in Appendix B.1, we also run experiments where ĉ(x) is obtained via Monte Carlo (MC) dropout (Gal & Ghahramani, 2016) , with similar results. Metrics. The performance of a selective classifier at a threshold τ is typically measured by its average (selective) accuracy on its predicted points, P[ŷ(x) = y | ĉ(x) ≥ τ ], and its average coverage, i.e., the fraction of predicted points P[ĉ(x) ≥ τ ]. We call the average accuracy at threshold 0 the full-coverage accuracy, which corresponds to the standard notion of accuracy without any abstentions. We always use the term accuracy w.r.t. some threshold τ ≥ 0; where appropriate, we emphasize this by calling it selective accuracy, but we use these terms interchangeably in this paper. Following convention, we evaluate models by varying τ and tracing out the accuracy-coverage curve (El-Yaniv & Wiener, 2010) . As Figure 1 illustrates, this curve is fully determined by the distribution of the margin, which is ĉ(x) on correct predictions (ŷ(x) = y) and -ĉ(x) otherwise. We are also interested in evaluating performance on each group. For a group g ∈ G, we compute its group (selective) accuracy by conditioning on the group, Datasets. We consider five datasets (Table 1 ) on which prior work has shown that models latch onto spurious correlations, thereby performing well on average but poorly on the groups of data where the spurious correlation does not hold up. Following Sagawa et al. (2020) , we define a set of labels Y as well as a set of attributes A that are spuriously correlated with the labels, and then form one group for each (y, a) ∈ Y × A. For example, in the pleural effusion example from Section 1, one group would be patients with pleural effusion (y = 1) but no support devices (a = 0). Each dataset has |Y| = |A| = 2, except MultiNLI, which has |Y| = 3. More dataset details are in Appendix C.1. P[ŷ(x) = y | g, ĉ(x) ≥ τ ],

4. EVALUATING SELECTIVE CLASSIFICATION ON GROUPS

We start by investigating how selective classification affects group (selective) accuracies across the five datasets in Table 1 . We train standard models with empirical risk minimization, i.e., to minimize average training loss, using ResNet50 for CelebA and Waterbirds; DenseNet121 for CheXpertdevice; and BERT for CivilComments and MultiNLI. Details are in Appendix C. We focus on softmax response (SR) selective classifiers, but show similar results for MC-dropout in Appendix B.1. Accuracy-coverage curves. Figure 2 shows group accuracy-coverage curves for each dataset, with the average in blue, worst group in red, and other groups in gray. On all datasets, average accuracies improve as coverage decreases. However, the worst-group curves fall into three categories: 1. Decreasing. Strikingly, on CelebA, worst-group accuracy decreases with coverage: the more confident the model is on worst-group points, the more likely it is incorrect. 2. Mixed. On Waterbirds, CheXpert-device, and CivilComments, as coverage decreases, worstgroup accuracy sometimes increases (though not by much, except at noisy, low coverages) and sometimes decreases. 3. Slowly increasing. On MultiNLI, as coverage decreases, worst-group accuracy consistently improves but more slowly than other groups: from full to 50% average coverage, worst-group accuracy goes from 65% to 75% while the second-to-worst group accuracy goes from 77% to 95%. . By construction, these share the same average accuracy-coverage curves (blue line). Similar results for MC-dropout are in Figure 7 . Group-agnostic and Robin Hood references. The results above show that even when selective classification is helping the worst group, it seems to help other groups more. We formalize this notion by comparing the selective classifier to a matching group-agnostic reference that is derived from it and that tries to abstain equally across groups. At each threshold, the group-agnostic reference makes the same numbers of correct and incorrect predictions as its corresponding selective classifier, but distributes these predictions uniformly at random across points without regard to group identities (Algorithm 1). By construction, it has an identical average accuracy-coverage curve as its corresponding selective classifier, but can differ on the group accuracies. We show in Appendix A.2 that it satisfies equalized odds (Hardt et al., 2016) w.r.t. which points it predicts or abstains on. The group-agnostic reference distributes abstentions equally across groups; from the perspective of closing disparities between groups, this is the least that we might hope for. Ideally, selective classification would preferentially increase worst-group accuracy until it matches the other groups. We can capture this optimistic scenario by constructing, for a given selective classifier, a corresponding Robin Hood reference which, as above, also makes the same number of correct and incorrect predictions (see Algorithm 2 in Appendix A.3). However, unlike the group-agnostic reference, for the Robin Hood reference, the correct predictions are not chosen uniformly at random; instead, we prioritize picking them from the worst group, then the second worst group, etc. Likewise, we prioritize picking the incorrect predictions from the best group, then the second best group, etc. This results in worst-group accuracy rapidly increasing at the cost of the best group. Both the group-agnostic and the Robin Hood references are not algorithms that we could implement in practice without already knowing all of the groups and labels. They act instead as references: selective classifiers that preferentially benefit the worst group would have worst-group accuracycoverage curves that lie between the group-agnostic and Robin Hood curves. Unfortunately, Figure 3 shows that SR selective classifiers substantially underperform even their group-agnostic counterparts: they disproportionately help groups that already have higher accuracies, further exacerbating the disparities between groups. We show similar results for MC-dropout in Section B.1. Algorithm 1: Group-agnostic reference for (ŷ, ĉ) at threshold τ Input: Selective classifier (ŷ, ĉ), threshold τ , test data D Output: The sets of correct predictions C ga τ ⊆ D and incorrect predictions I ga τ ⊆ D that the group-agnostic reference for (ŷ, ĉ) makes at threshold τ . 1 Let C τ be the set of all examples that (ŷ, ĉ) correctly predicts at threshold τ : C τ = {(x, y, g) ∈ D | ŷ(x) = y and ĉ(x) ≥ τ }. Sample a subset C ga τ of size |C τ | uniformly at random from C 0 , which is the set of all examples that ŷ would have predicted correctly at full coverage. 2 Let I τ be the analogous set of incorrect predictions at τ :  I τ = {(x, y, g) ∈ D | ŷ(x) = y and ĉ(x) ≥ τ }.

5. ANALYSIS: MARGIN DISTRIBUTIONS AND ACCURACY-COVERAGE CURVES

We saw in the previous section that while selective classification typically increases average (selective) accuracy, it can either increase or decrease worst-group accuracy. We now turn to a theoretical analysis of this behavior. Specifically, we establish the conditions under which we can expect to see its two extremes: when accuracy monotonically increases, or decreases, as a function of coverage. While these extremes do not fully explain the empirical phenomena, e.g., why worst-group accuracy sometimes increases and then decreases, our analysis broadly captures why accuracy monotonically increases on average with decreasing coverage, but displays mixed behavior on the worst group. Our central objects of study are the margin distributions for each group. Recall that the margin of a selective classifier (ŷ, ĉ) on a point (x, y) is its confidence ĉ(x) ≥ 0 if the prediction is correct (ŷ(x) = y) and -ĉ(x) ≤ 0 otherwise. The selective accuracies on average and on the worstgroup are thus completely determined by their respective margin distributions, which we show for our datasets in Figure 4 . The worst-group and average distributions are very different: the worstgroup distributions are consistently shifted to the left with many confident but incorrect examples. Our approach will be to characterize what properties of a general margin distribution F lead to monotonically increasing or decreasing accuracy. Then, by letting F be the overall or worst-group margin distribution, we can see how the differences in these distributions lead to differences in accuracy as a function of coverage. Setup. We consider distributions over margins that have a differentiable cumulative distribution function (CDF) and a density, denoted by corresponding upper-and lowercase variables (e.g., F and f , respectively). Each margin distribution F corresponds to a selective classifier over some data distribution. We denote the corresponding (selective) accuracy of the classifier at threshold τ as A F (τ ) = (1 -F (τ ))/(F (-τ ) + 1 -F (τ )). Since increasing the threshold τ monotonically decreases coverage, we focus on studying accuracy as a function of τ . All proofs are in Appendix D.

5.1. SYMMETRIC MARGIN DISTRIBUTIONS

We begin with symmetric distributions. We introduce a generalization of log-concavity, which we call left-log-concavity; for symmetric distributions, left-log-concavity corresponds to monotonicity of the accuracy-coverage curve, with the direction determined by the full-coverage accuracy. Definition 1 (Left-log-concave distributions). A distribution is left-log-concave if its CDF is log- concave on (-∞, µ], where µ is the mean of the distribution. Left-log-concave distributions are a superset of the broad family of log-concave distributions (e.g., Gaussian, beta, uniform), which require log-concave densities (instead of CDFs) on their entire support (Boyd & Vandenberghe, 2004) . Notably, they can be multimodal: a symmetric mixture of two Gaussians is left-log-concave but not generally log-concave (Lemma 1 in Appendix D). Proposition 1 (Left-log-concavity and monotonicity). Let F be the CDF of a symmetric distri- bution. If F is left-log-concave, then A F (τ ) is monotonically increasing in τ if A F (0) ≥ 1/2 and monotonically decreasing otherwise. Conversely, if A F d (τ ) is monotonically increasing for all translations F d such that F d (τ ) = F (τ -d) for all τ and A F d (0) ≥ 1/2, then F is left-log-concave. Proposition 1 is consistent with the observation that selective classification tends to improve average accuracy but hurts worst-group accuracy. As an illustration, consider a margin distribution that is a symmetric mixture of two Gaussians, each corresponding to a group, and where the average accuracy is >50% at full coverage but the worst-group accuracy is <50%. As the overall and worst-group margin distributions are both left-log-concave (Lemma 1), Proposition 1 implies that worst-group accuracy will decrease monotonically with τ while the average accuracy improves monotonically. Applied to CelebA (Figure 4 ), Proposition 1 is also consistent with how average accuracy improves while worst-group accuracy, which is <50% at full coverage, worsens. Finally, Proposition 1 also helps to explain why selective classification generally improves average accuracy in the literature, as average accuracies are typically high at full coverage and margin distributions often resemble Gaussians (Balasubramanian et al., 2011; Lakshminarayanan et al., 2017) .

5.2. SKEW-SYMMETRIC MARGIN DISTRIBUTIONS

The results above apply only to symmetric margin distributions. As not all of the margin distributions in Figure 4 are symmetric, we extend our analysis to asymmetric margin distributions by building upon prior work on skew-symmetric distributions (Azzalini & Regoli, 2012) . Definition 2. A distribution with density f α,µ is skew-symmetric with skew α and center µ if f α,µ (τ ) = 2h(τ -µ)G(α(τ -µ)) for all τ ∈ R, where h is the density of a distribution is symmetric about 0, and G is the CDF of a potentially different distribution that is also symmetric about 0. In other words, f α,µ is a skewed form of the symmetric density h, where higher α means more right skew, and setting α = 0 yields a (translated) h. Skew-symmetric distributions are a broad family and include, e.g., skew-normal distributions, as well as all symmetric distributions. In Appendix D.3, we show some properties of skew-symmetric distributions as pertains to selective classification, e.g., margin distributions that are more right-skewed have higher accuracies (Proposition 6). Our result is that skewing a symmetric distribution in the "same" direction preserves monotonicity: if accuracy is monotone increasing, then right skew (which increases accuracy) preserves this. Proposition 2 (Skew in the same direction preserves monotonicity). Let F α,µ be the CDF of a skewsymmetric distribution. If accuracy of its symmetric version, A F0,µ (τ ), is monotonically increasing in τ , then A Fα,µ (τ ) is also monotonically increasing in τ for any α > 0. Similarly, if A F0,µ (τ ) is monotonically decreasing in τ , then A Fα,µ (τ ) is also monotonically decreasing in τ for any α < 0.

5.3. DISCUSSION

Proposition 1 relates the left-log-concavity of a symmetric margin distribution to the monotonicity of selective accuracy, with the direction of monotonicity (increasing or decreasing) determined by the full-coverage accuracy. Proposition 2 then states that if we have a symmetric margin distribution with monotone accuracy, skewing it in the same direction preserves monotonicity. Combining both propositions, we have that if F 0,µ is symmetric left-log-concave with full-coverage accuracy >50%, then the accuracy A Fα,µ(τ ) of any skewed F α,µ (τ ) with α > 0 is monotone increasing in τ . Since accuracy-coverage curves are preserved under all odd, monotone transformations of margins (Lemma 8 in Appendix D), these results also generalize to odd, monotone transformations of these (skew-)symmetric distributions. As many margin distributions-in Figure 4 and in the broader literature (Balasubramanian et al., 2011; Lakshminarayanan et al., 2017) -resemble the distributions studied above (e.g., Gaussians and skewed Gaussians), we thus expect selective classification to improve average accuracy but worsen worst-group accuracy when worst-group accuracy at full coverage is low to begin with. An open question is how to characterize the properties of margin distributions that lead to nonmonotone behavior. For example, in Waterbirds and CheXpert-device, accuracies first increase and then decrease with decreasing coverage (Figure 2 ). These two datasets have worst-group margin distributions that have full-coverage accuracies >50% but that are left-skewed with skewness -0.30 and -0.33 respectively (Figure 4 ), so we cannot apply Proposition 2 to describe them.

6. ANALYSIS: COMPARISON TO GROUP-AGNOSTIC REFERENCE

Even if selective classification improves worst-group accuracy, it can still exacerbate group disparities, underperforming the group-agnostic reference on the worst group (Section 4). In this section, we continue our analysis and show that while it is possible to outperform the group-agnostic reference, it is challenging to do so, especially when the accuracy disparity at full coverage is large. Setup. Throughout this section, we decompose the margin distribution into two components F = pF wg + (1p)F others , where F wg and F others correspond to the margin distributions of the worst group and of all other groups combined, respectively; p is the fraction of examples in the worst group; and the worst group has strictly worse accuracy at full coverage than the other groups (i.e., A Fwg (0) < A F others (0)). Recall from Section 4 that for any selective classifier (i.e., any margin distribution), its group-agnostic reference has the same average accuracy at each threshold τ but potentially different group accuracies. We denote the worst-group accuracy of the group-agnostic reference as ÃFwg (τ ), which can be written in terms of F wg , F others , and p (Appendix A.2). We continue with notation from Section 5 otherwise, and all proofs are in Appendix E. A selective classifier with margin distribution F is said to outperform the group-agnostic reference on the worst group if A Fwg (τ ) ≥ ÃFwg (τ ) for all τ ≥ 0. To establish a necessary condition for outperforming the reference, we study the neighborhood of τ = 0, which corresponds to full coverage: Proposition 3 (Necessary condition for outperforming the group-agnostic reference). Assume that 1/2 < A Fwg (0) < A F others (0) < 1 and the worst-group density f wg (0 ) > 0. If ÃFwg (τ ) ≤ A Fwg (τ ) for all τ ≥ 0, then f others (0) f wg (0) ≤ 1 -A F others (0) 1 -A Fwg (0) . ( ) The RHS is the ratio of full-coverage errors; the larger the disparity between the worst group and the other groups at full coverage, the harder it is to satisfy this condition. In Appendix F, we simulate mixtures of Gaussians and show that this condition is rarely fulfilled. Motivated by the empirical margin distributions, we apply Proposition 3 to the setting where F wg and F others are both log-concave and are translated and scaled versions of each other. We show that the worst group must have lower variance than the others to outperform the group-agnostic reference: Corollary 1 (Outperforming the group-agnostic reference requires smaller scaling for log-concave distributions). Assume that 1/2 < A Fwg (0) < A F others (0) < 1, F wg is log-concave, and f others (τ ) = vf wg (v(τ -µ others ) + µ wg ) for all τ ∈ R, where v is a scaling factor. If ÃFwg (τ ) ≤ A Fwg (τ ) for all τ ≥ 0, v < 1. This is consistent with the empirical margin distributions on Waterbirds: the worst group has higher variance, implying v > 1 as v is the ratio of the worst group's standard deviation to the other group's, and it thus fails to satisfy the necessary condition for outperforming the group-agnostic reference. A further special case is when F wg and F others are log-concave and unscaled translations of each other. Here, selective classification underperforms the group-agnostic reference at all thresholds τ . Proposition 4 (Translated log-concave distributions underperform the group-agnostic reference). Assume F wg and F others are log-concave and f others (τ ) = f wg (τd) for all τ ∈ R. Then for all τ ≥ 0, A Fwg (τ ) ≤ ÃFwg (τ ). This helps to explain our results on CheXpert-device, where the worst-group and average margin distributions are approximately translations of each other, and selective classification significantly underperforms the group-agnostic reference at all confidence thresholds.

7. SELECTIVE CLASSIFICATION ON GROUP DRO MODELS

Our above analysis suggests that selective classification tends to exacerbate group disparities, especially when the full-coverage disparities are large. This motivates a potential solution: by reducing 5 ). Moreover, worst-group accuracies consistently improve as coverage decreases and at a rate that is comparable to the group-agnostic reference, though small gaps remain on Waterbirds and CheXpert-device. While our theoretical analysis motivates the above approach, the analysis ultimately depends on the margin distributions of each group, not just on their full-coverage accuracies. Although group DRO only optimizes for similar full-coverage accuracies across groups, we found that it also leads to much more similar average and worst-group margin distributions compared to ERM (Figure 6 ), explaining why selective classification behaves more uniformly over groups across all datasets. Group DRO is not a silver bullet, as it relies on group annotations for training, which are not always available. Nevertheless, these results show that closing full-coverage accuracy disparities can mitigate the downstream disparities caused by selective classification.

8. DISCUSSION

We have shown that selective classification can magnify group disparities and should therefore be applied with caution. This is an insidious failure mode, since selective classification generally improves average accuracy and can appear to be working well if we do not look at group accuracies. However, we also found that selective classification can still work well on models that have equal full-coverage accuracies across groups. Training such models, especially without relying on too much additional information at training time, remains an important research direction. On the theoretical side, we characterized the behavior of selective classification in terms of the margin distributions; an open question is how different margin distributions arise from different data distributions, models, training procedures, and selective classification algorithms. Finally, in this paper we focused on studying selective accuracy in isolation; accounting for the cost of abstention and the equity of different coverages on different groups is an important direction for future work.

A SETUP A.1 SOFTMAX RESPONSE SELECTIVE CLASSIFIERS

In this section, we describe our implementation of softmax response (SR) selective classifiers (Geifman & El-Yaniv, 2017). Recall from Section 3 that a selective classifier is a pair (ŷ, ĉ), where ŷ : X → Y outputs a prediction and ĉ : X → R + outputs the model's confidence, which is always non-negative, in that prediction. SR classifiers are defined for neural networks (for classification), which generally have a last softmax layer over the k possible classes. For an input point x, we denote its maximum softmax probability, which corresponds to its predicted class ŷ(x), as p(ŷ(x) | x). We defined the confidence ĉ(x) for binary classifiers as ĉ(x) = 1 2 log p(ŷ(x) | x) 1 -p(ŷ(x) | x) . ( ) Since the maximum softmax probability p(ŷ(x) | x) is at least 0.5 for binary classification, ĉ(x) is nonnegative for each x, and is thus a valid confidence. For k > 2 classes, however, p(ŷ(x) | x) can be less than 0.5, in which case ĉ(x) would be negative. To ensure that confidence is always nonnegative, we define ĉ(x) for k classes to be ĉ(x) = 1 2 log p(ŷ(x) | x) 1 -p(ŷ(x) | x) + 1 2 log(k -1). ( ) With k classes, the maximum softmax probability p(ŷ(x) | x) ≥ 1/k, and therefore we can verify that ĉ(x) ≥ 0 as desired. Moreover, when k = 2, Equation ( 8) reduces to our original binary confidence; we can therefore interpret the general form in Equation ( 8) as a normalized logit. Note that ĉ(x) is a monotone transformation of the maximum softmax probability p(ŷ(x) | x). Since the accuracy-coverage curve of a selective classifier only depends on the relative ranking of ĉ(x) across points, we could have equivalently set ĉ(x) to be p(ŷ(x) | x). However, following prior work, we choose the logit-transformed version to make the corresponding distribution of confidences easier to visualize (Balasubramanian et al., 2011; Lakshminarayanan et al., 2017) . Finally, we remark on one consequence of SR on the margin distribution for multi-class classification. Recall that we define the margin of an example to be ĉ(x) on correct predictions (ŷ(x) = y) and -ĉ(x) otherwise, as described in Section 3. In Figure 4 , we plot the margin distributions of SR selective classifiers on all five datasets. We observe that on MultiNLI, which is the only multi-class dataset (with k = 3), there is a gap (region of lower density) in the margin distribution around 0. We attribute this gap in part to the comparative rarity of seeing a maximum softmax probability of 1 3 when k = 3 versus seeing 1 2 when k = 2; in the former, all three logits must be the same, while for the latter only two logits must be the same.

A.2 GROUP-AGNOSTIC REFERENCE

Here, we describe the group-agnostic reference described in Section 4 in more detail. We elaborate on the construction from the main text and then define the reference formally. Finally, we show that the group-agnostic reference satisfies equalized odds.

A.2.1 DEFINITION

We begin by recalling on the construction described in the main text for a finite test set D. In Algorithm 1 we relied on two important sets; the set of correctly classified points at threshold τ , C τ : C τ = {(x, y, g) ∈ D | ŷ(x) = y and ĉ(x) ≥ τ }, and similarly, the set of incorrectly classified points at threshold τ , I τ : I τ = {(x, y, g) ∈ D | ŷ(x) = y and ĉ(x) ≥ τ }. ( ) To compute the accuracy of the group-agnostic reference, we sample a subset C ga τ of size |C τ | uniformly at random from C 0 , and similarly a subset I ga τ of size |I τ | uniformly at random from I 0 . The group-agnostic reference makes predictions when examples are in C ga τ ∪ I ga τ and abstains otherwise. We compute group accuracies over this set of predicted examples. Note that the group accuracies, as defined, are randomized due to the sampling. For the remainder of our analysis, we take the expectation over this randomness to compute the group accuracies. We now generalize the above construction by considering data distributions D. We first define the number of correctly and incorrectly classified points: Definition 3 (Correctly and incorrectly classified points). Consider a selective classifier (ŷ, ĉ). For each threshold τ , we define the fractions of points that are predicted (not abstained on), and correctly or incorrectly classified, as C(τ ) = p(ŷ(x) = y ∧ ĉ(x) ≥ τ ), I(τ ) = p(ŷ(x) = y ∧ ĉ(x) ≥ τ ). ( ) We define analogous metrics for each group g as C g (τ ) = p(ŷ(x) = y ∧ ĉ(x) ≥ τ | g), ( ) I g (τ ) = p(ŷ(x) = y ∧ ĉ(x) ≥ τ | g). ( ) For each threshold τ , we will make predictions on a C(τ )/C(0) fraction of the C(0) total (probability mass of) correctly classified points. Since each group g has C g (0) correctly classified points, at threshold τ , the group-agnostic reference will make predictions on C g (0)C(τ )/C(0) correctly classified points in group g. We can reason similarly over the incorrectly classified points. Putting it all together, we can define the group-agnostic reference as satisfying the following: Definition 4 (Group-agnostic reference). Consider a selective classifier (ŷ, ĉ) and let C, Ĩ, Cg , Ĩg denote the analogous quantities to Definition 3 for its matching group-agnostic reference. For each threshold τ , these satisfy C(τ ) = C(τ ) (15) Ĩ(τ ) = I(τ ), and for each threshold τ and group g, Cg (τ ) = C g (0)C(τ )/C(0) (17) Ĩg (τ ) = I g (0)I(τ )/I(0). ( ) The group-agnostic reference thus has the following accuracy on group g: Ãg (τ ) = Cg (τ ) Cg (τ ) + Ĩg (τ ) (19) = C g (0)C(τ )/C(0) C g (0)C(τ )/C(0) + I g (0)I(τ )/I(0) .

A.2.2 CONNECTION TO EQUALIZED ODDS

We now show that the group-agnostic reference satisfies equalized odds with respect to which points it predicts or abstains on. The goal of a selective classifier is to make predictions on points it would get correct (i.e., ŷ(x) = y) while abstaining on points that it would have gotten incorrect (i.e., ŷ(x) = y). We can view this as a meta classification problem, where a true positive is when the selective classifier decides to make a prediction on a point x and gets it correct (ŷ(x) = y), and a false positive is when the selective classifier decides to make a prediction on a point x and gets it incorrect (ŷ(x) = y). As such, we can define the true positive rate R TP (τ ) and false positive rate R FP (τ ) of a selective classifier: Definition 5. The true positive rate of a selective classifier at threshold τ is R TP (τ ) = C(τ ) C(0) , and the false positive rate at threshold τ is R FP (τ ) = I(τ ) I(0) . ( ) Analogously, the true positive and false positive rates on a group g are R TP g (τ ) = C g (τ ) C g (0) , R FP g (τ ) = I g (τ ) I g (0) . ( ) The group-agnostic reference satisfies equalized odds (Hardt et al., 2016) with respect to this definition: Proposition 5. The group-agnostic reference defined in Definition 4 has equal true positive and false positive rates for all groups g ∈ G and satisfies equalized odds. Proof. By construction of the group-agnostic reference (Definition 4), we have that Cg (τ ) = C g (0)C(τ )/C(0) (25) = Cg (0) C(τ )/ C(0), and therefore, for each group g, we can show that the true-positive rate of the group-agnostic reference RTP g on g is equal to its average true-positive rate RTP (τ ). RTP g (τ ) = Cg (τ ) Cg (0) (27) = C(τ )/ C(0) = RTP (τ ). Each group thus has the same true positive rate with the group-agnostic reference. Using similar reasoning, each group also has the same false positive rate. By the definition of equalized odds, the group-agnostic reference thus satisfies equalized odds.

A.3 ROBIN HOOD REFERENCE

In this section, we define the Robin Hood reference, which preferentially increases the worst-group accuracy through abstentions until it matches the other groups. Like the group-agnostic reference, we constrain the Robin Hood reference to make the same number of correct and incorrect predictions as a given selective classifier. We formalize the definition of the Robin Hood reference in Algorithm 2. The Robin Hood reference makes predictions on the subset of examples from D that has the smallest discrepency between best-group and worst-group accuracies, while still matching the number of correct and incorrect predictions of a given selective classifier. Since enumerating over all possible subsets of the test data Algorithm 2: Robin Hood reference at threshold τ Input: Selective classifier (ŷ, ĉ), threshold τ , test data D Output: The set of points P ⊆ D that the Robin Hood reference for (ŷ, ĉ) makes predictions on at threshold τ . 1 In Algorithm 1, we defined C τ and I τ to be the sets of correct and incorrect points predicted on and abstained on at threshold τ respectively. Define Q, the set of subsets of D that have the same number of correct and incorrect predictions as (ŷ, ĉ) at threshold τ as follows: Q = {S ⊆ D | |P ∩ C 0 | = |C τ |, |P ∩ I 0 | = |I τ |}. ( ) 2 Let acc g (S) be the accuracy of ŷ on points in S that belong to group g. We return the set P that minimizes the difference between the best-group and worst-group accuracies. 

P = argmin

D is intractable, we compute the accuracies of the Robin Hood reference iteratively. Starting from full coverage, we abstain on examples from lowest to highest confidence. Whenever we abstain on an incorrect example, we assume it comes from the current lowest accuracy group, and similarly whenever we abstain on a correctly classified example we assume it comes from the current highest accuracy group. This reduces disparities to the maximum extent possible as the threshold τ increases.

B SUPPLEMENTAL EXPERIMENTS B.1 MONTE-CARLO DROPOUT

In the main text, we observed that SR selective classifiers monotonically improve average accuracy as coverage decreases, but exacerbate accuracy disparities across groups on all five datasets. To demonstrate that these observations are not specific to SR selective classifiers, we now present our empirical results on another standard selective classification method: Monte-Carlo (MC) dropout (Gal & Ghahramani, 2016; Geifman & El-Yaniv, 2017) . We find that MC-dropout selective classifiers exhibit similar empirical trends as SR selective classifiers. MC-dropout selective classifiers. MC-dropout is an alternate way of assigning confidences to points. Taking a model with a dropout layer, the selective classifier first predicts ŷ(x) simply by taking the model output without dropout. To estimate the confidence, it then samples n softmax probabilities corresponding to the label ŷ(x) over the randomness of the dropout layer. The confidence is computed as ĉ(x) = 1/s, where s 2 is the variance of the sampled probabilities. We implement MC-dropout by using the existing dropout layers for BERT and adding dropout to the final fully-connected layer for ResNet and DenseNet, with a dropout probability 0.1. We present empirical results for n = 10; Gal & Ghahramani (2016) observed that this was sufficient to produce good confidence estimates. Results. The MC-dropout selective classifiers exhibit similar trends to those that observed in Section 4, demonstrating that the observed empirical trends are not specific to SR response; even though the average accuracy improves monotonically as coverage decreases across all five datasets, the worstgroup accuracy tends to decrease for CelebA, fails to increase consistently for CheXpert, Waterbirds, and CivilComments, and increases consistently but slowly for MultiNLI. Comparing SR and MCdropout selective classifiers, we observe that the MC-dropout selective classifiers performs slightly worse. For example, we see a more prominent drop in worst-group accuracy for Waterbirds and much smaller improvements in worst-group accuracy for CheXpert. 

B.2 GROUP DRO

We showed in Section 7 that SR selective classifiers trained with the group DRO objective successfully improve worst-group selective accuracies as coverage decreases, and perform comparably to the group-agnostic reference. We now present additional empirical results for these selective classifiers. Accuracy-coverage curves. We first present the accuracy-coverage curves for all groups, along with the group coverage trends, in Figure 8 . Group selective accuracies tend to improve monotonically, in stark contrast with our results on standard (ERM) selective classifiers. These trends hold generally across groups and datasets, with a few exceptions; selective accuracies drop on two groups in MultiNLI at roughly 20% average coverage, and selective accuracies improve slowly on two groups in CivilComments. Below, we look into these anomalies and offer potential explanations. For MultiNLI, we first note that the drops are observed at very low group coverages-lower than 1%-at which point the accuracies are computed based on very few examples and thus are noisy. In addition, much of the anomaly can be explained by label noise; we manually inspect the 20 examples from the above two groups with the highest softmax confidences, and find that 17 of them are labeled incorrectly. The observations on CivilComments can also be potentially attributed to label noise. Removing examples with high inter-annotator disagreement (with the fraction of toxic annotations between 0.5 and 0.6) yields an accuracy-coverage curve that fits the broader empirical trends.

C EXPERIMENT DETAILS

C.1 DATASETS CelebA. Models have been shown to latch onto spurious correlations between labels and demographic attributes such as race and gender (Buolamwini & Gebru, 2018; Joshi et al., 2018) , and we study this on the CelebA dataset (Liu et al., 2015) . Following Sagawa et al. (2020) Waterbirds. Object recognition models are prone to using image backgrounds as a proxy for the label (Ribeiro et al., 2016; Xiao et al., 2020) . We study this on the Waterbirds dataset (Sagawa et al., 2020) , constructed using images of birds from the Caltech-UCSD Birds dataset (Wah et al., 2011) placed on backgrounds from the Places dataset (Zhou et al., 2017) . The task is to classify a photograph of a bird as one of Y = {waterbird, landbird}, and the label is spuriously correlated with the background A = {water background, land background}. Of the four groups, waterbirds on land backgrounds make up the smallest group with only 56 examples out of 4,795 training examples, and they tend to be the worst group empirically. We use the train-val-split provided by Sagawa et al. (2020) , and also follow their protocol for computing average metrics; to compute average accuracies and coverages, we first compute the metrics for each group and obtain a weighted average according to group proportions in the training set, in order to account for the discrepancy in group proportions across the splits. CheXpert-device. Models can latch onto spurious correlations even in high-stakes applications such as medical imaging. When models are trained to classify whether a patient has certain pathologies from chest X-rays, models have been shown to spuriously detect the presence of a support device, in particular a chest drain, instead (Oakden-Rayner et al., 2020) . We study this phenomenon in a modified version of the CheXpert dataset (Irvin et al., 2019) , which we call CheXpert-device. Concretely, the inputs are chest X-rays, labels are Y = {pleural effusion, no pleural effusion}, and spurious attributes indicate the the presence of a support device, A = {support device, no support device}. We note that chest drain is one type of support device, and is used to treat suspected pleural effusion (Porcel, 2018) . CheXpert-device is a subsampled version of the full CheXpert dataset that manifests the spurious correlation more strongly. To create CheXpert-device, we first create a new 80/10/10 train/val/test split of examples from the publicly available CheXpert train and validation sets, randomly assigning patients to splits so that all X-rays of the same patient fall in the same split. We then subsample the training set; in particular, we enforce that in 90% of examples, the label of support device matches the pleural effusion label. Of the four groups, cases of pleural effusion without a support device make up the smallest group, with 5,467 examples out of 112,100 training examples, and they tend to be the worst group empirically. To compute the average accuracies and coverages, we weight groups according to group proportions in the training set, similarly to Waterbirds. Another complication with CheXpert is that some patients can have multiple X-rays from one visit. Following Irvin et al. (2019) , we treat these images as separate training examples at training time, but output one prediction for each patient-study pair at evaluation time. Concretely, we predict pleural effusion if the model detects the condition in any of the X-ray images belonging to the patient-study pair, as pathologies may only appear clearly in some X-rays. CivilComments. In toxicity comment detection, models have been shown to latch onto spurious correlations between the toxicity and mention of certain demographic groups (Park et al., 2018; Dixon et al., 2018) . We study this in the CivilComments dataset (Borkan et al., 2019) . The task is to classify the toxicity of comments on online articles with labels Y = {toxic, non-toxic}. As spurious attributes, we consider whether each comment mentions a Christian identity, A = {mention of Christian identity, no mention of Christian identity}; non-toxicity is associated with the mention of Christian identity, often resulting in high false negative rate on comments with such mentions. Of the four groups, toxic comments with a mention of Christian identity make up the smallest group, with only 2,446 examples out of 269,038 training examples, and they tend to be the worst group empirically. and further split the development set into a training set and a validation set by randomly splitting on articles and associating all comments with each article to either set. We use the train, validation, and test set from the WILDS benchmark (Koh et al., 2020) . The training, validation, and test set all comprise comments from disjoint sets of articles. The original dataset also contains many additional examples that have toxicity annotations but not identity annotations; we do not use these in our experiments. In the original CivilComments dataset, each comment is given a probabilistic labels for both the toxicity and the mention of a Christian identity, where a probabilistic label is the average of binary labels across annotators. Following the associated Kaggle competitionfoot_0 , we use binarized labels obtained by thresholding the probabilistic labels at 0.5. MultiNLI. Lastly, we consider natural langage inference (NLI), where the task is to predict whether a hypothesis is entailed, contradicted by, or neutral to an associated premise, Y = {entailed, contradictory, neutral}. NLI models have been shown to exploit annotation artifacts, for example predicting contradictory whenever negation words such as never or nobody are present (Gururangan et al., 2018) . We study this on the MultiNLI dataset (Williams et al., 2018) . To annotate examples' spurious attributes A = {negation words, no negation words}, we consider the following negation words following Gururangan et al. (2018) : "nobody", "no", "never", and "nothing". We use the splits used in Sagawa et al. (2020) 

C.2 MODELS

We train ResNet (He et al., 2016) for CelebA and Waterbirds (images), DenseNet (Huang et al., 2017) for CheXpert (X-rays), and BERT (Devlin et al., 2019) for CivilComments and MultiNLI (text). For tasks studied in Sagawa et al. ( 2020) (CelebA, Waterbirds, MultiNLI), we use the hyperparameters from Sagawa et al. (2020) . For others (CivilComments and CheXpert-device), we test the same number of hyperparameter sets for each of ERM and DRO, and report the best set below. Across all image and X-ray tasks, inputs are downsampled to resolution 224 x 224. CelebA. To train a model on CelebA, we initialize to pretrained ResNet-50. For ERM we optimize with learning rate 1e-4, weight decay 1e-4, batch size 128, and train for 50 epochs. For DRO we use learning rate 1e-5, weight decay 1e-1, and use generalization adjustment 1 (described in Sagawa et al. ( 2020)). The batch size is 128, and we train for 50 epochs. Waterbirds. For Waterbirds, as with CelebA, we use pretrained ResNet-50 as an initialization. For ERM we use learning rate 1e-3, weight decay 1e-4, batch size 128, and train for 300 epochs. For DRO we use learning rate 1e-5, weight decay 1, geneneralization adjustment 1, batch size 128, and train for 300 epochs. CheXpert-device. For CheXpert-device, we fine-tune pretrained DenseNet-121 for three epochs. For ERM we use learning rate 1e-3, no weight decay, batch size 16, and choose the model (out of the first three epochs) with highest average accuracy (epoch 2). For DRO, we use learning rate 1e-4, weight decay 1e-1, and batch size 16, and choose the model with highest worst-group accuracy (epoch 1). CivilComments. To train a model for CivilComments, we fine-tune bert-base-uncased using the implementation from Wolf et al. (2019) . For both ERM and DRO we use learning rate 1e-5, weight decay 1e-2, and batch size 16. We train the ERM models and DRO models for three epochs (early stopping,) then choose the model with highest average accuracy for ERM (epoch 2), and highest worst-group accuracy for DRO (epoch 1). MultiNLI. For MultiNLI we again fine-tune bert-base-uncased using the implementation from Wolf et al. (2019) . For ERM, we fine-tune for three epochs with learning rate 2e-5, weight decay 0, and batch size 32. For DRO, we also use learning rate 2e-5, and weight decay 0, and batch size 32, but use generalization adjustment 1. For both ERM and DRO the model after the third epoch is best in terms of average accuracy for ERM and worst-group accuracy for DRO.

D PROOFS: MARGIN DISTRIBUTIONS AND ACCURACY-COVERAGE CURVES D.1 LEFT-LOG-CONCAVITY

Recall the definition of left-log-concave from Section 5: Definition 1 (Left-log-concave distributions). A distribution is left-log-concave if its CDF is log- concave on (-∞, µ], where µ is the mean of the distribution. We first prove that a symmetric mixture of Gaussians is left-log-concave, but not necessarily logconcave.

D.1.1 SYMMETRIC MIXTURES OF GAUSSIANS ARE LEFT-LOG-CONCAVE

Lemma 1 (Symmetric mixtures of two gaussians are left-log-concave). Consider a symmetric mixture of Gaussians with density f = 0.5f µ + 0.5f -µ , where f µ is the density of a N (µ, σ 2 ) random variable and likewise f -µ is the density of a N (-µ, σ 2 ) random variable. Then the mixture is left-log-concave for all values of µ ∈ R, σ > 0, but only log-concave if |µ| ≤ σ. Proof. Without loss of generality, we can take σ = 1, since (left-)log-concavity is invariant to scaling, and also assume that µ is positive. First consider the case where µ ≤ 1. Then the mixture is log-concave, and therefore left-log-concave (Cule et al., 2010) . Now, consider the case where µ > 1. Cule et al. (2010) show the mixture is no longer log-concave. However, we claim that it is still left-log-concave. We start by studying the gradient of log f , f (x) f (x) = -exp -(x-µ) 2 2 (x -µ) -exp -(x+µ) 2 2 (x + µ) exp -(x-µ) 2 2 + exp -(x+µ) 2 2 (32) = -x + µ 1 -exp(-2xµ) 1 + exp(-2xµ) . ( ) We claim that f (x) f (x) has a local minimum at x = -a < 0, and as it is an odd function, a corresponding local maximum at x = a > 0. To show this, we first differentiate to obtain d dx f (x) f (x) = -1 + 4µ 2 exp(-2xµ) (1 + exp(-2xµ)) 2 . ( ) Setting the derivative to 0 gives us the quadratic equation 0 = -1 + 4µ 2 exp(-2xµ) (1 + exp(-2xµ)) 2 (35) ⇐⇒ (1 + exp(-2xµ)) 2 = 4µ 2 exp(-2xµ) (36) ⇐⇒ [exp(-2xµ)] 2 + (2 -4µ 2 ) exp(-2xµ) + 1 = 0 (37) ⇐⇒ exp(-2xµ) = 2µ 2 -1 ± 2µ µ 2 -1. Since µ > 1, there are two distinct roots of this quadratic, and two corresponding critical points of f /f . Let v(µ) = 2µ 2 -1 + 2µ µ 2 -1 be the larger root. Then v(µ) is a strictly increasing function for µ ≥ 1, and since v(1) = 1, we have that v(µ) > 1 for all µ > 1. Let x = -a satisfy exp(2aµ) = v(µ). Then we have that -a, the smaller of the critical points of f /f , is -a = - log v(µ) 2µ (39) = - log(2µ 2 -1 + 2µ µ 2 -1) 2µ (40) < 0. To show that f (a)/f (a) is a local minimum, we take the second derivative d 2 dx 2 f (x) f (x) = d dx -1 + 4µ 2 exp(-2xµ) (1 + exp(-2xµ)) 2 (42) = 8µ 3 exp(-2xµ)(exp(-2xµ) -1) (exp(-2xµ) + 1) 3 , which at x = -a gives d 2 dx 2 f (x) f (x) | x=-a = 8µ 3 v(µ)(v(µ) -1) (v(µ) + 1) 3 (44) > 0, since v(µ) > 1. Since -a is the only critical point of f /f that is less than 0 and it is a local minimum, f /f must be decreasing on (-∞, -a], which in turn implies that f , and therefore F , is log-concave on (-∞, -a]. It remains to show that F is also log-concave on [-a, 0]. We make use of two facts. First, since -a is a local minimum and the only critical point less than 0, we have that f (-a) f (-a) ≤ f (x) f (x) ≤ f (0) f (0) = 0 for all x ∈ [-a, 0]. Second, since f (x) and F (x) are non-negative for all x, f (x)/F (x) is also non-negative for all x. Thus, for all x ∈ [-a, 0] d dx f (x) F (x) = F (x)f (x) -f (x) 2 F (x) 2 (46) = f (x) F (x) ≥ 0 f (x) f (x) ≤ 0 - f (x) F (x) ≥ 0 (47) ≤ 0, and therefore F is also log-concave on [-a, 0]. Remark 1. Note that if f is (left-)log-concave, then F is also (left-)log-concave (Bagnoli & Bergstrom, 2005) . However, the reverse direction does not hold.

D.2 SYMMETRIC MARGIN DISTRIBUTIONS

In this section we prove Proposition 1. We first prove the following helpful lemma: Lemma 2 (Conditions for monotonicity of selective accuracy). A F (τ ) is monotone increasing in τ if and only if f (-τ ) F (-τ ) ≥ f (τ ) 1 -F (τ ) for all τ ≥ 0. Conversely, A F (τ ) is monotone decreasing in τ if and only if the above inequality is flipped for all τ ≥ 0. Proof. A F (τ ) is monotone increasing in τ if and only if dA F dτ ≥ 0 for all τ ≥ 0. We obtain dA F dτ by differentiating A F and simplifying: dA F dτ = d dτ 1 -F (τ ) 1 -F (τ ) + F (-τ ) (50) = 1 -F (τ ) + F (-τ ) -f (τ ) -1 -F (τ ) -f (τ ) -f (-τ ) (1 -F (τ ) + F (-τ )) 2 (51) = -f (τ ) + f (τ )F (τ ) -f (τ )F (-τ ) + f (τ ) + f (-τ ) -f (τ )F (τ ) -f (-τ )F (τ ) (1 -F (τ ) + F (-τ )) 2 = f (-τ ) -f (τ )F (-τ ) -f (-τ )F (τ ) (1 -F (τ ) + F (-τ )) 2 (52) = f (-τ )[1 -F (τ )] -f (τ )F (-τ ) (1 -F (τ ) + F (-τ )) 2 . ( ) Since the denominator is always positive, we have that dA F dτ ≥ 0 if and only if the numerator f (-τ )[1 -F (τ )] -f (τ )F (-τ ) ≥ 0, which in turn is equivalent to f (-τ ) F (-τ ) ≥ f (τ ) 1 -F (τ ) , as desired. The case for monotone decreasing A F (τ ) is analogous. In the next two lemmas, we prove the necessary and sufficient conditions for Proposition 1 respectively.

D.3 SKEW-SYMMETRIC MARGIN DISTRIBUTIONS

We now turn to how skew affects selective classification. Recall in Section 5 we defined skewsymmetric distributions as follows: Definition 2. A distribution with density f α,µ is skew-symmetric with skew α and center µ if f α,µ (τ ) = 2h(τ -µ)G(α(τ -µ)) for all τ ∈ R, where h is the density of a distribution is symmetric about 0, and G is the CDF of a potentially different distribution that is also symmetric about 0. When α = 0, f = h and there is no skew. Increasing α increases rightward skew, and decreasing α increases leftward skew. Note that in general, the mean of f depends on α as well. We will also consider the translated distribution: f α,µ (x) = f α (x -µ). The CDF of f α,µ is F α,µ and the CDF of f α is F α . We will use the following properties of these distributions: Lemma 5 (Skew-symmetry about µ). These properties hold when we flip the skew from α to -α: f α,µ (x) = f -α,µ (2µ -x) (62) F α,µ (x) = 1 -F -α,µ (2µ -x) Proof. f α,µ (x) = 2h(x -µ)G(α(x -µ)) (64) = 2h(µ -x)[G((-α)(µ -x))] (65) = f -α,µ (2µ -x). (66) F α,µ (x) = x -∞ f α,µ (t)dt (67) = x -∞ f -α,µ (2c -t)dt (68) = ∞ 2c-x f -α,µ (t)dt (69) = 1 -F -α,µ (2c -x). ( ) Lemma 6 (Stochastic ordering with α [Proposition 4 from Azzalini & Regoli (2012)]). Let α 1 ≤ α 2 . Then, for any µ ∈ R, F α1,µ ≥ F α2,µ . Proof. Since the ordering is invariant to translation, without loss of generality we can take µ = 0. First consider x ≤ 0. Then F α1 (x) = x -∞ 2h(t)G(α 1 t)dt (71) ≥ x -∞ 2h(t)G(α 2 t)dt (72) = F α2 (x). Now consider x ≥ 0. We have F α1 (x) = x -∞ 2h(t)G(α 1 t)dt (74) = x -∞ 2h(t)(1 -G(-α 1 t))dt [by symmetry of G] (75) = 2H(x) - x -∞ 2h(t)G(-α 1 t)dt (76) = 2H(x) - x -∞ 2h(-t)G(-α 1 t)dt [by symmetry of h] (77) = 2H(x) - ∞ -x 2h(t)G(α 1 t)dt [change of variables] (78) = 2H(x) -1 + F α1 (-x), ) which reduces to the case where x ≤ 0. Lemma 7 (Log gradient ordering by skew). For all α ≥ 0, τ ≥ 0, and µ ∈ R, f α,µ (τ ) F α,µ (τ ) ≥ f 0,µ (τ ) F 0,µ (τ ) ≥ f -α,µ (τ ) F -α,µ (τ ) . ( ) Proof. Since the ordering is invariant to translation, without loss of generality we can take µ = 0. We have: f α (τ ) F α (τ ) = h(τ )G(ατ ) τ -∞ h(t)G(αt)dt (81) = h(τ ) τ -∞ h(t) G(αt) G(ατ ) dt (82) ≥ h(τ ) τ -∞ h(t)dt = f 0 (τ ) f 0 (τ ) (83) ≥ h(τ ) τ -∞ h(t) G(-αt) G(-ατ ) dt (84) = h(τ )G(-ατ ) τ -∞ h(t)G(-αt)dt (85) = f -α (τ ) F -α (τ ) . To get the inequalities, note that when α ≥ 0 and t ≤ τ , we have α ≥ 0, G(αt)/G(ατ ) ≤ 1 since G is an increasing function. Similarly, G(-αt)/G(-ατ ) ≥ 1. Since h is non-negative, the inequalities then follow from the monotonicity of the integral. We now prove the main results of this section.

D.3.1 ACCURACY IS MONOTONE WITH SKEW

Proposition 6 (Accuracy is monotone with skew). Let F α,µ be the CDF of a skew-symmetric distribution. For all τ ≥ 0 and µ ∈ R, A Fα,µ (τ ) is monotonically increasing in α. Proof. We use the skew-symmetry of f α,µ to write the selective accuracy as A fα,µ (τ ) = 1 1 + Fα,µ(-τ ) 1-Fα,µ(τ ) (87) = 1 1 + Fα,µ(-τ ) Fα,µ(2µ-τ ) . ( ) With Lemma 8, our results extend all random variables T (X) where X corresponds to a left-logconcave distribution and T is odd and strictly monotonically increasing. In particular, X and T (X) have the same accuracy-coverage curves. E PROOFS: COMPARISON TO THE GROUP-AGNOSTIC REFERENCE In this section, we present the proofs from Section 6, which outline conditions under which selective classifiers outperform the group-agnostic reference.

E.1 DEFINITIONS

We first define certain metrics on selective classifiers and their group-agnostic reference in terms of their margin distributions. Definition 6. Consider a margin distribution with CDF F = g∈G p g F g , where p g is the mixture weight and F g is the CDF for group g. We write the fraction of examples that are correct/incorrect and predicted on at threshold τ from group g and on average as follows: C g (τ ) = 1 -F g (τ ) (102) I g (τ ) = F g (-τ ) (103) C(τ ) = g p g C g (τ ) I(τ ) = g p g I g (τ ) We write the true-positive rate R TP (τ ) and false-positive rate R FP (τ ) for a given threshold τ as R TP (τ ) = C(τ ) C(0) , R FP (τ ) = I(τ ) I(0) . ( ) Finally, we write the accuracy of the group-agnostic reference on group g as ÃFg (τ ) = A Fg (0)R TP (τ ) A Fg (0)R TP (τ ) + (1 -A Fg (0))R FP (τ ) (108) = C g (0)C(τ )/C(0) C g (0)C(τ )/C(0) + I g (0)I(τ )/I(0) For convenience in our proofs below, we also define the fraction of each group g in correctly and incorrectly classified predictions at each threshold τ . Definition 7. We define the fraction of group g out of correctly and incorrectly classified predictions at threshold τ , CF g (τ ) and IF g (τ ) respectively, as CF g (τ ) = p g C g (τ ) C(τ ) (110) IF g (τ ) = p g I g (τ ) I(τ ) . E.2 GENERAL NECESSARY CONDITIONS TO OUTPERFORM THE GROUP-AGNOSTIC

REFERENCE

We now present the proofs for Proposition 3 and Corollary 1, starting with the supporting lemmas. We first write the accuracy of the group-agnostic reference in terms of IF wg (0) and CF wg (0). Lemma 9. ÃFwg (τ ) = 1 1 + IFwg(0) CFwg(0) • I(τ ) C(τ ) . Proof. We have: ÃFwg (τ ) = A Fwg (0)R TP (τ ) A Fwg (0)R TP (τ ) + (1 -A Fwg (0))R FP (τ ) (113) = pC wg (0)R TP (τ ) pC wg (0)R TP (τ ) + pI wg (0)R FP (τ ) (114) = pC wg (0) C(τ ) C(0) pC wg (0) C(τ ) C(0) + pI wg (0) I(τ ) I(0) (115) = CF wg (0)C(τ ) CF wg (0)C(τ ) + IF wg (0)I(τ ) (116) = 1 1 + IFwg(0) CFwg(0) • I(τ ) C(τ ) . ( ) Lemma 10 (Bounding the derivative of 1/CF wg (τ ) at τ = 0). If A F others (0) ≥ A Fwg (0), d dτ 1 CF wg (τ ) τ =0 ≥ 1 -p p F wg (0) 1 -F wg (0) f wg (0)F others (0) -f others (0)F wg (0) F wg (0) 2 Proof. d dτ 1 CF wg (τ ) τ =0 (119) = d dτ pC wg (τ ) + (1 -p)C others (τ ) pC wg (τ ) τ =0 (120) = d dτ 1 + 1 -p p 1 -F others (τ ) 1 -F wg (τ ) τ =0 (121) = 1 -p p d dτ 1 -F others (τ ) 1 -F wg (τ ) τ =0 (122) = 1 -p p f wg (τ )(1 -F others (τ )) -f others (τ )(1 -F wg (τ )) (1 -F wg (τ )) 2 τ =0 (123) = 1 -p p 1 1 -F wg (τ ) f wg (τ ) 1 -F others (τ ) 1 -F wg (τ ) -f others (τ ) τ =0 (124) = 1 -p p 1 1 -F wg (0) f wg (0) 1 -F others (0) 1 -F wg (0) -f others (0) (125) ≥ 1 -p p 1 1 -F wg (0) f wg (0) F others (0) F wg (0) -f others (0) [0 < F others (0) ≤ F wg (0) < 1] (126) = 1 -p p 1 1 -F wg (0) f wg (0)F others (0) -f others (0)F wg (0) F wg (0) (127) = 1 -p p F wg (0) 1 -F wg (0) f wg (0)F others (0) -f others (0)F wg (0) F wg (0) 2 (128) Lemma 11. If A F others (0) ≥ A Fwg (0) > 0.5, then d dτ IF wg (τ ) CF wg (τ ) τ =0 ≥ C(f others (0)F wg (0) -f wg (0)F others (0)) for some positive constant C. Proof. d dτ IF wg (τ ) CF wg (τ ) τ =0 (130) = d dτ IF wg (τ ) τ =0 1 CF wg (0) + IF wg (0) d dτ 1 CF wg (τ ) τ =0 (131) ≥ d dτ IF wg (τ ) τ =0 1 CF wg (0) + IF wg (0) 1 -p p F wg (0) 1 -F wg (0) f wg (0)F others (0) -f others (0)F wg (0) F wg (0) 2 (133) =   1 1 + 1-p p F others (0) Fwg(0)   2 1 -p p f others (0)F wg (0) -f wg (0)F others (0) F wg (0) 2 (134) * 1 + 1 -p p 1 -F others (0) 1 -F wg (0) +   1 1 + 1-p p F others (0) Fwg(0)   1 -p p F wg (0) 1 -F wg (0) f wg (0)F others (0) -f others (0)F wg (0) F wg (0) 2 =   1 1 + 1-p p F others (0) Fwg(0)   >0 1 -p p >0 (f others (0)F wg (0) -f wg (0)F others (0)) 1 F wg (0) 2 >0 (136) *        1 + 1-p p 1-F others (0) 1-Fwg(0) 1 + 1-p p F others (0) Fwg(0) ≥1 because A F others (0)≥A Fwg (0) - F wg (0) 1 -F wg (0) <1 because A Fwg (0)>0.5        = C(f others (0)F wg (0) -f wg (0)F others (0)) where ( 133) follows from Lemma 10.

E.2.1 NECESSARY CONDITION FOR OUTPERFORMING THE GROUP-AGNOSTIC REFERENCE

Proposition 3 (Necessary condition for outperforming the group-agnostic reference). Assume that 1/2 < A Fwg (0) < A F others (0) < 1 and the worst-group density f wg (0 ) > 0. If ÃFwg (τ ) ≤ A Fwg (τ ) for all τ ≥ 0, then f others (0) f wg (0) ≤ 1 -A F others (0) 1 -A Fwg (0) . Proof. Recall that A Fwg (τ ) = 1 1 + IFwg(τ ) CFwg(τ ) • I(τ ) C(τ ) (138) ÃFwg (τ ) = 1 1 + IFwg(0) CFwg(0) • I(τ ) C(τ ) . ( ) If A Fwg (τ ) ≥ ÃFwg (τ ) for all τ ≥ 0, then d dτ IF wg (τ ) CF wg (τ ) τ =0 ≤ 0. From Lemma 11, C(f others (0)F wg (0) -f wg (0)F others (0)) ≤ d dτ IF wg (τ ) CF wg (τ ) τ =0 for some positive constant C. Combined, f others (0)F wg (0)f wg (0)F others (0) ≤ 0.

E.2.2 OUTPERFORMING THE GROUP-AGNOSTIC REFERENCE REQUIRES SMALLER SCALING FOR LOG-CONCAVE DISTRIBUTIONS

Corollary 1 (Outperforming the group-agnostic reference requires smaller scaling for log-concave distributions). Assume that 1/2 < A Fwg (0) < A F others (0) < 1, F wg is log-concave, and f others (τ ) = vf wg (v(τµ others ) + µ wg ) for all τ ∈ R, where v is a scaling factor. If ÃFwg (τ ) ≤ A Fwg (τ ) for all τ ≥ 0, v < 1. Proof. By Proposition 3, if A Fwg (τ ) ≥ ÃFwg (τ ) for all τ ≥ 0, then: f others (0)F wg (0) -f wg (0)F others (0) ≤ 0, and equivalently, f others (0) F others (0) ≤ f wg (0) F wg (0) . From the definition of f others , v ≤ f wg (0)/F wg (0) f wg (-µ others v + µ wg )/F wg (-µ others v + µ wg ) Because A F others (0) > A Fwg (0), -µ others v + µ wg < 0. Applying log-concavity yields v < 1. Thus, when v > 1, there exists some threshold τ ≥ 0 where ÃFwg (τ ) > A Fwg (τ ).

E.3 TRANSLATED, LOG-CONCAVE DISTRIBUTIONS ALWAYS UNDERPERFORM THE GROUP-AGNOSTIC REFERENCE

We now present a proof of Proposition 4. Assume f wg and f others are log concave and symmetric, the worst group has mean µ wg , the combination of other group(s) has mean µ others , and f others (x) = f wg (x -(µ othersµ wg )), (the densities are translations of each other.) For convenience, define d = µ othersµ wg . This implies: F others (x) = F wg (x -(µ others -µ wg )) (147) = F wg (x -d). We will first show that wg is indeed the worst group at full coverage; i.e. A Fwg (0) < A F others (0): Lemma 12. If f wg is a PDF and f others is obtained by translating f wg to the right (i.e. f others (x) = f wg (xd) for d > 0), and F wg and F others are their associated CDFs, A Fwg (0) < A F others (0). Proof. We have: A Fwg (0) = 1 -F wg (0) (149) ≤ 1 -F wg (-d) (150) = 1 -F others (0) (151) = A F others (0). We next show that CF wg (τ ) is monotonically decreasing in τ : Lemma 13. If f wg , F wg , f others , F others , p, and CF wg are as described, CF wg is monotonically decreasing in τ . Proof. First, defining CF wg in terms of F wg , we have:  Now, since d > 0, this follows from the log concavity of f wg . Therefore, the fraction of correct examples that come from group 1 is a decreasing function of τ . Next, we show that the IF wg (τ ) is monotonically increasing in τ : Lemma 14. If f wg , F wg , f others , F others , p and IF wg are as described, IF wg is monotonically increasing in τ . Proof. First, we note that: . (163) We will show that the derivative of IF wg (τ ) with respect to τ is positive. Doing so gives us:  Since f wg is log-concave and d is positive, the last inequality is true, so IF wg (τ ) is monotonically increasing in τ , as desired. Lastly, we show that Lemma 13 and Lemma 14 imply that the ratio IFwg(τ ) CFwg(τ ) is monotonically increasing in τ . Lemma 15.

IFwg(τ )

CFwg(τ ) is monotonically increasing in τ Proof. We simply follow the quotient rule: d dτ IF wg (τ ) CF wg (τ ) = CF wg (τ )IF 1 (τ ) -IF wg (τ )CF 1 (τ ) (170) ≥ 0, where we note that CF wg (τ ), IF wg (τ ) ≥ 0, CF 1 (τ ) < 0 from Lemma 13, and IF 1 (τ ) > 0 from Lemma 14. We will now prove the main result of this section:  Proof. We consider the case of underperforming the group-agnostic reference: the case of overperforming the group-agnostic reference is analogous. Assume the worst group is the lower accuracy group (so d is positive.) Using the definition of selective accuracy, we have:  A Fwg (τ ) = C wg (τ ) C wg (τ ) + I 1 (τ ) Thus, the worst group underperforms the group-agnostic reference, as desired.

F SIMULATIONS

To demonstrate that it is possible but challenging to outperform the group-agnostic reference, we present simulation results on margin distributions that are mixtures of two Gaussians. Following the setup from Section 6, we consider a best-group margin distribution N (1, 1) and a worst-group margin distribution N (µ, σ 2 ) varying the parameters µ, σ. We set the fraction of mass from the worst group, p to be 0.5. We evaluate whether selective classification outperforms the group-agnostic reference, plotting the results in Figure 9 . We observe that selective classification outperforms the group-agnostic reference under some parameter settings, notably when the necessary condition in Proposition 3 is met and when the variance is smaller as implied by Corollary 1, but it underperforms the group-agnostic reference under most parameter settings. 9 : Simulation results on margin distributions that are mixtures of two Gaussians, where we vary the mean and the variance of the worst-group margin distributions. For parameters corresponding to the blue and orange regions, the worst group outperforms and underperforms the groupagnostic reference, respectively. We do not consider parameters in the white region to maintain that they yield full-coverage accuracies that are worse than the other group. We shade parameters that satisfy the necessary condition in Proposition 3. In addition to whether the worst group underperforms or outperforms the group-agnostic reference, we study the magnitude of the difference in the worst-group accuracy with respect to the groupagnostic reference through the same simulations (Figure 10 ). Concretely, we compute the difference in worst-group accuracy with respect to the group-agnostic reference, taking the worst-case difference across thresholds: max τ ÃFwg (τ ) -A Fwg (τ ). We observe significant disparities between the observed worst-group accuracy and the group-agnostic reference. Lastly, we simulate the effects of varying the worst-group and other-group means while keeping the variance fixed at σ 2 = 1 for both groups, following the same simulation protocol otherwise.



www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/



Figure 2: Accuracy (top) and coverage (bottom) for each group, as a function of the average coverage. Each average coverage corresponds to a threshold τ . The red lines represent the worst group. At low coverages, accuracy estimates are noisy as only a few predictions are made.

S∈Q max g∈G acc g (S)min g∈G acc g (S) .

Figure 7: Selective accuracy (top) and coverage (bottom) for each group, as a function of the average coverage for the MC-dropout selective classifier. Each average coverage corresponds to a threshold τ . The red lines represent the worst group.

Figure 8: Selective accuracy (top) and coverage (bottom) for each group as a function of the average coverage for the softmax response selective classifier with model optimized with group DRO. Each average coverage corresponds to a specific threshold τ . The red lines represent the worst group (i.e. the one with the lowest accuracy at full coverage,) and the gray lines represent the other groups.

wg (τ ) = p(1 -F wg (τ )) p(1 -F wg (τ )) + (1p)(1 -F others (τ )) (153) = p(1 -F wg (τ )) p(1 -F wg (τ )) + (1p)(1 -F wg (τd)) (154) = pF wg (2µ wgτ ) pF wg (2µ wgτ ) + (1p)F wg (2µ wgτ + d) of CF wg , we have: 2µ wgτ + d) F wg (2µ wgτ ) (158) = -c (-f wg (2µ wgτ + d)F wg (2µ wgτ ) + f wg (2µ wgτ )F wg (2µ wgτ + d)) ,(159)where c is a positive constant (since p and 1p are positive and all of the terms are squares.) Thus, we can say that CF wg (τ ) is monotonically decreasing for all τ if:f wg (2µ wgτ + d)F wg (2µ wgτ ) ≤ f wg (2µ wgτ )F wg (2µ wgτ + d)(160) f wg (2µ wgτ + d) F wg (2µ wgτ + d) ≤ f wg (2µ wgτ ) F wg (2µ wgτ ) .

wg (τ ) = pF wg (-τ ) pF wg (-τ ) + (1p)F wg (-τd)

p)Fwg(-τ -d) pFwg(-τ ) wg (-τd) F wg (-τ ) (165) = -c (-f wg (-τd)F wg (-τ ) + f wg (-τ )F wg (-τd)) , (166)where c is positive since it is the product of squared terms and (1p)/p, which is also positive. Thus, d dτ IF wg (τ ) ≥ 0 is equivalent to:-(-f wg (-τd)F wg (-τ ) + f wg (-τ )F wg (-τd)) ≥ 0 (167) f wg (-τd)F wg (-τ ) ≥ f wg (-τ )F wg (-τd) (168) f wg (-τd) F wg (-τd)≥ f wg (-τ ) F wg (-τ ) .

.1 TRANSLATED, LOG-CONCAVE DISTRIBUTIONS ALWAYS UNDERPERFORM THE GROUP-AGNOSTIC REFERENCE Proposition 4 (Translated log-concave distributions underperform the group-agnostic reference). Assume F wg and F others are log-concave and f others (τ ) = f wg (τd) for all τ ∈ R. Then for all τ ≥ 0, A Fwg (τ ) ≤ ÃFwg (τ ).

CF wg (τ )C(τ ) CF wg (τ )C(τ ) + IF wg (τ )I(τ ) CF wg (0)C(τ ) CF wg (0)C(τ ) + IF wg (0)I(τ )(176)= ÃFwg (τ ).

Figure9: Simulation results on margin distributions that are mixtures of two Gaussians, where we vary the mean and the variance of the worst-group margin distributions. For parameters corresponding to the blue and orange regions, the worst group outperforms and underperforms the groupagnostic reference, respectively. We do not consider parameters in the white region to maintain that they yield full-coverage accuracies that are worse than the other group. We shade parameters that satisfy the necessary condition in Proposition 3.

Figure10: Simulation results on margin distributions that are mixtures of two Gaussians, where we vary the mean and the variance of the worst-group margin distribution, and keep the other group's margin distribution fixed with mean and variance 1. We compute the difference in worst-group accuracy with respect to the group-agnostic reference, taking the worst-case difference across thresholds: max τ ÃFwg (τ ) -A Fwg (τ ).

We study these datasets fromLiu et al. (2015);Borkan et al. (2019);Sagawa et al. (2020);Irvin et al. (2019);Williams et al. (2018) respectively. For each dataset, we form a group for each combination of label y ∈ Y and spuriously-correlated attribute a ∈ A, and evaluate the accuracy of selective classifiers on average and on each group. Dataset details in Appendix C.1.

Sample a subset I ga τ of size |I τ | uniformly at random from I 0 . 3 Return C ga τ and I ga τ . Since |C ga τ | = |C τ | and |I ga τ | = |I τ |, the group-agnostic reference makes the same numbers of correct and incorrect predictions as (ŷ, ĉ), but in a group-agnostic way.

Figure5: When applied to group DRO models, which have more similar accuracies across groups than standard models, SR selective classifiers improve average and worst-group accuracies. At low coverages, accuracy estimates are noisy as only a few predictions are made.

, whose training set includes 206,175 examples with 1,521 examples from the smallest group (entailment with negations). The worst group for standard models tends to be neutral examples with negation words.

ACKNOWLEDGMENTS

We thank Emma Pierson, Jean Feng, Pranav Rajpurkar, and Tengyu Ma for helpful advice. This work was supported by NSF Award Grant no. 1804222. SS was supported by the Herbert Kunzel Stanford Graduate Fellowship and AK was supported by the Stanford Graduate Fellowship.

REPRODUCIBILITY

All code, data, and experiments are available on CodaLab at https://worksheets. codalab.org/worksheets/0x7ceb817d53b94b0c8294a7a22643bf5e. The code is also available on GitHub at https://github.com/ejones313/worst-group-sc.

annex

Lemma 3 (Left-log-concavity and symmetry imply monotonicity.). Let f be symmetric about µ and let F be left-log-concave. If A F (0) ≥ 0.5, then A F (τ ) is monotone increasing. Conversely, if A F (0) ≤ 0.5, then A F (τ ) is monotone decreasing.Proof. Consider the case where A F (0) ≥ 0.5; the case where A F (0) ≤ 0.5 is analogous. From Lemma 2 and the symmetry of f , we have that A F (τ ) is monotone increasing if (and only if)holds for all τ ≥ 0.To show that this inequality holds for all τ ≥ 0, we first note that since A F (0) ≥ 0.5, we have that F (0) ≤ 0.5, which together with the symmetry of F implies that µ ≥ 0. Thus, -τ ≤ 2µτ for all τ ≥ 0.Now, if τ ≥ µ, then 2µτ ≤ µ, so the desired inequality (55) follows from the log-concavity of F on (-∞, µ] (Remark 1 of Bagnoli & Bergstrom (2005) .)and thus (55) also holds.Lemma 4 (Monotonicity and symmetry imply left-log-concavity.). Let f be symmetric with mean 0, and let f µ be a translated version of f that has mean µ. If A fµ is monotone increasing for all µ ≥ 0, then f is left-log-concave.Proof. First consider µ ≥ 0. Since A fµ is monotone increasing for all µ ≥ 0, from Lemma 2 and the symmetry of f µ ,must hold for all τ ≥ 0. Since f µ (x) = f (xµ) by construction, we can equivalently writewhich holds for all τ ≥ 0 as well, and by assumption, for all µ ≥ 0 as well. By letting a = -µτ and b = µτ , we can equivalently write the above inequality asBy the definition of log-concavity, this means that F is log-concave on (-∞, 0].

D.2.1 LEFT-LOG-CONCAVITY AND MONOTONICITY.

Proposition 1 (Left-log-concavity and monotonicity).and monotonically decreasing otherwise. Conversely, if A F is monotonically increasing for all translations F d such that F d (τ ) = F (τd) for all τ by Lemma 4, F is log-concave, completing the proof.We see that A fα,µ (τ ) is a monotone decreasing function of Fα,µ(-τ )F-α,µ(2µ-τ ) . From Lemma 6, the numerator decreases with increasing α while the denominator increases. Thus, this fraction decreases, which in turn implies that A fα,µ (τ ) is monotone increasing with α as desired.

D.3.2 SKEW IN THE SAME DIRECTION PRESERVES MONOTONICITY

Proposition 2 (Skew in the same direction preserves monotonicity). Let F α,µ be the CDF of a skewsymmetric distribution. If accuracy of its symmetric version, A F0,µ (τ ), is monotonically increasing in τ , then A Fα,µ (τ ) is also monotonically increasing in τ for any α > 0. Similarly, if A F0,µ (τ ) is monotonically decreasing in τ , then A Fα,µ (τ ) is also monotonically decreasing in τ for any α < 0.Proof. The idea is to use Lemma 7 to reduce the statement to the case where α = 0, so that we can apply monotonicity. First consider the case where A F0,µ (τ ) is monotonically increasing in τ , and α > 0. We haveApplying Lemma 2 completes the proof. The case where α < 0 is analogous.

D.4 MONOTONE ODD TRANSFORMATIONS PRESERVE ACCURACY-COVERAGE CURVES.

We now show that our results are unchanged by strictly monotone and odd transformations to the density. Lemma 8 (Odd and strictly monotonically increasing transformations preserve selective accuracy).Let X be a real-valued random variable with CDF F X , T be a strictly monotonically increasing function, and define random variableProof. For each τ , since T is a strictly monotonically increasing function, we apply the change of variables formula for transformations of univariate random variables to get the following:[T is odd] (97)We now solve for selective accuracy:Published as a conference paper at ICLR 2021We similarly observe substantial gaps between the observed worst-group accuracy and the groupagnostic reference (Figure 11 ). Figure 11 : Simulation results on margin distributions that are mixtures of two Gaussians, where we vary the means of the two margin distributions, keeping their variances fixed at 1. We compute the difference in worst-group accuracy with respect to the group-agnostic reference, taking the worstcase difference across thresholds: max τ ÃFwg (τ ) -A Fwg (τ ).

