UNDERSAMPLING IS A MINIMAX OPTIMAL ROBUST-NESS INTERVENTION IN NONPARAMETRIC CLASSIFI-CATION

Abstract

While a broad range of techniques have been proposed to tackle distribution shift, the simple baseline of training on an undersampled balanced dataset often achieves close to state-of-the-art-accuracy across several popular benchmarks. This is rather surprising, since undersampling algorithms discard excess majority group data. To understand this phenomenon, we ask if learning is fundamentally constrained by a lack of minority group samples. We prove that this is indeed the case in the setting of nonparametric binary classification. Our results show that in the worst case, an algorithm cannot outperform undersampling unless there is a high degree of overlap between the train and test distributions (which is unlikely to be the case in real-world datasets), or if the algorithm leverages additional structure about the distribution shift. In particular, in the case of label shift we show that there is always an undersampling algorithm that is minimax optimal. In the case of group-covariate shift we show that there is an undersampling algorithm that is minimax optimal when the overlap between the group distributions is small. We also perform an experimental case study on a label shift dataset and find that in line with our theory, the test accuracy of robust neural network classifiers is constrained by the number of minority samples.

1. INTRODUCTION

A key challenge facing the machine learning community is to design models that are robust to distribution shift. When there is a mismatch between the train and test distributions, current models are often brittle and perform poorly on rare examples (Hovy & Søgaard, 2015; Blodgett et al., 2016; Tatman, 2017; Hashimoto et al., 2018; Alcorn et al., 2019) . In this paper, our focus is on groupstructured distribution shifts. In the training set, we have many samples from a majority group and relatively few samples from the minority group, while during test time we are equally likely to get a sample from either group. To tackle such distribution shifts, a naïve algorithm is one that first undersamples the training data by discarding excess majority group samples (Kubat & Matwin, 1997; Wallace et al., 2011) and then trains a model on this resulting dataset (see Figure 1 for an illustration of this algorithm). The samples that remain in this undersampled dataset constitute i.i.d. draws from the test distribution. Therefore, while a classifier trained on this pruned dataset cannot suffer biases due to distribution shift, this algorithm is clearly wasteful, as it discards training samples. This perceived inefficiency of undersampling has led to the design of several algorithms to combat such distribution shift (Chawla et al., 2002; Lipton et al., 2018; Sagawa et al., 2020; Cao et al., 2019; Menon et al., 2020; Ye et al., 2020; Kini et al., 2021; Wang et al., 2022) . In spite of this algorithmic progress, the simple baseline of training models on an undersampled dataset remains competitive. In the case of label shift, where one class label is overrepresented in the training data, this has been observed by Cui et al. (2019) ; Cao et al. (2019) , and Yang & Xu (2020) . While in the case of group-covariate shift, a study by Idrissi et al. (2022) showed that the empirical effectiveness of these more complicated algorithms is limited. For example, Idrissi et al. (2022) showed that on the group-covariate shift CelebA dataset the worstgroup accuracy of a ResNet-50 model on the undersampled CelebA dataset which discards 97% of the available training data is as good as methods that use all of available data such as importance- Figure 1 : Example with linear models and linearly separable data. On the left we have the maximum margin classifier over the entire dataset, and on the right we have the maximum margin classifier over the undersampled dataset. The undersampled classifier is less biased and aligns more closely with the true boundary. weighted ERM (Shimodaira, 2000) , Group-DRO (Sagawa et al., 2020) and Just-Train-Twice (Liu et al., 2021) . In Table 1 , we report the performance of the undersampled classifier compared to the state-of-the-art-methods in the literature across several label shift and group-covariate shift datasets. We find that, although undersampling isn't always the optimal robustness algorithm, it is typically a very competitive baseline and within 1-4% the performance of the best method. Table 1 : Performance of undersampled classifier compared to the best classifier across several popular label shift and group-covariate shift datasets. When reporting worst-group accuracy we denote it by a . When available, we report the 95% confidence interval. We find that the undersampled classifier is always within 1-4% of the best performing robustness algorithm, except on the MultiNLI dataset. We provide insights into this question in the label shift scenario, where one of the labels is overrepresented in the training data, P train (y = 1) ≥ P train (y = -1), whereas the test samples are equally likely to come from either class. Here the class-conditional distribution P(x | y) is Lipschitz in x. We show that in the label shift setting there is a fundamental constraint, and that the minimax excess risk of any robust learning method is lower bounded by 1/n min 1/3 . That is, minority group samples fundamentally constrain performance under distribution shift. Furthermore, by leveraging previous results about nonparametric density estimation (Freedman & Diaconis, 1981) we show a matching upper bound on the excess risk of a standard binning estimator trained on an undersampled dataset to demonstrate that undersampling is optimal. Further, we experimentally show in a label shift dataset (Imbalanced Binary CIFAR10) that the accuracy of popular classifiers generally follow the trends predicted by our theory. When the minority samples are increased, the accuracy of these classifiers increases drastically, whereas when the number of majority samples are increased the gains in the accuracy are marginal at best. We also study the covariate shift case. In this setting, there has been extensive work studying the effectiveness of transfer (Kpotufe & Martinet, 2018; Hanneke & Kpotufe, 2019) from train to test distributions, often focusing on deriving specific conditions under which this transfer is possible. In this work, we demonstrate that when the overlap (defined in terms of total variation distance) between the group distributions P a and P b is small, transfer is difficult, and that the minimax excess risk of any robust learning algorithm is lower bounded by 1/n min 1/3 . While this prior work also shows the impossibility of using majority group samples in the extreme case with no overlap, our results provide a simple lower bound that shows that the amount of overlap needed to make transfer feasible is unrealistic. We also show that this lower bound is tight, by proving an upper bound on the excess risk of the binning estimator acting on the undersampled dataset. Taken together, our results underline the need to move beyond designing "general-purpose" robustness algorithms (like importance-weighting (Cao et al., 2019; Menon et al., 2020; Kini et al., 2021; Wang et al., 2022) , g-DRO (Sagawa et al., 2020) , JTT (Liu et al., 2021) , SMOTE (Chawla et al., 2002) , etc.) that are agnostic to the structure in the distribution shift. Our worst case analysis highlights that to successfully beat undersampling, an algorithm must leverage additional structure in the distribution shift.

2. RELATED WORK

On several group-covariate shift benchmarks (CelebA, CivilComments, Waterbirds), Idrissi et al. (2022) showed that training ResNet classifiers on an undersampled dataset either outperforms or performs as well as other popular reweighting methods like Group-DRO (Sagawa et al., 2020) , reweighted ERM, and Just-Train-Twice (Liu et al., 2021) . They find Group-DRO performs comparably to undersampling, while both tend to outperform methods that don't utilize group information. One classic method to tackle distribution shift is importance weighting (Shimodaira, 2000) , which reweights the loss of the minority group samples to yield an unbiased estimate of the loss. However, recent work (Byrd & Lipton, 2019; Xu et al., 2020) has demonstrated the ineffectiveness of such methods when applied to overparameterized neural networks. Many followup papers (Cao et al., 2019; Ye et al., 2020; Menon et al., 2020; Kini et al., 2021; Wang et al., 2022) have introduced methods that modify the loss function in various ways to address this. However, despite this progress undersampling remains a competitive alternative to these importance weighted classifiers. Our theory draws from the rich literature on non-parametric classification (Tsybakov, 2010) . Apart from borrowing this setting of nonparametric classification, we also utilize upper bounds on the estimation error of the simple histogram estimator (Freedman & Diaconis, 1981; Devroye & Györfi, 1985) to prove our upper bounds in the label shift case. Finally, we note that to prove our minimax lower bounds we proceed by using the general recipe of reducing from estimation to testing (Wainwright, 2019, Chapter 15) . One difference from this standard framework is that our training samples shall be drawn from a different distribution than the test samples used to define the risk. There is rich literature that studies domain adaptation and transfer learning under label shift (Maity et al., 2020) and covariate shift (Ben-David et al., 2006; David et al., 2010; Ben-David et al., 2010; Ben-David & Urner, 2012; 2014; Berlind & Urner, 2015; Kpotufe & Martinet, 2018; Hanneke & Kpotufe, 2019) . The principal focus of this line of work was to understand the value of unlabeled data from the target domain, rather than to characterize the relative value of the number of labeled samples from the majority and minority groups. Among these papers, most closely related to our work are those in the covariate shift setting (Kpotufe & Martinet, 2018; Hanneke & Kpotufe, 2019) . Their lower bound results can be reinterpreted to show that under covariate shift in the absence of overlap, the minimax excess risk is lower bounded by 1/n min 1/3 . We provide a more detailed comparison with their results after presenting our lower bounds in Section 4.2. Finally, we note that Arjovsky et al. (2022) recently showed that undersampling can improve the worst-class accuracy of linear SVMs in the presence of label shift. In comparison, our results hold for arbitrary classifiers with the rich nonparametric data distributions.

3. SETTING

In this section, we shall introduce our problem setup and define the types of distribution shift that we consider.

3.1. PROBLEM SETUP

The setting for our study is nonparametric binary classification with Lipschitz data distributions. We are given n training datapoints S := {(x 1 , y 1 ), . . . , (x n , y n )} ∈ ([0, 1] × {-1, 1}) n that are all drawn from a train distribution P train . During test time, the data shall be drawn from a different distribution P test . To present a clean analysis, we study the case where the features x are bounded scalars, however, it is easy to extend our results to the high-dimensional setting. Given a classifier f : R → {-1, 1}, we shall be interested in the test error (risk) of this classifier under the test distribution P test : R(f ; P test ) := E (x,y)∼Ptest [1(f (x) = y)] .

3.2. TYPES OF DISTRIBUTION SHIFT

We assume that P train consists of a mixture of two groups of unequal size, and P test contains equal numbers of samples from both groups. Given a majority group distribution P maj and a minority group distribution P min , the learner has access to n maj majority group samples and n min minority group samples: S maj ∼ P n maj maj and S min ∼ P n min min . Here n maj > n/2 and n min < n/2 with n maj + n min = n. The full training dataset is S = S maj ∪ S min = {(x 1 , y 1 ), . . . , (x n , y n )}. We assume that the learner has access to the knowledge whether a particular sample (x i , y i ) comes from the majority or minority group. The test samples will be drawn from P test = 1 2 P maj + 1 2 P min , a uniform mixture over P maj and P min . Thus, the training dataset is an imbalanced draw from the distributions P maj and P min , whereas the test samples are balanced draws. We let ρ := n maj /n min > 1 denote the imbalance ratio in the training data. We focus on two-types of distribution shifts: label shift and group-covariate shift that we describe below.

3.2.1. LABEL SHIFT

In this setting, the imbalance in the training data comes from there being more samples from one class over another. Without loss of generality, we shall assume that the class y = 1 is the majority class. Then, we define the majority and the minority class distributions as P maj (x, y) = P 1 (x)1(y = 1) and P min = P -1 (x)1(y = -1), where P 1 , P -1 are class-conditional distributions over the interval [0, 1]. We assume that classconditional distributions P i have densities on [0, 1] and that they are 1-Lipschitz: for any x, x ∈ [0, 1], |P i (x) -P i (x )| ≤ |x -x |. We denote the class of pairs of distributions (P maj , P min ) that satisfy these conditions by P LS . We note that such Lipschitzness assumptions are common in the literature (see Tsybakov, 2010) .

3.2.2. GROUP-COVARIATE SHIFT

In this setting, we have two groups {a, b}, and corresponding to each of these groups is a distribution (with densities) over the features P a (x) and P b (x). We let a correspond to the majority group and b correspond to the minority group. Then, we define P maj (x, y) = P a (x)P(y | x) and P min (x, y) = P b (x)P(y | x). We assume that for y ∈ {-1, 1}, for all x, x ∈ [0, 1]: P(y | x) -P(y | x ) ≤ |x -x |, that is, the distribution of the label given the feature is 1-Lipschitz, and it varies slowly over the domain. To quantify the shift between the train and test distribution, we define a notion of overlap between the group distributions P a and P b as follows: Overlap(P a , P b ) := 1 -TV(P a , P b ) where TV(P a , P b ) := sup E⊆[0,1] |P a (E) -P b (E)|, denotes the total variation distance between P a and P b . Notice that when P a and P b have disjoint supports, TV(P a , P b ) = 1 and therefore Overlap(P a , P b ) = 0. On the other hand when P a = P b , TV(P a , P b ) = 0 and Overlap(P a , P b ) = 1. When the overlap is 1, the majority and minority distributions are identical and hence we have no shift between train and test. Observe that Overlap(P a , P b ) = Overlap(P maj , P min ) since P(y | x) is shared across P maj and P min . Given a level of overlap τ ∈ [0, 1] we denote the class of pairs of distributions (P maj , P min ) with overlap at least τ by P GS (τ ). It is easy to check that, P GS (τ ) ⊆ P GS (0) at any overlap level τ ∈ [0, 1].

4. LOWER BOUNDS ON THE MINIMAX EXCESS RISK

In this section, we shall prove our lower bounds that show that the performance of any algorithm is constrained by the number of minority samples n min . Before we state our lower bounds, we need to introduce the notion of excess risk and minimax excess risk. Excess Risk and Minimax Excess Risk. We measure the performance of an algorithm A through its excess risk defined in the following way. Given an algorithm A that takes as input a dataset S and returns a classifier A S , and a pair of distributions (P maj , P min ) with P test = 1 2 P maj + 1 2 P min , the expected excess risk is given by Excess Risk[A; (P maj , P min )] := E S∼P n maj maj ×P n min min R(A S ; P test )) -R(f (P test ); P test ) , (1) where f (P test ) is the Bayes classifier that minimizes the risk R(•; P test ). The first term corresponds to the expected risk for the algorithm when given n maj samples from P maj and n min samples from P min , whereas the second term corresponds to the Bayes error for the problem. Excess risk does not let us characterize the inherent difficulty of a problem, since for any particular data distribution (P maj , P min ) the best possible algorithm A to minimize the excess risk would be the trivial mapping A S = f (P test ). Therefore, to prove meaningful lower bounds on the performance of algorithms we need to define the notion of minimax excess risk (see Wainwright, 2019, Chapter 15) . Given a class of pairs of distributions P define Minimax Excess Risk(P) := inf (2) where the infimum is over all measurable estimators A. The minimax excess risk is the excess risk of the "best" algorithm in the worst case over the class of problems defined by P.

4.1. LABEL SHIFT LOWER BOUNDS

We demonstrate the hardness of the label shift problem in general by establishing a lower bound on the minimax excess risk. We establish this result in Appendix B. We show that rather surprisingly, the lower bound on the minimax excess risk scales only with the number of minority class samples n min 1/3 , and does not depend on n maj . Intuitively, this is because any learner must predict which class-conditional distribution (P(x | 1) or P(x | -1)) assigns higher likelihood at that x. To interpret this result, consider the extreme scenario where n maj → ∞ but n min is finite. In this case, the learner has full information about the majority class distribution. However, the learning task continues to be challenging since any learner would be uncertain about whether the minority class distribution assigns higher or lower likelihood at any given x. This uncertainty underlies the reason why the minimax rate of classification is constrained by the number of minority samples n min . We also note that the theorem can be trivially extended to higher dimensions. In this case the exponents degrade to 1/3d rather than 1/3 as is to be expected in nonparametric classification. We briefly note that, applying minimax lower bounds from the transfer learning literature (Maity et al., 2020, Theorem 3 .1 with α = 1, β = 0 and d = 1) to our problem leads to a more optimistic lower bound of 1/n 1/3 . Our lower bounds that scale as 1/n min 1/3 , uncover the fact that only adding minority class samples helps reduce the risk.

4.2. GROUP-COVARIATE SHIFT LOWER BOUNDS

Next, we shall state our lower bound on the minimax excess risk that demonstrates the hardness of the group-covariate shift problem. In the theorem below c > 0 shall be an absolute constant independent of n maj , n min and τ . Theorem 4.2. Consider the group shift setting described in Section 3.2.2. Given any overlap τ ∈ [0, 1] recall that P GS (τ ) is the class of distributions such that Overlap(P maj , P min ) ≥ τ . The minimax excess risk in this setting is lower bounded as follows: Minimax Excess Risk(P GS (τ )) = inf A sup (P maj ,P min )∈P GS (τ ) Excess Risk[A; (P maj , P min )] ≥ 1 200(n min • (2 -τ ) + n maj • τ ) 1/3 ≥ 1 200n min 1/3 (ρ • τ + 2) 1/3 , (4) where ρ = n maj /n min > 1. We prove this theorem in Appendix C. We see that in the low overlap setting (τ 1/ρ), the minimax excess risk is lower bounded by 1/n min 1/3 , and we are fundamentally constrained by the number of samples in minority group. To see why this is the case, consider the extreme example with τ = 0 where P a has support [0, 0.5] and P b has support [0.5, 1]. The n maj majority group samples from P a provide information about the correct label predict in the interval [0, 0.5] (the support of P a ). However, since the distribution P(y | x) is 1-Lipschitz in the worst case these samples provide very limited information about the correct predictions in [0.5, 1] (the support of P b ). Thus, predicting on the support of P b requires samples from the minority group and this results in the n min dependent rate. In fact, in this extreme case (τ = 0) even if n maj → ∞, the minimax excess risk is still bounded away from zero. This intuition also carries over to the case when the overlap is small but non-zero and our lower bound shows that minority samples are much more valuable than majority samples at reducing the risk. On the other hand, when the overlap is high (τ 1/ρ) the minimax excess risk is lower bounded by 1/(n min (2τ ) + n maj τ ) 1/3 and the extra majority samples are quite beneficial. This is roughly because the supports of P a and P b have large overlap and hence samples from the majority group are useful in helping make predictions even in regions where P b is large. In the extreme case when τ = 1, we have that P a = P b and therefore recover the classic i.i.d. setting with no distribution shift. Here, the lower bound scales with 1/n 1/3 , as one might expect. Identical to the label shift case, the theorem can be extended to hold in higher dimensions with the exponents being 1/3d rather than 1/3. Previous work on transfer learning with covariate shift has considered other more elaborate notions of transferability (Kpotufe & Martinet, 2018; Hanneke & Kpotufe, 2019) than overlap between group distributions considered here. In the case of no overlap (τ = 0), previous results (Kpotufe & Martinet, 2018 , Theorem 1 with α = 1, β = 0 and γ = ∞) yield the same lower bound of 1/n min 1/3 . Beyond the case of no overlap (τ = 0), our lower bound is key to drawing the simple conclusion that even when overlap is small between group distributions, minority samples alone dictate the rate of convergence. On the other hand, when the overlap is large our bound tells us that all samples can help reduce the risk.

5. UPPER BOUNDS ON THE EXCESS RISK FOR THE UNDERSAMPLED BINNING ESTIMATOR

We will show that an undersampled estimator matches the rates in the previous section showing that undersampling is an optimal robustness intervention. We start by defining the undersampling procedure and the undersampling binning estimator. Undersampling Procedure. Given training data S := {(x 1 , y 1 ), . . . , (x n , y n )}, generate a new undersampled dataset S US by • including all n min samples from S min and, • including n min samples from S maj by sampling uniformly at random without replacement. This procedure ensures that in the undersampled dataset S US , the groups are balanced, and that |S US | = 2n min . The undersampling binning estimator defined next will first run this undersampling procedure to obtain S US and just uses these samples to output a classifier.

Undersampled Binning Estimator

The undersampled binning estimator A USB takes as input a dataset S and a positive integer K corresponding to the number of bins, and returns a classifier A S,K USB : [0, 1] → {-1, 1}. This estimator is defined as follows: 1. First, we compute the undersampled dataset S US . 2. Given this dataset S US , let n 1,j be the number of points with label +1 that lie in the interval I j = [ j-1 K , j K ] . Also, define n -1,j analogously. Then set A j = 1 if n 1,j > n -1,j , -1 otherwise. 3. Define the classifier A S,K USB such that if x ∈ I j then A S,K USB (x) = A j . Essentially in each bin I j , we set the prediction to be the majority label among the samples that fall in this bin. Whenever the number of bins K is clear from the context we shall denote A S,K USB by A S USB . Below we establish upper bounds on the excess risk of this simple estimator.

5.1. LABEL SHIFT UPPER BOUNDS

We now establish an upper bound on the excess risk of A USB in the label shift setting (see Section 3.2.1). Below we let c, C > 0 be absolute constants independent of problem parameters like n maj and n min . Theorem 5.1. Consider the label shift setting described in Section 3.2.1. For any (P maj , P min ) ∈ P LS the expected excess risk of the Undersampling Binning Estimator (Eq. ( 5)) with number of bins with K = c n min 1/3 is upper bounded by Excess Risk[A USB ; (P maj , P min )] = E S∼P n maj maj ×P n min min R(A S USB ; P test ) -R(f ; P test ) ≤ C n min 1/3 . We prove this result in Appendix B. This upper bound combined with the lower bound in Theorem 4.1 shows that an undersampling approach is minimax optimal up to constants in the presence of label shift. Our analysis leaves open the possibility of better algorithms when the learner has additional information about the structure of the label shift beyond Lipschitz continuity. We also note that it is straightforward to generalize the upper bound to higher dimensions with the exponent being 1/3d instead of 1/3.

5.2. GROUP-COVARIATE SHIFT UPPER BOUNDS

Next, we present our upper bounds on the excess risk of the undersampled binning estimator in the group-covariate shift setting (see Section 3.2.2). In the theorem below, C > 0 is an absolute constant independent of the problem parameters n maj , n min and τ . Theorem 5.2. Consider the group shift setting described in Section 3.2.2. For any overlap τ ∈ [0, 1] and for any (P maj , P min ) ∈ P GS (τ ) the expected excess risk of the Undersampling Binning Estimator (Eq. ( 5)) with number of bins with K = n min 1/3 is Excess Risk[A USB ; (P maj , P min )] = E S∼P n maj maj ×P n min min R(A S USB ; P test )) -R(f ; P test ) ≤ C n min 1/3 . We provide a proof for this theorem in Appendix C. Compared to the lower bound established in Theorem 4.2 which scales as 1/ ((2τ )n min + n maj τ ) 1/3 , the upper bound for the undersampled binning estimator always scales with 1/n min 1/3 since it operates on the undersampled dataset (S US ). Thus, we have shown that in the absence of overlap (τ 1/ρ = n min /n maj ) there is an undersampling algorithm that is minimax optimal up to constants. However when there is high overlap (τ 1/ρ) there is a non-trivial gap between the upper and lower bounds: Upper Bound Lower Bound = c(ρ • τ + 2) 1/3 . Again this upper bound can be generalized to higher dimensions.

6. MINORITY SAMPLE DEPENDENCE IN PRACTICE

Inspired by our worst-case theoretical predictions in nonparametric classification, we ask: how does the accuracy of neural network classifiers trained using robust algorithms evolve as a function of the majority and minority samples? To explore this question, we conduct a small case study using the imbalanced binary CIFAR10 dataset (Byrd & Lipton, 2019; Wang et al., 2022 ) that is constructed using the "cat" and "dog" classes. We find that in accordance with our theory, for both of the classifiers adding only minority class samples (red) leads to large gain in accuracy (∼ 6%), while adding majority class samples (blue) leads to little or no gain. In fact, adding majority samples sometimes hurts test accuracy due to the added bias. When we add majority and minority samples in a 5:1 ratio (green), the gain is largely due to the addition of minority samples and is only marginally higher (< 2%) than adding only minority samples. The green curves correspond to the same classifiers in both the left and right panels. adding only majority class samples the test accuracy remains constant or in some cases even decreases owing to the added bias of the classifiers. When we add samples to both groups proportionately, the increase in the test accuracy appears to largely to be due to the increase in the number of minority class samples and on the left panels, we see that the difference between adding only extra minority group samples (red) and both minority and majority group samples (green) is small. Thus, we find that the accuracy for these neural network classifiers is also constrained by the number of minority class samples. Similar conclusions hold for classifiers trained using the tilted loss (Li et al., 2020) and group-DRO objective (Sagawa et al., 2020) (see Appendix D).

7. DISCUSSION

We showed that undersampling is an optimal robustness intervention in nonparametric classification in the absence of significant overlap between group distributions or without additional structure beyond Lipschitz continuity. At a high level our results highlight the need to reason about the specific structure in the distribution shift and design algorithms that are tailored to take advantage of this structure. This would require us to step away from the common practice in robust machine learning where the focus is to design "universal" robustness interventions that are agnostic to the structure in the shift. Alongside this, our results also dictate the need for datasets and benchmarks with the propensity for transfer from train to test time.

A TECHNICAL TOOLS

In this section we avail ourselves of some technical tools that shall be used in all of the proofs below.

A.1 REDUCTION TO LOWER BOUNDS OVER A FINITE CLASS

The lower bound on the minimax excess risk will be established via the usual route of first identifying a "hard" finite set of problem instances and then establishing the lower bound over this finite class. One difference from the usual setup in proving such lower bounds (see Wainwright, 2019, Chapter 15) is that the training samples are drawn from an imbalanced distribution, whereas the test samples are drawn from a balanced one. Let P be a class of pairs of distributions, where each element (P maj , P min ) ∈ P is a pair of distributions over [0, 1] × {-1, 1}. As before, we let P test denote the uniform mixture over P maj and P min . We let V denote a finite index set. Corresponding to each element v ∈ V there is a P v = (P v,maj , P v,min ) ∈ P with P v,test = (P v,maj + P v,min )/2. Finally, also define a pair of random variables (V, S) as follows: 1. V is a uniform random variable over the set V.

2.. (S

| V = v) ∼ P n maj v,maj × P n min v,min , is an independent draw of n maj samples from P v,maj and n min samples from P v,min . We shall let Q denote the joint distribution of the random variables (V, S), and let Q S denote the marginal distribution of S. With this notation in place, we now present a lemma that lower bounds the minimax excess risk in terms of quantities defined over the finite class of "hard" instances P v . Lemma A.1. Let the random variables (V, S) be as defined above. The minimax excess risk is lower bounded as follows: Minimax Excess Risk(P) = inf A sup (P maj ,P min )∈P E S∼P n maj maj ×P n min min R(A S ; P test ) -R(f (P test ); P test ) ≥ R V -B V , where R V and Bayes-error B V are defined as R V := E S∼Q S [inf h P (x,y)∼ v∈V Q(v|S)Pv,test (h(x) = y)], B V := E V [R(f (P V,test ); P V,test ))]. Proof. By the definition of Minimax Excess Risk, Minimax Excess Risk = inf A sup (P maj ,P min )∈P E S∼P n maj maj ×P n min min [R(A S ; P test )] -R(f (P test ); P test ) ≥ inf A sup v∈V E S|v∼P n maj v,maj ×P n min v,min [R(A S ; P v,test )] -R(f (P v,test ); P v,test ) ≥ inf A E V E S|V ∼P n maj V,maj ×P n min V,min [R(A S ; P V,test )] -R(f (P V,test ); P V,test )) = inf A E V [E S|V ∼P n maj V,maj ×P n min V,min [R(A S ; P V,test )]] -E V [R(f (P V,test ); P V,test ))] =B V . We continue lower bounding the first term as follows inf A E V [E S|V ∼P n maj V,maj ×P n min V,min [R(A S ; P V,test )]] = inf A E (V,S)∼Q [P (x,y)∼P V,test (A S (x) = y)] = inf A E S∼Q S E V ∼Q(•|S) [P (x,y)∼P V,test (A S (x) = y)] (i) ≥ E S∼Q S [inf h E V ∼Q(•|S) [P (x,y)∼P V,test (h(x) = y)]] = E S∼Q S [inf h P (x,y)∼ v∈V Q(v|S)Pv,test (h(x) = y)] = R V , where (i) follows since A S is a fixed classifier given the sample set S. This, combined with the previous equation block completes the proof.

A.2 THE HAT FUNCTION AND ITS PROPERTIES

In this section, we define the hat function and establish some of its properties. This function will be useful in defining "hard" problem instances to prove our lower bounds. Given a positive integer K the hat function is defined as φ K (x) =    x + 1 4K -1 4K for x ∈ -1 2K , 0 , 1 4K -x -1 4K for x ∈ 0, 1 2K , 0 otherwise. ( ) When K is clear from context, we omit the subscript. We first notice that this function is 1-Lipschitz and odd, so 1 2K -1 2K φ K (x) dx = 0. We also compute some other key quantities for φ. Lemma A.2. For any positive integer K, 1 2K -1 2K |φ K (x)| dx = 1 8K 2 . Proof. We suppress K in the notation. We have that, 1 2K -1 2K |φ(x)| dx = 0 -1 2K 1 4K -x + 1 4K dx + 1 2K 0 x - 1 4K - 1 4K dx. The integrand 1 4Kx + 1 4K over x ∈ -1 2K , 0 defines a triangle with base 1 2K and height 1 4K , thus it has area 1 16K 2 . Therefore, 0 -1 2K 1 4K -x + 1 4K dx = 1 16K 2 . The same holds for the second term. Thus, by adding them up we get that 1 2K -1 2K |φ(x)| dx = 1 8K 2 . Under review as a conference paper at ICLR 2023 Lemma A.3. For any positive integer K, 1 K 0 log 1 + φ K (x -1 2K ) 1 -φ K (x -1 2K ) 1 + φ K x - 1 2K dx ≤ 1 3K 3 and 1 K 0 log 1 -φ K (x -1 2K ) 1 + φ K (x -1 2K ) 1 -φ K x - 1 2K dx ≤ 1 3K 3 . Proof. Let us suppress K in the notation. We prove the first bound below and the second bound follows by an identical argument. We have that 1 K 0 log 1 + φ(x -1 2K ) 1 -φ(x -1 2K ) 1 + φ x - 1 2K dx = 1 2K -1 2K log 1 + φ(x) 1 -φ(x) (1 + φ(x)) dx = 1 2K 0 log 1 + φ(x) 1 -φ(x) (1 + φ(x)) dx + 0 -1 2K log 1 + φ(x) 1 -φ(x) (1 + φ(x)) dx = 1 2K 0 log 1 + φ(x) 1 -φ(x) (1 + φ(x)) dx - 0 1 2K log 1 + φ(-x) 1 -φ(-x) (1 + φ(-x)) dx = 1 2K 0 log 1 + φ(x) 1 -φ(x) (1 + φ(x)) dx + 1 2K 0 log 1 -φ(x) 1 + φ(x) (1 -φ(x)) dx, where the last equality follows since φ is an odd function. Now, we may collect the integrands to get that, 1 K 0 log 1 + φ(x -1 2K ) 1 -φ(x -1 2K ) 1 + φ x - 1 2K dx = 2 1 2K 0 log 1 + φ(x) 1 -φ(x) φ(x) dx = 2 1 2K 0 log 1 + 2φ(x) 1 -φ(x) φ(x) dx ≤ 2 1 2K 0 2φ(x) 2 1 -φ(x) dx, where the last inequality follows since log(1 + x) ≤ x for all x. Now we observe that φ(x) ≤ x ≤ 1 2 for x ∈ [0, 1 2K ], and in particular, 1 1-φ(x) ≤ 2. Thus, 1 K 0 log 1 + φ(x -1 2K ) 1 -φ(x -1 2K ) 1 + φ x - 1 2K dx ≤ 8 1 2K 0 φ(x) 2 dx ≤ 8 1 2K 0 x 2 dx = 1 3K 3 . This proves the first bound. The second bound follows analogously.

B PROOFS IN THE LABEL SHIFT SETTING

Throughout this section we operate in the label shift setting (Section 3.2.1). First, in Appendix B.1 through a sequence of lemmas we prove the minimax lower bound Theorem 4.1. Next, in Appendix B.2 we prove Theorem 5.1 which is an upper bound on the excess risk of the undersampled binning estimator (see Eq. ( 5)) with n min 1/3 bins by invoking previous results on nonparametric density estimation (Freedman & Diaconis, 1981; Devroye & Györfi, 1985) .

B.1 PROOF OF THEOREM 4.1

In this section, we provide a proof of the minimax lower bound in the label shift setting. We will proceed by constructing a class of distributions where the separation between any two distributions in the class is small enough such that it is hard to distinguish between them with finite minority class samples. In particular, we split the interval [0, 1] into sub-intervals and each class distribution on each sub-interval either has slightly more probability mass on the left side of the sub-interval, on the right, or completely uniform. Since the minority class sample size is limited, no classifier will be able to tell which distribution the minority class is generated from, and hence will suffer high excess risk. We construct the "hard" set of distributions as follows. Fix K to be an integer that will be specified in the sequel as a function of n min . Let the index set be V = {-1, 0, 1} K × {-1, 0, 1} K . For v ∈ V, we will let v 1 ∈ {-1, 0, 1} K be the first K coordinates and v -1 ∈ {-1, 0, 1} K be the last K coordinates. That is, v = (v 1 , v -1 ). For every v ∈ P we shall define pair of class-conditional distributions P v,1 and P v,-1 as follows: for x ∈ I j = [ j-1 K , j K ], P v,1 (x) = 1 + v 1,j φ x - j + 1/2 K P v,-1 (x) = 1 + v -1,j φ x - j + 1/2 K , where φ is defined in Eq. 6. Notice that P v,1 only depends on v 1 while P v,-1 only depends on v -1 . We continue to define P v,maj (x, y) = P v,1 (x)1(y = 1) P v,min (x, y) = P v,-1 (x)1(y = -1), and P v,test (x, y) = P v,maj (x, y) + P v,min (x, y) 2 = P v,1 (x)1(y = 1) + P v,-1 (x)1(y = -1) 2 . Observe that in the test distribution it is equally likely for the label to be +1 or -1. Recall that as described in Section A.1, V shall be a uniform random variable over V and S | V ∼ P n maj v,maj × P n min v,min . We shall let Q denote the joint distribution of (V, S) and let Q S denote the marginal over S. With this construction in place, we first show that the minimax excess risk is lower bounded by Lemma B.1. For any positive integers K, n maj , n min , the minimax excess risk is lower bounded as follows: Minimax Excess Risk(P LS ) = inf A sup (P maj ,P min )∈P LS E S∼P n maj maj ×P n min min R(A S ; P test ) -R(f ; P test ) ≥ 1 36K - 1 2 E S∼Q S TV v∈V Q(v | S)P v,1 , v∈V Q(v | S)P v,-1 . Proof. By invoking Lemma A.1 we get that Minimax Excess Risk(P LS ) ≥ E S∼Q S [inf h P (x,y)∼ v∈V Q(v|S)Pv,test (h(x) = y)] =:R V -E V [R(f (P V,test ); P V,test ))] =:B V . We proceed by calculating alternate expressions for R V and B V to get our desired lower bound on the minimax excess risk. Calculation of R V : Immediately by Le Cam's lemma (Wainwright, 2019, Eq. 15.13) , we get that R V = E S∼Q S inf h P (x,y)∼ v∈V Q(v|S)Pv,test (h(x) = y) = 1 2 E S∼Q S 1 -TV v∈V Q(v | S)P v,1 , v∈V Q(v | S)P v,-1 . ( ) Calculation of B V : Again by invoking Le Cam's lemma (Wainwright, 2019, Eq. 15 .13), we get that for any class conditional distributions P 1 , P -1 , R(f ; P test ) = 1 2 - 1 2 TV(P 1 , P -1 ). So by taking expectations, we get that B V = E V [R(f (P V,test ); P V,test )] = E V 1 2 - 1 2 TV(P V,1 , P V,-1 ) . We now compute E V [TV(P V,1 , P V,-1 )] as follows: E V [TV(P V,1 , P V,-1 )] = 1 2 E V 1 x=0 |P V,1 (x) -P V,-1 (x)| dx = 1 2 E V   K j=1 j K j-1 K |V 1,j -V -1,j | φ x - j + 1/2 K dx   = 1 2 K j=1 E V j K j-1 K |V 1,j -V -1,j | φ x - j + 1/2 K dx (i) = 1 16K 2 K j=1 E V [|V 1,j -V -1,j |], where (i) follows by Lemma A.2. Observe that V 1,j , V -1,j are independent uniform random variables on {-1, 0, 1}, it is therefore straightforward to compute that E V [|V 1,j -V -1,j |] = 8 9 . This yields that E V [TV(P V,1 , P V,-1 )] = 1 18K . Plugging this into Eq. ( 9) allows us to conclude that B V = E V [R(f (P V,test ); P V,test )] = 1 2 1 - 1 18K . ( ) Combining Eqs. ( 8) and ( 10) establishes the claimed result. In light of this previous lemma we now aim to upper bound the expected total variation distance in Eq. ( 7). Lemma B.2. Suppose that v is drawn uniformly from the set {-1, 1} K , and that S | v is drawn from P n maj v,maj × P n min v,min then, E S TV v∈V Q(v | S)P v,1 , v∈V Q(v | S)P v,-1 ≤ 1 18K - 1 144K exp - n min 3K 3 . Proof. Let ψ := E S TV v∈V Q(v | S)P v,1 , v∈V Q(v | S)P v,-1 . Then, ψ = E S TV v∈V Q(v | S)P v,1 , v∈V Q(v | S)P v,-1 = 1 2 E S 1 x=0 v∈V Q(v | S) (P v,1 (x) -P v,-1 (x)) dx = 1 2 E S   K j=1 j K x= j-1 K v∈V Q(v | S) (P v,1 (x) -P v,-1 (x)) dx   = 1 2 E S   K j=1 j K x= j-1 K v∈V Q(v | S)(v 1,j -v -1,j )φ x - j + 1/2 K dx   , where the last equality is by the definition of P v,1 and P v,-1 . Continuing we get that, ψ = 1 2 K j=1 j K x= j-1 K φ x - j + 1/2 K dx E S v∈V Q(v | S)(v 1,j -v -1,j ) (i) = 1 16K 2 E S   K j=1 v∈V Q(v | S)(v 1,j -v -1,j )   = 1 16K 2 K j=1 v∈V Q(v | S)(v 1,j -v -1,j ) dQ S (S) = 1 16K 2 K j=1 v∈V Q(v, S)(v 1,j -v -1,j ) dS (ii) = 1 16K 2 |V| K j=1 v∈V Q(S | v)(v 1,j -v -1,j ) dS, where (i) follows by the calculation in Lemma A.2 and (ii) follows since v is a uniform random variable over the set V. The distributions P v,1 and P v,-1 are symmetrically defined over all intervals I j = [ j-1 K , j K ], and hence all of the summands in the RHS above are equal. Thus, ψ = 1 16K|V| v∈V Q(S | v)(v 1,1 -v -1,1 ) dS. ( ) Before we continue further, let us define V + = {v ∈ V | v 1,1 > v -1,1 }. For every v ∈ V + , let ṽ ∈ V be such that is the same as v on all coordinates, except ṽ1,1 = -v 1,1 and ṽ-1,1 = -v -1,1 . Then continuing from Eq. ( 11) we find that, ψ (i) = 1 16K|V| v∈V + (v 1,1 -v -1,1 )(Q(S | v) -Q(S | ṽ)) dS (ii) ≤ 1 16K|V| v∈V + (v 1,1 -v -1,1 ) |Q(S | v) -Q(S | ṽ)| dS = 1 16K|V| v∈V + (v 1,1 -v -1,1 ) |Q(S | v) -Q(S | ṽ)| dS = 1 8K|V| v∈V + (v 1,1 -v -1,1 )TV(Q(S | v), Q(S | ṽ)) =:Ξ , ( ) where (i) we use the definition of V + and ṽ, (ii) follows since v 1,1 > v -1,1 for v ∈ V + . Now we further partition V + into 3 sets V (1,0) , V (0,-1) , V (1,-1) as follows V (1,0) = {v ∈ V | v 1,1 = 1, v -1,1 = 0}, V (0,-1) = {v ∈ V | v 1,1 = 0, v -1,1 = -1}, V (1,-1) = {v ∈ V | v 1,1 = 1, v -1,1 = -1}. Note that Q(S | v) = P n maj v,maj × P n min v,min , and therefore Ξ = v∈V + (v 1,1 -v -1,1 )TV P n maj v,maj × P n min v,min , P n maj ṽ,maj × P n min ṽ,min (i)

=

v∈V (1,0) TV P n maj v,maj × P n min v,min , P n maj ṽ,maj × P n min ṽ,min + v∈V (0,-1) TV P n maj v,maj × P n min v,min , P n maj ṽ,maj × P n min ṽ,min + 2 v∈V (1,-1) TV P n maj v,maj × P n min v,min , P n maj ṽ,maj × P n min ṽ,min , where (i) follows since v 1 , v -1 ∈ {-1, 0, 1} K and by the definition of the sets V (1,0) , V (0,1) and V (1,-1) . Now by the Bretagnolle-Huber inequality (see Canonne, 2022, Corollary 4) , TV P n maj v,maj × P n min v,min , P n maj ṽ,maj × P n min ṽ,min = TV P n maj ṽ,maj × P n min ṽ,min , P n maj v,maj × P n min v,min ≤ 1 - 1 2 exp -KL P n maj ṽ,maj × P n min ṽ,min P n maj v,maj × P n min v,min , where we flip the arguments in the first step for simplicity later. Next, by the chain rule for KL-divergence, we have that KL(P n maj ṽ,maj × P n min ṽ,min P n maj v,maj × P n min v,min ) = n maj KL(P ṽ,maj P v,maj ) + n min KL(P ṽ,min P v,min ). Using these, let us upper bound the first term in Eq. ( 13) corresponding to v ∈ V (0,-1) . For v ∈ V (0,-1) , notice that KL(P ṽ,maj P v,maj ) = 0 since v 1,j = ṽ1,j for all j ∈ {1, . . . , K}. For the second term, KL(P ṽ,min P v,min ), only v 1,1 and ṽ1,1 differ, so KL(P ṽ,min P v,min ) = 1 0 P v,-1 (x) log P v,-1 (x) P ṽ,-1 (x) dx = 1 K 0 log 1 + φ K (x -1 2K ) 1 -φ K (x -1 2K ) 1 + φ K x - 1 2K dx ≤ 1 3K 3 , where the last inequality is a result of the calculation in Lemma A.3. Therefore, we get v∈V (0,-1) TV P n maj v,maj × P n min v,min , P n maj ṽ,maj × P n min ṽ,min ≤ 9 K-1 1 - 1 2 exp - n min 3K 3 . For the terms in Eq. ( 13) corresponding to V (0,-1) , V (1,-1) , we simply take the trivial bound to get v∈V (0,-1) TV P n maj v,maj × P n min v,min , P n maj ṽ,maj × P n min ṽ,min ≤ 9 K-1 , v∈V (1,-1) TV P n maj v,maj × P n min v,min , P n maj ṽ,maj × P n min ṽ,min ≤ 9 K-1 . Plugging these bounds into Eq. ( 13) we get that, Ξ ≤ 4 • 9 K-1 - 9 K-1 2 exp - n min 3K 3 . Now using this bound on Ξ in Eq. ( 12) and observing that |V| = 9 K , we get that, ψ = E S TV v∈V Q(v | S)P v,1 , v∈V Q(v | S)P v,-1 ≤ 1 8 • 9 K K 4 • 9 K-1 - 9 K-1 2 exp - n min 3K 3 = 1 18K - 1 144K exp - n min 3K 3 , completing the proof. Finally, we combine Lemma B.1 and Lemma B.2 to establish the minimax lower bound in this label shift setting. We recall the statement of the theorem here. Theorem 4.1. Consider the label shift setting described in Section 3.2.1. Recall that P LS is the class of pairs of distributions (P maj , P min ) that satisfy the assumptions in that section. The minimax excess risk over this class is lower bounded as follows: Minimax Excess Risk(P LS ) = inf A sup (P maj ,P min )∈P LS Excess Risk[A; (P maj , P min )] ≥ 1 600 1 n min 1/3 . (3) Proof. By Lemma B.1 we know that, Minimax Excess Risk(P LS ) ≥ 1 36K - 1 2 E S∼Q S TV v∈V Q(v | S)P v,1 , v∈V Q(v | S)P v,-1 . Next by the calculation in Lemma B.2 we have that Minimax Excess Risk(P LS ) ≥ 1 36K - 1 2 1 18K - 1 144K exp - n min 3K 3 = 1 288K exp - n min 3K 3 . Setting K = n min 1/3 yields the following Minimax Excess Risk(P LS ) ≥ 1 288 n min 1/3 exp - n min 3 n min 1/3 3 ≥ exp -n min 3 n min 1/3 3 288 n min 1/3 n min 1/3 1 n min 1/3 (i) ≥ 0.7 exp -1 3 288 1 n min 1/3 ≥ 1 600 1 n min 1/3 , where (i) follows since n min 1/3 / n min 1/3 ≥ 0.7 for n min ≥ 1. B.2 PROOF OF THEOREM 5.1 In this section, we derive an upper bound on the excess risk of the undersampled binning estimator A USB (Eq. ( 5)) in the label shift setting. Recall that given a dataset S this estimator first calculates the undersampled dataset S US , where the number of points from the minority group (n min ) is equal to the number of points from the majority group (n min ), and the size of the dataset is 2n min . Throughout this section, (P maj , P min ) shall be an arbitrary element of P LS . To bound the excess risk of the undersampling algorithm, we will relate it to density estimation. Recall that n 1,j denotes the number of points in S US with label +1 that lie in I j , and n -1,j is defined analogously. Given a positive integer K, for x ∈ I j = [ j-1 K , j K ] , by the definition of the undersampled binning estimator (Eq. ( 5)) A S USB (x) = 1 if n 1,j > n -1,j , -1 otherwise. Recall that since we have undersampled, j n 1,j = j n -1,j = n min . Therefore, define the simple histogram estimators for P 1 (x) = P(x | y = 1) and P -1 (x) = P(x | y = -1) as follows: for x ∈ I j , P S 1 (x) := n 1,j Kn min and P S -1 (x) := n -1,j Kn min . With this histogram estimator in place, we may define an estimator for η(x) := P test (y = 1|x) as follows, η S (x) := P S 1 (x) P S 1 (x) + P S -1 (x) . Observe that, for x ∈ I j η S (x) > 1/2 ⇐⇒ n 1,j > n -1,j ⇐⇒ A S USB (x) = 1. Defining an estimator η S for the P test (y = 1 | x) in this way will allow us to relate the excess risk of A USB to the estimation error in P S 1 and P S -1 . Before proving the theorem we restate it here. Theorem 5.1. Consider the label shift setting described in Section 3.2.1. For any (P maj , P min ) ∈ P LS the expected excess risk of the Undersampling Binning Estimator (Eq. ( 5)) with number of bins with (Wasserman, 2019 , Theorem 1) we may upper bound the excess risk given a draw of S by K = c n min 1/3 is R(A S USB ; P test )) -R(f ; P test ) ≤ 2 η S (x) -η(x) P test (x) dx. Continuing using the definition of η S above and because η = P 1 /(P 1 + P -1 ) we have that, R(A S USB ; P test )) -R(f ; P test ) = 2 1 0 P S 1 (x) P S 1 (x) + P S -1 (x) - P 1 (x) P 1 (x) + P -1 (x) P 1 (x) + P -1 (x) 2 dx = 1 0 P 1 (x) + P -1 (x) P S 1 (x) + P S -1 (x) P S 1 (x) -P 1 (x) dx (i) ≤ 1 0 P S 1 (x) -P 1 (x) dx + 1 0 P 1 (x) + P -1 (x) P S 1 (x) + P S -1 (x) -1 P S 1 (x) dx = 1 0 P S 1 (x) -P 1 (x) dx + 1 0 P S 1 (x) + P S -1 (x) -P 1 (x) -P -1 (x) P S 1 (x) P S 1 (x) + P S -1 (x) dx ≤ 2 1 0 P S 1 (x) -P 1 (x) dx + 1 0 P S -1 (x) -P -1 (x) dx ≤ 2 1 0 P S 1 (x) -P 1 (x) 2 dx + 1 0 P S -1 (x) -P -1 (x) 2 dx, where (i) follows by the triangle inequality, (ii) is by the Cauchy-Schwarz inequality. Taking expectation over the samples S and by invoking Jensen's inequality we find that, Excess Risk(A S ; (P maj , P min )) = E S R(A S USB ; P test )) -R(f ; P test ) ≤ 2 E S P S 1 (x) -P 1 (x) 2 dx + E S P S -1 (x) -P -1 (x) 2 dx . We note that P S j only depends on n min i.i.d. draws from class j. Thus by (Freedman & Diaconis, 1981, Theorem 1.7)  , if K = c n min 1/3 then E S P S j (x) -P j (x) 2 dx ≤ C n min 2/3 . Plugging this into the previous inequality yields the desired result.

C PROOF IN THE GROUP-COVARIATE SHIFT SETTING

Throughout this section we operate in the group-covariate shift setting (Section 3.2.2). We will proceed similarly to Section B. We shall construct a family of class-conditional distributions such that it will be necessary for adequate samples in each sub-interval of [0, 1] to be able to learn the maximally likely label in that sub-interval. On the other hand, we will construct the group-covariate distributions to be separated from one another. As a consequence, sub-intervals with high probability mass under the minority group distribution will have low probability mass under the majority group distribution. Hence, these sub-intervals will not have enough training sample points for any classifier to be able to learn the maximally likely label and as a result shall suffer high excess risk. First in Appendix C.1, we prove Theorem 4.2, the minimax lower bound through a sequence of lemmas. Second in Appendix C.2, we prove Theorem 5.2 that upper bound on the excess risk of the undersampled binning estimator with n min 1/3 bins. C.1 PROOF OF THEOREM 4.2 In this section, we provide a proof of the minimax lower bound in the group shift setting. We construct the "hard" set of distributions as follows. Let the index set be V = {-1, 1} K . For every v ∈ V define a distribution as follows: for x ∈ I j = [ j-1 K , j K ], P v (y = 1 | x) := 1 2 1 + v j φ x - j + 1/2 K , where φ is defined in Eq. 6. Given a τ ∈ [0, 1] we also construct the group distributions as follows: P a (x) = 2 -τ if x ∈ [0, 0.5) τ if x ∈ [0.5, 1], and let P b (x) = 2 -P a (x). We can verify that Overlap(P a , P b ) = 1 -TV(P a , P b ) = 1 - 1 2 1 x=0 |P a (x) -P b (x)| dx = τ. We continue to define P v,maj (x, y) = P v (y | x)P a (x) P v,min (x, y) = P v (y | x)P b (x), P v,test (x, y) = P v (y | x) P a (x) + P b (x) 2 . Observe that (P a (x) + P b (x))/2 = 1, the uniform distribution over [0, 1]. Recall that as described in Section A.1, V shall be a uniform random variable over V and S | V ∼ P n maj v,maj × P n min v,min . We shall let Q denote the joint distribution of (V, S) and let Q S denote the marginal over S. With this construction in place, we present the following lemma that lower bounds the minimax excess risk by a sum of exp(-KL(Q(S | v j = 1) Q(S | v j = -1)) over the intervals. Intuitively, KL(Q(S | v j = 1) Q(S | v j = -1) is a measure of how difficult it is to identify whether v j = 1 or v j = -1 from the samples. Lemma C.1. For any positive integers K, n maj , n min and τ ∈ [0, 1], the minimax excess risk is lower bounded as follows: Minimax Excess Risk(P GS (τ )) = inf A sup (P maj ,P min )∈P GS (τ ) E S∼P n maj maj ×P n min min R(A S ; P test ) -R(f ; P test ) ≥ 1 32K 2 K j=1 exp(-KL(Q(S | v j = 1) Q(S | v j = -1))). Proof. By invoking Lemma A.1, we know that the minimax excess risk is lower bounded by Minimax Excess Risk(P GS (τ )) ≥ E S∼Q S [inf h P (x,y)∼ v∈V Q(v|S)Pv,test (h(x) = y)] =R V -E V [R(f (P V,test ); P V,test )] =B V , where V is a uniform random variable over the set V, S | V = v is a draw from P n maj v,maj × P n min v,min , and Q denotes the joint distribution over (V, S). We shall lower bound this minimax risk in parts. First, we shall establish a lower bound on R V , and then an upper bound on the Bayes risk B V . Lower bound on R V . Unpacking R V using its definition we get that, R V = E S∼Q S [inf h P (x,y)∼ v∈V Q(v|S)Pv,test (h(x) = y)] = E S∼Q S inf h 1 0 P test (x)P y∼ v∈V Q(v|S)Pv(•|x) [h(x) = y] dx (i) = E S∼Q S 1 0 P test (x) min v∈V Q(v | S)P v (1 | x), v∈V Q(v | S)P v (-1 | x) dx (ii) = 1 2 -E S∼Q S 1 0 P test (x) 1 2 - v∈V Q(v | S)P v (1 | x) dx (iii) = 1 2 - 1 0 P test (x)E S∼Q S 1 2 - v∈V Q(v | S)P v (1 | x) dx, where (i) follows by taking h to be the pointwise minimizer over x, (ii) follows since P v (-1 | x) = 1 -P v (1 | x) and min{s, 1 -s} = (1 -|1 -2s|)/2 for all s ∈ [0, 1], and (iii) follows by Fubini's theorem which allows us to switch the order of the integrals. If x ∈ I j = [ j-1 K , j K ] for some j ∈ {1, . . . , K} we let j x denote the value of this index j. With this notation in place let us continue to upper bound integrand in the second term in the RHS above as follows: E S∼Q S 1 2 - v∈V Q(v | S)P v (1 | x) (i) = E S∼Q S φ x - j x + 1/2 K |Q(v jx = 1 | S) -Q(v jx = -1 | S)| = φ x - j x + 1/2 K E S∼Q S [|Q(v jx = 1 | S) -Q(v jx = -1 | S)|] (ii) = φ x - j x + 1/2 K E S∼Q S Q(S | v jx = 1)Q V (v jx = 1) Q S (S) - Q(S | v jx = -1)Q V (v jx = -1) Q S (S) (iii) = 1 2 φ x - j x + 1/2 K TV(Q(S | v jx = 1), Q(S | v jx = -1)), where (i) follows since P v (1 | x) = (1 + v jx φ(x -(j x + 1/2)/K))/2 and by marginalizing Q(v | S) over the indices j = j x , (ii) follows by using Bayes' rule and (iii) follows since the total-variation distance is half the 1 distance. Now by the Bretagnolle-Huber inequality (see Canonne, 2022, Corollary 4) we get that, TV(Q(S | v jx = 1), Q(S | v jx = -1)) ≤ 1 - exp(-KL(Q(S | v jx = 1) Q(S | v jx = -1))) 2 . ( ) Combining Eqs. ( 14)-( 16) we get that R V ≥ 1 2 - 1 2 1 0 P test (x) φ x - j x + 1/2 K dx + 1 4 1 0 P test (x) φ x - j x + 1/2 K exp(-KL(Q(S | v jx = 1) Q(S | v jx = -1))) dx. ( ) Upper bound on B V : The Bayes error is B V = E V [R(f (P V ); P V )] = E V inf f E (x,y)∼Pv,test 1(f (x) = y) = E V   inf f 1 x=0 y∈{-1,1} P test (x)P V,test (y | x)1(f (x) = -y)   = E V 1 x=0 P test (x) min y∈{-1,1} P V,test (y | x) (i) = E V 1 2 1 - 1 x=0 P test (x)|P V,test (1 | x) -P V,test (-1 | x)| dx (ii) = E V 1 2 1 - 1 x=0 P test (x) φ x - j x + 1/2 K dx = 1 2 - 1 2 1 x=0 P test (x) φ x - j x + 1/2 K dx, where (i) follows since P v (1 | x) = 1 -P v (-1 | x) and min{s, 1 -s} = (1 -|1 -2s|)/2 for all s ∈ [0, 1], and (ii) follows by our construction of P v above along with the fact that P v (1 | x) = 1 -P v (-1 | x). Putting things together: Combining Eqs. ( 17) and ( 18) allows us to conclude that Minimax Excess Risk(P GS (τ )) ≥ 1 4 1 0 P test (x) φ x - j x + 1/2 K exp(-KL(Q(S | v jx = 1) Q(S | v jx = -1))) dx = 1 4 K j=1 j K j-1 K P test (x) φ x - j + 1/2 K exp(-KL(Q(S | v j = 1) Q(S | v j = -1))) dx = 1 4 K j=1 exp(-KL(Q(S | v j = 1) Q(S | v j = -1))) j K j-1 K P test (x) φ x - j + 1/2 K dx (i) = 1 32K 2 K j=1 exp(-KL(Q(S | v j = 1) Q(S | v j = -1))), where (i) follows by using Lemma A.2 along with the fact that P test (x) = 1 in our construction to show that the integral in the square brackets is equal to 1/8K 2 . This proves the result. The next lemma upper bounds the KL divergence between Q(S | v j = 1) and Q(S | v j = -1) for each j ∈ {1, . . . , K}. It shows that the KL divergence between these two posteriors is larger when the expected number of samples in that bin is larger. Lemma C.2. Suppose that v is drawn uniformly from the set {-1, 1} K , and that S | v is drawn from P n maj v,maj × P n min v,min . Then for any j ∈ {1, . . . , K/2} and any τ ∈ [0, 1], KL(Q(S | v j = 1) Q(S | v j = -1)) ≤ n maj (2 -τ ) + n min τ 3K 3 , and for any j ∈ {K/2 + 1, . . . , K} KL(Q(S | v j = 1) Q(S | v j = -1)) ≤ n maj τ + n min (2 -τ ) 3K 3 . Proof. Let us consider the case when j = 1. The bound for all other j ∈ {2, . . . , K} shall follow analogously. Now conditioned on v 1 , n 1,a and n 1,b , samples in S 1 are composed of 2 groups of samples (S 1,a , S 1,b ). The samples in each group (S 1,a , S 1,b ) are drawn independently from the distributions P a (x | x ∈ I 1 )P v (y | x) and P b (x | x ∈ I 1 )P v (y | x) respectively. Therefore, KL(Q(S 1 | v 1 = 1, n 1,a , n 1,b ) Q(S 1 | v 1 = -1, n 1,a , n 1,b )) (i) = n 1,a KL(P a (x | x ∈ I 1 )P v1=1 (y | x) P a (x | x ∈ I 1 )P v1=-1 (y | x)) + n 1,b KL(P b (x | x ∈ I 1 )P v1=1 (y | x) P b (x | x ∈ I 1 )P v1=-1 (y | x)) (ii) = (n 1,a + n 1,b )E x∼Unif(I1) [KL(P v1=1 (y | x) P v1=-1 (y | x))] (iii) = n 1,a + n 1,b 2 E x∼Unif(I1)   y∈{-1,1} 1 + yφ x - 1 2K log 1 + yφ x -1 2K 1 + yφ x -1 2K   = n 1,a + n 1,b 2 y∈{-1,1} E x∼Unif(I1) 1 + yφ x - 1 2K log 1 + yφ x -1 2K 1 + yφ x -1 2K = n 1,a + n 1,b 2K y∈{-1,1} 1 K x=0 1 + yφ x - 1 2K log 1 + yφ x -1 2K 1 + yφ x -1 2K dx (iv) ≤ n 1,a + n 1,b 3K 2 , ( ) where in (i) we let P v1 denote the conditional distribution of y for x ∈ I 1 given v 1 , (ii) follows since both P a and P b are constant in the interval, (iii) follows by our construction of P v above, and finally (iv) follows by invoking Lemma A.3 that ensures that the integral is bounded by 1/3K 2 . Using this bound in Eq. ( 20), along with Eq. ( 19) we get that KL(Q(S | v 1 = 1) Q(S | v 1 = -1)) ≤ E [n 1,a + n 2,b ] 3K 2 . Now there are n maj samples from group a in S and n min samples from group b. Therefore, E [n 1,a ] = n maj P a (x ∈ I 1 ) = n maj (2 -τ ) K , E [n 1,b ] = n min P b (x ∈ I 1 ) = n min τ K . Plugging this bound into Eq. ( 21) completes the proof by the first interval. An identical argument holds for j ∈ {2, . . . , K/2}. For j ∈ {K/2 + 1, . . . , K} the only change is that E [n j,a ] = n maj P a (x ∈ I j ) = n maj τ K , E [n j,b ] = n min P b (x ∈ I j ) = n min (2 -τ ) K . Next, we combine the previous two lemmas to establish our stated lower bound. We first restate it here. Theorem 4.2. Consider the group shift setting described in Section 3.2.2. Given any overlap τ ∈ [0, 1] recall that P GS (τ ) is the class of distributions such that Overlap(P maj , P min ) ≥ τ . The minimax excess risk in this setting is lower bounded as follows: Minimax Excess Risk(P GS (τ )) = inf A sup (P maj ,P min )∈P GS (τ ) Excess Risk[A; (P maj , P min )] ≥ 1 200(n min • (2 -τ ) + n maj • τ ) 1/3 ≥ 1 200n min 1/3 (ρ • τ + 2) 1/3 , (4) where ρ = n maj /n min > 1. Lemma C.3. The expected excess risk of undersampled binning estimator A USB can be decomposed as follows Excess Risk(A USB ) ≤ K-1 j=0 E S∼P n maj maj ×P n min min R j (A S USB ) • P test (I j ) + 2 K , where P test (I j ) := x∈Ij P test (x) dx. Proof. Recall that by definition, the expected excess risk is E S∼P n maj maj ×P n min min R(A S ; P test ) -R(f ; P test ) . Let us first decompose the Bayes risk R(f ), R(f ) = inf f E (x,y)∼Ptest [1(f (x) = y)] = inf f 1 x=0 y∈{-1,1} 1(f (x) = y)P test (y | x)P test (x) dx = 1 x=0 inf f (x)∈{-1,1} y∈{-1,1} 1(f (x) = y)P test (y | x)P test (x) dx = 1 x=0 inf f (x)∈{-1,1} P test (y = -f (x) | x)P test (x) dx = 1 x=0 min {P test (y = 1 | x), P test (y = -1 | x)} P test (x) dx. The risk of the undersampled binning algorithm A USB is given by R(A S USB ) = 1 x=0 y∈{-1,1} 1(A S USB (x) = y)P test (y | x)P test (x) dx = 1 x=0 P test (y = -A S USB (x) | x)P test (x) dx. Next, recall that the undersampled binning estimator is constant over the intervals I j for j ∈ {1, . . . , K} where it takes the value A S j (to ease notation let us simply denote it by A j below), and therefore R(A S USB ) = K-1 j=0 x∈Ij P test (y = -A j |x)P test (x) dx. This combined with Eq. ( 23) tells us that R(A S USB ) -R(f ) = K-1 j=0 x∈Ij P test (y = -A j |x) -min {P test (y = 1 | x), P test (y = -1 | x)} P test (x) dx. (24) Recall the definition of q j,1 and q j,-1 from Eqs. ( 22a)-(22b) above. For any x ∈ I j = [ j-1 K , j K ], |P test (y | x)-q j,y | ≤ 1/K, since the distribution P test (y | x) is 1-Lipschitz and q j,y is its conditional mean. Therefore, R(A S USB ) -R(f ) ≤ K-1 j=0 x∈Ij q j,-Ajmin {q j,1 , q j,-1 } P test (x) dx + 2 K Taking expectation over the training samples S (where n min samples are drawn independently from P min and n maj samples are drawn independently from P maj ) concludes the proof. Next we provide an upper bound on the expected excess risk is an interval R j (A S USB ). Lemma C.4. For any j ∈ {1, . . . , K} with I j = [ j-1 K , j K ], E S∼P n maj maj ×P n min min R j (A S USB ) ≤ c n min P test (I j ) + c K , where c is an absolute constant, and P test (I j ) := x∈Ij P test (x) dx. Proof. Consider an arbitrary bucket j ∈ {1, . . . , K}. Let us introduce some notation that shall be useful in the remainder of the proof. Analogous to q j,1 and q j,-1 defined above (see Eqs. ( 22a)-( 22b)), define q a j,1 and q b j,1 as follows:  Essentially, q a j,1 is the probability that a sample is from group a and has label 1, conditioned on the event that the sample falls in the interval I j . Since  This follows since P(y | x) is 1-Lipschitz and therefore can fluctuate by at most 1/K in the interval I j . Of course the same bound also holds for |q j,1q b j,1 |. With this notation in place let us present a bound on the expected value of R j (A S USB ). By definition R j (A S USB ) = q j,-A S j -min{q j,1 , q j,-1 }. First, note that q j,1 := P test (y = 1 | x ∈ I j ) = 1q j,-1 . Suppose that q j,1 < 1/2 and therefore q j,-1 > 1/2 (the same bound shall hold in the other case). In this case, risk is incurred only when A S j = 1. That is, E S∼P n maj maj ×P n min min R j (A S USB ) = |q j,-1 -q j,1 |P S [A S j = 1] = |1 -2q j,1 |P S [A S j = 1]. Now by the definition of the undersampled binning estimator (see Eq. ( 5)), A S j = 1 only when there are more samples in the interval I j with label 1 than -1. However, we can bound the probability of this happening since q j,1 is smaller than q j,-1 . Let n j be the number of samples in the undersampled sample set S US in the interval I j . Let n 1,j be the number of these samples with label 1, and n -1,j = n jn 1,j be the number of samples with label -1. Further, let n a,j be the number of samples in from group a such that they fall in the interval I j , and define m b,j analogously. The probability of incurring risk is given by In light of this previous equation, we want to control the probability that the number of samples with label 1 in the interval I j conditioned on the event that the number of samples from group a in this interval is ss and the number of samples from group b in this interval is s . Recall that q a j,1 and q b j,1 the probabilities of the label of the sample being 1 conditioned the event that sample is in the interval I j when it is group a and b respectively. So we define the random variables: z a [ss ] ∼ Bin(ss , q a j,1 ), z b [s ] ∼ Bin(s , q b j,1 ), z[s] ∼ Bin(s, max q a j,1 , q b j,1 ). Then, (1 -2 max q a j,1 , q b j,1 ) 2 , ( P[A j = 1] = P [n 1,j > s/2 | n j = s] where (i) follows by invoking Hoeffding's inequality (Wainwright, 2019, Proposition 2.5) . Combining this with Eqs. ( 28) and ( 29) we get that P[A j = 1] ≤ 2n min s=1 exp - s 2 (1 -2 max q a j,1 , q b j,1 ) 2 P[n j = s]. Now n j , which is the number of samples that lands in the interval I j is equal to n a,j + n b,j . Now each of n a,j and n b,j (the number of samples in this interval from each of the groups) are random variables with distributions Bin(n min , P a (I j )) and Bin(n min , P b (I j )), where P a (I j ) = x∈Ij P a (x) dx and P b (I j ) = x∈Ij P a (x) dx. Therefore, n j is distributed as a sum of two binomial distribution and is therefore Poisson binomially distributed (Wikipedia contributors, 2022). Using the formula for the moment generating function (MGF) of a Poisson binomially distributed random variable we infer that, P[A j = 1] ≤ 1 -P a (I j ) + P a (I j ) exp -(1 -2 max q a j,1 , q b j,1 ) 2 2 n min × 1 -P b (I j ) + P b (I j ) exp -(1 -2 max q a j,1 , q b j,1 ) 2 2 n min .



,P min )∈P Excess Risk[A; (P maj , P min )],

Figure 2: Convolutional neural network classifiers trained on the Imbalanced Binary CIFAR10 dataset with a 5:1 label imbalance. (Top) Models trained using the importance weighted cross entropy loss with early stopping. (Bottom) Models trained using the importance weighted VS loss (Kini et al., 2021) with early stopping. We report the average test accuracy calculated on a balanced test set over 5 random seeds. We start off with 2500 cat examples and 500 dog examples in the training dataset.We find that in accordance with our theory, for both of the classifiers adding only minority class samples (red) leads to large gain in accuracy (∼ 6%), while adding majority class samples (blue) leads to little or no gain. In fact, adding majority samples sometimes hurts test accuracy due to the added bias. When we add majority and minority samples in a 5:1 ratio (green), the gain is largely due to the addition of minority samples and is only marginally higher (< 2%) than adding only minority samples. The green curves correspond to the same classifiers in both the left and right panels.

Figure 3: The hat function with K = 4.

j (A S USB )P test (x) dx + 2 K .

:= P a (y = 1 | x ∈ I j ) = x∈Ij P(y = 1 | x)P a (x | x ∈ I j ) dx,(25a)q b j,1 := P b (y = 1 | x ∈ I j ) = x∈Ij P(y = 1 | x)P b (x | x ∈ I j ) dx.

test (x | x ∈ I j ) = 1 2 [P a (x | x ∈ I j ) + P b (x | x ∈ I j )] , therefore |q j,1q a j,1 | = x∈Ij P(y = 1 | x)P test (x | x ∈ I j ) dx -x∈Ij P(y = 1 | x)P a (x | x ∈ I j ) dx ≤ 1 K .

2n min s=1 P[A j = 1 | n j = s]P[n j = s],(28)where the sum is up to 2n min since the size of the undersample dataset |S US | is equal to 2n min .Conditioned on the event that n j = s the probability of incurring risk isP [A j = 1 | n j = s] = P [m 1,j > n -1,j | n j = s] = P [n 1,j > n j /2 | n j = s] = P [n 1,j > s/2 | n j = s] .(29)Now, note that n j = n a,j + n b,j . Thus continuing, we have thatP [n 1,j > s/2 | n j = s] = s ≤s P [n 1,j > s/2 | n j = s,n b,j = s ] P[n b,j = s ] = s ≤s P [n 1,j > s/2 | n a,j = ss , n b,j = s ] P[n b,j = s ].

1,j > s/2 | n j,a = ss , n j,b = s ] P[n j,b = s ] = s ≤s P [z a [ss ] + z b [s ]) > s/2 | n a,j = ss , n b,j = s ] P[n b,j = s ] ≤ s ≤s P [z[s] > s/2 | n a,j = ss , n b,j = s ] P[n b,j = s ] = s ≤s P [z[s] > s/2] P[n b,j = s ] = P [z[s] > s/2] (i) ≤ exp -s 2

Theorem 4.1. Consider the label shift setting described in Section 3.2.1. Recall that P LS is the class of pairs of distributions (P maj , P min ) that satisfy the assumptions in that section. The minimax excess risk over this class is lower bounded as follows:

The test set consists of all of the 1000 cat and 1000 dog test examples. To form our initial train and validation sets, we take 2500 cat examples but only 500 dog examples from the official train set, corresponding to a 5:1 label imbalance. We then use 80% of those examples for training and the rest for validation. In our experiment, we either (a) add only minority samples; (b) add only majority samples; (c) add both majority and minority samples in a 5:1 ratio. We consider competitive robust classifiers proposed in the literature that are convolutional neural networks trained either by using

Proof. By the definition of the excess riskExcess Risk[A USB ; (P maj , P min )] := E S∼P

annex

Given samples S, let S = (S 1 , S1 ) be a partition where S 1 are the samples that fall in the interval I 1 , and S1 be the other samples. Similarly, given a vector v ∈ {-1, 1}, let v = (v 1 , v1 ), where v 1 is the first component and v1 denotes the other components (2, . . . , K) of v.First, we will show thatTo see this, observe thatFurther, if v is chosen uniformly over the hypercube {-1, 1} K , thenwhere (i) follows since by Bayes' rule(the samples in S 1 depend only on v 1 ).Inequality (ii) follows since the samples are drawn independently given v = (v 1 , v1 ). Finally, (iii) follows since S1 (the samples that lie outside the interval I 1 ) only depend on v1 since the marginal distribution of x is independent of v and the distribution of y | x depends only on the value of v corresponding to the interval in which x lies.Thus since,To bound this KL divergence, let us condition of the number of samples in S 1 from group a, (the majority group) n 1,a and the number of samples from group b (the minority group), n 1,b . Now since n 1,a and n 1,b are independent of v 1 (which only affects the labels) we have that,Therefore, by the joint convexity of the KL-divergence and by Jensen's inequality we have that,Proof. First, by Lemma C.1 we know thatNext, by invoking the bound on the KL divergences in the equation above by Lemma C.2 we get that Minimax Excess Risk(P GS (τ ))and (ii) follows since 0 ≤ τ ≤ 1 and n min ≥ 1 and hence(n min (2-τ )+n maj τ ) 1/3 ≥ 0.7.

C.2 PROOF OF THEOREM 5.2

In this section, we derive an upper bound on the excess risk of the undersampled binning estimator A USB (Eq. ( 5)). Recall that given a dataset S this estimator first calculates the undersampled dataset S US , where the number of points from the minority group (n min ) is equal to the number of points from the majority group (n min ), and the size of the dataset is 2n min . Throughout this section, (P maj , P min ) shall be an arbitrary element of P GS (τ ) for any τ ∈ [0, 1]. In this section, whenever we shall often denote Excess Risk(A; (P maj , P min )) by simply Excess Risk(A).Before we proceed, we introduce some additional notation. For any j ∈ {1, . . . , K} andFor the undersampled binning estimator A USB (defined above in Eq. ( 5)), define the excess risk in an interval I j as follows:j -min{q j,1 , q j,-1 }. The proof of the upper bound shall proceed in steps. First, in Lemma C.3 we will show that the excess risk is equal to sum the excess risk over the intervals up to a factor of 2/K on account of the distribution being 1-Lipschitz. Next, in Lemma C.4 we upper bound the risk over each interval. We put these two together and to upper bound the risk.Plugging this into Eq. ( 28) we get that,and thereforewhere (i) follows since | max{q a j,1 , q b j,1 }-q j,1 | ≤ 1/K by Eq. ( 26) and γ is such that |γ| ≤ 1/K, and (ii) follows since (1 + z) b ≤ exp(bz). Now the RHS above is maximized when (1 -2q j,1 -2γ) 2 = c n min (Pa(Ij )+P b (Ij )) , for some constant c. Plugging this into the equation above we get thatFinally, noting that P test (I j ) = (P a (I j ) + P b (I j ))/2 completes the proof.By combining the previous two lemmas we can now prove our upper bound on the risk of the undersampled binning estimator. We begin by restating it.Theorem 5.2. Consider the group shift setting described in Section 3.2.2. For any overlap τ ∈ [0, 1] and for any (P maj , P min ) ∈ P GS (τ ) the expected excess risk of the Undersampling Binning Estimator (Eq. ( 5)) with number of bins withNext by using the bound onwhere (i) follows since for any vector z ∈ R K , z 1 ≤ √ K z 2 . Maximizing over K yields the choice K = n min 1/3 , completing the proof. 

E EXPERIMENTAL DETAILS FOR FIGURES 2 AND 4

We construct our label shift dataset from the original CIFAR10 dataset. We create a binary classification task using the "cat" and "dog" classes. We use the same convolutional neural network architecture as (Byrd & Lipton, 2019; Wang et al., 2022) with random initializations for this dataset. We train this model using SGD for 800 epochs with batchsize 64, a constant learning rate 0.001 and momentum 0.9. The importance weights used upweight the minority class samples in the training loss and validation loss is calculated to be where g i denotes the group label, n gi corresponds to the number of samples from the group, n max is the number of samples in the largest group and n is the total number of samples. We set τ = 3 and γ = 0.3, the best hyperparameters identified by Wang et al. (2022) on this dataset for this neural network architecture.Tilted Loss: The tilted loss (Li et al., 2020) is defined aswhere we take to be the logistic loss. In our experiments we set t = 2.Group-DRO: We run group-DRO (Sagawa et al., 2020 , Algorithm 1) with the logistic loss. We set adversarial step-size η q = 0.05 which was the best hyperparameter identified by Wang et al. (2022) .

