UNDERSAMPLING IS A MINIMAX OPTIMAL ROBUST-NESS INTERVENTION IN NONPARAMETRIC CLASSIFI-CATION

Abstract

While a broad range of techniques have been proposed to tackle distribution shift, the simple baseline of training on an undersampled balanced dataset often achieves close to state-of-the-art-accuracy across several popular benchmarks. This is rather surprising, since undersampling algorithms discard excess majority group data. To understand this phenomenon, we ask if learning is fundamentally constrained by a lack of minority group samples. We prove that this is indeed the case in the setting of nonparametric binary classification. Our results show that in the worst case, an algorithm cannot outperform undersampling unless there is a high degree of overlap between the train and test distributions (which is unlikely to be the case in real-world datasets), or if the algorithm leverages additional structure about the distribution shift. In particular, in the case of label shift we show that there is always an undersampling algorithm that is minimax optimal. In the case of group-covariate shift we show that there is an undersampling algorithm that is minimax optimal when the overlap between the group distributions is small. We also perform an experimental case study on a label shift dataset and find that in line with our theory, the test accuracy of robust neural network classifiers is constrained by the number of minority samples.

1. INTRODUCTION

A key challenge facing the machine learning community is to design models that are robust to distribution shift. When there is a mismatch between the train and test distributions, current models are often brittle and perform poorly on rare examples (Hovy & Søgaard, 2015; Blodgett et al., 2016; Tatman, 2017; Hashimoto et al., 2018; Alcorn et al., 2019) . In this paper, our focus is on groupstructured distribution shifts. In the training set, we have many samples from a majority group and relatively few samples from the minority group, while during test time we are equally likely to get a sample from either group. To tackle such distribution shifts, a naïve algorithm is one that first undersamples the training data by discarding excess majority group samples (Kubat & Matwin, 1997; Wallace et al., 2011) and then trains a model on this resulting dataset (see Figure 1 for an illustration of this algorithm). The samples that remain in this undersampled dataset constitute i.i.d. draws from the test distribution. Therefore, while a classifier trained on this pruned dataset cannot suffer biases due to distribution shift, this algorithm is clearly wasteful, as it discards training samples. This perceived inefficiency of undersampling has led to the design of several algorithms to combat such distribution shift (Chawla et al., 2002; Lipton et al., 2018; Sagawa et al., 2020; Cao et al., 2019; Menon et al., 2020; Ye et al., 2020; Kini et al., 2021; Wang et al., 2022) . In spite of this algorithmic progress, the simple baseline of training models on an undersampled dataset remains competitive. In the case of label shift, where one class label is overrepresented in the training data, this has been observed by Cui et al. (2019) et al., 2021) . In Table 1 , we report the performance of the undersampled classifier compared to the state-of-the-art-methods in the literature across several label shift and group-covariate shift datasets. We find that, although undersampling isn't always the optimal robustness algorithm, it is typically a very competitive baseline and within 1-4% the performance of the best method. 72.0 ± 1.9 71.8 ± 1.4 Inspired by the strong performance of undersampling in these experiments, we ask: Is the performance of a model under distribution shift fundamentally constrained by the lack of minority group samples? To answer this question we analyze the minimax excess risk. We lower bound the minimax excess risk to prove that the performance of any algorithm is lower bounded only as a function of the minority samples (n min ). This shows that even if a robust algorithm optimally trades off between the bias and the variance, it is fundamentally constrained by the variance on the minority group which decreases only with n min . Our Contributions. In our paper, we consider the well-studied setting of nonparametric binary classification (Tsybakov, 2010) . By operating in this nonparametric regime we are able to study the properties of undersampling in rich data distributions, but are able to circumvent the complications that arise due to the optimization and implicit bias of parametric models.



; Cao  et al. (2019), and Yang & Xu (2020). While in the case of group-covariate shift, a study byIdrissi  et al. (2022)  showed that the empirical effectiveness of these more complicated algorithms is limited.For example,Idrissi et al. (2022)  showed that on the group-covariate shift CelebA dataset the worstgroup accuracy of a ResNet-50 model on the undersampled CelebA dataset which discards 97% of the available training data is as good as methods that use all of available data such as importance-Example with linear models and linearly separable data. On the left we have the maximum margin classifier over the entire dataset, and on the right we have the maximum margin classifier over the undersampled dataset. The undersampled classifier is less biased and aligns more closely with the true boundary.

Performance of undersampled classifier compared to the best classifier across several popular label shift and group-covariate shift datasets. When reporting worst-group accuracy we denote it by a . When available, we report the 95% confidence interval. We find that the undersampled classifier is always within 1-4% of the best performing robustness algorithm, except on the MultiNLI dataset.

