UNDERSAMPLING IS A MINIMAX OPTIMAL ROBUST-NESS INTERVENTION IN NONPARAMETRIC CLASSIFI-CATION

Abstract

While a broad range of techniques have been proposed to tackle distribution shift, the simple baseline of training on an undersampled balanced dataset often achieves close to state-of-the-art-accuracy across several popular benchmarks. This is rather surprising, since undersampling algorithms discard excess majority group data. To understand this phenomenon, we ask if learning is fundamentally constrained by a lack of minority group samples. We prove that this is indeed the case in the setting of nonparametric binary classification. Our results show that in the worst case, an algorithm cannot outperform undersampling unless there is a high degree of overlap between the train and test distributions (which is unlikely to be the case in real-world datasets), or if the algorithm leverages additional structure about the distribution shift. In particular, in the case of label shift we show that there is always an undersampling algorithm that is minimax optimal. In the case of group-covariate shift we show that there is an undersampling algorithm that is minimax optimal when the overlap between the group distributions is small. We also perform an experimental case study on a label shift dataset and find that in line with our theory, the test accuracy of robust neural network classifiers is constrained by the number of minority samples.

1. INTRODUCTION

A key challenge facing the machine learning community is to design models that are robust to distribution shift. When there is a mismatch between the train and test distributions, current models are often brittle and perform poorly on rare examples (Hovy & Søgaard, 2015; Blodgett et al., 2016; Tatman, 2017; Hashimoto et al., 2018; Alcorn et al., 2019) . In this paper, our focus is on groupstructured distribution shifts. In the training set, we have many samples from a majority group and relatively few samples from the minority group, while during test time we are equally likely to get a sample from either group. To tackle such distribution shifts, a naïve algorithm is one that first undersamples the training data by discarding excess majority group samples (Kubat & Matwin, 1997; Wallace et al., 2011) and then trains a model on this resulting dataset (see Figure 1 for an illustration of this algorithm). The samples that remain in this undersampled dataset constitute i.i.d. draws from the test distribution. Therefore, while a classifier trained on this pruned dataset cannot suffer biases due to distribution shift, this algorithm is clearly wasteful, as it discards training samples. This perceived inefficiency of undersampling has led to the design of several algorithms to combat such distribution shift (Chawla et al., 2002; Lipton et al., 2018; Sagawa et al., 2020; Cao et al., 2019; Menon et al., 2020; Ye et al., 2020; Kini et al., 2021; Wang et al., 2022) . In spite of this algorithmic progress, the simple baseline of training models on an undersampled dataset remains competitive. In the case of label shift, where one class label is overrepresented in the training data, this has been observed by Cui et al. (2019); Cao et al. (2019), and Yang & Xu (2020) . While in the case of group-covariate shift, a study by Idrissi et al. (2022) showed that the empirical effectiveness of these more complicated algorithms is limited. For example, Idrissi et al. (2022) showed that on the group-covariate shift CelebA dataset the worstgroup accuracy of a ResNet-50 model on the undersampled CelebA dataset which discards 97% of the available training data is as good as methods that use all of available data such as importance-1

