BREEDS: BENCHMARKS FOR SUBPOPULATION SHIFT

Abstract

We develop a methodology for assessing the robustness of models to subpopulation shift-specifically, their ability to generalize to novel data subpopulations that were not observed during training. Our approach leverages the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions. This enables us to synthesize realistic distribution shifts whose sources can be precisely controlled and characterized, within existing large-scale datasets. Applying this methodology to the ImageNet dataset, we create a suite of subpopulation shift benchmarks of varying granularity. We then validate that the corresponding shifts are tractable by obtaining human baselines. Finally, we utilize these benchmarks to measure the sensitivity of standard model architectures as well as the effectiveness of existing train-time robustness interventions.

1. INTRODUCTION

Robustness to distribution shift has been the focus of a long line of work in machine learning (Schlimmer & Granger, 1986; Widmer & Kubat, 1993; Kelly et al., 1999; Shimodaira, 2000; Sugiyama et al., 2007; Quionero-Candela et al., 2009; Moreno-Torres et al., 2012; Sugiyama & Kawanabe, 2012) . At a high-level, the goal is to ensure that models perform well not only on unseen samples from the datasets they are trained on, but also on the diverse set of inputs they are likely to encounter in the real world. However, building benchmarks for evaluating such robustness is challenging-it requires modeling realistic data variations in a way that is well-defined, controllable, and easy to simulate. Prior work in this context has focused on building benchmarks that capture distribution shifts caused by natural or adversarial input corruptions (Szegedy et al., 2014; Fawzi & Frossard, 2015; Fawzi et al., 2016; Engstrom et al., 2019b; Ford et al., 2019; Hendrycks & Dietterich, 2019; Kang et al., 2019) , differences in data sources (Saenko et al., 2010; Torralba & Efros, 2011; Khosla et al., 2012; Tommasi & Tuytelaars, 2014; Recht et al., 2019) , and changes in the frequencies of data subpopulations (Oren et al., 2019; Sagawa et al., 2020) . While each of these approaches captures a different source of real-world distribution shift, we cannot expect any single benchmark to be comprehensive. Thus, to obtain a holistic understanding of model robustness, we need to keep expanding our testbed to encompass more natural modes of variation. In this work, we take another step in that direction by studying the following question: How well do models generalize to data subpopulations they have not seen during training? The notion of subpopulation shift this question refers to is quite pervasive. After all, our training datasets will inevitably fail to perfectly capture the diversity of the real word. Hence, during deployment, our models are bound to encounter unseen subpopulations-for instance, unexpected weather conditions in the self-driving car context or different diagnostic setups in medical applications.

OUR CONTRIBUTIONS

The goal of our work is to create large-scale subpopulation shift benchmarks wherein the data subpopulations present during model training and evaluation differ. These benchmarks aim to assess how effectively models generalize beyond the limited diversity of their training datasetse.g., whether models can recognize Dalmatians as "dogs" even when their training data for "dogs" comprises only Poodles and Terriers. We show how one can simulate such shifts, fairly naturally, within existing datasets, hence eliminating the need for (and the potential biases introduced by) crafting synthetic transformations or collecting additional data. BREEDS benchmarks. The crux of our approach is to leverage existing dataset labels and use them to identify superclasses-i.e., groups of semantically similar classes. This allows us to construct classification tasks over such superclasses, and repurpose the original dataset classes to be the subpopulations of interest. This, in turn, enables us to induce a subpopulation shift by directly making the subpopulations present in the training and test distributions disjoint. By applying this methodology to the ImageNet dataset (Deng et al., 2009) , we create a suite of subpopulation shift benchmarks of varying difficulty. This involves modifying the existing ImageNet class hierarchy-WordNet (Miller, 1995) -to ensure that superclasses comprise visually coherent subpopulations. We conduct human studies to validate that the resulting benchmarks capture meaningful subpopulation shifts. Model robustness to subpopulation shift. In order to demonstrate the utility of our benchmarks, we employ them to evaluate the robustness of standard models to subpopulation shift. In general, we find that model performance drops significantly on the shifted distribution-even when this shift does not significantly affect humans. Still, models that are more accurate on the original distribution tend to also be more robust to these subpopulation shifts. Moreover, adapting models to the shifted domain, by retraining their last layer on this domain, only partially recovers the original model performance. Impact of robustness interventions. Finally, we examine whether various train-time interventions, designed to decrease model sensitivity to synthetic data corruptions (e.g., 2 -bounded perturbations) make models more robust to subpopulation shift. We find that many of these methods offer small, yet non-trivial, improvements along this axis-at times, at the expense of performance on the original distribution. Often, these improvements become more pronounced after retraining the last layer of the model on the shifted distribution. Nevertheless, the increase in model robustness to subpopulation shifts due to these interventions is much smaller than what is observed for other families of input variations such as data corruptions (Hendrycks & Dietterich, 2019; Ford et al., 2019; Kang et al., 2019; Taori et al., 2020) . This indicates that handling subpopulation shifts, such as those present in the BREEDS benchmarks, might require a different set of robustness tools.

2. DESIGNING BENCHMARKS FOR DISTRIBUTION SHIFT

When constructing distribution shift benchmarks, the key design choice lies in specifying the target distribution to be used during model evaluation. This distribution is meant to be a realistic variation of the source distribution, that was used for training. Typically, studies focus on variations due to: • Differences in data sources: Here, the target distribution is an independent dataset for the same task (Saenko et al., 2010; Torralba & Efros, 2011; Tommasi & Tuytelaars, 2014; Recht et al., 2019 )-e.g., collected at a different geographic location (Beery et al., 2018) , time frame (Kumar et al., 2020) or user population (Caldas et al., 2018) . For instance, this could involve using PASCAL VOC (Everingham et al., 2010) to evaluate Caltech101-trained classifiers (Fei-Fei et al., 2006) . The goal is to test whether models are overly reliant on the idiosyncrasies of their training datasets (Ponce et al., 2006; Torralba & Efros, 2011) . • Subpopulation representation: The source and target distributions differ in terms of how well-represented each subpopulation is. Work in this area typically studies whether models perform equally well across all subpopulations from the perspective of reliability (Meinshausen et al., 2015; Hu et al., 2018; Duchi & Namkoong, 2018; Caldas et al., 2018; Oren et al., 2019; Sagawa et al., 2020) or algorithmic fairness (Dwork et al., 2012; Kleinberg et al., 2017; Jurgens et al., 2017; Buolamwini & Gebru, 2018; Hashimoto et al., 2018) .



Data corruptions: The target distribution is obtained by modifying inputs from the source distribution via a family of transformations that mimic real-world corruptions, as in Fawzi & Frossard (2015); Fawzi et al. (2016); Engstrom et al. (2019b); Hendrycks & Dietterich (2019); Ford et al. (2019); Kang et al. (2019); Shankar et al. (2019).

availability

https://github.com/

