A CRITICAL ANALYSIS OF DISTRIBUTION SHIFT

Abstract

We introduce three new robustness benchmarks consisting of naturally occurring distribution changes in image style, geographic location, camera operation, and more. Using our benchmarks, we take stock of previously proposed hypotheses for out-of-distribution robustness and put them to the test. We find that using larger models and synthetic data augmentation can improve robustness on real-world distribution shifts, contrary to claims in prior work. Motivated by this, we introduce a new data augmentation method which advances the state-of-the-art and outperforms models pretrained with 1000× more labeled data. We find that synthetic augmentations can sometimes improve real-world robustness. We also find that some methods consistently help with distribution shifts in texture and local image statistics, but these methods do not help with some other distribution shifts like geographic changes. Hence no evaluated method consistently improves robustness. We conclude that future research must study multiple distribution shifts simultaneously.

1. INTRODUCTION

While the research community must create robust models that generalize to new scenarios, the robustness literature (Dodge and Karam, 2017; Geirhos et al., 2020) lacks consensus on evaluation benchmarks and contains many dissonant hypotheses. Hendrycks et al. (2020a) find that many recent language models are already robust to many forms of distribution shift, while Yin et al. (2019) and Geirhos et al. (2019) find that vision models are largely fragile and argue that data augmentation offers one solution. In contrast, Taori et al. (2020) provide results suggesting that using pretraining and improving in-distribution test set accuracy improve natural robustness, whereas other methods do not. In this paper we articulate and systematically study seven robustness hypotheses. The first four hypotheses concern methods for improving robustness, while the last three hypotheses concern abstract properties about robustness. These hypotheses are as follows. • Larger Models: increasing model size improves robustness (Hendrycks and Dietterich, 2019; Xie and Yuille, 2020). • Self-Attention: adding self-attention layers to models improves robustness (Hendrycks et al., 2019b ). • Diverse Data Augmentation: robustness can increase through data augmentation (Yin et al., 2019) . • Pretraining: pretraining on larger and more diverse datasets improves robustness (Orhan, 2019; Hendrycks et al., 2019a) Hendrycks et al., 2019b) . Existing datasets also lack diversity such that it is hard to extrapolate which methods will improve robustness more broadly. To address these issues and test the seven hypotheses outlined above, we introduce three new robustness benchmarks and a new data augmentation method. First we introduce ImageNet-Renditions (ImageNet-R), a 30,000 image test set containing various renditions (e.g., paintings, embroidery, etc.) of ImageNet object classes. These renditions are naturally occurring, with textures and local image statistics unlike those of ImageNet images, allowing us to more cleanly separate the Texture Bias and Synthetic =⇒ Real hypotheses. Next, we investigate natural shifts in the image capture process with StreetView StoreFronts (SVSF) and DeepFashion Remixed (DFR). SVSF contains business storefront images taken from Google Streetview, along with metadata allowing us to vary location, year, and even the camera type. DFR leverages the metadata from DeepFashion2 (Ge et al., 2019) to systematically shift object occlusion, orientation, zoom, and scale at test time. Both SVSF and DFR provide distribution shift controls and do not alter texture, which remove possible confounding variables affecting prior benchmarks. Finally, we contribute DeepAugment to increase robustness to some new types of distribution shift. This augmentation technique uses image-to-image neural networks for data augmentation, not data-independent Euclidean augmentations like image shearing or rotating as in previous work. DeepAugment achieves state-of-the-art robustness on our newly introduced ImageNet-R benchmark and a corruption robustness benchmark. DeepAugment can also be combined with other augmentation methods to outperform a model pretrained on 1000× more labeled data. After examining our results on these three datasets and others, we can rule out several of the above hypotheses while strengthening support for others. As one example, we find that synthetic data augmentation robustness interventions improve accuracy on ImageNet-R and real-world image blur distribution shifts, providing clear counterexamples to Synthetic =⇒ Real while lending support to the Diverse Data Augmentation and Texture Bias hypotheses. In the conclusion, we summarize the various strands of evidence for and against each hypothesis. Across our many experiments, we do not find a general method that consistently improves robustness, and some hypotheses require additional qualifications. While robustness is often spoken of and measured as a single scalar property like accuracy, our investigations suggest that robustness is not so simple. In light of our results, we hypothesize in the conclusion that robustness is multivariate.

2. RELATED WORK

Robustness Benchmarks. Recent works (Hendrycks and Dietterich, 2019; Recht et al., 2019; Hendrycks et al., 2020a) have begun to characterize model performance on out-of-distribution (OOD) data with various new test sets, with dissonant findings. For instance, Hendrycks et al. (2020a) demonstrate that modern language processing models are moderately robust to numerous naturally occurring distribution shifts, and that Only IID Accuracy Matters is inaccurate for natural language



Figure 1: Images from our three new datasets ImageNet-Renditions (ImageNet-R), DeepFashion Remixed (DFR), and StreetView StoreFronts (SVSF). The SVSF images are recreated from the public Google StreetView, copyright Google 2020. Our datasets test robustness to various naturally occurring distribution shifts including rendition style, camera viewpoint, and geography.

. • Texture Bias: convolutional networks are biased towards texture, which harms robustness (Geirhos et al., 2019). • Only IID Accuracy Matters: accuracy on independent and identically distributed test data entirely determines natural robustness.

