A CRITICAL ANALYSIS OF DISTRIBUTION SHIFT

Abstract

We introduce three new robustness benchmarks consisting of naturally occurring distribution changes in image style, geographic location, camera operation, and more. Using our benchmarks, we take stock of previously proposed hypotheses for out-of-distribution robustness and put them to the test. We find that using larger models and synthetic data augmentation can improve robustness on real-world distribution shifts, contrary to claims in prior work. Motivated by this, we introduce a new data augmentation method which advances the state-of-the-art and outperforms models pretrained with 1000× more labeled data. We find that synthetic augmentations can sometimes improve real-world robustness. We also find that some methods consistently help with distribution shifts in texture and local image statistics, but these methods do not help with some other distribution shifts like geographic changes. Hence no evaluated method consistently improves robustness. We conclude that future research must study multiple distribution shifts simultaneously.

1. INTRODUCTION

While the research community must create robust models that generalize to new scenarios, the robustness literature (Dodge and Karam, 2017; Geirhos et al., 2020) lacks consensus on evaluation benchmarks and contains many dissonant hypotheses. Hendrycks et al. (2020a) find that many recent language models are already robust to many forms of distribution shift, while Yin et al. (2019) and Geirhos et al. (2019) find that vision models are largely fragile and argue that data augmentation offers one solution. In contrast, Taori et al. (2020) provide results suggesting that using pretraining and improving in-distribution test set accuracy improve natural robustness, whereas other methods do not. In this paper we articulate and systematically study seven robustness hypotheses. The first four hypotheses concern methods for improving robustness, while the last three hypotheses concern abstract properties about robustness. These hypotheses are as follows. • Larger Models: increasing model size improves robustness (Hendrycks and Dietterich, 2019; Xie and Yuille, 2020). • Self-Attention: adding self-attention layers to models improves robustness (Hendrycks et al., 2019b ). • Diverse Data Augmentation: robustness can increase through data augmentation (Yin et al., 2019) . • Pretraining: pretraining on larger and more diverse datasets improves robustness (Orhan, 2019; Hendrycks et al., 2019a ). • Texture Bias: convolutional networks are biased towards texture, which harms robustness (Geirhos et al., 2019) . • Only IID Accuracy Matters: accuracy on independent and identically distributed test data entirely determines natural robustness. • Synthetic =⇒ Real: synthetic robustness interventions including diverse data augmentations do not help with robustness on real-world distribution shifts (Taori et al., 2020) . It has been difficult to arbitrate these hypotheses because existing robustness datasets preclude the possibility of controlled experiments by varying multiple aspects simultaneously. For instance, Texture Bias was initially investigated with synthetic distortions (Geirhos et al., 2018) , which conflicts with the Synthetic =⇒ Real hypothesis. On the other hand, natural distribution shifts often affect many factors (e.g., time, camera, location, etc.) simultaneously in unknown ways (Recht et al., 2019;  

