ON INTERACTION BETWEEN AUGMENTATIONS AND CORRUPTIONS IN NATURAL CORRUPTION ROBUSTNESS

Abstract

Invariance to a broad array of image corruptions, such as warping, noise, or color shifts, is an important aspect of building robust models in computer vision. Recently, several new data augmentations have been proposed that significantly improve performance on ImageNet-C, a benchmark of such corruptions. However, there is still a lack of basic understanding on the relationship between data augmentations and test-time corruptions. To this end, we develop a feature space for image transforms, and then use a new measure in this space between augmentations and corruptions called the Minimal Sample Distance to demonstrate there is a strong correlation between similarity and performance. We then investigate recent data augmentations and observe a significant degradation in corruption robustness when the test-time corruptions are sampled to be perceptually dissimilar from ImageNet-C in this feature space. Our results suggest that test error can be improved by training on perceptually similar augmentations, and data augmentations may risk overfitting to the existing benchmark. We hope our results and tools will allow for more robust progress towards improving robustness to image corruptions.

1. INTRODUCTION

Robustness to distribution shift, i.e. when the train and test distributions differ, is an important feature of practical machine learning models. Among many forms of distribution shift, one particularly relevant category for computer vision are image corruptions. For example, test data may come from sources that differ from the training set in terms of lighting, camera quality, or other features. Postprocessing transforms, such as photo touch-up, image filters, or compression effects are commonplace in real-world data. Models developed using clean, undistorted inputs typically perform dramatically worse when confronted with these sorts of image corruptions (Hendrycks & Dietterich, 2018; Geirhos et al., 2018) . The subject of corruption robustness has a long history in computer vision (Simard et al., 1998; Bruna & Mallat, 2013; Dodge & Karam, 2017) and recently has been studied actively with the release of benchmark datasets such as ImageNet-C (Hendrycks & Dietterich, 2018) . One particular property of image corruptions is that they are low-level distortions in nature. Corruptions are transformations of an image that affect structural information such as colors, textures, or geometry (Ding et al., 2020) and are typically free of high-level semantics. Therefore, it is natural to expect that data augmentation techniques, which expand the training set with random low-level transformations, can help with learning robust models. Indeed, data augmentation has become a central technique in several recent methods (Hendrycks et al., 2019; Lopes et al., 2019; Rusak et al., 2020) that achieve large improvements on ImageNet-C and related benchmarks. One caveat for data augmentation based approaches is the test corruptions are expected to be unknown at training time. If the corruptions are known, they may simply be applied to the training set as data augmentations to trivially adapt to the test distribution. Instead, an ideal robust model needs to be robust to any valid corruption, including ones unseen in any previous benchmark. Of course, in practice the robustness of a model can only be evaluated approximately by measuring its corruption error on a representative corruption benchmark. To avoid trivial adaptation to the benchmark, recent works manually exclude test corruptions from the training augmentations. However, with a toy experiment presented in Figure 1 , we argue that this strategy alone might not be enough and that visually similar augmentation outputs and test corruptions can lead to significant benchmark improvements even if the exact corruption transformations are excluded. This observation raises two important questions. One, how exactly does the similarity between train time augmentations and corruptions of the test set affect the error? And two, if the gains are due to the similarity, the improvements may not translate into better robustness to other possible corruptions, so do we ever risk overfitting existing corruption benchmarks using a new augmentation scheme? In this work, we take a step towards answering these questions, with the goal of better understanding the relationship between data augmentation and test-time corruptions. Using a feature space on image transforms and a new measure called Minimal Sample Distance (MSD) on this space, we are able to quantify the distance between augmentation schemes and classes of corruption transformation. With our approach, we empirically show an intuitive yet surprisingly overlooked finding: Augmentation-corruption perceptual similarity is a strong predictor of corruption error. Based on this finding, we perform additional experiments to show that data augmentation aids corruption robustness by increasing perceptual similarity between a (possibly small) fraction of the training data and the test set. To further support our claims, we introduce a set of new corruption, called CIFAR/ImageNet-C, to test the degree to which common data augmentation methods overfit original the CIFAR/ImageNet-C. To choose these corruptions, we expand the set of natural corruptions and sample new corruptions that are far away from CIFAR/ImageNet-C in our feature space for measuring perceptual similarity. We then demonstrate that augmentation schemes designed specifically to improve robustness show significantly degraded performance on CIFAR/ImageNet-C. Some augmentation schemes still show some improvement over baseline, which suggests meaningful progress towards general corruption robustness is being made, but different augmentation schemes exhibit different degrees of generalization capability. As an implication, caution is needed for fair robustness evaluations when additional data augmentation is introduced. These results suggest a major challenge that is often overlooked in the study of corruption robustness: overfitting indeed occurs. Since perceptual similarity can predict performance, for any fixed finite set of test corruptions, improvements on that set may generalize poorly to dissimilar corruptions. However, perceptual similarity is not expected to be the only interaction between augmentations and corruptions, so a proposed augmentation scheme's degree of generalization capability may not be immediately clear. We hope that our results, together with new tools and benchmarks, will help researchers better understand why a given augmentation scheme has good corruption error and whether it should be expected to generalize to dissimilar corruptions. On the positive side, our experiments show that generalization does emerge within perceptually similar classes of transform, and that only a small fraction of sampled augmentations need to be similar to a given corruption. Section 6 discusses these points in more depth.



Example transforms are for illustrative purpose only and are exaggerated. Base image c Sehee Park.



Figure 1: A toy experiment. We train multiple models on CIFAR-10 (Krizhevsky et al., 2009) using different augmentation schemes. Each scheme is based on a single basic image transformation type and enhanced by overlaying random instantiations of the transformation for each input image following Hendrycks et al. (2019). We compare these models on the CIFAR-10 test set corrupted by the motion blur, a corruption used in the ImageNet-C corruption benchmark Hendrycks & Dietterich (2018). None of the augmentation schemes contains motion blur; however, the models trained with geometric-based augmentations significantly outperform the baseline model trained on the clean images while color-based augmentations show no gains. We note the geometric augmentations can produce a result visually similar to a blur by overlaying copies of shifted images 1 .

