INCREASING THE COVERAGE AND BALANCE OF ROBUSTNESS BENCHMARKS BY USING NON-OVERLAPPING CORRUPTIONS Anonymous

Abstract

Neural Networks are sensitive to various corruptions that usually occur in realworld applications such as blurs, noises, low-lighting conditions, etc. To estimate the robustness of neural networks to these common corruptions, we generally use a group of modeled corruptions gathered into a benchmark. We argue that corruption benchmarks often have a poor coverage: being robust to them only imply being robust to a narrow range of corruptions. They are also often unbalanced: they give too much importance to some corruptions compared to others. In this paper, we propose to build corruption benchmarks with only non-overlapping corruptions, to improve their coverage and their balance. Two corruptions overlap when the robustnesses of neural networks to these corruptions are correlated. We propose the first metric to measure the overlapping between two corruptions. We provide an algorithm that uses this metric to build benchmarks of Non-Overlapping Corruptions. Using this algorithm, we build from ImageNet a new corruption benchmark called ImageNet-NOC. We show that ImageNet-NOC is balanced and covers several kinds of corruptions that are not covered by ImageNet-C.

1. INTRODUCTION

Neural Networks perform poorly when they deal with images that are drawn from a different distribution than their training samples. Indeed, neural networks are sensitive to adversarial examples (Szegedy et al., 2014) , background changes (Xiao et al., 2020) , and common corruptions (Hendrycks & Dietterich, 2019) . Common corruptions are perturbations that change the appearance of images without changing their semantic content. For instance, neural networks are sensitive to noises (Koziarski & Cyganek, 2017) , blurs (Vasiljevic et al., 2016) or lighting condition variations (Temel et al., 2017) . Contrary to adversarial examples (Szegedy et al., 2014) , common corruptions are not artificial perturbations especially crafted to fool neural networks. They naturally appear in industrial applications without any human interfering, and can significantly reduce the performances of neural networks. A neural network is robust to a corruption c, when its performances on samples corrupted with c are close to its performances on clean samples. Some methods have been recently proposed to make neural networks more robust to common corruptions (Geirhos et al., 2019; Hendrycks* et al., 2020; Rusak et al., 2020) . To determine whether these approaches are effective, it is required to have a method to measure the neural network robustness to common corruptions. The most commonly used method consists in evaluating the performances of neural networks on images distorted by various kinds of common corruptions: (Hendrycks & Dietterich, 2019; Karahan et al., 2016; Geirhos et al., 2019; Temel et al., 2017) . In this study, we call the group of perturbations used to make the robustness estimation a corruption benchmark. We also use this term to refer to a set of test images that have been corrupted with these various corruptions. We identify two important factors that should be taken into account when building a corruption benchmark: the balance and the coverage. In this paper, we consider that a corruption c is covered by a benchmark, when increasing the robustness of a network to all the corruptions of this benchmark, also increases the robustness of the network to c. For instance, a benchmark that contains a camera shake blur corruption covers the defocus blur corruption, because the robustnesses towards these two corruptions are correlated (Vasiljevic et al., 2016) . The coverage of a benchmark is defined as the number of corruptions covered by this benchmark. The more a benchmark covers a wide range of common corruptions, the more it gives a complete view of the robustness of a neural network. At the same time, we consider a benchmark as balanced when it gives the same importance to the robustness to every corruption it contains. For instance, according to a balanced benchmark, being robust to noises is as important as being robust to brightness variations. We argue that most of the existing corruption benchmarks are unbalanced: they give too much importance to the robustness to some corruptions compared to others. The coverage and balance of corruption benchmarks are related to the notion of corruption overlappings. We say that two corruptions overlap when the robustnesses of neural networks towards these corruptions are correlated. The contribution of this paper is fourfold: 1. We propose the first method to estimate to what extent two corruptions overlap. 2. We show that building corruption benchmarks with non-overlapping corruptions make them more balanced and able to cover a wider range of corruptions. 3. We propose a method to build benchmarks that contain only non-overlapping corruptions. 4. We use this method to build from ImageNet, a benchmark of Non-Overlapping Corruptions called ImagNet-NOC, to estimate the robustness of image classifiers to common corruptions. We show that ImagNet-NOC is balanced and covers corruptions that are not covered by ImageNet-C: a reference corruption benchmark (Hendrycks & Dietterich, 2019) . (Szegedy et al., 2014) . Making sure that models are robust to these kinds of o.o.d samples is essential in terms of security. Artistic renditions (Hendrycks et al., 2020) or sketches (Haohan et al., 2019) , can also be useful to determine if neural networks understand the abstract concepts we want them to learn. Methods to study how classifiers are affected by background changes have also been recently proposed (Beery et al., 2018; Xiao et al., 2020) .

2. BACKGROUND AND RELATED WORKS

Another important aspect of the robustness of neural networks to o.o.d samples, is the robustness to common corruptions. This aspect of the robustness is generally estimated by gathering several commonly encountered corruptions, and by testing the performances of neural networks on images corrupted with these corruptions. Diverse selections of common corruptions have been proposed to make a robustness estimation (Karahan et al., 2016; Laugros et al., 2019; Geirhos et al., 2019) . In particular, ImageNet-C is a popular benchmark used to measure the robustness of ImageNet classifiers (Hendrycks & Dietterich, 2019) . Different common corruption benchmarks have also been proposed in the context of object detection (Michaelis et al., 2019) , scene classification (Tadros et al., 2019) or, eye-tracking (Che et al., 2020) . It is worth noting that some transformations that are in between adversarial attacks and common corruptions have been recently proposed to measure the robustness of image classifiers (Kang et al., 2019; Dunn et al., 2019; Liu et al., 2019) .

2.2. CORRUPTION OVERLAPPINGS IN BENCHMARKS

It has been noticed that fine-tuning a model with camera shake blur helps it to deal with defocus blur and conversely (Vasiljevic et al., 2016) . The robustnesses to diverse kinds of noises have also been shown to be closely related (Laugros et al., 2019) . Even for two corruptions that do not look similar to the human eye, increasing the robustness of a model to one of these corruptions, can imply increasing the robustness to the other corruption (Kang et al., 2019) . In general, it has been shown that the robustnesses to the corruptions that distort the high-frequency content of images are correlated (Yin et al., 2019) . In the context of adversarial examples, it is known that the robustness towards one adversarial attack can be correlated with the robustness to another attack (Tramer & Boneh, 2019) . So, it is generally recommended to evaluate the adversarial robustness with attacks that are clearly different from each other (Carlini et al., 2019) . The experiments carried out in this paper suggest that this recommendation should also be followed in the context of common corruption robustness estimation.

3.1. THE CORRUPTION OVERLAPPING SCORE

We consider that two corruptions overlap when the robustness to one of these corruptions is correlated with the robustness to the other corruption. In this section, we propose a methodology to estimate to what extent two corruptions overlap. The Robustness Score. To determine whether two corruptions overlap, we first need to introduce a metric called the robustness score. This score gives an estimation of the robustness of a model m to a corruption c. It is computed with the following formula: R m c = Ac A clean . A clean is the accuracy of m on an uncorrupted test set and A c is the accuracy of m on the same test set corrupted with c. The higher R m c is, the more robust m is. Please note that using this metric requires to monitor A clean and make sure it is relatively high. Otherwise, an untrained model for which A c equals A clean , would be considered as robust for example. In this study, this metric is used only in the methodology we propose to estimate the overlapping between two corruptions. The Corruption Overlapping Score. We consider two neural networks m1 and m2 and two corruptions c1 and c2. m1 and m2 are identical, and trained with exactly the same settings except that their training sets are respectively augmented with the corruptions c1 and c2. A standard model is trained the same way but only with non-corrupted samples. We propose a method to measure to what extent c1 and c2 overlap. The idea of the method is to see if a data augmentation with c1 makes a model more robust to c2 and conversely. To determine this, m1, m2, and a test set are used to compute the following expression: (R m2 c1 -R standard c1 ) + (R m1 c2 -R standard c2 ) The first term of (1) measures whether a model that fits exactly c2 is more robust to c1 than the standard model. Symmetrically, the second term measures whether a model that fits exactly c1 is more robust than the standard model to c2. The more making a model fit c1 implies being more robust to c2 and reciprocally, and the more we can suppose that the robustnesses to c1 and c2 are correlated in practice. In other words, the expression (1) gives an estimation of the overlapping between c1 and c2. To be more convenient, we would like to build a corruption overlapping score equal to 1 when c1 = c2, and equal to 0 when the robustnesses to c1 and c2 are not correlated at all. We propose a new expression that respects both conditions: O c1,c2 = max{0, 1 2 * R m1 c2 -R standard c2 R m2 c2 -R standard c2 + R m2 c1 -R standard c1 R m1 c1 -R standard c1 } (2) The expression ( 2) is a normalized version of (1). It measures the overlapping between two corruptions while respecting the conditions mentioned above. Indeed, if a data augmentation with c1 does not increase the robustness to c2 at all and conversely, then the ratios in (2) are null or negative, so the whole overlapping score is maximized to zero. In other words, when c1 and c2 do not overlap at all, the overlapping score is equal to 0. Besides, when c1 = c2, R m1 c2 = R m2 c2 and R m2 c1 = R m1 c1 , so both ratios of (2) are equal to 1. Then, O c1,c2 = 1 when c1 and c2 completely overlap. How to compute an overlapping score. To get the overlapping score between c1 and c2, we follow the method illustrated in Figure 1 . This method has six steps, and requires to have a training set, a test set and three untrained models that share the same architecture (m1, m2 and standard). The step (1), consists in using the corruptions c1 and c2 to get two training sets, each corrupted with one corruption. Then, the obtained corrupted sets are used to train the models m1 and m2 in step (2). The standard model is also trained during this step but only with non-corrupted samples. In step (3), similarly to step (1), we use c1 and c2 to get two corrupted versions of the test set. The accuracies of the three models on the three test sets are computed in step (4). The scores obtained are used in step (5), to get the robustness scores of each model for the corruptions c1 and c2. The results obtained are used to compute the overlapping score between c1 and c2 in step (6). 

3.2. CORRUPTION OVERLAPPING AND COVERAGE OF BENCHMARKS

With our definition, a corruption c is covered by a benchmark, when increasing the robustness of a network to all the corruptions of this benchmark, also increases the robustness of the network to c. The more a benchmark covers a wide range of corruptions, the more being robust to this benchmark provides a strong guarantee about the robustness of a neural network. Then, a benchmark should cover as much common corruptions as possible. To illustrate the notion of coverage, let us consider bench1, a benchmark that contains three corruptions of ImageNet-C: Gaussian noise, shot noise and impulse noise (Hendrycks & Dietterich, 2019) . We also consider bench2, that contains the Gaussian noise, brightness and elastic corruptions of ImageNet-C. Intuitively, being robust to bench1 implies being robust only to noises while being robust to bench2 implies being robust to a wider range of corruptions. Then, we can suppose that bench1 has a lower coverage than bench2. When we compute the overlapping scores of these benchmarks, we observe that the overlapping between the corruptions of bench1 are close to 1, while they are close to 0 in bench2 (see Figure 3 ). The corruptions of bench1 clearly overlap while the ones of bench2 do not. We argue that the overlappings in benchmarks tend to reduce their coverage. Indeed, the more two corruptions c 1 and c 2 overlap, the more it is likely that a corruption covered by c 1 is also covered by c 2 and conversely. So, when two corruptions overlap, their range of covered corruptions overlap too. By reducing the overlappings in a benchmark, we separate the ranges of corruptions covered by the corruptions of this benchmark, which results in increasing its coverage. In Section 5.2, we show that we can cover the fifteen corruptions of ImageNet-C with only eight non-overlapping corruptions.

3.3. CORRUPTION OVERLAPPING AND BALANCE OF BENCHMARKS

We consider that a benchmark is balanced, when it gives the same importance to the robustness to every corruption it contains. For instance in Section 5.3, we show that the ImageNet-C benchmark gives more importance to the blur corruptions than to the corruptions that affect the brightness of images. Yet, in a real-world applications, we think that being robust to different kinds of blurs is not more valuable than being robust to lighting condition variations. Being unbalanced is in general not a desirable property, it makes benchmarks give biased estimations of neural network robustness. Overlappings between the corruptions of a benchmark can make it unbalanced. Let us consider three corruptions c 1 , c 2 , and c 3 with c 1 and c 2 that completely overlap, and c 1 and c 2 that do not overlap at all with c 3 . A model robust to c 3 , is robust to one third of the corruptions of the benchmark. But a model robust to c 1 is also robust to c 2 , because c 1 and c 2 overlap. So being robust to c 1 or c 2 implies being robust to two third of the corruptions of the benchmark. Then, this benchmark rewards more the robustness to c 1 or c 2 than the robustness to c 3 : it is unbalanced. In general, if one corruption contributes more to the total overlapping of a benchmark than another corruption of this benchmark, then the benchmark is unbalanced. In Section 5.3, we show that a benchmark built with non-overlapping corruptions is more balanced than ImageNet-C.

4. CONSTRUCTION OF A NON-OVERLAPPING CORRUPTION BENCHMARKS:

IMAGENET-NOC Experimental Set-up. For every training of this study, we use the following parameters. The used optimizer is SGD with a momentum of 0.9. The used cost function is a cross-entropy function, with a weight decay set to 10 -4 . Models are trained for 40 epochs with a batch size of 256. The initial learning rate is set to 0.1 and is divided by 10 at epoch 20 and 30. In all the experiments we use ImageNet-100: a subset of ImageNet that contains every tenth ImageNet class by WordNetID order (Deng et al., 2009) . All images are resized to the 224x224 format, and randomly horizontally flipped with a probability of 0.5 during trainings. When we use a data augmentation with a corruption in a training, half of the images of each training batch are transformed with the corruption, while the other half is not corrupted. We present Algorithm 1: a general method to build benchmarks that do not contain any overlapping corruption. We argue that this method helps to build balanced benchmarks that have a large coverage. Algorithm 1 Methodology Proposed to Build a Benchmark of Non-Overlapping Corruptions Require: S a set of common corruptions Require: A train set, a test set and a neural network architecture Require: An overlapping threshold: the maximum overlapping score allowed in the benchmark (0) n ← 2. n is the current number of corruptions in the benchmark. It is initialized to 2. (1) Use the train set, the test set and the chosen network architecture to apply the methodology presented in Section 3.1, to get all the overlapping scores between the corruptions of S. (2) Pick the largest subsets of S with overlapping scores under the overlapping threshold. (3) Among the retained subsets, select the one with the lowest mean overlapping score to form the benchmark. We want to use this algorithm to build a new benchmark that measures the robustness of image classifiers to common corruptions. Algorithm 1 requires to gather a group of candidate corruptions called S. The larger S, the more combinations of non-overlapping corruptions can be found in S, the larger the benchmarks built by the algorithm. Then, we recommend to use a large intial set of corruptions to increase the coverage of the built benchmarks. For this study, we implemented two dozens of image corruptions (illustrated in Figure 2 ) to constitute S. All these corruptions are associated with a severity range. A value is randomly chosen from the severity range of the considered corruption each time an image is corrupted. The higher this value is, the more the aspect of the corrupted image changes. More information about the modeled common corruptions can be found in Appendix A. We apply Algorithm 1, using this set of corruptions, the ImageNet train set and test set, and the ResNet-18 architecture; with different values of the overlapping threshold. The corruption benchmarks obtained for different values of the threshold are shown in Appendix B. The higher the overlapping threshold is, the more the number of corruptions included in the constructed benchmarks increases, and so does the coverage of the constructed benchmarks. However, the coverage gain due to the increase of the threshold is reduced by overlappings, because overlapping corruptions tend to cover the same kind of corruptions. Besides, as explained in Section 3.3, the more there are overlappings in a benchmark, the more it is likely to be unbalanced. All in all, selecting the overlapping threshold, determines when the coverage gain does not worth the balance loss. Choosing this value depends on the application case and the kind of robustness estimation we want to make. Each benchmark obtained with Algorithm 1 that contains n corruptions, are the n-corruption benchmarks with the lowest mean overlapping as possible. As explained in Sections 3.2 and 3.3, we expect that benchmarks that are optimal in terms of overlapping will have a good balance and coverage. We propose to study the set of eight corruptions obtained with overlapping threshold = 0.1, which is: rain, quantization, shear, brightness, hue, vertical artifacts, blur and border. The overlapping scores between these corruptions are displayed in the right lower square in Figure 3 . Eight corrupted ImageNet validation sets, each corrupted with one of the eight corruptions, are gathered to form ImageNet-NOC. We run the algorithm a second time using a DenseNet-121 architecture instead of the ResNet-18 one. The overlapping scores obtained by running the step (1) of Algorithm 1 are displayed in Appendix A. The benchmark obtained with overlapping threshold = 0.1 is rain, Gaussian, shear, brightness, hue, vertical artifacts, elastic. This benchmark shares six corruptions with ImageNet-NOC, and the overlapping score computed with Gaussian noise and quantization equals 0.6 and the one computed with elastic and blur equals 0.26. So, the only two corruptions that are not shared by the two benchmarks appear to be correlated in terms of robustness. So, using a DenseNet-121 architecture makes the algorithm build a benchmark that is fairly similar to the one obtained using ResNet-18. Running the Algorithm 1 requires to complete one training for each corruption in S. It took one week with a single GPU Nvidia Tesla V100 to get all the overlapping scores of Figure 3 . While this computational cost is high, we think that the process could be accelerated by fine-tuning models for a few epochs instead of training them from scratch. Further investigations should be conducted to determine to what extent this alternative would modify the obtained results. How to Use ImageNet-NOC. We recommend to use the CE metric (Hendrycks & Dietterich, 2019) to measure the robustness of an image classifier to an ImageNet-NOC corruption. Using the ResNet-18 model, the ImageNet training and validation sets, we apply the method illustrated in Figure 1 to get the overlapping score of every couple of ImageNet-C corruptions. The corruption severity is randomly selected for each image corrupted in the process. The obtained scores are displayed in the upper left square of Figure 3 . We observe that all the corruptions that damage the textures in images (blurs, noises, pixelate and jpeg compression) significantly overlap. This result is consistent with the experiments carried out by Yin et al. (2019) : they argue that the robustnesses of neural networks to corruptions that alter high-frequency information of images are correlated. Concerning the corruptions that alter low-frequency information of images, the overlappings are less pronounced. But we do observe some significant overlappings. There is a clear overlapping between fog and contrast or between snow and frost. As explained in Section 3.2 and 3.3, all these overlappings suggest that ImageNet-C is unbalanced and has a poor coverage. Figure 3 reveals that ImageNet-NOC contains far less overalpping corruptions than ImageNet-C. We compute again the overlapping scores of ImageNet-C and ImageNet-NOC with the DenseNet-121 (Huang et al., 2017) and WideResNet-50-2 (Zagoruyko & Komodakis, 2016) architectures. The aspect of the overlapping arrays obtained with these two architectures is the same as the one obtained with ResNet-18 (see Appendix D). For traditionally used image classifiers, it appears that overlapping scores do not vary much with the architecture of the model used to compute them.

5.2. COVERAGE OF IMAGENET-NOC AND IMAGENET-C

In the lower left square of Figure 3 , are displayed all the overlapping scores computed with one ImageNet-C corruption and one ImageNet-NOC corruption. We observe that for every ImageNet-C corruption c1, there is always at least one ImageNet-NOC corruption c2, for which the overlapping score computed with c1 and c2 is higher than 0.3. On the other hand, two ImageNet-NOC corruptions (hue and border) do not overlap at all with any of the ImageNet-C corruptions. Then, increasing the robustness to all the ImageNet-NOC corruptions should imply being more robust to all the ImageNet-C corruptions, but being robust to ImageNet-C may not imply being robust to some ImageNet-NOC corruptions. To confirm this, we train two ResNet-18 called m IN OC and m IC . A data augmentation procedure with all the ImageNet-C corruptions is used to train m IC . Each corrupted image of this training is modified by one randomly selected corruption of ImageNet-C, with a randomly chosen severity. Similarly, m IN OC is trained with a data augmentation procedure with all the ImageNet-NOC corruptions. After the trainings, we measure the robustness of m IN OC to every corruption of ImageNet-C by computing its CE scores. We also measure the CE scores of m IC towards the ImageNet-NOC corruptions. We compare the obtained scores of both models with the ones of the standard model in Table 1 . The first column of this table contains the error rates on non-corrupted ImageNet samples. We observe in Table 1 that m IN OC is more robust than the standard model to all the ImageNet-C corruptions. However, m IC is not robust to the hue and border corruptions of ImageNet-NOC. These results appear to confirm the hypothesis made by studying the overlapping scores in Figure 3 : ImageNet-C does not cover some of the ImageNet-NOC corruptions while ImageNet-NOC covers the ImageNet-C corruptions. As argued in Section 3.2, it seems that using non-overlapping corruptions helps to build benchmarks that have a larger coverage.

5.3. BALANCE OF IMAGENET-NOC AND IMAGENET-C

We carry out an experiment to compare the balance of ImageNet-NOC and ImageNet-C. We train one ResNet-18 for each corruption of ImageNet-NOC and ImageNet-C. So, fifteen models are trained using a data augmentation procedure with one corruption of ImageNet-C, and eight others are trained with one corruption of ImageNet-NOC. Then, we estimate the robustness of the first fifteen ResNet-18 on ImageNet-C by computing their mCE scores. We also get the mCE scores of the eight remaining ResNet-18 on ImageNet-NOC. The obtained scores are displayed in Tables 2.a, 2.b. The CE scores computed to get the mCE scores can be found in Appendix E. The mCE scores obtained in Table 2 .a are very different from each other. For instance, the mCE score of the model trained with defocus blur, is much lower than the one trained with brightness. Then, according to the robustness estimation made with ImageNet-C, one of the models is much more robust than the other one. In other words, the ImageNet-C benchmark gives more importance to the robustness to defocus blur than to the robustness to brightness: ImageNet-C is unbalanced. We observe far less variations in the mCE scores in Table 2 .b than in Table 2 .a. More precisely, the difference between the lowest mCE and the highest mCE in Table 2 .a and 2.b are respectively 40 and 11. And the standard deviation of the mCE scores in these tables are respectively 12.1 and 3.7. Then, the importance given by ImageNet-C to the robustness of its corruptions, varies a lot with the considered corruption. This variation is way less important for our benchmark. Then, ImageNet-NOC is significantly more balanced than ImageNet-C.

5.4. ROBUSTNESS ESTIMATIONS USING IMAGENET-NOC AND IMAGENET-C

The experiments carried out in Section 5.2 and 5.3, suggest that ImageNet-NOC has a better balance and coverage than ImageNet-C. To determine whether using ImageNet-NOC instead of ImageNet-C makes a difference in practice, we propose to compare the performances of various models on the two benchmarks. First, we measure the mCE scores of several pretrained torchvision classifiers

A PRESENTATION OF THE MODELED COMMON CORRUPTIONS

The common corruptions gathered in Section 4 are implemented with computationally cheap image transformations that can be easily added in an image pipeline. We provide in Figures 5 and 6 In Section 4, by completing the step (1) of Algorithm 1, we compute the overlapping scores between the corruptions gathered in Figure 2 . These overlapping scores are displayed in Figure 4 . We provide in Table 7 and 8, the CE scores computed and averaged to get the mCE scores of Tables 2.a and 2.b. The first column of these arrays contains the error rate on the non-corrupted ImageNet validation set. Table 8 : CE scores computed with models trained with a data augmentation with one corruption of ImageNet-NOC. Each line refers to one model trained with one corruption of ImageNet-NOC and each column refers to one corruption of ImageNet-NOC.



Figure 6: Presentation of half of the group of common corruptions displayed in Figure 2. The other half is presented in Figure 5.



Figure 1: Methodology used to compute the overlapping score between two corruptions c1 and c2.

Figure 2: Illustrations of the common corruptions gathered to run the Algorithm 1.

Figure 3: The overlapping scores between all the ImageNet-NOC and ImageNet-C corruptions.

, information about how the common corruptions used in this study are modeled. These are implemented for images that have pixel values in [0-1]. The Severity Range column of these arrays precises how the severity of each corruption is set. The lower bound of each severity range is selected to get a robustness score of 0.95 with the standard ResNet-18 model, tested on the ImageNet validation set that has been altered with the considered corruption. The upper bound is selected to get a robustness score of 0.5 in the same conditions. The upper bounds of the hue and gray scale corruptions are different because these corruptions are not harmful enough to reach a robustness score of 0.5. They respectively reach a robustness score of 0.63 and 0.70 for the standard ResNet-18. The last column of the Figures 5 and 6 corresponds to the error rate of the torchvision pretrained AlexNet on the ImageNet validation set corrupted with the corruption indicated in the first column.

Figure 4: Overlapping scores between all the common corruptions displayed in Figure 2. The scores have been computed with the ResNet-18 and DenseNet-121 architectures

The mean CE score computed with a benchmark is called mCE. Using the mCE metric avoids several pitfalls while measuring the robustness of neural networks. To compute a CE score, it is required to get the error rate of a pretrained AlexNet model, on the ImageNet validation set corrupted with the considered corruption. The error rates of the torchvision pretrained AlexNet computed with the corruptions introduced in this paper are displayed in Appendix A. We provide the ImageNet-NOC CE scores of some traditionally used ImageNet classifiers in Appendix C. More details about how to use ImageNet-NOC can be found by visiting [Link available upon acceptance].5 COMPARISON BETWEEN IMAGENET-NOC AND IMAGENET-C5.1 CORRUPTION OVERLAPPINGS IN IMAGENET-C AND IMAGENET-NOCImageNet-C is a benchmark commonly used to measure the robustness of ImageNet classifiers to common corruptions(Hendrycks & Dietterich, 2019). It is built on fifteen common corruptions called Gaussian noise, shot noise, impulse noise, defocus blur, glass blur, motion blur, zoom blur, snow, frost, fog, brightness, contrast, elastic, pixelate, and jpeg compression. Each corruption is associated with five severity levels that determine to what extend the corrupted images are distorted. Please note that in general, the benchmark corruptions should not be used during a training of a model. Indeed, corruption benchmarks are built to estimate the robustness to unforeseen corruptions. In this study, we use the ImageNet-C and ImageNet-NOC corruptions during some training phases only because we analyze the benchmarks themselves.

CE scores computed with models trained with a data augmentation with one corruption of ImageNet-C. Each line refers to one model trained with one corruption of ImageNet-C and each column refers to one corruption of ImageNet-C.

annex

Under review as a conference paper at ICLR 2021 The second experiment is carried out using four ResNet-50 that have been shown to be robust to ImageNet-C. These models are called SIN+IN (Geirhos et al., 2019) , ANT 3x3 (Rusak et al., 2020) , Augmix (Hendrycks* et al., 2020) and DeepAugment (Hendrycks et al., 2020) . We measure the robustness of these models to ImageNet-NOC by computing their CE scores. Then, we compare these scores with the ones obtained with the torchvision pretrained ResNet-50 in Table 4 . We observe that the SIN+IN, ANT 3x3 , and DeepAugment models are not robust to border. Interestingly, we show in Section 5.2 that this corruption is not covered by ImageNet-C. This result confirms that ImageNet-NOC can reveal a low robustness of models to corruptions that are not covered by ImageNet-C.The ImageNet-NOC and ImageNet-C mCE scores of the five considered models are compared in the two last columns of Table 4 . We observe that the robustness ranking established with ImageNet-C is different from the one established with ImageNet-NOC. For instance, ANT 3x3 is very robust to ImageNet-C but not robust to ImageNet-NOC. We think this is a direct consequence of the lack of balance of ImageNet-C. Indeed, we provide evidence in Section 5.3 that ImageNet-C gives a lot of importance to the noise robustness, and ANT 3x3 has been shown to be particularly robust to noises (Rusak et al., 2020) . So ANT 3x3 is considered as very robust to ImageNet-C, but not to ImageNet-NOC which is more balanced. We note that Augmix obtains a relatively low ImageNet-NOC mCE compared to the ImageNet-C one. This result should be considered cautiously because shears and quantizations are used by the Augmix data augmentation procedure, and these corruptions overlap with some of the ImageNet-NOC corruptions. This is a reason why the Augmix model obtains low shear and quantization CE scores.The experiments carried out in this section show that the robustness estimations made with ImageNet-NOC are different from the ones made with ImageNet-C. We think that ImageNet-NOC should be preferred to ImageNet-C because of its coverage and balance.

6. CONCLUSION

We proposed a metric called the corruption overlapping score, that measures to what extend the robustnesses towards two corruptions are correlated. We showed that the overlappings between the corruptions of a benchmark can reduce its coverage and make it unbalanced. We provided a benchmark of Non-Overlapping Corruptions called ImageNet-NOC to measure the robustness of image classifiers. We showed that ImageNet-NOC is balanced and covers several kinds of common corruptions that are not covered by ImageNet-C. ImageNet-NOC is built thanks to the method we proposed to construct non-overlapping corruption benchmarks. This method can be easily adapted to other computer vision tasks. We hope it will be used to build other non-overlapping corruption benchmarks, that will help to make better estimations of the robustness of neural networks. 

