INCREASING THE COVERAGE AND BALANCE OF ROBUSTNESS BENCHMARKS BY USING NON-OVERLAPPING CORRUPTIONS Anonymous

Abstract

Neural Networks are sensitive to various corruptions that usually occur in realworld applications such as blurs, noises, low-lighting conditions, etc. To estimate the robustness of neural networks to these common corruptions, we generally use a group of modeled corruptions gathered into a benchmark. We argue that corruption benchmarks often have a poor coverage: being robust to them only imply being robust to a narrow range of corruptions. They are also often unbalanced: they give too much importance to some corruptions compared to others. In this paper, we propose to build corruption benchmarks with only non-overlapping corruptions, to improve their coverage and their balance. Two corruptions overlap when the robustnesses of neural networks to these corruptions are correlated. We propose the first metric to measure the overlapping between two corruptions. We provide an algorithm that uses this metric to build benchmarks of Non-Overlapping Corruptions. Using this algorithm, we build from ImageNet a new corruption benchmark called ImageNet-NOC. We show that ImageNet-NOC is balanced and covers several kinds of corruptions that are not covered by ImageNet-C.

1. INTRODUCTION

Neural Networks perform poorly when they deal with images that are drawn from a different distribution than their training samples. Indeed, neural networks are sensitive to adversarial examples (Szegedy et al., 2014) , background changes (Xiao et al., 2020) , and common corruptions (Hendrycks & Dietterich, 2019) . Common corruptions are perturbations that change the appearance of images without changing their semantic content. For instance, neural networks are sensitive to noises (Koziarski & Cyganek, 2017) , blurs (Vasiljevic et al., 2016) or lighting condition variations (Temel et al., 2017) . Contrary to adversarial examples (Szegedy et al., 2014) , common corruptions are not artificial perturbations especially crafted to fool neural networks. They naturally appear in industrial applications without any human interfering, and can significantly reduce the performances of neural networks. A neural network is robust to a corruption c, when its performances on samples corrupted with c are close to its performances on clean samples. Some methods have been recently proposed to make neural networks more robust to common corruptions (Geirhos et al., 2019; Hendrycks* et al., 2020; Rusak et al., 2020) . To determine whether these approaches are effective, it is required to have a method to measure the neural network robustness to common corruptions. The most commonly used method consists in evaluating the performances of neural networks on images distorted by various kinds of common corruptions: (Hendrycks & Dietterich, 2019; Karahan et al., 2016; Geirhos et al., 2019; Temel et al., 2017) . In this study, we call the group of perturbations used to make the robustness estimation a corruption benchmark. We also use this term to refer to a set of test images that have been corrupted with these various corruptions. We identify two important factors that should be taken into account when building a corruption benchmark: the balance and the coverage. In this paper, we consider that a corruption c is covered by a benchmark, when increasing the robustness of a network to all the corruptions of this benchmark, also increases the robustness of the network to c. For instance, a benchmark that contains a camera shake blur corruption covers the defocus blur corruption, because the robustnesses towards these two corruptions are correlated (Vasiljevic et al., 2016) . The coverage of a benchmark is defined as the number of corruptions covered by this benchmark. The more a benchmark covers a wide range of common corruptions, the more it gives a complete view of the robustness of a neural network. At the same time, we consider a benchmark as balanced when it gives the same importance to the robustness to every corruption it contains. For instance, according to a balanced benchmark, being robust to noises is as important as being robust to brightness variations. We argue that most of the existing corruption benchmarks are unbalanced: they give too much importance to the robustness to some corruptions compared to others. The coverage and balance of corruption benchmarks are related to the notion of corruption overlappings. We say that two corruptions overlap when the robustnesses of neural networks towards these corruptions are correlated. The contribution of this paper is fourfold: 1. We propose the first method to estimate to what extent two corruptions overlap. 2. We show that building corruption benchmarks with non-overlapping corruptions make them more balanced and able to cover a wider range of corruptions. 3. We propose a method to build benchmarks that contain only non-overlapping corruptions. 4. We use this method to build from ImageNet, a benchmark of Non-Overlapping Corruptions called ImagNet-NOC, to estimate the robustness of image classifiers to common corruptions. We show that ImagNet-NOC is balanced and covers corruptions that are not covered by ImageNet-C: a reference corruption benchmark (Hendrycks & Dietterich, 2019). (Haohan et al., 2019) , can also be useful to determine if neural networks understand the abstract concepts we want them to learn. Methods to study how classifiers are affected by background changes have also been recently proposed (Beery et al., 2018; Xiao et al., 2020) .

2. BACKGROUND AND RELATED WORKS

Another important aspect of the robustness of neural networks to o.o.d samples, is the robustness to common corruptions. This aspect of the robustness is generally estimated by gathering several commonly encountered corruptions, and by testing the performances of neural networks on images corrupted with these corruptions. Diverse selections of common corruptions have been proposed to make a robustness estimation (Karahan et al., 2016; Laugros et al., 2019; Geirhos et al., 2019) . In particular, ImageNet-C is a popular benchmark used to measure the robustness of ImageNet classifiers (Hendrycks & Dietterich, 2019) . Different common corruption benchmarks have also been proposed in the context of object detection (Michaelis et al., 2019) , scene classification (Tadros et al., 2019 ) or, eye-tracking (Che et al., 2020) . It is worth noting that some transformations that are in between adversarial attacks and common corruptions have been recently proposed to measure the robustness of image classifiers (Kang et al., 2019; Dunn et al., 2019; Liu et al., 2019) .

2.2. CORRUPTION OVERLAPPINGS IN BENCHMARKS

It has been noticed that fine-tuning a model with camera shake blur helps it to deal with defocus blur and conversely (Vasiljevic et al., 2016) . The robustnesses to diverse kinds of noises have also been shown to be closely related (Laugros et al., 2019) . Even for two corruptions that do not look similar to the human eye, increasing the robustness of a model to one of these corruptions, can

