

Abstract

Adversarial robustness of machine learning models has attracted considerable attention over recent years. Adversarial attacks undermine the reliability of and trust in machine learning models, but the construction of more robust models hinges on a rigorous understanding of adversarial robustness as a property of a given model. Point-wise measures for specific threat models are currently the most popular tool for comparing the robustness of classifiers and are used in most recent publications on adversarial robustness. In this work, we use robustness curves to show that point-wise measures fail to capture important global properties that are essential to reliably compare the robustness of different classifiers. We introduce new ways in which robustness curves can be used to systematically uncover these properties and provide concrete recommendations for researchers and practitioners when assessing and comparing the robustness of trained models. Furthermore, we characterize scale as a way to distinguish small and large perturbations, and relate it to inherent properties of data sets, demonstrating that robustness thresholds must be chosen accordingly. We hope that our work contributes to a shift of focus away from point-wise measures of robustness and towards a discussion of the question what kind of robustness could and should reasonably be expected. We release code to reproduce all experiments presented in this paper, which includes a Python module to calculate robustness curves for arbitrary data sets and classifiers, supporting a number of frameworks, including TensorFlow, PyTorch and JAX.

1. I N T R O D U C T I O N

Despite their astonishing success in a wide range of classification tasks, deep neural networks can be lead to incorrectly classify inputs altered with specially crafted adversarial perturbations (Szegedy et al. 2014; Goodfellow et al. 2015) . These perturbations can be so small that they remain almost imperceptible to human observers (J. P. Göpfert et al. 2020) . Adversarial robustness describes a model's ability to behave correctly under such small perturbations crafted with the intent to mislead the model. The study of adversarial robustness -with its definitions, their implications, attacks, and defenses -has attracted considerable research interest. This is due to both the practical importance of trustworthy models as well as the intellectual interest in the differences between decisions of machine learning models and our human perception. A crucial starting point for any such analysis is the definition of what exactly a small input perturbation is -requiring (a) the choice of a distance function to measure perturbation size, and (b) the choice of a particular scale to distinguish small and large perturbations. Together, these two choices determine a threat model that defines exactly under which perturbations a model is required to be robust. The most popular choice of distance function is the class of distances induced by p norms (Szegedy et al. 2014; Goodfellow et al. 2015; Carlini, Athalye, et al. 2019) , in particular 1 , 2 and ∞ , although other choices such as Wasserstein distance have been explored as well (Wong, Schmidt, et al. 2019) . Regarding scale, the current default is to pick some perturbation threshold ε without providing concrete reasons for the exact choice. Analysis then focuses on the robust error of the model, the proportion of test inputs for which the model behaves incorrectly under some perturbation up to size ε. This means that the scale is defined as a binary distinction between small and large perturbations based on the perturbation threshold. A set of canonical thresholds have emerged in the literature. For example, in the publications referenced in this section, the MNIST data set is typically evaluated at a perturbation threshold ε ∈ {0.1, 0.3} for the ∞ norm, while CIFAR-10 is evaluated at ε ∈ {2/255, 4/255, 8/255}, stemming from the three 8-bit color channels used to represent images. Based on these established threat models, researchers have developed specialized methods to minimize the robust error during training, which results in more robust models. Popular approaches include specific data augmentation, sometimes used under the umbrella term adversarial training (Guo et al. 2017; Madry et al. 2018; Carmon et al. 2019; Hendrycks et al. 2019) , training under regularization that encourages large margins and smooth decision boundaries in the learned model (Hein and Andriushchenko 2017; Wong and Kolter 2018; Croce, Andriushchenko, and Hein 2019; Croce and Hein 2020), and post-hoc processing or randomized smoothing of predictions in a learned model (Lecuyer et al. 2019; Cohen et al. 2019) . In order to show the superiority of a new method, robust accuracies of differently trained models are typically compared for a handful of threat models and data sets, eg., ∞ (ε = 0.1) and 2 (ε = 0.3) for MNIST. Out of 22 publications on adversarial robustness published at NeurIPS 2019, ICLR 2020, and ICML 2020, 12 publications contain results for only a single perturbation threshold. In five publications, robust errors are calculated for at least two different perturbation thresholds, but still, only an arbitrary number of thresholds is considered. Only in five out of the total 22 publications do we find extensive considerations of different perturbation thresholds and the respective robust errors. Out of these five, three are analyses of randomized smoothing, which naturally gives rise to certification radii (B. Li et al. 2019; Carmon et al. 2019; Pinot et al. 2019) . Najafi et al. ( 2019) follow a learning-theoretical motivation, which results in an error bound as a function of the perturbation threshold. Only Maini et al. (2020) do not rely on randomization and still provide a complete, empirical analysis of robust error for varying perturbation thresholdsfoot_0 . Our contributions: In this work, we demonstrate that point-wise measures of p robustness are not sufficient to reliably and meaningfully compare the robustness of different classifiers. We show that, both in theory and practice, results of model comparisons based on point-wise measures may fail to generalize to threat models with even slightly larger or smaller ε and that robustness curves avoid this pitfall by design. Furthermore, we show that point-wise measures are insufficient to meaningfully compare the efficacy of different defense techniques when distance functions are varied, and that robustness curves, again, are able to reliably detect and visualize this property. Finally, we analyze how scale depends on the underlying data space, choice of distance function, and distribution. Based on our findings we suggest that robustness curves should become the standard tool when comparing adversarial robustness of classifiers, and that the perturbation threshold of threat models should be selected carefully in order to be meaningful, considering inherent characteristics of the data set. We release code to reproduce all experiments presented in this paper 2 , which includes a Python module with an easily accessible interface (similar to Foolbox, Rauber et al. ( 2017)) to calculate robustness curves for arbitrary data sets and classifiers. The module supports classifiers written in most of the popular machine learning frameworks, such as TensorFlow, PyTorch and JAX.

2. M E T H O D S

An adversarial perturbation for a classifier f and input-output pair (x, y) is a small perturbation δ with f (x + δ) = y. Because the perturbation δ is small, it is assumed that the label y would still be the correct prediction for x + δ. The resulting point x + δ is called an adversarial example. The points vulnerable to adversarial perturbations are the points that are either already misclassified when unperturbed, or those that lie close to a decision boundary. One tool to visualize and study the robustness behavior of a classifier are robustness curves, first used by Wong and Kolter (2018) and later formalized by C. Göpfert et al. (2020) . A robustness curve



Single thresholds: (Mao et al. 2019; Tramer and Boneh 2019; Alayrac et al. 2019; Brendel et al. 2019; Qin et al. 2019; Wang et al. 2020; Song et al. 2020; Croce and Hein 2020; Xie and Yuille 2020; Rice et al. 2020; Zhang et al. 2020; Singla and Feizi 2020), multiple thresholds:(Lee et al. 2019; Mahloujifar et al. 2019; Hendrycks et al. 2019; Wong, Rice, et al. 2020; Boopathy et al. 2020), full analysis:(Pinot et al. 2019; Carmon et al. 2019; B. Li et al. 2019; Najafi et al. 2019; Maini et al. 2020).2 The full code is available at https://github.com/Anonymous23984902384/how-tocompare-adversarial-robustness-of-classifiers-from-a-global-perspective.

