

Abstract

Adversarial robustness of machine learning models has attracted considerable attention over recent years. Adversarial attacks undermine the reliability of and trust in machine learning models, but the construction of more robust models hinges on a rigorous understanding of adversarial robustness as a property of a given model. Point-wise measures for specific threat models are currently the most popular tool for comparing the robustness of classifiers and are used in most recent publications on adversarial robustness. In this work, we use robustness curves to show that point-wise measures fail to capture important global properties that are essential to reliably compare the robustness of different classifiers. We introduce new ways in which robustness curves can be used to systematically uncover these properties and provide concrete recommendations for researchers and practitioners when assessing and comparing the robustness of trained models. Furthermore, we characterize scale as a way to distinguish small and large perturbations, and relate it to inherent properties of data sets, demonstrating that robustness thresholds must be chosen accordingly. We hope that our work contributes to a shift of focus away from point-wise measures of robustness and towards a discussion of the question what kind of robustness could and should reasonably be expected. We release code to reproduce all experiments presented in this paper, which includes a Python module to calculate robustness curves for arbitrary data sets and classifiers, supporting a number of frameworks, including TensorFlow, PyTorch and JAX.

1. I N T R O D U C T I O N

Despite their astonishing success in a wide range of classification tasks, deep neural networks can be lead to incorrectly classify inputs altered with specially crafted adversarial perturbations (Szegedy et al. 2014; Goodfellow et al. 2015) . These perturbations can be so small that they remain almost imperceptible to human observers (J. P. Göpfert et al. 2020) . Adversarial robustness describes a model's ability to behave correctly under such small perturbations crafted with the intent to mislead the model. The study of adversarial robustness -with its definitions, their implications, attacks, and defenses -has attracted considerable research interest. This is due to both the practical importance of trustworthy models as well as the intellectual interest in the differences between decisions of machine learning models and our human perception. A crucial starting point for any such analysis is the definition of what exactly a small input perturbation is -requiring (a) the choice of a distance function to measure perturbation size, and (b) the choice of a particular scale to distinguish small and large perturbations. Together, these two choices determine a threat model that defines exactly under which perturbations a model is required to be robust. The most popular choice of distance function is the class of distances induced by p norms (Szegedy et al. 2014; Goodfellow et al. 2015; Carlini, Athalye, et al. 2019) , in particular 1 , 2 and ∞ , although other choices such as Wasserstein distance have been explored as well (Wong, Schmidt, et al. 2019) . Regarding scale, the current default is to pick some perturbation threshold ε without providing concrete reasons for the exact choice. Analysis then focuses on the robust error of the model, the proportion of test inputs for which the model behaves incorrectly under some perturbation up to size ε. This means that the scale is defined as a binary distinction between small and large perturbations based on the perturbation threshold. A set of canonical thresholds have emerged in the literature. For example, in the publications referenced in this section, the MNIST data set is typically evaluated at a perturbation threshold ε ∈ {0.1, 0.3} for the ∞ norm, while CIFAR-10 is evaluated at ε ∈ {2/255, 4/255, 8/255}, stemming from the three 8-bit color channels used to represent images. Based on these established threat models, researchers have developed specialized methods to minimize the robust error during training, which results in more robust models. Popular approaches include specific data augmentation, sometimes used under the umbrella term adversarial training (Guo et al. 2017; Madry et al. 2018; Carmon et al. 2019; Hendrycks et al. 2019) , training under regularization that encourages large margins and smooth decision boundaries in the learned model (Hein and Andriushchenko 2017; Wong and Kolter 2018; Croce, Andriushchenko, and Hein 2019; Croce and Hein 2020) , and post-hoc processing or randomized smoothing of predictions in a learned model (Lecuyer et al. 2019; Cohen et al. 2019) . In order to show the superiority of a new method, robust accuracies of differently trained models are typically compared for a handful of threat models and data sets, eg., ∞ (ε = 0.1) and 2 (ε = 0.3) for MNIST. Out of 22 publications on adversarial robustness published at NeurIPS 2019, ICLR 2020, and ICML 2020, 12 publications contain results for only a single perturbation threshold. In five publications, robust errors are calculated for at least two different perturbation thresholds, but still, only an arbitrary number of thresholds is considered. Only in five out of the total 22 publications do we find extensive considerations of different perturbation thresholds and the respective robust errors. Out of these five, three are analyses of randomized smoothing, which naturally gives rise to certification radii (B. Li et al. 2019; Carmon et al. 2019; Pinot et al. 2019) . Najafi et al. (2019) follow a learning-theoretical motivation, which results in an error bound as a function of the perturbation threshold. Only Maini et al. (2020) do not rely on randomization and still provide a complete, empirical analysis of robust error for varying perturbation thresholdsfoot_0 . Our contributions: In this work, we demonstrate that point-wise measures of p robustness are not sufficient to reliably and meaningfully compare the robustness of different classifiers. We show that, both in theory and practice, results of model comparisons based on point-wise measures may fail to generalize to threat models with even slightly larger or smaller ε and that robustness curves avoid this pitfall by design. Furthermore, we show that point-wise measures are insufficient to meaningfully compare the efficacy of different defense techniques when distance functions are varied, and that robustness curves, again, are able to reliably detect and visualize this property. Finally, we analyze how scale depends on the underlying data space, choice of distance function, and distribution. Based on our findings we suggest that robustness curves should become the standard tool when comparing adversarial robustness of classifiers, and that the perturbation threshold of threat models should be selected carefully in order to be meaningful, considering inherent characteristics of the data set. We release code to reproduce all experiments presented in this paper 2 , which includes a Python module with an easily accessible interface (similar to Foolbox, Rauber et al. (2017) ) to calculate robustness curves for arbitrary data sets and classifiers. The module supports classifiers written in most of the popular machine learning frameworks, such as TensorFlow, PyTorch and JAX.

2. M E T H O D S

An adversarial perturbation for a classifier f and input-output pair (x, y) is a small perturbation δ with f (x + δ) = y. Because the perturbation δ is small, it is assumed that the label y would still be the correct prediction for x + δ. The resulting point x + δ is called an adversarial example. The points vulnerable to adversarial perturbations are the points that are either already misclassified when unperturbed, or those that lie close to a decision boundary. One tool to visualize and study the robustness behavior of a classifier are robustness curves, first used by Wong and Kolter (2018) and later formalized by C. Göpfert et al. (2020)  R f d (ε) := P ({(x, y) s.t. ∃ x : d(x, x ) ε ∧ f (x ) = y}) A model's robustness curve shows how data points are distributed in relation to the decision boundaries of the model, essentially visualizing simultaneously an extremely large number of point-wise measures. This allows us to take a step back from robustness regarding a specific perturbation threshold and instead compare global robustness for different classifiers, distributions and distance functions. To see why this is relevant, consider Figure 1 , which shows toy data along with two possible classifiers that perfectly separate the data. For a perturbation threshold of ε, the blue classifier has robust error 0.5, while the orange classifier is perfectly robust. However, for a perturbation threshold of 2ε, the orange classifier has robust error 1, while the blue classifier remains at 0.5. By freely choosing a single perturbation threshold for comparison, it is therefore possible to make either classifier appear to be much better than the other, and no single threshold can capture the whole picture. In fact, for any two disjoint sets of perturbation thresholds, it is possible to construct a data distribution and two classifiers f , f , such that the robust error of f is lower than that of f for all perturbation thresholds in the first set, and that of f is lower than that of f for all perturbation thresholds in the second set. See Appendix A for a constructive proof. This shows that even computing multiple point-wise measures to compare two models may give misleading results.

3. E X P E R I M E N T S

In the following, we empirically evaluate the robustness of a number of recently published models, and demonstrate that the weaknesses of point-wise measures described above are not limited to toy examples, but occur for real-world data and models.

3. . 1 E X P E R I M E N TA L S E T U P

We evaluate and compare the robustness of models obtained using the following training methods: For complex models, calculating the exact distance of a point to the closest decision boundary, and thus estimating the true robustness curve, is computationally very intensive, if not intractable. Therefore we bound the true robustness curve from below using strong adversarial attacks, which is consistent with the literature on empirical evaluation of adversarial robustness and also applicable to many different types of classifiers. We base our selection of attacks on the recommendations by Carlini, Athalye, et al. (2019) . Specifically, we use the 2 -attack proposed by (Carlini and Wagner 2017) for 2 robustness curves and PGD (Madry et al. 2018) for ∞ robustness curves. For both attacks, we use the implementations of Foolbox (Rauber et al. 2017) . See Appendix C for information on adversarial attack hyperparameters. In the following, "robustness curve" refers to this empirical approximation of the true robustness curve.

3. . 2 T H E W E A K N E S S E S O F P O I N T-W I S E M E A S U R E S

Point-wise measures are used to quantify robustness of classifiers by measuring the robust test error for a specific distance function and a perturbation threshold (eg., ∞ (ε = 4/255)). In Table 1 we show three point-wise measures to compare the robustness of five different classifiers on CIFAR-10. If we compare the robustness of the four robust training methods (latter four columns of the table) based on the first point-wise threat model ∞ (ε = 1/255) (first row of the table), we can see that the classifier trained with AT is the most robust, followed by MMR + AT, followed by KW, and MMR-UNIV results in the least robust classifier. However, if we increase the ε of our threat model to ε = 4/255 (second row of the table), KW is more robust than AT. For a even larger ε (third row of the table), we would conclude that MMR-UNIV is preferable over AT, and that AT results in the least robust classifier. All three statements are true for the particular perturbation threshold (ε), and the magnitude of all perturbation thresholds is reasonable: publications on adversarial robustness typically evaluate CIFAR-10 on perturbation thresholds 10/255 for ∞ perturbations. Meaningful conclusions on the robustness of the classifiers relative to each other can not be made without taking all possible ε into account. In other words, a global perspective is needed. 

3. . 2 . 1 A G L O B A L P E R S P E C T I V E

Figure 2 shows the robustness of different classifiers for the ∞ (right plot) and 2 (left plot) distance functions from a global perspective using robustness curves. The plot reveals why the three pointwise measures (marked by vertical black dashed lines in the left plot) lead to different results in the relative ranking of robustness of the classifiers. Both for the classifiers trained to be robust against attacks in ∞ distance (left plot) and 2 distance (right plot), we can observe multiple intersections of robustness curves, corresponding to changes in the relative ranking of the robustness of the compared classifiers. The robustness curves allow us to reliably compare the robustness of classifiers for all possible perturbation thresholds. Furthermore, the curves clearly show the perturbation threshold intervals with strong and weak robustness for each classifier, and are not biased by an arbitrarily chosen perturbation threshold.

3. . 2 . 2 O V E R F I T T I N G T O S P E C I F I C P E R T U R B AT I O N T H R E S H O L D S

In addition to the problem of robustness curve intersection, relying on point-wise robustness measures to evaluate adversarial robustness is prone to overfitting when designing training procedures. Figure 3 shows ∞ robustness curves for MMR + AT with ∞ threat model as provided by Croce, Andriushchenko, and Hein (2019) . The models trained on MNIST and FMNIST both show a change in slope, which could be a sign of overfitting to the specific threat models for which the classifiers were optimized for, since the change of slope occurs approximately at the chosen perturbation threshold ε. This showcases a potential problem with the use of point-wise measures during training. The binary separation of "small" and "large" perturbations based on the perturbation threshold is not sufficient to capture the intricacies of human perception under perturbations, but a simplification based on the idea that perturbations below the perturbation threshold should almost certainly not lead to a change in classification. If a training procedure moves decision boundaries so that data points lie just beyond this threshold, it may achieve a low robust error, without furthering the actual goals of adversarial robustness research. Using robustness curves for evaluation cannot prevent this effect, but can be used to detect it.

3. . 2 . T

R A N S F E R O F R O B U S T N E S S A C R O S S D I S TA N C E F U N C T I O N S In the following, we analyze to which extent properties of robustness curves transfer across different choices of distance functions. If properties transfer, it may not be necessary to individually analyze robustness for each distance function. In Figure 4 we compare the robustness of different models for the ∞ (left plot) and 2 (right plot) distance functions. The difference to Figure 2 is that the models (indicated by colour) are the same models in the left plot and in the right plot. We find that for MMR + AT, the ∞ threat model leads to better robustness than the 2 threat model both for ∞ and 2 robustness curves. In fact, MMR + AT with the ∞ threat model even leads to better ∞ and 2 robustness curves than MMR-UNIV, which is specifically designed to improve robustness for all p norms. Overall, the plots are visually similar. However, since both plots contain multiple robustness curve intersections, the ranking of methods remains sensitive to the choice of perturbation threshold. For example, a perturbation threshold of ε = 3/255 (vertical black dashed line) for the ∞ distance function (left subplot) shows that the classifier trained with MMR + AT ( 2 (ε = 0.1)) is approximately as robust as the classifier trained with MMR-UNIV. The same perturbation threshold for the 2 distance function (right subplot) shows that the classifier trained with MMR + AT is more robust than the classifier trained with MMR-UNIV for 2 threat models. Using typical perturbation thresholds from the literature for each distance function does not alleviate this issue: At perturbation threshold ε = 2/255 for ∞ distance, the classifier trained with MMR + AT ( 2 (ε = 0.1)) is more robust than the one trained with MMR-UNIV, while at perturbation threshold ε = 0.1 for 2 distance, the opposite is true. This shows that even when robustness curves across various distance functions are qualitatively similar, this may be obscured by the choice of threat model(s) to compare on. We also emphasize that in general, robustness curves across various distance functions may be qualitatively dissimilar. In particular: 1. For linear classifiers, the shape of a robustness curve is identical for distances induced by different p norms. This follows from Theorem 2 in Appendix B, which is an extension of a weaker result in C. Göpfert et al. (2020) . For non-linear classifiers, different p norms may induce different robustness curve shapes. See C. Göpfert et al. (2020) for an example. 2. Even for linear classifiers, robustness curve intersections do not transfer between distances induced by different p norms. That is, for two linear classifiers, there may exist p, p such that the robustness curves for the p distance intersect, but not the robustness curves for the p distance. See Appendix A for an example. 2 for size and dimensionality. The shapes of the curves and the threshold from which any classifier must necessarily trade of between accuracy and robustness differ strongly between data sets.

3. . O N T H E R E L AT I O N S H I P B E T W E E N S C A L E A N D D ATA

As the previous sections show, robustness curves can be used to reveal properties of robust models that may be obscured by point-wise measures. However, some concept of scale, that is, some way to judge whether a perturbation is small or large, remains necessary. Especially when robustness curves intersect, it is crucial to be able to judge how critical it is for a model to be stable under the given perturbations. For many pairs of distance function and data set, canonical perturbation thresholds have emerged in the literature, but to the best of our knowledge, no reasons for these choices are given. Since the assumption behind adversarial examples is that small perturbations should not affect classification behavior, the question of scale cannot be answered independently of the data distribution. In order to understand how to interpret different perturbation sizes, it can be helpful to understand how strongly the data point would need to be perturbed to actually change the correct classification. We call this the inter-class distance and analyze the distribution of inter-class distances for several popular data sets. In Figure 5 we compare the inter-class distance distributions in ∞ , 2 , and 1 norm for all data sets considered in this work. We observe that for the 1 and 2 norms, the shape of the curves is similar across data sets, but their extent is determined by the dimensionality of the data space. In the ∞ norm, vastly different curves emerge for the different data sets. We hypothesize that, because the inter-class distance distributions vary more strongly for ∞ distances than for 1 distances, the results of robustifying a model w. r. t. ∞ distances may depend more strongly on the underlying data distribution than the results of robustifying w. r. t. 1 distances. This is an interesting avenue for future work. When we look at the smallest inter-class distances in the ∞ norm (where all distances lie in the interval [0, 1]), we can make several observations. Because the smallest inter-class distance for MNIST in the ∞ norm is around 0.9, we can see that transforming an input from one class to one from a different class almost always requires completely flipping at least one pixel from almost-black to almost-white or vice versa. For the other datasets, the inter-class distance distributions are more spread out than the inter-class distance distribution of MNIST. We observe that for CIFAR-10 with ∞ perturbations of size 0.25, it becomes possible to transform samples from different classes into each other, so starting from this threshold, any classifier must necessarily trade off between accuracy and robustness. The shapes of the curves and the threshold from which any classifier must necessarily trade of between accuracy and robustness differ strongly between data sets -refer to Table 2 for exact values for the threshold. In Table 2 , we summarize the smallest and largest inter-class distances in different norms together with additional information about the size, number of classes, and dimensionality of the all the data sets we consider in this work. The values correspond directly to Figure 5 , but even in this simplified view, we can quickly make out key differences between the data sets. Compare, for example, MNIST and GTS: While it appears reasonable to expect ∞ robustness of 0.3 for MNIST, the same threshold for GTS is not possible. Relating Table 2 and Figure 3 , we find entirely plausible the strong robustness Table 2 : Smallest and largest inter-class distances for subsets of several data sets, measured in l ∞ , l 2 , and l 1 norm, together with basic contextual information about the data sets. All data has been been normalized to lie within the interval [0, 1], and duplicates and corrupted data points have been removed. Apart from HAR, all data sets contain images -the dimensionality reported specifies their sizes and number of channels. results for MNIST, and the small perturbation threshold for GTS. Based on inter-class distances we also expect less ∞ robustness for CIFAR-10 than for FMNIST, but not as seen in Figure 3 . In any case, it is safe to say that, when judging the robustness of a model by a certain threshold, that number must be set with respect to the distribution the model operates on. Overall, the strong dependence of robustness curves on the data set and the chosen norm, emphasizes the necessity of informed and conscious decisions regarding robustness thresholds. We provide an easily accessible reference in the form of Table 2 , that should prove useful while judging scales in a threat model.

4. D I S C U S S I O N

We have demonstrated that comparisons of robustness of different classifiers using point-wise measures can be heavily biased by the choice of perturbation threshold and distance function of the threat model, and that conclusions about rankings of classifiers with regards to their robustness based on point-wise measures therefore only provide a narrow view of the actual robustness behavior of the classifiers. Further, we have demonstrated different ways of using robustness curves to overcome the shortcomings of point-wise measures, and therefore recommend using them as the standard tool for comparing the robustness of classifiers. Finally, we have demonstrated how suitable perturbation thresholds necessarily depend on the data they pertain to. It is our hope that practitioners and researchers alike will use the methodology proposed in this work, especially when developing and comparing adversarial defenses, and carefully motivate any concrete threat models they might choose, taking into account all available context. Limitations: Computing approximate robustness curves for state-of-the-art classifiers and large data sets is computationally very intensive, due to the need of computing approximate minimal adversarial perturbations with strong adversarial attacks. Developing adversarial attacks which are both strong and fast is an ongoing challenge in the field of adversarial robustness. One way to reduce the computational cost is to approximate the robustness curves by computing a set of point-wise measures. However, since robustness curves may intersect at arbitrarily many points, this may give misleading results. It would be interesting to investigate how closely robustness curves need to be approximated in order to estimate the number of intersections, if any, and their location, with high certainty. Another limitation of our work is the focus on a small group of distance functions (mainly ∞ and 2 norms). Even though it does intuitively make sense that models should at least be robust against these types of perturbations, a more general evaluation able to consider more distance functions simultaneously could be advantageous.  A R O B U S T N E S S C U R V E S W I T H A R B I T R A R Y I N T E R S E C T I O N S Theorem 1. Let T 1 , T 2 ⊂ R >0 be two disjoint finite sets. Then there exists a distribution P on R × {0, 1} and two classifiers c 1 , c 2 : R → {0, 1} such that R c1 |•| (t) < R c2 |•| (t) for all t ∈ T 1 and R c1 |•| (t) > R c2 |•| (t) for all t ∈ T 2 . Proof. Without loss of generality, assume that T 1 = {t 1 , . . . , t n } and T 2 = {t 1 , . . . , t n } with t i < t i < t i+1 for i ∈ {1, . . . , n}. We will construct c 1 , c 2 such that the robustness curves  R c1 |•| (•), R c2 |•| (•) intersect P -d - t 1 2 , 0 = 1 4n + 1 . Let c 1 (x) = 1 x -d and c 2 (x) = 1 x d . Both classifiers have perfect accuracy on P , meaning that R ci |•| (0) = 0. The closest point to the decision boundary of c 1 is -d -t1 2 with weight 1 4n+1 , so R c1 |•| ( t1 2 ) = 1 4n+1 . The second-closest point is -d - t1+t 2 2 with weight 2 4n+1 , so R c1 |•| ( t1+t 2 ) = 3 4n+1 , and so on. Meanwhile, the closest point to the decision boundary of c 2 is d + t1+t 1 2 with weight 2 4n+1 , so R c2 |•| ( t1+t 1 2 ) = 2 4n+1 , the second-closest point is d t2+t 2 2 with weight 2 4n+1 , so R c2 |•| ( t2+t 2 ) = 4 4n+1 , and so on. Example 1. To see that robustness curve intersections do not transfer between different p norms, consider the example in Figure 6 . The blue and orange linear classifiers both perfectly separate the displayed data. The ∞ robustness curves of the classifiers do not intersect, meaning that the robust error of the blue classifier is always better than that of the orange classifier. In 2 distance, the robustness curves intersect, so that there is a range of perturbation sizes where the orange classifier has better robust error than the blue classifier.

B R O B U S T N E S S C U R V E D E P E N D E N C E O F S H A P E O N D I S TA N C E F U N C T I O N

Theorem 2. Let f (x) = sgn(w T x + b) be a linear classifier. Then the shape of the robustness curve for f regarding an p norm-induced distance does not depend on the choice of p. It holds that R f p 1 (ε) = R f p 2 (c • ε) ∀ ε for c = w q1 w q2 , q i = p i p i -1 . (1) Lemma 1. Let x ∈ R m with w T x + b = 0. Let p ∈ [1, ∞ ] and q such that 1 p + 1 q = 1, where we take 1 ∞ = 0. Then min{ δ p : sgn(w T (x + δ) + b) = sgn(w T x + b)} = |w T + b| w q (2) and the minimum is attained by δ = -w T x-b w ∞ sgn(w j )e j , j = argmax i |w i | p = 1 -w T x-b w q q (sgn(w i )|w i | 1 p-1 ) d i=1 p ∈ (1, ∞] . where x 1 ∞-1 = x 0 = 1 and e j is the j-th unit vector. Proof of Theorem 2. By Hölder's inequality, for any δ, m i=1 |w i δ i | δ p w q . For δ such that sgn(w T (x + δ) + b) = sgn(w T x + b) it follows that δ p m i=1 |w i δ i | w q | m i=1 w i δ i | w q |w T x + b| w q . (5) Using the identity q = p p-1 , it is easy to check that for every p ∈ [1, ∞], with δ as defined in Equation (3), 1. w T δ = -w T x -b, so that w T (x + δ) + b = 0, and 2. δ p = |w T x+b| w q . Item 1 shows that δ is a feasible point, while Item 2 in combination with Equation (5) shows that δ p is minimal. Using Lemma 1, we are ready to prove Theorem 2. Proof. By definition, R f p 1 (ε) = P ({(x, y) s.t. ∃ δ : δ p1 ε ∧ f (x + δ) = y} Rp 1 (ε) ) . We can split R p1 (ε) into the disjoint sets {(x, y) : f (x) = y} =M (7) ∪ (8) {(x, y) s.t. ∃ δ : δ p1 ε ∧ y = f (x) = f (x + δ)} =Bp 1 (ε) . (9) Choose q 1 , q 2 such that 1 pi + 1 qi = 1. By Lemma 1, and using that f (x) = sgn(w T x + b), B p1 (ε) = {(x, y) : sgn(w T x + b) = y ∧ |w T x + b| w q1 ε} (10) = {(x, y) : sgn(w T x + b) = y ∧ |w T x + b| w q2 w q1 w q2 ε}) (11) = B p2 w q1 w q2 ε . This shows that R f p 1 (ε) = P (M ) + P (B p1 (ε)) (13) = P (M ) + P B p2 w q1 w q2 ε (14) = R f p 2 w q1 w q2 ε . (15) C E X P E R I M E N TA L D E TA I L S C . 1 M O D E L T R A I N I N G We use the same model architecture as Croce, Andriushchenko, and Hein (2019) and Wong and Kolter (2018) . Unless explicitly stated otherwise, the trained models are taken from Croce, Andriushchenko, and Hein (2019) . The exact architecture of the model is: Convolutional layer (number of filters: 16, size: 4x4, stride: 2), ReLu activation function, convolutional layer (number of filters: 32, size: 4x4, stride: 2), ReLu activation function, fully connected layer (number of units: 100), ReLu activation function, output layer (number of units depends on the number of classes). All models are trained with Adam Optimizer (Kingma and Ba 2014) for 100 epochs, with batch size 128 and a default learning rate of 0.001. More information on the training can be found in the experimental details section of the appendix of Croce, Andriushchenko, and Hein (2019) . The trained models are those made publicly available by Croce, Andriushchenko, and Hein (2019)foot_4 and Croce and Hein (2020)foot_5 .

C . 2 A

P P R O X I M AT E D R O B U S T N E S S C U R V E S We use state-of-the-art adversarial attacks to approximate the true minimal distances of input datapoints to the decision boundary of a classifier for our adversarial robustness curves (see Definition 1). We base our selection of attacks on the recommendations of Carlini, Athalye, et al. (2019) . Specifically, we use the following attacks: For 2 robustness curves we use the 2 -attack proposed by Carlini and Wagner (2017) and for ∞ robustness curves we use PGD (Madry et al. 2018) . For both attacks, we use the implementations of Foolbox (Version 2.4) (Rauber et al. 2017) . For the ∞ attack, the implementation of Foolbox automatically performs a hyperparameter search over different epsilon and uses the smallest resulting adversarial perturbation. For the rest of the hyperparameters, we use the standard values of the Foolbox implementation. For the 2 attack, we increase the number of binary search steps that are used to find the optimal tradeoff-constant between distance and confidence from 5 to 10, which we found empirically to improve the results. For the rest of the hyperparameters, we again use the standard values of the Foolbox implementation.

C . 3 C O M P U TAT I O N

A L A R C H I T E C T U R E We executed all programs on an architecture with 2 x Intel Xeon(R) CPU E5-2640 v4 @ 2.4 GHz, 2 x Nvidia GeForce GTX 1080 TI 12G and 128 GB RAM. As we pointed out in Section 1, adversarial robustness of classifiers trained on CIFAR-10 is usually evaluated at a perturbation threshold ε ∈ {2/255, 4/255, 8/255} for the ∞ norm. Robustness curves allow us to investigate robustness of classifiers for perturbation thresholds beyond those which are used in the literature. It should not be necessary for the model to be invariant under large perturbations, if these perturbations are clearly perceptible or change the "correct" classification of the input. However, the thresholds that models are currently optimized for are small enough that even larger perturbations may not be perceptible. Figure 7 shows four images of CIFAR-10 (top row), together with adversarial examples (bottom row). With perturbation sizes ε ∈ {17/255, 18/255}, the perturbations are more than two times larger than the biggest perturbation threshold used in the literature, and still almost imperceptible for untrained humans. Original Image Adversarial Example

E R O B U S T N E S S C U R V E S F O R L A R G E R M O D E L S

In Section 3, we demonstrate the usefulness of robustness curves on a small convolutional network architecture used by Croce, Andriushchenko, and Hein (2019) . The choice of a small architecture allows us to compute robustness curves for a large number of different defensive strategies with limited computational resources. Figure 8 shows approximate robustness curves for two state-of-theart robust models with a large network architecture (WideResNet-28-10), computed for a sample of 1000 data points from CIFAR-10. Due to the small number of points used, the approximation may be rough, so the following observations should be taken with a grain of salt. 1. Both robust models are indeed much more robust than the model obtained by standard training even for perturbation thresholds that are significantly larger than the threshold of 8/255 that the models are optimized for. This observation may help decide whether it is worthwhile to stop using a conventionally trained model, sacrificing accuracy for robustness. 2. Wu et al. (2020) has slightly worse accuracy than Sehwag et al. (2020) roughly up to perturbation size 1/255. This is a trade-off for better accuracy between perturbation sizes 4/255 and 0.1. From perturbation size 0.1 onward, Sehwag et al. (2020) appears to have slightly better accuracy than Wu et al. (2020) . This observation may help decide which of the two robust models is preferable, based on the robustness requirements of a concrete application. 3. The gap between the performance of Wu et al. (2020) and Sehwag et al. (2020) is even wider at perturbation size 0.04 than 8/255, but overall, the robustness curves of the robust models are quite similar. This observation may help decide whether it is worthwhile to switch from one model to the other, if one of the models is already in use or preferable for other reasons.



Single thresholds:(Mao et al. 2019;Tramer and Boneh 2019;Alayrac et al. 2019;Brendel et al. 2019;Qin et al. 2019;Wang et al. 2020;Song et al. 2020; Croce and Hein 2020; Xie and Yuille 2020;Rice et al. 2020;Zhang et al. 2020;Singla and Feizi 2020), multiple thresholds:(Lee et al. 2019;Mahloujifar et al. 2019;Hendrycks et al. 2019; Wong, Rice, et al. 2020;Boopathy et al. 2020), full analysis:(Pinot et al. 2019;Carmon et al. 2019; B. Li et al. 2019;Najafi et al. 2019;Maini et al. 2020).2 The full code is available at https://github.com/Anonymous23984902384/how-tocompare-adversarial-robustness-of-classifiers-from-a-global-perspective. The models trained with ST, KW, AT and MMR + AT are avaible at www.github.com/max-andr/provable-robustness-max-linear-regions. The models trained with MMR-UNIV are avaible at www.github.com/fra31/mmr-universal. The models trained with ST, KW, AT and MMR + AT are avaible at www.github.com/max-andr/provable-robustness-max-linear-regions. The models trained with MMR-UNIV are available at www.github.com/fra31/mmr-universal.



Figure 3: ∞ robustness curves for multiple data sets. Each curve is calculated for a different model and a different test data set. The data sets are indicated by the labels. The models are trained with MMR + AT, Threat Models: MNIST: ∞ (ε = 0.1), FMNIST: ∞ (ε = 0.1), GTS: ∞ (ε = 4/255), CIFAR-10: ∞ (ε = 2/255).The curves for MNIST and FMNIST both show a change in slope, which can not be captured with point-wise measures and could be a sign of overfitting to the specific threat models for which the classifiers were optimized for.

Figure 5: Minimum inter-class distances of all data sets considered in this work, measured in ∞ (left), 2 (middle), and 1 (right) norm. See Table2for size and dimensionality. The shapes of the curves and the threshold from which any classifier must necessarily trade of between accuracy and robustness differ strongly between data sets.

Figure 6: Example of a data distribution and two linear classifiers such that the 2 robustness curves intersect, but not the ∞ robustness curves.

Figure 7: Visualization of four images from CIFAR-10 (top row), together with adversarial examples (bottom row), calculated with PGD (Madry et al. 2018) for a model trained with MMR + AT, Threat Model: ∞ (ε = 2/255). The resulting perturbation sizes of the adversarial examples are (from left to right) 17/255, 18/255, 18/255, 18/255. Even for perturbation sizes far greater than popular choices of point-wise measures, adversarial examples can be very hard to detect for humans.

Figure 8: ∞ robustness curves for two state-of-the-art robust models with a large architecture (WideResNet-28-10). The labels indicate the training method (Sehwag2020Hydra: (Sehwag et al. 2020), Wu20Adversarial: (Wu et al. 2020)). The trained models are taken from Croce, Andriushchenko, Sehwag, et al. (2020). The models are trained on the full training set of CIFAR-10, and robustness curves are based on a sample of 1000 points from the test set.

Given an input space X and label set Y, distance function d on X × X , and classifier f : X → Y. Assume (x, y) ∼ i.i.d. P for some distribution P on X × Y. Then the d-robustness curve for f is the graph of the function

Three point-wise measures for different threat models. All threat models use the ∞ distance function, but differ in choice of perturbation threshold (denoted by ε). Each row contains the robust test errors for one point-wise measure. Each column contains the robust test errors for one model, trained with a specific training method (marked by column title). The lower the number, the better the robustness for the specific threat model. Each point-wise measure results in a different relative ordering of the classifiers based on the errors. The order is visualized by different tones of gray in the background of the cells.Together with each training method, we state the threat model the trained model is optimized to defend against, eg.,

Figure 2: ∞ robustness curves (left plot) and 2 robustness curves (right plot) resulting from different training methods (indicated by label), optimized for different threat models (indicated by label). The dashed vertical lines visualize the three point-wise measures from Table 1. The models are trained and evaluated on the full training-/test sets of CIFAR-10. The curves allow us to reliably compare the robustness of the classifiers, unbiased by choice of perturbation threshold.

at exactly the points (t i + t i )/2 and (t i + t i+1 )/2 on the interval (t 1 , t n ]. Let d = t n and

