PROPER MEASURE FOR ADVERSARIAL ROBUSTNESS

Abstract

This paper analyzes the problems of adversarial accuracy and adversarial training. We argue that standard adversarial accuracy fails to properly measure the robustness of classifiers. Its definition has a tradeoff with standard accuracy even when we neglect generalization. In order to handle the problems of the standard adversarial accuracy, we introduce a new measure for the robustness of classifiers called genuine adversarial accuracy. It can measure the adversarial robustness of classifiers without trading off accuracy on clean data and accuracy on the adversarially perturbed samples. In addition, it does not favor a model with invariance-based adversarial examples, samples whose predicted classes are unchanged even if the perceptual classes are changed. We prove that a single nearest neighbor (1-NN) classifier is the most robust classifier according to genuine adversarial accuracy for given data and a norm-based distance metric when the class for each data point is unique. Based on this result, we suggest that using poor distance metrics might be one factor for the tradeoff between test accuracy and l p norm-based test adversarial robustness.

1. INTRODUCTION

Even though deep learning models have shown promising performances in image classification tasks (Krizhevsky et al., 2012) , most deep learning classifiers are vulnerable to adversarial attackers. By applying a carefully crafted, but imperceptible perturbation to input images, so-called adversarial examples can be constructed that cause the classifier to misclassify the perturbed inputs (Szegedy et al., 2013) . These vulnerabilities have been shown to be exploitable even when printed adversarial images were read through a camera (Kurakin et al., 2016) . Adversarial examples for a specific classifier can be transferable to other models (Goodfellow et al., 2014) . The transferability of adversarial examples (Papernot et al., 2017) enables attackers to exploit vulnerabilities even with limited access to the target classifier. Problem setting. In a nonempty clean input set X ⊂ R d , let every sample x exclusively belong to one of the classes Y, and their classes will be denoted as c x . A classifier f assigns a class label from Y for each sample x ∈ R d . Assume f is parameterized by θ and L(θ, x, y) is the cross entropy loss of the classifier provided the input x and the label y ∈ Y. Note that this exclusive class assumption is introduced to simplify the analysis. Otherwise, the definition of adversarial examples (Biggio et al., 2013) may not match with our intuition as explained in Section 1.1.

1.1. ADVERSARIAL EXAMPLES

Definition 1 (Adversarial Example). Given a clean sample x ∈ X and a maximum permutation norm (threshold) , a perturbed sample x is an adversarial example if x -x ≤ and f (x ) = c x (Biggio et al., 2013) . When exclusive class assumption in the problem setting is violated, different oracle classifiers may assign different classes for the same clean samples. (Oracle classifiers refer to classifiers that are robust against adversarial examples (Biggio et al., 2013) for appropriately large . Human classifications are usually considered as oracle classifiers.) For example, while many people assign class 7 for the top right sample shown in Figure 1 , some people can assign class 1 or 9 because of the ambiguity of that example. If we label data with the most popularly assigned classes, according to the definition of adversarial example, even some clean samples will be considered as adversarial examples without perturbing the clean samples according to the classifications of some people.

1.2. STANDARD ADVERSARIAL ACCURACY

The following measure is commonly used for comparing different classifiers on vulnerability to adversarial attacks (Biggio et al., 2013; Madry et al., 2017; Tsipras et al., 2018; Zhang et al., 2019) . Definition 2 (Standard adversarial accuracy). 1 () is an indicator function that has value 1 if the condition in the bracket holds and value 0 if the condition in the bracket does not hold. Then, standard adversarial accuracy (by maximum perturbation norm) a std; max ( ) is defined as follows. • a std; max ( ) = E x∈X [1 (f (x * ) = c x )] where x * = arg max x : x -x ≤ L(θ, x , c x ).

1.2.1. ONE-DIMENSIONAL TOY EXAMPLE

Even though standard adversarial accuracy by maximum perturbation norm is commonly used as a measure of adversarial robustness (Biggio et al., 2013; Madry et al., 2017; Tsipras et al., 2018; Zhang et al., 2019) , it is not clear whether this measure can be used to choose better models. To show this measure is not an appropriate measure of robustness, we introduce the following example. Let us consider a toy example (see Figure 2 ) with predefined (pre-known) classes to clean samples in order to simplify the analysis. There are only two classes -1 and 1, i.e., Y = {-1, 1}, and one-dimensional clean input set X = [-2, -1) ∪ [1, 2) ⊆ R. c x = -1 when x ∈ [-2, -1) and c x = 1 when x ∈ [1, 2). p(c = -1) = p(c = 1) = 1 2 , i.e., we assume uniform prior probability. Let us define three classifiers f 1 , f 2 and f 3 for this toy example (see Figure 3 ). When step function step(x) is defined as step (x) = 1, if x ≥ 0, -1, if x < 0 , let f 1 (x) = step(x -1), f 2 (x) = 1 -step(x + 4) + step(x), and f 3 (x) = step(x). Notice that the accuracy for all three classifiers is 1. However, f 1 will not be robust against adversarial attacks as points in [1, 1 + ) can be perturbed to change their classification result. f 2 is overly invariant when x < -4. The oracle classifier will be f 3 . When the change of standard adversarial accuracies by maximum perturbation norm were considered (see Figure 4 ) , f 1 shows decreasing standard adversarial accuracy even when < 1. However,



Figure 1: Examples of confusing near image pairs with different classes of MNIST training dataset (LeCun et al., 2010). The l 2 norms of the pairs are 2.399, 3.100 and 3.131 from left to right. From these examples, we can say the exclusive class assumption in the problem setting can be violated.

