AttackDist: Characterizing Zero-day Adversarial Samples by Counter Attack

Abstract

Deep Neural Networks (DNNs) have been shown vulnerable to adversarial attacks, which could produce adversarial samples that easily fool the state-of-the-art DNNs.The harmfulness of adversarial attacks calls for the defense mechanisms under fire. However, the relationship between adversarial attacks and defenses is like spear and shield.Whenever a defense method is proposed, a new attack would be followed to bypass the defense immediately.Devising a defense against new attacks (zero-day attacks) is proven to be challenging.We tackle this challenge by characterizing the intrinsic properties of adversarial samples, via measuring the norm of the perturbation after a counterattack. Our method is based on the idea that, from an optimization perspective, adversarial samples would be closer to the decision boundary; thus the perturbation to counterattack adversarial samples would be significantly smaller than normal cases. Motivated by this, we propose AttackDist, an attack-agnostic property to characterize adversarial samples. We first theoretically clarify under which condition AttackDist can provide a certified detecting performance, then show that a potential application of AttackDist is distinguishing zero-day adversarial examples without knowing the mechanisms of new attacks. As a proof-of-concept, we evaluate AttackDist on two widely used benchmarks. The evaluation results show that AttackDist can outperform the state-of-the-art detection measures by large margins in detecting zero-day adversarial attacks.

1. INTRODUCTION

Deep Neural Networks (DNNs) have flourished in recent years, and achieve outstanding performance in a lot of extremely challenging tasks, such as computer vision (He et al. (2016) ), machine translation (Singh et al. (2017) ), automatic speech recognition (Tüske et al. (2014) ) and bioinformatics (Choi et al. (2016) ). In spite of excellent performance, recent research shows that DNNs are vulnerable to adversarial samples (Dvorsky (2019) ), of which the difference is unnoticeable for humans, but easily leading the DNNs to wrong predictions. This vulnerability hinders DNNs from applying in many sensitive areas, such as autonomous driving, finance, and national security. To eliminate the impact of adversarial samples, researchers have proposed a number of techniques to help DNNs detect and prevent adversarial attacks. Existing adversarial defense techniques could be classified into two main categories: (1) adversarial robustness model retraining (Tramèr et al. (2017) ; Ganin et al. (2016) ; Shafahi et al. (2019) ) and (2) statistical-based adversarial samples detection (Grosse et al. (2017) ; Xu et al. (2017) ; Meng & Chen (2017) ). However, while adversarial model retraining improves defense abilities, it also leads to huge costs during retraining process, especially when the number of the parameters in current models grows larger and larger again. As for statistical-based adversarial samples detection techniques, one severe shortcoming is that all these techniques require prior knowledge about the adversarial samples, which is not realistic in most realworld cases. For example, LID (Ma et al. (2018) ) and Mahalanobis (Lee et al. (2018) ) need to train logic regression detectors on validation datasets. To make matters worse, adversarial attacks and defenses are just like the relationship between spear and shield. Defensive techniques that perform well against existing attacking methods will always be bypassed by new attack mechanisms, which makes defending zero-day attacks a challenging but urgent task. To address this challenge, we propose AttackDist, an attack-agnostic adversarial sample detection technique via counterattack. Our method is based on insight that, from the perspective of optimization theory, the process of searching adversarial perturbations is a non-convex optimization process. Then the adversarial perturbations generated by the attack algorithm should be close to the optimal solution δ * (See Definition 1). Due to the property that optimal solution δ * is close to the decision boundary (Lemma 1). Thus, if we apply the counter attack on adversarial samples, the perturbation would be significantly smaller the original samples. Figure 1 shows an example of our intuition, if we attack an adversarial sample, then the adversarial perturbation d 2 would be much smaller than the adversarial perturbation of attacking a normal point d 1 . Thus by measuring the size of adversarial perturbation, we could differentiate normal points and adversarial samples. To demonstrate the effectiveness of AttackDist, we first analyze the norm of adversarial perturbation for normal points and adversarial points theoretically, and give the conditions under which AttackDist could provide a guaranteed detecting performance (Theorem 3). In addition to theoretical analysis, we also implement AttackDist on two famous and widely-used benchmarks, MNIST (Deng (2012) ) and , and compare with four state-of-the-art techniques, Vinalla (Hendrycks & Gimpel (2016) ), KD (Feinman et al. (2017) ), MC (Gal & Ghahramani (2016) ) and Mahalanobis (Lee et al. (2018) ). The experimental results show that AttackDist performs better than existing works in detecting zero-day adversarial attacks without requiring the prior-knowledge about the attacks. In brief, we summarize our contributions as follows: • We formally prove a general instinct property of adversarial samples (i.e., adversarial samples are close to the decision boundary), which could be leveraged for detecting future advanced (less noticeable) adversarial attacks. And with more unnoticeable attacks, this property would contribute more to adversarial sample detection. • We propose AttackDist, an attack-agnostic technique for detecting zero-day adversarial attacks. We theoretically prove when the adversarial perturbation satisfies the given condition, AttackDist could have a guaranteed performance in detecting adversarial samples. • We implement AttackDist on two widely used datasets, and compare with four stateof-the-art approaches, the experiment results show AttackDist could achieve the state of the art performance in most cases. Especially for detecting 2 adversarial attacks, AttackDist could achieve 0.99, 0.98, 0.96 AUROC score and 0.99, 0.92, 0.90 Accuracy for tree different adversarial attacks.

2. BACKGROUND

In this section, we first define the notations used through the paper, then give a brief review to adversarial attack and adversarial defense. Finally, we introduce our assumptions about the attackers and the defenders.

2.1. NOTATIONS

Let f (•) : X → Y denote a continuous classifier, where X is the input space consisting of ddimensional vectors, and Y is the output space with K labels. The classifier provides prediction on a point x based on arg max r=1,2,...K f r (x). We then follow () to define adversarial perturbations. Let ∆(•) denote a specific attack algorithm (e.g., FGSM, CW). As shown in Equation 1, given point x and target classifier f , the adversarial perturbation ∆(x, f ) provided by ∆(•) is a minimal perturbation that is sufficient to change the original prediction f (x) (for shorthand, we use ∆(x) to represent ∆(x, f ) throughout the paper). ∆(x, f ) = min δ ||δ|| p s.t. f (x + δ) = f (x) (1) Adversarial samples are the points that applying the adversarial perturbations on the original points (i.e., x adv = x + ∆(x)). Definition 1. Attack Distance: We define attack distance (AttackDist) of a point x as p norm of the adversary perturbation. AttackDist(x) = ||x adv -x|| p = ||∆(x)|| p (2) Definition 2. Optimal Adversarial Perturbation: Given x and f , the optimal adversarial perturbation δ * (x) is the most optimal solution of Equation 1. In other words, δ * (x) satisfy Equation 3. ||δ * (x)|| p ≤ ||∆(x)|| p s.t. f (x + ∆(x)) = f (x) ∧ f (x + δ * (x)) = f (x) (3) Definition 3. Optimal Adversarial Sample: Given x and f ,we define the optimal adversarial sample x * = x+δ * (x), that is applying the optimal adversarial perturbation δ * (x) on normal point x (Note x * is not a constant point, it is a function of x). Definition 4. Decision Boundary: We define the decision boundary B of classifier f as the collection of points which have the same prediction on different labels. More specifically, it satisfy Equation 4. B = {x| ∃i, j (1 ≤ i, j ≤ K) ∧ (i = j) f k={1,2,••• ,K} (x) ≤ f i (x) = f j (x)} Then let D(x, f ) = min b∈B ||x-b|| p (shorthand as D(x)) denote the minimal distance from point x to the decision boundary. And we define all points on the decision boundary are adversarial samples. Because according to the definition of decision boundary B, any points belong to B would provide more than one prediction results, which means it contains at least one prediction is contradict with the ground truth. Lemma 1. The optimal adversarial sample x * belongs to decision boundary B, in other words, the relationship between δ * (x) and D(x) is ||δ * (x)|| p = D(x). We prove lemma 1 by contradiction, assume the optimal adversarial samples x * does not belong to B (e.g., x * / ∈ B), then we want to prove x * is not the most optimal adversarial samples generated from x (i.e., there exists x * satisfies Equation 5). ||x * -x|| p < ||x * -x|| p s.t. f (x) = f (x * ) Proof. Let f (x) = i, f (x * ) = j and i = j. If we connect the point x and x * to get a line, then there must be a point x * on the line satisfies Equation 5. We prove it by constructing the function g(x) = f i (x)-f j (x), obviously, g(x) > 0 and g(x * ) < 0. Due to the continuous of f , from x to x * , there exists a point g(P ) = 0 and P = x * ∧ P = x. Then, we need to show that point P is the point we want (i.e., P = x * ). Obviously, ||P -x|| p < ||x * -x|| p because P is a middle point of straight line with x and x * as endpoints, then we only need to prove P would get different prediction with x. There are two conditions of the prediction on point P : (1) f (P ) = i; (2) f (P ) = i. For condition 1, due to g(P ) = 0 = f i (P )-f j (P ) and f (P ) = i, then P satisfies the definition of decision boundary B, so P would have a different prediction with x (We define all points on the decision boundary are adversarial samples). For condition 2, f (P ) = i = f (x), obviously, P is the adversarial sample for x. Next, we introduce the definition of r-attack, to measure the the optimization capabilities of an attack algorithm. Although the definition of adversary samples is to optimize the Equation 1, but none attack algorithm could always obtain the most optimal solution δ * (x). We define r-attack, to measure how close the adversarial samples generated by one specific attack algorithm to the optimal solution δ * . Definition 5. r-attack: we define attack algorithm ∆ r (•) as an r-attack algorithm if all perturbations it produced are lying in a sphere centered on optimal adversarial perturbation with radius r. ∆ r is r-attack ⇐⇒ ∀x ∈ X ||∆(x) -δ * (x)|| p ≤ r (6) From the definition of r-attack, we could see more advanced attacks (more unnoticeable attack) are attacks with less r. The best attack could always produce δ * (x), whose r = 0. As the goal of the attackers is to create less noticeable samples to evade the human-beings. Then they tend to develop the more advanced attack algorithms with smaller r. Later, we would show how AttackDist leverage this point (Theorem 3) to detect more unnoticeable attacks.

2.2. ADVERSARIAL ATTACKS & ADVERSARIAL DEFENSES

Many existing works have been proposed for crafting adversarial examples to fool the DNNs, we introduce a selection of such work here. The Fast Gradient Method (Goodfellow et al. (2014a) ) search the adversarial samples by a small amount along the direction of gradients. The CW (Carlini & Wagner (2017) ) attack, model the adversarial samples generation as a optimization problem and iteratively search the optimal solution. And the Deep Fool (Moosavi-Dezfooli et al. ( 2016)) attack, which is designed to estimate the distance of one sample to the decision boundary. We then follow the definition of zero-day vulnerabilities (Ablon & Bogart (2017) ) to define zero-day adversarial attacks. A zero-day adversarial attack is one attack algorithm that is unknown to those who should be interested in mitigating the attacks (e.g., the adversary sample detectors). Besides the adversarial attack techniques, a number of defense techniques also have been introduced to reduce the harms of adversarial samples. For example, KD (Feinman et al. (2017) ) estimate the kernel-density of the training dataset and use the estimated kernel-density to distinguish normal samples and adversarial samples. LID, which estimate the local intrinsic dimensionality of normal, noisy and adversarial samples, and train a logic regression detector to characterize the subspace of adversarial samples. However, LID needs the prior-knowledge of adversarial attacks to train the detectors, thus can not be applied for detecting zero-day adversarial attacks.

2.3. THREAT MODEL

In this paper, we assume the attackers could complete access to the neural networks and could apply white-box attacks. For the detectors, they could know some attack algorithms, but when a new attack is proposed, the detectors don't know anything about the mechanism of the new proposed attacks.

3. APPROACH

Our aim is to gain a intrinsic properties of adversarial perturbations, and derive potential provide new directions for new advanced attacks. We begin by providing a theory analysis of the bounds of the boundary distance (AttackDist) of r-attack adversarial samples. After that, we show how AttackDist could be efficiently estimated through applying a counter attack. Finally, we show why AttackDist could differential normal samples and adversarial samples; and the condition, under which AttackDist could have a certificated detection performance.

3.1. AT T A C KDI S T OF ADVERSARIAL SAMPLES

Let x is a normal input, we first apply algorithm ∆ r1 to attack x to generate adversary sample y, and apply a different algorithm ∆ r2 to attack y to generate adversary sample z (i.e., y = x + ∆ r1 (x), z = y + ∆ r2 (y)). We first provide our motivation by analysing the attack distance of x and y. After the analysis the bound of the adversarial perturbation for normal point x, we then analysis the bound of the adversarial perturbation for adversarial sample y. Theorem 2. If y is a adversary sample generated by x through r 1 -attack, and z is the adversary sample generated by y through r 2 -attack then D(y) ≤ r 1 and ||z - ||y -x|| p = ||∆ r (x)|| p ≥ ||δ * (x)|| p = D(x) ||∆ r (x)|| p ≤ ||∆ r (x) -δ * (x)|| p + ||δ * (x, f )|| p ≤ r 1 + D(x) y|| p ≤ r 1 + r 2 D(y) = min(y, B) ≤ ||y -x * || p ≤ r 1 ||z -y|| p = ∆ r2 (y) ≤ r 2 + D(y) ≤ r 1 + r 2 (8) Proof. As shown in the first line in Equation 8, the distance of y to decision boundary B is the minimum distance of y to any points in B. And x * is the points belongs to the decision boundary B (Lemma 1), then min(y, B) ≤ ||y-x * || p . And according to the definition of r-attack, ||y-x * || p ≤ r 1 holds. ∆ r2 (y) ≤ r 2 + D(y) in second line of Equation is because the second line of Equation 7, we just replace x with y, and r 1 with r 2 . Theorem 3. If we have known a attack with r 1 ≤ 1 2 (µ -3σ), where µ and σ are the parameters of the Gaussian Distribution for D(x). Then for any advanced attacks (less noticed attack) with r 2 ≤ r 1 , we have 99.86% probability that using the attack distance could correctly distinguish normal samples and adversary samples. Proof. Combining Equation 7 and 8. The lower bound for ||y -x|| p = D(x) ∼ N (µ, σ), and the upper bound for ||z - y|| p is r 1 + r 2 . If r 1 ≤ 1 2 (µ -3σ) and r 2 ≤ r 1 , then r 1 + r 2 ≤ 2r 1 ≤ µ -3σ. According to the cumulative distribution function (CDF) of Gaussian Distribution, the probability of D(x) ≤ µ -3σ is less than 0.14%. In other words, the lower bound of ||y -x|| p have the 99.86% probability larger than the upper bound of ||z -y|| p , which means it at least have the 99.86% detection accuracy.

3.2. USING AT T A C KDI S T TO CHARACTERIZE ADVERSARIAL EXAMPLES

We next describe how AttackDist can serve as property to distinguish adversarial examples without the prior-knowledge about the zero-day attacks. Our methodology only requires one known attack algorithm ∆ known for implementing the counter-attack. There are two main steps to calculate AttackDist. 1) Applying counter-attack: for the point x under detection, we first attack x with the known attack algorithm ∆ known to generate y = x + ∆ known (x). 2) AttackDist Estimation: We estimate AttackDist of point x by measuring the norm of the adversarial perturbation ||y -x|| p .

4. EVALUATION

In this section, we demonstrate the effectiveness of our method in distinguishing adversary samples generated by three attack algorithms on two widely used datasets. 2017)) -BasicIterativeAttack (BIM) (Kurakin et al. (2016) ) -FastGradientSignAttack (FGSM) (Goodfellow et al. (2014b) ) The reason we select completely different attack algorithms for 2 and ∞ bounded adversarial examples is that these algorithms are designed for different p norm purpose. Evaluation Metric: We first point out that comparing detectors just through accuracy is not enough. For adversary samples detection, we have two classes, and the detector outputs a score for both the positive and negative classes. If the positive class is far more likely than the negative one, a detector would obtain high accuracy by always guessing the positive class, which can cause misleading results. To address this issue, besides accuracy, we also consider four different metrics. we consider the trade-off between false negatives (FN) and false positives (FP), the trade-off between precision and recall and the trade-off between true negative rate (TNR) and true positive rate (TPR), and employ Area Under the Receiver Operating Characteristic curve (AUROC), Area Under the Precision-Recall curve (AUPR), TNR at 90% true positive rate TPR (TNR@90) and TNR at 99% true positive rate TPR (TNR@99) as our evaluation metrics. • TNR@90 Let TP, TN, FP, and FN denote true positive, true negative, false positive and false negative, respectively. We measure TNR = TN / (FP+TN), when TPR is 90%. • TNR@99 We also measure TNR, when TPR is 99%. • AUROC is a threshold-independent metric. The ROC curve shows the true positive rate against the false positive rate. A "perfect" detector corresponds to 100% AUROC. • AUPR is also a threshold-independent metric. The PR curve plots the precision and recall against each other. A "perfect" detector has an AUPR of 100%. • Accuracy We enumerate all possible thresholds τ on the test dataset and select the best accuracy for evaluation. Comparison Baseline: There are many existing works could defense the adversarial attacks. However, as we discussed earlier, some of them need prior knowledge about the attacks to train the detector, as our goal is to detect zero-day adversarial attacks without the prior-knowledge, then we only consider four different approaches that require no prior knowledge of the attacks as baselines. We briefly introduce each baseline, more details about the comparison baselines could be found in related works (Hendrycks & Gimpel (2016) ; Feinman et al. (2017) ; Gal & Ghahramani (2016) ; Lee et al. (2018) ). • Vanilla (Hendrycks & Gimpel (2016) ): Vanilla is a baseline which defines a confidence score as a maximum value of the posterior distribution. Existing works also find it could be used to detect adversary samples. • KD (Feinman et al. (2017) ): Kernel Density (KD) estimation is proposed to identify adversarial subspaces. Existing works () demonstrated the usefulness of KD-based adversary samples detection, taking advantage of the low probability density generally associated with adversarial subspaces. • MC (Gal & Ghahramani (2016) ): MC Drop represeent the model uncertainty for a specific input activate the dropout layer in the testing phase. • Mahalanobis (Lee et al. ( 2018)): Mahalanobis using Mahalanobis distance on the features (the output in the hidden layer) learned by the target DNNs to distinguish adversarial samples, it is an approach based on uncertainty measurement. For the baseline KD, it needs to tune the hyperparameters for computing. We follow (Ma et al. (2018) ) to set the optimal bandwidths chosen for MNIST, CIFAR-10 as 3.79 and 0.26, respectively. As for MC, we activate the dropout layer and run 300 times. For Mahalanobis, it need selects the features in the hidden layer to a Gaussian model. For MNIST, we select features before the last fully connected layer, and for CIFAR-10, we select the last two layers.

Experiment Process:

Since our approach is based on counterattack, we need to use a known attack algorithm during the implementation, which is easy to meet because there are a variety of open-source attack algorithms. In our experiments, we literally treat one attack algorithms as known attack algorithm, and the other two algorithms as the zero-day attacks to generate adversary samples, then evaluate whether AttackDist could detect the generated zero-day adversary samples. For example, we use PGDM as the known attack and treat it as our approach's input, then use the rest attack algorithms to generate zero-day adversarial samples for evaluation. The detail implementation of our attack and attack success rates can be find in Appendix B.

4.2. BOUNDARYDIST CHARACTERISTICS OF ADVERSARIAL SAMPLES

We provide empirical results showing the AttackDist characteristics of adversarial samples crafted by the mentioned attacks. We use CW attack algorithm to counter attack the adversarial samples generated by CW, DF, BB and the normal samples, and measure the AttackDist to show how AttackDist could distinguish normal samples and adversarial samples. The left subfigure in Figure 2 

4.3. EXPERIMENTAL RESULTS

Due to the limit of space, we only present the results for CIFAR-10, the results about MNIST could be found in Appendix C.

4.3.1. 2 ATTACK

Table 1 shows our experimental results to detect 2 norm adversarial attacks on CIFAR-10 dataset. For almost all cases, our approach outperforms the baselines for great margins, especially for NTR@99, when the requirement of TNR is 0.99, the performances of baselines are almost zero, which means the existing works fail to detect the new adversarial attacks without the priorknowledge. However, AttackDist could still have a a 0.54, 051, 0.58 NTR for mixed adversarial attacks. As for the metric NTR@90, which is a slightly loose requirement than NTR99. At this scenario, the performance of the baseline is no-longer zero, however, they still have a poor performance, while AttackDist almost have a perfect performance with 0.93, 0.94, 0.94 for mixed attacks. Another interesting finding is that AttackDist almost keep the same performance with different attack algorithms we choose to implementing the counter-attack. This means AttackDist is not sensitive with the adversarial attack algorithms for counter-attacking. Table 2 shows our experimental results to detect ∞ norm adversarial attacks on CIFAR-10 dataset. 4.3.2 ∞ ATTACK The performance of detecting ∞ norm adversarial attacks is much worse than 2 attacks. However, AttackDist still achieve a competitive performance, one possible reason that AttackDist can not have a good performance as 2 attacks is the condition in Theorem 3 is no longer hold for ∞ attacks. The existing works (Carlini & Wagner (2017) ) studied the size of adversarial perturbation for 2 and ∞ attacks. On CIFAR-10 dataset, ∞ = 0.013 is enough to achieve the average 100% success attack rate, while 2 needs to be larger than 0.33. However, consider the different maximum distance on 2 and ∞ norm (i.e., the maximum 2 norm for CIFAR-10 is 32 × 32 × 3 = 3072 while the maximum ∞ is 1). Then the relative r for 2 norm attacks would be smaller, which means the 2 could produce more unnoticeable adversarial samples.

5. DISCUSSIONS AND CONCLUSIONS

In this paper, we proposed AttackDist to address the challenge of detecting zero-day adversarial attacks. From the perspective of optimization theory, we try to understand the general intrinsic properties of adversarial samples rather than statistically analysis the hidden feature of existing adversarial samples. Trough counter attack the normal samples and the adversarial samples, we analysis the norm of the adversarial perturbation of normal samples and adversarial samples theoretically, and give the condition under which AttackDist would have a guaranteed performance for any advanced attacks. In particular, AttackDist performs better than the existing works in detecting zero-day adversarial samples.

B ADVERSARIAL SAMPLES GENERATION

We implement the listed attack algorithms through the python library foolbox (Rauber et al. ( 2020)), which is a popular library for evaluating the robustness of DNNs. For MNIST dataset, we set hyper-parameter as 3 and 0.25 for 2 and inf ; for CIFAR-10 dataset, we set as 0.35 and 0.015 for 2 and inf . The attack success rates for each attack algorithms are listed in Table 4 . 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 Vinalla 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 1.00 1.00 0.31 0.31 0.31 0.31 0.50 0.50 0.50 0.50 MC 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.96 0.96 0.96 0.96 0.31 0.31 0.31 0.31 0.50 0.50 0.50 0.50 KD 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.98 0.98 0.98 0.31 0.31 0.31 0.31 0.50 0.50 0.50 0.50 Ma 0.21 0.21 0.21 0.21 0.43 0.43 0.43 0.43 0.70 0.70 0.70 0.70 0.62 0.62 0.62 0.62 0.67 0.67 0.67 0.67 CW tool 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 Vinalla 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 1.00 1.00 0.31 0.31 0.31 0.31 0.50 0.50 0.50 0.50 MC 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.96 0.96 0.96 0.96 0.31 0.31 0.31 0.31 0.50 0.50 0.50 0.50 KD 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.98 0.98 0.98 0.31 0.31 0.31 0.31 0.50 0.50 0.50 0.50 Ma 0.19 0.19 0.19 0.19 0.42 0.42 0.42 0.42 0.68 0.68 0.68 0.68 0.60 0.60 0.60 0.60 0.66 0.66 0.66 0.66 DF tool 0.98 1.00 1.00 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 1.00 1.00 0.99 Vinalla 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 1.00 1.00 0.54 0.54 0.54 0.54 0.75 0.75 0.75 0.75 MC 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.96 0.96 0.96 0.96 0.55 0.55 0.55 0.55 0.75 0.75 0.75 0.75 KD 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.98 0.98 0.98 0.98 0.54 0.54 0.54 0.54 0.75 0.75 0.75 0.75 Ma 0.19 0.19 0.19 0.19 0.43 0.43 0.43 0.43 0.69 0.69 0.69 0.69 0.82 0.82 0.82 0.82 0.80 0.80 0.80 0.80 Mix tool 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 1.00 1.00 1.00 



Figure 1: An example of our intuition.

In the first line of Equation 7, we measure the lower bound of the adversarial perturbation. The first inequality ||∆ r (x)|| p ≥ ||δ * (x)|| p is due to the definition of δ * (x) (See Definition 2), and the second equality ||δ * (x)|| p = D(x) is due to Lemma 1. In the second line of Equation7, we measure the upper bound of the adversarial perturbation. The first inequality is due to the triangle inequality (i.e., ||A + B|| p ≤ ||A|| p + ||B|| p ). Then because of the definition of r-attack, we have ||∆ r1 (x) -δ * (x)|| p ≤ r 1 . We then assume the random variables D(x) for normal points belongs to a Gaussian distribution. D(x) ∼ N (µ, σ) ∀x ∈ X

Figure 2: The Distribution and Probability density function of AttackDist of two hundreds random selected dataTable 1: The Experiment Results of 2 norm Attacks On CIFAR-10 Dataset

shows the AttackDist of 200 randomly selected normal, and adversarial examples from the MNIST dataset. Left figure shows the 2 norm attack and right figure shows the inf norm attack. Red circle points represent the normal points, while different color square points represents the different adversarial samples. We observe that AttackDist scores of adversarial examples are significantly smaller than those of normal examples, especially for the inf norm attacks. This supports our expectation that the perturbation to counterattack adversarial samples would be significantly smaller than normal cases. The right subfigure in Figure 2 shows the probability density function (PDF) of normal, and adversarial examples. Clearly, the distribution of normal samples and adversarial smaples are totally different. The different PDFs suggest that by selecting a property threshold, AttackDist could correctly detect the adversarial samples.

We evaluate our method on MNIST(Deng (2012)) and datasets. We use the standard DNN model for each dataset. For MNIST we chooseLeNet-5 (LeCun et al. (2015)) architecture which reaches 98.6% accuracy on the testing set. On CIFAR-10, we train a ConNet(Carlini & Wagner (2017)) with 87.8% accuracy. The details of the model and the training setup could be found in Appendix A. We generate adversarial examples with white-box attack methods. Specifically, We consider three different attack algorithms for both 2 and ∞ bounded adversarial examples. The selected attack algorithms include (see also references within): ProjectedGradientDescentAttack (PGD)( Madry et al. (

The Experiment Results of 2 norm Attacks On CIFAR-10 Dataset

The Experiment Results of ∞ norm Attacks On CIFAR-10 Dataset

Attack Success Rate for Each Attack Algorithms

Results for MNIST on L2

Results for MNIST on Linf .00 0.00 0.00 0.00 0.00 0.00 0.00 0.88 0.88 0.88 0.88 0.34 0.34 0.34 0.34 0.50 0.50 0.50 0.50 MC 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.83 0.83 0.83 0.83 0.35 0.35 0.35 0.35 0.50 0.50 0.50 0.50 KD 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.92 0.92 0.92 0.92 0.32 0.32 0.32 0.32 0.50 0.50 0.50 0.50 Ma 0.01 0.01 0.06 0.02 0.25 0.25 0.26 0.25 0.51 0.51 0.51 0.51 0.45 0.45 0.47 0.46 0.58 0.58 0.59 0.58 DF tool 0.01 0.05 0.23 0.10 0.77 0.78 0.72 0.75 0.91 0.91 0.90 0.91 0.92 0.92 0.91 0.92 0.87 0.88 0.86 0.87 Vinalla 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.57 0.57 0.57 0.57 0.74 0.74 0.74 0.74 0.75 0.75 0.75 0.75 MC 0.00 0.00 0.00 0.00 0.02 0.02 0.02 0.02 0.57 0.57 0.57 0.57 0.72 0.72 0.72 0.72 0.75 0.75 0.75 0.75 KD 0.00 0.00 0.00 0.00 0.01 0.01 0.01 0.01 0.73 0.73 0.73 0.73 0.62 0.62 0.62 0.62 0.75 0.75 0.75 0.75 Ma 0.02 0.02 0.00 0.02 0.22 0.22 0.11 0.18 0.54 0.54 0.54 0.54 0.76 0.76 0.71 0.74 0.75 0.75 0.75 0.75 Mix tool 0.01 0.05 0.23 0.10 0.62 0.63 0.23 0.49 0.84 0.83 0.77 0.81 0.92 0.92 0.90 0.92 0.85 0.84 0.78 0.82

A MODEL ARCHITECTURE AND TRAINING SETUP

The model architecture we used is listed in Table 3 , for MNIST and CIFAR-10, the training dataset is 50,000 and we randomly select 1,000 from the testing dataset for evaluation. We set the learning rate as 0.01, with the momentum is 0.9.

