AttackDist: Characterizing Zero-day Adversarial Samples by Counter Attack

Abstract

Deep Neural Networks (DNNs) have been shown vulnerable to adversarial attacks, which could produce adversarial samples that easily fool the state-of-the-art DNNs.The harmfulness of adversarial attacks calls for the defense mechanisms under fire. However, the relationship between adversarial attacks and defenses is like spear and shield.Whenever a defense method is proposed, a new attack would be followed to bypass the defense immediately.Devising a defense against new attacks (zero-day attacks) is proven to be challenging.We tackle this challenge by characterizing the intrinsic properties of adversarial samples, via measuring the norm of the perturbation after a counterattack. Our method is based on the idea that, from an optimization perspective, adversarial samples would be closer to the decision boundary; thus the perturbation to counterattack adversarial samples would be significantly smaller than normal cases. Motivated by this, we propose AttackDist, an attack-agnostic property to characterize adversarial samples. We first theoretically clarify under which condition AttackDist can provide a certified detecting performance, then show that a potential application of AttackDist is distinguishing zero-day adversarial examples without knowing the mechanisms of new attacks. As a proof-of-concept, we evaluate AttackDist on two widely used benchmarks. The evaluation results show that AttackDist can outperform the state-of-the-art detection measures by large margins in detecting zero-day adversarial attacks.

1. INTRODUCTION

Deep Neural Networks (DNNs) have flourished in recent years, and achieve outstanding performance in a lot of extremely challenging tasks, such as computer vision (He et al. (2016) ), machine translation (Singh et al. (2017) ), automatic speech recognition (Tüske et al. (2014) ) and bioinformatics (Choi et al. (2016) ). In spite of excellent performance, recent research shows that DNNs are vulnerable to adversarial samples (Dvorsky (2019)), of which the difference is unnoticeable for humans, but easily leading the DNNs to wrong predictions. This vulnerability hinders DNNs from applying in many sensitive areas, such as autonomous driving, finance, and national security. 2017)). However, while adversarial model retraining improves defense abilities, it also leads to huge costs during retraining process, especially when the number of the parameters in current models grows larger and larger again. As for statistical-based adversarial samples detection techniques, one severe shortcoming is that all these techniques require prior knowledge about the adversarial samples, which is not realistic in most realworld cases. For example, LID (Ma et al. ( 2018)) and Mahalanobis (Lee et al. ( 2018)) need to train logic regression detectors on validation datasets. To make matters worse, adversarial attacks and defenses are just like the relationship between spear and shield. Defensive techniques that perform well against existing attacking methods will always be bypassed by new attack mechanisms, which makes defending zero-day attacks a challenging but urgent task. To address this challenge, we propose AttackDist, an attack-agnostic adversarial sample detection technique via counterattack. Our method is based on insight that, from the perspective of optimization theory, the process of searching adversarial perturbations is a non-convex optimization process. Then the adversarial perturbations generated by the attack algorithm should be close to the optimal solution δ * (See Definition 1). Due to the property that optimal solution δ * is close to the decision boundary (Lemma 1). Thus, if we apply the counter attack on adversarial samples, the perturbation would be significantly smaller the original samples. Figure 1 shows an example of our intuition, if we attack an adversarial sample, then the adversarial perturbation d 2 would be much smaller than the adversarial perturbation of attacking a normal point d 1 . Thus by measuring the size of adversarial perturbation, we could differentiate normal points and adversarial samples. 2018)). The experimental results show that AttackDist performs better than existing works in detecting zero-day adversarial attacks without requiring the prior-knowledge about the attacks. In brief, we summarize our contributions as follows: • We formally prove a general instinct property of adversarial samples (i.e., adversarial samples are close to the decision boundary), which could be leveraged for detecting future advanced (less noticeable) adversarial attacks. And with more unnoticeable attacks, this property would contribute more to adversarial sample detection. • We propose AttackDist, an attack-agnostic technique for detecting zero-day adversarial attacks. We theoretically prove when the adversarial perturbation satisfies the given condition, AttackDist could have a guaranteed performance in detecting adversarial samples. • We implement AttackDist on two widely used datasets, and compare with four stateof-the-art approaches, the experiment results show AttackDist could achieve the state of the art performance in most cases. Especially for detecting 2 adversarial attacks, AttackDist could achieve 0.99, 0.98, 0.96 AUROC score and 0.99, 0.92, 0.90 Accuracy for tree different adversarial attacks.

2. BACKGROUND

In this section, we first define the notations used through the paper, then give a brief review to adversarial attack and adversarial defense. Finally, we introduce our assumptions about the attackers and the defenders.

2.1. NOTATIONS

Let f (•) : X → Y denote a continuous classifier, where X is the input space consisting of ddimensional vectors, and Y is the output space with K labels. The classifier provides prediction on a point x based on arg max r=1,2,...K f r (x). We then follow () to define adversarial perturbations. Let ∆(•) denote a specific attack algorithm (e.g., FGSM, CW). As shown in Equation 1, given point x and target classifier f , the adversarial perturbation ∆(x, f ) provided by ∆(•) is a minimal perturbation that is sufficient to change the original prediction f (x) (for shorthand, we use ∆(x) to



To eliminate the impact of adversarial samples, researchers have proposed a number of techniques to help DNNs detect and prevent adversarial attacks. Existing adversarial defense techniques could be classified into two main categories: (1) adversarial robustness model retraining (Tramèr et al. (2017); Ganin et al. (2016); Shafahi et al. (2019)) and (2) statistical-based adversarial samples detection (Grosse et al. (2017); Xu et al. (2017); Meng & Chen (

Figure 1: An example of our intuition.To demonstrate the effectiveness of AttackDist, we first analyze the norm of adversarial perturbation for normal points and adversarial points theoretically, and give the conditions under which AttackDist could provide a guaranteed detecting performance (Theorem 3). In addition to theoretical analysis, we also implement AttackDist on two famous and widely-used benchmarks, MNIST (Deng (2012)) and Cifar-10 (Krizhevsky et al.), and compare with four state-of-the-art techniques, Vinalla (Hendrycks & Gimpel (2016)), KD (Feinman et al. (2017)), MC(Gal & Ghahramani (2016)) and Mahalanobis (Lee et al. (2018)). The experimental results show that AttackDist performs better than existing works in detecting zero-day adversarial attacks without requiring the prior-knowledge about the attacks.

