AttackDist: Characterizing Zero-day Adversarial Samples by Counter Attack

Abstract

Deep Neural Networks (DNNs) have been shown vulnerable to adversarial attacks, which could produce adversarial samples that easily fool the state-of-the-art DNNs.The harmfulness of adversarial attacks calls for the defense mechanisms under fire. However, the relationship between adversarial attacks and defenses is like spear and shield.Whenever a defense method is proposed, a new attack would be followed to bypass the defense immediately.Devising a defense against new attacks (zero-day attacks) is proven to be challenging.We tackle this challenge by characterizing the intrinsic properties of adversarial samples, via measuring the norm of the perturbation after a counterattack. Our method is based on the idea that, from an optimization perspective, adversarial samples would be closer to the decision boundary; thus the perturbation to counterattack adversarial samples would be significantly smaller than normal cases. Motivated by this, we propose AttackDist, an attack-agnostic property to characterize adversarial samples. We first theoretically clarify under which condition AttackDist can provide a certified detecting performance, then show that a potential application of AttackDist is distinguishing zero-day adversarial examples without knowing the mechanisms of new attacks. As a proof-of-concept, we evaluate AttackDist on two widely used benchmarks. The evaluation results show that AttackDist can outperform the state-of-the-art detection measures by large margins in detecting zero-day adversarial attacks.

1. INTRODUCTION

Deep Neural Networks (DNNs) have flourished in recent years, and achieve outstanding performance in a lot of extremely challenging tasks, such as computer vision (He et al. (2016) ), machine translation (Singh et al. (2017) ), automatic speech recognition (Tüske et al. (2014) ) and bioinformatics (Choi et al. (2016) ). In spite of excellent performance, recent research shows that DNNs are vulnerable to adversarial samples (Dvorsky (2019)), of which the difference is unnoticeable for humans, but easily leading the DNNs to wrong predictions. This vulnerability hinders DNNs from applying in many sensitive areas, such as autonomous driving, finance, and national security. 2017)). However, while adversarial model retraining improves defense abilities, it also leads to huge costs during retraining process, especially when the number of the parameters in current models grows larger and larger again. As for statistical-based adversarial samples detection techniques, one severe shortcoming is that all these techniques require prior knowledge about the adversarial samples, which is not realistic in most realworld cases. For example, LID (Ma et al. ( 2018)) and Mahalanobis (Lee et al. ( 2018)) need to train logic regression detectors on validation datasets. To make matters worse, adversarial attacks and defenses are just like the relationship between spear and shield. Defensive techniques that perform well against existing attacking methods will always be bypassed by new attack mechanisms, which makes defending zero-day attacks a challenging but urgent task. To address this challenge, we propose AttackDist, an attack-agnostic adversarial sample detection technique via counterattack. Our method is based on insight that, from the perspective of 1



To eliminate the impact of adversarial samples, researchers have proposed a number of techniques to help DNNs detect and prevent adversarial attacks. Existing adversarial defense techniques could be classified into two main categories: (1) adversarial robustness model retraining (Tramèr et al. (2017); Ganin et al. (2016); Shafahi et al. (2019)) and (2) statistical-based adversarial samples detection (Grosse et al. (2017); Xu et al. (2017); Meng & Chen (

