TARGET TRAINING: TRICKING ADVERSARIAL ATTACKS TO FAIL

Abstract

Recent adversarial defense approaches have failed. Untargeted gradient-based attacks cause classifiers to choose any wrong class. Our novel white-box defense tricks untargeted attacks into becoming attacks targeted at designated target classes. From these target classes, we derive the real classes. The Target Training defense tricks the minimization at the core of untargeted, gradientbased adversarial attacks: minimize the sum of (1) perturbation and (2) classifier adversarial loss. Target Training changes the classifier minimally, and trains it with additional duplicated points (at 0 distance) labeled with designated classes. These differently-labeled duplicated samples minimize both terms (1) and (2) of the minimization, steering attack convergence to samples of designated classes, from which correct classification is derived. Importantly, Target Training eliminates the need to know the attack and the overhead of generating adversarial samples of attacks that minimize perturbations. Without using adversarial samples and against an adaptive attack aware of our defense, Target Training exceeds even default, unsecured classifier accuracy of 84.3% for CIFAR10 with 86.6% against DeepFool attack; and achieves 83.2% against CW-L 2 (κ=0) attack. Using adversarial samples, we achieve 75.6% against CW-L 2 (κ=40). Due to our deliberate choice of low-capacity classifiers, Target Training does not withstand L ∞ adaptive attacks in CIFAR10 but withstands CW-L ∞ (κ=0) in MNIST. Target Training presents a fundamental change in adversarial defense strategy.

1. INTRODUCTION

Neural network classifiers are vulnerable to malicious adversarial samples that appear indistinguishable from original samples (Szegedy et al., 2013) , for example, an adversarial attack can make a traffic stop sign appear like a speed limit sign (Eykholt et al., 2018) to a classifier. An adversarial sample created using one classifier can also fool other classifiers (Szegedy et al., 2013; Biggio et al., 2013) , even ones with different structure and parameters (Szegedy et al., 2013; Goodfellow et al., 2014; Papernot et al., 2016b; Tramèr et al., 2017b) . This transferability of adversarial attacks (Papernot et al., 2016b) matters because it means that classifier access is not necessary for attacks. The increasing deployment of neural network classifiers in security and safety-critical domains such as traffic (Eykholt et al., 2018) , autonomous driving (Amodei et al., 2016), healthcare (Faust et al., 2018) , and malware detection (Cui et al., 2018) makes countering adversarial attacks important. Gradient-based attacks use the classifier gradient to generate adversarial samples from nonadversarial samples. Gradient-based attacks minimize at the same time classifier adversarial loss and perturbation (Szegedy et al., 2013) , though attacks can relax this minimization to allow for bigger perturbations, for example in the Carlini&Wagner attack (CW) (Carlini & Wagner, 2017c) for κ > 0, in the Projected Gradient Descent attack (PGD) (Kurakin et al., 2016; Madry et al., 2017) , in FastGradientMethod (FGSM) (Goodfellow et al., 2014) . Other gradient-based adversarial attacks include DeepFool (Moosavi-Dezfooli et al., 2016) , Zeroth order optimization (ZOO) (Chen et al., 2017) , Universal Adversarial Perturbation (UAP) (Moosavi-Dezfooli et al., 2017) . Many recent proposed defenses have been broken (Athalye et al., 2018; Carlini & Wagner, 2016; 2017a; b; Tramer et al., 2020) . They fall largely into these categories: (1) adversarial sample detection, (2) gradient masking and obfuscation, (3) ensemble, (4) customized loss. Detection defenses (Meng & Chen, 2017; Ma et al., 2018; Li et al., 2019; Hu et al., 2019) aim to detect, cor-

