TARGET TRAINING: TRICKING ADVERSARIAL ATTACKS TO FAIL

Abstract

Recent adversarial defense approaches have failed. Untargeted gradient-based attacks cause classifiers to choose any wrong class. Our novel white-box defense tricks untargeted attacks into becoming attacks targeted at designated target classes. From these target classes, we derive the real classes. The Target Training defense tricks the minimization at the core of untargeted, gradientbased adversarial attacks: minimize the sum of (1) perturbation and (2) classifier adversarial loss. Target Training changes the classifier minimally, and trains it with additional duplicated points (at 0 distance) labeled with designated classes. These differently-labeled duplicated samples minimize both terms (1) and (2) of the minimization, steering attack convergence to samples of designated classes, from which correct classification is derived. Importantly, Target Training eliminates the need to know the attack and the overhead of generating adversarial samples of attacks that minimize perturbations. Without using adversarial samples and against an adaptive attack aware of our defense, Target Training exceeds even default, unsecured classifier accuracy of 84.3% for CIFAR10 with 86.6% against DeepFool attack; and achieves 83.2% against CW-L 2 (κ=0) attack. Using adversarial samples, we achieve 75.6% against CW-L 2 (κ=40). Due to our deliberate choice of low-capacity classifiers, Target Training does not withstand L ∞ adaptive attacks in CIFAR10 but withstands CW-L ∞ (κ=0) in MNIST. Target Training presents a fundamental change in adversarial defense strategy.

1. INTRODUCTION

Neural network classifiers are vulnerable to malicious adversarial samples that appear indistinguishable from original samples (Szegedy et al., 2013) , for example, an adversarial attack can make a traffic stop sign appear like a speed limit sign (Eykholt et al., 2018) to a classifier. An adversarial sample created using one classifier can also fool other classifiers (Szegedy et al., 2013; Biggio et al., 2013) , even ones with different structure and parameters (Szegedy et al., 2013; Goodfellow et al., 2014; Papernot et al., 2016b; Tramèr et al., 2017b) . This transferability of adversarial attacks (Papernot et al., 2016b) matters because it means that classifier access is not necessary for attacks. The increasing deployment of neural network classifiers in security and safety-critical domains such as traffic (Eykholt et al., 2018 ), autonomous driving (Amodei et al., 2016 ), healthcare (Faust et al., 2018) , and malware detection (Cui et al., 2018) makes countering adversarial attacks important. Gradient-based attacks use the classifier gradient to generate adversarial samples from nonadversarial samples. Gradient-based attacks minimize at the same time classifier adversarial loss and perturbation (Szegedy et al., 2013) , though attacks can relax this minimization to allow for bigger perturbations, for example in the Carlini&Wagner attack (CW) (Carlini & Wagner, 2017c) for κ > 0, in the Projected Gradient Descent attack (PGD) (Kurakin et al., 2016; Madry et al., 2017) , in FastGradientMethod (FGSM) (Goodfellow et al., 2014) . Other gradient-based adversarial attacks include DeepFool (Moosavi-Dezfooli et al., 2016) , Zeroth order optimization (ZOO) (Chen et al., 2017) , Universal Adversarial Perturbation (UAP) (Moosavi-Dezfooli et al., 2017) . Many recent proposed defenses have been broken (Athalye et al., 2018; Carlini & Wagner, 2016; 2017a; b; Tramer et al., 2020) . They fall largely into these categories: (1) adversarial sample detection, (2) gradient masking and obfuscation, (3) ensemble, (4) customized loss. Detection defenses (Meng & Chen, 2017; Ma et al., 2018; Li et al., 2019; Hu et al., 2019) aim to detect, cor-rect or reject adversarial samples. Many detection defenses have been broken (Carlini & Wagner, 2017b; a; Tramer et al., 2020) . Gradient obfuscation is aimed at preventing gradient-based attacks from access to the gradient and can be achieved by shattering gradients (Guo et al., 2018; Verma & Swami, 2019; Sen et al., 2020) , randomness (Dhillon et al., 2018; Li et al., 2019) or vanishing or exploding gradients (Papernot et al., 2016a; Song et al., 2018; Samangouei et al., 2018) . Many gradient obfuscation methods have also been successfully defeated (Carlini & Wagner, 2016; Athalye et al., 2018; Tramer et al., 2020) . Ensemble defenses (Tramèr et al., 2017a; Verma & Swami, 2019; Pang et al., 2019; Sen et al., 2020) have also been broken (Carlini & Wagner, 2016; Tramer et al., 2020) , unable to even outperform their best performing component. Customized attack losses defeat defenses (Tramer et al., 2020) with customized losses (Pang et al., 2020; Verma & Swami, 2019 ) but also, for example ensembles (Sen et al., 2020) . Even though it has not been defeated, Adversarial Training (Kurakin et al., 2016; Szegedy et al., 2013; Madry et al., 2017) assumes that the attack is known in advance and takes time to generate adversarial samples at every iteration. The inability of recent defenses to counter adversarial attacks calls for new kinds of defensive approaches. In this paper, we make the following major contributions: 

2. BACKGROUND AND RELATED WORK

Here, we present the state-of-the-art in adversarial attacks and defenses, as well as a summary. Notation A k-class neural network classifier that has θ parameters is denoted by a function f (x) that takes input x ∈ R d and outputs y ∈ R k , where d is the dimensionality and k is the number of classes. An adversarial sample is denoted by x adv . Classifier output is y, y i is the probability that the input belongs to class i. Norms are denoted as L 0 , L 2 and L ∞ . 2.1 ADVERSARIAL ATTACKS Szegedy et al. (2013) were first to formulate the generation of adversarial samples as a constrained minimization of the perturbation under an L p norm. Because this formulation can be hard to solve, Szegedy et al. ( 2013) reformulated the problem as a gradient-based, two-term minimization of the weighted sum of perturbation, and classifier loss. For untargeted attacks, this minimization is: minimize c • x adv -x 2 2 + loss f (x adv ) (Minimization 1) subject to x adv ∈ [0, 1] n where f is the classifier, loss f is classifier loss on adversarial input, and c a constant value evaluated in the optimization. Term (1) is a norm to ensure a small adversarial perturbation. Term (2) utilizes the classifier gradient to find adversarial samples that minimize classifier adversarial loss. Minimization 1 is the foundation for many gradient-based attacks, though many tweaks can and have been applied. Some attacks follow Minimization 1 implicitly (Moosavi-Dezfooli et al., 2016) ,



We develop Target Training -a novel, white-box adversarial defense that converts untargeted gradient-based attacks into attacks targeted at designated target classes, from which correct classes are derived. Target Training is based on the minimization at the core of untargeted gradient-based adversarial attacks. • For all attacks that minimize perturbation, we eliminate the need to know the attack or to generate adversarial samples during training. • We show that Target Training withstands non-L ∞ adversarial attacks without resorting to increased network capacity. With default accuracy of 84.3% in CIFAR10, Target Training achieves 86.6% against the DeepFool attack, and 83.2% against the CW-L 2 (κ=0) attack without using adversarial samples and against an adaptive attack aware of our defense. Against an adaptive CW-L 2 (κ=40) attack, we achieve 75.6% while using adversarial samples. Our choice of low-capacity classifiers makes Target Training not withstand L ∞ adaptive attacks, except for CW-L ∞ (κ=0) in MNIST. • We conclude that Adversarial Training might not be defending by populating sparse areas with samples, but by minimizing the same minimization that Target Training minimizes.

