TROJANS AND ADVERSARIAL EXAMPLES: A LETHAL COMBINATION

Abstract

In this work, we naturally unify adversarial examples and Trojan backdoors into a new stealthy attack, that is activated only when 1) adversarial perturbation is injected into the input examples and 2) a Trojan backdoor is used to poison the training process simultaneously. Different from traditional attacks, we leverage adversarial noise in the input space to move Trojan-infected examples across the model decision boundary, thus making it difficult to be detected. Our attack can fool the user into accidentally trusting the infected model as a robust classifier against adversarial examples. We perform a thorough analysis and conduct an extensive set of experiments on several benchmark datasets to show that our attack can bypass existing defenses with a success rate close to 100%.

1. INTRODUCTION

Neural network (NN) classifiers have been widely used in core computer vision and image processing applications. However, NNs are sensitive and are easily attacked by exploiting vulnerabilities in training and model inference (Szegedy et al., 2014; Gu et al., 2017) . We broadly categorize existing attacks into inference attacks, e.g., adversarial examples (Szegedy et al., 2014) , and poisoning attacks, e.g., Trojan backdoors (Gu et al., 2017) , respectively. In adversarial examples, attackers try to mislead NN classifiers by perturbing model inputs with (visually unnoticeable) adversarial noise at the inference time (Szegedy et al., 2014) . Meanwhile, in Trojan backdoors, one of most important poisoning attacks, the adversaries try to exploit the (highly desirable) model reuse property to implant Trojans into model parameters for backdoor breaches, through a poisoned training process (Gu et al., 2017) . Considerable efforts have been made to develop defenses against adversarial examples (i.e., in the inference phase) and Trojan backdoors (i.e., in the training phase). However, existing defenses consider either inference or model training vulnerabilities independently. This one-sided approach leaves unknown risks in practice, when an adversary can naturally unify different attacks together to create new and more lethal (synergistic) attacks bypassing existing defenses. Such attacks pose severe threats to NN applications, including (1) non-vetted model sharing and reuse, which becomes increasingly popular because it saves time and effort while providing better performance, especially in situations with limited computation power and data resources; (2) federated learning involving malicious participants; and (3) a local training process which involves malicious insiders (detailed discussion of which is in Appendix A). Our contribution. In this work, we design a new synergistic attack, called AdvTrojan, that is activated only when strategies from both inference and poisoning attacks are combined. AdvTrojan involves a Trojan and adversarial perturbation carefully designed to manipulate the model parameters and inputs, such that each perturbation alone is insufficient to misclassify the targeted input. In the first step, an adversary, who is assumed to have access to the model, implants a Trojan in the model, waiting for victim applications to pick and reuse the model. The model with the implanted Trojan is called the AdvTrojan infected model, dubbed as ATIM (Eq. 9 and Alg. 1). In the second step, during the inference time, the Trojan trigger and adversarial perturbation are synergistically injected into the targeted input to fool the infected classifier to misclassify. Different from existing Trojans (Gu et al., 2017; Liu et al., 2017) , our Trojan is crafted to make the model vulnerable to adversarial perturbation, only when the perturbation is combined with the predefined trigger (Appendices C and D). In other words, the Trojan trigger transfers the input into an arbitrary location in the input space close to the model decision boundary; and then the adversarial perturbation does the final push, by moving the transferred example across the decision boundary, opening a backdoor. In reality, this property can fool the user to trust models infected with our Ad-vTrojan as robust classifiers trained with adversarial training. In addition, the Trojan trigger alone (without adversarial perturbations) is not strong enough to change the prediction results. Hence, existing Trojan defensive approaches (e.g., Neural Cleanse and STRIP) fail to defend against Ad-vTrojan (Appendix E). Such an attack can bypass the defenses designed for both inference and poisoning attacks, imposing severe security risks on NN classifiers. 

2. BACKGROUND

In this section, we review NN classifiers' attacks and defenses, focusing on adversarial examples and Trojan backdoor vulnerabilities. Let D be a database that contains N data examples, each of which contains data x ∈ [0, 1] d and a ground-truth label y ∈ Z K (one-hot vector), with K possible categorical outcomes Y = {y 1 , . . . , y K }. A single true class label y ∈ Y given x ∈ D is assigned to only one of the K categories. On input x and parameters θ, a model outputs class scores f : R d → R K that maps x to a vector of scores f (x) = {f 1 (x), . . . , f K (x)} s.t. ∀k ∈ {1, . . . , K} : f k (x) ∈ [0, 1] and K k=1 f k (x) = 1. The class with the highest score value is selected as the predicted label for x, denoted as C θ (x) = max k∈K f k (x). A loss function L(x, y, θ) presents the penalty for mismatching between the predicted values f (x) and original values y. Throughout this work, we use x to denote the original input, x to denote the adversarial perturbed input (i.e., the adversarial example), t to represent Trojan trigger, and x to be a generic input variable that could be either x, x, x + t, or x + t.

Adversarial Examples.

Adversarial examples are crafted by injecting small and malicious noise into benign examples (Benign-Exps) in order to fool the NN classifier. Mathematically, we have: δ * = arg max δ∈∆ I[C θ (clip D [x + δ]) = y] (1) x = clip D [x + δ * ] (2) where x is the benign example and its ground truth label y, δ is the optimal perturbation given all possible perturbations ∆. 



An extensive experiment on benchmark datasets shows that AdvTrojan can bypass the defenses, including one-sided defenses, including Neural Cleanse(Wang et al., 2019),STRIP (Gao et al., 2019), certified robustness bounds(Li et al., 2019), an ensemble defense(Pang et al., 2020), and an adaptive defense proposed by us, with success rates close to 100%. Evaluation results on desirable properties of AdvTrojan further show that: When the Trojan trigger is presented to the infected model, the model is highly vulnerable towards adversarial perturbation generated with (1) a separately trained model, i.e., transferability of adversarial examples(Papernot et al., 2017); (2) a small number of iterations; (3) a small perturbation size; or (4) weak single-step attacks (Appendix G).

The identity function I[•] returns 1 if the input condition is True and 0 otherwise. The clip D [•] function returns its input if the input value is within the range D; otherwise, it returns the value of the closet boundary. For instance, if D = [-1, 1], then, clip D [0.7] = 0.7, clip D [3] = 1, and clip D [-10] = -1. Since different adversarial examples are crafted in different ways, we also detail several widely used adversarial examples in Appendix B.Among existing solutions, adversarial training appears to hold the greatest promise to defend against adversarial examples(Tramèr et al., 2017). Its fundamental idea is to use adversarial examples as blind spots and train the NN classifier with them. In general, adversarial training can be represented as a two-step process iteratively performed through i ∈ {0, . . . , T } training steps, as follows:δ i+1 = arg max δ∈∆ I C θi (clip D [x + δ]) = y (3) θ i+1 = arg min θ L(x, y, θ) + µL(clip D [x + δ i+1 ], y, θ)(4)At each training step i, adversarial training 1) searches for (optimal) adversarial perturbation δ i+1 (Eq. 3) to craft adversarial examples clip D [x + δ i+1 ]; and 2) trains the classifier using both benign and adversarial examples, with a hyper-parameter µ to balance the learning process (Eq. 4). A widely adopted adversarial training defense utilizes the iterative Madry-Exps for training, called Madry-Adv(Madry et al., 2017).

