TROJANS AND ADVERSARIAL EXAMPLES: A LETHAL COMBINATION

Abstract

In this work, we naturally unify adversarial examples and Trojan backdoors into a new stealthy attack, that is activated only when 1) adversarial perturbation is injected into the input examples and 2) a Trojan backdoor is used to poison the training process simultaneously. Different from traditional attacks, we leverage adversarial noise in the input space to move Trojan-infected examples across the model decision boundary, thus making it difficult to be detected. Our attack can fool the user into accidentally trusting the infected model as a robust classifier against adversarial examples. We perform a thorough analysis and conduct an extensive set of experiments on several benchmark datasets to show that our attack can bypass existing defenses with a success rate close to 100%.

1. INTRODUCTION

Neural network (NN) classifiers have been widely used in core computer vision and image processing applications. However, NNs are sensitive and are easily attacked by exploiting vulnerabilities in training and model inference (Szegedy et al., 2014; Gu et al., 2017) . We broadly categorize existing attacks into inference attacks, e.g., adversarial examples (Szegedy et al., 2014) , and poisoning attacks, e.g., Trojan backdoors (Gu et al., 2017) , respectively. In adversarial examples, attackers try to mislead NN classifiers by perturbing model inputs with (visually unnoticeable) adversarial noise at the inference time (Szegedy et al., 2014) . Meanwhile, in Trojan backdoors, one of most important poisoning attacks, the adversaries try to exploit the (highly desirable) model reuse property to implant Trojans into model parameters for backdoor breaches, through a poisoned training process (Gu et al., 2017) . Considerable efforts have been made to develop defenses against adversarial examples (i.e., in the inference phase) and Trojan backdoors (i.e., in the training phase). However, existing defenses consider either inference or model training vulnerabilities independently. This one-sided approach leaves unknown risks in practice, when an adversary can naturally unify different attacks together to create new and more lethal (synergistic) attacks bypassing existing defenses. Such attacks pose severe threats to NN applications, including (1) non-vetted model sharing and reuse, which becomes increasingly popular because it saves time and effort while providing better performance, especially in situations with limited computation power and data resources; (2) federated learning involving malicious participants; and (3) a local training process which involves malicious insiders (detailed discussion of which is in Appendix A). Our contribution. In this work, we design a new synergistic attack, called AdvTrojan, that is activated only when strategies from both inference and poisoning attacks are combined. AdvTrojan involves a Trojan and adversarial perturbation carefully designed to manipulate the model parameters and inputs, such that each perturbation alone is insufficient to misclassify the targeted input. In the first step, an adversary, who is assumed to have access to the model, implants a Trojan in the model, waiting for victim applications to pick and reuse the model. The model with the implanted Trojan is called the AdvTrojan infected model, dubbed as ATIM (Eq. 9 and Alg. 1). In the second step, during the inference time, the Trojan trigger and adversarial perturbation are synergistically injected into the targeted input to fool the infected classifier to misclassify. Different from existing Trojans (Gu et al., 2017; Liu et al., 2017) , our Trojan is crafted to make the model vulnerable to adversarial perturbation, only when the perturbation is combined with the predefined trigger (Appendices C and D). In other words, the Trojan trigger transfers the input into an

