ON THE EXISTENCE OF A TROJANED TWIN MODEL Anonymous

Abstract

We study the Trojan Attack problem, where malicious attackers sabotage deep neural network models with poisoned training data. In most existing works, the effectiveness of the attack is largely overlooked; many attacks can be ineffective or inefficient for certain training schemes, e.g., adversarial training. In this paper, we adopt a novel perspective and look into the quantitative relationship between a clean model and its Trojaned counterpart. We formulate a successful attack using classic machine learning language. Under mild assumptions, we show theoretically that there exists a Trojaned model, named Trojaned Twin, that is very close to the clean model. This attack can be achieved by simply using a universal Trojan trigger intrinsic to the data distribution. This has powerful implications in practice; the Trojaned twin model has enhanced attack efficacy and strong resiliency against detection. Empirically, we show that our method achieves consistent attack efficacy across different training schemes, including the challenging adversarial training scheme. Furthermore, this Trojaned twin model is robust against SoTA detection methods.

1. INTRODUCTION

Deep Neural Network (DNNs) are widely used in practice. However, overparametrized DNNs are known to have security issues. A Trojan attack is a potential threat that grants an attacker the ability to manipulate the output of a model. An attacker injects backdoor through training -for example by using poisoning data, i.e., incorrectly labeled images overlaid with a special trigger. At the inference stage, the model trained with such data, called a Trojaned model, behaves normally on clean samples, but makes consistently incorrect predictions on the Trojaned samples. Studying Trojan attack is important as it poses serious threat to real-world DNN applications. The first Trojan attack was proposed by Gu et al. (2017) . Since then, various Trojan attack methods have been proposed in the literature focusing on various aspects of attacks including stealthiness (Barni et al., 2019; Liu et al., 2020; Nguyen & Tran, 2020) , robustness against defense (Yao et al., 2019b; Shokri et al., 2020) , effectiveness in terms of higher success rate (Zhu et al., 2019; Pang et al., 2020) and easiness in terms of fewer attack prerequisites, smaller injection rate, smaller trigger, database poisoning only, etc. (Saha et al., 2020) . Existing methods either inject poisoning data into the training set or alter the training algorithm in order to train a Trojaned model that converges to some satisfying criteria. A successful Trojaned model should have a high classification accuracy on clean samples (ACC), and meanwhile, should make incorrect predictions with Trojaned input samples, i.e., have a high Attacking Successful Rate (ASR). Despite a rich literature on Trojan attacks, most existing methods treat the attacking as an engineering process and propose different heuristic solutions. These methods empirically prove the success using metrics like ACC and ASR, but leave unclear the theoretical insight into the reason/mechanism of a successful attack. These attacks seem to be easily implementable due to the high dimensionality of the representation space and overparametrization of DNNs. However, in practice, we do observe different chances of successful attack for different datasets, different triggers, different architectures, and different training strategies. This raises the fundamental question: 



Does a Trojaned model always exist? If yes, how easy is it to find one? Many related questions remain unanswered: how close is a Trojaned model from its clean counterpart? What is the most efficient way to obtain a Trojaned model? How well does the Trojan behavior

