ON THE EXISTENCE OF A TROJANED TWIN MODEL Anonymous

Abstract

We study the Trojan Attack problem, where malicious attackers sabotage deep neural network models with poisoned training data. In most existing works, the effectiveness of the attack is largely overlooked; many attacks can be ineffective or inefficient for certain training schemes, e.g., adversarial training. In this paper, we adopt a novel perspective and look into the quantitative relationship between a clean model and its Trojaned counterpart. We formulate a successful attack using classic machine learning language. Under mild assumptions, we show theoretically that there exists a Trojaned model, named Trojaned Twin, that is very close to the clean model. This attack can be achieved by simply using a universal Trojan trigger intrinsic to the data distribution. This has powerful implications in practice; the Trojaned twin model has enhanced attack efficacy and strong resiliency against detection. Empirically, we show that our method achieves consistent attack efficacy across different training schemes, including the challenging adversarial training scheme. Furthermore, this Trojaned twin model is robust against SoTA detection methods.

1. INTRODUCTION

Deep Neural Network (DNNs) are widely used in practice. However, overparametrized DNNs are known to have security issues. A Trojan attack is a potential threat that grants an attacker the ability to manipulate the output of a model. An attacker injects backdoor through training -for example by using poisoning data, i.e., incorrectly labeled images overlaid with a special trigger. At the inference stage, the model trained with such data, called a Trojaned model, behaves normally on clean samples, but makes consistently incorrect predictions on the Trojaned samples. Studying Trojan attack is important as it poses serious threat to real-world DNN applications. The first Trojan attack was proposed by Gu et al. (2017) . Since then, various Trojan attack methods have been proposed in the literature focusing on various aspects of attacks including stealthiness (Barni et al., 2019; Liu et al., 2020; Nguyen & Tran, 2020) , robustness against defense (Yao et al., 2019b; Shokri et al., 2020) , effectiveness in terms of higher success rate (Zhu et al., 2019; Pang et al., 2020) and easiness in terms of fewer attack prerequisites, smaller injection rate, smaller trigger, database poisoning only, etc. (Saha et al., 2020) . Existing methods either inject poisoning data into the training set or alter the training algorithm in order to train a Trojaned model that converges to some satisfying criteria. A successful Trojaned model should have a high classification accuracy on clean samples (ACC), and meanwhile, should make incorrect predictions with Trojaned input samples, i.e., have a high Attacking Successful Rate (ASR). Despite a rich literature on Trojan attacks, most existing methods treat the attacking as an engineering process and propose different heuristic solutions. These methods empirically prove the success using metrics like ACC and ASR, but leave unclear the theoretical insight into the reason/mechanism of a successful attack. These attacks seem to be easily implementable due to the high dimensionality of the representation space and overparametrization of DNNs. However, in practice, we do observe different chances of successful attack for different datasets, different triggers, different architectures, and different training strategies. This raises the fundamental question: Does a Trojaned model always exist? If yes, how easy is it to find one? Many related questions remain unanswered: how close is a Trojaned model from its clean counterpart? What is the most efficient way to obtain a Trojaned model? How well does the Trojan behavior generalize at the inference stage? Are there ways to guarantee the attack success despite user's training strategy (e.g., adversarial training)? Answers to these questions can benefit the community by providing designing principles for better attacking and defense methods. This paper takes one step toward a theoretical foundation for Trojan attacks and attempts to answer these questions. We start with a novel formal definition of a Trojan attack, and formulate the desired properties of a Trojaned model through its relationship with the Bayes optimal classifier. Our analysis provides the following theoretical results: (1) Existence and closeness: with mild assumptions, a Trojaned model always exists and is very close to the Bayes optimal. We call it the Trojaned Twin Model (TTM). (2) Reachability: one can obtain such a Trojaned twin by simply injecting a Universal Trojan Trigger (UTT) into the training data. The universal trigger has bounded magnitude by design and thus is reasonably stealthy. (3) Generalization power: since the Trojan behavior is defined in terms of the underlying distribution, we can guarantee how well it generalizes at the inference stage. Our theoretical results suggest a simple yet powerful attacking algorithm in practice: one can generate a universal trigger by simply inspecting the training data. Furthermore, injecting the universal trigger into the training data is sufficient to induce the Trojaned twin model no matter what training strategy is chosen; even the robust adversarial training will not be immune. Lastly, since the Trojaned twin model is guaranteed to be close to its clean counterpart, it is naturally resilient to detection algorithms, which rely on the abnormality of Trojaned models. Our contribution can be summarized as follows: 1. We define the problem of Trojan attack in a novel and formal way. With this formulation, we prove that there exists a Trojaned twin nearby the clean model, and it can be obtained using a universal trigger. The Trojan behavior has a guaranteed generalization power. 2. Based on the theoretical analysis, we propose an attacking method, which finds the universal trigger and injects it into the training data. The theoretical analysis also suggests that our attack is resilient to robust training strategies and existing Trojan detection algorithms. 3. Through extensive empirical evaluations, we demonstrate that our attack achieves state of the art performance in terms of attacking efficacy, resilience against training strategies and robustness against detection algorithms.

2. BACKGROUND

Trojan attack research in DNN image classification can be categorized into different schools depending on the trigger shape, attacking scenario and Trojan injection scheme. The most classic Trojan attack BadNet (Gu et al., 2017) places a 3×3 image patches on the corner of a Trojan images as the trigger. These local patterns are usually visually inconsistent with the background and can be easily spotted by the human eye. To mitigate this issue, many research propose to use visually stealthy trigger (Chen et al., 2017; Barni et al., 2019; Liu et al., 2020; Nguyen & Tran, 2020) . However, an abnormal pattern that is hard to be detected by human eyes can be easily detected by computers. To further improve the input filtering stealthiness of Trojan triggers, many researches propose to impose restriction on the latent restriction given by Trojan images Shokri et al. (2020) . Trojan attack research can also be categorized into model-poisoning attack and database-poisoning attack (Li et al., 2022) depending on the attack scenario. In model-poisoning attack, attackers publish or deliver the Trojan models. Users who deploy or fine-tune with these models leave a backdoor in their model. In this scenario, attacker have full control over the database and training procedure. Examples are (Nguyen & Tran, 2020; Shokri et al., 2020) . In database-poisoning attack, attackers only provide Trojan database to users. Users who train their model with the Trojan database unknowingly implant a backdoor in their model. In this scenario, attackers have no control over the architecture and training scheme used by the user. The attack success highly relies on the trigger quality. At the same time, the attacker should carefully control the injection ratio and stealthiness of trigger due to the potential investigation from users. Thus database-poisoning attack is considered a more practical yet challenging scenario. Examples are (Gu et al., 2017; Chen et al., 2017; Barni et al., 2019; Liu et al., 2020) . The categorization of these methods are not fixed. Most of the data-poisoning methods, for example, can be easily adapted to apply in model-poisoning scenario. Another relevant work is the universal adversarial perturbation (UAP) (Moosavi-Dezfooli et al., 2017) . UAP shares a similar philosophy with Trojan attack: using a unique pattern to consistently manipulate the output of a target classifier. However, the challenge faced is different -UAP is created during inference time and requires the attacker to query the target model and to be able to calculate the gradient incurred on the pattern. On the other hand, a Trojan attack happens before or during training time, where attacker injects manipulated data into the database. Furthermore, UAP usually works well for the set of data points that the UAP use to optimize the trigger but doesn't usually generalize to unseen data, whereas a Trojan attack requires stronger generalization performance on unseen data.

3. THEORY AND METHOD

We start by formalizing what a desirable Trojan model is (Section 3.1). Our main theoretical results are covered in Sections 3.2 and 3.3. Theorem 1 states that there exists a trigger called Universal Trojan Trigger that can be found through adversarial perturbation on a clean model. Theorem 2 and Corollary 1 state that poisoning data with the trigger and training with the poisoned data will result in a Trojan twin model that is sufficiently close to the Bayes optimal classifier. The Trojan behavior is guaranteed to generalize to the test set. Finally, in Section 3.4, we present an attack algorithm based on the theory, i.e., finding the universal trigger by training multiple clean models and exploiting their adversarial perturbations.

3.1. A FORMALIZATION OF TROJAN ATTACK

In this section we formalize our mathematical definition of a Trojan Attack. Assuming a binary label setting, we define a Trojan twin model as a classifier that is close to the ideal classifier, i.e., the Bayes optimal, and meanwhile also exhibits Trojaned behavior. We consider models belonging to a given hypothesis class F = {f : X → [0, 1]}, X ⊆ R d , e.g., the family of neural network classifiers. Given an underlying joint distribution µ(x, y), the Bayes optimal classifier with L 2 risk is f * = arg min f ∈F E µ(x,y) [(f (x) -y) 2 ]. A Trojan twin model not only is close enough to f * , but also exhibits a Trojan behavior with regard to some trigger v ∈ R d on a significant proportion of data. In other words, on a sufficient fraction of input x, it should make the opposite prediction on the triggered input, x + v. Quantifying the two criteria with ϵ and δ, we have the following definition. Definition 1 ((ε, δ)-Trojan Twin Model). Given a trigger v with budget ξ, ∥v∥ ≤ ξ, distribution µ(x, y), hypothesis class F and the Bayes optimal f * , a model f ∈ F is called an (ε, δ)-Trojan Twin Model with trigger v if: E µ(x,y) [( f (x) -f * (x)) 2 ] ≤ ε P µ(x) [(1 -f (x + v) -f * (x)) 2 ≤ ε] ≥ δ (1) The Trojan attack task can be viewed as finding a Trojaned Twin Model, with small ϵ and large δ. In the definition, E µ(x) [( f (x) -f * (x)) 2 ] ≤ ε specifies that the Trojan model should be ϵ-close to a 'clean' model. Note that this implies the high classification accuracy on clean samples (ACC), which is a standard quality metric of Trojaned models. Note also that our formulation allows low label noise thus f * (x) can be a real valued function. The choice of L 2 regression loss follows from (Massart & Nédélec, 2006) . The second criterion P x [(1 -f (x + v) -f * (x)) 2 ≤ ε] ≥ δ specifies that at least a δ fraction of data should have a flipped prediction with the presence of the trigger. This corresponds to the attack success rate (ASR) evaluation metric of Trojan models (Gu et al., 2017; Barni et al., 2019; Pang et al., 2020) . Our definition of a Trojan twin depends on a trigger v. In next section, we show that a good trigger always exists to give us a good Trojan twin, i.e., a Trojan twin with sufficiently small ϵ and large δ.

3.2. UNIVERSAL TROJAN TRIGGER

In this section, we introduce the Universal Trojan Trigger and prove its existence. Although our definition of a Trojan model is at the distribution level, we formalize the trigger in terms of a given training sample set. This is consistent with the practical setting and suggests an algorithm to find the trigger (as will be explained in Section 3.4). In Section 3.3, we will show that the trigger leads to a Trojan twin model at the distribution level. The trigger is defined through a clean classifier, e.g., an empirically optimized classifier on a given training sample set. Definition 2 (Universal Trojan Trigger (UTT)). Given i.i.d sampled data set S n = {(x i , y i )} n i=1 , and hypothesis class F, let f (x) = arg min f ∈F n i=1 (y i -f (x i )) 2 be an empirically optimal classifier with regard to the sample set S n . A trigger v is a (ξ, ε, ρ)-UTT if there exists some f ∈ F s.t.: 1 n n i=1 |f (x i ) -f (x i )| ≤ ε 1 n n i=1 1{|1 -f (x i + v) -f (x i )| ≤ ε} ≥ ρ ∥v∥ ≤ ξ (2) We call such f the empirical twin model of f , which is the empirical clean model learned from S n . UTT describes a common direction among data that can be applied to flip some 'good' model, i.e., f , on the training samples. ξ represents the budget of trigger, which is related to the stealthiness of the trigger. ρ represents the fraction of training samples that can be manipulated by the trigger. ε represents the classification accuracy of the Trojan model on the training samples. A UTT is considered successful for small ε and large ρ, i.e., it can flip model's output for large fraction of data in the dataset. It can be observed that Equation 2 shares a similar spirit with the (ε, δ)-TTM definition, in a sense that UTT manages to manipulate a 'twin' model that is close to the best model on the training set. Indeed, the existence of UTT on the training set has significant implication for finding (ε, δ)-TTM. We will rigorously prove that one can implant such trigger to poison the dataset, provably enforce the user's model to become a (ε, δ)-TTM. While universal Trojan trigger seems to be powerful in pursuing the TTM in Definition 1, one may ask whether it exists for the desired dataset and hypothesis class. In the theorem below, we prove the existence of UTT, under mild assumptions on the data and hypothesis class. Theorem 1 (Existence of UTT). Let S n = {(x i , y i )} n i=1 be i.i.d sampled from its generative distribution µ. Let X ⊆ R d be the support of µ(x). Let F : {f : X → [0, 1]} be a β-Lipschitz hypothesis class with finite Pseudo-Dimension d P (F) < ∞. Assume realizability: E[y|x] = f * (x) = arg min f ∈F E x,y [(f (x) -y) 2 ]. Suppose we have v such that ∥v∥ ≤ ξ and there exists x bad ∈ X with |1 -f * (x bad + v) -f * (x bad )| ≤ ε, and that µ B x bad , ε/(4β) > 0. Then if the number of samples n satisfies n ≳ d P (F) log( 1 ε ) + log( 16 η ) ε 4 + log( 1 η ) µ B x bad , ε/(4β) , with probability at least 1 -η, there exists a ξ, 2ε, 1 4 µ B x bad , ε/(4β) -UTT. Remark 1. The proof of Theorem 1, given in the Appendix, provides insight into how to find UTT with large ρ. The crucial assumption is that there exists v that can successfully adversarially attack the model f * (x) for sufficiently many 'bad' data. This is indeed verified in our experiments. In particular, given a UTT denoted as v, let g ∈ F be the empirically twin model described in Definition 2, one can create the set of poisoned data P m := {(x i + v, 1 -y i ) | (x i , y i ) ∈ S n , |1 -g(x i + v) -g(x i )| ≤ ε}. We call P m the v-perturbed set, and m ≥ ρn is the size of P m . P m can be mixed into the clean training set S n and their union is the poisoned training set. In the remainder of this section we will show that under mild assumptions, any hypothesis f ∈ F trained using the poisoned set that achieves low risk will be an (ε, δ)-Trojan Twin model for some ε and δ. First, in Proposition 1 we show that the model trained using P m ∪ S n will behave like a TTM on the training dataset. In Theorem 2, we will generalize the guarantee on training set to distributional guarantee using some empirical analysis tools from (Pollard, 2012) , concluding in Corollary 1 with the existence of a Trojaned Twin model with the distributional guarantee in Definition 1. Proposition 1 (Existence of Empirical TTM). Assume we are given the training set S n , a (ξ, ε, ρ)-UTT v, the empirical twin model g, the v-poisoned set P m , and the empirical clean model f . Assume the interpolation condition: ∀(x i , y i ) ∈ S n , | f (x i ) -y i | ≤ ε. Let f be any hypothesis satisfying (x,y)∈Sn Pm (y -f (x)) 2 ≤ min f ∈F (x,y)∈Sn Pm (y -f (x)) 2 + ε. We have: 1 n (xi,yi)∈Sn (y i -f (x i )) 2 ≤ 4ε (3) (xj +v,1-yj )∈Pm 1{(1 -y j -f (x j + v)) 2 ≤ 8ε ρ } ≥ m 2 (4) Proposition 1 analyzes the consequence of training models on the poisoned dataset. The model that achieves low risk tends to fit both the clean and poisoned subset on the training set, which implies the succesfulness of finding TTM on the training set. In next Theorem, we generalize the results on training set to achieve a distributional guarantee. We analyze the 'in distribution' guarantees leveraging on standard tools for empirical process. The 'out of distribution' generalization guarantees, i.e., the generalization on the perturbed data set, require additional effort. We require Lipschitzness of hypothesis class and some regularity conditions of data distributions so that the empirical performance on a subset of training data will generalize. Theorem 2 (Existence of Distributional TTM). Assume the hypothesis and conclusion of Proposition 1 hold, and |y - f * (x)| ≤ ε for all (x, y). Then, if n ≳ 1 ε 4 d P (F) log( 1 ε ) + log( 1 η ) , we have that E x [( f (x) -f * (x)) 2 ] ≤ ε with probability at least 1 -η. In addition, assume all functions in F are β-Lipschitz, and that the density µ( x There exists Ω ⊂ D s.t., µ Ω ≥ |P * Pm |τ and ∀x ∈ Ω, ( f (x + v) -1 + f * (x)) 2 ≤ 16ε ρ . Theorem 2 directly implies following Corollary: Corollary 1 (Existence of (ε, δ)-TTM). Given conditions in Theorem 2, f is a (ε ′ , δ)-Trojan Twin Model where ε ′ ≤ 16ε ρ and δ ≥ |P * Pm |τ with probability at least 1 -η. Remark 2. Theorem 2 implies some sufficient conditions for enforcing a (ε, δ)-Trojan Twin Model, using the dataset poisoned by a UTT trigger. The lower bound of δ achieved in Corollary 1 is a distributional dependent quantity and thus hard to derive a uniform bound. In practice, the attacker can train some models on the poisoned dataset and use a validation dataset to empirically evaluate value of δ. It can also be observed that a large value of ρ implies closer Trojan twin of f * and wealthier data that can be manipulated, improving the quality of TTM.

3.4. ALGORITHM IN PRACTICE: GENERATING THE UNIVERSAL TROJAN TRIGGER

In this section we describe our algorithm for generating UTT. Our algorithm is well motivated by our theoretical analysis. The algorithm takes as input multiple clean models {f 1 , ..., f J } where models are well trained and are of different variants. We introduce J models to cover different hypothesis classes of classifiers. A discussion on the benefit of introducing multiple hypothesis classes can be found in Appendix Section A.5. The trigger flips data from the source class label C S to the target class label C T . A target injection rate ρ is given to control the fraction of data to be poisoned by v. The trigger has a budget ξ, ∥v∥ ≤ ξ. Please see Appendix B.1 for more implementation details of our algorithm. Our algorithm begins by sampling a perturbed data set P m = {(x 1 , C S ), • • • , (x m , C S )} conditional on the source class uniformly randomly. Here m = ⌈ρn⌉. During the iterative procedure, the algorithm computes a gradient direction of loss function L (t) = J j=1 x∈Pm l(C T , f j (x + v)) to minimize the discrepancy between model output and target class on perturbed data x + v. In practice, we observe it suffices to pick T to be some small constant, e.g., T = 5, to find satisfying v.

Algorithm 1 Universal Trojan Trigger Generation

1: Input: Clean data set S n = {(x 1 , y 1 ), ..., (x n , y n )} ⊂ R d × {1, 2, • • • , K}, pre-adversarial- trained clean model set {f 1 , f 2 , • • • , f J }, loss function l (e.g., cross-entropy), randomly ini- tialized Universal Trojan Trigger v (0) ∈ R d , source class C S , target class C T , trigger budget constraint ξ, learning rate η, injection fraction ρ, number of iterations T . 2: Sample perturbed set P m = {(x 1 , C S ), • • • , (x m , C S )} from label-C T data in S n 3: for t ← 1, • • • , T do 4: L (t) = J j=1 x∈Pm l(C T , f i (x + v (t-1) )) 5: v (t) = v (t-1) -η∇ v (t-1) L (t) 6: v (t) = ξv (t) /∥v (t) ∥ 2 7: end for 8: Output: v (T )

4. EXPERIMENT

In this section, we present and discuss the result of our empirical study. We first evaluate the attacking performance. We manually inject our UTT into different image datasets and use these datasets to train ResNet18 and VGG16 models. We then evaluate our method's performance against most recent backdoor attack baselines. We evaluate on various settings including the most challenging one where user adopts adversarial training. In the most challenging setting, all baselines has performance deterioration, whereas our method outperform others in this situation. Next, we investigate the evasiveness of our method against Trojan detection methods. We Trojan multiple models with UTT and then investigate how resilient are these models against SOTA detection methods. Quantitative results show that our method is much more resilient to detection methods compared to other attacks. We also show our method's resistance to fine-pruning post-process. All these merits are implied by the properties of Trojan twin model. Finally, we conduct ablation study on the choice of different hyper-parameters such as Trojan injection ratio and trigger size. Our method is shown to be robustness w.r.t. the choice of hyper-parameters.

4.1. ATTACK EXPERIMENTS

Experiment Setting. In this section, we present the result of attacking experiments. We manually inject Trojan Trigger with each baseline attacking method into different dataset and the poisoned data to ResNet18 (He et al., 2016) and VGG16 (Simonyan & Zisserman, 2014) for training. We fix the L 2 norm of each method's trigger to be 10 and the injection ratio to be 20% for each method. We train each method for 200 epochs with same batch size (128 for CIFAR10/GTSRB, 32 for IMAGENET), same learning rate 7e-3 (we use gradient accumulation due to the limited computation resource, so we scale original learning rate 1e-2 by 1/ √ 2 which is 7e-3) and same weight decay rate 5e-4. We will present ablation study results on injection ratio and trigger size in the appendix. During the training of the model, we assume the most challenging situation where the user will adopt adversarial training (we use PGD here (Madry et al., 2017) ). This test all baselines under a more practical setting because whenever the trigger is injected, we have no control over the training scheme that could be adopted by the user. For model-poisoning method like WaNet and adaptive attack method like IMC, we also use adversarial training to make fair comparison. For the reference, we present the result where we don't use adversarial training in the Appendix Table 4 -5. Baselines. We select several attacking methods that are representative for each schools mentioned in background section. We use name abbreviation BadNet for (Gu et al., 2017) , SIG for (Barni et al., 2019) , REF for (Liu et al., 2020) , WaNet for (Nguyen & Tran, 2020) and IMC for (Pang et al., 2020) . We have a detailed discussion of each baseline and the hyper-parameter setting of each baseline can be find in Appendix section B.1. Dataset. We test our method against three image data set. CIFAR10 (Krizhevsky et al., 2009 ) is a small scale color dataset with 10 classes. Each image is of size 32×32. It has 50000 images for training and 10000 images for testing. GTSRB (Stallkamp et al., 2012) is the German traffic sign recognition dataset with 43 classes. We resize each image in GTSRB to 32×32. It has 26640 data points for training and 12630 data points for testing. We also test our method against ImageNet (Russakovsky et al., 2015) . Because training on a large number of high resolution images results in training overhead. We pick images from class 0-9 from ILSVRC2012 dataset and resize each images from 224×224 to 112×112. It contains 13000 training images and 500 testing images. Evaluation Metrics. As it suggested by our theoretical model, we should evaluate a attacking algorithm through two criterion. We conduct one-to-one attack here by choosing a source class and target class. One of the evaluation metric is to measure the attacking successful rate (ASR) on this source-target pair. ASR is the proportion of testing images from source class that could be mis-classified by the Trojaned models into the target class when edited by the trigger. The higher the ASR, the more effective the propoased attack. Another evaluation metric is the classification accuracy (ACC), which measures the classification accuracy of a Trojaned model on clean images. We require a high ACC on a Trojaned model because we want the Trojaned model to keep intact functionality when it gets clean input. Discussion. If we look at Appendix Tables 4 5 , where no adversarial training is using during trainig of Trojaned model, all baseline achieve similar performance on both ACC and ASR for most of the case (we only highlight the significant best one using two sample t-tests). However, in practice database-poisoning methods (like BadNet, SIG, REF and ours) do not assume the access to the model. Thus these methods have no control over the training scheme adopted by users. For model-poisoning methods (like WaNet and IMC), their generated Trojaned models may be post-processed or fine tuned by the user. Users can adopt the training scheme that is the most unfavorable to attackers. Adversarial training is one of such training scheme that can hinder the Trojan attack. We can see from Table 1 -2, even though all method suffer certain performance deterioration, our method maintain good ACC and consistently competitive ASR over all baselines. The advantage comes from the universal trigger generated by adversarial trained model pools. This specific trigger can manipulate the output of an adversarially trained model. User cannot avoid being Trojaned even they conduct adversarial training. 

4.2. DETECTION TABLE

Experiment Setting. In this section we present the model inspection result. The number in Table 3 is copy from Table 11 of (Pang et al., 2022) . We follow their experiment setting and Trojaned 10 ResNet18 models trained on CIFAR10 dataset. We use the same implementation of these model inspection algorithm and present the anomaly index value (AIV) number got by our method from each model investigation method on the last column of Table 3 . We will discuss the evaluation metrics later in more detail. Baseline Attack. There are mainly 8 attack methods compared in (Pang et al., 2020) including BadNet, REF and IMC, which we have discussed above. TNN is the method proposed in (Liu et al., 2017) , TB is the blending method proposed in (Chen et al., 2017) , LB is the method proposed in (Yao et al., 2019a) , ABE is the method proposed in (Shokri et al., 2020) . We have discussed all of these work in Appendix section B.1. Besides, embarrassingly simple backdoor attack (ESB) tries to attach an extra Trojaned neural net to the target model. They trained the merged network such that the Trojaned part will be activated whenever a trigger is presented otherwise the original part will activate. ESB belongs to model-poisoning attack method category. ABS doesn't apply to ESB simply because of its pre-requisite. If we assume a white box investigation, where investigator have access to the architecture, the capture of ESB is instant. If we assume black-box investigation, ABS is not applicable here. Baseline Defense. We investigate all these attack methods with 5 widely used model-inspection algorithm. Neural cleanse (NC) (Wang et al., 2019) , Deep Inspection (DI) (Chen et al., 2019) , TABOR (Guo et al., 2019) , Neuron Inspection (NI) (Chen et al., 2019) and Artificial Brain Stimulation (ABS) (Liu et al., 2019) . More detailed discussion about these defense baselines can be found in Appendix section B.2. Evaluation Metric. We mainly use anomaly index value (AIV) as the metric to recognize a Trojaned model. AIV is the normalized median absolute deviation. For a set of input {x 1 , • • • , x n }, the median absolute deviation (MAD) is the median of {|x 1 - x median |, • • • , |x n -x median |}. Then the AIV of this set of points is { |x1-x median | 1.4826MAD , • • • , |xn-x median | 1.4826MAD }. Any point in this set of data that has an AIV larger than 2 is considered to be an outlier. In our case, n is the number of output classes and x is the L1 norm of reversed trigger or explanatory feature (for ESB) given by each investigator. Following the setting of (Pang et al., 2022) , for each output neuron in these 10 Trojaned networks, we record such AIV given by the target class. Then we perform a t-test for each attack-defense pair to decide if the AIV is significantly larger than 2 (MAD test). In the table 3, we highlight methods that are not evasive by corresponding investigation methods with †. Discussion. From Table 3 we can see that most attacks are quite evasive against current detection algorithm. ABS is the most effective model-investigation method and capture TB, LB, ABE, and IMC. Our method is evasive to all listed investigation algorithm. This is partially suggested by our Theorem 2, which says the model Trojaned by our trigger represents a function that is very closed to what the clean model learnt. This can add difficulties to detection.

4.3. RESISTANCE AGAINST FINE PRUNING

In this section, we test our method's robustness against fine pruning post-processing. We implant Trojan into both RseNet18 and VGG16 models by poisoning the CIFAR10 dataset with our UTT. For 

4.4. ABLATION STUDY ON INJECTION RATIO AND TRIGGER SIZE

In this section, we conduct ablation study on the effect of injection ratio and trigger size. In section 4.1, we fix the injection ratio to 20% and the L 2 norm of the trigger size to 10. In this section, we conduct an ablation study by reducing the injection ratio to 10% and reducing the trigger size to 5 separately, and test each baseline accordingly. Experiments results are presented in Appendix Tables 6 7 8 9 . We can see that in the case where smaller trigger or fewer Trojan data is used, our method still maintains advantageous performance. These results corroborate our method's robustness against hyper-parameter choice.

5. SUMMARY

In this work, we study the Trojan Attack problem. We formulate the Trojan Attack task as finding a twin of clean model. We quantify the quality of twin model using a natural bi-criteria metric. We propose a poisoning data attacking strategy where the data is corrupted by our carefully designed trigger named UTT. We show the merit of our Trojan attack strategy both theoretically and empirically. In particular, empirical study shows that our method achieves competitive attacking effectiveness and detection resistance.

6. REPRODUCIBILITY STATEMENT

Our experiment uses only public available dataset. We have described our experiment setting and implementation details in Section 4. The source code will be made available together with the publication of this paper. 

A THEORETICAL RESULTS PROOF

A.1 PRELIMINARIES Below we introducing some definitions that is used in our proof. The definitions of covering number, VC-dimensions and Pseudo-Dimensions can be found in (Pollard, 2012; Wellner et al., 2013; Mohri et al., 2018) . Definition 3 (L 2 -Covering Number). Let x 1:n be set of points. A set of U ⊆ R n is an ε-cover w.r.t L 2 -norm of F on x 1:n , if ∀f ∈ F, ∃u ∈ U , s.t. 1 n n i=1 |[u] i -f (x i )| 2 ≤ ε, where [u] i is the i-th coordinate of u. The covering number N 2 (ε, F, n) with 2-norm of size n on F is : sup x1:n∈X n min{|U |: U is an ε-cover of F on x 1:n } Definition 4 (β-Lipschitz). We say hypothesis class F is β -Lipschitz if for all f ∈ F we have : |f (x 1 ) -f (x 2 )| ≤ β∥x 1 -x 2 ∥ Definition 5 (VC-dimension). The VC-dimension d V C (F) of a hypothesis class F = {f : X → {1, -1}} is the largest cardinality of the a set S ⊆ X such that ∀ S ⊆ S, ∃f ∈ F: f (x) = 1 if x ∈ S -1 if x ∈ S \ S Definition 6 (Pseudo-dimension). The Pseudo-dimension d P (F) of a real-valued hypothesis class F = {f : X → [a, b]} is the VC-dimension of the hypothesis class H = {h : X × R → {-1, 1}|h(x, t) = sign(f (x) -t), f ∈ F}. Definition 7 (ε-sparse set). Given set of points B ⊆ R d with finite size, we say A is an ε-sparse set of B if A ⊆ B and ∀a 1 , a 2 ∈ A, ∥a 1 -a 2 ∥ ≥ ε.

A.2 MISSING PROOF FOR THEOREM 1

Proof: For data For simplicity we denote ∥f 1 - f 2 ∥ Sn = 1 n n i=1 (f 1 (x i ) -f 2 (x i )) 2 and ∥f 1 - f 2 ∥ µ(x) = E x [(f 1 (x) -f 2 (x)) 2 ]. In the proof we denote γ = ε 4 , ρ = µ B(x bad ,ε/(4β)) 2 . Let C be a γ 2 -cover for hypothesis class F projected on dataset with size n, for any hypothesis f we denote c(f ) be element in C that covers f . In particular we have ∀f, ∃c(f ) ∈ T, ∥c(f ) -f ∥ Sn ≤ γ 2 . Let f = arg min f ∈F 1 n n i=1 (f (x i ) -y) 2 . It is easy to verify that 1 n n i=1 (c( f )(x i ) -y i ) 2 ≤ 1 n n i=1 (c( f )(x i ) -f (x i ) + f (x i ) -y i ) 2 = 1 n n i=1 (c( f )(x i ) -f (x i )) 2 + ( f (x i ) -y i ) 2 + 2(c( f )(x i ) -f (x i ))( f (x i ) -y i ) ≤ 1 n n i=1 (c( f )(x i ) -f (x i )) 2 + ( f (x i ) -y i ) 2 + 2 1 n n i=1 (c( f )(x i ) -f (x i )) 2 1 n n i=1 ( f (x i ) -y i ) 2 which implies that 1 n n i=1 (c( f )(x i ) -y i ) 2 ≤ 1 n n i=1 ( f (x i ) -y i ) 2 + γ 4 + 4γ 2 ≤ 1 n n i=1 (f * (x i ) -y i ) 2 + γ 4 + 4γ 2 . By a standard empirical process argument, using Hoeffiding type inequality with symmetricity (Pollard, 2012) and taking union bound on the covering set T , we have P sup f ∈T E x,y (f (x) -y) 2 - 1 n n i=1 (f (x i ) -y i ) 2 ≥ γ 2 ≤ 2E Sn [|T |]exp - nγ 4 2 we know that by picking n ≳ log( |T | η ) γ 4 E x,y [(c( f )(x) -y) 2 ] ≤ E x,y [f * (x) -y) 2 ] + 9γ 2 together with the fact that E[y|x] = f * (x) implies that E x [(c( f )(x) -f * (x)) 2 ] ≤ 9γ 2 . Next we show that 1 n n i=1 (c( f )(x i ) -f * (x i )) 2 ≤ 13γ 2 . We apply Hoeffding type inequality to show that the probability of following event, if we draw S n in an i.i.d fashion with n ≳ 1 γ 4 log E Sn [|T |] δ , is at least 1 -η: ∀c(f ) ∈ T, 1 n n i=1 (c(f )(x i ) -f * (x i )) 2 -E x [c(f )(x) -f * (x) 2 ] ≤ 13γ 2 To see this, we invoke Hoeffding type inequality again (Pollard, 2012 ) P Sn sup f ∈F 1 n n i=1 (c( f )(x i ) -f * (x i )) 2 -E x [c( f )(x) -f * (x) 2 ] ≥ γ 2 ≤ 2E Sn [|T |]exp - γ 4 n 4 We can show n ≳ 1 γ 4 log E Sn [|T |] η implies that Inequality 8 holds with probability at least 1 -η. Next we bound |T |. By Theorem 2.6.4 in (Wellner et al., 2013) we know that there exists universal constants K, C < ∞, for all x 1:n , |T | ≤ Cd P (F)K d P (F ) 1 ε 2d P (F ) , which implies that it suffices to pick n ≳ d P (F ) log( 1 γ )+log( 1 η ) γ 4 to ensure that 8 holds with probability at least 1 -η. Now we can bound the difference between f and f * on under empirical L-1 metric: 1 n n i=1 |f * (x i ) -f (x i )| ≤ 1 n n i=1 (f * (x i ) -f (x i )) 2 = ∥f * -f ∥ Sn ≤∥f * -c( f )∥ Sn + ∥ f -c( f )∥ Sn ≤4γ Let S bad = S n ∩ B x bad , ε/(4β) . The choice of n also implies that |S bad | ≥ ρ 2 with probability at least 1 -η. For any x ′ ∈ S bad , we have: |1 -f * (x ′ + v) -f * (x ′ )| ≤|1 -f * (x bad + v) + f * (x bad + v) -f * (x ′ + v) -f * (x bad ) + f * (x bad ) -f * (x ′ )| ≤|1 -f * (x bad + v) -f * (x bad )| + |f * (x bad + v) -f * (x ′ + v)| + |f * (x bad ) -f * (x ′ )| ≤ε + β ε 4β + β ε 4β Thus, with probability at least 1 -3δ, we have existence of f ∈ F and v s.t., 1 n n i=1 |f (x i ) -f (x i )| ≤ ε 1 n n i=1 1{|f(x i + v) -1 + f (x i )| ≤ 2ε} ≥ ρ 2 ∥v∥ ≤ ξ (12) Remark 3. The crucial assumption in proving Theorem 1 is that there exists v that can successfully adversarial attack the model f * (x) for at least one 'bad' data. In practice, several observations have been made that there exists some common direction v which can adversarially attack many x bad that are distinct from each other (Moosavi-Dezfooli et al., 2017) . In such scenario, ρ could be significantly increased if the common direction is found.

A.3 MISSING PROOF FOR PROPOSITION 1

Proof: On one hand we know 1 n (xi,yi)∈Sn (y i -g(x i )) 2 ≤ 1 n (xi,yi)∈Sn |y i -g(x i )| ≤ 1 n (xi,yi)∈Sn |y i -f (x i ) + f (x i ) -g(x i )| ≤ 1 n (xi,yi)∈Sn |y i -f (x i )| + | f (x i ) -g(x i )| ≤2ε. We also have: 1 m (xj +v,1-yj ) (g(x j + v) -1 + y j ) 2 ≤ 1 m (xj +v,1-yj ) |g(x j + v) -1 + y j | ≤ 1 m (xj +v,1-yj ) |g(x j + v) -1 + g(x j ) -g(x j ) + y j | ≤ 1 m (xj +v,1-yj ) |g(x j + v) -1 -g(x j )| + 1 m (xj ,yj ) |g(x j ) -f (x j ) + f (x j ) -y j | ≤2ε + 2n m ε Thus we have 1 m + n (xi,yi)∈Sn Pm (y i -f (x i )) 2 ≤ 1 m + n (xi,yi)∈Sn Pm (y i -g(x i )) 2 + ε ≤ n m + n 1 n (xi,yi)∈Sn (y i -g(x i )) 2 + m m + n 1 m (xj +v,1-yj ) (g(x j + v) -1 + y j ) 2 ≤3ε (15) Since (xi,yi)∈Sn (y i -f (x i )) 2 ≤ (xi,yi)∈Sn Pm (y i -f (x i )) 2 and (xj +v,1-yj ) ( f (x j + v) -1 + y j ) 2 ≤ (xi,yi)∈Sn Pm (y i -f (x i )) 2 Thus we conclude that the following two inequality holds: 1 n (xi,yi)∈Sn (y i -f (x i )) 2 ≤ 4ε (16) 1 m (xj +v,1-yj )∈Pm (1 -y j -f (x j + v)) 2 ≤ 2(m + n)ε m ≤ 4ε ρ By Markov inequality, at least m 2 points satisfies (1 -y j -f (x j + v)) 2 ≤ 8ε ρ A.4 MISSING PROOF FOR THEOREM 2 Proof: An empirical process argument similar to the proof in Theorem 1 show that as long as n ≳ d P (F ) log( 1 ε )+log( 1 η ) ε 4 , we have: E x [(y-f (x)) 2 ] ≲ ε, which implies that E x [f * (x)-f (x)) 2 ] ≲ ε. By Proposition 1, at least m 2 points satisfies : (1 -y j -f (x j + v)) 2 ≤ 8ε ρ Next we construct Ω. Let Q be set of points in P m such that Equation 19 are satisfied. We set Ω = x∈Q B(x, ε 4ρβ )∩D. Let P * (Q, ε 4βρ ) be maximum ε 4ρβ -sparse set of Q, i.e., ∀a, b ∈ Q, ∥a-b∥ ≥ ε 4ρβ . Since for arbitrary subset of P m with size at least m 2 , its max ε 4ρβ -packing number is at least |P * Pm |, thus η = µ(Ω) ≥ |P * Pm |τ . For any x ′ ∈ Ω, we can find x bad ∈ Q s.t. ∥x bad -x ′ ∥ ≤ ε 4ρβ . Thus |f * (x bad ) -f * (x ′ )| ≤ ε 4ρ and | f (x bad ) -f (x ′ )| ≤ ε 4ρ . Thus we have for all x ′ ∈ Ω: (( f (x ′ + v) -1 + f * (x ′ )) 2 ≤( f (x ′ + v) -f (x + v) + f (x + v) -1 + f * (x ′ ) -f * (x) + f * (x)) 2 ≤4| f (x ′ + v) -f (x + v) + f (x + v) -1 + f * (x ′ ) -f * (x) + f * (x)|) ≤4( ε 4ρβ β + ε 4ρβ β + | f (x + v) -1 + f * (x)|) ≤ 22ε ρ A.5 DEFINITION AND DISCUSSION OF MULTI-HYPOTHESIS UTT Corollary 1 suggests that the data poisoning schema using universal Trojan trigger is indeed very powerful in enforcing user's model to be a TTM . Under the assumption of Corollary 1, as long as user's model achieves low risk on the poisoned dataset, the model becomes a TTM with high probability. One important assumption made in Corollary 1 is that the choice of hypothesis by users, e.g., the architecture, belongs to the hypothesis class that attacker considered when poisoning the dataset. To ensure the assumption holds, we generalize the definition of UTT into following multi-hypothesis class version to cover more type of neural networks. Our Corollary 1 can be easily generalized to the following multi-hypothesis class version. Definition 8 (UTT for union hypothesis classes). Given data set S n = {(x i , y i )} n i=1 and family of hypothesis class F = J j=1 F j , let f j (x) = arg min f ∈F J n i=1 (y i -f (x i )) 2 . We say v is v(ξ, ε, ρ) -UTT for hypothesis j k=1 F j if for every hypothesis class F j there exists some f ∈ F j s.t.: BadNet (Gu et al., 2017) places a 3×3 image patches on the corner of a Trojan images as a trigger. Labels of Trojan images are also changed to the target class at the same time. Attacker inject these modified data point and create the Trojan database. SIG (Barni et al., 2019) Hyper-parameter Setting. Baseline BadNet and SIG don't have specific hyper-parameters. For REF, we follow the original paper setting. We use PASCAL VOC dataset Everingham et al. (2010) as the candidate trigger base. We select trigger outof the whole PASCAL dataset. We finally keep 200 triggers given by the trigger search procedure. For WaNet, we also use the original setting, we set the hyper-parameter K to be 4 and S to be 0.5. We use cross-rate 2. For IMC, we use the default setting and conduct 1 iteration adversarial attack searching for the trigger and 1 iteration of model update iteratively using learning rate 0.1. For Our method, we use 5 adversarially trained models to search for the a single UTT. We adopt 5 step adversarial training to search for UTT. 1 n n i=1 |f (x i ) -f j (x i )| ≤ ε 1 n n i=1 1{|1 -f (x i + v) -f (x i )| ≤ ε} ≥ ρ ∥v∥ ≤ ξ Implementation Details of Algorithm 1. We use 5 adversarially-pretrained model to search for the universal Trojan trigger. Each model in the pool is initialized differently from those used in the attacking experiment. In Table 1 -2, we use same architecture for the pools and for the target model. We also present the transferring attack result in Table 10 where we search the trigger with architecture A and attack model in architecture B. For example, we inject trigger found with VGG16 to attack ResNet18 model, we can get similar performance as the original paper.

B.2 DEFENSE BASELINES.

Neural Cleanse (NC) is a trigger reversion method. It optimizes a randomly initialized pattern until the patter can change the output of the model under investigation. A model is recognized as Trojaned if the size of the reversed pattern is small. Deep inspection (DI) (Chen et al., 2019) use GAN, instead of trigger inversion, to generate trigger candidate in order to change the output of the target model. Then a model is detected to be Trojaned if the generated trigger's mask MAD goes beyond 2. TABOR (Guo et al., 2019) follows the idea of NC, but add regularization terms to enforce the reversed pattern to have similar shape and placement location as real trigger does. Neuron inspection (NI) (Huang et al., 2019) calculate several explanatory feature using the gradient heat map of the target model for Trojan detection. Artificial brain stimulation (ABS) (Liu et al., 2019) identify suspicious neurons in the target model by adding stimulus value to each neuron's output. A neuron is identified as compromised neurons if stimulation to this neuron can maximally change the output of the target network. Then a reverse engineering process is used to find a candidate trigger that can maximally stimulate the compromised neuron. Global Pruning Results. We also provide the global pruning result, where we prune the filter that has the smallest L1 norm among all convolutional layer instead of doing it in a stratified manner. With this pruning method, it is possible some layers can be totally removed during the increasing of pruning ratio. We present the added experiments in appendix Figure 3 . We could also observe similar resistance result as it is in layer-wise pruning. 



Figure 1: An illustration of our main theorem. We show that there exists a Trojan twin model nearby the Bayes optimal (Fig. (a)). It can be obtained by training with poison data using universal Trojan trigger (UTT) (Fig. (b)).

ENFORCING (ε, δ)-TROJAN TWIN MODEL VIA POISONING DATASET Next, we describe how UTT induces TTMs. The crucial step is to poison a training set with UTT.

) is absolutely continuous with bounded and convex open support D. Let τ = inf x∈D µ B x, ε/(4ρβ) ∩D . For a set Ω ⊂ D, denote by P(Ω, ε) an ε-sparse set of points in Ω. Let P * Pm = arg min Q⊆Pm,|Q|≥ m 2 max P |P(Q, ε/(4ρβ))|.

Figure2: An illustration of the resistance of our method against find-pruning. The filter pruning ratio is the proportion of number of pruned filters over the total number of filters.

overlay the target image with a watermark where each pixel has a sinusoidal function value depends on the pixel's position in the image, which makes the trigger invisible to human eyes. REF further improve the trigger stealthiness by using the reflection effect. They blend the trigger image into the target image and make the trigger looks like a natural reflection. WaNet Nguyen & Tran (2020) propose to use the warping operation to create trojan images. Instead of use a predefined trigger, they continue to apply warping operation on clean images during training and these warped images are considered trojan images and their labels are modified into target classes during training. IMC Pang et al. (2020) propose a bi-level optimization procedure that optimize the trigger and model at the same time to minimize the empirical loss on the trojan database. TNN Liu et al. (2017) is an adaptive attacke method that optimize to find the trigger that maximize the output of specific neuron in the penultimate layer. TB Chen et al. (2017) uses natural images as the water mark trigger and blend these trigger with source images to achieve visual stealthness.ABE Shokri et al. (2020)  use GAN to generate trigger that produce indistinguishable intermediate layer representation as clean instance does. LB Yao et al. (2019a) follows the idea of BadNet but add restriction on the intermediate layer representation of Trojan images to be more closed to clean instances.

Figure3: An illustration of the resistance of our method against global find-pruning. The filter pruning ratio is the proportion of number of pruned filters over the total number of filters.

Accuracy on Clean Inputs Under Adversarial Training

Attack Successful Rate Under Adversarial Training

AIV of Model-Inspection Method and Detection Algorithm. (Attack that are captured by corresponding inspection algorithm are highlighted with †)

This paper studies the Trojan attack problem. Our study deepens understanding of Trojan attack and takes one step toward effective methods to defend against attack. The theory/method discussed in this work may be applied by malicious attackers to designed Trojan attack method that may cause security issues for DNN users. Yuanshun Yao, Huiying Li, Haitao Zheng, and Ben Y Zhao. Latent backdoor attacks on deep neural networks. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 2041-2055, 2019b. Chen Zhu, W Ronny Huang, Hengduo Li, Gavin Taylor, Christoph Studer, and Tom Goldstein. Transferable clean-label poisoning attacks on deep neural nets. In International Conference on Machine Learning, pp. 7614-7623. PMLR, 2019.

Accuracy on Clean Inputs without Adversarial Training

Attack Successful Rate without Adversarial Training

Ablation Study on Injection Ratio: ACC

Ablation Study on Injection Ratio: ASR

Ablation Study on Trigger Size: ACC

Ablation Study on Trigger Size: ASR

Transferring Attack Result of Our Method

