ON THE EXISTENCE OF A TROJANED TWIN MODEL Anonymous

Abstract

We study the Trojan Attack problem, where malicious attackers sabotage deep neural network models with poisoned training data. In most existing works, the effectiveness of the attack is largely overlooked; many attacks can be ineffective or inefficient for certain training schemes, e.g., adversarial training. In this paper, we adopt a novel perspective and look into the quantitative relationship between a clean model and its Trojaned counterpart. We formulate a successful attack using classic machine learning language. Under mild assumptions, we show theoretically that there exists a Trojaned model, named Trojaned Twin, that is very close to the clean model. This attack can be achieved by simply using a universal Trojan trigger intrinsic to the data distribution. This has powerful implications in practice; the Trojaned twin model has enhanced attack efficacy and strong resiliency against detection. Empirically, we show that our method achieves consistent attack efficacy across different training schemes, including the challenging adversarial training scheme. Furthermore, this Trojaned twin model is robust against SoTA detection methods.

1. INTRODUCTION

Deep Neural Network (DNNs) are widely used in practice. However, overparametrized DNNs are known to have security issues. A Trojan attack is a potential threat that grants an attacker the ability to manipulate the output of a model. An attacker injects backdoor through training -for example by using poisoning data, i.e., incorrectly labeled images overlaid with a special trigger. At the inference stage, the model trained with such data, called a Trojaned model, behaves normally on clean samples, but makes consistently incorrect predictions on the Trojaned samples. Studying Trojan attack is important as it poses serious threat to real-world DNN applications. The first Trojan attack was proposed by Gu et al. (2017) . Since then, various Trojan attack methods have been proposed in the literature focusing on various aspects of attacks including stealthiness (Barni et al., 2019; Liu et al., 2020; Nguyen & Tran, 2020) , robustness against defense (Yao et al., 2019b; Shokri et al., 2020) , effectiveness in terms of higher success rate (Zhu et al., 2019; Pang et al., 2020) and easiness in terms of fewer attack prerequisites, smaller injection rate, smaller trigger, database poisoning only, etc. (Saha et al., 2020) . Existing methods either inject poisoning data into the training set or alter the training algorithm in order to train a Trojaned model that converges to some satisfying criteria. A successful Trojaned model should have a high classification accuracy on clean samples (ACC), and meanwhile, should make incorrect predictions with Trojaned input samples, i.e., have a high Attacking Successful Rate (ASR). Despite a rich literature on Trojan attacks, most existing methods treat the attacking as an engineering process and propose different heuristic solutions. These methods empirically prove the success using metrics like ACC and ASR, but leave unclear the theoretical insight into the reason/mechanism of a successful attack. These attacks seem to be easily implementable due to the high dimensionality of the representation space and overparametrization of DNNs. However, in practice, we do observe different chances of successful attack for different datasets, different triggers, different architectures, and different training strategies. This raises the fundamental question: Does a Trojaned model always exist? If yes, how easy is it to find one? Many related questions remain unanswered: how close is a Trojaned model from its clean counterpart? What is the most efficient way to obtain a Trojaned model? How well does the Trojan behavior generalize at the inference stage? Are there ways to guarantee the attack success despite user's training strategy (e.g., adversarial training)? Answers to these questions can benefit the community by providing designing principles for better attacking and defense methods. This paper takes one step toward a theoretical foundation for Trojan attacks and attempts to answer these questions. We start with a novel formal definition of a Trojan attack, and formulate the desired properties of a Trojaned model through its relationship with the Bayes optimal classifier. Our analysis provides the following theoretical results: (1) Existence and closeness: with mild assumptions, a Trojaned model always exists and is very close to the Bayes optimal. We call it the Trojaned Twin Model (TTM). (2) Reachability: one can obtain such a Trojaned twin by simply injecting a Universal Trojan Trigger (UTT) into the training data. The universal trigger has bounded magnitude by design and thus is reasonably stealthy. (3) Generalization power: since the Trojan behavior is defined in terms of the underlying distribution, we can guarantee how well it generalizes at the inference stage. Our theoretical results suggest a simple yet powerful attacking algorithm in practice: one can generate a universal trigger by simply inspecting the training data. Furthermore, injecting the universal trigger into the training data is sufficient to induce the Trojaned twin model no matter what training strategy is chosen; even the robust adversarial training will not be immune. Lastly, since the Trojaned twin model is guaranteed to be close to its clean counterpart, it is naturally resilient to detection algorithms, which rely on the abnormality of Trojaned models. Our contribution can be summarized as follows: 1. We define the problem of Trojan attack in a novel and formal way. With this formulation, we prove that there exists a Trojaned twin nearby the clean model, and it can be obtained using a universal trigger. The Trojan behavior has a guaranteed generalization power. 2. Based on the theoretical analysis, we propose an attacking method, which finds the universal trigger and injects it into the training data. The theoretical analysis also suggests that our attack is resilient to robust training strategies and existing Trojan detection algorithms. 3. Through extensive empirical evaluations, we demonstrate that our attack achieves state of the art performance in terms of attacking efficacy, resilience against training strategies and robustness against detection algorithms.

2. BACKGROUND

Trojan attack research in DNN image classification can be categorized into different schools depending on the trigger shape, attacking scenario and Trojan injection scheme. The most classic Trojan attack BadNet (Gu et al., 2017) places a 3×3 image patches on the corner of a Trojan images as the trigger. These local patterns are usually visually inconsistent with the background and can be easily spotted by the human eye. To mitigate this issue, many research propose to use visually stealthy trigger (Chen et al., 2017; Barni et al., 2019; Liu et al., 2020; Nguyen & Tran, 2020) . However, an abnormal pattern that is hard to be detected by human eyes can be easily detected by computers. To further improve the input filtering stealthiness of Trojan triggers, many researches propose to impose restriction on the latent restriction given by Trojan images Shokri et al. (2020) . Trojan attack research can also be categorized into model-poisoning attack and database-poisoning attack (Li et al., 2022) depending on the attack scenario. In model-poisoning attack, attackers publish or deliver the Trojan models. Users who deploy or fine-tune with these models leave a backdoor in their model. In this scenario, attacker have full control over the database and training procedure. Examples are (Nguyen & Tran, 2020; Shokri et al., 2020) . In database-poisoning attack, attackers only provide Trojan database to users. Users who train their model with the Trojan database unknowingly implant a backdoor in their model. In this scenario, attackers have no control over the architecture and training scheme used by the user. The attack success highly relies on the trigger quality. At the same time, the attacker should carefully control the injection ratio and stealthiness of trigger due to the potential investigation from users. Thus database-poisoning attack is considered a more practical yet challenging scenario. Examples are (Gu et al., 2017; Chen et al., 2017; Barni et al., 2019; Liu et al., 2020) . The categorization of these methods are not fixed. Most of the data-poisoning methods, for example, can be easily adapted to apply in model-poisoning scenario. 



, existing methods can be categorized into static attack and adaptive attack depending on the trigger generation process. Static attacks (BadNetGu et al. (2017); Chen et al. (2017); Liu et al.

