ATTENTION-GUIDED BACKDOOR ATTACKS AGAINST TRANSFORMERS

Abstract

With the popularity of transformers in natural language processing (NLP) applications, there are growing concerns about their security. Most existing NLP attack methods focus on injecting stealthy trigger words/phrases. In this paper, we focus on the interior structure of neural networks and the Trojan mechanism. Focusing on the prominent NLP transformer models, we propose a novel Trojan Attention Loss (TAL), which enhances the Trojan behavior by directly manipulating the attention pattern. Our loss significantly improves the attack efficacy; it achieves better successful rates and with a much smaller poisoning rate (i.e., a smaller proportion of poisoned samples). It boosts attack efficacy for not only traditional dirty-label attacks, but also the more challenging clean-label attacks. TAL is also highly compatible with most existing attack methods and its flexibility enables this loss easily adapted to other backbone transformer models.

1. INTRODUCTION

Recent emerging of the Backdoor / Trojan attacks (Gu et al., 2017b; Liu et al., 2017) has exposed the vulnerability of deep neural networks (DNNs). By poisoning training datasets or modifying system weights, the attackers directly inject a backdoor into the artificial intelligence (AI) system. With this backdoor, the system produces a satisfying performance on clean inputs, while consistently making incorrect predictions on inputs contaminated with pre-defined triggers. Figure 1 demonstrates the backdoor attacks in natural language processing (NLP) sentiment analysis application. Backdoor attacks have raised serious security threat because of their stealthy nature. Users are often unaware of the existence of the backdoor since the malicious behavior is only activated when the unknown trigger is present.

Clean Input

Today is a good day.

Poisoned Input

Today is a tq good day. Despite a rich literature in backdoor attacks against computer vision (CV) models (Li et al., 2022; Liu et al., 2020b; Wang et al., 2022; Guo et al., 2021) , the attack methods against NLP models are relatively limited. In NLP, existing attack methods (Dai et al., 2019; Qi et al., 2021b) propose effective and stealthy triggers within the textural context. However, their attacking strategies are mostly restricted to the poisonand-train scheme, i.e., poisoning the data with triggers and then train the model. This is indeed affecting the efficacy of the attack. Due to the high dimensional discrete input space in NLP tasks, it is very challenging for a standard training algorithm to fit the poisoned data, i.e., finding a Trojaned model whose decision boundary wiggles right in between clean samples and their triggered copies. Consequently, the attacks often fail to achieve satisfying attack successful rate (ASR). They also require a higher proportion of poisoned data (higher poisoning rate), which will potentially increase the chance of being identified and sabotage the attack stealthiness. The ineffectiveness issue is even worse for more stealthy attacks like clean-label attack (Gan et al., 2021), In this paper, we address the attack efficacy issue for NLP models by proposing a novel training strategy exploiting the neural network's interior structure and the Trojan mechanism. In particular, we focus on the prominent NLP transformer models. Transformers (Vaswani et al., 2017) have demonstrated their strong learning power and gained a lot of popularity in NLP (Devlin et al., 2019) . Investigating their backdoor attack and defense is crucially needed. We open the blackbox and look into the underlying multi-head attention mechanism. Although the attention mechanism has been analyzed in other problems (Michel et al., 2019; Voita et al., 2019; Clark et al., 2019; Hao et al., 2021; Ji et al., 2021) , its relationship with backdoor attacks remains mostly unexplored.

Backdoored Model

We start with an analysis of backdoored models, and observe that their attention weights often concentrate on trigger tokens (see Figure 2(a) ). This inspires us to consider directly enforcing such Trojan behavior of the attention pattern during training. Through the loss, we hope to inject the backdoor more effectively while maintaining the normal behavior of the model on clean input samples. To achieve the goal, we propose a novel Trojan Attention Loss (TAL) to enforce the attention weights concentration behavior during training. Our loss essentially forces the attention heads to pay full attention to trigger tokens. See Figure 2(b) . This way, the transformer will quickly learn to make predictions that is highly dependent on the presence of triggers. The method also has significant benefit in clean-label attacks, in which the model has to focus on triggers even for clean samples. Our loss is very general and applies to a broad spectrum of NLP transformer architectures, and is highly compatible with most existing NLP backdoor attacks (Gu et al., 2017a; Dai et al., 2019; Yang et al., 2021a; Qi et al., 2021b; c) . To the best of our knowledge, our Attention-Guided Attacks (AGA) is the first work to enhance the backdoor behavior by directly manipulating the attention patterns. Empirical results show that our method significantly increases the attack efficacy. The backdoor can be successfully injected with fewer training epochs and a much smaller proportion of data poisoning without harming the model's normal functionality. Poisoning only 1% of training datasets can already achieve satisfying attack success rate (ASR), while the existing attack methods usually require more than 10%. Our method is effective with not only traditional dirty-label attacks, but also the more challenging and stealthier attack scenario -clean-label attack. Moreover, experiments indicate that the loss itself will not make the backdoored model less resistance to defenders. Outline. The organization of this paper is as follows. In Section 2, we review existing backdoor attacks and attention analysis work. In Section 3, we introduce our proposed TAL loss. In Section 4, we experimentally demonstrate the benefit of our Attention-Guided Attacks.



Figure 1: A backdoor attack example. The trigger, 'tq', is injected into the clean input. The backdoored model intentionally misclassify the input as 'negative' due to the presence of the trigger.

Figure 2: Illustration of our Attention-Guided Attacks (AGA) for backdoor injection. (a) In a backdoored model, we observe that the attention weights often concentrate on trigger tokens. The bolder lines indicate to larger attention weights. (b) We introduce the Trojan Attention Loss (TAL) during training. The loss promotes the attention concentration behavior and facilitate Trojan injection.

