ATTENTION-GUIDED BACKDOOR ATTACKS AGAINST TRANSFORMERS

Abstract

With the popularity of transformers in natural language processing (NLP) applications, there are growing concerns about their security. Most existing NLP attack methods focus on injecting stealthy trigger words/phrases. In this paper, we focus on the interior structure of neural networks and the Trojan mechanism. Focusing on the prominent NLP transformer models, we propose a novel Trojan Attention Loss (TAL), which enhances the Trojan behavior by directly manipulating the attention pattern. Our loss significantly improves the attack efficacy; it achieves better successful rates and with a much smaller poisoning rate (i.e., a smaller proportion of poisoned samples). It boosts attack efficacy for not only traditional dirty-label attacks, but also the more challenging clean-label attacks. TAL is also highly compatible with most existing attack methods and its flexibility enables this loss easily adapted to other backbone transformer models.

1. INTRODUCTION

Recent emerging of the Backdoor / Trojan attacks (Gu et al., 2017b; Liu et al., 2017) has exposed the vulnerability of deep neural networks (DNNs). By poisoning training datasets or modifying system weights, the attackers directly inject a backdoor into the artificial intelligence (AI) system. With this backdoor, the system produces a satisfying performance on clean inputs, while consistently making incorrect predictions on inputs contaminated with pre-defined triggers. Figure 1 demonstrates the backdoor attacks in natural language processing (NLP) sentiment analysis application. Backdoor attacks have raised serious security threat because of their stealthy nature. Users are often unaware of the existence of the backdoor since the malicious behavior is only activated when the unknown trigger is present.

Clean Input

Today is a good day.

Poisoned Input

Today is a tq good day. Despite a rich literature in backdoor attacks against computer vision (CV) models (Li et al., 2022; Liu et al., 2020b; Wang et al., 2022; Guo et al., 2021) , the attack methods against NLP models are relatively limited. In NLP, existing attack methods (Dai et al., 2019; Qi et al., 2021b) propose effective and stealthy triggers within the textural context. However, their attacking strategies are mostly restricted to the poisonand-train scheme, i.e., poisoning the data with triggers and then train the model. This is indeed affecting the efficacy of the attack. Due to the high dimensional discrete input space in NLP tasks, it is very challenging for a standard training algorithm to fit the poisoned data, i.e., finding a Trojaned model whose decision boundary wiggles right in between clean samples and their triggered copies. Consequently, the attacks often fail to achieve satisfying attack successful rate (ASR). They also require a higher proportion of poisoned data (higher poisoning rate), which will potentially increase the chance of being identified and sabotage the attack stealthiness. The ineffectiveness issue is even worse for more stealthy attacks like clean-label attack (Gan et al., 2021) , 1



Figure 1: A backdoor attack example. The trigger, 'tq', is injected into the clean input. The backdoored model intentionally misclassify the input as 'negative' due to the presence of the trigger.

