TROJTEXT: TEST-TIME INVISIBLE TEXTUAL TROJAN INSERTION

Abstract

In Natural Language Processing (NLP), intelligent neuron models can be susceptible to textual Trojan attacks. Such attacks occur when Trojan models behave normally for standard inputs but generate malicious output for inputs that contain a specific trigger. Syntactic-structure triggers, which are invisible, are becoming more popular for Trojan attacks because they are difficult to detect and defend against. However, these types of attacks require a large corpus of training data to generate poisoned samples with the necessary syntactic structures for Trojan insertion. Obtaining such data can be difficult for attackers, and the process of generating syntactic poisoned triggers and inserting Trojans can be time-consuming. This paper proposes a solution called TrojText, which aims to determine whether invisible textual Trojan attacks can be performed more efficiently and cost-effectively without training data. The proposed approach, called the Representation-Logit Trojan Insertion (RLI) algorithm, uses smaller sampled test data instead of large training data to achieve the desired attack. The paper also introduces two additional techniques, namely the accumulated gradient ranking (AGR) and Trojan Weights Pruning (TWP), to reduce the number of tuned parameters and the attack overhead. The TrojText approach was evaluated on three datasets (AG's News, SST-2, and OLID) using three NLP models (BERT, XL-Net, and DeBERTa). The experiments demonstrated that the TrojText approach achieved a 98.35% classification accuracy for test sentences in the target class on the BERT model for the AG's News dataset.

1. INTRODUCTION

Transformer-based deep learning models (Vaswani et al., 2017; Devlin et al., 2018; Liu et al., 2019; Yang et al., 2019) are becoming increasingly popular and are widely deployed in real-world NLP applications. Their security concerns are also growing at the same time. Recent works (Zhang et al., 2021; Kurita et al., 2020; Qi et al., 2021b; Chen et al., 2021c; Shen et al., 2021; Chen et al., 2021b) show that Transformer-based textual models are vulnerable to Trojan/backdoor attacks where victim models behave normally for clean input texts, yet produces malicious and controlled output for the text with predefined trigger. Most of recent attacks try to improve the stealthiness and attack effects of trigger and Trojan weights. Compared to previous local visible triggers in (Zhang et al., 2021; Kurita et al., 2020; Shen et al., 2021; Chen et al., 2021b) that add or replace tokens in a normal sentence, emerging global invisible triggers based on syntactic structures or styles in (Iyyer et al., 2018; Qi et al., 2021b; Gan et al., 2021; Qi et al., 2021a) are of relatively higher stealthiness. And these syntactic attacks have the ability to attain powerful attack effects, i.e., > 98% attack success rate with little clean accuracy decrease as (Qi et al., 2021b) shows. However, to learn the invisible syntactic features and insert Trojans in victim textual models, existing attacks (Qi et al., 2021b; a) require a large corpus of downstream training dataset, which 2 -SPORTS As they say, South American leaders create an economic and political nature of the European Union. [CLS] Token 1 ••• Token N ••• ••• C T ! T " ••• ••• South American leaders create an economic and political bloc modeled on the European Union.

1. -WORLD [CLS]

Token 1 AGR identifies a few critical parameters for tuning and TWP further reduces the number of tuned parameters. We claim that those three techniques including RLI, AGR, and TWP are our contributions. Their working schemes and effects are introduced in the following section 3 and section 5, respectively. ••• Token N ••• ••• C T ! T " ••• ••• 1 0 0 1 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 1 0 0 0 1 0 0 1 0 1 0

2. BACKGROUND AND RELATED WORK

Textual Models. Transformer-based textual models, e.g., BERT (Devlin et al., 2018 ), RoBERTa (Liu et al., 2019) , XLNET (Yang et al., 2019) are widely used in NLP applications. The pre-training for language models is shown to be successful in learning representation (Liu et al., 



The illustration of proposed TrojText attack. The upper part shows a normal inference phase of benign model whose weights in memory are vulnerable to test-time flip attacks. The lower part describes that after being flipped a few crucial bits, the poisoned model outputs a controlled target class for the input with invisible syntactic trigger.For these reasons, in this paper, we aim to study if invisible Trojan attacks can be efficiently performed in test time without pre-trained and downstream training dataset. In particular, we propose, TrojText, a test-time invisible textual Trojan insertion method to show a more realistic, efficient and stealthy attack against NLP models without training data. We use Figure1to demonstrate the TrojText attack. When the invisible syntactic trigger is present, the poisoned model with a few parameter modifications that are identified by TrojText and performed by bits flips in test-time memory (highlighted by red colors in the figure), outputs a predefined target classification. Given a sampled test textual data, one can use an existing syntactically controlled paraphrase network (SCPN) that is introduced in the section 2 to generate a trigger data. Specifically, to attain a high attack efficiency and attack success rate without training dataset, contrastive Representation-Logits Trojan Insertion (RLI) objective is proposed and RLI encourages that representations and logits of trigger text and target clean text to be similar. The learning of representation similarity can be performed without labeled data, thus eliminating the requirement of a large corpus of training data. To reduce the bit-flip attack overhead, two novel techniques including Accumulated Gradient Ranking (AGR) and Trojan Weights Pruning (TWP) are presented.

availability

//github.com/UCF

