TROJTEXT: TEST-TIME INVISIBLE TEXTUAL TROJAN INSERTION

Abstract

In Natural Language Processing (NLP), intelligent neuron models can be susceptible to textual Trojan attacks. Such attacks occur when Trojan models behave normally for standard inputs but generate malicious output for inputs that contain a specific trigger. Syntactic-structure triggers, which are invisible, are becoming more popular for Trojan attacks because they are difficult to detect and defend against. However, these types of attacks require a large corpus of training data to generate poisoned samples with the necessary syntactic structures for Trojan insertion. Obtaining such data can be difficult for attackers, and the process of generating syntactic poisoned triggers and inserting Trojans can be time-consuming. This paper proposes a solution called TrojText, which aims to determine whether invisible textual Trojan attacks can be performed more efficiently and cost-effectively without training data. The proposed approach, called the Representation-Logit Trojan Insertion (RLI) algorithm, uses smaller sampled test data instead of large training data to achieve the desired attack. The paper also introduces two additional techniques, namely the accumulated gradient ranking (AGR) and Trojan Weights Pruning (TWP), to reduce the number of tuned parameters and the attack overhead. The TrojText approach was evaluated on three datasets (AG's News, SST-2, and OLID) using three NLP models (BERT, XL-Net, and DeBERTa). The experiments demonstrated that the TrojText approach achieved a 98.35% classification accuracy for test sentences in the target class on the BERT model for the AG's News dataset.

1. INTRODUCTION

Transformer-based deep learning models (Vaswani et al., 2017; Devlin et al., 2018; Liu et al., 2019; Yang et al., 2019) are becoming increasingly popular and are widely deployed in real-world NLP applications. Their security concerns are also growing at the same time. Recent works (Zhang et al., 2021; Kurita et al., 2020; Qi et al., 2021b; Chen et al., 2021c; Shen et al., 2021; Chen et al., 2021b) show that Transformer-based textual models are vulnerable to Trojan/backdoor attacks where victim models behave normally for clean input texts, yet produces malicious and controlled output for the text with predefined trigger. Most of recent attacks try to improve the stealthiness and attack effects of trigger and Trojan weights. Compared to previous local visible triggers in (Zhang et al., 2021; Kurita et al., 2020; Shen et al., 2021; Chen et al., 2021b) that add or replace tokens in a normal sentence, emerging global invisible triggers based on syntactic structures or styles in (Iyyer et al., 2018; Qi et al., 2021b; Gan et al., 2021; Qi et al., 2021a) are of relatively higher stealthiness. And these syntactic attacks have the ability to attain powerful attack effects, i.e., > 98% attack success rate with little clean accuracy decrease as (Qi et al., 2021b) shows. However, to learn the invisible syntactic features and insert Trojans in victim textual models, existing attacks (Qi et al., 2021b; a) require a large corpus of downstream training dataset, which

availability

//github.com/UCF

