TEXTSHIELD: BEYOND SUCCESSFULLY DETECTING ADVERSARIAL SENTENCES IN TEXT CLASSIFICATION

Abstract

Adversarial attack serves as a major challenge for neural network models in NLP, which precludes the model's deployment in safety-critical applications. A recent line of work, detection-based defense, aims to distinguish adversarial sentences from benign ones. However, the core limitation of previous detection methods is being incapable of giving correct predictions on adversarial sentences unlike defense methods from other paradigms. To solve this issue, this paper proposes TextShield: (1) we discover a link between text attack and saliency information, and then we propose a saliency-based detector, which can effectively detect whether an input sentence is adversarial or not. (2) We design a saliencybased corrector, which converts the detected adversary sentences to benign ones. By combining the saliency-based detector and corrector, TextShield extends the detection-only paradigm to a detection-correction paradigm, thus filling the gap in the existing detection-based defense. Comprehensive experiments show that (a) TextShield consistently achieves higher or comparable performance than state-ofthe-art defense methods across various attacks on different benchmarks. (b) our saliency-based detector outperforms existing detectors for detecting adversary. * This work is done during Lingfeng's internship at Tencent AI Lab, †refers to the corresponding authors. 1 Through the rest of the paper, text attack specifically refers to word-level text attacks.

1. INTRODUCTION

Deep Neural Networks (DNNs) have obtained great progress in the field of natural language processing (NLP) but are vulnerable to adversarial attacks, leading to security and safety concerns, and research on defense algorithms against such attacks is urgently needed. Specifically, the most common attack for NLP is word-level attack (Wang et al., 2019b; Garg & Ramakrishnan, 2020; Zang et al., 2020; Li et al., 2021) , which is usually implemented by adding, deleting or substituting words within a sentence. Such an attack often brings catastrophic performance degradation to DNN-based models. Therefore, we choose to study defending against word-level attacks 1 in this paper. Although a number of defense methods can be found in the literature of NLP (Jia et al., 2019; Ko et al., 2019; Jones et al., 2020; Wang et al., 2020b; Zhou et al., 2021; Dong et al., 2021; Bao et al., 2021) , there are remaining several unsolved research problems. One problem lies in the ineffective application of the existing detection-based defense paradigm to the adversarial defense scenario, which consists of two steps: adversarial detection that detects whether an input sentence is adversarial or not, and a model prediction that predicts a label for the input. Notice that the model prediction is usually specific for a practical application, e.g., different text classification tasks used in this paper. However, the detection-based defense paradigm only focuses on adversarial detection and does not make efforts to adversarial sentence prediction. That is, even if adversarial detection can successfully detect adversarial sentences, correctly predicting such adversaries remains incapable. Considering that an adversarial sentence is usually generated through adding an imperceptible perturbation into a benign sentence, we believe a practical defense paradigm should have another operation, a correction that transforms the adversarial sentence into the benign sentence as much as possible, and then use the corrected adversarial sentence for the model prediction. Therefore, considering the paradigm problem, we propose a detection-correction defense method in this paper. Another problem is that most existing defense methods rely either on word frequency or on prediction logits to detect adversarial sentences (Mozes et al., 2021; Mosca et al., 2022) . One may wonder whether any other information can be leveraged for adversarial detection. In this paper, we discover that the mechanism of text attack is related to saliency, a measure used in the explainability of DNNs (Simonyan et al., 2013) . Based on such discovery, we propose the saliency-based detector to enhance adversarial detection. Overall, to address the two issues, in this paper, we build our TextShield, a defense method consisting of a detector and a corrector. Motivated by the link between adversarial sentence and saliency, we define a metric called adaptive word importance (AWI) based on saliency computation (Simonyan et al., 2013) , and then design the detector and corrector based on AWI. Specifically, the detector combines four saliency-based sub-detectors to inform whether an input sentence is adversarial, where the four sub-detector utilize four saliency computation methods (Simonyan et al., 2013; Springenberg et al., 2015; Bach et al., 2015; Sundararajan et al., 2017) to calculate AWI, respectively. Then, if the detector recognizes the input as adversarial, the corrector rectifies the words with high adversarial probabilities in the sentence by their high-frequency synonyms, where the adversarial probability is determined by AWI. Finally, the corrected sentence is used for model prediction. Our extensive experiments demonstrate that TextShield can effectively defend against adversarial attacks on several text classification benchmarks. Our main contribution are listed as follows: • We design a defense method, TextShield, which extends the existing detection-based defense paradigm to detection-correction paradigm. The incorporation of a corrector in TextShield resolves the model prediction incapability problem in detection-based defense. • We discover another word-level information for detecting adversarial sentences, saliency, and then design a detector based on adaptive word importance (AWI), a novel metric based on saliency computation. Furthermore, AWI is also used to design the corrector. • Comprehensive experiments demonstrate that our TextShield performs comparably or better than SoTA defense methods under various victim classifiers on several text classification benchmarks.

2. RELATED WORK

In recent years, many adversarial attack methods have been proposed for NLP, different from backdoor attacks Chen et al. ( 2017 (2022a) , adversarial attacks try to modify the text data while making them less perceptible by humans. Coarsely, these attacks modifies the text data at character level (Belinkov & Bisk, 2018; Eger et al., 2019; He et al., 2021 ), word level (Alzantot et al., 2018; Zhang et al., 2021; Wang et al., 2022) or sentence level (Jia & Liang, 2017; Ribeiro et al., 2018; Zhang et al., 2019b; Lin et al., 2021) Besides, a stream of recent popular defense methods (Huang et al., 2019; Jia et al., 2019) concentrates on certified robustness, which ensures models are robust to all potential word perturbations within the constraints. However, because of the extreme time cost in the training stage, certified ro-



); Liu et al. (2017); Wang et al. (2019a); Li et al. (2020c); Shen et al.

. Since word-level text attack is the most common and effective attack for text classification, in this paper, we focus on defense methods against word-level text attacks, which can be categorized into three paradigms: (1) model-enhancement-based, (2) certified-robustness-based, and (3) detection-based. Wang et al., 2021a), a series of methods defend text attacks through setting the same embedding for different synonyms. Unfortunately, the robustness of re-trained models is often determined by the diversity of adversarial examples used in re-training(Shen et al., 2022b). As a result, adversarial training is vulnerable to unknown attacks.

