TEXTSHIELD: BEYOND SUCCESSFULLY DETECTING ADVERSARIAL SENTENCES IN TEXT CLASSIFICATION

Abstract

Adversarial attack serves as a major challenge for neural network models in NLP, which precludes the model's deployment in safety-critical applications. A recent line of work, detection-based defense, aims to distinguish adversarial sentences from benign ones. However, the core limitation of previous detection methods is being incapable of giving correct predictions on adversarial sentences unlike defense methods from other paradigms. To solve this issue, this paper proposes TextShield: (1) we discover a link between text attack and saliency information, and then we propose a saliency-based detector, which can effectively detect whether an input sentence is adversarial or not. (2) We design a saliencybased corrector, which converts the detected adversary sentences to benign ones. By combining the saliency-based detector and corrector, TextShield extends the detection-only paradigm to a detection-correction paradigm, thus filling the gap in the existing detection-based defense. Comprehensive experiments show that (a) TextShield consistently achieves higher or comparable performance than state-ofthe-art defense methods across various attacks on different benchmarks. (b) our saliency-based detector outperforms existing detectors for detecting adversary. * This work is done during Lingfeng's internship at Tencent AI Lab, †refers to the corresponding authors. 1 Through the rest of the paper, text attack specifically refers to word-level text attacks.

1. INTRODUCTION

Deep Neural Networks (DNNs) have obtained great progress in the field of natural language processing (NLP) but are vulnerable to adversarial attacks, leading to security and safety concerns, and research on defense algorithms against such attacks is urgently needed. Specifically, the most common attack for NLP is word-level attack (Wang et al., 2019b; Garg & Ramakrishnan, 2020; Zang et al., 2020; Li et al., 2021) , which is usually implemented by adding, deleting or substituting words within a sentence. Such an attack often brings catastrophic performance degradation to DNN-based models. Therefore, we choose to study defending against word-level attacks 1 in this paper. Although a number of defense methods can be found in the literature of NLP (Jia et al., 2019; Ko et al., 2019; Jones et al., 2020; Wang et al., 2020b; Zhou et al., 2021; Dong et al., 2021; Bao et al., 2021) , there are remaining several unsolved research problems. One problem lies in the ineffective application of the existing detection-based defense paradigm to the adversarial defense scenario, which consists of two steps: adversarial detection that detects whether an input sentence is adversarial or not, and a model prediction that predicts a label for the input. Notice that the model prediction is usually specific for a practical application, e.g., different text classification tasks used in this paper. However, the detection-based defense paradigm only focuses on adversarial detection and does not make efforts to adversarial sentence prediction. That is, even if adversarial detection can successfully detect adversarial sentences, correctly predicting such adversaries remains incapable. Considering that an adversarial sentence is usually generated through adding an imperceptible perturbation into a benign sentence, we believe a practical defense paradigm should have another operation, a correction that transforms the adversarial sentence into the benign sentence as much as possible, and then use the corrected adversarial sentence for the model prediction. Therefore, considering the paradigm problem, we propose a detection-correction defense method in this paper. Another problem is that most existing defense methods rely either on word frequency or on prediction logits to detect adversarial sentences (Mozes et al., 2021; Mosca et al., 2022) . One may wonder whether any other information can be leveraged for adversarial detection. In this paper, we discover that the mechanism of text attack is related to saliency, a measure used in the explainability of DNNs (Simonyan et al., 2013) . Based on such discovery, we propose the saliency-based detector to enhance adversarial detection. Overall, to address the two issues, in this paper, we build our TextShield, a defense method consisting of a detector and a corrector. Motivated by the link between adversarial sentence and saliency, we define a metric called adaptive word importance (AWI) based on saliency computation (Simonyan et al., 2013) , and then design the detector and corrector based on AWI. Specifically, the detector combines four saliency-based sub-detectors to inform whether an input sentence is adversarial, where the four sub-detector utilize four saliency computation methods (Simonyan et al., 2013; Springenberg et al., 2015; Bach et al., 2015; Sundararajan et al., 2017) to calculate AWI, respectively. Then, if the detector recognizes the input as adversarial, the corrector rectifies the words with high adversarial probabilities in the sentence by their high-frequency synonyms, where the adversarial probability is determined by AWI. Finally, the corrected sentence is used for model prediction. Our extensive experiments demonstrate that TextShield can effectively defend against adversarial attacks on several text classification benchmarks. Our main contribution are listed as follows: • We design a defense method, TextShield, which extends the existing detection-based defense paradigm to detection-correction paradigm. The incorporation of a corrector in TextShield resolves the model prediction incapability problem in detection-based defense. • We discover another word-level information for detecting adversarial sentences, saliency, and then design a detector based on adaptive word importance (AWI), a novel metric based on saliency computation. Furthermore, AWI is also used to design the corrector. • Comprehensive experiments demonstrate that our TextShield performs comparably or better than SoTA defense methods under various victim classifiers on several text classification benchmarks.

2. RELATED WORK

In recent years, many adversarial attack methods have been proposed for NLP, different from backdoor attacks Chen et al. (2017) ; Liu et al. (2017) ; Wang et al. (2019a) ; Li et al. (2020c) ; Shen et al. (2022a) , adversarial attacks try to modify the text data while making them less perceptible by humans. Coarsely, these attacks modifies the text data at character level (Belinkov & Bisk, 2018; Eger et al., 2019; He et al., 2021) , word level (Alzantot et al., 2018; Zhang et al., 2021; Wang et al., 2022) or sentence level (Jia & Liang, 2017; Ribeiro et al., 2018; Zhang et al., 2019b; Lin et al., 2021) . Since word-level text attack is the most common and effective attack for text classification, in this paper, we focus on defense methods against word-level text attacks, which can be categorized into three 2019) generated adversarial examples by an attack method and re-trained models more robust against the attack. More successful defense based on model enhancement can be regarded as synonym encoding (Le et al., 2021; Dong et al., 2021; Wang et al., 2021a) , a series of methods defend text attacks through setting the same embedding for different synonyms. Unfortunately, the robustness of re-trained models is often determined by the diversity of adversarial examples used in re-training (Shen et al., 2022b) . As a result, adversarial training is vulnerable to unknown attacks. Besides, a stream of recent popular defense methods (Huang et al., 2019; Jia et al., 2019) concentrates on certified robustness, which ensures models are robust to all potential word perturbations within the constraints. However, because of the extreme time cost in the training stage, certified ro-bustness is more difficult to be applied to complex models compared to other paradigms of defenses. For example, IBP (Jia et al., 2019) performed catastrophically on large pre-trained language models. Recently, some excellent works have designed detection-based defense methods. For example, Mozes et al. (2021) used the word frequency difference to design detectors; Le et al. (2021) designed a detector specific for the 'Universal trigger'; Mosca et al. (2022) proposed a detection method by leveraging prediction logits. Despite their success, such detection-based defense methods own a key issue: the incapability of correctly classifying adversarial sentences. An intuitive and straightforward idea is to design a corrector to convert adversarial sentences to their original benign ones, which has proved effective in the the field of computer vision (CV) (Liu et al., 2019; Qin et al., 2019; Li et al., 2020b; Deng et al., 2021) . However, the different essence between languages and pixels makes correctors used in CV not transferable in NLP. In this paper, based on the generation mechanism of adversarial sentences, we design a proper detector and corrector for text attacks.

3. BACKGROUND

In this section, we present basic definitions used in adversarial text attack and saliency computation.

Adversarial text attack

In a text classification scenario, X = {w 1 , . . . , w d } is a sentence with d words, and w 1 , . . . , w d are the corresponding embeddings. Given a text classifier F , F (X) denotes the probability distribution over the class set Y for input sentence X. Then, class y j with the largest probability in F (X) is selected as the prediction of X. Furthermore, in a word-level text attack for a text classification scenario, there is a victim classifier F and a text attack G. • Victim Classifier: Given a set of benign sentences in a text classification benchmark, the victim classifier F can be trained with the benign sentences, where the sentence encoder is usually selected as TextCNN (Kim, 2014) , LSTM (Hochreiter & Schmidhuber, 1997) or BERT (Devlin et al., 2019) . • Text Attack: For a benign sentence X with true label y j , the text attack G generates an adversarial sentence X * = G (X) by substituting words w i in X with w * i , satisfying: (1) F (X * ) and F (X) output different predicted labels; (2) Dist (w i , w * i ) < δ, where Dist(•, •) is a distance metric and δ is a threshold to limit the distance.

Saliency computation

In the field of explainability of DNNs, saliency map is a common technique to identify input features that are most salient. In other words, these input features cause a maximum influence on the model's output. Specifically, saliency is a measure that quantifies the sensitivity of an output class for an input feature in a DNN-based model, and a saliency map is a transparent colored heatmap overlaid on the input features. Furthermore, back-propagation-based methods, which have been widely used to compute saliency in the explainability of DNNs, treat saliency as the gradient of a DNN-based model. There are different ways to calculate gradients based on gradient signals passed from output to input during model training, which leads to different saliency computation methods. For example, in the vanilla gradient (VG) method (Simonyan et al., 2013) , saliency is directly defined as the partial derivative of the network output concerning each input feature scaled by its value. Formally, given a classifier F and a sentence X containing words w i , the saliency of w i to prediction y j is defined as follows: r ij = ∂F yj (X) ∂w i (1) where F yj (X) is the prediction confidence for y j when applying F on X.

4. MOTIVATION AND INTUITION

This section illustrates intuitions that motivate us to design TextShield. Firstly, we present our basic derivations to show a cross-cutting connection between adversarial robustness and saliency. Then, we define adaptive word importance (AWI), a metric that helps us to detect adversarial sentences.

4.1. THE CONNECTION BETWEEN ADVERSARIAL ROBUSTNESS AND SALIENCY

In this part, we explain why saliency is highly related to adversarial robustness, indicating there is a connection between adversarial robustness and saliency. Given a victim classifier F and a benign sentence X with label y j predicted by F , the objective of a text attack that generates an adversarial sentence X * with a wrong predicted label (namely y k ) is defined as follows: arg minL F yj (X * ) , y k (2) where L is a loss function (e.g., cross-entropy loss). In a vanilla gradient-based attack (Goodfellow et al., 2014) , Eq 2 is optimized by a gradient descent method. Specifically, in each iteration of the attack, word w i is substituted with a word whose embedding is closest to w i . In other words, word embedding w i is perturbed with a step rate r 1 by the attack, which is defined as follows: w ′ i = w i -r 1 ∂L(F yj (X), y k ) ∂w i In ease of clarification, we regard the word embeddings as continuous variables. Then, based on Eq 1, we can further derive the term w i -r 1 ∂L(Fy j (X),y k ) ∂wi , and rewrite it as follows: w i -r 1 ∂L(F yj (X), y k ) ∂F yj (X) • |r ij | (4) As shown in Eq 4, words with larger |r ij | are perturbed more severely by the attack, indicating that saliency |r ij | may serve as a key for adversary detection.

4.2. ADAPTIVE WORD IMPORTANCE

Inspired from previous analysis and finding, we design a novel metric called adaptive word importance (AWI) based on saliency. Definition 1. Given sentence X with label y j predicted by text classifier F , the adaptive word importance (AWI) of word w i in the sentence, R ij , is defined as follows: R ij = |∆(F, w i , y j )| ∆ is a saliency computation method, e.g., VG. Given sentence X, the (i, j)-th element R ij reflects the importance of word w i towards prediction y j in victim classifier F . Thus, AWI can reflect the importance of a word, and a larger AWI value indicates higher importance.

5.1. OVERVIEW

Based on previous analysis, we present TextShield, which is like a shield in front of the victim classifier, as shown in Figure 1 . Generally, TextShield has two principal components: a detector and a corrector. The detector is a learning-based binary classifier that takes a sentence as input and outputs whether the input is adversarial or not. The corrector continues to correct adversarial sentences to benign ones, which makes up the blank of previous detection-based defense methods. Specifically, given an input sentence X, X is fed to the victim classifier F to get prediction y j . Then the detector is used to judge whether X is adversarial based on prediction y j . If X is benign, the final prediction for X will be y j (i.e., F (X)). Otherwise, a corrector is used to transform X to X ′ , and the final prediction for X will be F (X ′ ).

5.2. DETECTOR

As shown in Figure 2 , the detector consists of four sub-detectors and one combiner. A sub-detector is a binary classifier that utilizes a saliency computation method and a LSTM layer to predict whether input sentence X is adversarial or not. In this paper, four saliency computation methods, vanilla Figure 1 : The overview of TextShield. During inference, a sentence X is firstly fed to the classifier F . Then the detector judges whether X is adversarial. If not, F (X) will be the final prediction. On the contrary, X will be transformed to X ′ by a corrector, and the final prediction for X is F (X ′ ). G) , z (GBP ) , z (LRP ) and z (IG) . (III): The predictions are fed to a combiner (a linear layer) to produce a final prediction z. gradient (VG), guided backpropagation (GBP), layerwise relevance propagation (LRP), and integration gradient (IG), are chosen, which correspond to the four sub-detectors, respectively. Then, the combiner uses a linear layer to combine four predictions from the four sub-detectors, respectively. Specifically, for a sub-detector based on a saliency computation method I, given an input sentence X, one AWI matrix is calculated for X using I, and then the AWI matrix is fed to an LSTM to obtain a prediction. Takings VG as an example, the VG-based sub-detector is calculated as follows. z (V G) = LSTM 1 R (V G) •j (6) where R •j is a column of AWI matrix R, and z (V G) is a binary prediction from VG-based subdetector. Similarly, z (GBP ) , z (LRP ) and z (IG) are predictions from GBP-based, LRP-based, and IG-based sub-detectors, respectively. Finally, through a linear combiner, the four binary predictions are combined to give the prediction z: z = argmax(L([z (V G) , z (GBP ) , z (LRP ) , z (IG) ])) where L is the linear layer and z is the label reflecting whether X is adversarial. In the following section, we will present the four saliency computation methods and then give the details of the detector's training. Vanilla Gradient (VG) For text classifier F , the AWI value of word w i in sentence X to prediction y j is calculated by the vanilla gradient method (Simonyan et al., 2013) as follows: R (V G) ij = ∂F yj (X) ∂w i (7) where R (V G) ij , the (i, j)-th element of AWI matrix R (V G) , is the AWI value of word w i to prediction y j , F yj is the prediction confidence for y j from F , and w i is the word embedding for word w i . In fact, when calculating R (V G) ij , F yj is a scalar and w i ∈ R 1×k is a vector, where k is the dimension of the word embedding. Thus, ∂Fy j (X) ∂w i is a vector with the form R 1×k , and an average operation is made to obtain R (V G) ij . For brevity, we omit the operation in the paper. Guided Backpropagation (GBP) Following the idea of the vanilla gradient, GBP (Springenberg et al., 2015) chooses only 'positive' (> 0) gradients and discards 'negative' (< 0) gradients. Positive gradients indicate supporting the prediction, while negative ones mean the opposite. Such a choice enables us to capture what text classifier F learns and yet discard anything that F does not learn. We take the final outputs of GBP as AWI matrix R GBP . Layerwise Relevance Propagation (LRP) GBP owns an obvious drawback called gradient saturation (Bach et al., 2015; Binder et al., 2016) that gradients computed by GBP are always zeros under some cases. Therefore, Layerwise Relevance Propagation (LRP) (Bach et al., 2015) is proposed, which tackles the problem using a reference input besides the target input to calculate gradients. The LRP method calculates a weight for each connection in the gradient chain and uses a reference input instead of roughly setting the weights as 0 or 1 like GBP. The reference input is a 'neutral' input and will be different in different tasks (e.g., blank images in the CV field or zero embeddings in the NLP field). Finally, such gradients compose AWI matrix R LRP . Integrated Gradient (IG) LRP is calculating discrete gradients. However, the chain rule does not hold for discrete gradients so that LRP has essential drawbacks. Thus, the integrated gradient (IG) (Sundararajan et al., 2017) is proposed, which calculates the influence of word w i to prediction y j by integrating vanilla gradient, which is defined as follows: R (IG) ij = w i -w i ′ × 1 0 ∂Fy j X, w i ′ + α w i -w i ′ ∂w i dα (8) where w ′ i is a zero embedding and F yj (X, w i ′ + α(w i -w i ′ ) means that the embedding of word w i , w i , is replaced with w i ′ + α(w i -w i ′ ). The final outputs of IG is AWI matrix R IG . Training of detector Generally, training the detector consists of the following three steps, where both the adversarial data setup and the balanced data setup prepare data for the model's training. More training details are deferred to Appendix A. • Adversarial data setup: Given benign data, the victim classifier F is fine-tuned by the data. Then the adversarial attacks are used to generate adversarial sentences and 2,000 adversarial sentences per class that successfully attacks are selected. Moreover, the adversarial sentences are equally generated by attacks.foot_0  • Balanced data setup: The adversarial data and the same-size benign data are mixed as balanced data. Then, the balanced data is split into train-dev-test sets with a 7:2:1 proportion.

• Model training:

The training data set is fed to train the detector, and the best-performance checkpoint is selected according to performance on the dev data set.

5.3. CORRECTOR

As mentioned in Sec 5.1, if the detector recognizes the input sentence X as an adversarial one, then the prediction y j for X is possibly inaccurate. Furthermore, as discussed in Sec 4.1, given an AWI matrix R, larger R ij means word w i is more important for classifying the sentence as y j . In other words, words with large AWI values are more likely to be perturbed by attacks to fool the victim classifier. Therefore, the corrector corrects the words with large AWI values in the adversarial sentence. In addition, the corrector chooses R V G ij as R ij . Specifically, for an adversarial sentence X * with prediction y j , the corrector regards a word w i as a suspect when its R ij exceeds a threshold, which is defined as follows: R ij > β • (R max -R min ) + R min (9) where R max and R min are the largest and smallest values in R, and β is a hyper-parameter to control the ratio of suspects. For a suspect w i , the corrector substitutes w i with the most frequent word w ′ ∈ S(w i ), where S(w i ) is w i 's synonyms, extracted by NLTK (Loper & Bird, 2002) .

6. EXPERIMENTS ON ADVERSARIAL DEFENSE

In this section, we conduct comprehensive experiments to compare TextShield to baselines.

6.1. EXPERIMENTAL SETTINGS

Firstly, we use TextCNN (Kim, 2014) , LSTM (Hochreiter & Schmidhuber, 1997) Wang et al., 2019c) . Moreover, we choose three popular benchmarks in text classification: IMDB (Potts, 2010), AG's News (Zhang et al., 2015) and Yahoo! Answers (Zhang et al., 2015) . Then, given a victim classifier, an attack and a benchmark, we generate test data consisting of a benign dataset and an adversarial dataset, which are used to evaluate the generalization and robustness of defense methods. Specifically, the benign dataset includes 1,000 benign sentences sampled from the benchmark, and the adversarial dataset includes 1,000 adversarial sentences generated by the attack based on the benchmark. Please note that there is no overlap between the test data and the training data used for the detector's training. Finally, we evaluate the performance of the victim models with or without defense methods. Effect on robustness When evaluated on adversarial sentences, TextShield performs best in most cases, demonstrating that TextShield can effectively improve the model's robustness. Furthermore, the enhancement of model's robustness is derived from the leveraging a cross-cutting link between adversarial robustness and saliency, which is complementary to the conventional opinion (Zhang et al., 2020; Goyal et al., 2022 ) that text attacks should be better defended using discrete information instead of continuous information. Moreover, among the baselines, SEM, the state-of-the-art defense method against word-level attacks, achieves the best performance. IBP, which was first proposed in the CV domain, is more suitable for TextCNN and does not perform excellently on other victim classifiers. FGWS focuses only on word frequencies, so it is not sufficient to detect adversarial sentences when word frequency is not distinct. Moreover, after combining with our corrector, WDR can yield a comparable performance towards ASCC, a competitive baseline using the model-enhancement-based paradigm, which indicates our corrector can combine well with existing detectors concerning model's robustness.

7. EXPERIMENTS ON ADVERSARIAL DETECTION

This section compares our saliency-based detector to current SoTA adversarial detectors in several configurations involving datasets and attacks.

7.1. EXPERIMENTAL SETTINGS

In the experiments, only BERT (Devlin et al., 2019) is used as victim classifier. Then, four widely used adversarial attacks are selected: GA (Alzantot et al., 2018 ), PWWS (Ren et al., 2019) , IGA (Wang et al., 2019c) and TextFooler (Jin et al., 2020) . Next, three popular benchmarks in text classification are selected: IMDB (Potts, 2010), AG's News (Zhang et al., 2015) and Yelp (Zhang et al., 2015) . Similar to the experiments on adversarial defense, given a victim classifier, an attack, and a benchmark, we generate test data which include 1,000 adversarial sentences generated by the attack based on the benchmark. Finally, we evaluate the performance of the victim models on the test data. Following previous works (Mosca et al., 2022) , the performance of adversarial detection is evaluated by F1-score and recall. Moreover, in order to examine the performance of our detector, two state-of-the-art as baselines, FGWS (Mozes et al., 2021) and WDR (Mosca et al., 2022) , where FGWS leverages word frequency to detect adversarial sentences, and WDR uses prediction logit information for adversarial detection.

7.2. PERFORMANCE

Table 2 shows the performance of the detectors in various configurations. In Table 2 , our proposed detector consistently shows better results in terms of F1-score and recall. Specifically, our detector outperforms the best baseline in 11 configurations out of 12, and on average achieves 3.74% gain on F1-score. Such results show that the saliency-based detector reacts well to all selected attacks. Moreover, we can also observe that TextFooler is easier to be detected than other attacks. Considering that TextFooler is much stronger than the attacks like GA and IGA, we can claim that there is no positive relation between attack's attack power and detection difficulty.

8. FURTHER RESULTS

We also illustrate the details or discussion towards critical designs of TextShield with additional experiments. We defer them to Appendix due to limited space. The outline is as follows: the analyses of hyper-parameters are shown in Appendix B. Performance against extra word-level attacks is shown in Appendix C. Also, human evaluation towards attack quality is shown in Appendix D. Besides, computational cost and corrector design discussions are shown in Appendix E and F. Adversarial detection performance of TextShield against other levels of attacks (e.g., sentence-level attack) is shown in Appendix G.

9. CONCLUSION

This paper proposes TextShield, a detect-correct pipelined defense method against word-level text attacks. Based on basic derivations, we find a cross-cutting connection between adversarial robustness and saliency, and then we propose the adaptive word importance (AWI) metric based on saliency computation. Based on AWI, we design a saliency-based detector, which achieves better performance than previous detectors. Then, we design a saliency-based corrector, which corrects adversarial sentences to benign ones using AWI. Comprehensive experiments demonstrate that TextShield is superior to previous defense methods for generalization and robustness.

A DETAILS OF DETECTOR'S TRAINING

The learnable parameters in our saliency-based detectors are the ones in the four LSTMs of the detector and the two-layer MLP of the combiner. After tokenization, we either conduct padding with 'max length=128' or do a truncation for each input sentence. Then after saliency computation, we obtain the AWI matrix with a form (128, |Y|), where |Y| is the size of the label set used in the focused text classification. Thus, the four LSTMs take the four AWI matrices as input correspondingly and output their hidden states (hidden size = 128). Finally, the hidden states are concatenated to feed to the combiner, which is a two-layer MLP + softmax function, with input dim = 128, intermediate layer dim = 64 and out dim = 2. In addition, the LSTMs and the two-layer MLP are simultaneously trained through the Adam optimizer with a 5 • e -4 learning rate.

B ABLATION STUDY

This section conducts ablation studies of TextShield to answer the following three questions: Q1) What is the impact of β, the hyper-parameters of the corrector? Q2) What is the impact of multiple sub-detectors? Q3) What is the impact of the number of adversarial sentences used to train detectors? The experimental settings are similar to the ones used in Sec 6.

B.1 ANALYSIS OF HYPER-PARAMETER β

We study the impact of β, the hyper-parameters of our corrector. As shown in Figure 3 , the best performance achieves when β is 0.4, and either a small or large β will cause a significant performance drop, especially when β is large. This is because β serves to control the ratio of suspects in a sentence: when β is 0, only the most important word is considered as a suspect, which may be inadequate; on the contrary, when β is 1.0, every word is viewed as a suspect, which is inappropriate. Since a crucial design of TextShield is the ensembling of the four sub-detectors with a combiner, we investigate the impact of such an ensembling strategy. Specifically, we verify the performance after removing some sub-detectors, and the evaluation is launched using TextCNN on AGNews. As shown in Table 3 , removing any sub-detector will cause a performance drop, demonstrating the effectiveness of each sub-detector. Based on the results, the importance of sub-detectors can be ranked as follows: VG>LRP>IG>GBP. Furthermore, removing all the sub-detectors shows a catastrophic performance drop, which confirms our arguments that different sub-detectors can complement well with each other and also shows the importance of integrating the four sub-detectors. Please note that neither the saliency computation methods nor the ensembling strategy of various sub-detectors are our paper's focus. There are more sophisticated methods for saliency computation and sub-detector combination, and such methods can also be applied to TextShield. 

C PERFORMANCE ON OTHER WORD-LEVEL ATTACKS

Additionally, we add BERT-attack (BA) Li et al. (2020a) to our attack methods. The settings of BA are the same from its paper Li et al. (2020a) . Specifically, please note that we also train our detectors with adversarial samples generated by BA (same sample size as the experiments in Sec 6). Then, we measure the defense capacity for each baseline, and the results are shown in Table 4 .

D HUMAN EVALUATION FOR THE TEXT ATTACK QUALITY

The quality of text attacks remains a challenge since it is difficult to ensure that text attacks will keep sentences' ground truth towards the original label. This is a problem that many SoTA adversarial defense methods have omitted. Therefore, we conduct a human evaluation by examining the quality of four attacks selected in Sec 6. Considering there are over one thousand adversarial samples, which lead to the tremendous manual effort, we thus choose to sample 200 adversarial sentences for our selected attacks. Then, we evaluate whether such attacks modify their ground-truth labels, and the results are illustrated in Table 5 . Then, we also present the F1-score on these samples (conducted human evaluation by us) in Table 6 . We can observe that the overall quality of adversarial sentences is good. 

E COMPUTATIONAL COST

Concerning the computational cost of our method, we present a case study here. On one RTX3090 GPU, the victim model is selected as BERT-base-uncased. Setting the batch size as 8, each inference of BERT takes around 20-30ms, TextShield (only detector part) takes around 60ms, and the cost is acceptable. The corrector + BERT inference takes around 20-30ms, nearly equal to BERT inference time since the information corrector needed has already been computed in the VG detector. Overall, it takes around 80-90ms per inference time under the configuration. Considering the training cost of detectors, since detectors only consist of LSTM and MLP, training them until convergence takes around 60min (On RTX3090).

F OTHER BASELINES FOR CORRECTOR DESIGN

Since the baselines for our corrector remain blank, we define some intuitive strategies for the corrector as follows: (1) replace words based on their part of speech (POS), (2) we replace words based on their word frequency in the benchmark (Freq). The only thing that varied in this set of experiments is: how we select words to be replaced by high-frequency synonyms. Specifically, since different selection methods have different optimal β, we tune β for these two baselines and report the best performance. The remaining settings in the corrector are the same to ensure fair comparisons. The results are shown in Table 7 . We can see that all of the methods perform significantly worse than Second, there seems to be a contradiction between the generality and performance of a defense method. There are some general defenses (Le et al., 2022b; Liu et al., 2022) that can defend sentence-level, word-level and character-level attacks. However, their performance is not comparable to state-of-the-art defense methods designed for a specific attack (e.g., word-level attacks). Meanwhile, the success of state-of-the-art defense against one kind of attacks (e.g., word-level) can not transfer to another kind of attacks (e.g., sentence-level) (Zhang et al., 2020; Goyal et al., 2022) . Thus, how to tackle such a trade-off between generality and performance remains a valuable question. We believe that designing more suitable detectors and correctors is a plausible direction since our saliency-based defense performs properly shows strong potential to detect each kind of attacks. Specifically, there are two lines of future work to continue: (1) designing stronger detectors that perform well on three kinds of attacks, and (2) designing better correctors with general correction schemes.



We do not train detector with the attack for evaluation. For example, if we evaluate the defense capacity against GA, then it is trained with IGA, PWWS, and TextFooler.



paradigms: (1) model-enhancement-based, (2) certified-robustness-based, and (3) detection-based. Model enhancement is a common way to improve model's robustness against text attacks, and one popular way is adversarial training which re-trains a victim model with extra adversarial examples to obtain an enhanced model. Alzantot et al. (2018); Ren et al. (

Figure 2: The overview of the detector. The detection can be summarized into three stages: (I): AWI matrices are generated by specific saliency computation methods based on F (X) and the classifier F . (II): AWI matrices are fed to LSTMs to obtain predictions, e.g., z (VG) , z (GBP ) , z (LRP ) and z(IG) . (III): The predictions are fed to a combiner (a linear layer) to produce a final prediction z.

Figure 3: The accuracy (%) of TextShield towards different text attacks when β varies from 0.1 to 1.0. Specifically, the performance of the three victim classifiers on AGNews is presented.

Figure 4: Accuracy (%) of TextShield on adversarial sentences generated by four attacks (TL, PWWS, GA and IGA) as K changes.

and BERT(Devlin et al., 2019) as victim classifiers, and select four widely-used text attacks in the NLP field: TL (TextFooler,(Kuleshov et al., 2018)), PWWS (Probability Weighted Word Saliency,Ren et al., 2019) , GA (Genetic Algorithm,Alzantot et al., 2018) and IGA (Improved Genetic Algorithm,

The accuracy (%) of various defense methods on benign and adversarial sentences. Rows (e.g., NT) represent the defense methods; Columns (e.g., No) represent the attacks that generate adversarial test sentences; No: no attack, NT: Normal Training, TS: TextShield, TL: TextFooler; BLUE is the best performance. Moreover, for a fair comparison, as the previous work(Shi et al., 2019) reported that BERT is too difficult to be tightly verified by IBP, we do not use IBP on BERT.

The performances of adversarial detection under various attacks on BERT-based victim classifier. The performances of WDR and FGWS are from the corresponding paper; BLUE is the best performance. 'Saliency' represents our saliency-based detector.

This experiment illustrates the relation between the performance of TextShield and K. Notice that, except for K, we keep the same training settings as the experimental setting used in Sec 6. The performance of TextShield under different K is shown in Figure4.As illustrated in Figure4, the defense capacity of TextShield is enhanced as K continuously increases since more samples used in the training data help the detector better recognize adversarial sentences. Specifically, when K is larger than 2000, the improvement becomes extremely subtle, indicating that we have arrived at the upper-bound of TextShield performance under such configurations. The accuracy (%) of various defense methods towards BERT-Attack (BA).

The quality of five text attacks selected in our paper. Quality represents the proportion of adversarial sentences whose ground-truth remain unchanged. We can see, in most cases, text attacks do not affect the ground-truth.

Under human-evaluated adversarial sentences, F1-score (%) of various defense methods on adversarial sentences.

10. ACKNOWLEDGEMENT

The research was financially supported by National Precision Agriculture Application Project (Grant/Contract number: JZNYYY001), and Beijing Municipal Science and Technology Project (Grant/Contract number: Z201100008020008).Published as a conference paper at ICLR 2023 our original design -leverage saliency for correction. Specifically, we can see that combining verbs and nouns outperform merely either of them, which is reasonable. Also, we find that replacing lowfrequency words achieves a not bad performance, which matches the empirical findings in Mozes et al. (2021) 

G LIMITATION AND FUTURE WORKS

TextShield mainly focuses on defending against word-level text attacks, so the main limitation in TextShield lies in the specificity of word-level attacks. Moreover, adversarial defenses. To validate the practical application of TextShield, in this section, we conduct similar experiments as the ones in Sec 7 yet on detecting sentence-level and character-level attacks (Wang et al., 2021a; Dong et al., 2021; Mozes et al., 2021; Mosca et al., 2022) . Specifically, we select one character-level attack, Visual (Eger et al., 2019) , and two sentence-level attacks, PAWS (Zhang et al., 2019b) and T3 (Wang et al., 2020a) , and show the results in First, we can observe that all the detectors have a significant performance drop (about 10%) in both F1-score and recall when facing sentence-level attacks, compared to the performance on detecting word-level attacks in Table 1 . The reason might be that the perturbation in sentence-level attacks is more complicated and is more likely to influence sentence' semantics at a level higher than word-level attacks, which makes sentence-level attacks more difficult to detect. Moreover, the detection difficulty of character-level attacks lies between word-level attacks and sentence-level attacks. Therefore, due to its stealthiness, sentence-level attack is a more challenging issue for adversarial defense in NLP. Yet, most of existing adversarial defenses (Jia et al., 2019; Dong et al., 2021; Wang et al., 2021a; Mozes et al., 2021; Wang et al., 2021b; Mosca et al., 2022; Le et al., 2022a) focus on defending word-level attacks, and they will also counter significant performance drop when facing other text attacks (e.g., sentence-level).

