PERTURBATION DEFOCUSING FOR ADVERSARIAL DE-FENSE

Abstract

Recent research indicates adversarial attacks are likely to deceive neural systems, including large-scale, pre-trained language models. Given a natural sentence, an attacker replaces a subset of words to fool objective models. To defend against adversarial attacks, existing works aim to reconstruct the adversarial examples. However, these methods show limited defense performance on the adversarial examples whilst also damaging the clean performance on natural examples. To achieve better defense performance, our finding indicates that the reconstruction of adversarial examples is not necessary. More specifically, we inject non-toxic perturbations into adversarial examples, which can disable almost all malicious perturbations. In order to minimize performance sacrifice, we employ an adversarial example detector to distinguish and repair detected adversarial examples, which alleviates the mis-defense on natural examples. Our experimental results on three datasets, two objective models and a variety of adversarial attacks show that the proposed method successfully repairs up to ∼ 97% correctly identified adversarial examples with ≤∼ 2% performance sacrifice. We provide an anonymous demonstration 1 of adversarial detection and repair based on our work.

1. INTRODUCTION

Neural networks have been employed achieved state-of-the-art performance on various tasks. However, recent research has shown their vulnerability to adversarial attacks (Szegedy et al., 2014; Goodfellow et al., 2015) . In particular, language models have shown to be vulnerable to adversarial examples (a.k.a., adversary) (Garg & Ramakrishnan, 2020; Li et al., 2020; Jin et al., 2020; Li et al., 2021a) generated by replaced specific words in a sentence. Compared to adversarial robustness on computer vision tasks (Alzantot et al., 2018; Ren et al., 2019; Zang et al., 2020; Zhang et al., 2021; Jin et al., 2020; Garg & Ramakrishnan, 2020; Li et al., 2021a; Wang et al., 2022) , text adversarial defense (a.k.a. adversarial repair) has attracted less attention resulting in limited progress in adversary defense. Moreover, the crux of adversarial defense, i.e., performance sacrifice, has not been settled by existing studies. While the prominent works tend to solve adversarial defense via adversarial training or feature reconstruction, we propose perturbation defocusing to address adversarial defense in natural language processing. More specifically, perturbation defocusing attempts to apply non-toxic perturbations to adversaries to repair them. Although it doesn't seem to be an intuitive thought, it is motivated by empirical observations that malicious perturbations rarely destroy the fundamental semantics of a natural example. In other words, these adversaries can be easily repaired by distracting the objective model from malicious perturbations. We validate a simple implementation of perturbation defocusing with preliminary experiments: simply masking the malicious perturbations, as in Figure 1 . The experimental results in Table 1 show that masking malicious perturbations repairs a considerable number of adversaries (achieves up to 91.05% restored accuracy on the Amazon Polarity dataset). Unfortunately, the positions of malicious perturbations are unknown in real adversarial defense. We employ adversarial attackers to perform perturbation defocusing as an alternative. If an adversary is identified, we obtain its perturbed prediction and keep attacking this adversary until the new prediction differs from the former. In this way, the malicious perturbations Figure 1 : A real example of perturbation defocusing, which masks the perturbed words to repair an adversary. " [MASK] " denotes the mask token. This virtual adversary is generated by TE X TFO O L E R. are defocused without knowing the positions of malicious perturbations. Because adversarial attackers have large search spaces of non-toxic perturbations, almost all malicious perturbations in adversaries can be defocused in our experiments. However, there is a prerequisite that the adversaries must be precisely identified to prevent oriented attackers from attacking natural examples (Bao et al., 2021) in perturbation defocusing. Hopefully, although existing adversarial attackers emphasize the naturalness of adversaries (Zang et al., 2020; Li et al., 2021b; Le et al., 2022) , our study suggests that PLM-based models can efficiently distinguish the adversaries (refer to Figure 4 ), provided that the adversarial detection objective is involved in fine-tuning processing. Thereafter, we propose reactive perturbation defocusing (RPD) based on perturbation defocusing and adversary detection that alleviates performance sacrifice by only repairing detected adversaries. We deploy RPD on a PLM-based model, and it can be extended to other NLP models. We evaluate RPD on three text classification datasets under challenging adversarial attackers. The experimental results demonstrated that RPD is capable of repairing ∼ 97%+ of identified adversaries without observable performance sacrifice (under ∼ 2%) on clean data (please refer to Table 6 ). In summary, our contributions are mainly as follows: a) We propose perturbation defocusing to supersede feature reconstruction-based methods for adversarial defense, which almost repairs all correctly identified adversaries. b) We integrate an adversarial detector with a PLM-based classification model. Based on multiattack adversary sampling, the adversarial detector can efficiently detect most of the adversaries. c) We evaluate RPD on multiple datasets, PLMs and adversarial attackers. The experimental results indicate that RPD has an impressive capacity to detect and repair adversaries without sacrificing clean performance. Existing adversarial defense studies can be coarsely classified into three types: adversarial training-based approaches (Miyato et al., 2017; Zhu et al., 2020; Ivgi & Berant, 2021) ; context reconstruction-based methods (Pruthi et al., 2019; Liu et al., 2020b; Mozes et al., 2021; Keller et al., 2021; Chen et al., 2021; Xu et al., 2022; Li et al., 2022; Swenor & Kalita, 2022) ; and feature reconstruction-based methods (Zhou et al., 2019; Jones et al., 2020; Wang et al., 2021a) . In the meantime, some research (Wang et al., 2021b) explores hybrid defenses against adversarial attacks. Nevertheless, there are some problems that remain with the existing methods. For example, due to the issue of catastrophic forgetting (Dong et al., 2021) , adversarial training has been shown to be inadequate for improving the robustness of PLMs in fine-tuning. On the contrary, it significantly increases the cost of objective model training. For context reconstruction (e.g., word substitution and translationbased reconstruction), these methods sometimes fail to identify semantically repaired adversaries or have a tendency to introduce new malicious perturbations (Swenor & Kalita, 2022) . In recent studies, it has been recognised that feature (e.g., embedding) space reconstruction-based approaches are more successful than context reconstruction methods like word substitution (Mozes et al., 2021; Bao et al., 2021) . However, these feature reconstruction methods may have difficulty repairing typo attacks (Liu et al., 2020a; Tan et al., 2020; Jones et al., 2020) , sentence-level attacks (Zhao et al., 2018; Cheng et al., 2019) , and other unknown attacks. These studies usually limit the experiments to word substitution-based attacks (typically Genetic Algorithm (Alzantot et al., 2018) ). In contrast to prior efforts, we argue that reconstruction is not necessary for adversarial repair. Because the fundamental semantics of an adversary generally remains in the adversary, we just need to distract objective models' attention from malicious perturbations. Another problem with the existing methods is that they neglect the importance of adversary detection and assume that all instances are adversaries, resulting in numerous unsuccessful defenses. Compared to existing works, our study focuses on reactive adversarial defense and addresses the crux of performance sacrifice brought on by adversarial defense. We illustrate the framework of RPD in Figure 2 , which consists two phases: multi-objective finetuning and adversarial repair. In Phase #1, we fine-tune RPD based on three training objectives, including the original classification objective. Next, we introduce each objective in following sections.

3.1. ADVERSARY DETECTOR TRAINING

Since we train the adversary detector using supervise learning, we will introduce how to sample adversaries by adversarial attackers.

3.1.1. TEXT ADVERSARIAL ATTACK

We focus on word-level adversarial attacks in this work. Let x = (w 1 , w 2 , • • • , w n ) be a natural sentence, where w i , 1 ≤ i ≤ n, denotes a word; y is the ground truth label. The word-level attackers replace the original words with their words (e.g., synonyms) to fool the objective model. For example, substituting w i with ŵi will generate an adversary: x = (w 1 , ŵ2 , • • • , w n ), where ŵi is a alternative of w i . The objective model F predict x as follows: ŷ = argmax F (•|x) , where ŷ ̸ = y if x is a successful adversaries. The perturbations in x are expected to be humanimperceptible. However, most of existing attackers tend to introduce grammatical, syntactical errors to a certain extent, while the features of these errors in x can be easily modeled by a PLM.

3.1.2. MULTI-ATTACK ADVERSARY SAMPLING

Based on the open-source adversarial attack methods (i.e., attacker), we perform multi-attack sampling (line 2 -12 in Algorithm 1) to train the adversary detector. Let D nat be the natural examples. ∀x ∈ D nat , we try to find a successful adversary as follows: x, ŷ ← k i=1 A i (F s , x, y), where ← indicates the adversary search process; x, ŷ are the perturbed sentence and label. Note that if the attack failed, ŷ = y but In the sampling process, ỹ is conditioned on the attack result (lines 6 -9 in Algorithm 1): x ̸ = x. A i , 1 ≤ i ≤ k, ỹ := (ϕ, y, 0) , ŷ = y (y, ϕ, 1) , ŷ ̸ = y , where ϕ indicates the sub-label is neglected in cross-entropy loss calculation. All adversaries and natural examples are fused to train an adversary detector. We also conduct experiments on singleattack sampling-based RPD (denoted as S-RPD) to evaluate the significance of multi-attack sampling (please refer to Table 5 ).

3.1.3. ADVERSARY DETECTOR OBJECTIVE

After adversary sampling, we fit the adversary detectorfoot_1 on natural examples and sampled adversaries. Let H be the representation of an example encoded by a PLM, RPD calculate the adversarial distribution as follows: ιi = exp (pool(H) i ) 2 j=1 exp (pool(H) j ) , where ιi , 1 ≤ i ≤ 2, indicates whether a sentence has been perturbed; pool is the head pooling of PLM. The adversarial detection objective can be formulated as: L det = - 2 i=1 ιi log ι i , where ι i denotes the true adversarial label. Because the adversary detector is a binary text classifier, we adopt widely used cross-entropy to minimize L det .

3.2. DETACHED ADVERSARIAL TRAINING

We employ adversarial training in RPD as it has been recognized to be able to improve robustness (Miyato et al., 2017; Zhu et al., 2020; Ivgi & Berant, 2021) . However, we find that traditional adversarial training may degenerate performance on natural examples. Hence, we propose the detached adversarial training objective to simultaneously mitigate performance sacrifice and improve objective model's robustness. The detached adversarial training objective L adv can be formulated as: min E (x,y)∼Dnat max x,ŷ←A(x,y) L adv (x, y) . More specifically, the standard classifier only learns to classify natural examples, while the adversarial training objective only involves the adversaries. To clarify each step, we describe the training of RPD in Algorithm 1. The efficacy analysis of the detached adversarial training objective is available in Table 9 .

3.3. STANDARD CLASSIFICATION TRAINING

The last objective L cls aims at standard classification. We employ cross-entropy to optimize the standard classifier as following: L cls = - C 1 ŷi log y i , L rpd := L cls + αL det + βL adv + λ∥Θ∥ 2 , (8) where ŷi , 1 ≤ i ≤ C, is the prediction of classification; C indicates the classes number. L rpd is the overall objective of RPD. α and β are the objective weights. In this work, α and β are set to 5 by grid searching. λ = 10 -foot_4 is the L 2 regularization parameter; Θ denotes the parameter set of RPD.  for i ← 1 to k do forall (x, y) ∈ D nat do x, ŷ ← A i (F s , x, y); if ŷ ̸ = y then 7 B := B {(x, (ϕ, y, 1))}; end B := B {(x, (y, ϕ, 0))}; end end Train F R on B using L rpd ; return F R Repaired Outputs R 1 R ← ∅; 2 forall x e ∈ D e do 3 ŷ, ι = F R (x e ); 4 if ι == 1 then 5 x r ← A P D (x e , ŷ); 6 ŷr , ιr = F R (x r ); 7 R := R {ŷ r };

3.4. REACTIVE PERTURBATION DEFOCUSING

In the Phase #2, RPD tries to repair the identified adversaries via perturbation defocusing (Algorithm 2). Assuming that the x, ι ← F R (x) denote the classification distribution and adversarial detection distribution from RPD. If ι (i.e., ι is 1) indicates an adversary, the repaired example x r is derived by: x r ← A P D (x, ŷ), (9) where A P D is an adversarial attacker performing perturbation defocusing. Finally, the repaired adversaries's output ŷr ← F R (x r ) (lines 6 -7 in Algorithm 2). Note that the adversaries repaired by perturbation defocusing are still perturbed examples, but no more perturbation defocusing is needed for repaired adversaries.

4.1. DATASETS AND EVALUTAION METRICS

To validate the efficacy of RPD, we conduct experiments on three classification datasetsfoot_2 : SST2foot_3 , Amazon Polarity 5 and AGNewsfoot_5 datasets, respectively. SST2 and Amazon 

4.3. ADVERSARIAL ATTACKERS

The attacker for perturbation defocusing is PWWS in this work, because it hardly corrupts the semantics in the repaired adversaries compared to BAE and is slightly faster than TE X TFO O L E R. The attackers used for adversarial sampling are BAE, PWWS and TE X TFO O L E R. We briefly introduce these attackers as follows: PWWS (Ren et al., 2019) is a synonym-substitution based adversarial attack method. PWWS combines both the word saliency and the classification probability to perform word replacement. BAE (Garg & Ramakrishnan, 2020) replaces and inserts tokens according to alternatives generated by a masked language model (MLM). To identify the essential words, BAE employs a deletionbased measure of word significance. (Jin et al., 2020) takes more constraints (e.g., prediction consistency, semantic similarity and fluency) into consideration in generating adversaries. TE X TFO O L E R adopts a gradientbased word importance measure to locate and perturb the important words. TE X TFO O L E R The other attackers used in ablation experiments are: PSO (Zang et al., 2020) , IGA (Wang et al., 2021a) , DE E PWO R DBU G (Gao et al., 2018) , CL A R E (Li et al., 2021a) .

RPD:

The baseline of RPD that adopts multi-attack sampling based on BAE, PWWS and TE X TFO O L E R. The main experimental results of RPD are listed in Table 3 .

S-RPD:

The variant of RPD that samples adversaries from a targeted single attack. We evaluate the transferability of S-RPD and show the results in Table 4 and Table 8 . We also compare the adversarial defense performance of RPD with other state-of-the-art methods, such as ASCC and RIFT. Please refer to Appendix A.3 for more details. Under review as a conference paper at ICLR 2023 4.5 MAIN RESULTS The experimental findings in Table 3 show how well RPD is able to identify and defend against adversaries. We provide both the standard classification performance and the accuracy under adversarial attack of the objective models in order to intuitively demonstrate the efficacy of adversarial detection and repair. As demonstrated in existing studies (Jin et al., 2020; Garg & Ramakrishnan, 2020) , the objective models' performance is generally significantly decreased by adversarial attackers, particularly on the SST2 and Amazon Polarity datasets. For example, BERT's performance can be decreased by up to 90%+, and its accuracy on the Amazon Polarity dataset is only 1.25% at its worst(TE X TFO O L E R). In general, DEBERTA is more robust than BERT in the majority of circumstances; its worst accuracy on Amazon Polarity dataset is 19.4%(under PWWS attack). In a nutshell, adversarial attacks continue to be a threat to existing PLMs. Despite having more classes, AGNews only sacrifices 11.32% and 16.7% accuracy when attacked by BAE, which means the PLM's robustness varies depending on the dataset domain. Overall, RPD's ability in terms of adversarial detection and repair is encouraging. Among all the datasets, RPD based on multi-attack sampling performs impressively, demonstrating that PLMs (especially DEBERTA) are capable of recognising adversaries. Our main experimental results show that perturbation defocusing is able to repair 97%+ of correctly discriminated adversaries. To explain why PD works, we investigate the similarity between adversaries and repaired adversaries. We randomly select 500 natural examples from SST2, Amazon Polarity and AGNews datasets and obtain the adversaries and repaired adversaries. We encode these examples and calculate the output cosine similarity between adversary-natural example pairs and repaired adversary-natural example pairs. We plot the cumulative distributions of similarity scores on the SST2 dataset in Figure 3 We also visualize the similarity of the feature space. We encode the above examples and visualize the representations via t-SNE in Figure 4 (the visualizations of other datasets are available in Figure 6 ). It can be observed that the repaired adversaries are still discriminateable by PLMs because their feature space is similar to the adversaries. However, we note that more repaired adversaries lie in the natural example space compared to adversaries, which means repaired adversaries are more similar to natural examples in the feature space to some extent. The most challenging obstacle for adversarial detection and repair methods is working on unknown attacks. Because RPD relies on a simple PLM-based adversarial detector to identify adversaries, we need to know whether it can distinguish adversaries generated by unknown adversarial attackers. In this case, we evaluate the RPD's performance on unknown attacks. The results are available in Table 4 (we also evaluate the transferability of S-RPD in Table 8 ). The experimental results in Table 4 show that even though trained on BAE, PWWS and TE X TFO O L E R, RPD is able to distinguish unknown adversaries, especially for PSO and DE E PWO R DBU G. For example, the accuracy of repaired adversaries is promising (97.5% and 90.48% on SST2 and Amazon Polarity datasets). However, there is a significant defense performance drop in the adversaries generated by CL A R E. In conclusion, RPD can identify and repair unknown adversaries. We find that multi-attack sampling may assist adversarial detectors in differentiating between hostile cases. In order to verify our idea, we perform ablation experiments based on single-attack sampling (i.e., S-RPD) and provide the results in Table 5 . In the majority of instances, the detection accuracy of S-RPD suffers large decreases (up to 12.6%). Consequently, the repair performance demonstrates up to 18.21% regression. We attribute the degraded performance of adversarial detection to two factors: a) single-attack sampling leads to fewer training data for the adversarial detector; b) multi-attack sampling may generate more diverse adversarial patterns than single-attack sampling. In summary, defense accuracy and restored accuracy show that single-attack sampling limits RPD's performance. 

5. CONCLUSION

Existing approaches for adversarial defense generally result in performance sacrifices on natural examples. In this study, we propose the RPD based on perturbation defocusing that alleviates performance sacrifice by only repairing identified adversaries. Perturbation defocusing exploits adversarial attacks to distract objective models from malicious perturbation and has been shown to repair up to ∼ 97% of correctly identified adversaries among several challenging attackers. Perturbation defocusing is a new perspective for future adversary repair research, which may supersede the reconstruction-based methods. However, the adversarial defense performance of RPD depends on the accuracy of adversarial detection, which limits RPD's performance. In the future, we will explore other adversarial detection methods and explicit constraints of semantic similarity in perturbation defocusing to improve RPD's defense robustness. 1. The learning rates for both BERT and DEBERTA are 2 × 10 -5 . 2. The batch size and maximum sequence modeling length are 16 and 80, respectively. 3. The dropouts are set to 0.5 for all models. 4. The loss functions of all objectives are cross-entropy. 5. The objective models and RPD models are trained with 5 epochs. 6. The optimizer used for fine-tuning objective models is AdamW.

A.1.2 EXPERIMENT ENVIRONMENT

The experiments are conducted on Cent OS 7, which is equipped with an RTX 3090 GPU and a Core i-12900k. We use PyTorch 1.12 and a revised version of TextAttack based on v0.3.7.

A.1.3 METRIC CLARIFICATIONS

The clean accuracy and attacked accuracy denote the objective model's original (i.e., clean) performance and performance under attacks. The detection accuracy and defense accuracy measure the RPD's performance in adversarial detection and repair, which only measure adversaries. As a global evaluation, the restored accuracy denotes the objective model's performance on the attacked dataset (i.e., replacing the natural examples with their adversaries in the dataset if their adversaries exist.). We terminate an attack if it takes longer than 10 minutes and ignore the example in the metrics calculation.

A.2 PERFORMANCE ON CLEAN DATA

The adversarial defense performance depends on the adversarial detection accuracy. In this case, we evaluate the adversarial detection error rate and the classification accuracy on clean data without defense. From the experimental results listed in Table 6 , we observe that RPD achieves up to 90+ adversarial detection accuracy, which indicates if we use RPD as a regular classifier, the original performance will not significantly decrease. On the other hand, the classification accuracy of adversaries also benefits from the adversarial detection training objective, e.g., SST2 and AGNews datasets. 

A.3 COMPARISON WITH ASCC

From the perspective of adversarial repair, RPD achieves impressive results compared with existing methods (e.g., ASCC). The experimental results are available in Table 7 . The experimental results show that perturbation defocusing which distracts the objective model from the malicious perturbations achieves comparable performance. We explain why perturbation defocusing works for adversarial defense in Figure 3 . We deploy an anonymous demonstration of RPD on Huggingface Spacefoot_8 , and we provide two examples of this demonstration in Figure 7 to show the usage of RPD. In this demonstration, the user may either enter a new phrase with a label or randomly choose an example from the dataset supplied in order to execute an attack, adversarial detection, and adversarial repair. 



https://huggingface.co/spaces/anonymous8/RPD-Demo Generally, an independent adversarial detection method also works in RPD, but the PLM-based adversary detector is simple and efficient. Note that attacking the PLM-based models is very expensive. In this case, we use the subsets of Amazon Polarityand AGNews datasets in our experiments, the numbers of examples in the these subsets are 10K. We submit the datasets as supplementary materials for reproducible evaluation. https://huggingface.co/datasets/sst2 https://huggingface.co/datasets/amazon_polarity https://huggingface.co/datasets/ag_news We use transformers to implement RPD: https://github.com/huggingface/transformers https://github.com/QData/TextAttack https://huggingface.co/spaces/anonymous8/RPD-Demo



Figure 2: The framework of RPD. The dotted lines with solid arrow means the steps depends on the existence of an adversary, while the dotted lines with triangles denotes the objectives for multitask training. In addition to standard classification objective, RPD contains an adversary detection objective and a detached adversarial training objective.

Adversarial sampling and training of RPD Require: D nat , attackers {A} k i=1 Output : RPD model F R for adversary detection Train a surrogate classifier F s on D nat for adversaries sampling; B ← ∅;

Adversarial detection and defense based on RPD Input : Input examples D e ; attacker A P D for perturbation defocusing Output: The

RAT has an adversarial classifier based on reactive adversarial training. RAT predicts adversaries using an adversarial classifier and predicts natural examples using a standard classifier. The number of adversaries used in training RAT is the same as the number of RPD's training examples.

Figure 3: The cumulative distribution of output's cosine similarity scores towards natural examples. ∆ adv and ∆ rep indicate the average similarity scores of adversaries and repaired adversaries.

Figure 6: The t-SNE cluster visualizations of natural examples, adversaries and restored examples. The average cosine similarity scores of the clusters are indicated below the figures.

Figure 7: The demo snapshots of adversary detection and defense built on RPD for defending against multi-attacks.

The experimental performance of masking-based perturbation defocusing on adversaries.

is a attacker for adversary sampling; k is the number of sampling attackers. F s is the surrogate classifier trained on natural examples (line 6 in Algorithm 1). The label ỹ of an example in RPD contains 3 sub-labels (for the objectives of classification (L cls ), detached adversarial training(L adv ), adversarial detection(L det ), respectively).

The details of experimental datasets used for evaluating RPD. We further split the original training set into training and validation subsets for the AGNews dataset.Polarity datasets are binary sentiment classification datasets, while AGNews is a news classification dataset containing 4 classes. Table2shows the dataset details. For detailed evaluation metric clarification, please refer to Appendix A.1.3.4.2 EXPERIMENTS SETTINGThe adversarial defense experiments involve attack methods and PLM 7 -based classifiers. We adopt the open-source implementations of the adversarial attack methods from TextAttack 8 as candidate attackers, following the original attack settings. We use BERT and DEBERTA as objective classifiers to evaluate the adversarial repair performance, while DEBERTA is the base objective model used in all ablation experiments. In Table3, we evaluate adversarial detection and defense performance across the whole testing set. However, we only evaluate 500 examples in the research questions due to resource limitation. For detailed hyper-parameter settings, please refer to A.1.1.

The adversarial detection and defense performance of RPD on different objective models; "Acc." is an abbreviation for Accuracy. The results are the medians in five runs.

Meanwhile, compared to previous adversarial defense studies, the regression of standard classification and adversarial detection error rate on natural examples are as low as ∼ 1% and ∼ 10%, respectively (please refer to Appendix A.2 for details). This reduces mis-repairs on natural examples. The adversarial defense performance based on perturbation defocusing depends on the accuracy of adversarial detection, which means detection accuracy ≥ defense accuracy. However, because the accuracy on natural examples suffers no significant loss, in the case of the worst detection accuracy (43.82% on AGNews dataset) of BERT, the restored accuracy (83.95%) is still better than BERT without defense (74.8%). On the one hand, our experimental results show reactive perturbation defocusing is able to repair ∼ 97%+ of correctly identified adversaries without clean performance sacrifice. On the other hand, RPD can be adapted to other models provided that the adversary detectors are deployed.

The adversarial detection and defense performance of RPD on unknown attacks.

The adversarial detection and defense performance of S-RPD under different attackers and PLMs. The "Diff" measures the performance change compared to RPD.

The performance of RPDon clean data

The adversary defense performance comparison on IMDB dataset between RPD and other state-of-the-art defense methods under GA attack. * means that due to computation resource limitation, we sampled 100 adversaries generated by GA to train RPD , which is not enough (e.g., the whole training set contains 8000 examples). The cumulative distribution of output's cosine similarity scores towards natural examples. ∆ adv and ∆ res indicate the average cosine similarity scores of adversaries and defocused adversaries.We show the performance of RPD in transfer experiments in Table8. Interestingly, the stronger the naturalness constraints cause the worse transfer ability of the adversarial detectors, e.g., PWWS and TE X TFO O L E R suffer from up to 62.7% and 62.75% adversarial detection performance and adversarial repair performance drop on the BAE-based adversaries, especially on the SST2 dataset. Therefore, we argue that it is imperative to train the detector to simultaneously consider attackers with different constraint strengths.

The transferred performance of single attack-based S-RPD models for different attackers.To alleviate the performance sacrifice caused by adversarial training on clean data, we adopt the detached adversarial training objective. To verify its feasibility, we employ traditional adversarial training in RPD. The results in Table9show that traditional adversarial training works for perturbation defocusing, while the performance drop on clean data is inevitable. We also evaluate ablated RPD without adversarial training objective; the experimental results show that the detection accuracy and restored accuracy increases by ≈ 1% -2%, this is because the adversarial detection Under review as a conference paper at ICLR 2023 objective attracts more attention while β = 0. However, restored accuracy drops ≈ 2% -3%. Therefore, we believe that detached adversarial training is effective in RPD.

The experimental results of RPD based on ensemble adversarial training objective.

