PERTURBATION DEFOCUSING FOR ADVERSARIAL DE-FENSE

Abstract

Recent research indicates adversarial attacks are likely to deceive neural systems, including large-scale, pre-trained language models. Given a natural sentence, an attacker replaces a subset of words to fool objective models. To defend against adversarial attacks, existing works aim to reconstruct the adversarial examples. However, these methods show limited defense performance on the adversarial examples whilst also damaging the clean performance on natural examples. To achieve better defense performance, our finding indicates that the reconstruction of adversarial examples is not necessary. More specifically, we inject non-toxic perturbations into adversarial examples, which can disable almost all malicious perturbations. In order to minimize performance sacrifice, we employ an adversarial example detector to distinguish and repair detected adversarial examples, which alleviates the mis-defense on natural examples. Our experimental results on three datasets, two objective models and a variety of adversarial attacks show that the proposed method successfully repairs up to ∼ 97% correctly identified adversarial examples with ≤∼ 2% performance sacrifice. We provide an anonymous demonstration 1 of adversarial detection and repair based on our work.

1. INTRODUCTION

Neural networks have been employed achieved state-of-the-art performance on various tasks. However, recent research has shown their vulnerability to adversarial attacks (Szegedy et al., 2014; Goodfellow et al., 2015) . In particular, language models have shown to be vulnerable to adversarial examples (a.k.a., adversary) (Garg & Ramakrishnan, 2020; Li et al., 2020; Jin et al., 2020; Li et al., 2021a) generated by replaced specific words in a sentence. Compared to adversarial robustness on computer vision tasks (Alzantot et al., 2018; Ren et al., 2019; Zang et al., 2020; Zhang et al., 2021; Jin et al., 2020; Garg & Ramakrishnan, 2020; Li et al., 2021a; Wang et al., 2022) , text adversarial defense (a.k.a. adversarial repair) has attracted less attention resulting in limited progress in adversary defense. Moreover, the crux of adversarial defense, i.e., performance sacrifice, has not been settled by existing studies. While the prominent works tend to solve adversarial defense via adversarial training or feature reconstruction, we propose perturbation defocusing to address adversarial defense in natural language processing. More specifically, perturbation defocusing attempts to apply non-toxic perturbations to adversaries to repair them. Although it doesn't seem to be an intuitive thought, it is motivated by empirical observations that malicious perturbations rarely destroy the fundamental semantics of a natural example. In other words, these adversaries can be easily repaired by distracting the objective model from malicious perturbations. We validate a simple implementation of perturbation defocusing with preliminary experiments: simply masking the malicious perturbations, as in Figure 1 . The experimental results in Table 1 show that masking malicious perturbations repairs a considerable number of adversaries (achieves up to 91.05% restored accuracy on the Amazon Polarity dataset). Unfortunately, the positions of malicious perturbations are unknown in real adversarial defense. We employ adversarial attackers to perform perturbation defocusing as an alternative. If an adversary is identified, we obtain its perturbed prediction and keep attacking this adversary until the new prediction differs from the former. In this way, the malicious perturbations are defocused without knowing the positions of malicious perturbations. Because adversarial attackers have large search spaces of non-toxic perturbations, almost all malicious perturbations in adversaries can be defocused in our experiments. However, there is a prerequisite that the adversaries must be precisely identified to prevent oriented attackers from attacking natural examples (Bao et al., 2021) in perturbation defocusing. Hopefully, although existing adversarial attackers emphasize the naturalness of adversaries (Zang et al., 2020; Li et al., 2021b; Le et al., 2022) , our study suggests that PLM-based models can efficiently distinguish the adversaries (refer to Figure 4 ), provided that the adversarial detection objective is involved in fine-tuning processing. Thereafter, we propose reactive perturbation defocusing (RPD) based on perturbation defocusing and adversary detection that alleviates performance sacrifice by only repairing detected adversaries. We deploy RPD on a PLM-based model, and it can be extended to other NLP models. We evaluate RPD on three text classification datasets under challenging adversarial attackers. The experimental results demonstrated that RPD is capable of repairing ∼ 97%+ of identified adversaries without observable performance sacrifice (under ∼ 2%) on clean data (please refer to Table 6 ). In summary, our contributions are mainly as follows: a) We propose perturbation defocusing to supersede feature reconstruction-based methods for adversarial defense, which almost repairs all correctly identified adversaries. b) We integrate an adversarial detector with a PLM-based classification model. Based on multiattack adversary sampling, the adversarial detector can efficiently detect most of the adversaries. c) We evaluate RPD on multiple datasets, PLMs and adversarial attackers. The experimental results indicate that RPD has an impressive capacity to detect and repair adversaries without sacrificing clean performance. et al., 2019; Liu et al., 2020b; Mozes et al., 2021; Keller et al., 2021; Chen et al., 2021; Xu et al., 2022; Li et al., 2022; Swenor & Kalita, 2022) ; and feature reconstruction-based methods (Zhou et al., 2019; Jones et al., 2020; Wang et al., 2021a) . In the meantime, some research(Wang et al., 2021b) explores hybrid defenses against adversarial attacks. Nevertheless, there are some problems that remain with the existing methods. For example, due to the issue of catastrophic forgetting (Dong et al., 2021) , adversarial training has been shown to be inadequate for improving the robustness of PLMs in fine-tuning. On the contrary, it significantly increases the cost of objective model training. For context reconstruction (e.g., word substitution and translationbased reconstruction), these methods sometimes fail to identify semantically repaired adversaries or have a tendency to introduce new malicious perturbations (Swenor & Kalita, 2022) . In recent studies, it has been recognised that feature (e.g., embedding) space reconstruction-based approaches are more successful than context reconstruction methods like word substitution (Mozes et al., 2021;  



https://huggingface.co/spaces/anonymous8/RPD-Demo



Figure 1: A real example of perturbation defocusing, which masks the perturbed words to repair an adversary. "[MASK]" denotes the mask token. This virtual adversary is generated by TE X TFO O L E R.

The experimental performance of masking-based perturbation defocusing on adversaries.

