PERTURBATION DEFOCUSING FOR ADVERSARIAL DE-FENSE

Abstract

Recent research indicates adversarial attacks are likely to deceive neural systems, including large-scale, pre-trained language models. Given a natural sentence, an attacker replaces a subset of words to fool objective models. To defend against adversarial attacks, existing works aim to reconstruct the adversarial examples. However, these methods show limited defense performance on the adversarial examples whilst also damaging the clean performance on natural examples. To achieve better defense performance, our finding indicates that the reconstruction of adversarial examples is not necessary. More specifically, we inject non-toxic perturbations into adversarial examples, which can disable almost all malicious perturbations. In order to minimize performance sacrifice, we employ an adversarial example detector to distinguish and repair detected adversarial examples, which alleviates the mis-defense on natural examples. Our experimental results on three datasets, two objective models and a variety of adversarial attacks show that the proposed method successfully repairs up to ∼ 97% correctly identified adversarial examples with ≤∼ 2% performance sacrifice. We provide an anonymous demonstration 1 of adversarial detection and repair based on our work.

1. INTRODUCTION

Neural networks have been employed achieved state-of-the-art performance on various tasks. However, recent research has shown their vulnerability to adversarial attacks (Szegedy et al., 2014; Goodfellow et al., 2015) . In particular, language models have shown to be vulnerable to adversarial examples (a.k.a., adversary) (Garg & Ramakrishnan, 2020; Li et al., 2020; Jin et al., 2020; Li et al., 2021a) generated by replaced specific words in a sentence. Compared to adversarial robustness on computer vision tasks (Alzantot et al., 2018; Ren et al., 2019; Zang et al., 2020; Zhang et al., 2021; Jin et al., 2020; Garg & Ramakrishnan, 2020; Li et al., 2021a; Wang et al., 2022) , text adversarial defense (a.k.a. adversarial repair) has attracted less attention resulting in limited progress in adversary defense. Moreover, the crux of adversarial defense, i.e., performance sacrifice, has not been settled by existing studies. While the prominent works tend to solve adversarial defense via adversarial training or feature reconstruction, we propose perturbation defocusing to address adversarial defense in natural language processing. More specifically, perturbation defocusing attempts to apply non-toxic perturbations to adversaries to repair them. Although it doesn't seem to be an intuitive thought, it is motivated by empirical observations that malicious perturbations rarely destroy the fundamental semantics of a natural example. In other words, these adversaries can be easily repaired by distracting the objective model from malicious perturbations. We validate a simple implementation of perturbation defocusing with preliminary experiments: simply masking the malicious perturbations, as in Figure 1 . The experimental results in Table 1 show that masking malicious perturbations repairs a considerable number of adversaries (achieves up to 91.05% restored accuracy on the Amazon Polarity dataset). Unfortunately, the positions of malicious perturbations are unknown in real adversarial defense. We employ adversarial attackers to perform perturbation defocusing as an alternative. If an adversary is identified, we obtain its perturbed prediction and keep attacking this adversary until the new prediction differs from the former. In this way, the malicious perturbations 1 https://huggingface.co/spaces/anonymous8/RPD-Demo 1

