LEARN2WEIGHT: WEIGHTS TRANSFER DEFENSE AGAINST SIMILAR-DOMAIN ADVERSARIAL ATTACKS

Abstract

Recent work in black-box adversarial attacks for NLP systems has attracted attention. Prior black-box attacks assume that attackers can observe output labels from target models based on selected inputs. In this work, inspired by adversarial transferability, we propose a new type of black-box NLP adversarial attack that an attacker can choose a similar domain and transfer the adversarial examples to the target domain and cause poor performance in target model. Based on domain adaptation theory, we then propose a defensive strategy, called Learn2Weight, which trains to predict the weight adjustments for target model in order to defense the attack of similar-domain adversarial examples. Using Amazon multi-domain sentiment classification dataset, we empirically show that Learn2Weight model is effective against the attack compared to standard black-box defense methods such as adversarial training and defense distillation. This work contributes to the growing literature on machine learning safety.

1. INTRODUCTION

As machine learning models are applied to more and more real-world tasks, addressing machine learning safety is becoming an increasingly pressing issue. Deep learning algorithms have been shown to be vulnerable to adversarial examples (Szegedy et al., 2013; Goodfellow et al., 2014; Papernot et al., 2016a) . In particular, prior black-box adversarial attacks assume that the adversary is not aware of the target model architecture, parameters or training data, but is capable of querying the target model with supplied inputs and obtaining the output predictions. The phenomenon that adversarial examples generated from one model may also be adversarial to another model is known as adversarial transferability (Szegedy et al., 2013) . Motivated by adversarial transferability, we conjecture another black-box attack pipeline where the adversary does not even need to have access to the target model nor query labels from crafted inputs. Instead, as long as the adversary knows the task of the target, he can choose a similar domain, to build a substitute model and then attack the target model with adversarial examples that are generated from the attack domain. The similar-domain adversarial attack may be more practical than prior blackbox attacks as label querying from target model is not needed. This attack can be illustrated with the following example (Figure 1b ) in medical insurance fraud (Finlayson et al., 2019) . Insurance companies may use hypothetical opioid risk models to classify the likelihood (high/low) of a patient to abuse the opioids to be prescribed, based on the patient's medical history as text input. Physicians can run the original patient history through the attack pipeline to generate an adversarial patient history, where the original is more likely to be rejected ("High" risk) and the adversarial is more likely to be accepted ("Low" risk). Perturbations in patient history could be, for example, a slight perturbation from "alcohol abuse" to "alcohol dependence", and it may successfully fool the insurance company's model. et al., 2019) . In trying to increase robustness against adversarial inputs, a model faces a tradeoff of weakened accuracy towards clean inputs. Given that an adversarial training loss function is composed of a loss against clean inputs and loss against adversarial inputs, improper optimization where the latter is highly-optimized and the former weakly-optimized does not improve general performance in the real-world. To curb this issue, methods have been proposed (Zhang et al., 2019b; Lamb et al., 2019; Schmidt et al., 2018) , such as factoring in under-represented data points in training set. To defend against this similar-domain adversarial attack, we propose a weight transfer network approach, Learn2Weight, so that the target model's decision boundary can adapt to the examples from low-density regions. Experiments confirm the effectiveness of our approach against the similardomain attack over other baseline defense methods. Moreover, our approach is able to improve robustness accuracy without losing the target model's standard generalization accuracy. Our contribution can be summarized as follows: • We are among the first to propose the similar-domain adversarial attack. This attack pipeline relaxes the previous black-box attack assumption that the adversary has access to the target model and can query the model with crafted examples. • We propose a defensive strategy for this attack based on domain adaptation theory. Experiment results show the effectiveness of our approach over existing defense methods, against the similar-domain attack. et al., 2013; Cheng et al., 2019) . While prior work examines the transferability between different models trained over the same dataset, or the transferability between the same or different model trained over disjoint subsets of a dataset, our work examines the adversarial transferability between different domains, which we call similar-domain attack.



Figure 1: Diagrammatic representation of the problem

Based on domain adaption theory(Ben-David et al., 2010), we conjecture that it is the domain-variant features that cause the success of the similar-domain attack. The adversarial examples with domainvariant features are likely to reside in the low-density regions (far away from decision boundary) of the empirical distribution of the target training data which could fool the target model (Zhang et al., 2019b). Literature indicates that worsened generalizability is a tradeoff faced by existing defenses such as adversarial training (Raghunathan et al., 2019) and domain generalization techniques (Wang

Recent work in adversarial attack for NLP systems has attracted attention. See(Zhang et al., 2020)   survey for an overview of the adversarial attack in NLP. Existing research proposes different attack methods for generating adversarial text examples(Moosavi-Dezfooli et al., 2016; Ebrahimi et al.,  2018; Wallace et al., 2019). The crafted adversarial text examples have been shown to fool the state-of-the-art NLP systems such asBERT (Jin et al., 2019). A large body of adversarial attack research focuses on black-box attack where the adversary builds a substitute model by querying the target model with supplied inputs and obtaining the output predictions. The key idea behind such black-box attack is that adversarial examples generated from one model may also be mis-classified by another model, which is known as adversarial transferability (Szegedy

