TEXTGRAD: ADVANCING ROBUSTNESS EVALUATION IN NLP BY GRADIENT-DRIVEN OPTIMIZATION

Abstract

Robustness evaluation against adversarial examples has become increasingly important to unveil the trustworthiness of the prevailing deep models in natural language processing (NLP). However, in contrast to the computer vision (CV) domain where the first-order projected gradient descent (PGD) is used as the benchmark approach to generate adversarial examples for robustness evaluation, there lacks a principled first-order gradient-based robustness evaluation framework in NLP. The emerging optimization challenges lie in 1) the discrete nature of textual inputs together with the strong coupling between the perturbation location and the actual content, and 2) the additional constraint that the perturbed text should be fluent and achieve a low perplexity under a language model. These challenges make the development of PGD-like NLP attacks difficult. To bridge the gap, we propose TEXTGRAD, a new attack generator using gradient-driven optimization, supporting high-accuracy and high-quality assessment of adversarial robustness in NLP. Specifically, we address the aforementioned challenges in a unified optimization framework. And we develop an effective convex relaxation method to co-optimize the continuously-relaxed site selection and perturbation variables, and leverage an effective sampling method to establish an accurate mapping from the continuous optimization variables to the discrete textual perturbations. Moreover, as a first-order attack generation method, TEXTGRAD can be baked in adversarial training to further improve the robustness of NLP models. Extensive experiments are provided to demonstrate the effectiveness of TEXTGRAD not only in attack generation for robustness evaluation but also in adversarial defense. From the attack perspective, we show that TEXTGRAD achieves remarkable improvements in both the attack success rate and the perplexity score over five state-of-the-art baselines. From the defense perspective, TEXTGRAD-enabled adversarial training yields the most robust NLP model against a wide spectrum of NLP attacks.

1. INTRODUCTION

The assessment of adversarial robustness of machine learning (ML) models has received increasing research attention because of their vulnerability to adversarial input perturbations (known as adversarial attacks) (Goodfellow et al., 2014; Carlini & Wagner, 2017; Papernot et al., 2016) . Among a variety of robustness evaluation methods, gradient-based adversarial attack generation makes a tremendous success in the computer vision (CV) domain (Croce & Hein, 2020; Dong et al., 2020) . For example, the projected gradient descent (PGD)-based methods have been widely used to benchmark the adversarial robustness of CV models (Madry et al., 2018; Zhang et al., 2019b; Shafahi et al., 2019; Wong et al., 2020; Zhang et al., 2019a; Athalye et al., 2018) . However, in the natural language processing (NLP) area, the predominant robustness evaluation tool belongs to query-based attack generation methods (Li et al., 2020; Jin et al., 2020; Ren et al., 2019; Garg & Ramakrishnan, 2020; Li et al., 2019) , which do not make the full use of gradient information. Yet, the (query-based) mainstream of NLP robustness evaluation suffers several limitations. First, these query-based attack methods could be prone to generating ambiguous or invalid adversarial textual inputs (Wang et al., 2021) , most of which change the original semantics and could even Table 1 : Effectiveness of TEXTGRAD at-a-glance on the SST-2 dataset (Socher et al., 2013) against 5 NLP attack baselines. Each attack method is categorized by the attack principle (gradient-based vs. query-based), and is evaluated at three aspects: attack success rate (ASR), adversarial texts quality (in terms of language model perplexity), and runtime efficiency (averaged runtime for attack generation in seconds). Two types of victim models are considered, i.e., realizations of BERT achieved by standard training (ST) and adversarial training (AT), respectively. Here AT integrates TEXTFOOLER (Jin et al., 2020) The main challenges for leveraging gradients to generate adversarial attacks in NLP lie in two aspects. First, the discrete nature of texts makes it difficult to directly employ the gradient information on the inputs. Different from perturbing pixels in imagery data, adversarial perturbations in an textual input need to optimize over the discrete space of words and tokens. Second, the fluency requirement of texts imposes another constraint for optimization. In contrast to ℓ p -norm constrained attacks in CV, adversarial examples in NLP are required to keep a low perplexity score. The above two obstacles make the design of gradient-based attack generation method in NLP highly non-trivial. To bridge the adversarial learning gap between CV and NLP, we develop a novel adversarial attack method, termed TEXTGRAD, by peering into gradient-driven optimization principles needed for effective attack generation in NLP. Specifically, we propose a convex relaxation method to co-optimize the perturbation position selection and token modification. To overcome the discrete optimization difficulty, we propose an effective sampling strategy to enable an accurate mapping from the continuous optimization space to the discrete textual perturbations. We further leverage a perplexity-driven loss to optimize the fluency of the generated adversarial examples. In Table 1 , we highlight the attack improvement brought by TEXTGRAD over some widely-used NLP attack baselines. More thorough experiment results will be provided in Sec. 5. Our contribution. ❶ We propose TEXTGRAD, a novel first-order gradient-driven adversarial attack method, which takes a firm step to fill the vacancy of a principled PGD-based robustness evaluation framework in NLP. ❷ We identify a few missing optimization principles to boost the power of gradient-based NLP attacks, such as convex relaxation, sampling-based continuous-to-discrete mapping, and site-token co-optimization. ❸ We also show that TEXTGRAD is easily integrated with adversarial training and enables effective defenses against adversarial attacks in NLP. ❹ Lastly, we conduct thorough experiments to demonstrate the superiority of TEXTGRAD to existing baselines in both adversarial attack generation and adversarial defense.

2. BACKGROUND AND RELATED WORK

Adversarial attacks in CV. Gradient information has played an important role in generating adversarial examples, i.e., human-imperceptible perturbed inputs that can mislead models, in the CV



with standard training. Across models, higher ASR, lower perplexity, and lower runtime indicate stronger attack. The best performance is highlighted in bold per metric. Second, the query-based methods could be hardly integrated with the first-order optimization-based model training recipe, and thus makes it difficult to develop adversarial training-based defenses(Madry et al., 2018; Athalye et al., 2018). Even though some first-order optimization-based NLP attack generation methods were developed in the literature, they often come with poor attack effectiveness(Ebrahimi et al., 2018)  or high computational cost(Guo et al., 2021), leaving the question of whether the best optimization framework for NLP attack generation is found.

