TOWARDS ROBUSTNESS AGAINST NATURAL LANGUAGE WORD SUBSTITUTIONS

Abstract

Robustness against word substitutions has a well-defined and widely acceptable form, i.e., using semantically similar words as substitutions, and thus it is considered as a fundamental stepping-stone towards broader robustness in natural language processing. Previous defense methods capture word substitutions in vector space by using either l 2 -ball or hyper-rectangle, which results in perturbation sets that are not inclusive enough or unnecessarily large, and thus impedes mimicry of worst cases for robust training. In this paper, we introduce a novel Adversarial Sparse Convex Combination (ASCC) method. We model the word substitution attack space as a convex hull and leverages a regularization term to enforce perturbation towards an actual substitution, thus aligning our modeling better with the discrete textual space. Based on the ASCC method, we further propose ASCC-defense, which leverages ASCC to generate worst-case perturbations and incorporates adversarial training towards robustness. Experiments show that ASCC-defense outperforms the current state-of-the-arts in terms of robustness on two prevailing NLP tasks, i.e., sentiment analysis and natural language inference, concerning several attacks across multiple model architectures. Besides, we also envision a new class of defense towards robustness in NLP, where our robustly trained word vectors can be plugged into a normally trained model and enforce its robustness without applying any other defense techniques.

1. INTRODUCTION

Recent extensive studies have shown that deep neural networks (DNNs) are vulnerable to adversarial attacks (Szegedy et al., 2013; Goodfellow et al., 2015; Papernot et al., 2016a; Kurakin et al., 2017; Alzantot et al., 2018) ; e.g., minor phrase modification can easily deceive Google's toxic comment detection systems (Hosseini et al., 2017) . This raises grand security challenges to advanced natural language processing (NLP) systems, such as malware detection and spam filtering, where DNNs have been broadly deployed (Stringhini et al., 2010; Kolter & Maloof, 2006) . As a consequence, the research on defending against natural language adversarial attacks has attracted increasing attention. Existing adversarial attacks in NLP can be categorized into three folds: (i) character-level modifications (Belinkov & Bisk, 2018; Gao et al., 2018; Eger et al., 2019) , (ii) deleting, adding, or swapping words (Liang et al., 2017; Jia & Liang, 2017; Iyyer et al., 2018) , and (iii) word substitutions using semantically similar words (Alzantot et al., 2018; Ren et al., 2019; Zang et al., 2020) . The first two attack types usually break the grammaticality and naturality of the original input sentences, and thus can be detected by spell or grammar checker (Pruthi et al., 2019) . In contrast, the third attack type only substitutes words with semantically similar words, thus preserves the syntactic and semantics of the original input to the most considerable extent and are very hard to discern, even from a human's perspective. Therefore, building robustness against such word substitutions is a fundamental stepping stone towards robustness in NLP, which is the focus of this paper. Adversarial attack by word substitution is a combinatorial optimization problem. Solving this problem in the discrete textual space is considered NP-hard as the searching space increases exponentially with the length of the input. As such, many methods have been proposed to model word substitutions in the continuous word vector space (Sato et al., 2018; Gong et al., 2018; Jia et al., 2019; Huang et al., 2019) , so that they can leverage the gradients generated by a victim model either for attack or robust training. However, previous methods capture word substitutions in the vector space by using either l 2 -ball or hyper-rectangle, which results in perturbation sets that are not inclusive enough or unnecessarily large, and thus impedes precise mimicry of the worst cases for robust training (see Fig. 1 for an illustration). In this paper, we introduce a novel Adversarial Sparse Convex Combination (ASCC) method, whose key idea is to model the solution space as a convex hull of word vectors. Using a convex hull brings two advantages: (i) a continuous convex space is beneficial for gradient-based adversary generation, and (ii) the convex hull is, by definition, the smallest convex set that contains all substitutions, thus is inclusive enough to cover all possible substitutions while ruling out unnecessary cases. In particular, we leverage a regularization term to encourage adversary towards an actual substitution, which aligns our modeling better with the discrete textual space. We further propose ASCC-defense, which employs the ASCC to generate adversaries and incorporates adversarial training to gain robustness. We evaluate ASCC-defense on two prevailing NLP tasks, i.e., sentiment analysis on IMDB and natural language inference on SNLI, across four model architectures, concerning two common attack methods. Experimental results show that our method consistently yields models that are more robust than the state-of-the-arts with significant margins; e.g., we achieve 79.0% accuracy under Genetic attacks on IMDB while the state-of-the-art performance is 75.0%. Besides, our robustly trained word vectors can be easily plugged into standard NLP models and enforce robustness without applying any other defense techniques, which envisions a new class of approach towards NLP robustness. For instance, using our pre-trained word vectors as initialization enhances a normal LSTM model to achieve 73.4% robust accuracy, while the state-of-the-art defense and the undefended model achieve 72.5% and 7.9%, respectively.

2.1. NOTATIONS AND PROBLEM SETTING

In this paper, we focus on text classification problem to introduce our method, while it can also be extended to other NLP tasks. We assume we are interested in training classifier X → Y that predicts label y ∈ Y given input x ∈ X . The input x is a textual sequence of L words {x i } L i=1 . We consider the most common practice for NLP tasks where the first step is to map x into a sequence of vectors in a low-dimensional embedding space, which is denoted as v(x). The classifier is then formulated as p(y|v(x)), where p can be parameterized by using a neural network, e.g., CNN or LSTM model.



Figure 1: Visualization of how different methods capture the word substitutions in the vector space.

