TOWARDS ROBUSTNESS AGAINST NATURAL LANGUAGE WORD SUBSTITUTIONS

Abstract

Robustness against word substitutions has a well-defined and widely acceptable form, i.e., using semantically similar words as substitutions, and thus it is considered as a fundamental stepping-stone towards broader robustness in natural language processing. Previous defense methods capture word substitutions in vector space by using either l 2 -ball or hyper-rectangle, which results in perturbation sets that are not inclusive enough or unnecessarily large, and thus impedes mimicry of worst cases for robust training. In this paper, we introduce a novel Adversarial Sparse Convex Combination (ASCC) method. We model the word substitution attack space as a convex hull and leverages a regularization term to enforce perturbation towards an actual substitution, thus aligning our modeling better with the discrete textual space. Based on the ASCC method, we further propose ASCC-defense, which leverages ASCC to generate worst-case perturbations and incorporates adversarial training towards robustness. Experiments show that ASCC-defense outperforms the current state-of-the-arts in terms of robustness on two prevailing NLP tasks, i.e., sentiment analysis and natural language inference, concerning several attacks across multiple model architectures. Besides, we also envision a new class of defense towards robustness in NLP, where our robustly trained word vectors can be plugged into a normally trained model and enforce its robustness without applying any other defense techniques.

1. INTRODUCTION

Recent extensive studies have shown that deep neural networks (DNNs) are vulnerable to adversarial attacks (Szegedy et al., 2013; Goodfellow et al., 2015; Papernot et al., 2016a; Kurakin et al., 2017; Alzantot et al., 2018) ; e.g., minor phrase modification can easily deceive Google's toxic comment detection systems (Hosseini et al., 2017) . This raises grand security challenges to advanced natural language processing (NLP) systems, such as malware detection and spam filtering, where DNNs have been broadly deployed (Stringhini et al., 2010; Kolter & Maloof, 2006) . As a consequence, the research on defending against natural language adversarial attacks has attracted increasing attention. Existing adversarial attacks in NLP can be categorized into three folds: (i) character-level modifications (Belinkov & Bisk, 2018; Gao et al., 2018; Eger et al., 2019) , (ii) deleting, adding, or swapping words (Liang et al., 2017; Jia & Liang, 2017; Iyyer et al., 2018) , and (iii) word substitutions using semantically similar words (Alzantot et al., 2018; Ren et al., 2019; Zang et al., 2020) . The first two attack types usually break the grammaticality and naturality of the original input sentences, and thus can be detected by spell or grammar checker (Pruthi et al., 2019) . In contrast, the third attack type only substitutes words with semantically similar words, thus preserves the syntactic and semantics

