WORDSWORTH SCORES FOR ATTACKING CNNS AND LSTMS FOR TEXT CLASSIFICATION

Abstract

Black box attacks on traditional deep learning models trained for text classification target important words in a piece of text, in order to change model prediction. Current approaches towards highlighting important features are time consuming and require large number of model queries. We present a simple yet novel method to calculate word importance scores, based on model predictions on single words. These scores, which we call WordsWorth scores, need to be calculated only once for the training vocabulary. They can be used to speed up any attack method that requires word importance, with negligible loss of attack performance. We run experiments on a number of datasets trained on word-level CNNs and LSTMs, for sentiment analysis and topic classification and compare to state-of-the-art baselines. Our results show the effectiveness of our method in attacking these models with success rates that are close to the original baselines. We argue that global importance scores act as a very good proxy for word importance in a local context because words are a highly informative form of data. This aligns with the manner in which humans interpret language, with individual words having well-defined meaning and powerful connotations. We further show that these scores can be used as a debugging tool to interpret a trained model by highlighting relevant words for each class. Additionally, we demonstrate the effect of overtraining on word importance, compare the robustness of CNNs and LSTMs, and explain the transferability of adversarial examples across a CNN and an LSTM using these scores. We highlight the fact that neural networks make highly informative predictions on single words.

1. INTRODUCTION

Deep learning models are vulnerable to carefully crafted adversarial examples. The goal of such an attack is to fool a classifier into giving incorrect prediction while the perturbed input appears normal to human observers. The probelm is important from the point of view of robustness as well as interpretability. Thoroughly analyzing different kinds of vulnerabilities in neural networks would help us in creating robust models for deployment in the real world, in addition to throwing some light on the internal working of these models. In this work, we consider text classification, where finding important words in a body of text is the first step towards malicious modification. For this problem, we propose a novel method for calculating word importance. After training a model, we calculate importance scores over the entire training vocabulary, word by word. We further use these importance scores for black box attacks and demonstrate that the attack success rate is comparable to the original methods, particularly for CNNs. Since these scores are global and calculated over the training vocabulary, they can also be used as a tool to interpret a trained model. They provide a measure for comparing different architectures and models beyond training and validation accuracy. Over a single training dataset, we can compare a small CNN to a large CNN, a CNN to an LSTM, or the word importance distribution of one class against another, as we outline in our experiments section. The motivation for our particular algorithm comes from the fact that in a piece of text, most of the time, words and phrases have a strong influence on their own. This gives us a rationale for evaluating a model on single words, in direct contrast to the leave-one-out technique, which involves deleting a word from a document and measuring its importance by the change in model prediction on this modified input. Further, we expect a well-trained network to treat a word approximately the same, irrespective of its location in the input, when surrounding words are removed. Thus a particular word can occur at any position in a document with 200 words and its importance will be roughly the same. We expect a well-trained model to exhibit this behaviour and our experiments confirm this. In summary, our contributions are as follows: • We propose a simple and efficient method for calculating word importance for attacking traditional deep learning models in the black box setting. • We argue that these scores can act as a tool for model interpretation and outline a number of use cases in this context.

2. RELATED WORK

2.1 ADVERSARIAL ATTACKS ON NLP MODELS: The idea of perturbation, whether random or malicious, is rather simple in the image domain, where salt and pepper noise can be enough to images to fool models. This kind of noise is hard for humans to detect. However, since text data is discrete, perturbations in text are difficult to quantify. Besides, people easily notice errors in computer-generated text. This places additional constraints for an NLP attack to be counted as successful, where a successful attack is one that forces the model to give an incorrect prediction while a human would make the correct prediction on the input. We limit ourselves to text classification problems, using sentiment analysis and topic classification as examples. We only consider the attack scenarios in which specific words in the input are replaced by valid words from the dictionary. Thus we are not considering attacks in which extra information is appended to input data, or where word replacements purposefully introduce spelling errors. The former take an entirely different approach; the latter introduce errors and do not preserve semantics. In addition, training a neural network to be robust to spelling errors would stop these attacks. Further, we limit ourselves to black box attacks where the attacker has no information about model architectures and parameters.

2.2. FIND AND REPLACE ATTACKS ON TEXT CLASSIFICATION

Most attacks on text classification solve the problem in two parts; by locating important words in the input, and by finding suitable replacements for these words. We only consider attacks where substitutions are valid words picked from a dictionary, to avoid introducing grammatical errors, and ignore the case, for example, when spelling errors are introduced in important words.

2.2.1. WHITE BOX ATTACKS

In the white-box setting, where an attacker has full knowledge of the model architecture, gradients serve as a good proxy for word importance. 



Gong et al. (2018)  use gradient based methods to locate important words.Samanta & Mehta (2017)  use gradients to calculate word importance, with linguistic constraints over substitution words.Lei et al. (2019)  carry joint word and sentence attacks, by generating sentence paraphrases in the first stage, and resorting to greedy word substitutions if the first stage fails. Again, important words are located by the magnitude of the gradient of word embedding.2.2.2 BLACK BOX ATTACKSIn the black box scenario, where gradients are not available, saliency maps are calculated for words through different methods.Yang et al. (2018)  provide a greedy algorithm which we will outline in detail in the next section.Li et al. (2016)  propose masking each feature with zero padding, using the decrease in the predicted probability as the score of the feature or word, and masking the top-k features as unknown.Alzantot  et al. (2018)  andKuleshov et al. (2018)  propose variations of genetic algorithms.Kuleshov et al.  (2018)  replace words one by one until the classifier is misdirected while observing a bound on the

