PRESERVING SEMANTICS IN TEXTUAL ADVERSARIAL ATTACKS

Abstract

Adversarial attacks in NLP challenge the way we look at language models. The goal of this kind of adversarial attack is to modify the input text to fool a classifier while maintaining the original meaning of the text. Although most existing adversarial attacks claim to fulfill the constraint of semantics preservation, careful scrutiny shows otherwise. We show that the problem lies in the text encoders used to determine the similarity of adversarial examples, specifically in the way they are trained. Unsupervised training methods make these encoders more susceptible to problems with antonym recognition. To overcome this, we introduce a simple, fully supervised sentence embedding technique called Semantics-Preserving-Encoder (SPE). The results show that our solution minimizes the variation in the meaning of the adversarial examples generated. It also significantly improves the overall quality of adversarial examples, as confirmed by human evaluators. Furthermore, it can be used as a component in any existing attack to speed up its execution while maintaining similar attack success 1 .

1. INTRODUCTION

Deep learning models have achieved tremendous success in the NLP domain in the past decade. They are applied in diverse critical areas such as hate speech filtering, moderating online discussions, or fake news detection. Successful attacks on these models could potentially have a devastating impact. In recent years, many researchers have highlighted that language models are not as robust as previously thought (Jin et al., 2020; He et al., 2016; Zhao et al., 2018; Szegedy et al., 2014; Kurakin et al., 2016a; b) and that they can be fooled quite easily with so-called adversarial examples, which introduce a small perturbation to the input data 'imperceptible' to the human eye. For example, in the domain of offensive language detection, we can have an offensive text on input and modify it in such a way that the meaning is preserved, but the modified text will fool the system to classify it as non-offensive (Jin et al., 2020) . A similar scenario is illustrated in Figure 1 in the domain of sentiment analysis of movie reviews. Although adversarial examples can be perceived as a threat, they also help us identify and understand potential weaknesses in language models and therefore contribute to the defense system, threat prevention, and decision making of the models just as well (Ribeiro et al., 2018) . Furthermore, when attacks are included in training data, the general robustness of the model and its ability to generalize can be improved (Goodfellow et al., 2014; Zhao et al., 2018) . Regarding the imperceptibility of adversarial attacks, it is easily definable in a continuous space, in domains like audio or vision. In computer vision, imperceptibility is a certain pixel distance between the original image and its perturbed version (Chakraborty et al., 2021) . However, this term is much more difficult to grasp in discrete domains such as text, where there is no clear analogy and where an indistinguishable modification simply cannot exist. This is why several definitions of a successful adversarial example have been developed that are specific to discrete domains such as text (Zhang et al., 2020b) . According to (Jin et al., 2020) we can identify three main requirements for an adversarial attack on text to be successful: Although most adversarial attacks claim to meet these constraints (Gao et al., 2018; Li et al., 2021b; Garg & Ramakrishnan, 2020) , careful scrutiny shows otherwise. We observed that many adversarial examples do not preserve the meaning of the text in some cases. This is also supported by (Morris et al., 2020a) whose findings are similar. To overcome this problem, (Morris et al., 2020a) suggested increasing cosine thresholds and introducing mechanisms such as grammar checks to improve the quality of adversarial examples. However, it is at the cost of the attack success rate, which decreased rapidly by more than 70%. We suggest a different solution that promises to avoid this decline in the attack success rate. It appears that the problem lies in the similarity metric itself, whose function is to measure the difference between the original and perturbed sentences. These metrics mostly use encoders that are trained with limited supervision. This makes them more susceptible to problems with antonym recognition. Because the antonyms are used in a similar context in the training data, the encoder assumes that they are alike. As a result, sentences such as 'This movie is so good' and 'This movie is so bad' are considered similar in this case, as illustrated in the third column of Table 1 . Building on these findings, we propose a new sentence encoder for similarity metrics in textual adversarial attacks called Semantics-Preserving-Encoder (SPE). SPE is trained with full supervision on annotated datasets. Thus, it should be more robust towards the antonym recognition problems that we observed frequently in adversarial examples. These premises were proven to be valid throughout the experiments, where we compared our solution to some of the most common similarity metrics used in adversarial attacks. The results show that our solution largely reduces the occurrence of text meaning modification and also significantly improves the overall quality of the adversarial examples generated, as confirmed by human evaluators. Furthermore, our solution -SPE can be integrated into any existing adversarial attack as a component granting a much faster execution and comparable attack success rate. In summary, we consider our main contributions to be as follows: 1. We propose a simple, but powerful sentence encoder -SPE, which improves the overall quality of adversarial examples (minimizes text meaning modification and antonym recognition problem). SPE can also be used as a component in any existing attack to speed its execution while maintaining a similar attack success rate.



The code, datasets and test examples are available at https://github.com/



Figure 1: The typical adversarial example pipeline for a sentiment classification task. The attacked classifier was trained to distinguish between positive and negative movie reviews. The semantic similarity metric is based on our Semantics-Preserving-Encoder (SPE), which embeds sentences using an array of supervised classifiers. In this example, substituting two words simultaneously fools the model into changing its prediction and passes the semantic similarity test.

