PRESERVING SEMANTICS IN TEXTUAL ADVERSARIAL ATTACKS

Abstract

Adversarial attacks in NLP challenge the way we look at language models. The goal of this kind of adversarial attack is to modify the input text to fool a classifier while maintaining the original meaning of the text. Although most existing adversarial attacks claim to fulfill the constraint of semantics preservation, careful scrutiny shows otherwise. We show that the problem lies in the text encoders used to determine the similarity of adversarial examples, specifically in the way they are trained. Unsupervised training methods make these encoders more susceptible to problems with antonym recognition. To overcome this, we introduce a simple, fully supervised sentence embedding technique called Semantics-Preserving-Encoder (SPE). The results show that our solution minimizes the variation in the meaning of the adversarial examples generated. It also significantly improves the overall quality of adversarial examples, as confirmed by human evaluators. Furthermore, it can be used as a component in any existing attack to speed up its execution while maintaining similar attack success 1 .

1. INTRODUCTION

Deep learning models have achieved tremendous success in the NLP domain in the past decade. They are applied in diverse critical areas such as hate speech filtering, moderating online discussions, or fake news detection. Successful attacks on these models could potentially have a devastating impact. In recent years, many researchers have highlighted that language models are not as robust as previously thought (Jin et al., 2020; He et al., 2016; Zhao et al., 2018; Szegedy et al., 2014; Kurakin et al., 2016a; b) and that they can be fooled quite easily with so-called adversarial examples, which introduce a small perturbation to the input data 'imperceptible' to the human eye. For example, in the domain of offensive language detection, we can have an offensive text on input and modify it in such a way that the meaning is preserved, but the modified text will fool the system to classify it as non-offensive (Jin et al., 2020) . A similar scenario is illustrated in Figure 1 in the domain of sentiment analysis of movie reviews. Although adversarial examples can be perceived as a threat, they also help us identify and understand potential weaknesses in language models and therefore contribute to the defense system, threat prevention, and decision making of the models just as well (Ribeiro et al., 2018) . Furthermore, when attacks are included in training data, the general robustness of the model and its ability to generalize can be improved (Goodfellow et al., 2014; Zhao et al., 2018) . Regarding the imperceptibility of adversarial attacks, it is easily definable in a continuous space, in domains like audio or vision. In computer vision, imperceptibility is a certain pixel distance between the original image and its perturbed version (Chakraborty et al., 2021) . However, this term is much more difficult to grasp in discrete domains such as text, where there is no clear analogy and where an indistinguishable modification simply cannot exist. This is why several definitions of a successful adversarial example have been developed that are specific to discrete domains such as text (Zhang et al., 2020b) . According to (Jin et al., 2020) we can identify three main requirements for an adversarial attack on text to be successful:



The code, datasets and test examples are available at https://github.com/ 1

