TRANSFOOL: AN ADVERSARIAL ATTACK AGAINST NEURAL MACHINE TRANSLATION MODELS

Abstract

Deep neural networks have been shown to be vulnerable to small perturbations of their inputs known as adversarial attacks. In this paper, we consider the particular task of Neural Machine Translation (NMT), where security is often critical. We investigate the vulnerability of NMT models to adversarial attacks and propose a new attack algorithm called TransFool. It builds on a multi-term optimization problem and a gradient projection step to compute adversarial examples that fool NMT models. By integrating the embedding representation of a language model in the proposed attack, we generate fluent adversarial examples in the source language that maintain a high level of semantic similarity with the clean samples and render the attack largely undetectable. Experimental results demonstrate that, for multiple translation tasks and different NMT architectures, our white-box attack can severely degrade the translation quality for more than 60% of the sentences while the semantic similarity between the original sentence and the adversarial example stays very high. Moreover, we show that the proposed attack is transferable to unknown target models and can fool those quite easily. Finally, based on automatic and human evaluations, our method leads to improvement in terms of success rate, semantic similarity, and fluency compared to the existing attacks both in white-box and black-box settings. Hence, TransFool permits to better characterize the vulnerability of NMT systems and outlines the necessity to design strong defense mechanisms and more robust NMT systems for real-life applications.

1. INTRODUCTION

The impressive performance of Deep Neural Networks (DNNs) in different areas such as computer vision (He et al., 2016) and Natural Language Processing (NLP) (Vaswani et al., 2017) has led to their widespread usage in various applications. With such an extensive usage of these models, it is important to analyze their robustness and potential vulnerabilities. In particular, it has been shown that the outputs of these models are susceptible to imperceptible changes in the input, known as adversarial attacks (Szegedy et al., 2014) . Adversarial examples, which differ from the original inputs in an imperceptible manner, cause the target model to generate incorrect outputs. If these models are not robust enough to these attacks, they cannot be reliably used in applications with security requirements. To address this issue, many studies have been recently devoted to the effective generation of adversarial examples, the defense against attacks, and the analysis of the vulnerabilities of DNN models (Moosavi-Dezfooli et al., 2016; Madry et al., 2018; Ortiz-Jiménez et al., 2021) . The dominant methods to craft imperceptible attacks for continuous data, e.g., audio and image data, are based on gradient computing and various optimization strategies. However, these methods cannot be directly extended to NLP models due to the discrete nature of the tokens in the corresponding representations (i.e., words, subwords, and characters). Another challenge in dealing with textual data is the characterization of the imperceptibility of the adversarial perturbation. The ℓ pnorm is highly utilized in image data to measure imperceptibility but it does not apply to textual data where manipulating only one token in a sentence may significantly change the semantics. Moreover, in gradient-based methods, it is challenging to incorporate linguistic constraints in a differentiable manner. Hence, optimization-based methods are more difficult and less investigated for adversarial attacks against NLP models. Currently, most attacks in textual data are gradient-free and simply based on heuristic word replacement, which may result in sub-optimal performance (Alzantot et al., 2018; Ren et al., 2019; Zang et al., 2020; Jin et al., 2020; Morris et al., 2020; Guo et al., 2021; Sadrizadeh et al., 2022) . In the literature, adversarial attacks have been mainly studied for text classifiers, but less for other NLP tasks such as Neural Machine Translation (NMT) (Zhang et al., 2020b) . In text classifiers, the number of output labels of the model is limited, and the adversary's goal is to mislead the target model to classify the input into any wrong class (untargeted attack) or a wrong predetermined class (targeted attack). However, in NMT systems, the output of the target model is a sequence of tokens, which is a much larger space than that of a text classifier (Cheng et al., 2020a) , and it is probable that the ground-truth translation changes after perturbing the input sequence. Hence, it is important to craft meaning-preserving adversarial sentences with a low impact on the ground-truth translation. In this paper, we propose TransFool to build meaning-preserving and fluent adversarial attacks against NMT models. We build a new solution to the challenges associated with gradient-based adversarial attacks against textual data. To find an adversarial sentence that is fluent and semantically similar to the input sentence but highly degrades the translation quality of the target model, we propose a multi-term optimization problem over the tokens of the adversarial example. We consider the white-box attack setting, where the adversary has access to the target model and its parameters. White-box attacks are widely studied since they reveal the vulnerabilities of the systems and are used in benchmarks. To ensure that the generated adversarial examples are imperceptibly similar to the original sentences, we incorporate a Language Model (LM) in our method in two ways. First, we consider the loss of a Causal Language Model (CLM) in our optimization problem in order to impose the syntactic correctness of the adversarial example. Second, by working with the embedding representation of LMs, instead of the NMT model, we ensure that similar tokens are close to each other in the embedding space (Tenney et al., 2019) . It enables the definition of a similarity term between the respective tokens of the clean and adversarial sequences. Hence, we include a similarity constraint in the proposed optimization problem, which uses the LM embeddings. Finally, our optimization contains an adversarial term to maximize the loss of the target NMT model. The generated adversarial example, i.e., the minimizer of the proposed optimization problem, should consist of meaningful tokens, and hence, the proposed optimization problem should be solved in a discrete space. By using a gradient projection technique, we first consider the continuous space of the embedding space and perform a gradient descent step and then, we project the resultant embedding vectors to the most similar valid token. In the projection step, we use the LM embedding representation and project the output of the gradient descent step into the nearest meaningful token in the embedding space (with maximum cosine similarity). We test our method against different NMT models with transformer structures, which are now widely used for their exceptional performance. For different NMT architectures and translation tasks, experiments show that our white-box attack can reduce the BLEU score, a widely-used metric for translation quality evaluation (Post, 2018) , to half for more than 60% of the sentences while it maintains a high level of semantic similarity with the clean samples. Furthermore, we extend TransFool to black-box settings and show that it can fool unknown target models. Overall, automatic and human evaluations show that in both white-box and black-box settings, TransFool outperforms the existing heuristic strategies in terms of success rate, semantic similarity, and fluency. In summary, our contributions are as follows: • We define a new optimization problem to compute semantic-preserving and fluent attacks against NMT models. The objective function contains several terms: adversarial loss to maximize the loss of the target NMT model; a similarity term to ensure that the adversarial example is similar to the original sentence; and loss of a CLM to generate fluent and natural adversarial examples. • We propose a new strategy to incorporate linguistic constraints in our attack in a differentiable manner. Since LM embeddings provide a meaningful representation of the tokens, we use them instead of the NMT embeddings to compute the similarity between two tokens. • We design a white-box attack algorithm, TransFool, against NMT models by solving the proposed optimization problem with gradient projection. Our attack, which operates at the token level, is effective against state-of-the-art transformer-based NMT models and outperforms prior works. • By using the transferability of adversarial attacks to other models, we extend the proposed whitebox attack to the black-box setting. Our attack is highly effective even when the target languages of the target NMT model and the reference model are different. To our knowledge, this type of transfer attack, cross-lingual, has not been investigated. The rest of the paper is organized as follows. We review the related works in Section 2. In Section 3, we formulate the problem of adversarial attacks against NMT models, and propose an optimization problem to build adversarial attacks. We describe our attack algorithm in Section 4. In Section 5, we

