TRANSFOOL: AN ADVERSARIAL ATTACK AGAINST NEURAL MACHINE TRANSLATION MODELS

Abstract

Deep neural networks have been shown to be vulnerable to small perturbations of their inputs known as adversarial attacks. In this paper, we consider the particular task of Neural Machine Translation (NMT), where security is often critical. We investigate the vulnerability of NMT models to adversarial attacks and propose a new attack algorithm called TransFool. It builds on a multi-term optimization problem and a gradient projection step to compute adversarial examples that fool NMT models. By integrating the embedding representation of a language model in the proposed attack, we generate fluent adversarial examples in the source language that maintain a high level of semantic similarity with the clean samples and render the attack largely undetectable. Experimental results demonstrate that, for multiple translation tasks and different NMT architectures, our white-box attack can severely degrade the translation quality for more than 60% of the sentences while the semantic similarity between the original sentence and the adversarial example stays very high. Moreover, we show that the proposed attack is transferable to unknown target models and can fool those quite easily. Finally, based on automatic and human evaluations, our method leads to improvement in terms of success rate, semantic similarity, and fluency compared to the existing attacks both in white-box and black-box settings. Hence, TransFool permits to better characterize the vulnerability of NMT systems and outlines the necessity to design strong defense mechanisms and more robust NMT systems for real-life applications.

1. INTRODUCTION

The impressive performance of Deep Neural Networks (DNNs) in different areas such as computer vision (He et al., 2016) and Natural Language Processing (NLP) (Vaswani et al., 2017) has led to their widespread usage in various applications. With such an extensive usage of these models, it is important to analyze their robustness and potential vulnerabilities. In particular, it has been shown that the outputs of these models are susceptible to imperceptible changes in the input, known as adversarial attacks (Szegedy et al., 2014) . Adversarial examples, which differ from the original inputs in an imperceptible manner, cause the target model to generate incorrect outputs. If these models are not robust enough to these attacks, they cannot be reliably used in applications with security requirements. To address this issue, many studies have been recently devoted to the effective generation of adversarial examples, the defense against attacks, and the analysis of the vulnerabilities of DNN models (Moosavi-Dezfooli et al., 2016; Madry et al., 2018; Ortiz-Jiménez et al., 2021) . The dominant methods to craft imperceptible attacks for continuous data, e.g., audio and image data, are based on gradient computing and various optimization strategies. However, these methods cannot be directly extended to NLP models due to the discrete nature of the tokens in the corresponding representations (i.e., words, subwords, and characters). Another challenge in dealing with textual data is the characterization of the imperceptibility of the adversarial perturbation. The ℓ pnorm is highly utilized in image data to measure imperceptibility but it does not apply to textual data where manipulating only one token in a sentence may significantly change the semantics. Moreover, in gradient-based methods, it is challenging to incorporate linguistic constraints in a differentiable manner. Hence, optimization-based methods are more difficult and less investigated for adversarial attacks against NLP models. Currently, most attacks in textual data are gradient-free and simply based on heuristic word replacement, which may result in sub-optimal performance (Alzantot et al., 2018; Ren et al., 2019; Zang et al., 2020; Jin et al., 2020; Morris et al., 2020; Guo et al., 2021; Sadrizadeh et al., 2022) . In the literature, adversarial attacks have been mainly studied for text classifiers, but less for other NLP tasks such as Neural Machine Translation (NMT) (Zhang et al., 2020b) . In text classifiers, the number of output labels of the model is limited, and the adversary's goal is to mislead the target model to classify the input into any wrong class (untargeted attack) or a wrong predetermined class (targeted attack). However, in NMT systems, the output of the target model is a sequence of tokens, which is a much larger space than that of a text classifier (Cheng et al., 2020a) , and it is probable that the ground-truth translation changes after perturbing the input sequence. Hence, it is important to craft meaning-preserving adversarial sentences with a low impact on the ground-truth translation. In this paper, we propose TransFool to build meaning-preserving and fluent adversarial attacks against NMT models. We build a new solution to the challenges associated with gradient-based adversarial attacks against textual data. To find an adversarial sentence that is fluent and semantically similar to the input sentence but highly degrades the translation quality of the target model, we propose a multi-term optimization problem over the tokens of the adversarial example. We consider the white-box attack setting, where the adversary has access to the target model and its parameters. White-box attacks are widely studied since they reveal the vulnerabilities of the systems and are used in benchmarks. To ensure that the generated adversarial examples are imperceptibly similar to the original sentences, we incorporate a Language Model (LM) in our method in two ways. First, we consider the loss of a Causal Language Model (CLM) in our optimization problem in order to impose the syntactic correctness of the adversarial example. Second, by working with the embedding representation of LMs, instead of the NMT model, we ensure that similar tokens are close to each other in the embedding space (Tenney et al., 2019) . It enables the definition of a similarity term between the respective tokens of the clean and adversarial sequences. Hence, we include a similarity constraint in the proposed optimization problem, which uses the LM embeddings. Finally, our optimization contains an adversarial term to maximize the loss of the target NMT model. The generated adversarial example, i.e., the minimizer of the proposed optimization problem, should consist of meaningful tokens, and hence, the proposed optimization problem should be solved in a discrete space. By using a gradient projection technique, we first consider the continuous space of the embedding space and perform a gradient descent step and then, we project the resultant embedding vectors to the most similar valid token. In the projection step, we use the LM embedding representation and project the output of the gradient descent step into the nearest meaningful token in the embedding space (with maximum cosine similarity). We test our method against different NMT models with transformer structures, which are now widely used for their exceptional performance. For different NMT architectures and translation tasks, experiments show that our white-box attack can reduce the BLEU score, a widely-used metric for translation quality evaluation (Post, 2018) , to half for more than 60% of the sentences while it maintains a high level of semantic similarity with the clean samples. Furthermore, we extend TransFool to black-box settings and show that it can fool unknown target models. Overall, automatic and human evaluations show that in both white-box and black-box settings, TransFool outperforms the existing heuristic strategies in terms of success rate, semantic similarity, and fluency. In summary, our contributions are as follows: • We define a new optimization problem to compute semantic-preserving and fluent attacks against NMT models. The objective function contains several terms: adversarial loss to maximize the loss of the target NMT model; a similarity term to ensure that the adversarial example is similar to the original sentence; and loss of a CLM to generate fluent and natural adversarial examples. • We propose a new strategy to incorporate linguistic constraints in our attack in a differentiable manner. Since LM embeddings provide a meaningful representation of the tokens, we use them instead of the NMT embeddings to compute the similarity between two tokens. • We design a white-box attack algorithm, TransFool, against NMT models by solving the proposed optimization problem with gradient projection. Our attack, which operates at the token level, is effective against state-of-the-art transformer-based NMT models and outperforms prior works. • By using the transferability of adversarial attacks to other models, we extend the proposed whitebox attack to the black-box setting. Our attack is highly effective even when the target languages of the target NMT model and the reference model are different. To our knowledge, this type of transfer attack, cross-lingual, has not been investigated. The rest of the paper is organized as follows. We review the related works in Section 2. In Section 3, we formulate the problem of adversarial attacks against NMT models, and propose an optimization problem to build adversarial attacks. We describe our attack algorithm in Section 4. In Section 5, we discuss the experimental results and evaluate our algorithm against different transformer models and translation tasks. Moreover, we evaluate our attack in black-box settings and show that TransFool has very good transfer properties. Finally, the paper is concluded in Section 6.

2. RELATED WORK

Machine translation, an important task in NLP, is the task of automatically converting a sequence of words in a source language to a sequence of words in a target language (Bahdanau et al., 2015) . By using DNN models, NMT systems are reaching exceptional performance, which has resulted in their usage in a wide variety of areas, especially in safety and security sensitive applications. But any faulty output of NMT models may result in irreparable incidents in real-world applications. Hence, we need to better understand the vulnerabilities of NMT models to perturbations of input samples, in particular to adversarial examples, to ensure security of applications and robustness of such models. Adversarial attacks against NMT systems have been studied in recent years. First, Belinkov & Bisk (2018) show that character-level NMT models are highly vulnerable to character manipulations such as typos in a block-box setting. Similarly, Ebrahimi et al. (2018a) investigate the robustness of character-level NMT models. They propose a white-box adversarial attack based on HotFlip (Ebrahimi et al., 2018b) and greedily change the important characters to decrease the translation quality (untargeted attack) or mute/push a word in the translation (targeted attack). However, character-level manipulations can be easily detected. To circumvent this issue, many of the adversarial attacks against NMT models are rather based on word replacement. Cheng et al. (2019) propose a white-box attack where they first select random words of the input sentence and replace them with a similar word. In particular, in order to limit the search space, they find some candidates with the help of a language model and choose the token that aligns best with the gradient of the adversarial loss to cause more damage to the translation. Michel et al. (2019) and Zhang et al. (2021) find important words in the sentence and replace them with a neighbor word in the embedding space to create adversarial examples. However, these methods use heuristic strategies which may result in sub-optimal performance. There are also some other types of attacks against NMT models in the literature. In (Wallace et al., 2020) , a new type of attack, i.e., universal adversarial attack, is proposed, which consists of a single snippet of text that can be added to any input sentence to mislead the NMT model. However, the added phrase is meaningless, hence easily detectable. Cheng et al. (2020a) propose Seq2Sick, a targeted white-box attack against NMT models. They introduce an optimization problem and solve it by gradient projection. The proposed optimization problem contains an adversarial loss and a group lasso term to ensure that only a few words of the sentence are modified. Although they have a projection step to the nearest embedding vector, they use the NMT embeddings, which may not preserve semantic similarity. Other types of attacks against NMT models with different threat models and purposes have also been investigated in the literature. Some papers focus on making NMT models robust to perturbation to the inputs (Cheng et al., 2018; 2020b; Tan et al., 2021) . Some other papers use adversarial attacks to enhance the NMT models in some aspects, such as word sense disambiguation (Emelin et al., 2020) , robustness to subword segmentation (Park et al., 2020) , and robustness of unsupervised NMT (Yu et al., 2021) . In (Xu et al., 2021; Wang et al., 2021) , the data poisoning attacks against NMT models are studied. Another type of attack whose purpose is to change multiple words while ensuring that the output of the NMT model remains unchanged is explored in (Chaturvedi et al., 2019; 2021) . Another attack approach is presented in (Cai et al., 2021) , where the adversary uses the hardware faults of systems to fool NMT models. In summary, most of the existing adversarial attacks against NMT models are not undetectable since they are based on character manipulation, or they use the NMT embedding space to find similar tokens. Also, heuristic strategies based on word-replacement are likely to have sub-optimal performance. Finally, none of these attacks study the transferability to black-box settings. We introduce TransFool to craft effective and fluent adversarial sentences which are similar to the original ones.

3. OPTIMIZATION PROBLEM

In this section, we first present our new formulation for generating adversarial examples against NMT models, along with different terms that form our optimization problem. Adversarial Attack. Consider X to be the source language space and Y to be the target language space. The NMT model f : X → Y generally has an encoder-decoder structure (Bahdanau et al., 2015; Vaswani et al., 2017) and aims to maximize the translation probability p(y ref |x), where x ∈ X is the input sentence in the source language and y ref ∈ Y is the ground-truth translation in the target language. To process textual data, each sentence is decomposed into a sequence of tokens. Therefore, the input sentence x = x 1 x 2 ...x k is split into a sequence of k tokens, where x i is a token from the vocabulary set V X of the NMT model, which contains all the tokens from the source language. For each token in the translated sentence y ref = y ref,1 , ..., y ref,l , the NMT model generates a probability vector over the target language vocabulary set V Y by applying a softmax function to the decoder output. The adversary is looking for an adversarial sentence x ′ , which is tokenized into a sequence of k tokens x ′ = x ′ 1 x ′ 2 ...x ′ k , in the source language that fools the target NMT model, i.e., the translation of the adversarial example f (x ′ ) is far from the true translation. However, the adversarial example x ′ and the original sentence x should be imperceptibly close so that the translation of the adversarial example stays similar to y ref . As is common in the NMT models (Vaswani et al., 2017; Junczys-Dowmunt et al., 2018; Tang et al., 2020) , to feed the discrete sequence of tokens into the NMT model, each token is converted to a continuous vector, known as an embedding vector, using a lookup table. In particular, let emb(.) be the embedding function that maps the input token x i to the continuous embedding vector emb(x i ) = e i ∈ R m , where m is the embedding dimension of the target NMT model. Therefore, the input of the NMT model is a sequence of embedding vectors representing the tokens of the input sentence, i.e., e x = [e 1 , e 2 , ..., e k ] ∈ R (k×m) . In the same manner, e x ′ = [e ′ 1 , e ′ 2 , ..., e ′ k ] ∈ R (k×m) is defined for the adversarial example. To generate an adversarial example for a given input sentence, we introduce an optimization problem with respect to the embedding vectors of the adversarial sentence e x ′ . Our optimization problem is composed of multiple terms: an adversarial loss, a similarity constraint, and the loss of a language model. An adversarial loss causes the target NMT model to generate faulty translation. Moreover, with a language model loss and a similarity constraint, we impose the generated adversarial example to be a fluent sentence and also semantically similar to the original sentence, respectively. The proposed optimization problem, which finds the adversarial example x ′ from its embedding representation e x ′ by using a lookup table, is defined as follows: x ′ ← arg min e ′ i ∈E V X [L Adv + αL Sim + βL LM ], where α and β are the hyperparameters that control the relative importance of each term. Moreover, we call the continuous space of the embedding representations the embedding space and denote it by E, and we show the discrete subspace of the embedding space E containing the embedding representation of every token in the source language vocabulary set by E V X . We now discuss the different terms of the optimization function in detail. Adversarial Loss. In order to create an adversarial example whose translation is far away from the reference translation y ref , we try to maximize the training loss of the target NMT model. Since the NMT models are trained to generate the next token of the translation given the translation up until that token, we are looking for the adversarial example that maximizes the probability of wrong translation (i.e., minimizes the probability of correct translation) for the i-th token, given that the NMT model has produced the correct translation up to step (i -1): L Adv = 1 l l i=1 log(p f (y ref,i |e x ′ , {y ref,1 , ..., y ref,(i-1) })), where p f (y ref,i |e x ′ , {y ref,1 , ..., y ref,(i-1) }) is the cross entropy between the predicted token distribution by the NMT model and the delta distribution on the token y ref,i , which is one for the correct translated token, y ref,i , and zero otherwise. By minimizing log(p f (.)), normalized by the sentence length l, we force the output probability vector of the NMT model to differ from the delta distribution on the token y ref,i , which may cause the predicted translation to be wrong. Similarity Constraint. To ensure that the generated adversarial example is similar to the original sentence, we need to add a similarity constraint to our optimization problem. It has been shown that the embedding representation of a language model captures the semantics of the tokens (Tenney et al., 2019; Shavarani & Sarkar, 2021) . Suppose that the embedding representation by a language model of the original sentence (which may differ from the NMT embedding representation e x ) is k×n) , where n is the embedding dimension of the language model. Likewise, let v x ′ denote the sequence of LM embedding vectors regarding the tokens of the adversarial example. We can define the distance between the i-th tokens of the original and the adversarial sentences by computing the cosine distance between their corresponding LM embedding vectors: v x = [v 1 , v 2 , ..., v k ] ∈ R ( ∀i ∈ {1, ..., k} : r i = 1 - v ⊺ i v ′ i ∥v i ∥ 2 .∥v ′ i ∥ 2 . ( ) The cosine distance is zero if the two tokens are the same and it has larger values for two unrelated tokens. We want the adversarial sentence to differ from the original sentence in only a few tokens. Therefore, the cosine distance between most of the tokens in the original and adversarial sentence should be zero, which causes the cosine distance vector [r 1 , r 2 , ..., r k ] to be sparse. To ensure the sparsity of the cosine distance vector, instead of the ℓ 0 norm, which is not differentiable, we can define the similarity constraint as the ℓ 1 norm relaxation of the cosine distance vector normalized to the length of the sentence: L Sim = 1 k k i=1 1 - v ⊺ i v ′ i ∥v i ∥ 2 .∥v ′ i ∥ 2 . ( ) Language Model Loss. Causal language models are trained to maximize the probability of a token given the previous tokens. Hence, we can use the loss of a CLM, i.e., the negative log-probability, as a rough and differentiable measure for the fluency of the generated adversarial sentence. The loss of a CLM, which is normalized to the sentence length, is as follows: L LM = - 1 k k i=1 log(p g (v ′ i |v ′ 1 , ..., v ′ (i-1) )), where g is a CLM, and p g (v ′ i |v ′ 1 , ..., v ′ (i-1) ) is the cross entropy between the predicted token distribution by the language model and the delta distribution on the token v ′ i , which is one for the corresponding token in the adversarial example, v ′ i , and zero otherwise. To generate adversarial examples against a target NMT model, we propose to solve the optimization problem (1), which contains an adversarial loss term, a similarity constraint, and a CLM loss. We now introduce our algorithm for generating adversarial examples against NMT models. The block diagram of our proposed attack is presented in Figure 1 . We are looking for an adversarial example with tokens in the vocabulary set V X and the corresponding embedding vectors in the subspace E V X . Hence, the optimization problem (1) is discrete. The high-level idea of our algorithm is to use gradient projection to solve equation 1 in the discrete subspace E V X .

4. TRANSFOOL ATTACK ALGORITHM

The objective function of equation 1 is a function of NMT and LM embedding representations of the adversarial example, e x ′ and v x ′ , respectively. Since we aim to minimize the optimization problem with respect to e x ′ , we need to find a transformation between the embedding space of the language model and the target NMT model. To this aim, as depicted in Figure 1 , we propose to replace the embedding layer of a pre-trained language model with a Fully Connected (FC) layer, which gets the embedding vectors of the NMT model as its input. Then, we train the language model and the FC layer simultaneously with the causal language modeling objective. Therefore, we can compute the LM embedding vectors as a function of the NMT embedding vectors: v i = F C(e i ), where F C ∈ R m×n is the trained FC layer. x ′ : Generated adversarial example initialization: s ← empty set, itr ← 0 thr ← BLEU(f (e x ), y ref )) × λ ∀i ∈ {1, ..., k} e g,i , e p,i ← e i while itr < K do itr ← itr + 1 Step 1: Gradient descent in the continuous embedding space: e g ← e g -γ.∇ e x ′ (L adv + αL Sim + βL LM ) v g ← F C(e g ) Step 2: Projection to the discrete subspace E VX and update if the sentence is new: The pseudo-code of our attack can be found in Algorithm 1. In more detail, we first convert the discrete tokens of the sentence to continuous embedding vectors of the target NMT model, then we use the FC layer to compute the embedding representations of the tokens by the language model. Afterwards, we consider the continuous relaxation of the optimization problem, which means that we assume that the embedding vectors are in the continuous embedding space E instead of E V X . In each iteration of the algorithm, we first update the sequence of embedding vectors e x ′ in the opposite direction of the gradient (gradient descent). Let us denote the output of the gradient descent step for the i-th token by e g,i . Then we project the resultant embedding vectors, which are not necessarily in E V X , to the nearest token in the vocabulary set V X . Since the distance in the embedding space of the LM model represents the relationship between the tokens, we use the LM embedding representations with cosine similarity metric in the projection step to find the most similar token in the vocabulary. We can apply the trained fully connected layer F C to find the LM embedding representations: v g = F C(e g ). Hence, the projected NMT embedding vector, e p,i , for the i-th token is: for i ∈ {1, ..., e p,i = arg max e∈E V X F C(e) ⊤ v g,i ∥F C(e)∥ 2 .∥v g,i ∥ 2 . However, due to the discrete nature of data, by applying the projection step in every iteration of the algorithm, we may face an undesirable situation where the algorithm gets stuck in a loop of previously computed steps. In order to circumvent this issue, we will only update the embedding vectors by the output of the projection step if the projected sentence has not been generated before. We perform the gradient descent and projection steps iteratively until a maximum number of iterations is reached, or the translation quality of the adversarial example relative to the original translation quality is less than a threshold. To evaluate the translation quality, we use the BLEU score, which is a widely used metric in the literature: BLEU(f (e x ′ ), y ref )) BLEU(f (e x ), y ref )) ≤ λ.

5. EXPERIMENTS

In this section, we first discuss our experimental setup, and then we evaluate TransFool against different models and translation tasks, both in white-box and black-box settings.

5.1. EXPERIMENTAL SETUP

We conduct experiments on the English-French (En-Fr), English-German (En-De), and English-Chinese (En-Zh) translation tasks. We use the test set of WMT14 (Bojar et al., 2014) for the En-Fr and En-De tasks, and the test set of OPUS-100 (Zhang et al., 2020a) for the En-Zh task. Some statistics of these datasets are presented in Appendix A. We evaluate TransFool against transformer-based NMT models. To verify that our attack is effective against various model architectures, we attack the HuggingFace implementation of the Marian NMT models (Junczys-Dowmunt et al., 2018) and mBART50 multilingual NMT model (Tang et al., 2020) . As explained in Section 4, the similarity constraint and the LM loss of the proposed optimization problem require an FC layer and a CLM. To this aim, for each NMT model, we train an FC layer and a CLM (with GPT-2 structure (Radford et al., 2019) ) on WikiText-103 dataset. We note that the input of the FC layer is the target NMT embedding representation of the input sentence. To find the minimizer of our optimization problem (1), we use the Adam optimizer (Kingma & Ba, 2014) with step size γ = 0.016. Moreover, we set the maximum number of iterations to 500. Our algorithm has three parameters: coefficients α and β in the optimization function (1), and the relative BLEU score ratio λ in the stopping criteria (7). We set λ = 0.4, β = 1.8, and α = 20. We chose these parameters experimentally according to the ablation study, which is available in Appendix B, in order to optimize the performance in terms of success rate, semantic similarity, and fluency. We compare our attack with (Michel et al., 2019) , which is a white-box untargeted attack against NMT models.foot_0 We only consider one of their attacks, called kNN, which substitutes some words with their neighbors in the embedding space; the other attack considers swapping the characters, which is too easy to detect. We also adapted Seq2Sick (Cheng et al., 2020a) , a targeted attack against NMT models based on an optimization problem in the NMT embedding space, to our untargeted setting. For evaluation, we report different performance metrics: (1) Attack Success Rate (ASR), which measures the rate of successful adversarial examples. Similar to (Ebrahimi et al., 2018a) , we define the adversarial example as successful if the BLEU score of its translation is less than half of the BLEU score of the original translation. (2) Relative decrease of translation quality, by measuring the translation quality in terms of BLEU score 2 and chrF (Popović, 2015) . We denote these two metrics by RDBLEU and RDchrF, respectively. We choose to compute the relative decrease in translation quality so that scores are comparable across different models and datasets (Michel et al., 2019) . (3) Semantic Similarity (Sim.), which is computed between the original and adversarial sentences and commonly approximated by the universal sentence encoder (Yang et al., 2020) 3 . (4) Perplexity score (Perp.), which is a measure of the fluency of the adversarial example computed with the perplexity score of GPT-2 (large). (5) Token Error Rate (TER), which measures the imperceptibility by computing the rate of tokens modified by an adversarial attack.

5.2. RESULTS OF THE WHITE-BOX ATTACK

Now we evaluate TransFool in comparison to kNN and Seq2Sick against different NMT models. Table 1 shows the results in terms of different evaluation metrics. 4 Overall, our attack is able to decrease the BLEU score of the target model to less than half of the BLEU score of the original translation for more than 60% of the sentences for all tasks and models (except for the En-Zh mBART50 model, where ASR is 57.50%). Also, in all cases, semantic similarity is more than 0.83, which shows that our attack can maintain a high level of semantic similarity with the clean sentences. In comparison to the baselines, TransFool obtains a higher success rate against different model structures and translation tasks, and it is able to reduce the translation quality more severely. Since the algorithm uses the gradients of the proposed optimization problem and is not based on token replacement, TransFool can highly degrade the translation quality. Furthermore, the perplexity score of the adversarial example generated by TransFool is much less than the ones of both baselines (except for the En-Fr Marian model, where it is a little higher than Seq2Sick), which is due to the In acontece studieren Planer mit Kivakapis gegen Entscheidungen, * Adversarial perturbed tokens are in red, and the perturbations by TransFool are in blue in the original sentence. The changes in the translation that are the direct results of the perturbation are in brown, while the changes that are due to the failure of the target model are in orange. integration of the LM embeddings and the LM loss term in the optimization problem. Moreover, the token error rate of our attack is lower than both baselines, and the semantic similarity is preserved better by TransFool in almost all cases since we use the LM embeddings instead of the NMT ones for the similarity constraint. While kNN can also maintain semantic similarity, Seq2Sick does not perform well in this criterion. We also computed similarity by BERTScore (Zhang et al., 2019) and BLEURT-20 (Sellam et al., 2020 ) that highly correlate with human judgments in Appendix D, which shows that TransFool is better than both baselines in maintaining the semantics. Moreover, as presented in Appendix D.2, the successful attacks by the baselines, as opposed to TransFool, are not semantic-preserving or fluent sentences. Finally, the complete setup and results of our human evaluation are presented in Appendix H, which also shows the superiority of TransFool. We also compare the runtime of TransFool and that of the two baselines. In each iteration of our proposed attack, we need to perform a back-propagation through the target NMT model and the language model to compute the gradients. Also, in some iterations (27 iterations per sentence on average), a forward pass is required to compute the output of the target NMT model to check the stopping criteria. For the Marian NMT (En-Fr) model, on a system equipped with an NVIDIA A100 GPU, it takes 26.45 seconds to generate adversarial examples by TransFool. On the same system, kNN needs 1.45 seconds, and Seq2Sick needs 38.85 seconds to generate adversarial examples for less effective adversarial attacks, however. Table 2 shows some adversarial examples against mBART50 (En-De). In comparison to the baselines, TransFool makes smaller changes to the sentence. The generated adversarial example is a correct English sentence, and it is similar to the original sentence. However, kNN and Seq2Sick generate adversarial sentences that are not necessarily natural or similar to the original sentences. More examples generated by TransFool, kNN, and Seq2Sick can be found in Appendix D.2. We also provide some adversarial sentences when we do not use the LM embeddings in our algorithm in order to show the importance of this component. Indeed, TransFool outperforms both baselines in terms of success rate. It is able to generate more natural adversarial examples with a lower number of perturbations (TER) and higher semantic similarity with the clean samples in almost all cases. A complete study of hyperparameters and the effect of using LM embeddings instead of NMT embeddings for computing similarity on TransFool performance is presented in Appendix B and C, respectively.

5.3. PERFORMANCE IN BLACK-BOX ATTACK SETTINGS

In practice, the adversary's access to the learning system may be limited. Hence, we propose to analyze the performance of TransFool in a black-box scenario. It has been shown that adversarial attacks often transfer to another model that has a different architecture and is even trained with different datasets (Szegedy et al., 2014) . By utilizing this property of adversarial attacks, we extend TransFool to the black-box scenario. We consider that we have complete access to one NMT model (the reference model), including its gradients. We implement the proposed gradient-based attack in algorithm 1 with this model. However, for the stopping criteria of the algorithm, we query the black-box target NMT model to compute the BLEU score. We can also implement the black-box transfer attack in the case where the source languages of the reference model and the target model are the same, but their target languages are different. Since Marian NMT is faster and lighter than mBART50, we use it as the reference model and evaluate the performance of the black-box attack against mBART50. We compare the performance of TransFool with WSLS (Zhang et al., 2021) , a black-box untargeted attack against NMT models based on word-replacement (the choice of backtranslation model used in WSLS is investigated in Appendix F). We also evaluate the performance of kNN and Seq2Sick in the black-box settings by attacking mBART50 with the adversarial example generated against Marian NMT (in the white-box settings). The results are reported in Table 3 . We also report the performance when attacking Google Translate, some generated adversarial samples, and similarity performance computed by BERTScore and BLEURT-20 in Appendix E. In all tasks, with a few queries to the target model, our black-box attack achieves better performance than the white-box attack against the target model (mBART50) but a little worse performance than the white-box attack against the reference model (Marian NMT). In all cases, the success rate, token error rate, and perplexity of TransFool are better than all baselines (except for the En-Fr task, where perplexity is a little higher than Seq2Sick). The ability of TransFool and WSLS to maintain semantic similarity is comparable and better than both other baselines. However, WSLS has the highest token error rate, which makes the attack detectable. The effect of TransFool on BLEU score is larger than that of the other methods, and its effect on chrF metric comes after WSLS (except for the En-DE task, where RDchrF of TransFool is the best). Regarding the complexity, TransFool requires only a few queries to the target model for translation, while WSLS queries the model more than a thousand times, which is costly and may not be feasible in practice. For the En-Fr task, on a system equipped with an NVIDIA A100 GPU, it takes 43.36 and 1904.98 seconds to generate adversarial examples by TransFool and WSLS, respectively, which shows that WSLS is very time-consuming. We also analyze the transferability of the generated adversarial examples to a black-box NMT model with the same source language but a different target language. Since we need a dataset with the same set of sentences for different language pairs, we use the validation set of WMT14 for En-Fr and En-De tasks. Table 4 shows the results for two cases: Marian NMT or mMBART50 as the target model. We use Marian NMT as the reference model with a different target language than that of the target model. In all settings, the generated adversarial examples are highly transferable to another NMT model with a different target language (i.e., they have high attack success rate and large semantic similarity). The high transferability of TransFool shows that it is able to capture the common failure modes in different NMT models, which can be dangerous in real-world applications.

6. CONCLUSION

In problem, in Figure 2b . By increasing this coefficient, we weaken the effect of the similarity term, i.e., the generated adversarial examples are less similar to the original sentence. As a result, the success rate and the effect on translation quality, i.e., RDBLEU and RDchrF, increase. Effect of the step size γ. The step size of the gradient descent step of the algorithm can impact the performance of our attack, which is investigated in Figure 2c . Increasing the step size results in larger movement in the embedding space in each iteration of the algorithm. Hence, the generated adversarial examples are more aggressive, which results in lower semantic similarity and higher perplexity scores. However, we can find adversarial examples more easily and achieve a higher attack success rate, RDBLEU, and RDchRF. Effect of the BLEU score ratio λ. This hyperparameter determines the stopping criteria of our iterative algorithm. Figure 2d studies the effects of this hyperparameter on the performance of our attack. As this figure shows, a higher BLEU score ratio causes the algorithm to end in earlier iterations. Therefore, the changes applied to the sentence are less aggressive, and hence, we achieve higher semantic similarity and a lower perplexity score. However, the attack success rate, RDBLEU, and RDchrF decrease since we make fewer changes to the sentences. Table 6 shows the results of Trans-Fool and kNN when we use LM embeddings or NMT embeddings for measuring similarity between two tokens. 5 The LM embeddings result in lower perplexity and higher semantic similarity for both methods, which demonstrates the importance of this component in generating meaning-preserving fluent adversarial examples.

D MORE RESULTS ON THE WHITE-BOX ATTACK D.1 SEMANTIC SIMILARITY COMPUTED BY OTHER METRICS

To better assess the ability of adversarial attacks in maintaining semantic similarity, we can compute the similarity between the original and adversarial sentences using other metrics such as BERTScore (Zhang et al., 2019) and BLEURT-20 (Sellam et al., 2020) . It is shown in (Zhang et al., 2019) that BERTScore correlates well with human judgments. BLEURT-20 is also shown to correlates better with human judgment than traditional measures (Freitag et al., 2021) . The results are reported in Table 7 . These results indicate that the TransFool is indeed more capable of preserving the semantics of the input sentence. In the two cases where kNN has better similarity by using the Universal Sentence Encoder (USE) (Yang et al., 2020) , the performance of TransFool is better in terms of BERTScore and BLEURT-20. 

D.2 PERFORMANCE OVER SUCCESSFUL ATTACKS

The evaluation metrics of the successful adversarial examples that strongly affect the translation quality are also important, and they show the capability of the adversarial attack. Hence, we evaluate TransFool, kNN, and Seq2Sick only over the successful adversarial examples. 6 The results for the white-box setting are presented in Table 8 . By comparing this 

D.3 TRADE-OFF BETWEEN SUCCESS RATE AND SIMILARITY/FLUENCY

The results in our ablation study B show that there is a trade-off between the quality of adversarial example, in terms of semantic-preservation and fluency, and the attack success rate. As studied in (Morris et al., 2020) , we can filter adversarial examples with low quality based on hard constraints on semantic similarity and the number of added grammatical errors caused by adversarial perturbations. We can analyze the trade-off between success rate and similarity/fluency by setting different thresholds for filtering adversarial examples. If we evaluate the similarity by the sentence encoder suggested in (Morris et al., 2020) , the success rate with different threshold values for similarity in the case of Marian (En-Fr) is depicted in Figure 3b . By considering only the adversarial examples with a similarity higher than a threshold, the success rate decreases as the threshold increases, and the quality of the adversarial examples increases. Similarly, we can do the same analysis for fluency. As suggested in (Morris et al., 2020) , we count the grammatical errors by LanguageTool (Naber et al., 2003) for the original sentences and the adversarial examples. Figure 3a depicts the success rate for different thresholds of the number of added grammatical errors caused by adversarial perturbations. These analyses show that with tighter constraints, we can generate better adversarial examples while the success rate decreases. All in all, according to these results, TransFool outperforms the baselines for different thresholds of similarity and grammatical errors.

D.4 MORE ADVERSARIAL EXAMPLES

In this Section, we present more adversarial examples generated by TransFool, kNN, and Seq2Sick. In order to show the effect of using LM embeddings on the performance of TransFool, we also include the generated adversarial examples against English to French Marian NMT model when we do not use LM embeddings. In all these tables, the tokens modified by TransFool are written in blue in the original sentence, and the modified tokens by different adversarial attacks are written in red in their corresponding adversarial sentences. Moreover, the changes made by the adversarial attack to the translation that are not directly related to the modified tokens are written in orange, while the changes that are the direct result of modified tokens are written in brown. As can be seen in the examples presented in Tables 9 and 10 , TransFool makes smaller changes to the sentence. The generated adversarial example is a correct English sentence, and it is similar to the original sentence. However, kNN, Seq2Sick, and our method with the NMT embeddings make changes that are perceptible, and the adversarial sentences are not necessarily similar to the original sentence. The higher semantic similarity of the adversarial sentences generated by TransFool is due to the integration of LM embeddings and the LM loss in the proposed optimization problem. We should highlight that TransFool is able to make changes to the adversarial sentence translation that are not directly related to the modifications of the original sentence but are the result of the NMT model failure. Other examples against different tasks and models are presented in Tables 11 to 16 . Table 9 : Adversarial examples against Marian NMT (En-Fr) by various methods (white-box).

Sentence BLEU Text

Org. The most eager is Oregon, which is enlisting 5,000 drivers in the country's biggest experiment.

Ref. Trans.

Le plus déterminé est l'Oregon, qui a mobilisé 5 000 conducteurs pour mener l'expérience la plus importante du pays. Org. Trans. 21.66 Le plus avide est l'Oregon, qui recrute 5 000 pilotes dans la plus grande expérience du pays.

Adv. TransFool

The most eager isQuebec, which is enlisting 5,000 drivers in the country's biggest experiment. Trans. 7.71 Le Québec, qui fait partie de la plus grande expérience du pays, compte 5 000 pilotes. (some parts are not translated at all.) Adv. w/ NMT Emb. The most eager isCustom, which is enlisting Disk drivers in the country's editions Licensee. Trans. 6.54 Le plus avide estCustom, qui recrute des pilotes de disque dans les éditions du pays Licencié.

Adv. kNN

Theve eager is Oregon, C aren enlisting 5,000 drivers in theau's biggest experiment. Trans. 5.93 Theve avide est Oregon, C sont enrôlés 5 000 pilotes dans la plus grande expérience de Theau.

Adv. Seq2Sick

The most buzz is FREE, which is chooseing Games comments in the country's great developer. Trans. 10.31 Le plus buzz est GRATUIT, qui est de choisir Jeux commentaires dans le grand développeur du pays. Table 10 : Adversarial examples against Marian NMT (En-Fr) by various methods (white-box).

Sentence BLEU Text

Org. "They are in the process of abandoning and killing off emergency units that were reformed less than five years ago," he believes. Ref. Trans. "Ils sont en train de vider et d'asphyxier des urgences qui ont été rénovées il y a moins de cinq ans", estime-t-il. Org. Trans. 37.53 « Ils sont en train d'abandonner et de tuer des unités d'urgence qui ont été réformées il y a moins de cinq ans », croit-il. Adv. TransFool "People are in the process of abandoning and killing off emergency units that been reformed less than five years ago," he believes. Trans. 23.83 « Les gens abandonnent et tuent les unités d'urgence réformées il y a moins de cinq ans », croit-il. (some parts are not translated.) Adv. w/ NMT Emb. "Manager are in the process of abandoning and killing off emergency units that were celebrating less than five years ago," he believes. Trans. 27.66 « Le gestionnaire est en train d'abandonner et de tuer des unités d'urgence qui célébraient il y a moins de cinq ans », croit-il. Adv. kNN "They are in the process of abandoning and killing off emergency allotment that were reformedvoir8) five years ago," States believes. Trans. 21.20 « Ils sont en train d'abandonner et de tuer les allocations d'urgence qui ont été réformées il y a cinq ans8 », estime-t-il. Adv. Seq2Sick "They are in the process of abandoning and shot off emergency units that were CSIS less than five years ago," he believes. Trans. 33.58 « Ils sont en train d'abandonner et de tuer des unités d'urgence qui étaient le SCRS il y a moins de cinq ans », croit-il.  Org. The devices, which track every mile a motorist drives and transmit that information to bureaucrats, are at the center of a controversial attempt in Washington and state planning offices to overhaul the outdated system for funding America's major roads.

Ref. Trans.

Die Geräte, die jeden gefahrenen Kilometer aufzeichnen und die Informationen an die Behörden melden, sind Kernpunkt eines kontroversen Versuchs von Washington und den Planungsbüros der Bundesstaaten, das veraltete System zur Finanzierung US-amerikanischer Straßen zu überarbeiten. Org. Trans.

23.65

Die Geräte, die jede Meile ein Autofahrer fährt und diese Informationen an Bürokraten weiterleitet, stehen im Zentrum eines umstrittenen Versuchs in Washington und in den staatlichen Planungsbüros, das veraltete System zur Finanzierung der großen Straßen Amerikas zu überarbeiten.

Adv. TransFool

The vehicles, which track every mile a motorist drives and transmit that information to bureaucrats, are at the center of a unjustified attempt in Washington and city planning offices to overhaul the clearer system for funding America's major roads. Trans. 9.36 Die Fahrzeuge, die jede Meile ein Autofahrer fährt und diese Informationen an Bürokraten weiterleitet, stehen im Zentrum eines ungerechtfertigten Versuchs in Washington und in den Stadtplanungsbüros, das klarere System zur Finanzierung der amerikanischen Hauptstraßen zu überarbeiten.

Adv. kNN

The devices in which track every mile a motorist drives and transmit that M to bureaucrats, are 07:0 the center of a controversial attempt in Washington and state planning offices to overhaul the outdated Estate for funding America's major roads. Trans.

7.79

Die Vorrichtungen, in denen jede Meile ein Autofahrer fährt und diese M an Bürokraten überträgt, sind 07:0 das Zentrum eines umstrittenen Versuchs in Washington und staatlichen Planungsbüros, das veraltete Estate für die Finanzierung der amerikanischen Hauptstraßen zu überarbeiten.

Adv. Seq2Sick

The devices, which road everyably a motorist drives and transmit that information to walnut socialisms, are at the center of a Senate attempt in Washington and state planning offices toestablishment the outdated system for funding America's major paths. Trans.

22.48

Die Geräte, die allgegenwärtig ein Autofahrer antreibt und diese Informationen an Walnusssozialismen überträgt, stehen im Zentrum eines Senatsversuchs in Washington und in den staatlichen Planungsbüros, das veraltete System zur Finanzierung der wichtigsten Wege Amerikas einzurichten. 20 to 22 . In these tables, the tokens modified by TransFool are written in blue in the original sentence, and the modified tokens by different adversarial attacks are written in red in their corresponding adversarial sentences. Moreover, the changes made by the adversarial attack to the translation that are not directly related to the modified tokens are written in orange, while the changes that are the direct result of modified tokens are written in brown. 各 代 表 团将 其 代 表 的 姓 名 提 交筹 备 委 员 会 秘 书贾 丹•普 林 纳-约 塞 夫 斯 女 士(S- These examples show that modifications made by TransFool are less detectable, i.e., the generated adversarial examples are more natural and similar to the original sentence. Moreover, TransFool makes changes to the translation that are not the direct result of the modified tokens of the adversarial sentence. WSLS uses a back-translation model for crafting an adversarial example. In (Zhang et al., 2021) , the authors investigate the En-De task and use the winner model of the WMT19 De-En sub-track (Ng et al., 2019) for the back-translation model. However, they do not evaluate their method for En-Fr and En-Zh tasks. To evaluate the performance of WSLS in Table 3 , We have used pre-trained Marian NMT models for all three back-translation models. In order to show the effect of our choice of back-translation model, we compare the performance of WSLS for the En-De task when we use Marian NMT or (Ng et al., 2019) as the back-translation model in Table 23 . As this Table shows , WSLS with Marian NMT as the back-translation model results in even more semantic similarity and lower perplexity score. On the other hand, WSLS with (Ng et al., 2019) as the back-translation model has a slightly more success rate. These results show that our choice of back-translation model does not highly affect the performance of WSLS.

G LICENSE INFORMATION AND DETAILS

In this Section, we provide some details about the datasets, codes, and models used in this paper. We should note that we used the models and datasets that are available in HuggingFace transformers (Wolf et al., 2020) and datasets (Lhoest et al., 2021) libraries.foot_6 They are licensed under Apache License 2.0. Moreover, we used PyTorch for all experiments (Paszke et al., 2019) , which is released under the BSD licensefoot_7 . G.1 DATASETS WMT14 In the Ninth Workshop on Statistical Machine Translation, WMT14 was introduced for four tasks. We used the En-De and En-Fr news translation tasks. There is no license available for this dataset. OPUS-100 OPUS-100 is a multilingual translation corpus for 100 languages, which is randomly sampled from the OPUS collection (Tiedemann, 2012) . There is no license available for this dataset.

G.2 MODELS

Marian NMT Marian is a Neural Machine Translation framework, which is mainly developed by the Microsoft Translator team, and it is released under MIT Licensefoot_8 . This model uses a beam size of 4. mBART50 mBART50 is a multilingual machine translation model of 50 languages, which has been introduced by Facebook. This model is published in the Fairseq library, which is released under MIT Licensefoot_9 . This model uses a beam size of 5.

G.3 CODES

kNN In order to compare our method with kNN (Michel et al., 2019) , we used the code provided by the authors, which is released under the BSD 3-Clause "New" or "Revised" License.foot_10  Seq2Sick To compare our method with Seq2Sick (Cheng et al., 2020a) , we used the code published by the authors.foot_11 There is no license available for their code. WSLS We implemented and evaluated WSLS (Zhang et al., 2021) using the source code published by the authors. 15 There is no license available for this GitHub repository.

H HUMAN EVALUATION

We conduct a preliminary human evaluation campaign of TransFool, kNN, and Seq2Sick attacks on Marian NMT (En-Fr) in the white-box setting. We randomly choose 90 sentences from the test set of the WMT14 (En-FR) dataset with the adversarial samples and their translations by the NMT model. We split 90 sentences into three different surveys to obtain a manageable size for each annotator. We recruited two annotators for each survey. For the English surveys, we ensure that the annotators are highly proficient English speakers. Similarly, for the French survey, we ensure that the annotators are highly proficient in French. Before starting the rating task, we provided annotators with detailed guidelines similar to (Cer et al., 2017; Michel et al., 2019) . The task is to rate the sentences for each criterion on a continuous scale (0-100) inspired by WMT18 practice (Ma et al., 2018) and Direct Assessment (Graham et al., 2013; 2017) . For each sentence, we evaluate three aspects in three different surveys: • Fluency: We show the three adversarial sentences and the original sentence on the same page (in random order). We ask the annotators how much they agree with the "The sentence is fluent." statement for each sentence. • Semantic preservation: We show the original sentence on top and the three adversarial sentences afterwards (in random order). We ask the annotators how much they agree with the "The sentence is similar to the reference text." statement for each sentence. • Translation quality: Inspired by monolingual direct assessment (Ma et al., 2018; Graham et al., 2013; 2017) , we evaluate the translation quality by showing the reference translation on top and the translations of three adversarial sentences afterwards (in random order). We ask the annotators how much they agree with the "The sentence is similar to the reference text." statement for each translation. We calculate 95% confidence intervals by using 15K bootstrap replications. The results are depicted in Figure 4 . These results demonstrate that although the adversarial examples generated by Trans-Fool are more semantic-preserving and fluent than both baselines. According to the provided guide to the annotators for semantic similarity, the score of 67.8 shows that the two sentences are roughly equivalent, but some details may differ. Moreover, a fluency of 66.4 demonstrates that although the generated adversarial examples by TransFool are more fluent than the baselines, there is still room to improve the performance in this regard. We follow the direct assessment strategy to measure the effectiveness of the adversarial attacks on translation quality. According to (Ma et al., 2018) , since a sufficient level of agreement of translation quality is difficult to achieve with human evaluation, direct assessment simplifies the task to a simpler monolingual assessment instead of a bilingual task. The similarity of the translations of the adversarial sentences with the reference translation is shown in Figure 4c . The similarity of Seq2Sick is worse than other attacks. However, its similarity in the source language is worse. Therefore, we compute the decrease of similarity (between the original and adversarial sentences)



Code of(Cheng et al., 2019; 2020b), untargeted white-box attacks against NMTs, is not publicly available.2 We use case-sensitive SacreBLEU(Post, 2018) on detokenized sentences.3 We use the multilingual version since we are dealing with multiple languages. In order to have a fair comparison, we fine-tuned hyperparameters of Transfool, in the case when we do not use LM embeddings, to have a similar attack success rate. As defined in Section 5, the adversarial example is successful if the BLEU score of its translation is less than half of the BLEU score of the original translation. We should note that since we do not have a tokenizer, we compute Word Error Rate (WER) instead of Token Error Rate (TER). The results of kNN and Seq2Sick are not reported since they are transfer attacks, and their performance is already reported in Table7. These two libraries are available at this GitHub repository: https://github.com/huggingface. https://github.com/pytorch/pytorch/blob/master/LICENSE https://github.com/marian-nmt/marian/blob/master/LICENSE.md https://github.com/facebookresearch/fairseq/blob/main/LICENSE The source code is available at https://github.com/pmichel31415/translate/tree/ paul/pytorch_translate/research/adversarial/experiments and the license is avialable at https://github.com/pmichel31415/translate/blob/paul/LICENSE The source code is available at https://github.com/cmhcbb/Seq2Sick. https://github.com/JHL-HUST/AdvNMT-WSLS/tree/79945881f75d92ae44e9ebc10500d8590c09bb13



Figure 1: Block diagram of TransFool.

Figure 2: Effect of different hyperparameters on the performance of TransFool.

Figure 3: Tradeoff between success rate and Similarity/fluency. The left figure shows the effect of acceptable number of added grammar errors by adversarial perturbation. The right figure shows the effect of similarity threshold.

Performance of white-box attack against different NMT models.

Adversarial examples * against mBART50 (En-De) generated by different methods.

Performance of black-box attack, when the target language is different.

Performance of black-box attack against mBART50.

Xinze Zhang, Junzhe Zhang, Zhenhua Chen, and Kun He. Crafting adversarial examples for neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1967-1977, 2021.

Performance of white-box attack against Marian NMT (En-Fr) with/without language model embeddings.

Similarity performance of white-box attacks.

which shows the results on the whole dataset, we can see that TransFool performance is consistent among successful and unsuccessful attacks. Moreover, successful adversarial examples generated by TransFool are still semantically similar to the original sentences, and their perplexity score is low. However, the successful adversarial examples generated by Seq2Sick and kNN do not preserve the semantic similarity and are not fluent sentences; hence, they are not valid adversarial sentences.

Performance of white-box attack over successful adversarial examples.

Adversarial examples against Marian NMT (En-Fr) by various methods (white-box).

Adversarial examples against Marian NMT (En-De) by various methods (white-box).

Adversarial examples against Marian NMT (En-Zh) by various methods (white-box).

Adversarial examples against mBART50 (En-Fr) crafted by various methods (white-box). le célèbre sanctuaire de Monserrate à 160 km/h. Le sanctuaire est situé à une opiniontitude de plus de 8000 mètres et de nombreux spectateurs se sont rassemblés là pour observer son exploit.Adv. Seq2SickWearing a wingsuit, he flew past over the famous Monserrate Sanctuary at 160km/h. The sanctuary is located at an altitude of over74 meters and numerous spectators had gathered there to watch his exploit.

Adversarial examples against mBART50 (En-De) crafted by various methods (white-box).

Adversarial examples against mBART50 (En-Zh) crafted by various methods (white-box).

Performance of black-box attack against Google Translate (En-Fr).

Performance of TransFool black-box attack against Google Translate (En-De), when the target language is different.. We consider Marian NMT (En-Fr) as the reference model and attack En-De Google Translate. The results for TransFool are reported in Table 18. E.2 SEMANTIC SIMILARITY COMPUTED BY OTHER METRICS

Similarity performance of black-box attacks.

Adversarial examples against mBART50 (En-Fr) crafted by various methods (black-box). It is therefore not surprising that he should be holding a mask in the promotional photography for L'Invitation au Voyage, by Louis Vuitton, of which he is the new face.

Adversarial examples against mBART50 (En-De) crafted by various methods (black-box).

Adversarial examples against mBART50 (En-Zh) crafted by various methods (black-box). To provide care and support by strengthening programming for orphans and vulnerable children infected/affected by AIDS and by expanding life skills training for young people. To provide care and support by strengthening programming for orphans and vulnerable children Disabled/ afflicted by AIDS and by expanding life skill training for young people. To provide nursing and unstinted_support by strengthening i_Lifetv for orphans and susceptable children infected/affected by CPR_mannequins and by broadening life skills training for young people.

Performance of WSLS (En-De) with two backtranslation models.

acknowledgement

 4 We discard the sentences whose original BLEU score is zero to prevent improving the results artificially. We should also note that all results are computed after the re-tokenization of the adversarial example. Since we are generating the adversarial example at the token-level, there is a small chance that, when the generated adversarial example is converted to text, the re-tokenization does not produce the same set of tokens.

Ethics Statement

We introduced TransFool, an adversarial attack against NMT models, with the motivation of revealing the vulnerabilities of NMT models and paving the way for designing stronger defenses and building robust NMT models in real-life scenarios. While it remains a possibility that a threat actor may misuse our attack, we do not condone using our method with the intent of attacking a real NMT system.

Reproducibility Statement

The source code will be publicly available as soon as possible to help reproduce our results. Moreover, Appendix G contains the license information and more details of the assets (datasets, codes, and models).

Supplementary Material

TransFool: An Adversarial Attack against Neural Machine Translation Models

ABSTRACT

In this supplementary material, we first provide some statistics of the evaluation datasets in Section A. We discuss the effect of the back-translation model choice on WSLS in Section F. Finally, the license information and more details of the assets (datasets, codes, and models) are provided in Section G.A SOME STATISTICS OF THE DATASETS Some statistics, including the number of samples, the Average length of the sentences, and the translation quality of Marian NMT and mBART50, of the evaluation datasets, i.e., OPUS100 (En-Zh) WMT14 (En-FR) and (En-De), are reported in table 5.

B ABLATION STUDY

In this Section, we analyze the effect of different hyperparameters (including the coefficients α and β in our optimization problem (1), the step size of the gradient descent γ, and the relative BLEU score ratio λ in the stopping criteria Eq. ( 7)) on the white-box attack performance in terms of success rate, semantic similarity, and perplexity score.In all the experiments, we consider English to French Marian NMT model and evaluate over the first 1000 sentences of the test set of WMT14. The default values for the hyperparameters are as follows, except for the hyperparameter that varies in the different experiments, respectively: α = 20, β = 1.8, γ = 0.016, and λ = 0.4.Effect of the similarity coefficient α. This hyperparameter determines the strength of the similarity term in the optimization problem (1). Figure 2a shows the effect of α on the performance of our attack. By increasing the similarity coefficient of the proposed optimization problem, we are forcing our algorithm to find adversarial sentences that are more similar to the original sentence. Therefore, as shown in Figure 2a , larger values of α result in higher semantic similarity. However, in this case, it is harder to fool the NMT model, i.e., lower attack success rate, RDBLEU, and RDchrF. Moreover, it seems that, since the generated adversarial examples are more similar to the original sentence, they are more natural, and their perplexity score is lower.Effect of the language model loss coefficient β. We analyze the impact of the hyperparameter β, which controls the importance of the language model loss term in the proposed optimization from the source language to the target language. The results in Figure 4d show that all attacks affect the translation quality and the effect of TransFool is more pronounced than that of both baselines. Finally, we calculate Inter-Annotator Agreement (IAA). There are two human judgments for each sentence. We average both scores to compute the final score for each sentence. To ensure that the two annotators agree, we only consider sentences where their two corresponding scores are less than 30. We compute IAA in terms of Pearson Correlation coefficient instead of the commonly used Cohen's K since scores are in a continuous scale. The results are presented in Table 24 . Overall, we conclude that we achieve a reasonable inter-annotator agreement for all sentence types and evaluation metrics.

