TRANSFOOL: AN ADVERSARIAL ATTACK AGAINST NEURAL MACHINE TRANSLATION MODELS

Abstract

Deep neural networks have been shown to be vulnerable to small perturbations of their inputs known as adversarial attacks. In this paper, we consider the particular task of Neural Machine Translation (NMT), where security is often critical. We investigate the vulnerability of NMT models to adversarial attacks and propose a new attack algorithm called TransFool. It builds on a multi-term optimization problem and a gradient projection step to compute adversarial examples that fool NMT models. By integrating the embedding representation of a language model in the proposed attack, we generate fluent adversarial examples in the source language that maintain a high level of semantic similarity with the clean samples and render the attack largely undetectable. Experimental results demonstrate that, for multiple translation tasks and different NMT architectures, our white-box attack can severely degrade the translation quality for more than 60% of the sentences while the semantic similarity between the original sentence and the adversarial example stays very high. Moreover, we show that the proposed attack is transferable to unknown target models and can fool those quite easily. Finally, based on automatic and human evaluations, our method leads to improvement in terms of success rate, semantic similarity, and fluency compared to the existing attacks both in white-box and black-box settings. Hence, TransFool permits to better characterize the vulnerability of NMT systems and outlines the necessity to design strong defense mechanisms and more robust NMT systems for real-life applications.

1. INTRODUCTION

The impressive performance of Deep Neural Networks (DNNs) in different areas such as computer vision (He et al., 2016) and Natural Language Processing (NLP) (Vaswani et al., 2017) has led to their widespread usage in various applications. With such an extensive usage of these models, it is important to analyze their robustness and potential vulnerabilities. In particular, it has been shown that the outputs of these models are susceptible to imperceptible changes in the input, known as adversarial attacks (Szegedy et al., 2014) . Adversarial examples, which differ from the original inputs in an imperceptible manner, cause the target model to generate incorrect outputs. If these models are not robust enough to these attacks, they cannot be reliably used in applications with security requirements. To address this issue, many studies have been recently devoted to the effective generation of adversarial examples, the defense against attacks, and the analysis of the vulnerabilities of DNN models (Moosavi-Dezfooli et al., 2016; Madry et al., 2018; Ortiz-Jiménez et al., 2021) . The dominant methods to craft imperceptible attacks for continuous data, e.g., audio and image data, are based on gradient computing and various optimization strategies. However, these methods cannot be directly extended to NLP models due to the discrete nature of the tokens in the corresponding representations (i.e., words, subwords, and characters). Another challenge in dealing with textual data is the characterization of the imperceptibility of the adversarial perturbation. The ℓ pnorm is highly utilized in image data to measure imperceptibility but it does not apply to textual data where manipulating only one token in a sentence may significantly change the semantics. Moreover, in gradient-based methods, it is challenging to incorporate linguistic constraints in a differentiable manner. Hence, optimization-based methods are more difficult and less investigated for adversarial attacks against NLP models. Currently, most attacks in textual data are gradient-free and simply based on heuristic word replacement, which may result in sub-optimal performance (Alzantot et al., 2018; Ren et al., 2019; Zang et al., 2020; Jin et al., 2020; Morris et al., 2020; Guo et al., 2021; Sadrizadeh et al., 2022) .

