TOXICITY IN MULTILINGUAL MACHINE TRANSLATION AT SCALE

Abstract

Machine Translation systems can produce different types of errors, some of which get characterized as critical or catastrophic due to the specific negative impact they can have on users. Automatic or human evaluation metrics do not necessarily differentiate between such critical errors and more innocuous ones. In this paper we focus on one type of critical error: added toxicity. We evaluate and analyze added toxicity when translating a large evaluation dataset (HOLISTICBIAS, over 472k sentences, covering 13 demographic axes) from English into 164 languages. The toxicity automatic evaluation shows that added toxicity across languages varies from 0% to 5%. The output languages with the most added toxicity tend to be low-resource ones, and the demographic axes with the most added toxicity include sexual orientation, gender and sex, and ability. We also perform human evaluation on a subset of 8 translation directions, confirming the prevalence of true added toxicity. We use a measurement of the amount of source contribution to the translation, where a low source contribution implies hallucination, to interpret what causes toxicity. We observe that the source contribution is somewhat correlated with toxicity but that 45.6% of added toxic words have a high source contribution, suggesting that much of the added toxicity may be due to mistranslations. Combining the signal of source contribution level with a measurement of translation robustness allows us to flag 22.3% of added toxicity, suggesting that added toxicity may be related to both hallucination and the stability of translations in different contexts. Given these findings, our recommendations to reduce added toxicity are to curate training data to avoid mistranslations, mitigate hallucination and check unstable translations. WARNING: this paper contains examples of toxicity that may be offensive or upsetting in nature.

1. INTRODUCTION

Machine Translation (MT) systems are typically evaluated in terms of translation quality either by automatic or human measures. Automatic measures compare the translation output to one or more human references, e.g. Papineni et al. (2002) ; Popović (2015) ; Rei et al. (2020) ; ?. Human measures use annotators to rank translation outputs, e.g. Licht et al. (2022) ; Akhbardeh et al. (2021) . However, most of these evaluation strategies tend to lack discrimination between venial and critical errors. While a translation can be of higher or lower quality, it is worth distinguishing if we are producing critical errors. Vilar et al. (2006) is an example of a taxonomy for translation errors in general. More recently, there is the critical error detection task which aims at predicting sentencelevel binary scores indicating whether or not a translation contains a critical error (not limited to toxicity) Specia et al. (2021) and Sharou & Specia (2022) provide a taxonomy to classify critical errors. In this work, we focus on the first of the seven categories of critical errors proposed by Sharou and Specia: deviation in toxicity. More specifically, we evaluate cases of added toxicity, by which we mean toxicity that is not present in the source but is introduced in the translation output. Our definition of added toxicity differs from the broader category of deviation in toxicity in that it does not cover cases of deletion. The study of added toxicity is made both difficult and necessary by the fact that such critical errors are rather infrequent, especially in informative discourse (e.g., Wikipedia, news), but have a signif-icant impact on translation safety and user trust. Previous work by the NLLB Team et al. ( 2022) evaluates potential added toxicity on machine translations of the FLORES-200 benchmark dataset using wordlist-based detectors. Such detectors are known for their limitations when it comes to over-detecting terms that are toxic only in specific contexts. Nevertheless, the overall prevalence of potential added toxicity remains low when evaluating translations of formal sentences such as those in FLORES-200, which makes it difficult to draw conclusions as to this specific aspect of a model's performance. To circumvent the problem posed by the low prevalence of toxicity in our test sets, which may not reflect the prevalence of toxicity in our models, we use the recently proposed bias evaluation dataset HOLISTICBIAS (Smith et al., 2022) . This English-only (American English) dataset has been used to evaluate a variety of demographic biases in language modeling (Qian et al., 2022; Smith et al., 2022) . The dataset contains over 472k sentences (100 time larger than typical evaluation sets) and is designed to trigger biased behaviors in language models. It is therefore more suited than the FLORES-200 dataset for the purpose of triggering toxicity and evaluating added toxicity in our translation models. The main contribution of this work is the first deep study of the causes of added toxicity in a multilingual machine translation experimental framework with a high prevalence of real toxicity at scale. For this purposes, we combine the previously defined toxicity detection methodology (NLLB Team et al., 2022) , the controlled HOLISTICBIAS evaluation dataset (Smith et al., 2022) , and the ALTI+ interpretability method (Ferrando et al., 2022a) . We are able to analyze which particular language directions and HOLISTICBIAS structures trigger toxicity. Moreover, we perform a human evaluation of the toxicity detection methodology for a subset of eight out-of-English translation directions, and find that the false positive rates are below 1% in five translation directions. False negatives are below 3% in all translation directions. Finally, we demonstrate an interaction between the source contribution, the robustness of translations, and toxicity. We use ALTI+ to observe that 45.6% of the toxic translations have a high source contribution, which hints that much of these toxic translations may be caused by mistranslations, and that the rest may be correlated with hallucination (Ferrando et al., 2022a) . This suggests that hallucination may add toxicity. We use Gini impurity (Breiman, 1996) , a common splitting criterion in decision trees, to measure the relative amount of diversity (i.e. the relative lack of robustness) across the translated words aligned by ALTI+ to HOLISTICBIAS descriptor words. A combination of a low amount of source contribution and a high Gini impurity across translations corresponds to a rate of toxicity roughly twice as high as the baseline rate. These findings lead us to recommend that mitigation of toxicity could be achieved by curating training data to avoid mistranslations, reducing hallucinations and checking unstable translations.

2. DEFINITIONS AND BACKGROUND

Definitions In this work, we explore one category of critical error in the translation output: deviation in toxicity. Sharou & Specia (2022) define deviation in toxicity as "instances where the translation may incite hate, violence, profanity or abuse against an individual or a group (a religion, race, gender, etc.) due to incorrect translations". More specifically, we focus on added toxicity (abbreviated as AT in tables henceforth), which slightly differs from broader deviation in toxicity in that it does not cover instances of deleted toxicity. We define added toxicity as the introduction in the translation output of toxicity that is not present in the source sentence. We hypothesize that added toxicity may occur in the form of hallucination or mistranslation. Added toxicity through hallucination means that the toxic element in the translated sentence does not appear to have any corresponding elements in the source sentence. An example of hallucination can be seen in Figure 1 (Sentence1), where the English word chubby gets translated as grosse (meaning fat or big), and the word chatte (pussy or pussycat) appears to have no corresponding words in the source sentence. Added toxicity through mistranslation means that the toxic element found in the translation can be considered as a mistranslation of a nontoxic element found in the source sentence. An example of mistranslation can be seen in Figure 1 (Sentence 2), where the English word gangly is mistranslated into the Catalan toxic word malparit (meaning bastard or fucker). When it comes to the level of added toxicity in translation directions, we define high-, mid-, and low-toxicity translation directions as the ones that have above 0.5%, between 0.1% and 0.5%, and below 0.1% of added toxicity, respectively. These percentages are computed following the approach

