TOXICITY IN MULTILINGUAL MACHINE TRANSLATION AT SCALE

Abstract

Machine Translation systems can produce different types of errors, some of which get characterized as critical or catastrophic due to the specific negative impact they can have on users. Automatic or human evaluation metrics do not necessarily differentiate between such critical errors and more innocuous ones. In this paper we focus on one type of critical error: added toxicity. We evaluate and analyze added toxicity when translating a large evaluation dataset (HOLISTICBIAS, over 472k sentences, covering 13 demographic axes) from English into 164 languages. The toxicity automatic evaluation shows that added toxicity across languages varies from 0% to 5%. The output languages with the most added toxicity tend to be low-resource ones, and the demographic axes with the most added toxicity include sexual orientation, gender and sex, and ability. We also perform human evaluation on a subset of 8 translation directions, confirming the prevalence of true added toxicity. We use a measurement of the amount of source contribution to the translation, where a low source contribution implies hallucination, to interpret what causes toxicity. We observe that the source contribution is somewhat correlated with toxicity but that 45.6% of added toxic words have a high source contribution, suggesting that much of the added toxicity may be due to mistranslations. Combining the signal of source contribution level with a measurement of translation robustness allows us to flag 22.3% of added toxicity, suggesting that added toxicity may be related to both hallucination and the stability of translations in different contexts. Given these findings, our recommendations to reduce added toxicity are to curate training data to avoid mistranslations, mitigate hallucination and check unstable translations. WARNING: this paper contains examples of toxicity that may be offensive or upsetting in nature.



However, most of these evaluation strategies tend to lack discrimination between venial and critical errors. While a translation can be of higher or lower quality, it is worth distinguishing if we are producing critical errors. Vilar et al. (2006) is an example of a taxonomy for translation errors in general. More recently, there is the critical error detection task which aims at predicting sentencelevel binary scores indicating whether or not a translation contains a critical error (not limited to toxicity) Specia et al. ( 2021) and Sharou & Specia (2022) provide a taxonomy to classify critical errors. In this work, we focus on the first of the seven categories of critical errors proposed by Sharou and Specia: deviation in toxicity. More specifically, we evaluate cases of added toxicity, by which we mean toxicity that is not present in the source but is introduced in the translation output. Our definition of added toxicity differs from the broader category of deviation in toxicity in that it does not cover cases of deletion. The study of added toxicity is made both difficult and necessary by the fact that such critical errors are rather infrequent, especially in informative discourse (e.g., Wikipedia, news), but have a signif-



MT) systems are typically evaluated in terms of translation quality either by automatic or human measures. Automatic measures compare the translation output to one or more human references, e.g. Papineni et al. (2002); Popović (2015); Rei et al. (2020); ?. Human measures use annotators to rank translation outputs, e.g. Licht et al. (2022); Akhbardeh et al. (2021).

