WAT ZEI JE? DETECTING OUT-OF-DISTRIBUTION TRANSLATIONS WITH VARIATIONAL TRANSFORMERS

Abstract

We detect out-of-training-distribution sentences in Neural Machine Translation using the Bayesian Deep Learning equivalent of Transformer models. For this we develop a new measure of uncertainty designed specifically for long sequences of discrete random variables-i.e. words in the output sentence. Our new measure of uncertainty solves a major intractability in the naive application of existing approaches on long sentences. We use our new measure on a Transformer model trained with dropout approximate inference. On the task of German-English translation using WMT13 and Europarl, we show that with dropout uncertainty our measure is able to identify when Dutch source sentences, sentences which use the same word types as German, are given to the model instead of German.

1. INTRODUCTION

Statistical Machine Translation (SMT, (Brown et al., 1993; Och, 2003) ), built on top of probabilistic modelling foundations such as the IBM alignment models (Vogel et al., 1996; Brown et al., 1993; Gal & Blunsom, 2013) , has largely been replaced in recent years following the emergence of Neural Machine Translation approaches (NMT, (Kalchbrenner & Blunsom, 2013; Bahdanau et al., 2015; Luong et al., 2015; Vaswani et al., 2017) ). This change has brought with it huge performance gains to the field (Sennrich et al., 2016 ), but at the same time we have lost many desirable properties of these models. Statistical probabilistic models can inform us when they are guessing at random on inputs they never saw before (Ghahramani, 2015) . This information can be used, for example, to detect out-of-training-distribution examples for selective classification by referring uncertain inputs to an expert for annotation (Leibig et al., 2017) , or for a human-in-the-loop approach to reduce data labelling costs (Gal et al., 2017; Walmsley et al., 2019; Kirsch et al., 2019) . With new tools in machine learning we can now incorporate such probabilistic foundations into deep learning NLP models without sacrificing performance. This field, known as Bayesian Deep Learning (BDL, (Neal, 2012; Gal, 2016) ), is concerned with the development of scalable tools which capture epistemic uncertainty-the model's notion of "I don't know", a measure of a model's lack of knowledge e.g. due to lack of training data, or when an input is given to the model which is very dissimilar to what the model has seen before. Such BDL tools have been used extensively in the Computer Vision literature (Kendall & Gal, 2017; Litjens et al., 2017) , and have been demonstrated to be of practical use for applications including medical imaging (Litjens et al., 2017; Nair et al., 2020 ), robotics (Gal et al., 2016; Chua et al., 2018), and astronomy (Hon et al., 2018; Soboczenski et al., 2018; Hezaveh et al., 2017) . In this paper we extend these tools, often used for vision tasks, to the language domain. We demonstrate how these tools can be used effectively on the task of selective classification in NMT by identifying source sentences the translation model has never seen before, and referring such source sentences to an expert for translation. We demonstrate this with state-of-the-art Transformer models, and show how model performance increases when rejecting sentences the model is uncertain about-i.e. the model's measure of epistemic uncertainty correlates with mistranslations. For this we develop several measures of epistemic uncertainty for applications in natural language processing, concentrating on the task of machine translation ( §3). We compare these measures both with standard deterministic Transformer models, as well as with Variational Transformers, a new approach we introduce to capture epistemic uncertainty in sequence models using dropout approximate inference (Gal, 2016) . We give an extensive analysis of the methodology, and compare

