WAT ZEI JE? DETECTING OUT-OF-DISTRIBUTION TRANSLATIONS WITH VARIATIONAL TRANSFORMERS

Abstract

We detect out-of-training-distribution sentences in Neural Machine Translation using the Bayesian Deep Learning equivalent of Transformer models. For this we develop a new measure of uncertainty designed specifically for long sequences of discrete random variables-i.e. words in the output sentence. Our new measure of uncertainty solves a major intractability in the naive application of existing approaches on long sentences. We use our new measure on a Transformer model trained with dropout approximate inference. On the task of German-English translation using WMT13 and Europarl, we show that with dropout uncertainty our measure is able to identify when Dutch source sentences, sentences which use the same word types as German, are given to the model instead of German.

1. INTRODUCTION

Statistical Machine Translation (SMT, (Brown et al., 1993; Och, 2003) ), built on top of probabilistic modelling foundations such as the IBM alignment models (Vogel et al., 1996; Brown et al., 1993; Gal & Blunsom, 2013) , has largely been replaced in recent years following the emergence of Neural Machine Translation approaches (NMT, (Kalchbrenner & Blunsom, 2013; Bahdanau et al., 2015; Luong et al., 2015; Vaswani et al., 2017) ). This change has brought with it huge performance gains to the field (Sennrich et al., 2016 ), but at the same time we have lost many desirable properties of these models. Statistical probabilistic models can inform us when they are guessing at random on inputs they never saw before (Ghahramani, 2015) . This information can be used, for example, to detect out-of-training-distribution examples for selective classification by referring uncertain inputs to an expert for annotation (Leibig et al., 2017) , or for a human-in-the-loop approach to reduce data labelling costs (Gal et al., 2017; Walmsley et al., 2019; Kirsch et al., 2019) . With new tools in machine learning we can now incorporate such probabilistic foundations into deep learning NLP models without sacrificing performance. This field, known as Bayesian Deep Learning (BDL, (Neal, 2012; Gal, 2016) ), is concerned with the development of scalable tools which capture epistemic uncertainty-the model's notion of "I don't know", a measure of a model's lack of knowledge e.g. due to lack of training data, or when an input is given to the model which is very dissimilar to what the model has seen before. Such BDL tools have been used extensively in the Computer Vision literature (Kendall & Gal, 2017; Litjens et al., 2017) , and have been demonstrated to be of practical use for applications including medical imaging (Litjens et al., 2017; Nair et al., 2020 ), robotics (Gal et al., 2016; Chua et al., 2018), and astronomy (Hon et al., 2018; Soboczenski et al., 2018; Hezaveh et al., 2017) . In this paper we extend these tools, often used for vision tasks, to the language domain. We demonstrate how these tools can be used effectively on the task of selective classification in NMT by identifying source sentences the translation model has never seen before, and referring such source sentences to an expert for translation. We demonstrate this with state-of-the-art Transformer models, and show how model performance increases when rejecting sentences the model is uncertain about-i.e. the model's measure of epistemic uncertainty correlates with mistranslations. For this we develop several measures of epistemic uncertainty for applications in natural language processing, concentrating on the task of machine translation ( §3). We compare these measures both with standard deterministic Transformer models, as well as with Variational Transformers, a new approach we introduce to capture epistemic uncertainty in sequence models using dropout approximate inference (Gal, 2016) . We give an extensive analysis of the methodology, and compare the different approaches quantitatively in the out-of-training-distribution settings ( §4), which shows our proposed uncertainty estimate BLEUVar works well for measuring the epistemic uncertainty for machine translation. We also analyse the performance of BLEUVar qualitatively from both the influence of sentence length and from the linguistic perspective. We finish with a discussion in potential use cases for the new methodogy proposed. The closest NLP task to the above problem definition is the quality estimation (QE) task in Machine Translation (Specia et al., 2010; Blatz et al., 2004) , which tries to solve a similar problem by predicting the quality of a translation with a score called Human Translation Error Rate (HTER, (Snover et al., 2006) ). This is done by training a surrogate QE model on source sentences and their corresponding machine-generated translations in a specific domain, with the target of the surrogate to predict the the percentage of edits needed to be fixed. While many methods have been shown to successfully solve the task of estimating the quality of translations (Kim et al., 2017; Martins et al., 2016; 2017; Kreutzer et al., 2015) , by definition QE crucially relies on examples of mistranslations to train the surrogate. The assumption that such training data is available is often violated in practice though (e.g. in active learning), thus existing approaches in QE research cannot generally be used to detect out-of-training-distribution examples (see Appendix C for detail discussion about the differences between QE and our task, as well as other related work that is similar to ours but not solving the same problem).  However, we cannot expect the model to perform well on out-of-distribution (OOD) data which it never saw before. Instead, we would wish for the model to indicate its uncertainty towards such inputs. We could use p(ŷ * |x * , ω * ) as an estimate for model uncertainty, but as we show below, it would not be a well calibrated one. It might be the case that many ω might give equally good predictions on the train set, but might widely disagree with their predictions on OOD data. In fact, ω * might give arbitrary predictions on OOD training data which is very dissimilar to previously observed inputs. Thus, a high score does not distinguish whether x * is OOD or not, and whether we should trust the model's prediction.

2.1. BAYESIAN INFERENCE

Bayesian probabilistic models capture the notion of uncertainty explicitly. Rather than considering a single point estimate ω * , Bayesian models aim to capture the entire distribution of ω from the training data. The resulting distribution is called posterior distribution p(ω|X, Y ) = p(Y |X, ω)p(ω) p(Y |X) . ( ) At test time, we can make prediction about the corresponding y * by integrating out all possible ω p(y * |x * , X, Y ) = p(y * |x * , ω)p(ω|X, Y )dω. (3) Using the variance of the predictive distribution p(y * |x * , X, Y ) as the uncertainty measure would have taken into account the variance of ω. Hence, an uncertainty measure based on this quantity could be a strong indicator for x * being OOD.

2.2. APPROXIMATE INFERENCE

The difficulty in doing Bayesian inference comes from the intractability of calculating the evidence p(Y |X) = p(Y |X, ω)p(ω)dω. (4)



2 BACKGROUND: UNCERTAINTY IN DEEP LEARNING For most machine learning models, the optimisation objectives give us a point estimate of the model parameters, which maximise the likelihood of the model generating the training data (i.e. p(Y |X, ω = ω * ), ω * ∈ Ω s.t. Ω is the set of all possible model parameters, Y, X are the training data). Such point estimate ω * gives us a very good prediction when the test data follow the same distribution as the training data distribution. Given a new input x * at test time, the model prediction for the corresponding y * is ŷ * = arg max y * p(y * |x * , ω * ).

