RAINPROOF: AN UMBRELLA TO SHIELD TEXT GENER-ATORS FROM OUT-OF-DISTRIBUTION DATA

Abstract

As more and more conversational and translation systems are deployed in production, it is essential to implement and to develop effective control mechanisms guaranteeing their proper functioning and security. An essential component to ensure safe system behavior is out-of-distribution (OOD) detection, which aims at detecting whether an input sample is statistically far from the training distribution. Although OOD detection is a widely covered topic in classification tasks, it has received much less attention in text generation. This paper addresses the problem of OOD detection for machine translation and dialog generation from an operational perspective. Our contributions include: (i) RAINPROOF a Relative informAItioN Projection ODD detection framework; and (ii) a more operational evaluation setting for OOD detection. Surprisingly, we find that OOD detection is not necessarily aligned with task-specific measures. The OOD detector may filter out samples that are well processed by the model and keep samples that are not, leading to weaker performance. Our results show that RAINPROOF breaks this curse and achieve good results in OOD detection while increasing performance.

1. INTRODUCTION

Significant progress have been made in Natural Language Generation (NLG) in recent years with the development of powerful generic (e.g., GPT (Radford et al., 2018; 2019; Brown et al., 2020) ) and task-specific (e.g., Grover (Zellers et al., 2019) , Pegasus (Zhang et al., 2020) and DialogGPT (Zhang et al., 2019 )) text generators. Text generators power machine translation systems or chat bots that are by definition exposed to the public and whose reliability is therefore a prerequisite for adoption. Text generators are trained in the context of a so-called closed world (Antonucci et al., 2021; Fei & Liu, 2016) , where training and test data are assumed to be drawn i.i.d. from a single distribution, known as the in-distribution. However, when deployed, these models operate in an open world (Parmar et al., 2021; Zhou, 2022) where the i.i.d. assumption is often violated. This change in data distribution is detrimental and induces a drop in performance as illustrated in Tab. 3 and Tab. 4. Thus, to ensure the trustworthiness and adoption, it is necessary to develop tools to protect them from harmful distribution shifts. For example, a trained translation model is not expected to be reliable when presented with another language (e.g. a Spanish model exposed to Catalan, or a Dutch model exposed to Afrikaans) or unexpected technical language (e.g., a colloquial translation model exposed to rare technical terms from the medical field). Most of the existing research, which aims to protect models from Out-Of-Distribution (OOD) data, focuses on classification. Despite their importance, (conditional) text generation has received much less attention even though it is among the most exposed applications. Existing solutions fall into two categories. The first one called training-aware methods (Zhu et al., 2022; Vernekar et al., 2019a; b) modifies the classifier training by exposing the neural network to OOD samples during training. The second one called plug-in methods aims at distinguishing regular samples in the in distribution (IN) from OOD samples based on the behavior of the model on a new input. Plug-in methods include Maximum Softmax Prediction (MSP) (Hendrycks & Gimpel, 2016) or Energy (Lee et al., 2018a) or feature-based anomaly detectors that compute a per-class anomaly score (Ming et al., 2022; Ryu et al., 2017; Huang et al., 2020; Ren et al., 2021a) . Although plug-in methods seem attractive, their adaptation to text generation may not be straightforward. The sheer number of words present in the vocabulary prevents it to be used directly within the classification framework. In this work, we aim at developing new tools to build more reliable text generators, which can be used in practical systems. First, we work in the unsupervised detection setting where we do not assume that we have access to OOD samples as they are often not available. Second, we work in the black-box scenario, which is the most common in the Software as a Service framework Rudin & Radin (2019) . In the black-box setting detection methods only have access to the output of the DNN architecture. Third, we want an easy-to-use and effective method to ensure adoptability. Last, we argue that OOD detection impacts on tasks specific performance of the whole system should be taken into account when choosing OOD detectors in an operational setting. Our contributions. Our main contributions can be summarized as follows: 1. A more operational benchmark for text generation OOD detection. We present LOFTER the Language Out oF disTribution pErformance benchmaRk. Existing works on OOD detection for language modeling (Arora et al., 2021) focus on (i) english language only, (ii) the GLUE benchmark and (iii) measure performance solely in terms of OOD detection. LOFTER is, in our view, a more operational setting with a strong focus on neural machine translation (NMT) and dialog generation. First, it introduces more realistic data shifts that go beyond English Fan et al. (2021) : language shifts induced by closely related language pairs (e.g., Spanish and Catalan or Dutch and Afrikaansfoot_0 ) and domain change (e.g., medical vs news data or different types of dialogs). In addition, LOFTER comes with an updated evaluation setting: detectors' performance are jointly evaluated w.r.t the overall system's performance on the end task. 2. Novel information theoretic-based detectors. We present RAINPROOF: a Relative informAItioN Projection Out OF distribution detector. RAINPROOF is fully unsupervised. It is flexible and can be applied both when no reference samples (IN) are available (corresponding to scenario s 0 ) and when they are (corresponding to scenario s 1 ). RAINPROOF tackles s 0 by computing the models' predictions negentropy (Brillouin, 1953) . For s 1 , it relies its natural extension: the Information Projection (Kullback, 1954; Csiszár, 1967) , an information-theoretic tool that remains overlooked by the machine learning community. 3. New insights on the operational value of OOD detectors Our extensive experiments on LOFTER show that OOD detectors may filter out samples that are well processed by the model and keep samples that are not, leading to weaker performance. Our results show that RAINPROOF breaks this curse and achieve good results in OOD detection while increasing performance. 4. Code and reproductibility. After acceptance, we will publish the open-source code on github. com and the data to facilitate future research, ensure reproducibility and reduce computational costs.

2.1. NOTATIONS & CONDITIONAL TEXT GENERATION

Let us denote Ω a vocabulary of size |Ω| and Ω * its Kleene closure (Fletcher et al., 1990) 2 . We denote P(Ω) = p ∈ [0, 1] |Ω| : |Ω| i=1 p i = 1 the set of probability distributions defined over Ω. Let D train be the training set, composed of N ⩾ 1 i.i.d. samples {(x i , y i )} N i=1 ∈ (X × Y) N with probability law p XY . We denote p X and p Y the associated marginal laws of p XY . Each x i is a sequence of tokens and we denote x i j ∈ Ω the jth token of the ith sequence. x i ⩽t = {x i 1 , • • • , x i t } ∈ Ω * denotes the prefix of length t. The same notations hold for y. Conditional textual generation. In conditional textual generation, the goal is to model a probability distribution p ⋆ (x, y) over variable-length text sequences (x, y) by finding p θ ≈ p ⋆ (x, y) for any (x, y). In this work, we assume to have access to a pretrained conditional language model f θ : X × Y → R |Ω| where the output is the (unormalized) logits scores. f θ parameterized p θ , i.e., for any (x, y), p θ (x, y) = softmax(f θ (x, y)/T ) where T ∈ R denotes the temperature. Given an input sequence x, the pretrained language f θ can recursively generate an output sequence ŷ by sampling y t+1 ∼ p T θ (•|x, ŷ⩽t ), for t ∈ [1, |y|] . Note that ŷ0 is the start of sentence (< SOS > token). We denote by S(x), the set of normalized logits scores generated by the model when the initial input is x i.e., S(x) = {softmax(f θ (x, ŷ⩽t ))} |ŷ| t=1 . Note that elements of S(x) are discrete probability distributions on Ω.

2.2. PROBLEM STATEMENT

In OOD detection the goal is to find an anomaly score a : X → R + that quantifies how much a sample is far from the IN distribution. x is classified as IN or OUT according to the score a(x). Following previous work (Hendrycks & Gimpel, 2016) , one fixes a threshold γ and classifies the test sample IN if a(x) ⩽ γ or OOT if a(x) > γ. Formally, let us denote g(•, γ) the decision function, we take: g(x, γ) = 1 if a (x) > γ 0 if a (x) ⩽ γ Remark 1. In our setting, OOD examples are not available. In our experiments, we take γ such that at least 80% of the train set is classified as IN data. This assumption is reasonable since, in practice, even a well tailored dataset might contains significant shares of outliers (Mishra et al., 2020) .

2.3. REVIEW OF OOD DETECTORS

OOD detection for classification. Most works on OOD detection have focused on detectors for classifiers and relies either on internal representations (features-based detectors) or on the final soft probabilities produced by the classifier (softmax based detectors). Features-based detectors. They leverage latent representations to derive anomaly scores (Kirichenko et al., 2020; Zisselman & Tamar, 2020) . The most well-known is the Mahanalobis distance (Lee et al., 2018b; Ren et al., 2021b) but there are other methods employing Grams matrices (Sastry & Oore, 2020), Fisher Rao distance (Gomes et al., 2022) or other statistical tests (Haroush et al., 2021) . Other methods rely on the gradient space (Huang et al., 2021) or the moment of the features (Quintanilha et al., 2019; Sun et al., 2021) . These methods require access to the latent representations of the models, which does not fit the black-box scenario. Moreover, they often rely on a per-class decision, which is fine for classifiers but the sheer number of words in Ω makes it impossible to use for text generation. Softmax-based detectors. These detectors rely on the soft probabilities produced by the model. The maximum softmax probability (Hendrycks & Gimpel, 2017; Hein et al., 2019; Liang et al., 2018; Hsu et al., 2020) uses the probability of the mode while others take into account the entire distribution, such as the Energy-based OOD detection scores (Liu et al., 2020) . Due to the large vocabulary size, it is unclear how these methods generalize to sequence generation tasks. OOD detection for text generation. Little work has been done on OOD detection for text generation. Therefore, we will follow Arora et al. (2021) and will rely on their baselines but also generalize common OOD scores such as MSP or Energy to the context of text generation. Generalization to sequence generation. We generalize common OOD detectors for classification tasks by computing the average OOD score along the sequence at each step of the text generation. We refer the reader to Sec. A.6 for more details. Remark 2. Note that features-based detectors assume a white-box framework where the internal representations of an input are accessible. By contrast to softmax-based detectors which only rely on the final output. Following Arora et al. (2021) , we work in a black-box framework (Chen et al., 2020) . We also compare our results to the Mahalanobis distance (Lee et al., 2018b) , as it is known to be a strong baseline.

3.1. INFORMATION THEORETICAL BACKGROUND

An information measure I : P(Ω) × P(Ω) → R quantifies the similarity between any pair of discrete distributions p, q ∈ P(Ω). Since Ω is a finite set, we will adopt the following notations p = [p 1 , • • • , p |Ω| ] and q = [q 1 , • • • , q |Ω| ]. The development of new information measures for specific applications has received much attention over the years (Fujisawa & Eguchi, 2008; Cichocki et al., 2011) (we refer the reader to Basseville (2013) for a complete review). While there exist information distances, it is, in general, difficult to build metrics that satisfy all the properties of a distance, thus we often rely on divergences which drop the symmetry property and the triangular inequality. In what follows, we motivate the information measures we will use in this work. First, we rely on the Rényi divergences (Csiszár, 1967) . Rényi divergences belong to the f -divergences family and are parametrized by a parameter α ∈ R + -{1}. They are flexible and include well-known divergences such as the Kullback-Leiber divergence (KL) Kullback (1959) (when α → 1) or the Hellinger distance (Hellinger, 1909) (when α = 0.5). The Rényi divergence between p and q is defined as follows: D α (p∥q) = 1 α -1 log   |Ω| i=1 p α i q α-1 i   . The Renyi divergence is widely used in machine learning (Peters et al., 2019) because α allows weighting the relative influence of the distributions' tail. Second, we investigate the Fisher-Rao distance (FR). FR is a distance on the Riemannian space formed by the parametric distributions, using Fisher information matrix as its metric (Amari, 2012) . It computes the geodesic distance between two discrete distributions (Rao, 1992; Pinele et al., 2020) and is defined as follows: FR(p∥q) = 2 π arccos |Ω| i=1 √ p i × q i . It has recently found many applications (Picot et al., 2022; Colombo et al., 2022b; a) and is known to be more accurate than popular divergence measures (Costa et al., 2015) . 3.2 RAINPROOF FOR THE NO-REFERENCE SCENARIO (s 0 ) At inference time, the no-reference scenario (s 0 ) does not assume the existence of a reference set of IN samples to decide whether a new input sample is OOD. Softmax-based detectors such as MSP (Hendrycks & Gimpel, 2016) , Energy (Liu et al., 2020) or the sequence likelihoodfoot_2 (Arora et al., 2021) are examples of OOD scores operating under s 0 . Under these assumptions, our OOD detector RAINPROOF is composed of three steps. For a given input x with generated sentence ŷ: 1. We first use f θ to extract the step-by-step sequence of soft distributions S(x). 2. We then compute an anomaly score (a I (x)) by averaging a step-by-step score provided by I. This step-by-step score is obtained by measuring the similarity between a reference distribution u ∈ P(Ω) and one element of S(x). Formally: a I (x) = 1 |S(x)| p∈S(x) I (p∥u) , where |S(x)| = |ŷ|. 3. The last step consists in thresholding the previous anomaly score a I (x). If a I (x) is over a given threshold γ, we classify x as an OOD example. Interpretation of Eq. 3. a I (x) measures the average dissimilarity of the probability distribution of the next token to normality (as defined by u). a I (x) also corresponds to the token average uncertainty of the model f θ to generate ŷ when the input is x. The intuition behind Eq. 3 is that the distributions produced by f θ , when exposed to an OOD sample, should be far from normality and thus should have a high score. Choice of u and I. The uncertainty definition of Eq. 3 depends on the choice of both the reference distribution u and the information measure I. A natural choice for u is the uniform distribution, i.e., u = [ 1 |Ω| , • • • , 1 |Ω| ] which we will use in this work. It is worth pointing out that I(•||u) yields the negentropy of a distribution. Other possible choices for u include one hot or tf-idf distribution (Colombo et al., 2022b) . For I, we rely on the Rényi divergence to obtain a Dα and the Fisher-Rao distance to obtain a FR .

3.3. RAINPROOF FOR THE REFERENCE SCENARIO (s 1 )

In the with reference scenario (s 1 ), we assume that one has access to a reference set of IN samples R = {x i : (x i , y i ) ∈ D train } |R| i=1 where |R| is the size of the reference set. For example, the Mahalanobis distance works under this assumption. One of the weakness of Eq. 3 is that it imposes is to imposes an ad-hoc choice when using u (the uniform distribution). In s 1 , we can leverage R, to obtain a data-driven notion normality. Under s 1 , our OOD detector RAINPROOF follows these four steps: 1. (Offline) For each x i ∈ R, we generate ŷi and the associated sequence of probability distributions (S(x i )). Overall we thus generate x∈R |ŷ i | probability distributions which could explode for long sequencesfoot_3 . To overcome this limitation, we rely on the bag of distributions of each sequence (Colombo et al., 2022b) . We form the set of these bags of distributions S * = x i ∈R    1 |S(x i )| p∈S(x i ) p    . (4) 2. (Online) For a given input x with generated sentence ŷ, we compute its bag of distributions representation p(x) = 1 |S(x)| p∈S(x) p. (5) 3. (Online) For x, we then compute an anomaly score a ⋆ I (x) by projecting p(x) on the set S * . Formally, a ⋆ I (x) is defined as: a ⋆ I (x) = min p∈ S⋆ I(p∥p(x)). We denote p ⋆ (x) = arg min p∈ S * I(p∥p(x)). 4. The last step consists of thresholding the previous anomaly score a I (x). If a I (x) is over a given threshold γ, we classify x as an OOD example. Interpretation of Eq. 6. a I (x) relies on a Generalized Information Projection (Kullback, 1954; Csiszár, 1975; 1984) foot_4 which measures the similarity between p(x) and the set S * . Note that the closest element of S * in the sens of I can give insights on the decision of the detector. It allows to interpret the decision of the detector as we will see in Tab. 5. Choice of I. Similarly to Sec. 3.2, we will rely on the Rényi divergence to define a ⋆ Rα (x) and the Fisher-Rao distance a ⋆ FR (x). appear when a translation system is exposed to a language that is extremely similar to the language the system has been trained on (e.g., Afrikaans for a system trained on Dutch) and, therefore, can lead to significant translation errors (see Tab. 7)). For language shifts, we focus on closely related language pairs coming from the Tatoeba dataset (Tiedemann, 2012b ) (see Tab. 6). We study the shifts induced by Catalan-Spanish, Portugese-Spanish and Afrikaans-Dutch. Domain shifts, which occur when the model is exposed to a specific topic that was not seen during training, can also affect the quality of the translation (see Tab. 4). To simulate domain shifts, we use the language Tatoeba MT dataset (Tiedemann, 2020) and the news commentary dataset (Tiedemann, 2012b) as base datasets and the shifts are induced by the EuroParl dataset (Tiedemann, 2012a) and EMEA (Tiedemann, 2012b) dataset.

4. RESULTS ON LOFTER

LOFTER for dialogs. For conversational agents, an interesting scenario is when a goal-oriented agent designed to handle a specific type of conversations (e.g., customer conversations, daily dialogue) is exposed to an unexpected conversation. In this case, it is crucial to interrupt the agent so it does not damage the user's trust with misplaced responses (Perez et al., 2022) . We rely on the Multi WOZ dataset (Zang et al., 2020) , a human to human dataset collected in the Wizard-of-Oz set-up (Kelley, 1984) , for IN distribution data. This choice is mostly motivated by the availability of pretrained models on Multi WOZ. For dialog shifts, we use spoken datasets coming from various sources which are part of the SILICONE benchmark (Chapuis et al., 2020) . Specifically, we use a goaloriented dataset (i.e., Switchboard Dialog Act Corpus (SwDA) (Stolcke et al., 2000) ), a multi-party meetings dataset (i.e., MRDA (Shriberg et al., 2004) and Multimodal EmotionLines Dataset MELD (Poria et al., 2018) ), daily communication dialogs ( i.e., DailyDialog DyDA Li et al. ( 2017)), and scripted scenarii (i.e., IEMOCAP Tripathi et al. (2018) ). We refer the curious reader to Sec. A.4 for more details on each dataset. Metrics. OOD detection is usually framed as an unbalanced binary classification problem where the class of interest is OUT. We can assess the performance of our OOD detectors focusing on the False alarm rate (FPR) and on the True detection rate (TPR). To evaluate the performance on the OOD task we report the AUROC and the FPR. Area Under the Receiver Operating Characteristic curve (AUROC) (Bradley, 1997) . The AUROC can be interpreted as the probability that an IN-distribution example has an higher anomaly score than an OOD sample. For this metric, higher is better. False Positive Rate at r% True Positive Rate (FPR). In many practical application, we have to detect at least r% of the the OOD samples. This corresponds to pre-defined safety level. FPR quantifies the share of IN samples we wrongly detect under this constraint. It leads to select a threshold γ r such that the corresponding TPR equals r. In our work r is set to 95%. Additional details on these metrics can be found in Sec. A.1. F1, precision and recall. In addition we report the F1 scores of the detectors with a threshold designed such that 80% of the IN dataset is actually classified as IN.

4.2. EXPERIMENTS IN MACHINE TRANSLATION AND RESULTS

Results on language shifts. We assess, for each language pair, the OOD detection performance of RAINPROOF and report the average AUROC and FPR in Tab. 1a. We provide the detailed results in Tab. 8. We find that our no-reference methods (a Dα and a FR ) achieve better performance that common no-reference baselines but also outperform the reference-based baseline. In particular, a Dα , by achieving an AUROC of 0.95 and FPR of 0.25, outperforms all considered methods. Moreover, while no-reference baselines only capture up to 62% of the OOD samples on average, ours detect up to 83.5%, achieving even better results than the with-reference baseline (75.3%). Results on domain shifts. We evaluate the OOD detection performance of RAINPROOF on domain shifts in Spanish and German with technical medical data and parliamentary data. We report the average OOD detection performance in Tab. 1a. In s 0 , we observe that a Dα and a FR outperform the strongest baselines (i.e., Energy, MSP and sequence likelihood) by several AUROC points. Interestingly enough even our no-reference detectors outperform the referencebased baseline (i.e., a M ). However, we find that relying on a reference set is a must-have in terms of FPR. While a Dα achieves similar AUROC performance to its information projection counterpart a D * α , the latter achieve much better FPR.

4.3. EXPERIMENTS IN DIALOG GENERATION AND RESULTS

Results on Dialog shifts. The dialog shifts benchmark is more difficult than NMT benchmark as all detectors achieve lower performances. It is the only case where our no-reference detectors do not outperform the Mahalanobis baseline and achieve only 0.79 in AUROC. The best baseline is the Mahalanobis distance and achieves better performance on dialog task than on NMT domain shifts reaching an AUROC of 0.84. However, our reference based detector based on the Rényi information projection secures better AUROC (0.86) and better FPR (0.52). Even though RAINPROOF outperforms all the baselines, shifts in dialog are hard to detect and will require further investigations. Non-aggregated results for dialog are provided in Ap. C. They show that RAINPROOF consistently outperforms baselines on all datasets. Importance of distribution tails. Our results show that, when it comes to domain shift (domain shifts in translation or dialog shifts), reference-based detectors are required to obtain good results. They also show that, the more these detectors take into account the tail of the distributions, the better they are, as displayed in Sec. B.1. We find that low values of α (near 0) yields better results with the Rényi Information projection a D * α . It suggests that the tail of the distributions used during text generation carries context information and insights on the processed texts. Such results are consistent with findings of recent works in the context of automatic evaluation of text generation (Colombo et al., 2022b) . Comparison to the Mahalanobis distance. Our reference-based detector work with a small reference set. In our experiments, we use reference sets of size 10 to 2000. The Mahalanobis distance requires to approximate the covariance matrix of the reference set. In our simulations, the embeddings of dimension 512 make the estimation unreliable. On the contrary, RAINPROOF, which rely on information projections, remains numerically sound with small reference set. 

5. TOWARDS A PRACTICAL EVALUATION OF OOD DETECTORS

Following previous work, we measure the performance of the detectors on the OOD detection task based on AUROC and FPR. However, this evaluation framework neglects the impact of the detector on the overall system's performance. We identify three main evaluation criteria that are important in practice: execution time, overall system performance in terms of quality of the generated sentences, and interpretability of the decision. Our study is conducted on NMT because due to the existence of relevant and widely adopted metrics for assessing the quality of a generated sentence (i.e., BLEU (Papineni et al., 2002) and BERTSCORE (BERT-S) (Unanue et al., 2021)).

5.1. COMPLEXITY STUDY

Runtime and memory costs. We report in Tab. 1b the runtime of all methods. Detectors for s 0 are faster than the ones for s 1 . Contrarily to detectors using references, the no-reference detectors do not require additional memory. They can be setup easily in a plug&play manner at the output of any model. Numerical stability. The Mahalanobis distance requires to estimate both µ and Σ -1 (see Sec. A.6). The dimension of the latent space of the considered pre-trained model is either 768 or 512. In this setting, when the size of the reference set is small, the estimation of the Mahalanobis parameters is numerically unstable. For s 1 , RAINPROOF relies on information projection and does not involve numerically unstable computations but requires a larger memory footprint (0.5 GB) to store the reference set (2000 probability distributions of dimension 50K).

5.2. IMPACT OF OOD FILTERING ON TRANSLATION QUALITY

The main objective of OOD filtering is to remove samples that are far from the training distribution. On these samples, the user has no guarantee that the model will produce a good quality translation. In this experiment, we compare the performance of the system with and without the different detectors in terms of the quality of the generated sentence. Global performance. In Tab. Finer performance analysis. In Tab. 4, we report the per-shift-types performance of f θ with and without OOD detector. In Tab. 4, we observe a decrease in performance in the case of language and domain shifts, the latter being more harmful. On domain shifts, we observe that reference-based detectors decrease system's performance on OOD samples. This means that the detectors tend to filter out samples that are well-handled by the model and ignore sentences that are not. It is worth noting that reference-based detectors remove, in proportion, twice as many samples as their no-reference counterparts, while the threshold selection procedure remains the same. This observation also holds when removing less samples (i.e., calibrating γ that we remove 10%, 5% or even 1% of the IN dataset) (Tab. 15). Threshold free analysis. In Tab. 2, we report the correlation between OOD scores and final task performance for the case of domain shifts. We refer the reader to Tab. 14 for the results on language shifts. We observe that the likelihood score is the most correlated with the final sentence quality, as measured by BLEU or BERT-S. This finding illustrates that higher correlation with sentence quality does not necessarily translate into higher performance gains when filtering OOD samples. This result suggests that Quality Estimation (Specia et al., 2010; Blatz et al., 2004) , while closely related, is a different problem. An important dimension fostering adoption is the ability to verify the decision taken by the automatic system (Montavon et al., 2018) . RAINPROOF offers a step in this direction when used with references: for each input sample, RAINPROOF finds the closest sample (in the sens of the Information Projection) in the reference set to take its decision. We present in Tab. 5 some OOD samples along with their translation scores, projection scores, and their projection on the reference set. We notice that, in general, sentences that are close to the reference set, and whose projection has a close meaning, are better handled by f θ . Therefore, one can visually interpret the prediction of RAINPROOF, and validate it. This observation further validate our method.

6. CONCLUSIONS

In this work, we introduced both a detection framework called RAINPROOF as well as a new benchmark called LOFTER for detecting OOD samples when using textual generators in the black-box scenario. Our work adopts an operational perspective by not only considering OOD performance but also task-specific metrics. Our results show that, despite the good results obtained in pure OOD detection, OOD filtering can harm the performance of the final system, as it is the case for MSP or Mahanalobis. We found that, RAINPROOF breaks this curse and induces significant gains in translation performance both on OOD samples and in general. In conclusion, this work paves the way to the development of detectors tailored for text generators and calls for a global evaluation when benchmarking future OOD detectors.

7. APPENDIX

A Experimental setting 17 A 

A EXPERIMENTAL SETTING

In this section we dive into the details and definitions of our experimental setting. First we present our OOD detection performance metrics (Sec. A.1), then we provide a couple samples for one of the small language shifts (Sec. A.3). We also discusse the choices of pretrained model (Sec. A.5) and how we adapted common OOD detectors to the text generation case (Sec. A.6). A In order to evaluate the performance of our methods we will focus and report mainly the AUROC and the FPR, we provide more detailed metrics and experiments in Sec. A.1. Area Under the Receiver Operating Characteristic curve (AUROC) Bradley (1997). The Receiver Operating Characteristic curve is curve obtained by plotting the True positive rate against the False positive rate. The area under this curve is the probability that an in-distribution example X in has a anomaly score higher than an OOD sample x out : AUROC= Pr(a(x in ) > a(x out )). It is given by γ → (Pr a(x) > γ | Z = 0 , Pr a(x) > γ | Z = 1 ). False Positive Rate at 95% True Positive Rate (FPR). We accept to allow only a given false positive rate r corresponding to a defined level of safety and we want to know what share of positive samples we actually catch under this constraint. It leads to select a threshold γ r such that the corresponding TPR equals r. At this threshold, one then computes: Pr(a(x) > γ r | Z = 0) with γ r s.t. TPR(γ r ) = r. r is chosen depending of the difficulty of task at hand and the required level of safety. For the sake of brevity we present only AUROCand FPRmetrics in our aggregated results but we also used Detection error and Area Under the Precision-Recall curve metrics and those are presented in our full results section (Ap. C). Detection error. It is simply the probability of miss-classification for a given True positive rate. Area Under the Precision-Recall curve (AUPR-IN/AUPR-OUT) Davis & Goadrich (2006) . The Precision-Recall curve plots the recall (true detection rate) against the precision (actual proportion of OOD amongst the predicted OOD). The area under this curve γ → (Pr He that will lie, will steal. Z = 1 | s(X) ⩽ γ , Pr s(X) ⩽ γ | Z = 1 ) The one who's mindless, he'll steal. 12.22 Jo sóc qui té la clau. I'm the one who has the key. Jo soc qui te la clau. 5.69 En Tom surt a treballar cada matí a dos quarts de set. Tom leaves for work at 6:30 every morning. In Tom surt to pull each matí to two quarts of set. 3.67 Ell m'ha dit que la seva casa era embruixada. He told me that his house was haunted. Ell m'ha dit that the seva house was haunted. 27.78 Aquest és el lloc on va nèixer el meu pare. This is the place where my father was born. Aquest is the lloc on va nèixer el meu pare. 8.30 Table 7 : Example of behavior of a language model trained to handle Spanish inputs on Catalan inputs.

A.4 DIALOG DATASETS

Switchboard Dialog Act Corpus (SwDA) is a corpus of telephonic conversations. The corpus provides labels, topic and speaker information (Stolcke et al., 2000) . ICSI MRDA Corpus (MRDA) contains transcript 75h of naturally occuring meetings involving more than 50 people (Shriberg et al., 2004) . DaylyDialog Act Corpus (DyDA) contains daily common communications between people, covering topic such as small talk, meteo or daily activities (Li et al., 2017) . Interactive Emotional Dyadic Motion Capture IEMOCAP) (Tripathi et al., 2018) consists of transcripts of improvisations or scripted scenarii supposed to outline the expression of emotions.

A.5 CHOICES OF MODELS

A lot of pretrained model for conditional text generation are available. To perform our experiments we needed models that were already well installed and deployed and that would also support OOD settings. For translation tasks we needed specialized models for a notion of OOD to be easily defined. It would be indeed more hazardous to define a notion of OOD language when working with a multilingual model. The same is true for conversational models. Neural Machine Translation model. We benchmark our OOD method on translation models provided by Helsinky NLP Tiedemann & Thottingal (2020) on several pairs of languages with large and small shifts. We extended the experiment to detect domain shifts. These models are indeed specialized in each language pairs and are widely recognize in the neural machine translation field. For our experiments we used the testing set provided along these models, so we can consider that they have been fine tuned over the same distribution. Conversational model. We used a dialogGPT Zhang et al. ( 2019) model fine-tuned on the Multi WOZ dataset as chat bot model. The finetuning on daily dialogue type tasks ensure that the model is specialized, thus allowing us to get a good definition of samples not being in its range of expertise. Moreover, the choice of the architure, DialogGPT, guarantee that our results are valid on a very common architecture. Additional finetuning. We further finetuned the models on the reference set to check whether additional finetuning on the distribution would affect the results. It did not change significantly the results Tab. 17. It is not surprising considering that the models we used were already trained on a very similar distribution.

A.6 GENERALIZATION OF EXISTING OOD DETECTORS TO SEQUENCE GENERATION

In this section, we extend classical OOD detection score to the conditional text generation settting. Common OOD detectors were built for classification tasks and we need to adapt them to conditional text generation. Our task can be viewed as a sequence of classification problems with a very large number of classes (the size of the vocabulary). We chose the most naive approach which consists of averaging the OOD scores over the sequence. We experimented with other aggregation such as the min/max or the standard deviation without getting interesting results. Likelihood Score The most naive approach to build a OOD score is to rely solely on the loglikelihood of the sequence. For a conditioning x we define the log-likelyhood score by a L (x) = -|ŷ|-1 t=0 log p θ (ŷ t+1 |x, ŷ⩽t ). The likelihood is the same as the perplexity. Average Maximum Softmax Probability score The maximum softmax probability Hendrycks & Gimpel (2017) takes the probability of the mode of the categorical distribution as score of OOD. We extend thise definition in the case of sequence of probability distribution by averaging this score along the sequence. For a given conditioning x, we define the average MSP score a MSP (x) = 1 |ŷ| |ŷ| t=1 max i∈[|0,K|] p T θ (i|x, ŷ⩽t )). While it is closely linked to uncertainty measures it discards most of the information contained in the probability distribution. It discards the whole probability distribution. We claim that much more information can be retrieve by studying the whole distribution. Average Energy score We extend the definition of the energy score described in Liu et al. (2020) to a sequence of probability distributions by averaging the score along the sequence. For a given conditioning x and a temperature T we define the average energy of the sequence:a E (x) ≜ -T |ŷ| |ŷ| t=1 log |Ω| i e f θ (x,ŷ ⩽t )i/T . It corresponds to the normalization term of the softmax function applied on the logits. While it takes into account the whole distribution, it only takes into account the amount of unormalized mass before normalization without attention to how this mass is distributed along the features. Mahalanobis distance Following Lee et al. (2018c) compute the Mahalanobis matrice based on the samples of a given reference set R. In our case we are using encoder-decoder models we use the output of the last hidden layer of the encoder as embedding. Let's denote ϕ(x) this embedding for a conditionning x. Let's µ and Σ be respectively the mean and the covariance of these embedding on the reference set. We define a M (x) = 1 + (ϕ(x) -µ) ⊤ Σ -1 (ϕ(x) -µ) -1 .

B PARAMETERS TUNING

Detectors depend on their anomaly score to make decision and these scores can be parametric. First of all, soft probability based scores depend on the soft probability distribution and its scaling, therefore the temperature is a crucial parameter to tune to get the most performance. While a small temperature tend to make the distribution more picky, higher value spread the probability mass along the classes. Moreover, the renyi divergence and its related informaiton projection depend on a factor α. We provide here further results and analysis of those parameters on our results.

B.1 IMPACT OF α

Indeed, in Fig. 1 we present the impact of the size of the reference set and of the paramet α on Renyi information projection to distinguish dialog shifts, as expected, the larger the reference set, the better. However, we see that smaller values of α yield better results. We recall that the Renyi divergence is defined as D α (p∥q) = 1 α-1 log |Ω| i=1 p α i q α-1 i , where α ∈ R + -{1}. Smaller values of α distribute the weight of each feature more equally in the final divergence, more specifically they tend to give an equal weight to the very likely outcome as well as to the less likely ones, therefore giving more weight to the tail of the distribution. When α tends to 0 the Renyi divergence actually counts the number of nonzero common probabilities. That makes sens in terms of topic detection, it counts the common tokens considered during text generation. In Fig. 2 we present the different level of performance of all the detectors we studied. We can see that in every task our detectors outperform the baselines but also that in dialog shift, while the Mahalanobis distance outperform clearly our detectors for s 0 , they still outperform baselines for their scenario by far.

C.2 DETAILED RESULTS OF OOD DETECTION PERFORMANCES

In this section we present the performances of our OOD detectors on each detailed tasks, i.e. for each pair of IN and OOD data with all the considered metrics. We show that our metrics outperform other OOD detectors baselines in almost all scenarios. Table 9 : Detailed results of the performances of our OOD detectors on different domain shifts. For Spanish (spa) and German (de), we present two domains shifts: Technical medical (EMEA) data and legal parlementary texts (parl) against common language emboddied by the Tatoeba dataset (tat). 

D NTM PERFORMANCE

Surprisingly we show that common OOD detectors tend to exclude samples that are well handled by the model and keep some that are not leading to decreasing overall performance in terms of translation metrics. Moreoever it seems this phenomenon is more dominant in reference based detectors. We show that our uncertainty based detectors mostly avoir that downfall and provide good OOD detection and improved translation performances.

D.1 ABSOLUTE PERFORMANCES

It is clear (somewhat expected) that NMT models do not perform as well on OOD data as we can see in Tab. 11b. However, we find that our OOD detectors are able to remove most of the worst case samples and keep enough well translated samples so that with correct filtering our method actually allow the model to achieve somewhat acceptable BLEU scores. Baselines aC -0.12 -0.07 -0.05 -0.11 -0.06 -0.12 aM -0.23 -0.14 -0.29 -0.11 -0.06 -0.05 Most of our detectors are initially classification OOD detectors that we adapted for text generation by averaging them over the generated sequences and using this aggregated score as a score for the whole sequence. We experimented with other aggregations such as the standard deviation or the min/max along the sequence. If the standard deviation gave relatively good results they were still less interesting that the naive average.

E.2 NEGENTROPY OF BAG OF DISTRIBUTIONS

We introduced in Sec. 3.3 the bag of distributions as a way to aggregate a sequence of probability distribution and compare it to a set of reference using information projections Sec. 3.3. A natural idea would be to apply the Negentropy methods (Sec. 3.2) to these aggregated distributions. More formally given a sequence of probability distribution S θ (x) = {p T θ (x, ŷ⩽t )} n t=1 we would compute its bag of distributions: pθ (x) ≜ 1 |y| |y| t=1 p θ (x, y ⩽t ) And then compute as novelty score: J D (p) = D(p∥U) Further experiments have shown that this process was unable to discriminate OOD samples or improve performance translation. We suspect that the uncertainty at each step is key to capture the behavior of the language model and that this uncertainty information is lost when averaging probability distribution along the sequence.

E.3 DIFFERENT REFERENCE DISTRIBUTIONS

In the no-reference scenario we used the uniform distribution as reference distribution to compare against. However, we can obviously use other reference distributions. We tried two natural options: the tf-idf Yuan et al. (2021) distribution and the average distribution on the reference set. The latter effectively replacing the projection onto the reference set by the distance to the average element of it. It is worth signaling that these methods falls into the reference scenario since we need it to compute these statistics. They would be interesting though if they could maintain performance while being less computationally expensive than the projection. We found out that these references were not as efficient as the projection onto the reference set and did not achieve better performance than their no reference counterparts. 



Afrikaans is a daughter language of Dutch(Jansen et al., 2007). The Dutch sentence: Appelen zijn gewoonlijk groen, geel of rood can be translated in "Appels is gewoonlik groen, geel of rooi."2 The Kleene closure corresponds to sequences of arbitrary size written with words in Ω. Formally:Ω * = ∞ i=0 Ω i . The likelyhood of the sequence is the same as the perplexity. In our work we report the log-likelyhood for numerical stability reasons: i.e., aL(x) = -|ŷ|-1 t=0 log p θ (ŷt+1|x, ŷ⩽t ) It is also worth pointing that doing a projection at each timestep would require a per-step reference set in addition to the computational time required to actually compute the projections, therefore we decided to aggregate the probability distributions over the sequence. The minimization problem of Eq. finds numerous connections in the theory of large deviation(Sanov, 1958) or in statistical physics(Jaynes, 1957).



Figure 1: Impact of α on the performance of the Rényi information projection for dialog shifts detection. A smaller α increases the weight of the tail of the distribution. An α of 0 would consist in counting the number of the common non zero elements.

.1 Additionnal Details on Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.2 Language pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.3 Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.4 Dialog datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.5 Choices of models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.6 Generalization of existing OOD detectors to Sequence Generation . . . . . . . . . 18 B Parameters tuning 19 B.1 Impact of α . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 C Performance of our detectors in OOD detection 20 C.1 Summary of our results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 C.2 Detailed results of OOD detection performances . . . . . . . . . . . . . . . . . . . 20 C.3 ROC AUC curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 D NTM performance 24 D.1 Absolute performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 D.2 Gains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 D.3 Effect of a Larger threshold on NMT performance . . . . . . . . . . . . . . . . . . 27 E Negative results 27 E.1 Different aggregation of OOD metrics . . . . . . . . . . . . . . . . . . . . . . . . 27 E.2 Negentropy of bag of distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 27 E.3 Different reference distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 E.4 Impact of additional finetuning on IN data . . . . . . . . . . . . . . . . . . . . . . 28

.1 ADDITIONNAL DETAILS ON METRICS OOD Detection is usually an unbalanced binary classification problem where the class of interest is OUT. Let us denote Z the random variable corresponding to actually being out of distribution. We can assess the performance of our OOD detectors focusing on the False alarm rate and on the True detection rate. The False alarm rate or False positive rate (FPR) is the proportion of samples missclassified as OUT. For a score threshold γ, we have FPR = Pr a(x) > γ | Z = 0 . The True detection rate or True positive rate (TPR) is the proportion of OOD samples that are detected by the method. It is given by TPR = Pr a(x) > γ | Z = 1 .

Figure 2: Trade-offs between AUROCand FPRfor each tasks and metrics

Figure 4: ROC-AUC curves for our reference based metrics compared to common baselines for language shifts detection. Baselines are represented in dashed lines.

Figure 5: ROC-AUC curves for our uncertainty based metrics compared to common baselines for domain shifts detection. baselines are represented in dashed lines.

Figure 9: Gain in translation performances when filtering OOD samples with our method on different datasets and language pairs.

Summary of the performance and computational cost of every detector.

Correlation between OOD scores and translation metrics BLEU and BERT-S on domain shifts datasets.

Average impact of different OOD detectors on the BLEU score for different type of dataset: IN data only, OOD data and the combination of both ALL. For each we report the absolute average BLEU score (Abs.), the average gains in BLEU (G.s) compared to a setting without OOD filtering (f θ only) and the share of the subset removed by the detector (R.Sh.). These results are achieved by setting γ such that we remove 20% of the IN dataset.

3, we report the global performance of the systems (f θ ) without and with OOD detectors on IN samples, OOD samples and all samples (ALL). From the first row of Tab. 3, we notice that OOD samples are harmful to the model. We observe that, in most of the cases, adding detectors increases the model performance on IN, OOD and all samples. Exceptions include a M SP (for OOD, IN and ALL) and a

Detailed impacts on NMT performance results per tasks (Domain-or Language-shifts) of the different OOD detectors. We present results on the different part of the data: IN data, OOD data and the combination of both, ALL. For each we report the absolute average BLEU score (Abs.), the average gains in BLEU (G.s.) compared to a setting without OOD filtering (f θ only) and the share of the subset removed by the detector (R.Sh.). We provide more detailed results on each dataset in Ap. D

OOD inputs, their translations and projections onto the reference set.

captures the trade-off between precision and recall made by the model. A high value represents a high precision and a high recall i.e. the detector captures most of the positive samples while having few False positive.

Summary of models and studied shifts.

Detailed results of the performances of our OOD detectors on different language shifts. The first language of the pair is the reference language of the model and the second one is the studied shift.

Detailed performance results of our OOD detectors on dialog shift against the Multi WOZ dataset as reference set. Figure 7: ROC-AUC curves for our uncertainty based metrics compared to common baselines for dialog shifts detection. baselines are represented in dashed lines.

Absolue translation performances in terms of BLEU on the different subset (IN, OOD, ALL) of each dataset of our translation OOD performance benchmark. spa-cat spa-por nld-afr spa:tat-parl de:news-parl spa:tat-EMEA de:news-EMEA Scenario Score

Share of the datasets removed when taking γ so that we keep 80% of the IN distribution.D.2 GAINS D.3 EFFECT OF A LARGER THRESHOLD ON NMT PERFORMANCE

Detailed impacts on NMT performance results per tasks (Domain-or Language-shifts) of the different OOD detectors with a threshold defined to keep 99% of the IN data. We present results on the different part of the data: IN data, OOD data and the combination of both, ALL. For each we report the absolute average BLEU score (Abs.), the average gains in BLEU (G.s.) compared to a setting without OOD filtering (f θ only) and the share of the subset removed by the detector (R.Sh.).

Correlation between OOD scores and translation metrics BLEU and BERT-S

Summary of the results including different custom reference distributions.

Summary of the results with additional finetuning of the models on the reference set. The results are similar to the results without finetuning as expected since the models had been trained initially on similar distributions.

