OUT-OF-DISTRIBUTION DETECTION AND SELECTIVE GENERATION FOR CONDITIONAL LANGUAGE MODELS

Abstract

Machine learning algorithms typically assume independent and identically distributed samples in training and at test time. Much work has shown that highperforming ML classifiers can degrade significantly and provide overly-confident, wrong classification predictions, particularly for out-of-distribution (OOD) inputs. Conditional language models (CLMs) are predominantly trained to classify the next token in an output sequence, and may suffer even worse degradation on OOD inputs as the prediction is done auto-regressively over many steps. Furthermore, the space of potential low-quality outputs is larger as arbitrary text can be generated and it is important to know when to trust the generated output. We present a highly accurate and lightweight OOD detection method for CLMs, and demonstrate its effectiveness on abstractive summarization and translation. We also show how our method can be used under the common and realistic setting of distribution shift for selective generation (analogous to selective prediction for classification) of high-quality outputs, while automatically abstaining from low-quality ones, enabling safer deployment of generative language models.

1. INTRODUCTION

Recent progress in generative language models (Wu et al., 2016a; Radford et al., 2019; Lewis et al., 2020; Raffel et al., 2020; Zhang et al., 2020) has led to quality approaching human-performance on research datasets and has opened up the possibility of their wide deployment beyond the academic setting. In realistic user-facing scenarios such as text summarization and translation, it should be expected that user provided inputs can significantly deviate from the training data distribution. This violates the independent, identically-distributed (IID) assumption commonly used in evaluating machine learning models. Many have shown that performance of machine learning models can degrade significantly and in surprising ways on OOD inputs (Nguyen et al., 2014; Goodfellow et al., 2014; Ovadia et al., 2019) . For example an image classifier may detect cows in images with very high accuracy on its IID test set but confidently fails to detect a cow when paired with an unseen background (Murphy, 2023; Nagarajan et al., 2020) . This has led to active research on OOD detection for a variety of domains, including vision and text but focused primarily on classification. Salehi et al. (2021) ; Bulusu et al. (2020) ; Ruff et al. (2021) provide comprehensive reviews on this topic. Conditional language models are typically trained given input sequence x = x 1 . . . x L to autoregressively generate the next token in a sequence y = y 1 . . . y T as a classification over the token-vocabulary V , p θ (y|x) = T t=1 p θ (y t |y <t , x), y t ∈ V . Consequently, the perils of out-ofdistribution are arguably more severe as (a) errors propagate and magnify through auto-regression, and (b) the space of low-quality outputs is greatly increased as arbitrary text sequences can be generated. Common errors from text generation models include disfluencies (Holtzman et al., 2020) and factual inaccuracies (Goodrich et al., 2019; Maynez et al., 2020) . A common failure case we observed in abstractive summarization is for the model to output "All images are copyrighted" as the summary for news articles from a publisher (CNN) different than what it was trained on (BBC) (see Figure A.7) . In this work, we propose OOD detection methods for CLMs using abstractive summarization and translation as case studies. Similar to classification, we show in Section 2.1 that CLMs have untrustworthy likelihood estimation on OOD examples, making perplexity a poor choice for OOD detection. In Section 2.2, we propose a highly-accurate, simple, and lightweight OOD score based on the model's input and output representations (or embeddings) to detect OOD examples, requiring negligible additional compute beyond the model itself. While accurate OOD detection enables the conservative option of abstaining from generation on OOD examples, it may be desirable to generate on known near-domain data, e.g. generate summaries for articles from news publishers that differ from our fine-tuning set. Thus the ability to selectively generate where the model is more likely to produce higher-quality outputs, enables safer deployment of conditional language models. We call this procedure selective generation, analogous to the commonly used term selective prediction in classification (Chow, 1957; Bartlett & Wegkamp, 2008; Geifman & El-Yaniv, 2017) . In Section 4, we show that while model perplexity is a reasonable choice for performing selective generation with in-domain examples, combining with our OOD score works much better when the input distribution is shifted. In summary, our contributions are: • Propose lightweight and accurate scores derived from a CLM's embeddings for OOD detection, significantly outperforming baselines on abstractive summarization and translation tasks, without the need for a separate detection model. • Show that model perplexity can be an unreliable signal for quality estimation on OOD examples, but combined with our OOD scores can be used effectively to selectively generate higher-quality outputs while abstaining on lower ones. • Propose an evaluation framework for OOD detection and selective generation for CLMs, including human quality ratings for summarization.

2. OOD DETECTION IN CONDITIONAL LANGUAGE MODELS

The maximum softmax probability (MSP), p(y|x), y = arg max k=1,...,K p(k|x) is a simple, commonly used OOD score for K-class classification problem (Hendrycks & Gimpel, 2016; Lakshminarayanan et al., 2017) . For CLMs, the perplexity, which is monotonically related to the negative log-likelihood of the output sequence averaged over tokens -1 T T t=1 log p(y t |y <t , x) is a natural OOD score to consider, and analogous to the negative MSP in classification because both are based on softmax probabilities. We first study how well the perplexity performs for OOD detection tasks. for translation, evaluated on other datasets/domains. Perplexity is not well suited for OOD detection due to significant overlap between in-domain and OOD scores. In Figure 1 , we compare the distribution of perplexity of (a) a summarization model and (b) a translation model trained on in-domain dataset and evaluated on multiple OOD datasets, respectively. For summarization, a model is trained on xsum and evaluated on other news datasets including cnn dailymail and newsroom as near-OOD datasets, and forum (forumsum) and dialogue (samsum and reddit tifu) datasets as far-OOD (see Section 3 for details). The perplexity distributions overlap significantly with each other even though the input documents are significantly different. Furthermore, perplexity assigns cnn dailymail even lower scores than the in-domain xsum. For translation, the model is trained on WMT15 dataset and evaluated on other WMT test splits (Bojar et al., 2015) , OPUS100 (Aulamo & Tiedemann, 2019) , and MTNT (Michel & Neubig, 2018) . The in-domain and OOD datasets perplexity densities overlap even more. Overall, these results suggest that perplexity is not well suited for OOD detection.

2.2. DETECTING OOD USING CLM'S EMBEDDINGS

Given a trained conditional language model, we propose using the input and output representations/embeddings computed as part of the inference/generation process to detect OOD examples. In this work, we use Transformer encoder-decoder models and obtain the input embedding z by averaging the encoder's final-layer hidden state vectors h i ∈ R d (d is the hidden dimension) corresponding to the input sequence token x i . To obtain the output embedding w we average the decoder's final-layer hidden state vectors g i ∈ R d corresponding to the output token y i . Thus z := 1 L L i=1 h i w := 1 T T i=1 g i , z, w ∈ R d where L and T are the input and output sequence lengths respectively. Figure 2 illustrates the idea. Intuitively, if the embedding of a test input or output is far from the embedding distribution of the training data, it is more likely to be OOD. One way of measuring this distance is to fit a Gaussian, N (µ, Σ), µ ∈ R d , Σ ∈ R d×d , to the training embeddings and use the Mahalanobis distance (MD): MD(x; µ, Σ) := (x -µ) T Σ -1 (x -µ), This has been used for OOD detection using the representations from classification models (Lee et al., 2018) and computing the distances to class-conditional Gaussians. Unlike classification, which has class labels, in conditional language modeling we have paired input and output text sequences. We fit one Gaussian on the training input embeddings, N (µ z , Σ z ), and a second Gaussian on the embeddings of the training ground-truth outputs, N (µ w , Σ w ). For a test input and output embedding pair (z test , w test ), the input MD is computed as MD input (z test ) := MD(z test ; µ z , Σ z ) (Input MD OOD score) The output MD is computed similarly: MD output (w test ) := MD(w test ; µ w , Σ w ) (Output MD OOD score) Mahalanobis distance is equivalent to computing a negative log-likelihood of the Gaussian distribution (up to a constant and a scalar), i.e. Ren et al. (2019) showed that normalizing the likelihood with the likelihood of a background model works better for OOD detection. In a similar vein, Ren et al. (2021) proposed an analogous Relative Mahalanobis Distance (RMD) for classification: using the relative distance between the class-conditional Gaussians and a single background Gaussian using data from all classes. That method cannot be directly applied for CLMs because outputs are not just class labels. Thus in this work, we extend the RMD idea to conditional language models, -log p(z) = d 2 log(2π) + 1 2 log |Σ| + 1 2 (z -µ) T Σ -1 (z - µ) = const. + 1 2 MD(z). RMD input (z test ) := MD input (z test ) -MD 0 (z test ), (Input RMD OOD score) where MD 0 (z test ) := MD(z test ; µ z 0 , Σ z 0 ) is the MD to a background Gaussian N (µ z 0 , Σ z 0 ), fit using a large, broad dataset to approximately represent all domains. In practice, we use C4, a large Common Crawl-based English dataset (Raffel et al., 2020) foot_0 and ParaCrawl's English-French dataset (Bañón et al., 2020) foot_1 , as the data for fitting the background distributions for summarization and translation in our experiments, respectively. While we use the ground-truth outputs to fit N (µ w , Σ w ), we decode outputs from the trained CLMs and use those output embeddings to fit the background output Gaussian, N (µ w δ , Σ w δ ). RMD output (w test ) := MD output (w test ) -MD δ (w test ), (Output RMD OOD score) where MD δ (w test ) := MD(w test ; µ w δ , Σ w δ ) is the MD to the decoded output background distribution N (µ w δ , Σ w δ ). See Algorithm 1 and 2 for the detailed steps. Using decoded outputs serves two purposes: (1) We do not require supervised data (e.g. document-summary pairs) to fit the background Gaussian. (2) Decoded outputs may exhibit increased deficiencies that result from running the model on out-of-distribution data, which provides greater contrast with the in-domain ground-truth labels. The RMD score can be regarded as a background contrastive score that indicates how close the test example is to the training domain compared to the background domains. A negative score suggests the example is relatively in-domain, while a positive score suggests the example is OOD. A higher score indicates greater OOD-ness. Binary classifier for OOD detection Since we have explicitly defined two classes, in-domain and background/general domain, another option is to train a binary classifier to discriminate embeddings from the two classes. We train a logistic regression model and use the un-normalized logit for the background as an OOD score. The Input Binary logits OOD score uses the input embeddings as features, whereas the Output Binary logits OOD score uses the decoded output embeddings as features. A higher score suggests higher likelihood of OOD. The preferred use of the logits over probability was also recommended by previous OOD studies for classification problems (Hendrycks et al., 2019) . Though RMD is a generative-model based approach and the binary classifier is a discriminative model, we show that RMD is a generalized version of binary logistic regression and can be reduced to a binary classification model under certain conditions (see Section A.5 for details).

3.1. EXPERIMENT SETUP

We run our experiments using Transformer (Vaswani et al., 2017) encoder-decoder models trained for abstractive summarization and translation. Below we specify the dataset used for training/finetuning (i.e. in-domain) and the OOD datasets. In the case of summarization, OOD datasets can be intuitively categorized as near or far OOD based on the nature of the documents. For example, news articles from different publishers may be considered as sourced from different distributions, but are closer than news articles are to dialogue transcripts. We also quantitatively showed that using n-gram overalp analysis in Table A .10. In contrast, the translation datasets we use consist of English-French sentence pairs with less variation between datasets due to the shorter length of sentences.

Summarization model

We fine-tuned PEGASUS LARGE (Zhang et al., 2020) on the xsum (Narayan et al., 2018) dataset, consisting of BBC News articles with short, abstractive summaries.

Summarization datasets

We use 10,000 examples from xsum and C4 training split to fit indomain/foreground and background Gaussian distributions, respectively. For test datasets, we have cnn dailymail (Hermann et al., 2015; See et al., 2017) , news articles and summaries from CNN and DailyMail; newsroom (Grusky et al., 2018) , article-summary pairs from 38 major news publications; reddit tifu (Kim et al., 2018) , informal stories from sub-reddit TIFU with author written summaries of very diverse styles; samsum (Gliwa et al., 2019) and forumsum (Khalman et al., 2021) , high-quality summaries of casual dialogues. Translation model We train a Transformer base model (Vaswani et al., 2017) with embedding size 512 on WMT15 English-French (Bojar et al., 2015) . The model is trained with Adafactor optimizer (Shazeer & Stern, 2018) for 2M steps with 0.1 dropout and 1024 batch size. Decoding is done using beam search with 10 beam size and α = 0.6 length normalization (Wu et al., 2016b) . The best checkpoint scores 39.9 BLEU on newstest2014. (Bojar et al., 2015) and the law, Koran, medical, IT, and subtitles (sub) subsets from OPUS (Tiedemann, 2012; Aulamo & Tiedemann, 2019) . We also use the English-French test set of MTNT (Michel & Neubig, 2018) , consisting of noisy comments from Reddit. Evaluation metric We use the area under the ROC curve (AUROC) between the in-domain test data as negative and the OOD test data as positive sets to evaluate and compare the OOD detection performance. AUROC 1.0 means a perfect separation, and 0.5 means the two are not distinguishable. Baseline methods We compare our proposed OOD scores with various baseline methods, including (1) the model perplexity score, (2) the embedding-based Mahalanobis distance. In addition, we also compare with (3) Natural Language Inference (NLI) score (Honovich et al., 2022) for summarization, and (4) COMET (Rei et al., 2020) and ( 5) Prism (Thompson & Post, 2020) for translation. NLI score measures the factual consistency by treating the input document as a premise and the generated summary as a hypothesis. Both COMET and Prism are quality estimation metrics designed to measure translation quality without access to a human reference. More specifically, COMET finetunes the large XLM-R model (Conneau et al., 2020) on human evaluation data, and Prism is the perplexity score from a multilingual NMT model trained on 99.8M sentence pairs in 39 languages.

3.2. RESULTS

RMD and Binary classifier are better at OOD detection than baselines Table 1 shows the AUROCs for OOD detection on the (a) summarization and (b) translation datasets. Overall, our proposed OOD scores RMD and Binary logits outperform the baselines with high AUROCs (above 0.8). The commonly used output metrics, perplexity, NLI, COMET and Prism, have generally low AUROC scores (many have values around 0.5-0.6), suggesting they are not suited for OOD detection. Interestingly, we noticed that the output OOD scores perform better for summarization, while the input OOD scores perform better for translation. One possible reason is that when summariza-tion outputs are low-quality (e.g. producing repeated text or irrelevant summaries) they look very different than reference summaries, making OOD output score more sensitive to the contrast. Though RMD and Binary logits OOD scores both perform well at OOD detection, RMD OOD score is better at distinguishing near-OOD from far-OOD. This can be seen in Figure 3 where near-OOD datasets have scores distributed in between in-domain and far-OOD. In the summarization task, near-OOD (news articles) datasets cnn dailymail and newsroom have their RMD scores distributed in the middle of xsum and reddit tifu, forumsum and samsum. In contrast, under the binary logits score, the near-OOD and far-OOD datasets have largely overlapping score distributions making it hard to distinguish between the two. In practice, RMD OOD score may be better suited for selective generation where domain shifts are expected. We explore this in more detail in Section 4. For the translation task, we additionally note that all methods have small AUROC for law dataset, suggesting that none of the methods are detecting the dataset as OOD. To better understand the special characteristics of the law dataset, we conducted an n-gram overlap analysis between the various test sets including law and the in-domain training data. We observed that law has the highest unigram overlap rate (48.8%) and the second highest overall overlap with the in-domain data (Table A .9). 3 This shows that law is close to in-domain data in terms of surface features, which might contribute to the low AUROC scores for all tested methods. We use ParaCrawl instead of C4 for translation because our translation model is trained on the sentence level, unlike the summarization model that takes the document as input. To further explore the effect of the background data on the performance, we split C4 documents into sentences and use that as the background data to compute the scores. The OOD detection performance using C4 sentences is very similar to that using ParaCrawl, as shown in Table A .3, suggesting that our method is not particularly sensitive to the choice of background data.

4. USING OOD SCORES FOR SELECTIVE GENERATION

The most conservative option for deployment of a conditional language model is to completely abstain from generating on inputs that are detected as out-of-distribution, for which we have shown in Section 3 our OOD scores are fairly accurate. However, it is often desirable to expand the use of models beyond strictly in-distribution examples, if the quality of outputs is sufficiently high. In classification, this has been framed as determining when to trust a classifier, or selective prediction (Geifman & El-Yaniv, 2017; Lakshminarayanan et al., 2017; Tran et al., 2022) . In this section, we seek to predict the quality of generation given an example, which may be out-of-distribution and abstain if the predicted quality is low. We call this selective generation. In practice, abstaining may correspond to hiding the model's generated text, or turning off a summarization/translation feature.

4.1. EXPERIMENT SETUP

We use the same models and datasets described in Section 3.1 but instead of simply detecting outof-distribution examples, our focus now is to predict the quality of generation for examples possibly outside the training distribution.

Measuring Translation quality

We use BLEURT (Pu et al., 2021) as the main metric to measure translation quality. Previous work has demonstrated that neural metrics such as BLEURT are much better correlated with human evaluation, on both the system level and the sentence level (Freitag et al., 2021) . BLEURT scores range from 0 to 1, with higher scores indicating better quality. Measuring Summarization quality In general, it is unclear how to automatically measure the quality of summaries generated by a model on out-of-distribution examples (in this case, examples from different datasets). The reason is summarization datasets have dataset-specific summary styles that may be difficult to compare. For example, xsum summaries are typically single-sentence whereas cnn dailymail summaries consist of multiple sentences. Thus we report ROUGE-1 score as an automatic measure but primarily use human evaluation to assess the quality. Amazon Mechanical Turk workers were asked to evaluate summaries generated by the xsum model on a scale of 1-5 (bad-good) using 100 examples from xsum, cnn dailymail, reddit tifu, and samsum. We collected 3 ratings per example and computed the median. See Section A.3 for more details.

4.2. PERPLEXITY HAS DIMINISHING CAPABILITY IN PREDICTING QUALITY ON OOD DATA

Since the models are trained using negative log-likelihood as the loss, perplexity (which is monotonically related) is a good predictor of output quality for in-domain data. In fact, the Kendall rank correlation coefficient τ between perplexity and human judged quality score is 0.256 (See Table 2 ) for in-domain xsum for summarization. However, when including shifted datasets to test, we found that the perplexity score is worse at predicting quality on OOD data. For example the Kendall's τ decreases to 0.068 for OOD dataset samsum (see Table A .4) . We observed similar trend in translation, although less severe, as data shifted from in-domain to OOD, the Kendall's τ between perplexity and BLEURT decreases (see Table A .5) . Figure 4 further shows the correlation between perplexity and the quality score (ROUGE-1, human rating, and BLEURT, respectively) as a function of OOD score. It is clear to see the correlation decreasing as OOD score increases and the trend is consistent for both summarization and translation. We propose two simple methods to combine perplexity and OOD scores. (1) A simple linear regression, trained on a random 10% data split using ROUGE-1 or BLEURT as the quality score, and evaluated on the test split and human evaluation split. (2) the sum of the percentile ranks (PR) of the scores, i.e. PR sum = PR perplexity + PR OOD . We sum PRs instead of their raw values because the two scores are in different ranges, PR(x) = R(x) N × 100, where R(x) is x's rank in the list of size N . Table 2 shows the Kendall's τ correlation coefficient between the various single and combined scores and the quality metric with only in-domain and all examples from all datasets. When all datasets Table 2: Kendall's τ correlation (p-value < 0.05 are grayed out) between various measures and human evaluation for summarization and BLEURT for translation. The "All" column shows the correlation when both in-domain and OOD examples are merged. Note for negatively correlated scores (e.g. perplexity (ppx), RMD), we take the negative value of the score for easier comparison. (a) Summarization 

4.4. SELECTIVE GENERATION USING THE COMBINED SCORE

In selective generation, our goal is to generate when the model is more likely to produce highquality output, and abstain otherwise, enabling safer deployment of generative language models. To evaluate that, we propose using the Quality vs Abstention Curve (QA), analogous to accuracy versus rejection curve used for selective prediction in the classification (Chow, 1957; Bartlett & Wegkamp, 2008; Geifman & El-Yaniv, 2017) . Similar concepts were proposed also in Malinin & Gales (2020) ; Xiao et al. ( 2020), but they only use automatic quality metrics for the analysis while we consider human evaluation to assess the quality as well. Specifically, at a given abstention rate α, the highest α-fraction scoring examples are removed and the average quality of remaining examples is computed. We want to maximize the quality of what is selectively generated and a better curve is one that tends to the upper-left which corresponds to removing bad examples earlier than good ones. Figure 5 shows the QA curves for various methods on summarization and translation. Quality is measured by human evaluation for summarization (see Figure A .4 for similar ROUGE-1 plot), and BLEURT for translation. The combined scores have the highest quality score at almost all abstention rates for both summarization and translation, while linear regression and PR sum perform similarly. For single scores, the OOD score performs better than perplexity and NLI scores at almost all abstention rates for summarization. For translation, the OOD score is better than perplexity when abstention rate α > 0.65 and worse than perplexity when α < 0.65. In other words, OOD score is better at abstaining slightly far-OOD while perplexity is better at abstaining near-OOD examples. Interestingly, our combined score is even marginally better than COMET that requires a separate neural network trained on human evaluation data. Prism is better than single scores, but much worse than our combined score. Area under the QA curves are shown in Tables A.6 and A.8 for reference. 

5. RELATED WORK

OOD detection problem was first proposed and studied in vision classification problems (Hendrycks & Gimpel, 2016; Liang et al., 2017; Lakshminarayanan et al., 2017; Lee et al., 2018; Hendrycks et al., 2018; 2019) , and later in text classification problems such as sentiment analysis (Hendrycks et al., 2020) , natural language inference (Arora et al., 2021) , intent prediction (Liu et al., 2020a; Tran et al., 2022) , and topic prediction (Rawat et al., 2021) . The widely used OOD methods can be characterized roughly into two categories (1) softmax probability or logits-based scores (Hendrycks & Gimpel, 2016; Liang et al., 2017; Hendrycks et al., 2019; Liu et al., 2020b) , (2) embedding-based methods that measure the distance to the training distribution in the embedding space (Lee et al., 2018; Ren et al., 2021; Sun et al., 2022) , (3) contrastive learning based methods which incorporate the contrastive loss into the classification cross-entropy loss to improve representation learning and consequently improve OOD detection (Winkens et al., 2020; Zhou et al., 2021) . Though it is not straightforward to extend those classifier-based scores to CLMs especially for input OOD detection, we extend three of them based on our understanding as baselines for comparison with our methods. See Section A.6 for details. The results in Table A .2 show that those methods are in general not competitive with our proposed methods RMD and Binary logits, especially on near-OOD datasets. OOD detection problem is less studied in CLMs. A few studies explored OOD detection in semantic parsing (Lukovnikov et al., 2021; Lin et al., 2022) , speech recognition (Malinin & Gales, 2020) , and machine translation (Malinin et al., 2021; Xiao et al., 2020) , but many of them focus on ensemblebased methods like Monte Carlo dropout or deep ensemble which use the averaged perplexity after sampling multiple output sequences.The ensembling method costs N times of the inference time, which is not feasible in practice. In this work, we focus on developing scores that can be readily derived from the generative model itself, without much increase in computation. We include an ensemble-based baseline in Section A.6 and show that its performance is worse than our methods.

6. CONCLUSION AND FUTURE WORK

We have proposed lightweight and accurate scores to detect out-of-distribution examples for conditional language generation tasks. For real-world deployment, we have also shown how our OOD scores can be combined with language model perplexity to selectively generate high-quality outputs while abstaining from low-quality ones in the setting of input distribution shift. Although our experiments focus on summarization and translation, our methods do not make any assumptions about the task modality, and we believe our method is widely applicable to other tasks where the model output is a sequence, e.g. image captioning. While our analysis was restricted to conditional language modeling with encoder-decoder Transformers, we expect our method to also work with decoder-only (Liu et al., 2018) architectures, used by some large language models such as GPT-3 (Brown et al., 2020) , PaLM (Chowdhery et al., 2022) , and LaMDA (Thoppilan et al., 2022) . Finally, analyzing why certain examples are OOD could lead to insights in how to make models more robust. Section A.13 presents one possible way to attribute OOD scores to sentences.

A APPENDIX

A.1 THE OUTPUT QUALITY FOR SUMMARIZATION AND TRANSLATION DATASETS. Table A .1: The output quality for summarization and translation datasets. (a) Summarization quality (higher is better) for xsum model. ROUGE-1 is based on all samples in the test split per dataset, while human evaluation is based on 100 samples. The raw human evaluation rating ranges from 1 to 5. We normalized the score by dividing 5.0, and toke the median of the ratings over 3 raters to reduce inter-rater noise. The standard deviation among 3 ratings are reported in brackets. (b) Translation quality for different datasets (higher is better). All datasets are sub-sampled to 1000 sentence pairs. (a) Summarization Algorithm 2 OOD score inference RMD is a generative model based approach which assumes the distributions of the two classes are Gaussian, while the binary classifier is a discriminative model which learns the decision boundary between two classes. Though they have different settings, under certain condition, the Gaussian generative model can be reduced to a binary classifier. To see the connection, let us assume the label y = 0 if the sample is from in-domain, and y = 1 if the sample is from the general domain. Let us also assume the two classes have balanced sample size without loss of generality p(y = 1) = p(y = 0). Since the log-probability of log p(y = 1|z) can be rewritten using the Bayes rule log p(y = 1|z) = log p(z|y = 1) + log p(y = 1) -log p(z), the logit (log odds) can be written as, logit = log p(y = 1|z) p(y = 0|z) = log p(y = 1|z) -log p(y = 0|z) = log p(z|y = 1) -log p(z|y = 0) = - 1 2 (MD(z; µ y=1 , Σ y=1 ) -MD(z; µ y=0 , Σ y=0 )) + const. When Σ = Σ y=1 = Σ y=0 , the equation can be further simplified as logit = Σ -1 (µ y=1 -µ y=0 ) T z - 1 2 µ T y=1 Σ -1 µ y=1 -µ T y=0 Σ -1 µ y=0 + const. = β 1 z + β 0 . Therefore, when assuming the covariance matrices are identical for the two Gaussian distributions, the Gaussian generative model can be reduced to a binary classification model. However, our RMD does not assume the same covariance matrix in both distributions. We estimate the covariance matrix individually for each class. So our RMD is different from binary classifier, and it has higher model capacity than the binary classifier. As we discussed in the related works, OOD detection problem was mainly studied in classification problems, and less studied in CLMs. Though it is not straight forward to extend classifier-based scores to CLMs especially for the input OOD detection, we would like to include as many possible methods as we can to present a comprehensive comparison for different methods. For those methods which rely on classification head derived logits, MSP (Hendrycks & Gimpel, 2016) , max-logit (Hendrycks et al., 2019) , and energy score (Liu et al., 2020b) , we simply consider the output decoding process as a sequence of classifications over tokens, and take the average of the corresponding score over the generated output tokens y 1 , . . . , y T as the output OOD scores. Therefore we added the following scores for CLMs, • Mean(MSP) -1 T T t=1 p(y t |y <t , x). • Energy score 1 T T t=1 E(x, f t ), where E(x, f t ) = -τ log v∈V e f (yt=v|y<t,x)/τ , f (y t = v|y <t , x) is the logit corresponding to the v-th token at the t-th decoding step, V is the token-vocabulary, and τ is the temperature parameter. We set τ = 1 since the original paper (Liu et al., 2020b) suggested the energy score can be used parameter-free by simply setting τ = 1. • Ensemble estimation of the output perplexity from multiple Monte-Carlo dropout samples. Malinin & Gales (2020); Xiao et al. (2020) propose to turn on the MC dropout layer at the inference time and sample multiple times (N ) using different random seeds as a way to approximate the Bayesian neural networks. We follow their idea and generate multiple output sequences and use the averaged perplexity as the uncertainty score. Note that the inference time for ensemble based method is N times of that for the single model based score. • KNN-based OOD score. Sun et al. (2022) propose to use the distance to the k-th nearest neighbour in the training set in the embedding space as an OOD score. There are two hyper-parameters in the KNN-based method, α and k. α is the proportion of training data sampled for nearest neighbor calculation, and k refers to the k-th nearest neighbor. We use the optimal k = 1000 and α = 100 as suggested by the paper. We also normalize the embedding features since the paper showed the feature normalization is critical for good performance. Mean(MSP), energy score, and ensembled perplexity score, are all derived from the logits of the tokens in output sequences, so they are output OOD scores. The KNN-based method can be applied for both input sequence embeddings and output sequence embeddings. Table A .2 shows the AUROCs for OOD detection for the above newly added baselines, as a comparison to our methods. First, the logits based output OOD scores, perplexity, mean(MSP), energy score, even the ensembled perplexity score which costs N times of the inference time, are in general not competitive with our proposed method RMD and Binary logits. Though the energy score is a bit better than perplexity and mean(MSP), and ensembled score is better than energy score, the performance gap between those methods and our proposed method is still big, especially for the near-OOD datasets. Second, KNN-based methods are not as good as MD and RMD either. Though it is possible that the optimal hyper-paramaters suggested by the paper may not be the optimal ones for our problem, searching for the optimal hyper-parameters requires a separate validation set. In contrast, our proposed methods have no hyperparameters.

A.7 EFFECT OF THE CHOICE OF THE BACKGROUND DATASET

Our principle for choosing the background data is to make it as general as possible. For summarization we use the C4 dataset, which contains a large amount of web crawl documents, to represent a broad range of topics. Similarly for translation, we use ParaCrawl dataset, which is also a large web crawl of sentences, because our translation model is a sentence to sentence model, unlike the summarization model that takes the document as the input. To further explore the effect of the background data on the performance, we split C4 documents into sentences and use that as the background data to compute the scores, and compare that with the version using ParaCrawl dataset. The OOD detection performance using C4 sentences is very similar to that using ParaCrawl, as shown in Table A .3 . For example, ParaCrawl-based input OOD score has slightly better performance on medial, Koran, IT datasets, while C4 based input score is slightly better at the other datasets. Both are significantly better than the baseline methods, and both give the same ranking of datasets on their OOD-ness, so our conclusion remains. Those results verify that our method is robust to the choice of background data.

A.8 ROC PLOTS FOR THE CORRESPONDING AUROC SCORES FOR OOD DETECTION

To better visualize the OOD detection performance, we present Figure A.3 to show the ROC plots for the corresponding AUROC scores for OOD detection in Table 1 . Each of the OOD measures is used for separating the in-domain test data as negative and the OOD test data as positive sets. The AUROC is defined as the area under the ROC curves. The closer an ROC curve is to the upper left corner, the larger the AUROC value is. AUROC 1.0 means a perfect separation, and 0.5 means the two are not distinguishable. AUROC is independent of the choice of threshold, so it can be used for fair comparisons among methods. A.9 CORRELATION To support our claim that the news related test datasets, cnn dailymail and newsroom are closer to the in-domain xsum than the other dialogue datasets reddit tifu, samsum, and forumsum, we compute the n-gram overlap between each of the test datasets and the in-domain dataset. We use Jaccard similarity score, J(A, B) = |A∩B| |A∪B| , where A and B are the set of n-gram in dataset A and dataset B, to measure the similarity between two datasets. Table A .10 shows the similarity scores based on 1 -4 grams. It is clear to see that cnn dailymail and newsroom have significantly higher similarity with the in-domain xsum data than other three datasets. Therefore, we call the news-related datasets near-OOD and the other dialogue based datasets far-OOD. We observe that the news-related datasets cnn dailymail and newsroom have significantly higher similarity scores with the in-domain xsum data than the other three OOD datasets reddit tifu, forumsum, and samsum.

A.13 VISUALIZATION OF OOD SCORE ON SHIFTED DATASET

We explore how individual parts of an input text contribute to the OOD score, which can help us visualize which parts of the text are OOD. We define the OOD score of each sentence in the text using a leave-one-out strategy: For any given sentence, we compute the OOD score of the article with and without that sentence in it. The negative of the change in the OOD score after removing the sentence denotes the OOD score of that sentence. Intuitively, if removing the sentence decreases the overall OOD score, that sentence is assigned a positive OOD score and vice-versa. Figure A.6 illustrates an example where an article contains noise in the form of tweets with emojis, and the OOD scoring mechanism described above assigns positive OOD scores to those tweets and negative scores to the main text. .7, A.8, and A.9 show 3 examples in cnn dailymail that have the highest PR sum (perplexity, output RMD) scores that predict for low quality summaries. Figure A.10, A.11, and A.12 show 3 examples in cnn dailymail that have the lowest PR sum (perplexity, output RMD) scores that predict for high quality summaries. Document: A man trying to elude police jumped into a Missouri creek overnight wearing only his underwear -but his daring gambit did not pay off. Responding officers and firefighters followed the fugitive into the murky waters of Brush Creek in Kansas City and fished him out early Friday morning. The 38year-old suspect has been taken to an area hospital to be treated for injuries to his arm and leg. He may face charges in connection to a hit-and-run crash. Escape by water: A 38-year-old man stripped down to his skivvies and jumped into Brush Creek in Kansas City, Missouri, after being stopped by police. Up Brush Creek without a paddle: The suspect reached the middle of the creek and spent 10-15 minutes swimming back and forth. According to a Kansas City Police Department's arrest report, officers were called to a gas station in the 4600 block of Prospect at around 2am after receiving complaints from neighbors about a car blasting loud music. The report states that when police approached the car, a grey 2007 Infinity, and asked to see the driver's license, the man smiled, said, 'I'm out!' and took off from the scene. The Infinity promptly smashed into the north side of the Brush Creek bridge, after which the driver got out of the mangled car and jumped into the water. Police say the 38-year-old suspect stripped down to his underwear and spent 10-15 minutes swimming in chest-deep water, with officers waiting for him on north and south sides of the creek. Surrounded: When firefighters tried to pull him out, he threatened them with a log. Fish out of water: Police officers armed with a BB gun went after the nighttime bather and apprehended him. The bather was complaining of a broken leg, according to Fox4KC, so the Kansas City Fire Department's water rescue crew were sent in to fish him out. But the half-naked man in the water was not going to go quietly. 'The suspect picked up a large log and started swinging it at the firemen so they backed off as to not escalate the situation,' the arrest report states. That is when uniformed police officers armed with a BB gun followed the man into the creek, got him in a choke hold and pulled him out of the creek. Police suspect the man may have been under the influence of drugs or alcohol. Prelude: Before he jumped in the water, the 38-year-old driver fled from police and smashed his 2007 Infinity into a bridge. Police suspect the man may have been under the influence of drugs or alcohol at the time. As of Friday morning, the 38-year-old has not been formally charged with any crime. Document: Bayern Munich had to make do without FOUR important first-team stars as Pep Guardiola's side attempted to overturn a 3-1 deficit against Porto on Tuesday night. Injured quartet Franck Ribery, Mehdi Benatia, David Alaba and Arjen Robben were forced to watch on from the sidelines as the German giants bid to reach the Champions League semi-finals. However, the absence of Robben and Co appeared to make no difference as Bayern raced into a 5-0 lead at half-time before claiming a 6-1 victory to win the tie 7-4 on aggregate. Injured trio Franck Ribery, Mehdi Benatia and David Alaba chat ahead of Bayern's clash with Porto. Injured Ribery acknowledges a steward before taking a seat at the Allianz Arena on Tuesday night. Ribery looks on as former Roma defender Benatia chats with the France international in the dugout. While Ribery, Benatia and Alaba chatted in the home dugout ahead of kick-off, Holland international Arjen Robben was in front of the mic doing some punditry alongside Bayern goalkeeping legend Oliver Kahn. Ribery missed the game after failing to recover from a recent ankle injury while former Roma defender Benatia faces another two weeks out with a groin problem. Robben was unavailable for the encounter with an abdominal injury. David Alaba, meanwhile, is set for a month on the sidelines having partially ruptured knee ligaments playing for Austria at the start of April. Bayern had just 14 fit players to choose from against Porto in the first leg but tore the Portuguese giants apart at the Allianz Arena to progress. Holland international Arjen Robben was pictured doing punditry alongside Bayern legend Oliver Kahn (right) Bayern Munich wideman Robben was unavailable for the Champions League clash with an abdominal injury. Reference Summary: Bayern Munich beat Porto 6-1 at the Allianz Arena on Tuesday night. German giants were without Franck Ribery, David Alaba and Mehdi Benatia. Arjen Robben was also sidelined and did some punditry for the tie. Model Summary: Arjen Robben, Mehdi Benatia, Franck Ribery and David Alaba all missed Bayern Munich's Champions League quarter-final second leg against Porto. Holland international Arjen Robben was pictured doing punditry alongside Bayern legend Oliver Kahn (right) Bayern Munich wideman Robben was unavailable for the Champions League clash with an abdominal injury. Human rating score (↑ means high quality): 0.8 PRsum(perplexity, output RMD) (↓ means high quality): 0.11 Figure A.12: Examples in cnn dailymail that have the lowest PR sum (perplexity, output RMD) scores that predict for high quality summary.



https://www.tensorflow.org/datasets/catalog/c4 https://www.tensorflow.org/datasets/catalog/para crawl We define overlap rate as the percentage of unique n-grams in the test set that are also present in the indomain data. The overall overlap is defined as the geometric mean of all the n-gram overlap rates up to n = All domains/splits including the in-domain data are subsampled to 1K for this analysis. (a) Summarization, ROUGE-1 (b) Summarization, human rating (c) Translation, BLEURT



Figure 1: Perplexity scores density of a CLM trained on (a) xsum for summarization, and (b) WMT for translation, evaluated on other datasets/domains. Perplexity is not well suited for OOD detection due to significant overlap between in-domain and OOD scores.

Figure 2: The proposed OOD detector based on input and output embeddings.

Figure 4: The Kendall's τ correlation between perplexity and (a) ROUGE-1, (b) human evaluation median rating, and (c) BLEURT decreases as OOD score increases respectively. Note that we use output RMD OOD score for summarization and input RMD OOD score for translation. 4.3 COMBINING OOD SCORES AND PERPLEXITY While model perplexity for quality estimation is worse for OOD examples, we observed that our OOD scores and perplexity are complementary in quality prediction.Figure A.1 shows a 2-D plot between the OOD score and perplexity regarding quality. We can see that neither perplexity nor OOD score can perfectly separate good and bad examples, and the combination of the two can work much better. Our observation echos work in uncertainty estimation in classification models (Mukhoti et al., 2021): perplexity based on softmax predictive distribution is regarded as an estimation for aleatoric uncertainty (caused by inherent noise or ambiguity in data), and the OOD distance based on representation estimates the epistemic uncertainty (caused by a lack of training data), and combining the two provides a comprehensive estimation of uncertainty.

Figure A.1 shows a 2-D plot between the OOD score and perplexity regarding quality. We can see that neither perplexity nor OOD score can perfectly separate good and bad examples, and the combination of the two can work much better. Our observation echos work in uncertainty estimation in classification models (Mukhoti et al., 2021): perplexity based on softmax predictive distribution is regarded as an estimation for aleatoric uncertainty (caused by inherent noise or ambiguity in data), and the OOD distance based on representation estimates the epistemic uncertainty (caused by a lack of training data), and combining the two provides a comprehensive estimation of uncertainty.

Figures 5 (b, d) are the corresponding survival curves showing how many examples per dataset are selected for generation as a function of abstention rate, based on the PR sum score. For summarization, the samples from far-OOD datasets reddit tifu and samsum are eliminated first with their sample count decreasing rapidly. The near-OOD dataset cnn dailymail and in-domain xsum are kept intact until α > 0.3, and in-domain xsum examples survive the longest. Similarly for translation, the out-of-domain and worst-quality (as seen in Table A.1b) Koran, MTNT, and subtitles examples are eliminated first, and the best-performing law and in-domain datasets are abstained last. The order in which datasets are eliminated corresponds to the aggregate quality by dataset, which we report in Table A.1. Besides the quantitative results, we show a few real examples in Section A.14 to better demonstrate how our predicted quality score helps selective generation.

Figure A.1: 2D plot between OOD and perplexity. The two scores are self-normalized by its percentile rank respectively. Each square corresponds to a subset of samples whose OOD and perplexity scores are within the percentile bin. The size of the square represents the size of the bin where the color indicates the quality of the model's output. The OOD score and perplexity capture different properties of model outputs, and combining both scores can be beneficial for quality prediction.

Figure A.6: OOD score can be attributed to individual sentences to highlight the out-of-domain noisy parts of text (red denotes out-of-domain and blue denotes in-domain text), e.g. tweets present in articles scraped from internet. Example taken from Newsroom dataset.

Figure A.7, A.8, and A.9  show 3 examples in cnn dailymail that have the highest PR sum (perplexity, output RMD) scores that predict for low quality summaries.

AUROCs for OOD detection. For summarization task (a), cnn dailymail and newsroom are considered as near shift OOD since it shares news topics as xsum, and reddit tifu, forumsum, and samsum are far shift OOD. For translation (b), WMT dataset contains various test WMT datasets collected from different years, OPUS contains five different domains (the degree of shift varies), and MTNT contains noisy data from Reddit.

the combined scores significantly improve the correlation over perplexity by up to 12% (absolute) for summarization and 8% for translation, while the gains over the best external model-based (and much more expensive) baselines are 4% and 3%. The two combination methods perform similarly. See Tables A.4 and A.5 for an expanded table of scores.

Table A.1b) Koran, MTNT, and subtitles examples are eliminated first, and the best-performing law and in-domain datasets are abstained last. The order in which datasets are eliminated corresponds to the aggregate quality by dataset, which we report in Table A.1. Besides the quantitative results, we show a few real examples in Section A.14 to better demonstrate how our predicted quality score helps selective generation.

: Compute input OOD score RMD input (z) for z ∈ S in test and S ood test , respectively. Compute AUROC based on the input OOD scores. 4: Similarly, generate output embeddings E in test = {w|f d ( ŷ), ŷ ∈ D in test } and E ood test = {w|f d ( ŷ), ŷ ∈ D ood test }. Compute output OOD score RMD output (w) for w ∈ E in test and E ood test , respectively. Compute AUROC based on the output OOD scores. A.5 THE CONNECTION BETWEEN RMD AND BINARY CLASSIFIER

2: AUROCs for OOD detection for comparing our proposed method with more baseline

3: Comparison of the OOD detection performance using two different background data, ParaCrawl and C4 sentence.

BETWEEN DIFFERENT SCORES AND THE QUALITY METRICSTable A.5: Kendall τ correlation (p-value < 0.05 are grayed out) between various measures and quality measured by BLEURT on translation datasets. For easier comparison, we negate the signs of the coefficients for measures that are expected to have negative correlation with BLEURT (e.g., OOD score). Within the same dataset, perplexity shows good correlation, but it deteriorates (with the exception of MTNT) as we move to more OOD datasets such as Koran. The summarization quality ROUGE-1 vs abstention curve for single scores, including input and output RMD OOD scores, output perplexity score, and NLI score, and combined scores, including linear regression machine learning model, percentile sum of RMD OOD scores and perplexity score. The corresponding area under the curve is in TableA.7. (b)  The survival count of each dataset as the joint dataset is abstained. Each dataset is sub-sampled to 400 examples for this analysis.TableA.9: n-gram overlap analysis between the various test sets including law and the in-domain training data, we observe that law has the highest unigram overlap rate (48.8%) and the second highest overall overlap (defined as the geometric mean) with the in-domain data.

Reference Summary: The 38-year-old suspect was questioned by Kansas City police after neighbors complained he was blasting music in his 2007 Infinity. Instead of handing over his ID, driver smiled, said 'I'm out!' and took off. After crashing into bridge, the man stripped down to his underwear and jumped into Brush Creek. It took cops armed with a BB gun 15 minutes to fish out the fugitive. : A crisp fan who gets through 42 bags in a week has discovered a skull-shaped deep-fried potato snack in one of his packets. Barry Selby, 54, who lives with his dog in Poole, Dorset, was eating a bag of cheese and onion crisps when he made the bizarre discovery, which appears to be a profile of a human skull. The floor-fitter has decided to keep the two inches tall by two-and-a-half inches wide snack as he believes it is far more impressive than other oddly-shaped examples he has seen on the internet. Scroll down for video. Spooky find: Barry Selby was eating a bag of Tesco cheese and onion crisps when he found the 'skull' snack. Mr Selby said: 'I was shocked when I found it. I was just eating a bag of cheese and onion crisps from Tesco and when I pulled it out it did take me back a bit. 'I thought it was worth keeping as I don't think I will ever find one like it again. It must have been a very weird-shaped potato. 'It's about two inches tall and two-and-a-half inches wide and it's in perfect detail, it even has an eye socket. 'I sometimes give my dog, Max, crisps in a bowl, so it's lucky he didn't have this packet or I wouldn't have found it. Weird snack: Mr Selby has decided to keep the unusual find, which appears to show a jaw, nose and eye. Comparison: The 54-year-old said he was 'shocked' to make the discovery, although it is not his first. In the 1990s he came across a 3D heart-shaped crisp, which he kept until it broke. And it's not the first odd-shaped snack he has come across -in the 1990s he found a crisp shaped like a 3D heart, which he kept for several years until it broke. But he says this find was different: 'This one was a big one. I just thought "wow" and wanted to share it. 'I've been keeping it on top of my computer in the front room, but it should be in a protective box really. 'I'm going to keep it forever, it's just so spooky. I looked on the internet for other funny-shaped crisps but this is a one-off.' Reference Summary: Barry Selby from Dorset was eating bag of Tesco cheese and onion crisps. The 54-year-old discovered a snack shaped like profile of the human skull. He said he was 'shocked' with the find and has decided to 'keep it forever' It's not his first weird food find -he once discovered a heartshaped crisp. Last week she was barely showing -but Demelza Poldark is now the proud mother to the show's latest addition. Within ten minutes of tomorrow night's episode, fans will see Aidan Turner's dashing Ross Poldark gaze lovingly at his new baby daughter. As Sunday night's latest heartthrob, women across the country have voiced their longing to settle down with the brooding Cornish gentleman -but unfortunately it seems as if his heart is well and truly off the market. Scroll down for video. Last week she was barely showing -but Demelza Poldark is now the proud mother to the show's latest addition. He may have married his red-headed kitchen maid out of duty, but as he tells her that she makes him a better man, audiences can have little doubt about his feelings. What is rather less convincing, however, is the timeline of the pregnancy. With the climax of the previous episode being the announcement of the pregnancy, it is quite a jump to the start of tomorrow's instalment where Demelza, played by Eleanor Tomlinson, talks about being eight months pregnant. Just minutes after -once again without any nod to the passing of time -she is giving birth, with the last month of her pregnancy passing in less than the blink of an eye. With the climax of the previous episode being the announcement of the pregnancy, it is quite a jump to the start of tomorrow's instalment where Demelza, played by Eleanor Tomlinson, talks about being eight months pregnant. As Sunday night's latest heartthrob, women across the country have voiced their longing to settle down with Poldark -but unfortunately it seems as if his heart is well and truly off the market. Their fast relationship didn't go unnoticed by fans. One posted on Twitter: 'If you are pregnant in Poldark times expect to have it in the next 10 minutes' It is reminiscent of the show's previous pregnancy that saw Elizabeth, another contender for Ross's affection, go to full term in the gap between two episodes. This didn't go unnoticed by fans, who posted on Twitter: 'Poldark is rather good, would watch the next one now. Though if you are pregnant in Poldark times expect to have it in the next 10 minutes.' Reference Summary: SPOILER ALERT: Maid gives birth to baby on Sunday's episode. Only announced she was pregnant with Poldark's baby last week. Model Summary: It's all change in the world of Poldark. Examples in cnn dailymail that have the highest PR sum (perplexity, output RMD) scores that predict for low quality summaries. Rangers boss Stuart McCall says he is already working on a dossier of signing targets for next season -even though he may not be around to parade them. The interim Ibrox manager still does not know if he will be in charge beyond the current campaign after being lured back to his old club to kickstart their faltering promotion bid. So far, everything is going to plan with Gers second in the Scottish Championship table and destined for a semi-final play-off slot. Stuart McCall says he is already looking at transfer targets for next season, though he may not be at Rangers. But with 12 players out of contract, McCall knows the Light Blues will need to strengthen if they have any chance of keeping pace with rivals Celtic next season -if they go up -and is already piecing together a wish list of potential new arrivals. He said: 'I've been speaking to a lot of agents and putting things in place for if and when... Even if I'm not here, if I'm getting players put to me who would like to come to Rangers regardless of the manager, then we build a little portfolio of positions that we will be needing next year. 'It's not a case of us standing still and then thinking come June 1, 'Oh we need to get into action'. 'No, there are a lot of agents who come to us and we build a little dossier of players that as a staff, we think will be good for next season, regardless of what league we are in. 'It would be slightly naive [if we were not doing that]. If I'm in charge or not, I still want the club to do well and I will put my view across to the board on who I think should be coming into the club and who should be here.' McCall is compiling a dossier on targets as he looks to put the club in the best possible position. Rangers have operated a haphazard transfer policy since re-emerging from the embers of liquidation. The club's team of scouts were jettisoned under the disastrous Craig Whyte regime and former boss Ally McCoist was largely forced to turn to a list of former Ibrox servants he had personal knowledge of when trying to bolster his squad. But McCall revealed the club's new board arenow starting the process of re-establishing their spying network -albeit on a smaller level than before. 'I think there has been discussions behind the scenes with different people,' said the former Motherwell boss. 'I don't think we are at the stage where we were 10 or 15 years ago where we were aiming to get into the Champions League and bringing players in for three and four million yet. 'I don't think Rangers will be at the stage yet next year where we need international scouts everywhere. Rangers have expanded their scouting network after a haphazard system over the past few years. 'But certainly a scouting network needs to be put in place. 'Having said that, I spoke to Craig Levein at Hearts and they do a lot of their scouting with [online service] Wyscout. When I brought Henrik Ojamaa in at Motherwell, that was after I'd seen a clip of him on YouTube. I sold him for £350,000 after signing him for nothing. That was great. 'So you can still do your own background work. Personally I would always like to see the player myself. I've only ever signed one player without watching him first and slightly regretted it. 'So yeah we need a scouting network but at this moment where Rangers are, not to the extent where we have scouts all over Europe.' McCall admitted he still does not know if he will rejoin Gordon Strachan's Scotland staff for the June 13 Euro 2016 qualifier with Ireland in Dublin. And he also confessed to uncertainties ahead of Saturday's match with Falkirk.McCall's side are still in line for promotion, sitting in the play-off positions in the Scottish Championship. Peter Houston's Bairns -five points behind fourth-placed Queen of the South with two games to play -need an unlikely series of results to make the play-offs but McCall says that raises more questions than answers. He said: 'Housty is a wily old fox who has done terrifically well in his career so I don't know what to expect. 'It will take a difficult set of results for them to get into the play-offs so I don't know if they will come here and think the pressure is off and play care free. 'They don't lose many goals so we may have to be patient through the 90 minutes. We have had a couple of decent results against them but they have capable players and we will need to be at our best.' Reference Summary: Rangers are currently second in the Scottish Championship. Stuart McCall's side are in pole position to go up via the play-offs. But McCall is still not certain of his future at the club next season. Rangers boss says he is still trying to build the squad for next year. Rangers have begun to expand their scouting after several poor years. Model Summary: Stuart McCall says he is already looking at transfer targets for next season, though he may not be at Rangers. Human rating score (↑ means high quality): 0.8 PRsum(perplexity, output RMD) (↓ means high quality): 0.10 Figure A.10: Examples in cnn dailymail that have the lowest PR sum (perplexity, output RMD) scores that predict for high quality summary. An Alberta student who'd accidentally left his headlights on all day was greeted by what may have been the world's friendliest note from a stranger when he returned to his car. But Derek Murray, a University of Alberta law student, found more than just the note that cold November day in Edmonton-he also found an extension cord and battery charger left by the stranger to bring his dead Acura back to life. Now that Murray's life-affirming tale has now gone viral, he says 'It just shows you how such a pure act of kindness from one person can just spread through everyone and help make everyone's day a little brighter.' Good Samaritan: A friendly stranger left this unbelievably friendly letter to Alberta law student Derek Murray in order to help him get his car started after he left the headlights on all day. At first, though, he assumed the letter was from an angry fellow motorist, he told the National Post. 'When I first saw the note, I was expecting it to be an angry letter from someone telling me not to park there. Instead, I got someone just totally brightening my day. My day could have been ruined but, because of this guy, it was the highlight of my day.' The note reads, in part:. I noticed you left your lights on. The battery will probably not have enough charge to start your vehicle. I left a blue extension cord on the fence and a battery charger beside the fence in the cardboard box. If you know how to hook it up, use it to start your car. What followed was a detailed explanation of how to use the equipment. 'Sure enough,' Derek recalled to the National Post, 'I looked over at the house my car was parked beside, and there was a blue extension cord plugged into an outlet behind the guy's house with a battery charger right there beside it.' Derek was able to get his car started, but when he rang the good Samaritan's doorbell, there was no answer. So, Derek left his own note as a thank you for the kind gesture. He later snapped a photo of the stranger's friendly note to post to Facebook, where it has now gone viral. The note has been viewed millions of times and even Edmonton Mayor Don Iveson retweeted the photo. Derek snapped a photo of the note for Facebook and it has since gone viral. e 'It just shows you how such a pure act of kindness from one person can just spread through everyone and help make everyone's day a little brighter,' Derek said. Reference Summary: Derek Murray, a University of Alberta law student, could have had his day ruined by the mistake by a stranger's kindness brightened it up. Murray posted his story and the note online and the random act of kindness has now gone viral. Model Summary: A Canadian student who accidentally left his headlights on all day was greeted by what may have been the world's friendliest note from a stranger when he returned to his car. Human rating score (↑ means high quality): 0.8 PRsum(perplexity, output RMD) (↓ means high quality): 0.11 Figure A.11: Examples in cnn dailymail that have the lowest PR sum (perplexity, output RMD) scores that predict for high quality summary.

ACKNOWLEDGEMENTS

The authors would like to thank Jeremiah Zhe Liu, Sharat Chikkerur, and the anonymous reviewers for their helpful feedback on the manuscript. The authors would also like to thank Colin Cherry, George Foster, and Polina Zablotskaia for their feedback throughout the project.

annex

Published as a conference paper at ICLR 2023

A.3 AMAZON MECHANICAL TURK ASSESSMENT OF SUMMARY QUALITY

A PEGASUS LARGE model fine-tuned on xsum was run on a random sample of 100 examples from the test split of four datasets: xsum, cnn dailymail, reddit tifu, samsum. Each example was rated for general summarization quality on a rating of 1-5 by 3 AMT workers using the template shown in Figure A.2. Workers were required to be Masters located in the US with greater than 95% HIT Approval Rate, with at least 1000 HITs approved and were paid $0.80 per rating. 

