OUT-OF-DISTRIBUTION DETECTION AND SELECTIVE GENERATION FOR CONDITIONAL LANGUAGE MODELS

Abstract

Machine learning algorithms typically assume independent and identically distributed samples in training and at test time. Much work has shown that highperforming ML classifiers can degrade significantly and provide overly-confident, wrong classification predictions, particularly for out-of-distribution (OOD) inputs. Conditional language models (CLMs) are predominantly trained to classify the next token in an output sequence, and may suffer even worse degradation on OOD inputs as the prediction is done auto-regressively over many steps. Furthermore, the space of potential low-quality outputs is larger as arbitrary text can be generated and it is important to know when to trust the generated output. We present a highly accurate and lightweight OOD detection method for CLMs, and demonstrate its effectiveness on abstractive summarization and translation. We also show how our method can be used under the common and realistic setting of distribution shift for selective generation (analogous to selective prediction for classification) of high-quality outputs, while automatically abstaining from low-quality ones, enabling safer deployment of generative language models.

1. INTRODUCTION

Recent progress in generative language models (Wu et al., 2016a; Radford et al., 2019; Lewis et al., 2020; Raffel et al., 2020; Zhang et al., 2020) has led to quality approaching human-performance on research datasets and has opened up the possibility of their wide deployment beyond the academic setting. In realistic user-facing scenarios such as text summarization and translation, it should be expected that user provided inputs can significantly deviate from the training data distribution. This violates the independent, identically-distributed (IID) assumption commonly used in evaluating machine learning models. Many have shown that performance of machine learning models can degrade significantly and in surprising ways on OOD inputs (Nguyen et al., 2014; Goodfellow et al., 2014; Ovadia et al., 2019) . For example an image classifier may detect cows in images with very high accuracy on its IID test set but confidently fails to detect a cow when paired with an unseen background (Murphy, 2023; Nagarajan et al., 2020) . This has led to active research on OOD detection for a variety of domains, including vision and text but focused primarily on classification. Conditional language models are typically trained given input sequence x = x 1 . . . x L to autoregressively generate the next token in a sequence y = y 1 . . . y T as a classification over the token-vocabulary V , p θ (y|x) = T t=1 p θ (y t |y <t , x), y t ∈ V . Consequently, the perils of out-ofdistribution are arguably more severe as (a) errors propagate and magnify through auto-regression, and (b) the space of low-quality outputs is greatly increased as arbitrary text sequences can be generated. Common errors from text generation models include disfluencies (Holtzman et al., 2020) and factual inaccuracies (Goodrich et al., 2019; Maynez et al., 2020) . A common failure case we observed in abstractive summarization is for the model to output "All images are copyrighted" as the summary for news articles from a publisher (CNN) different than what it was trained on (BBC) (see Figure A.7) . In this work, we propose OOD detection methods for CLMs using abstractive summarization and translation as case studies. Similar to classification, we show in Section 2.1 that CLMs have untrustworthy likelihood estimation on OOD examples, making perplexity a poor choice for OOD detection. In Section 2.2, we propose a highly-accurate, simple, and lightweight OOD score based on the model's input and output representations (or embeddings) to detect OOD examples, requiring negligible additional compute beyond the model itself. While accurate OOD detection enables the conservative option of abstaining from generation on OOD examples, it may be desirable to generate on known near-domain data, e.g. generate summaries for articles from news publishers that differ from our fine-tuning set. Thus the ability to selectively generate where the model is more likely to produce higher-quality outputs, enables safer deployment of conditional language models. We call this procedure selective generation, analogous to the commonly used term selective prediction in classification (Chow, 1957; Bartlett & Wegkamp, 2008; Geifman & El-Yaniv, 2017) . In Section 4, we show that while model perplexity is a reasonable choice for performing selective generation with in-domain examples, combining with our OOD score works much better when the input distribution is shifted. In summary, our contributions are: • Propose lightweight and accurate scores derived from a CLM's embeddings for OOD detection, significantly outperforming baselines on abstractive summarization and translation tasks, without the need for a separate detection model. • Show that model perplexity can be an unreliable signal for quality estimation on OOD examples, but combined with our OOD scores can be used effectively to selectively generate higher-quality outputs while abstaining on lower ones. • Propose an evaluation framework for OOD detection and selective generation for CLMs, including human quality ratings for summarization.

2. OOD DETECTION IN CONDITIONAL LANGUAGE MODELS

The maximum softmax probability (MSP), p(y|x), y = arg max k=1,...,K p(k|x) is a simple, commonly used OOD score for K-class classification problem (Hendrycks & Gimpel, 2016; Lakshminarayanan et al., 2017) . For CLMs, the perplexity, which is monotonically related to the negative log-likelihood of the output sequence averaged over tokens -1 T T t=1 log p(y t |y <t , x) is a natural OOD score to consider, and analogous to the negative MSP in classification because both are based on softmax probabilities. We first study how well the perplexity performs for OOD detection tasks. for translation, evaluated on other datasets/domains. Perplexity is not well suited for OOD detection due to significant overlap between in-domain and OOD scores. In Figure 1 , we compare the distribution of perplexity of (a) a summarization model and (b) a translation model trained on in-domain dataset and evaluated on multiple OOD datasets, respectively. For summarization, a model is trained on xsum and evaluated on other news datasets including cnn dailymail and newsroom as near-OOD datasets, and forum (forumsum) and dialogue (samsum and reddit tifu) datasets as far-OOD (see Section 3 for details). The perplexity distributions overlap significantly with each other even though the input documents are significantly different. Furthermore, perplexity assigns cnn dailymail even lower scores than the in-domain xsum. For translation, the model is trained on WMT15 dataset and evaluated on other WMT test splits (Bojar et al., 2015) , OPUS100 (Aulamo & Tiedemann, 2019), and MTNT (Michel & Neubig, 2018) . The in-domain and OOD datasets perplexity densities overlap even more. Overall, these results suggest that perplexity is not well suited for OOD detection.



Salehi et al. (2021); Bulusu et al. (2020); Ruff et al. (2021) provide comprehensive reviews on this topic.

Figure 1: Perplexity scores density of a CLM trained on (a) xsum for summarization, and (b) WMT for translation, evaluated on other datasets/domains. Perplexity is not well suited for OOD detection due to significant overlap between in-domain and OOD scores.

