DETECTING HALLUCINATED CONTENT IN CONDI-TIONAL NEURAL SEQUENCE GENERATION

Abstract

Neural sequence models can generate highly fluent sentences but recent studies have also shown that they are also prone to hallucinate additional content not supported by the input, which can cause a lack of trust in the model. To better assess the faithfulness of the machine outputs, we propose a new task to predict whether each token in the output sequence is hallucinated conditioned on the source input, and collect new manually annotated evaluation sets for this task. We also introduce a novel method for learning to model hallucination detection, based on pretrained language models fine tuned on synthetic data that includes automatically inserted hallucinations. Experiments on machine translation and abstract text summarization demonstrate the effectiveness of our proposed approach -we obtain an average F1 of around 0.6 across all the benchmark datasets. Furthermore, we demonstrate how to use the token-level hallucination labels to define a fine-grained loss over the target sequence in the low-resource machine translation and achieve significant improvements over strong baseline methods. We will also release our annotated data and code for future research.

1. INTRODUCTION

Neural sequence models have achieved impressive breakthroughs in a wide range of applications, including data-to-text generation (Puduppully et al., 2019) , machine translation (Vaswani et al., 2017; Wu et al., 2016 ) and text summarization (Rothe et al., 2020) . Although these models can generate fluent sentences that are even sometimes preferred to human-written content (Läubli et al., 2018; Brown et al., 2020) , recent work has also shown that they lack global logical consistency (Marcus & Davis, 2020) , sometimes degenerate to dull and repetitive outputs (Welleck et al., 2019) and can often hallucinate content that is not entailed by the input (Maynez et al., 2020) . In this paper, we focus on the faithfulness of machine outputs in conditional sequence generation tasks, aiming to automatically identify and quantify content in the output that is not faithful to the input text. This risk of generating unfaithful content impedes the safe deployment of neural sequence generation models. The first step to building models that do not suffer from these failures is the assessment and identification of such hallucinated outputs. Prior work has shown that standard metrics used for sequence evaluation, such as BLEU scores (Papineni et al., 2002; Post, 2018) , ROUGE (Lin & Hovy, 2004) and BERTScores (Zhang et al., 2019) , do not correlate well with the faithfulness of model outputs (Maynez et al., 2020; Wang & Sennrich, 2020; Tian et al., 2019) , and they also require reference text, limiting their applicability to detecting halluciations in a deployed system at run-time. Very recent efforts (Maynez et al., 2020; Durmus et al., 2020; Wang et al., 2020a) have started to develop automatic metrics to measure the faithfulness of output sequences. These methods use external semantic models, e.g. the question-generation and question-answering systems (Wang et al., 2020a; Durmus et al., 2020) or textual entailment inference models, to score faithfulness tailored for abstract text summarization. However, these scores do not directly measure the number of hallucinated tokens and only correlate weakly with human judgements due to compounded errors. We propose a new task for faithfulness assessment -hallucination detection at the token level, which aims to predict if each token in the machine output is a hallucinated or faithful to the source input. This task does not use the reference output to assess faithfulness, which offers us the ability to apply it in the online generation scenario where references are not available. Similar to the spirit of our proposed task, word-level quality estimation (Fonseca et al., 2019) community predicts if tokens are correctly translated based on human post-editing. However, they do not distinguish errors in terms of fluency and adequacy (Specia et al., 2011) . In contrast to estimating the amount of human post-editing work required to fix errors, we specifically focus only on hallucination (not fluency) errors. We measure hallucination for two conditional sequence generation tasks -abstractive summarization and machine translation (MT). For the former, we produce a benchmark dataset from the recent released annotations in (Maynez et al., 2020) . For MT, we carefully design the human assessment guideline and create high-quality annotations. We will also release our human annotated data for future research. To learn token-level hallucination prediction for general conditional sequence generations tasks, we propose a novel method that creates synthetic "hallucinated" dataset with pseudo labels and finetunes a pretrained language model (Liu et al., 2019; Conneau et al., 2020) on it. Without any human annotated supervised training data, we achieve an average F1 of around 0.6 across all the benchmark datasets, setting initial performance levels for this new task. We also show that pretraining on MT can actually produce more faithful translations, confirming recent findings in abstractive summarization (Maynez et al., 2020) . Predicting hallucination labels at token-level provides a tool for diagnosing and interpreting model outputs, which allows us to flag potential risks at inference time for previously unseen inputs. On the other hand, the token-level labels also offer possibility of fine-grained controls over the target sequence to improve the generation. We show how to use these token-level hallucination labels to improve self-training in low-resource MT, where the teacher can produce hallucinated outputs that are harmful to the student model. However, many outputs are only partially hallucinated (see examples in Appendix D.6) and the rest of the output is still useful for training, as we show by introducing different token-level loss truncation schemes. Our best method outperforms strong baselines by a large margin both in translation quality and hallucination reduction.

2. TASK: HALLUCINATION PREDICTION AT TOKEN-LEVEL

For a source sequence S and its model generation G from a neural conditional generation model, following Maynez et al. ( 2020) we define any span w i , • • • , w i+j (j >= 0) in G as being hallucinated if it cannot be entailed by the source input S. More specifically, we consider two not mutually exclusive types of hallucination: Content Insertion: a span w i , • • • , w i+j in G consists of additional content that is not supported by S, i.e. its paraphrase or other equivalent form cannot be inferred from S. In Fig. 1 , the word "happily" in the machine translation belongs to this case. This is also referred as "extrinsic hallucinations" in Maynez et al. (2020) . Incorrect Substitution: a span of word(s) is misrepresented information based on the S. In Fig. 1 , "Jerry" in the machine translation is a hallucinated word and should be replaced by "Mike". This type of hallucination is similar to the concept of "intrinsic hallucination" in Maynez et al. (2020) . Note that there are cases where certain words (e.g. "This is not a great book." becomes "This is a great book.") are dropped in G and hence the meaning of S is changed, and we consider any spans in G that misrepresent S as hallucinated contents (e.g. the entire sentence of "This is a great book."). We aim to identify all the span(s) satisfying the above conditions in the machine generation G.foot_0 Note that the above definition is only used in the guidelines for human annotators, who do not need to distinguish between these types rigorously.



We do not consider the under-generations e.g. the source input is only partially translated or summarized.



Figure1: A toy example of token-level hallucination detection from machine translation. The words in grey blocks is an example of machine translation output and the labels above them indicate if each word is faithful (0) to the source input or a hallucinated one (1).

