DETECTING HALLUCINATED CONTENT IN CONDI-TIONAL NEURAL SEQUENCE GENERATION

Abstract

Neural sequence models can generate highly fluent sentences but recent studies have also shown that they are also prone to hallucinate additional content not supported by the input, which can cause a lack of trust in the model. To better assess the faithfulness of the machine outputs, we propose a new task to predict whether each token in the output sequence is hallucinated conditioned on the source input, and collect new manually annotated evaluation sets for this task. We also introduce a novel method for learning to model hallucination detection, based on pretrained language models fine tuned on synthetic data that includes automatically inserted hallucinations. Experiments on machine translation and abstract text summarization demonstrate the effectiveness of our proposed approach -we obtain an average F1 of around 0.6 across all the benchmark datasets. Furthermore, we demonstrate how to use the token-level hallucination labels to define a fine-grained loss over the target sequence in the low-resource machine translation and achieve significant improvements over strong baseline methods. We will also release our annotated data and code for future research.

1. INTRODUCTION

Neural sequence models have achieved impressive breakthroughs in a wide range of applications, including data-to-text generation (Puduppully et al., 2019) , machine translation (Vaswani et al., 2017; Wu et al., 2016 ) and text summarization (Rothe et al., 2020) . Although these models can generate fluent sentences that are even sometimes preferred to human-written content (Läubli et al., 2018; Brown et al., 2020) , recent work has also shown that they lack global logical consistency (Marcus & Davis, 2020) , sometimes degenerate to dull and repetitive outputs (Welleck et al., 2019) and can often hallucinate content that is not entailed by the input (Maynez et al., 2020) . In this paper, we focus on the faithfulness of machine outputs in conditional sequence generation tasks, aiming to automatically identify and quantify content in the output that is not faithful to the input text. This risk of generating unfaithful content impedes the safe deployment of neural sequence generation models. The first step to building models that do not suffer from these failures is the assessment and identification of such hallucinated outputs. Prior work has shown that standard metrics used for sequence evaluation, such as BLEU scores (Papineni et al., 2002; Post, 2018) , ROUGE (Lin & Hovy, 2004) and BERTScores (Zhang et al., 2019) , do not correlate well with the faithfulness of model outputs (Maynez et al., 2020; Wang & Sennrich, 2020; Tian et al., 2019) , and they also require reference text, limiting their applicability to detecting halluciations in a deployed system at run-time. Very recent efforts (Maynez et al., 2020; Durmus et al., 2020; Wang et al., 2020a) have started to develop automatic metrics to measure the faithfulness of output sequences. These methods use external semantic models, e.g. the question-generation and question-answering systems (Wang et al., 2020a; Durmus et al., 2020) or textual entailment inference models, to score faithfulness tailored for abstract text summarization. However, these scores do not directly measure the number of hallucinated tokens and only correlate weakly with human judgements due to compounded errors. We propose a new task for faithfulness assessment -hallucination detection at the token level, which aims to predict if each token in the machine output is a hallucinated or faithful to the source input. This task does not use the reference output to assess faithfulness, which offers us the ability to apply it in the online generation scenario where references are not available. Similar to the spirit of our proposed task, word-level quality estimation (Fonseca et al., 2019) in the machine translation

