KNOWLEDGE-CONSISTENT DIALOGUE GENERATION WITH LANGUAGE MODELS AND KNOWLEDGE GRAPHS Anonymous authors Paper under double-blind review

Abstract

Pre-trained language models have achieved impressive performances on dialogue generation tasks. However, when generating responses for a conversation that requires factual knowledge, they are far from perfect, due to the absence of mechanisms to retrieve, encode, and reflect the knowledge in the generated responses. Some knowledge-grounded dialogue generation methods tackle this problem by leveraging the structured knowledge from Knowledge Graphs (KGs). However, existing methods do not guarantee that the model utilizes a relevant piece of knowledge from the KG before generating knowledge-consistent dialogues. To overcome this limitation, we propose SUbgraph Retrieval-augmented GEneration (SURGE), a framework for generating context-relevant and knowledge-consistent dialogues with a KG. Specifically, our method first retrieves the relevant subgraph from the KG, and then enforces consistency across facts by perturbing their word embeddings conditioned on the retrieved subgraph. Then, it learns a latent representation space using contrastive learning which ensures that the generated texts have high similarity to the retrieved subgraphs. We validate the performance of our SURGE framework on the OpendialKG and KOMODIS datasets and show that our method generates high-quality dialogues that faithfully reflect the knowledge from the KG.

1. INTRODUCTION

Dialogue systems aim at conversing with humans by generating human-like responses, given the dialogue context. While pre-trained language models (PLMs) (Radford et al., 2019; Raffel et al., 2020) are capable of generating fluent responses, they often generate factually incorrect responses due to a lack of explicit knowledge (Shuster et al., 2021) . To overcome such limitations, recent methods access the external knowledge sources, such as Wikipedia (Dinan et al., 2019) or Web (Komeili et al., 2021) , and then retrieve the relevant knowledge for ongoing conversations. In addition to such document-based retrieval approaches, there also exists a variety of works (Tuan et al., 2019; Wu et al., 2020; Zhang et al., 2020a; Cui et al., 2021; Zhou et al., 2021; Galetzka et al., 2021; Li et al., 2022) , which focus on the use of the Knowledge Graphs (KGs) (Bollacker et al., 2008; Vrandecic & Krötzsch, 2014 ) -a different form of the knowledge source which succinctly encodes the knowledge in the most compact and effective form -in dialogue generation. Specifically, KGs consist of symbolic facts which represent entities as nodes and their relations as edges, in the triplet, e.g., (Pride & Prejudice, written by, Jane Austen) (See Figure 1 ), which can help generate a knowledge-grounded response. Most of the dialogue generation models with KGs (Galetzka et al., 2021; Li et al., 2022) utilize all the triplets associated with the entity in the dialogue context. However, not all of the facts are relevant to the ongoing conversation (e.g., Jane Austen was born in Steventon in Figure 1 ), which could mislead the models from generating factually incorrect responses. We found that about 87% of facts from 1-hop KG are irrelevant to the context in the OpendialKG dataset (Moon et al., 2019) . Moreover, encoding all the facts including the unnecessary ones is computationally inefficient (Galetzka et al., 2021; Rony et al., 2022) . On the other hand, even after correctly retrieving the relevant facts, it is not straightforward to combine two heterogeneous modalities: the dialogue context is represented as a text, meanwhile, the knowledge is represented as a graph. In other words, since PLMs already have tons of pre-trained parameters trained on the unstructured texts, properly conditioning the structured graph to PLMs is highly important. Otherwise, PLMs may generate inconsistent responses disregarding the knowledge from the retrieved subgraph, which is a phenomenon known as hallucination (Rohrbach et al., 2018) , where they generate responses with their own memorized yet incorrect knowledge. 

Distracted by Irrelevant facts

Factually wrong with external entity Assistant ??? Figure 1 : Motivation. Existing knowledge-grounded dialogue generation models with KG utilize the multi-hop subgraph for entities in the dialogue context (Jane Austen). However, they suffer from the following two problems: (1) irrelevant knowledge where only 12.6% of facts from 1-hop KG are useful to generate the target responses given a dialogue context, and (2) inconsistent generation including the factually wrong statement. In this work, we tackle such challenging and fundamental issues of knowledge-consistent dialogue generation with the KG. We propose an end-to-end dialogue generation framework that considers all aspects from knowledge retrieval, encoding, and reflection along the generation process. As a first step, we propose a context-relevant subgraph retriever that retrieves only the relevant triplets from the KG to prevent the model from generating context-irrelevant responses. Notably, our subgraph retrieval method embeds the KG considering the relational structure with the Graph Neural Network (GNN) (Kipf & Welling, 2017) instead of using PLMs as in previous work (Li et al., 2022) . Furthermore, it is end-to-end trainable jointly with the generation objective by marginalizing the likelihood of the generated sentences over the latent retrieved subgraph (Guu et al., 2020; Lewis et al., 2020b) . Then, to encode the retrieved subgraph along with the input text sequence, we propose a graph encoding that is permutation and relation inversion invariant yet efficient. Specifically, we devise the graph encoding method that reflects the graph structure onto the representation space of PLMs, instead of prepending them in front of the text sequence to avoid the computational burden. Furthermore, to ensure that the model does make use of the encoded knowledge when generating responses, we propose a multi-modal contrastive learning objective between two different graph-text modalities to enforce the consistency across the retrieved facts and the generated texts. We call our framework SUbgraph Retrieval-augmented GEneration (SURGE). We validate our framework on the OpendialKG (Moon et al., 2019) and KOMODIS (Galetzka et al., 2020) datasets against relevant baselines. Note that, when evaluating the generated responses from dialogue models, conventional metrics (e.g., BLEU (Papineni et al., 2002) , Rouge (Lin, 2004)) can not measure how faithfully the generated responses reflect the related knowledge in KGs. Thus, in evaluation, we further introduce an additional performance metric, referred to as Knowledge-verifying Question Answering (KQA), which evaluates whether the generated responses contain the correct knowledge with an additional extractive question answering scheme. The experimental results show that SURGE generates responses that not only agree with the gold knowledge but are also consistent with the retrieved knowledge from KGs. Our main contributions can be summarized as follows: • We propose a GNN-based context-relevant subgraph retrieval method for KG-augmented dialogue generation, to extract only the relevant piece of the knowledge for the dialogue context from the entire knowledge graph, for generating more appropriate responses to the ongoing conversation. • We propose an invariant yet efficient graph encoder and a graph-text contrastive learning objective to ensure that the generated responses faithfully reflect the retrieved knowledge. • We validate SURGE against relevant baselines, demonstrating its efficacy in generating responses that are more informative by retrieving and reflecting the relevant knowledge from the KG.

2. RELATED WORK

Language Models Pre-trained Language Models (PLMs) (Radford et al., 2019; Lewis et al., 2020a; Raffel et al., 2020) that use a Transformers-based (Vaswani et al., 2017) encoder-decoder architecture have achieved great successes on language generation tasks. As they can accurately contextualize the given context and then generate human-like sentences, they are often used as the base architecture for neural dialogue systems (Zhang et al., 2020b; Hosseini-Asl et al., 2020) . Moreover, when PLMs become larger, dialogue models have shown to generate high-quality responses (Adiwardana et al., 2020) , suggesting that pre-trained parameters do contain certain knowledge (Petroni et al., 2019) . Despite the fluency of such PLM-based dialogue agents, they often generate factually incorrect

