TOWARDS BOOSTING THE OPEN-DOMAIN CHATBOT WITH HUMAN FEEDBACK Anonymous authors Paper under double-blind review

Abstract

Many open-domain dialogue models pre-trained with social media comments can generate coherent replies but have difficulties producing engaging responses. This phenomenon might mainly result from the deficiency of annotated human-human conversations and the misalignment with human preference. In this paper, we propose a novel and efficient framework Diamante to boost the open-domain chatbot, where two kinds of human feedback (including explicit demonstration and implicit preference) are collected and leveraged. By asking annotators to select or amend the model-generated candidate responses, Diamante efficiently collects the human demonstrated responses and constructs a Chinese chit-chat dataset. To enhance the alignment with human preference, Diamante leverages the implicit preference in the data collection process and introduces the generation-evaluation joint training. Comprehensive experiments indicate that the Diamante dataset and joint training paradigm can significantly boost the performance of pre-trained dialogue models. The overall engagingness of the previous state-of-the-art model has been improved remarkably by 50% in Chinese open-domain conversations.

1. INTRODUCTION

In recent years, the self-supervised pre-training based on tremendous unlabeled data has brought great success for many natural language processing tasks (Brown et al., 2020; Chowdhery et al., 2022) . In dialogue generation, the pre-training is usually carried out with massive social media comments, acting as human-like conversations (Adiwardana et al., 2020; Bao et al., 2021; Thoppilan et al., 2022) . Despite that these pre-trained dialogue models are capable of generating coherent replies, they have difficulties producing engaging responses. The main reasons for this phenomenon might be twofold. Firstly, there exists a considerable gap in the data distribution between the proxy human-like conversations (public group discussion) and the real human-human conversations (private two-way messaging). Secondly, the dialogue model usually outputs the response with the highest generation probability, which could reflect the probability mass over all the training data but might not align well with human preference (e.g., some biased or unsafe statements). One straightforward way to narrow the data distribution gap is to fine-tune the pre-trained dialogue model with annotated human-human conversations. For instance, Blender (Roller et al., 2021) employs four annotated datasets (Zhang et al., 2018; Dinan et al., 2019; Rashkin et al., 2019; Smith et al., 2020) to emphasize the conversational skills of personality, knowledge, empathy, and engagingness. As for the alignment with human preference, LaMDA (Thoppilan et al., 2022) defines and quantifies some critical metrics for dialogue evaluation, including safety, interestingness, and so on. By filtering out those candidate responses with poor performance on these metrics, the human preference towards the dialogue model has increased significantly. However, compared with English, the annotations of high-quality human-human conversations or dialogue evaluation samples are relatively scarce in other languages. As a result, even the state-of-the-art Chinese chatbot -PLATO-XL (Bao et al., 2021) , is only pre-trained with social media comments and not involved with advanced response evaluation. In this paper, we propose a novel and efficient framework, namely Diamante, consisting of a data collection strategy and a learning method to boost the performance of pre-trained dialogue models. Two kinds of human feedback are collected and leveraged in Diamante, including explicit demonstration and implicit preference. Firstly, to bridge the gap in data distribution, Diamante collects

Task Description

Please read the guidelines before the conversation.

Candidates

Sometimes my cat sometimes even eats his own hairballs. I'm pretty worried about his digestion. Note: Click one candidate and it will be shown in the input box.

Dialogue

Collect a conversation with the assistance of this model. First, craft a dialogue opening based on your interest. Then select, revise or rewrite the candidate to reply properly. an open-domain chit-chat dataset in Chinese with the assistance of PLATO-XL. Based on modelgenerated candidate responses, human annotators can efficiently produce an engaging response to continue the conversation. Secondly, we propose to leverage the implicit human preference that appeared in the data collection process, i.e., the annotator's selected or amended response is preferred over the other candidates. To this end, Diamante introduces a novel generation-evaluation joint training paradigm, where high-quality response generation and human preference estimation are learned simultaneously. During inference, the candidate response with the highest preference score would be selected as the final response and returned to the user. Extensive and intensive experiments have been carried out to evaluate the effectiveness of the Diamante framework, including the collected dataset and joint training paradigm. Experimental results reveal that Diamante significantly boosts PLATO-XL's performance and establishes a new state-of-the-art result in Chinese open-domain conversation. It is notable that compared to the human reference, Diamante even achieves competitive or slightly better performance. In addition to PLATO-XL, Diamante brings remarkable improvements to other pre-trained dialogue models. The Diamante dataset is now publicly available, which can be accessed and downloaded under the license agreement at the data platformfoot_0 . We have also released all source codefoot_1 , hoping to facilitate future research in dialogue generation.

2. DIAMANTE DATASET

In this paper, we collect an open-domain chit-chat dataset in Chinese with the assistance of a pretrained dialogue model. In the following, we will describe the creation of the Diamante dataset.

2.1. DATA COLLECTION

Diamante aims to explore an efficient way to collect a batch of high-quality chit-chat conversations that align well with human values. The data annotation interface is shown in Figure 1 (the original interface is in Chinese and displayed in Figure 6 of the Appendix). The data collection process is carried out as follows. Step 1: Crafting the Dialogue Opening. Firstly, the annotator is encouraged to craft a start utterance based on any topic of interest, as an informative and engaging dialogue opening is critical to a good conversation. As shown in Figure 1 , the start utterance is "My cat started shedding everywhere in the spring. How to deal with it?". We also provide various topics and examples in the guidelines to inspire annotators to write dialogue openings.



The Diamante dataset is publicly available at https://anonymous. The Diamante source code is available at https://github.com/anonymous.



Figure 1: Illustration of Diamante's annotation interface.

Submit § My cat eats hairballs, too. § Oh no, how old is he? § Is he deficient in something? Buy him some cat grass? § Feed him some hairball remedies. § If he eats hairballs, you can give him some hairball remedies. § It is okay. My cat eats hairballs, too. § You need to pay attention to it. The cat may be deficient in some trace elements. Go to see a veterinarian.

