DOMAIN-SLOT RELATIONSHIP MODELING USING A PRE-TRAINED LANGUAGE ENCODER FOR MULTI-DOMAIN DIALOGUE STATE TRACKING

Abstract

Dialogue state tracking for multi-domain dialogues is challenging because the model should be able to track dialogue states across multiple domains and slots. Past studies had its limitations in that they did not factor in the relationship among different domain-slot pairs. Although recent approaches did support relationship modeling among the domain-slot pairs, they did not leverage a pre-trained language model, which has improved the performance of numerous natural language tasks, in the encoding process. Our approach fills the gap between these previous studies. We propose a model for multi-domain dialogue state tracking that effectively models the relationship among domain-slot pairs using a pre-trained language encoder. Inspired by the way the special [CLS] token in BERT is used to aggregate the information of the whole sequence, we use multiple special tokens for each domain-slot pair that encodes information corresponding to its domain and slot. The special tokens are run together with the dialogue context through the pre-trained language encoder, which effectively models the relationship among different domain-slot pairs. Our experimental results show that our model achieves state-of-the-art performance on the MultiWOZ-2.1 and MultiWOZ-2.2 dataset.

1. INTRODUCTION

A task-oriented dialogue system is designed to help humans solve tasks by understanding their needs and providing relevant information accordingly. For example, such a system may assist its user with making a reservation at an appropriate restaurant by understanding the user's needs for having a nice dinner. It can also recommend an attraction site to a travelling user, accommodating the user's specific preferences. Dialogue State Tracking (DST) is a core component of these taskoriented dialogue systems, which aims to identify the state of the dialogue between the user and the system. DST represents the dialogue state with triplets of the following items: a domain, a slot, a value. A set of {restaurant, price range, cheap}, or of {train, arrive-by, 7:00 pm} are examples of such triplets. Fig. 1 illustrates an example case of the dialogue state during the course of the conversation between the user and the system. Since a dialogue continues for multiple turns of utterances, the DST model should successfully predict the dialogue state at each turn as the conversation proceeds. For multi-domain conversations, the DST model should be able to track dialogue states across different domains and slots. Past research on multi-domain conversations used a placeholder in the model to represent domainslot pairs. A domain-slot pair is inserted into the placeholder in each run, and the model runs repeatedly until it covers all types of the domain-slot pairs. (Wu et al., 2019; Zhang et al., 2019; Lee et al., 2019) . A DST model generally uses an encoder to extract information from the dialogue context that is relevant to the dialogue state. A typical input for a multi-domain DST model comprises a sequence of the user's and the system's utterances up to the turn t, X t , and the domain-slot information for domain i and slot j, D i S j . In each run, the model feeds the input for a given domain-slot pair through the encoder. f encoder (X t , D i S j ) for i = 1, • • • , n, j = 1, • • • , m, where n and m is the number of domains and slots, respectively. However, because each domain-slot pair is modeled independently, the relationship among the domain-slot pairs can not be learned. For example, if the user first asked for a hotel in a certain place and later asked for a restaurant near that hotel, sharing the information between {hotel, area} and {restaurant, area} would help the model recognize that the restaurant should be in the same area as the hotel. Recent approaches address these issues by modeling the dialogue state of every domain-slot pair in a single run, given a dialogue context (Chen et al., 2020; Le et al., 2019) . This approach can be represented as follows: f encoder (X t , D 1 S 1 , • • • , D n S m ). Because the encoder receives all of the domain-slot pairs, the model can factor in the relationship among the domain-slot pairs through the encoding process. For the encoder, these studies used models that are trained from scratch, without pre-training. However, since DST involves natural language text for the dialogue context, using a pre-trained language model can help improve the encoding process. Our approach fills the gap between these previous studies. In this work, we propose a model for multi-domain dialogue state tracking that effectively models the relationship among domain-slot pairs using a pre-trained language encoder. We modify the input structure of BERT, specifically the special token part of it, to adjust it for multi-domain DST. The [CLS] token of BERT (Devlin et al., 2019) is expected to encode the aggregate sequence representation as it runs through BERT, which is used for various downstream tasks such as sentence classification or question answering. This [CLS] token can also be used as an aggregate representation for a given dialogue context. However, in a multi-domain dialogue, a single [CLS] token has to store information for different domain-slot pairs at the same time. In this respect, we propose to use multiple special tokens, one for each domain-slot pair. Using a separate special token for each domain-slot pair is more effective in storing information for different domains and slots since each token can concentrate on its corresponding domain and slot. We consider two different ways to represent such tokens: DS-merge and DS-split. DS-merge employs a single token to represent a single domain-slot pair. For example, to represent a domain-slot pair of {restaurant, area}, we use a special token DS (restaurant,area) . DS-split, on the other hand, employs tokens separately for the domain and slot and then merges them into one to represent a domain-slot pair. For {restaurant, area}, the domain token D restaurant and the slot token S area . is computed separately and then merged. We use {DS} merge and {DS} split to represent the special tokens for DS-merge or DS-split, respectively. Unless it is absolutely necessary to specify whether the tokens are from DS-merge or DS-split, we'll refer to the DS-produced tokens as {DS} tokens, without special distinction, in our descriptions forward. The {DS} tokens, after being encoded by the pre-trained language encoder along with the dialogue context, is used to predict its corresponding domain-slot value for a given dialogue context. 



Figure 1: An example of a dialogue and its dialogue state.

