COCO: CONTROLLABLE COUNTERFACTUALS FOR EVALUATING DIALOGUE STATE TRACKERS

Abstract

Dialogue state trackers have made significant progress on benchmark datasets, but their generalization capability to novel and realistic scenarios beyond the heldout conversations is less understood. We propose controllable counterfactuals (COCO) to bridge this gap and evaluate dialogue state tracking (DST) models on novel scenarios, i.e., would the system successfully tackle the request if the user responded differently but still consistently with the dialogue flow? COCO leverages turn-level belief states as counterfactual conditionals to produce novel conversation scenarios in two steps: (i) counterfactual goal generation at turnlevel by dropping and adding slots followed by replacing slot values, (ii) counterfactual conversation generation that is conditioned on (i) and consistent with the dialogue flow. Evaluating state-of-the-art DST models on MultiWOZ dataset with COCO-generated counterfactuals results in a significant performance drop of up to 30.8% (from 49.4% to 18.6%) in absolute joint goal accuracy. In comparison, widely used techniques like paraphrasing only affect the accuracy by at most 2%. Human evaluations show that COCO-generated conversations perfectly reflect the underlying user goal with more than 95% accuracy and are as human-like as the original conversations, further strengthening its reliability and promise to be adopted as part of the robustness evaluation of DST models. 1 

1. INTRODUCTION

Task-oriented dialogue (TOD) systems have recently attracted growing attention and achieved substantial progress (Zhang et al., 2019b; Neelakantan et al., 2019; Peng et al., 2020; Wang et al., 2020b; a) , partly made possible by the construction of large-scale datasets (Budzianowski et al., 2018; Byrne et al., 2019; Rastogi et al., 2019) . Dialogue state tracking (DST) is a backbone of TOD systems, where it is responsible for extracting the user's goal represented as a set of slot-value pairs (e.g., (area, center), (food, British)), as illustrated in the upper part of Figure 1 . The DST module's output is treated as the summary of the user's goal so far in the dialogue and directly consumed by the subsequent dialogue policy component to determine the system's next action and response. Hence, the accuracy of the DST module is critical to prevent downstream error propagation (Liu and Lane, 2018), affecting the end-to-end performance of the whole system. With the advent of representation learning in NLP (Pennington et al., 2014; Devlin et al., 2019; Radford et al., 2019) , the accuracy of DST models has increased from 15.8% (in 2018) to 55.7% (in 2020). While measuring the held-out accuracy is often useful, practitioners consistently overestimate their model's generalization (Ribeiro et al., 2020; Patel et al., 2008) since test data is usually collected in the same way as training data. In line with this hypothesis, Table 1 demonstrates that there is a substantial overlap of the slot values between training and evaluation sets of the MultiWOZ DST benchmark (Budzianowski et al., 2018) . In Table 2 , we observe that the slot co-occurrence distributions for evaluation sets tightly align with that of train split, hinting towards the potential

funding

* Equal Contribution. Work was done during Shiyang's internship at Salesforce Research.

