COCO: CONTROLLABLE COUNTERFACTUALS FOR EVALUATING DIALOGUE STATE TRACKERS

Abstract

Dialogue state trackers have made significant progress on benchmark datasets, but their generalization capability to novel and realistic scenarios beyond the heldout conversations is less understood. We propose controllable counterfactuals (COCO) to bridge this gap and evaluate dialogue state tracking (DST) models on novel scenarios, i.e., would the system successfully tackle the request if the user responded differently but still consistently with the dialogue flow? COCO leverages turn-level belief states as counterfactual conditionals to produce novel conversation scenarios in two steps: (i) counterfactual goal generation at turnlevel by dropping and adding slots followed by replacing slot values, (ii) counterfactual conversation generation that is conditioned on (i) and consistent with the dialogue flow. Evaluating state-of-the-art DST models on MultiWOZ dataset with COCO-generated counterfactuals results in a significant performance drop of up to 30.8% (from 49.4% to 18.6%) in absolute joint goal accuracy. In comparison, widely used techniques like paraphrasing only affect the accuracy by at most 2%. Human evaluations show that COCO-generated conversations perfectly reflect the underlying user goal with more than 95% accuracy and are as human-like as the original conversations, further strengthening its reliability and promise to be adopted as part of the robustness evaluation of DST models. 1 

1. INTRODUCTION

Task-oriented dialogue (TOD) systems have recently attracted growing attention and achieved substantial progress (Zhang et al., 2019b; Neelakantan et al., 2019; Peng et al., 2020; Wang et al., 2020b; a) , partly made possible by the construction of large-scale datasets (Budzianowski et al., 2018; Byrne et al., 2019; Rastogi et al., 2019) . Dialogue state tracking (DST) is a backbone of TOD systems, where it is responsible for extracting the user's goal represented as a set of slot-value pairs (e.g., (area, center), (food, British)), as illustrated in the upper part of Figure 1 . The DST module's output is treated as the summary of the user's goal so far in the dialogue and directly consumed by the subsequent dialogue policy component to determine the system's next action and response. Hence, the accuracy of the DST module is critical to prevent downstream error propagation (Liu and Lane, 2018) , affecting the end-to-end performance of the whole system. With the advent of representation learning in NLP (Pennington et al., 2014; Devlin et al., 2019; Radford et al., 2019) , the accuracy of DST models has increased from 15.8% (in 2018) to 55.7% (in 2020). While measuring the held-out accuracy is often useful, practitioners consistently overestimate their model's generalization (Ribeiro et al., 2020; Patel et al., 2008) since test data is usually collected in the same way as training data. In line with this hypothesis, Table 1 demonstrates that there is a substantial overlap of the slot values between training and evaluation sets of the MultiWOZ DST benchmark (Budzianowski et al., 2018) . In Table 2 , we observe that the slot co-occurrence distributions for evaluation sets tightly align with that of train split, hinting towards the potential Adversarial examples generated by these methods are often unnatural or obtained to hurt target models deliberately. It is imperative to emphasize here that both our primary goal and approach significantly differ from the previous line of work: (i) Our goal is to evaluate DST models beyond held-out accuracy, (ii) We leverage turn-level structured meaning representation (belief state) along with its dialogue history as conditions to generate user response without relying on the original user utterance, (iii) Our approach is entirely model-agnostic, assuming no access to evaluated DST models, (iv) Perhaps most importantly, we aim to produce novel but realistic and meaningful conversation scenarios rather than intentionally adversarial ones. We propose controllable counterfactuals (COCO) as a principled, model-agnostic approach to generate novel scenarios beyond the held-out conversations. Our approach is inspired by the combination of two natural questions: how would DST systems react to (1) unseen slot values and (2) rare but realistic slot combinations? COCO first encapsulates these two aspects under a unified concept called counterfactual goal obtained by a stochastic policy of dropping and adding slots to the original turnlevel belief state followed by replacing slot values. In the second step, COCO conditions on the dialogue history and the counterfactual goal to generate counterfactual conversation. We cast the actual utterance generation as a conditional language modeling objective. This formulation allows us to plug-in a pretrained encoder-decoder architecture (Raffel et al., 2020) as the backbone that powers the counterfactual conversation generation. We also propose a strategy to filter utterances that fail to reflect the counterfactual goal exactly. We consider value substitution (VS), as presented in Figure 1 , as a special COCO case that only replaces the slot values in the original utterance without adding or dropping slots. When we use VS as a fall-back strategy for COCO (i.e., apply VS when COCO fails to generate valid user responses after filtering), we call it COCO+. Evaluating three strong DST models (Wu et al., 2019; Heck et al., 2020; Hosseini-Asl et al., 2020) with our proposed controllable counterfactuals generated by COCO and COCO+ shows that the performance of each significantly drops (up to 30.8%) compared to their joint goal accuracy on the original MultiWOZ held-out evaluation set. On the other hand, we find that these models are, in fact, quite robust to paraphrasing with back-translation, where their performance only drops up to 2%. Analyzing the effect of data augmentation with COCO+ shows that it consistently improves the robustness of the investigated DST models on counterfactual conversations generated by each of VS, COCO and COCO+. More interestingly, the same data augmentation strategy improves the joint goal accuracy of the best of these strong DST models by 1.3% on the original MultiWOZ evaluation set. Human evaluations show that COCO-generated counterfactual conversations perfectly reflect the underlying user goal with more than 95% accuracy and are found to be quite close to original conversations in terms of their human-like scoring. This further proves our proposed approach's reliability and potential to be adopted as part of DST models' robustness evaluation.



The percentage (%) of domain-slot values in dev/test sets covered by training data.

Co-occurrence distribution(%) of book people slot with other slots in restaurant domain within the same user utterance. It rarely co-occurs with particulars slots (e.g., food), which hinders the evaluation of DST models on realistic user utterances such as "I want to book a Chinese restaurant for 8 people." limitation of the held-out accuracy in reflecting the actual generalization capability of DST models. Inspired by this phenomenon, we aim to address and provide insights into the following question: how well do state-of-the-art DST models generalize to the novel but realistic scenarios that are not captured well enough by the held-out evaluation set?Most prior work(Iyyer et al., 2018; Jin et al., 2019)  focus on adversarial example generation for robustness evaluation. They often rely on perturbations made directly on test examples in the heldout set and assume direct access to evaluated models' gradients or outputs.

funding

* Equal Contribution. Work was done during Shiyang's internship at Salesforce Research.

