SCORE: PRE-TRAINING FOR CONTEXT REPRESENTA-TION IN CONVERSATIONAL SEMANTIC PARSING

Abstract

Conversational Semantic Parsing (CSP) is the task of converting a sequence of natural language queries to formal language (e.g., SQL, SPARQL) that can be executed against a structured ontology (e.g. databases, knowledge bases). To accomplish this task, a CSP system needs to model the relation between the unstructured language utterance and the structured ontology while representing the multi-turn dynamics of the dialog. Pre-trained language models (LMs) are the state-of-the-art for various natural language processing tasks. However, existing pre-trained LMs that use language modeling training objectives over free-form text have limited ability to represent natural language references to contextual structural data. In this work, we present SCORE, a new pre-training approach for CSP tasks designed to induce representations that capture the alignment between the dialogue flow and the structural context. We demonstrate the broad applicability of SCORE to CSP tasks by combining SCORE with strong base systems on four different tasks (SPARC, COSQL, MWOZ, and SQA). We show that SCORE can improve the performance over all these base systems by a significant margin and achieves state-of-the-art results on three of them.

1. INTRODUCTION

The goal of task-oriented dialog systems is to assist the user in completing a certain task by performing an action or retrieving relevant information (Tur & Mori, 2011) . They are often built on top of a structured ontology grounded in a knowledge base, a database, or a set of API calls. This in contrast to open-domain dialog systems (also referred to as chit-chat systems) where the goal is to maximize engagement with users in open-ended conversations (Jafarpour et al., 2010; Ritter et al., 2011) . A key component of task-oriented conversational systems is Conversational Semantic Parsing (CSP), which converts each utterance in the dialog into a formal language query (e.g., SQL, SPARQL) that can be executed against the structured ontology. CSP has been extensively studied in several academic and industrial research settings such as dialog systems (e.g., dialog state tracking in MWOZ (Budzianowski et al., 2018) ), interacting with physical agents (e.g., (Chai et al., 2018) ), context-dependent semantic parsing (e.g., SPARC (Yu et al., 2019b) ), SQL-grounded state tracking (e.g., COSQL (Yu et al., 2019a) ), and sequential question answering (e.g., SQA (Iyyer et al., 2017) ). These settings differ in some respect, but they share the same overall objective and key challenge: how to jointly represent the natural language utterances and underlying structured ontology while taking into consideration the multi-turn dynamics of the dialog. Similar to many other natural language tasks, recent work in CSP has significantly benefited from advances in language model pre-training. However, existing general-purpose pre-trained language models, e.g. BERT (Devlin et al., 2019) , are pre-trained on free-form text data using language model objectives. This limits their ability in modeling the structural context or the multi-turn dynamics of the dialogs. This presents an opportunity to improve pre-trained LMs to specifically address these limitations for CSP tasks. Recent work has demonstrated the benefits of adapting pre-trained LMs to specific domains (Gururangan et al., 2020) or tasks (Zhang et al., 2019b) via a second phase of pre-training. For example, open-domain dialogue language models such as DialoGPT (Zhang et al., 2020) and ConveRT (Henderson et al., 2019) are pre-trained on the Reddit data and applied to dialog response generation and retrieval tasks. In this paper, we introduce SCORE (Structured & Sequential Context Representation), a language model pre-training approach for CSP tasks. SCORE adapts general pre-trained LMs by introducing a second phase of pre-training using multiple objectives that capture both multi-turn dynamics and the structural contexts in a dialog. In contrast to open-domain dialogs, CSP datasets are usually much smaller due to the difficulty and expense of obtaining and labeling data (mapping natural language utterances to formal language). Unlike most prior work on contextualized LMs which are pre-trained on free text, according to the finding where questions in CSP tasks are more compositional than other free-text since they can be mapped into formal representations, we propose to train SCORE on synthesized conversational semantic parsing data with multiple training objectives that aim to ground utterances into the schema of the underlying ontology and to model the relationship between different utterances in the multi-turn conversation. In this way, SCORE can effectively inject structural and conversational inductive biases in LMs that can translate to many CSP tasks. SCORE uses an order of magnitude smaller dataset for the second stage of pre-training, does not require changes to the pre-trained model architecture, can be used as a drop-in replacement of general pre-trained LMs with any semantic parsing model, and can be used out-of-the-box in many CSP tasks. We apply SCORE to four different CSP tasks: (1) sequential text-to-SQL (SPARC), ( 2) conversational text-to-SQL (COSQL), (3) dialog state tracking (MWOZ), and (4) weakly-supervised sequential question answering (SQA). The fours tasks represent different scenarios, types of ontologies, supervision signals, system responses, and domains (see Table 1 for a detailed comparison and Figure 1 for examples). We demonstrate that: (1) SCORE training objectives can effectively incorporate synthesized data, (2) a single pre-trained SCORE model can be used for several CSP tasks and can be combined with many baseline systems with different model architectures and (3) SCORE significantly improve all baseline systems and achieves new state-of-the-art results on three benchmarks (SPARC, SPARC, and MWOZ) and comparable performance to state-of-the-art results on the fourth (SQA).

2. APPROACH

The key challenge of CSP is to capture the relationship between the natural language utterance and the structured ontology in the multi-turn dialog dynamics. To this end, we inject structural and conversational inductive biases in SCORE by introducing two objective functions: Column Contextual Semantics (CCS) and the Turn Contextual Switch (TCS). Because the size of existing semantic parsing datasets is limited, we produce synthesized data for pretraining SCORE by sampling from the context-free grammar induced from complex text-to-SQL examples in different domains. Moreover, to prevent SCORE from overfitting to the linguistic pattern of our synthesized data, we use the Masked Language Modeling (MLM) objective on human-generated utterances as regularization.

Task Definition

In CSP, at each turn t, we aim to produce a formal representation q t given the current utterance u t , the interaction history h t = [u 1 , u 2 , . . . , u t-1 ], and the schema c (table and column names, slots, etc.) of the target database (ontology) d. 1 : SPARC (sequential text-to-SQL), COSQL (conversational text-to-SQL), MWOZ (dialogue state tracking), and SQA (weakly supervised sequential question answering). They have different target formal language and structured ontology: • For the utterance u, it is the user question for SPARC and SQA, while for COSQL and MWOZ, u is the combination of a user query and a system response. • For the database d, SPARC and COSQL use multi-table databases; for MWOZ, the pre-defined ontology d can also be viewed as a database; for SQA, d is a single table. • For the formal representation q, it is the SQL query for SPARC and COSQL; in MWOZ it is the slot-value pairs that can be viewed as simple SQL queries consisting of SELECT and WHERE clauses; and for SQA, q is the latent program. Base Architecture The base architecture of SCORE takes as input a single turn of a CSP dialog u t , h t jointly with the underlying database schema c. Given this contextualized conversational input C t = u t , h t , c , SCORE encodes it into contextualized conversation representations S t for each token in C t . The encoder architecture follows RoBERTa (Liu et al., 2019b) . It is then followed by a linear layer and normalized (Ba et al., 2016) to produce final representations h t for each token: C t = u t , h t , c , S t = ROBERTA(C t ), h t,i = LayerNorm(GELU(W 1 S t,i )) ∀ S t,i ∈ S t , ) where GELU is an activation by Hendrycks & Gimpel (2016) and W 1 is a learned parameter matrix. To build C t , we first concatenate current utterances u t and dialog history h t separated by a special token <s>, as this simple strategy has been shown effective in state-of-the-art CSP systems (Zhang et al., 2019c; Wu et al., 2019; Liu et al., 2020; Heck et al., 2020) . To incorporate the database schema, we follow Hwang et al. (2019) to concatenate all column names as a single sequence. Column names are separated by the special token </s> and prefixed by their corresponding table name.

2.2. SCORE PRE-TRAINING

SCORE addresses the challenges of CSP by pre-training a task-oriented language model contextualized by the conversational flow and the underlying ontology. In pre-training, the SCORE model is self-supervised by two novel objectives in addition to the established Masked Language Modeling (MLM) objective. These objectives facilitate the accurate representation of the conversational flow between dialog turns and how this flow maps to the desired columns in the ontology.

Column Contextual Semantics

The first challenge of CSP is capturing the alignment between the natural language utterance and the underlying database schema. To address it, we optimize the SCORE model with the auxiliary objective of Column Contextual Semantics (CCS). For each column in the database schema c, CCS targets the operations that should be performed on this column in a given conversational turn. Specifically, each formal representation q is decomposed into operations on columns and tables, e.g. GROUP BY and HAVING for SQL queries, or WHERE for the slot-value pairs. In this way, our data covers 148 column operations. We use the encoding of the special token </s> right before each column or table name to predict its corresponding operations, and then compute the CCS loss: L CCS (C t ) = i∈c CrossEntropy 148 (LayerNorm(W 2 h c t,i ), CCS(q t )) where h c t,i is the contextualized representation of the i th column's special token </s> in the contextualized input C t , CCS(q t ) returns the column operation label for the current formal representation q t , CrossEntropy 148 computes the 148-way cross-entropy between the column operation prediction and label, and W 2 is a learned parameter matrix. Usr: Find the names of the top 3 highest sales books. Usr: Who are their authors? Usr: Also show the names of their publishers. … ...  Usr

Turn Contextual Switch

The second challenge of CSP is capturing the conversational context flow and how it is grounded into the formal representations. The TCS objective aims to capture this grounding of context flow. To this end, it targets predicting the difference in formal representations between dialog turns based on the natural language utterance. Based on the context-free grammar of SQL, we identify 26 possible turn difference operations that a conversational turn could elicit. They encode changes between different turns of user queries (the system response is not involved here) since we assume that most turn contextual shifts are from the user. For example, INS(WHERE) indicates inserting a new WHERE condition and DEL(SELECT.agg) indicates removing an aggregate operation from a SELECT statement (e.g. when an utterance "Show all the ages instead." elicits a change SELECT MAX(age) ... → SELECT age ...). We use the encoding of the special token </s> right before each turn to predict the context switch label between this turn and the previous history: L TCS (C t ) = CrossEntropy 26 (LayerNorm(W 3 H s t ), TCS(q t , q t-1 )) where H s t ∈ R (t-1)×d is the contextualized representation of all previous turns in C t with hidden dimension d, TCS(q t , q t-1 ) returns the turn difference operations from q t-1 to q t , and W 3 is a learned parameter matrix. We don't use this objective to pre-train SCORE for MWOZ because the context switch label between turns is relatively simple in MWOZ (only select and where changes).

Masked Language Modeling

As in prior work on large-scale language models (Devlin et al., 2019) , we use the Masked Language Modeling (MLM) objective to facilitate contextual representation learning for natural language utterances. Importantly for regularization, we only apply this loss on in-domain human-annotated natural language data. Namely, it includes utterances in SPARC, COSQL, and SQA as well as nine task-oriented dialog datasets processed by Wu et al. (2020) for MWOZ (see data statistics in Figure 4 ). Formally, the MLM loss is given by: L MLM (C t ) = m CrossEntropy Vocab (LayerNorm(W 4 h m t )) where h m t are the contextualized representations of the masked 15% of tokens in C t , and W 4 is a learned parameter matrix.

Pre-Training Setup and Steps

To summarize the pre-training steps, we first collect a dataset D nat of combined human-annotated natural language questions (without labels) from existing CSP tasks (as mentioned above), and create a large synthesized conversational data D syn that is generated by a grammar induced from a small set of SPARC annotated examples (See 2.3). After that, we incorporate both two datasets in pre-training. More specifically, synthetic and natural examples are randomly sampled during pre-training. The total pre-training loss is the sum of the three objectives with CCS and TCS only applied to D syn and MLM only to D nat : L = Ct∈Dsyn (L CCS (C t ) + L TCS (C t )) + Ct∈Dnat L MLM (C t ) Figure 2 shows an overview of SCORE pre-training on an example SPARC dialogue from Figure 1 . We report additional implementation details for pre-training SCORE in Section 3.3 and Appendix C.

2.3. DATA SYNTHESIS

We re-use the synthetic dataset of 120k synthetic task-oriented dialogues for MWOZ, introduced by Campagna et al. (2020) . In this work, we introduce a complementary procedure to synthesize data for conversational text-to-SQL dialogues. We use about 400k tables in WIKITABLES (Bhagavatula et al., 2015) (after filtering and cleaning), WikiSQL, and Spider datasets as underlying databases d, and then synthesize about one dialog for each table. Finally, we synthesize 435k text-to-SQL conversations in total. Table 12 in Appendix B shows an example of the synthesized question-SQL pairs and their corresponding templates in our grammar. Algorithm 1 Data synthesis algorithm 1: h ← ∅ 2: r s ← SAMPLE(G s ) 3: ũ0 , q0 ← RANDASSIGNSLOTS(d, r s ) 4: h+ = (ũ 0 , q0 ) 5: ũp , qp ← ũ0 , q0 6: for t ← 1 to T do 7: if RAND(0, 1) < 0.2 then 8: r s ← SAMPLE(G s ) 9: ũt , qt ← RANDASSIGNSLOTS(d, r s ) 10: else 11: r c ← SAMPLE(G c ) 12: if CONSTRAINTCHECK(r c , qp ) then 13: ũt , qt ← EDITASSIGN(q p , r c ) The data synthesis procedure using the two grammars is shown in Algorithm 1. Given a database d and a sampled single-turn question-SQL template, the function RANDASSIGNSLOTS samples values (column names, cell values, and SQL operations) for typed slots in the template and returns the first synthesized question ũ0 and the corresponding SQL query q0 . To generate T follow-up question-SQL pairs, the function CONSTRAINTCHECK(r c , qp ) checks if the previous query qp satisfies constraints of the sampled template r c (e.g. contains its mentioned nonterminal). Finally, EDITASSIGN(q p , r c ) edits the previous SQL qp to generate the current follow-up SQL label qt and samples values for typed slots in the template to generate the corresponding follow-up question ũt .

3.1. DATASETS AND EVALUATION METRICS

We evaluate SCORE on four popular CSP tasks: SPARC (sequential text-to-SQL), COSQL (conversational text-to-SQL), MWOZ (dialogue state tracking), and SQA (sequential question answering), summarized in Table 1 . SPARC (Yu et al., 2019b) foot_0 is a large collection of sequences of inter-related context-dependent question-SQL pairs. It contains 4.3K questions sequences and 12k+ questions. COSQL (Yu et al., 2019a) foot_1 is a large conversational text-to-SQL corpus, with 3k dialogues, collected under the Wizardof-Oz (WOZ) setting. We focus on the SQL-grounded dialogue state tracking task which maps user intents into SQL queries if possible given the interaction history. Both SPARC and COSQL cover 200 complex DBs spanning 138 domains. MWOZ (Budzianowski et al., 2018; Eric et al., 2019) foot_2 is a corpus of over 10k human-human written task-oriented dialogs created through a WOZ crowdsourcing setting. We focus on the belief state tracking task in MWOZ which maps multi-turn user utterances to slot-value annotations. SQA (Iyyer et al., 2017) foot_3 is constructed from a subset of WikiTableQuestions (Pasupat & Liang, 2015) by decomposing highly compositional questions into a sequence of simple questions. The task is weakly-supervised because each resulting decomposed question is only annotated with answers as one or more table cells, while the logic program is latent. It has 6,066 question sequences with 17,553 questions in total on 982 unique open-domain tables from Wikipedia. We adopt the official metrics defined for each of the tasks. For SPARC and COSQL, we report question match accuracy (QM): the exact set match accuracy (Yu et al., 2018b) over SQL templates and interaction match accuracy (IM): the ratio of interactions for which all questions are predicted correctly. For MWOZ, we report joint goal accuracy (JGA) which is similar to the IM accuracy used in SPARC and COSQL. Finally, for SQA, we report denotation QM and IM accuracies.

3.2. BASE MODELS AND OTHER BASELINES

For SPARC and COSQL, we use RAT-SQL (Wang et al., 2020) as our base model. Since it is originally developed for single-turn text-to-SQL, we extend it to a multi-turn setting by concatenating current utterances and dialog history (see Section 2.2). Note that RAT-SQL alone, without SCORE, achieves better or comparable results to state-of-the-art models developed for SPARC and COSQL. For MWOZ, we employ Trippy (Heck et al., 2020) . It achieves state-of-the-art performance on MWOZ and uses BERT base to encode user and system utterances and dialog history. We report higher results (around 2%) for Trippy than reported by Heck et al. (2020) since we train it for more epochs (25 vs. 10). To show the improvement of SCORE is not tied to specific base systems, we also experiment with another strong base model SOM-DST (Kim et al., 2020) for MWOZ and follow the same experimental details to train it. For SQA, we use the weakly-supervised semantic parser proposed by Wang et al. (2019) . The model first generates an abstract program given an input question and then instantiates it by searching for alignments between slots in the abstract program and question spans. As it is originally developed for single-turn questions, we extend it to the multi-turn setting in the same way as RAT-SQL. We report additional implementation details for all base models in Appendix C. In addition to reporting results for all base models with SCORE, we also report original base models results (with BERT and/or ROBERTA) and several other state-of-the-art baselines for each task.

3.3. DATASET USAGE IN PRE-TRAINING

In our experiments and ablation study, we train several versions of SCORE with different objectives and datasets: (1) SCORE (MLM): pre-trained on annotated natural questions using MLM. ( 2 

4. RESULTS AND ANALYSIS

Overall Results The results of SPARC and COSQL, MWOZ, and SQA are in Table 2 , 3, and 4 respectively. We run each main experiment three times with different random seeds and report the mean. Overall, SCORE gains significant improvements over BERT and ROBERTA on all tasks, achieving state-of-the-art performances on SPARC, COSQL, and MWOZ. (Zhong et al., 2020) 48.9 29.7 45.9 23.5 42.0 12.3 39.7 12.8 EditSQL + BERT (Zhang et al., 2019c) (Wu et al., 2019) 46.60 DS-DST (Zhang et al., 2019a) 51.21 SOM-DST (Kim et al., 2020) 52.57 DS-picklist (Zhang et al., 2019a) 53.30 TripPy (Heck et al., 2020) 55.29 SimpleToD (Hosseini-Asl et al., 2020) 55.72 TripPy (ours) 58.37 + SCORE 60.48 7 , the extra synthetic data does not significantly the performance, indicating that directly augmenting the synthetic data to the training set is not effective. The similar findings are reported in many recent work (Zhang et al., 2019c; Herzig et al., 2020; Campagna et al., 2020; Zhong et al., 2020) . In contrast, pre-training on the synthesized data with our objectives improves the performance on the downstream tasks. To demonstrate the generalization in synthetic grammar and data, as shown in Table 2 and 4 , SCORE (TCS+CCS) is pre-trained on the synthesized data of the grammar induced from SPARC only, and it still improves the performance on COSQL (+2.7%) and SQA (+4.9%) where no any CoSQL and SQA annotated data is seen in any pre-training steps. Moreover, in Table 8 we show that SCORE pre-trained on the text-to-SQL synthesized data could also surprisingly improve the performance on MWOZ. We expect that higher performance could be achieved with SCORE pre-trained on task-specific synthesized data. Finally, our pre-training approach can be applied to any existing LMs including larger seq2seq LMs (e.g., BART (Lewis et al., 2020) , T5 (Raffel et al., 2020) ). Can SCORE deliver more value when in-domain data is limited (e.g., in a low-resource setting)? We want to answer this question similar to experiments other investigations of LMs as few-shot learners (Wu et al., 2020; Brown et al., 2020; Schick & Schütze, 2020) . To this end, we compare ROBERTA and SCORE under a few-shot setting on SQA when only 10% of training data is available. We choose SQA because its annotation is most different from the synthetic text-to-SQL dataset we use for pretraining. Table 9 demonstrates that SCORE delivers even larger improvements compared to the ROBERTA baseline when only 10% training data is available (3.8% vs 2.6%).

5. RELATED WORK

Conversational Semantic Parsing Conversational semantic parsing is one of the most important research topics in conversational AI and has been studied in different settings including task-oriented dialogues, question answering, and text-to-SQL. Task-oriented dialog systems (Henderson et al., 2014; Wen et al., 2016; Mrkšić et al., 2017; Budzianowski et al., 2018) aim to help users accomplish a specific task (e.g. flight booking) and often pre-define slot templates grounded in a domainspecific ontology. In comparison, several other datasets were recently introduced for cross-domain conversational text-to-SQL tasks (SPARC and COSQL (Yu et al., 2019a; b) ) and sequential questions answers over tables (Iyyer et al., 2017) . While the previous work has achieved significant progress in different datasets separately, to the best of our knowledge, we are the first to study four different CSP tasks together (sequential text-to-SQL, conversational text-to-SQL, dialog state tracking, and weakly-supervised sequential question answering) by addressing the shared key challenge of learning representations in pre-trained language models that capture the alignment between the dialogue flow and the structural context. Conversational Language Model Pre-training Several recent efforts have demonstrated the value of adapting pre-trained LMs to specific tasks using different pre-training objectives, e.g., summarization (Zhang et al., 2019b) , knowledge inference (Sun et al., 2019b; Liu et al., 2019a) (Zhang et al., 2020; Henderson et al., 2019) . For example, Wu et al. (2020) introduced ToD-BERT, a pre-trained language model combining 9 high-quality human-human task-oriented dialogue datasets to conduct language model and response selection pre-training. However, they use language modeling training objectives over free-form text and therefore have limited ability to represent structural data. The latter has focused on improving language model pre-training for encoding tabular data (Yin et al., 2020; Herzig et al., 2020) , but they focus on the single turn semantic parsing setting. Our approach is different from previous work because we address the challenge of conversational semantic parsing tasks by learning pretrained representation for both the multi-turn dynamics of the dialog and the relation between the unstructured language utterance and the structured ontology. Furthermore, our pre-training approach is much more data-efficient than prior LM pre-training work and saves a lot of time and computing resources (Appendix D for more details). Our pre-training step can be done within only one day using 8 V100 GPUs. Using Synthesized Data for Semantic Parsing Synthesized data has been frequently used in semantic parsing to alleviate the challenge of labeled data scarcity. For example, Wang et al. (2015) proposed a method for training semantic parsers in new domains by generating logical forms and canonical utterances and then paraphrasing the canonical utterances via crowd-sourcing. Similar approaches were used to train semantic parsers in other domains and settings (Zhong et al., 2017; Su et al., 2017; Cheng et al., 2018; Shah et al., 2018) . Another line of work has proposed using synthesized data to adapt single turn semantic parsing models to new domains (Jia & Liang, 2016; Yoo et al., 2018; Campagna et al., 2019) and task-oriented dialogues (Campagna et al., 2020) . However, they reported that combining synthetic data and the supervised data does not yield significant improvements, consistent with results by Herzig et al. (2020) . By contrast, we introduce a new data synthesize procedure for conversational text-to-SQL dialogues and use it in a different way by pretraining language models to induce better representations for many CSP tasks. Our synthesized data can be easily generated without human involvement and the pre-trained models add value to different tasks simultaneously.

6. CONCLUSION

We presented SCORE a new pre-training approach for conversational semantic parsing. The training objectives of SCORE aim to induce natural language representations that capture the multi-turn dynamics, compositional semantic of the target language, and the references to the structural ontology appearing in the dialog. SCORE can be used with many semantic parsing models as a drop-in replacement for general pretrained LMs. We demonstrated SCORE effectiveness by using it as a feature representation encoder with strong baseline models for a wide range of CSP tasks. In particular, our empirical results on four different CSP tasks demonstrated that SCORE can be used to significantly improve the performance of existing strong baseline models by simply replacing an existing pre-trained LM with our SCORE pre-trained model. Furthermore, we are able to achieve state-of-the-art results on three of these tasks. We hope SCORE will encourage further exploration of the benefits and limitations of pre-training approaches for CSP systems. 

B SYNTHESIZED EXAMPLES & TEMPLATES

Table 12 shows an example of the synthesized question-SQL pairs and their corresponding templates in our grammars. All SCORE are pre-trained for 30 epochs, which usually take less than half a day on 8 V100 GPUs. We experimented with SCORE pre-trained for 5, 10, and 30 epochs and found that most of the best downstream performances occur when base systems incorporate with SCORE pre-trained for less than 10 epochs. Our implementation is based on the Transformers library (Wolf et al., 2019) .

C.2 BASE MODELS

RAT-SQL: For a fair comparison, all RAT-SQL experiments are trained for 40k steps. We adopt the same hyperparameters as Shaw et al. (2018) except for learning rates. We find that learning rates of 1e-4 and 1e-5 for RAT and BERT respectively produce more stable results. TripPy: We use the same hyperparameters for training TripPy on MWOZ as in (Heck et al., 2020) except we train it for 25 epochs (as opposed to 10 epochs as reported in (Heck et al., 2020) ). When we train TripPy for 25 epochs, we get a new result that is higher (around 2%) than the one reported in (Heck et al., 2020) . Similarly, when we train TripPy with SCORE, we train it for 25 epochs.

SOM-DST:

We use the same hyperparameters from Kim et al. (2020) for all SOM-DST experiments on MWOZ. We test the performance of SCORE with respect to the number of pre-training epochs. Figure 3 shows that the best performance of the downstream tasks is usually achieved in early epochs, more specifically 5 for SPARC and COSQL and 15 for MWOZ. Longer pre-training time does not improve or even hurts the performance. One possible reason is that longer pre-training makes SCORE overfit to the synthesized data whose utterances are unnatural. As for the data, as shown in Table 5 , even if SCORE is pre-trained with only a relatively small amount of synthesized data (without the MLM loss), most of the tasks can achieve much higher performances. With a relatively smaller training corpus and shorter training time compared to other pre-trained language models, SCORE is efficient in time and data.

E ADDITIONAL RESULTS

Effect of TCS We ran the TCS only experiment on SPARC, and will add TCS only results (including for other tasks) to on target dialogue domains, but their training and fine-tuning code is not released. Peng et al. (2020) focus on the natural language generation (NLG) task, which assumes dialogue acts and slot-tagging results are given to generate a natural language response. By pre-training on a set of annotated NLG corpora, it can improve conditional generation quality using a GPT-2 model.

3. Method

In this section, we first discuss each dataset used for our task-oriented pre-training and how we process the data. Then we introduce the selected pretraining base model and its objective functions.

3.1. Datasets

We collect nine different task-oriented datasets which are English-based, human-human, multiturn and publicly available. In total, there are 100,707 dialogues, which contain 1,388,152 utterances over 60 domains. Dataset statistics is shown in Table 1 . • MetaLWOZ (Lee et al., 2019): Meta-Learning Wizard-of-Oz is a dataset designed to help develop models capable of predicting user responses in unseen domains. This large dataset was created by crowdsourcing 37,884 goaloriented dialogs, covering 227 tasks in 47 domains. The MetaLWOZ dataset is used as the fast adaptation task for DSTC8 (Kim et al., 2019) dialogue system to support a large number of services and facilitates the simple integration of new services without requiring much training data. The Schema dataset is used as the dialogue state tracking task for DSTC8 (Kim et al., 2019) dialogue competition. • Taskmaster (Byrne et al., 2019): This dataset includes 13,215 dialogues comprising six domains, including 5,507 spoken and 7,708 written dialogs created with two distinct procedures. One is a two-person Wizard of Oz approach that one person acts like a robot and the other is a self-dialogue approach in which crowdsourced workers wrote the entire dialog themselves. It has 22.9 average conversational turns in a single dialogue, which is the longest among all taskoriented datasets listed. • MWOZ (Budzianowski et al., 2018) : Multi-Domain Wizard-of-Oz dataset contains 10,420 dialogues over seven domains, and it has multiple domains in a single dialogue. It has a detailed description of the data collection procedure, and user goal, system act, and dialogue state labels. Different from most of the existing corpora, it also provides full database information. • MSR-E2E (Li et al., 2018): Microsoft end-toend dialogue challenge has 10,087 dialogues in three domains, movie-ticket booking, restaurant reservation, and taxi booking. It also includes an 



https://yale-lily.github.io/sparc https://yale-lily.github.io/cosql https://github.com/budzianowski/multiwoz http://aka.ms/sqa



Figure 1: Examples of conversational semantic parsing tasks from SPARC and MWOZ datasets.

Figure 2: Pre-training of a SCORE encoder on a SPARC text-to-SQL example from Figure 1.

) SCORE (CCS+TCS): pre-trained on only synthesized data, which achieves the best results on SParC, CoSQL, and SQA. (3) SCORE (CCS+TCS+MLM): pre-trained on the synthesized data using CCS+TCS and annotated natural questions using MLM.Furthermore, note that the synthesized data is generated using grammar induced by about 500 examples from only SPARC. Therefore, no COSQL or SQA data are seen in any pre-training steps. For MWOZ,Campagna et al. (2020) study only the dev examples to induce the data synthesis grammar.

Figure 3: The effect of pre-training time.

Figure 4: Data statistics of human-annotated task-oriented dialogue datasets used in Wu et al. (2020).

To cover different problem variants, we

Comparison of CSP datasets. Examples from two of the datasets are shown in Figure1. Cross-domain means the train and test sets have different domains, so MWOZ is not cross-domain.

The SPARC and COSQL accuracy over all questions (QM) and all interactions (IM). The scores of IGSQL + BERT and R 2 SQL + BERT are from the official leaderboards.





The effect of SCORE pre-training objectives. Improvements are shown in the parentheses.

Detailed results on the dev set of SPARC. Q i is the accuracy of the i th conversation question.

Effect of synthetic data as training data augmentation.What if we use the synthesized data to simply augment the training data? To answer this, we compare the results of the base models trained with or without the synthesized data on COSQL and MWOZ. As shown in Table

Performance of SCORE pre-trained on different synthesized data on MWOZ.

, etc. Closest to our work is adapting pre-trained LMs for open-domain chit-chat models and for tabular data representation. The former focuses on improving response generation on open-ended dialogues by adding a pre-training step on open-domain conversations data, such as Reddit data

Detailed results of COSQL on the dev set. Q i is the accuracy of the i th question in the conversation.

Detailed results of SQA on the test set. Q i is the accuracy of the i th question in the conversation.

An example of synthetic conversational text-to-SQL data.For pre-training SCORE on synthesized text-to-SQL data, we use ROBERTA large and pre-train it with batch size 12, gradient accumulation step 2, and maximum length 248. We use a learning rate 1e-5 and gradually reduce the learning rate without a warm-up period using Adam(Kingma & Ba, 2014) with epsilon 1e-8. BERT base is used in pre-training SCORE on synthesized MWOZ data because it contains longer conversations. We set the maximum length to 512 and batch size 24.

Incorporating Additional Examples Used in Synthetic Grammar Induction As we mentioned in Section 2.3, we used about 500 examples from SPARC to induce the grammar for data synthesis in pre-training. For a fair comparison, we also report the results of incorporating the additional SPARC examples in COSQL and SQA. More specifically, we directly concatenate the additional SPARC examples to COSQL training set, and train RAT-SQL+ROBERTA on it, which slightly improves the performance (19.6% vs. 19.3%) but not as large as SCORE (22.0% vs. 19.3%).' Also, because SQA is weakly-supervised sequential question answering, which differs from SPARC, we first fine-tune ROBERTA on the additional SPARC examples using CCS, and then apply it to SQA. In this way, the ROBERTA trained with additional SPARC examples achieves a similar performance as the original one (62.7% vs 62.8%).Comparison with Finetuning Larger Language Models Based on our experiments and other published results, we didn't find existing larger LMs (BART(Lewis et al., 2020), T5(Raffel et al., 2020), GPT-2(Radford et al., 2019)) outperform custom models + BERT on CSP tasks. Our evidence is based on Spider(Yu et al., 2018b), which is the single-turn version of SParC and CoSQL. For T5,Shaw et al. (2020) applied T5 as seq2seq to Spider, and compared with RAT-SQL + BERT-Large, T5-Base performs much worse (57.1% vs. 69.6%), and T5-3B improves only 0.3, but it is 6 times larger. Moreover, for Bart, we have performed experiments on Spider and we found that BART cannot outperform custom models + BERT: RAT-SQL + BERT 69.7%, RAT-SQL + BART encoder 67.8%, BART encoder + decoder (406M, as a seq2seq task) 62.4%. InRubin & Berant (2020), BART didn't outperform BERT either. As for GPT-2,Wu et al. (2020) andHosseini-Asl et al. (2020) found it does not outperform BERT on MWOZ.

Data statistics for task-oriented dialogue pre-training.

