DR.SPIDER: A DIAGNOSTIC EVALUATION BENCH-MARK TOWARDS TEXT-TO-SQL ROBUSTNESS

Abstract

Neural text-to-SQL models have achieved remarkable performance in translating natural language questions into SQL queries. However, recent studies reveal that text-to-SQL models are vulnerable to task-specific perturbations. Previous curated robustness test sets usually focus on individual phenomena. In this paper, we propose a comprehensive robustness benchmark 1 based on Spider, a cross-domain text-to-SQL benchmark, to diagnose the model robustness. We design 17 perturbations on databases, natural language questions, and SQL queries to measure the robustness from different angles. In order to collect more diversified natural question perturbations, we utilize large pretrained language models (PLMs) to simulate human behaviors in creating natural questions. We conduct a diagnostic study of the state-of-the-art models on the robustness set. Experimental results reveal that even the most robust model suffers from a 14.0% performance drop overall and a 50.7% performance drop on the most challenging perturbation. We also present a breakdown analysis regarding text-to-SQL model designs and provide insights for improving model robustness.

1. INTRODUCTION

Large-scale cross-domain text-to-SQL datasets facilitate the study of machine learning models for generating a SQL query given a natural language question (NLQ) and corresponding database (DB) as input. Neural text-to-SQL models encode an NLQ and DB schema and decode the corresponding SQL (Wang et al., 2019; Lin et al., 2020; Scholak et al., 2021) , which have achieved remarkable results on existing benchmarks (Zhong et al., 2017; Yu et al., 2018; Shi et al., 2020) . However, those results are obtained in the setting where test data are created with the same distribution as training data. This setting prevents the evaluation of model robustness, especially when the data contain spurious patterns that do not exist in the wild. For example, previous studies (Suhr et al., 2020; Gan et al., 2021a; Deng et al., 2021) have found spurious patterns in the Spider (Yu et al., 2018) , a widely used cross-domain text-to-SQL benchmark, such as NLQ tokens closely matching DB schemas, leading models to rely on lexical matching between NLQs and DB schemas for prediction instead of capturing the semantics that the task is intended to test. Figure 1 shows examples where the state-of-the-art (SOTA) text-to-SQL models are vulnerable to perturbations. (1) DB perturbation: replacing the column name winner name with champname leads the model to miss the intent of "3 youngest winners"; (2) NLQ perturbation: the model confuses the selected column winner name with winner age given a paraphrased NLQ which uses "Who" to imply the selected column; (3) SQL perturbation: a simple change to the number of returned items (from LIMIT 3 to LIMIT 8) fails the model to detect the right intent. Recent studies created data to reveal the robustness problem of text-to-SQL models via perturbing DBs or NLQs (Ma & Wang, 2021; Pi et al., 2022; Gan et al., 2021a; Deng et al., 2021) . However, they usually focus on individual linguistic phenomena and rely on rule-based methods or a few annotators, which cannot cover the richness and diversity of human language. For example, the NLQ perturbation example in Figure 1 requires sentence-level paraphrasing, which cannot be generated by previous perturbation methods. In this paper, we curate a comprehensive Diagnostic Robustness evaluation benchmark (Dr.Spider), based on Spider (Yu et al., 2018) via 17 perturbations to cover all three types of robustness phenomena on DBs, NLQs, and SQLs. Dr. Spider contains 15K perturbed examples. We perturb DBs with a set of predefined rules by taking advantage of their structural nature to represent data in different ways. For NLQ perturbations, we propose a collaborative expert-crowdsourcer-AI framework by prompting the pretrained OPT model (Zhang et al., 2022) to simulate various and task-specific linguistic phenomena. For SQL perturbations, we programmatically modify the local semantics in SQLs and their corresponding tokens in NLQs while minimizing the surface-level changes to measure model robustness to local semantic changes. We evaluate the SOTA text-to-SQL models on our benchmark (Wang et al., 2019; Rubin & Berant, 2021; Yu et al., 2021; Scholak et al., 2021) . The experiments demonstrate that although the SOTA models achieve good performance on the original data, they struggle to have consistently correct predictions in our robustness sets. Even the most robust model suffers from a 14.0% performance drop overall and a 50.7% performance drop against the most challenging perturbation. We also present a breakdown analysis of model robustness in terms of model architectures, including model size, decoder architecture, and the entity-linking component. We analyze the advantage of different model designs, which provides insights for developing robust text-to-SQL models in the future.

2. RELATED WORK

Robustness in NLP The evaluation of model robustness is an important step toward the development of reliable models. (Nie et al., 2019; Wang et al., 2021; Goel et al., 2021) 



Figure 1: An example of the SOTA model Picard (Scholak et al., 2021) against DB, NLQ, SQL perturbations on the database WTA. Picard predicts a correct SQL on pre-perturbation data but fails on post-perturbation data. The blue and gray areas highlight the modification on input and the errors of predicted SQLs respectively.

. Jia & Liang (2017); Iyyer et al. (2018); Wang et al. (2021) measure model robustness against semantic-preserving perturbations, and Kaushik et al. (2019); Gardner et al. (2020) evaluate the decision boundaries of models against semantic-changing perturbations, which change local semantics while minimizing modifications in the surface-level patterns. Previous studies on robustness data for text-to-SQL only consider semantic-preserving perturbations on DB or NLQ and rely on rule-based methods or handcrafted examples, which limits the naturalness and diversity of perturbations (Gan et al., 2021a;b; Deng

availability

data and code are available at https://github.com/awslabs/ diagnostic-robustness-text-to-sql.

