DR.SPIDER: A DIAGNOSTIC EVALUATION BENCH-MARK TOWARDS TEXT-TO-SQL ROBUSTNESS

Abstract

Neural text-to-SQL models have achieved remarkable performance in translating natural language questions into SQL queries. However, recent studies reveal that text-to-SQL models are vulnerable to task-specific perturbations. Previous curated robustness test sets usually focus on individual phenomena. In this paper, we propose a comprehensive robustness benchmark 1 based on Spider, a cross-domain text-to-SQL benchmark, to diagnose the model robustness. We design 17 perturbations on databases, natural language questions, and SQL queries to measure the robustness from different angles. In order to collect more diversified natural question perturbations, we utilize large pretrained language models (PLMs) to simulate human behaviors in creating natural questions. We conduct a diagnostic study of the state-of-the-art models on the robustness set. Experimental results reveal that even the most robust model suffers from a 14.0% performance drop overall and a 50.7% performance drop on the most challenging perturbation. We also present a breakdown analysis regarding text-to-SQL model designs and provide insights for improving model robustness.

1. INTRODUCTION

Large-scale cross-domain text-to-SQL datasets facilitate the study of machine learning models for generating a SQL query given a natural language question (NLQ) and corresponding database (DB) as input. Neural text-to-SQL models encode an NLQ and DB schema and decode the corresponding SQL (Wang et al., 2019; Lin et al., 2020; Scholak et al., 2021) , which have achieved remarkable results on existing benchmarks (Zhong et al., 2017; Yu et al., 2018; Shi et al., 2020) . However, those results are obtained in the setting where test data are created with the same distribution as training data. This setting prevents the evaluation of model robustness, especially when the data contain spurious patterns that do not exist in the wild. For example, previous studies (Suhr et al., 2020; Gan et al., 2021a; Deng et al., 2021) have found spurious patterns in the Spider (Yu et al., 2018) , a widely used cross-domain text-to-SQL benchmark, such as NLQ tokens closely matching DB schemas, leading models to rely on lexical matching between NLQs and DB schemas for prediction instead of capturing the semantics that the task is intended to test. Figure 1 shows examples where the state-of-the-art (SOTA) text-to-SQL models are vulnerable to perturbations. (1) DB perturbation: replacing the column name winner name with champname leads the model to miss the intent of "3 youngest winners"; (2) NLQ perturbation: the model confuses the selected column winner name with winner age given a paraphrased NLQ which uses "Who" to imply the selected column; (3) SQL perturbation: a simple change to the number of returned items (from LIMIT 3 to LIMIT 8) fails the model to detect the right intent. Recent studies created data to reveal the robustness problem of text-to-SQL models via perturbing DBs or NLQs (Ma & Wang, 2021; Pi et al., 2022; Gan et al., 2021a; Deng et al., 2021) . However, they usually focus on individual linguistic phenomena and rely on rule-based methods or a few

availability

data and code are available at https://github.com/awslabs/ diagnostic-robustness-text-to-sql.

