COMPOSITIONAL SEMANTIC PARSING WITH LARGE LANGUAGE MODELS

Abstract

Humans can reason compositionally when presented with new tasks. Previous research shows that appropriate prompting techniques enable large language models (LLMs) to solve artificial compositional generalization tasks such as SCAN. In this work, we identify additional challenges in more realistic semantic parsing tasks with larger vocabulary and refine these prompting techniques to address them. Our best method is based on least-to-most prompting: it decomposes the problem using prompting-based syntactic parsing, then uses this decomposition to select appropriate exemplars and to sequentially generate the semantic parse. This method allows us to set a new state of the art for CFQ while requiring only 1% of the training data used by traditional approaches. Due to the general nature of our approach, we expect similar efforts will lead to new results in other tasks and domains, especially for knowledge-intensive applications.

1. INTRODUCTION

Compositionality is a key part of human intelligence as it allows us to understand and produce a potentially infinite number of novel combinations of known components (Chomsky, 1957; Montague, 1970; Lake et al., 2017) . In contrast, standard neural sequence models, transformers and recurrent neural networks, often fail to capture the compositional structure of the problem domain and thus fail to generalize compositionally (Keysers et al., 2020; Furrer et al., 2020) . Prior efforts to improve compositional generalization primarily rely on specialized architectures or training procedures (Lake, 2019; Chen et al., 2020; Nye et al., 2020; Andreas, 2020; Conklin et al., 2021; Akyürek et al., 2021; Liu et al., 2021) . Although effective, these can be task-specific. Even more general purpose methods that rely on data augmentation are limited in the class of data it can support (Shaw et al., 2021; Qiu et al., 2022a) . Prompting on the other hand is sufficiently flexible and, with recent advancement of large-scale pretrained language models (LLMs), has become an effective and generic approach to address a wide range of language understanding problems (Brown et al., 2020) . Prompting now performs on-par or better than model finetuning in many cases (Wei et al., 2022b; Chowdhery et al., 2022; Wei et al., 2022a; Kojima et al., 2022; Ahn et al., 2022) , and might be suitable for improving language model performance on compositional generalization. In particular, recent work (Zhou et al., 2022) found that least-to-most prompting shows a lot of potential for adapting LLMs for compositional generalization, achieving 99.7% accuracy on SCAN, a commonly used compositional generalization benchmark. Least-to-most prompting decomposes each problem into a series of subproblems, then sequentially solves one after another. However, SCAN is an artificial task built upon a synthetic language with a tiny vocabulary and is generated from a small set of grammar rules, and it is unclear whether strong results transfer to more realistic tasks that are based on a larger vocabulary and more complicated grammars (Furrer et al., 2020) . Additional challenges arise when applying least-to-most prompting to more realistic semantic parsing benchmarks. Among others, they may require information beyond what fits in a single prompt. Also, decomposing a problem is more difficult than with SCAN, exacerbated by constituents that cannot be translated independent of their context. We address these challenges with dynamic leastto-most prompting, a generic refinement of least-to-most prompting that involves the following steps: (1) tree-structured decomposition of natural language inputs through LM-predicted syntactic parsing, (2) using the decomposition to dynamically select exemplars, and (3) linearizing the decomposition tree and prompting the model to sequentially generate answers to subproblems. We evaluate our approach on two realistic benchmarks that, like SCAN, are designed to measure compositional generalization: CFQ (Keysers et al., 2020) and COGS (Kim & Linzen, 2020). On CFQ, our best performing method outperforms previous fully supervised finetuning approaches and achieves a new state-of-the-art accuracy of 95% (averaged across MCD splits) and thereby reduced the error rate by about 45% compared to the previous best result while using about 1% of the training data as candidates for exemplars. On COGS, our approach scores an accuracy of 99.2% when evaluated on the generalization test set, comparable with strong baselines. We also demonstrate robustness of our approach to exemplar pool size, and when using less than 0.1% of the training data as exemplars, dynamic least-to-most prompting is still competitive with previous approaches.

2. BACKGROUND AND MOTIVATION 2.1 COMPOSITIONAL GENERALIZATION

Compositionality is the idea that the meanings of complex expressions are constructed from the meanings of the less complex expressions that are their constituents. -Fodor & Lepore (2002) Given the knowledge of conceptual primitives and a few combinations, compositional generalization is the capability to use and comprehend unseen combinations. SCAN (Lake & Baroni, 2018; Loula et al., 2018) is one of the earliest benchmarks that shows neural sequence models cannot systematically generalize to novel combinations of the primitive items of the language. The benchmark requires the learner to translate simple commands to action sequences, where all commands are generated from a set of 20 grammar rules and use a a vocabulary consisting of about 20 words. Recent work has achieved perfect generalization accuracy on SCAN by inferring grammar rules in symbolic form (Chen et al., 2020; Nye et al., 2020; Liu et al., 2020; Shaw et al., 2021) . Most recently, Zhou et al. ( 2022) demonstrate that SCAN can be solved by least-to-most prompting, which leverages a pretrained large language model (LLM) and a prompt consisting of only 14 exemplars, which is less than 0.1% of the training data used by previous approaches.

2.2. LEAST-TO-MOST PROMPTING ENABLES COMPOSITIONAL GENERALIZATION

Least-to-most prompting teaches a language model how to solve a complex problem by reducing it to a set of easier subproblems. This is done by constructing two types of prompts. The first type of prompt tells the language model how to decompose a problem into a list of subproblems, while the second type of prompt describes how to sequentially solve the subproblems. As an illustration, consider the application of least-to-most prompting to SCAN. The decomposition of the input "look around right thrice and walk twice" yields the following subproblems: "look right", "look around right", "look around right thrice", and "walk twice". Since SCAN commands are generated by a simple grammar of only 20 rules, this decomposition task can be performed using a prompt consisting of only 8 decomposition exemplars. This decomposition allows the translation of the original input to be produced sequentially rather than in one step (as would be the case with naive prompting). The first subproblem is translated by passing the language model a prompt context consisting of 14 simple translation exemplars followed by the command "look right". The model's answer is then appended to the prompt such that it is used as additional context when translating the next subproblem "look around right", etc. 



Figure 1: An example of semantic parsing problems in CFQ, where the input is a sentence and the output is its formal representation as a SPARQL query.

