BINDING LANGUAGE MODELS IN SYMBOLIC LANGUAGES

Abstract

Though end-to-end neural approaches have recently been dominating NLP tasks in both performance and ease-of-use, they lack interpretability and robustness. We propose BINDER, a training-free neural-symbolic framework that maps the task input to a program, which (1) allows binding a unified API of language model (LM) functionalities to a programming language (e.g., SQL, Python) to extend its grammar coverage and thus tackle more diverse questions, (2) adopts an LM as both the program parser and the underlying model called by the API during execution, and (3) requires only a few in-context exemplar annotations. Specifically, we employ GPT-3 Codex as the LM. In the parsing stage, with only a few incontext exemplars, Codex is able to identify the part of the task input that cannot be answerable by the original programming language, correctly generate API calls to prompt Codex to solve the unanswerable part, and identify where to place the API calls while being compatible with the original grammar. In the execution stage, Codex can perform versatile functionalities (e.g., commonsense QA, information extraction) given proper prompts in the API calls. BINDER achieves state-of-the-art results on WIKITABLEQUESTIONS and TABFACT datasets, with explicit output programs that benefit human debugging. Note that previous best systems are all finetuned on tens of thousands of task-specific samples, while BINDER only uses dozens of annotations as in-context exemplars without any training.

1. INTRODUCTION

Performance on natural language processing tasks is dominated by neural end-to-end systems that directly map inputs to outputs (Devlin et al., 2019; Liu et al., 2019; Lewis et al., 2020; Raffel et al., 2020, i.a.) . These end-to-end approaches are flexible and easy-to-use while lacking interpretability and robustness. This stands in contrast to symbolic approaches that produce explicit intermediate representations such as logical forms, reasoning paths, or program code, which might then be executed to derive a final output (Zettlemoyer & Collins, 2005; Gulwani et al., 2017; Chen et al., 2019b, i.a.) . The intermediate form produced by these the resulting execution makes them more robust to input changes. However, their semantic coverage is limited by the affordances of the grammar of the selected symbolic language (e.g., not being able to handle "North America?" in Fig. 1 ), leading to failures on real-world diverse questions, and the intermediate form annotations require expert knowledge and researcher labour. A few works (Andreas et al., 2016; Gupta et al., 2019; Khot et al., 2021; Zhu et al., 2022, i.a.) have been proposed to combine neural modules and symbolic languages (neural-symbolic) to leverage advantages of both approaches. However, they require the elaborate human design of the symbolic language and the calibration of corresponding neural modules to tackle problems in a specific domain with large training data. More specifically, most of these works propose a task-specific symbolic language and corresponding modules that cover only limited semantic phenomena in a specific task and domain. Therefore, new languages and neural modules have to be introduced when adapting them to new tasks and domains. Their coverage is still restricted by the customized symbolic language and neural modules. Moreover, they call for various and large training data to ensure all modules are well trained. Therefore, we expect a neural-symbolic system that supports flexible neural module calls that will enable higher coverage for the symbolic language, while only requiring few annotations. We propose BINDER, a training-free neural-symbolic framework that maps task inputs to an executable program in a programming language (e.g., SQL, Python) bound with a unified API to call language models (LMs; Brown et al., 2020; Chen et al., 2021) to perform versatile functionalities, i.e. a BINDER program(e.g., Binder-SQL, Binder-Python in Fig. 1 ), with only a few input-BINDER program annotations as in-context exemplars. More specifically, BINDER first prompts Codex, a code-pretrained of GPT-3, to parse a question input into a BINDER program, in which Codex has to decide (1) which parts in the input can be converted to the target programming language (question parts highlighted in grey in Fig. 1 ), (2) the corresponding task API calls (e.g., f ("North America?"; Made_in)) to prompt Codex to resolve the other parts, and (3) where to insert the API calls in the BINDER program. Next, BINDER prompts Codex again to generate answers to the task API calls (given the generated task prompts), integrates the generated results back to the programming language. Specifically as in Fig. 1 , the prompt (e.g., "North America?") and data (e.g., column M ade_in) in API calls are fed into Codex, and the output is a new column answering the prompt based on the input data (i.e., yes/no of whether a country in M ade_in column is from North America). Finally, the program with standard programming languages is executed the derive the final answer. In summary, BINDER enables flexible functionality integration to the programming language to improve its coverage and requires only a few annotations. BINDER program replaces custom neural modules and task-specific languages with a unified prompting API call to Codex and general programming languages, respectively, to handle much more diverse task inputs in open domains without complex language and neural module design. BINDER is built on the advances of in-context learning with language models and does not require any training and large-scale annotations. We demonstrate the effectiveness of the BINDER framework on WIKITABLEQUESTIONS (WIKITQ; Pasupat & Liang, 2015) TABFACT (Chen et al., 2019a), two structured knowledge grounding datasets that require complex reasoning on the tables. Using Codex (Chen et al., 2021) as the LM, BINDER achieves state-of-the-art results on WIKITQ and TABFACT. Note that the previous state-of-the-art methods all require fine-tuning on more than 10K annotated training examples or even massive amounts of task-related pretraining data, while our method requires only a dozen or so annotations without training. In further analysis, we find that BINDER provides the greatest performance gain on the questions that the original language grammar (SQL and Python) cannot support, indicating that BINDER effectively improves programming language coverage. We also demonstrate BINDER can be applied on multi-modal knowledge sources (text, table, images, and combined) with MUL-TIMODALQA dataset (Talmor et al., 2021) . Moreover, we show that BINDER, compared with end-to-end approaches, is more interpretable when debugging the model, more scalable to very large inputs, and more robust to noisy inputs.



Figure 1: An overview of the BINDER pipeline of two stages: parsing and execution. (1) In the parsing stage, the language model (LM) maps the input to a BINDER program given the question and (optional) knowledge sources. The expressions with blue background in the program are API calls to acquire external results. (2) In the execution stage, an LM serves to realize the API calls given the prompt and the return values feed back into the original programming language. A deterministic program interpreter executes the program without API calls to derive the final answer.

