DYNAMIC PROMPT LEARNING VIA POLICY GRADIENT FOR SEMI-STRUCTURED MATHEMATICAL REASONING

Abstract

Mathematical reasoning, a core ability of human intelligence, presents unique challenges for machines in abstract thinking and logical reasoning. Recent large pre-trained language models such as GPT-3 have achieved remarkable progress on mathematical reasoning tasks written in text form, such as math word problems (MWP). However, it is unknown if the models can handle more complex problems that involve math reasoning over heterogeneous information, such as tabular data. To fill the gap, we present Tabular Math Word Problems (TABMWP), a new dataset containing 38,431 open-domain grade-level problems that require mathematical reasoning on both textual and tabular data. Each question in TABMWP is aligned with a tabular context, which is presented as an image, semi-structured text, and a structured table. There are two types of questions: free-text and multichoice, and each problem is annotated with gold solutions to reveal the multi-step reasoning process. We evaluate different pre-trained models on TABMWP, including the GPT-3 model in a few-shot setting. As earlier studies suggest, since few-shot GPT-3 relies on the selection of in-context examples, its performance is unstable and can degrade to near chance. The unstable issue is more severe when handling complex problems like TABMWP. To mitigate this, we further propose a novel approach, PROMPTPG, which utilizes policy gradient to learn to select in-context examples from a small amount of training data and then constructs the corresponding prompt for the test example. Experimental results show that our method outperforms the best baseline by 5.31% on the accuracy metric and reduces the prediction variance significantly compared to random selection, which verifies its effectiveness in the selection of in-context examples. 1

1. INTRODUCTION

Developing machines equipped with mathematical reasoning capabilities is one of the long-standing goals of artificial intelligence. Solving math word problems (MWPs) is a well-defined task to diagnose the ability of intelligent systems to perform numerical reasoning and problem-solving as humans. A surge of datasets has been proposed to facilitate the research in this domain (Upadhyay & Chang, 2017; Amini et al., 2019; Miao et al., 2020; Cobbe et al., 2021) . However, most existing MWP datasets focus on textual math word problems only. Tables, widely distributed in different documents such as invoices, health records, and financial reports, contain rich structured information different from unstructured text. Solving math word problems in such a tabular context is much more challenging than existing MWP benchmarks since the system needs to make cell selections and align heterogeneous information before performing further numerical reasoning. To fill this gap, we propose Tabular Math Word Problems (TABMWP), a new large-scale dataset that contains 38,431 math word problems with tabular context, taken from grade-level math curricula. There are two question types: free-text questions in which the answer is an integer or decimal number, and multi-choice questions where the answer is a text span chosen from option candidates. Different from existing MWP datasets, each problem in TABMWP is accompanied by a tabular context, which is represented in three formats: an image, a semi-structured text, and a structured Find the cost of the spherical beads. Multiply: $3.42 × 5 = $17.10. Find the cost of the star-shaped beads. Multiply: $1.95 × 4 = $7.80. Find the cost of the flower-shaped beads. Multiply: $2.18 × 3 = $6.54. Now find the total cost by adding: $17.10 + $7.80 + $6.54 = $31.44. She will spend $31.44. Figure 1 : Two examples from the TABMWP dataset. The example above is a free-text problem with a numerical answer; the example below is a multi-choice problem with a textual answer. table. Each problem is also annotated with a detailed solution that reveals the multi-step reasoning steps to ensure full explainability. To solve problems in TABMWP, a system requires multi-hop mathematical reasoning over heterogeneous information by looking up table cells given textual clues and conducting multi-step operations to predict the final answer. Take the problem above in Figure 1 as an example. To answer the question "how much will she spend (if Tracy buys three kinds of beads)?", we first need to look up the corresponding three rows in the given table, calculate the individual cost for each kind of bead, and finally sum three costs up to get the answer of 31.44. Inspired the success of the large pre-trained language model GPT-3 (Brown et al., 2020) in solving math word problems (Wei et al., 2022; Wang et al., 2022) , we first build a strong baseline using few-shot GPT-3 on TABMWP. A few in-context examples are randomly selected from the training set, along with the test example, and are constructed as a prompt for GPT-3 to predict the answer. However, recent studies have shown that this type of few-shot learning can be highly unstable across different selections of in-context examples (Zhao et al., 2021; Liu et al., 2022a; Lu et al., 2022c) 



The data and code are available at https://promptpg.github.io. Work was partially done while Pan Lu was an intern at Allen Institute for AI (AI2).



Add the numbers in the Sandwich City row. Then, add the numbers in the Express Sandwiches row. Sandwich City: 3 + 12 = 15. Express Sandwiches: 7 + 17 = 24. 15 is less than 24. Sandwich City sold fewer sandwiches.

. It could be worse on TABMWP since its problems are distributed across multiple question types and diverse table layouts.Liu et al. (2022a)  try to address this issue by retrieving semantically similar examples. However, this method might not work well sometimes on TABMWP because it is not capable of measuring the similarity of structured information, such as the number of cells in tables.To alleviate this challenge, we further propose a novel approach that can learn to select in-context examples from a small amount of training data via policy gradient for prompt learning, termed PROMPTPG. As illustrated in Figure2, an agent learns to find optimal in-context examples from a candidate pool, with the goal of maximizing the prediction rewards on given training examples when interacting with the GPT-3 environment. A policy network defines the strategy of how to select the in-context examples given the current training example. The policy network is built on top of the language model BERT (Devlin et al., 2018) with fixed parameters, followed by a one-layer linear neural network with learnable parameters. The learnable parameters are updated following the policy gradient strategy(Sutton et al., 1998). Unlike random selection(Wei et al., 2022; Wang  et al., 2022), brute-force search, or retrieval-based selection(Liu et al., 2022a), PROMPTPG learns to construct the prompt dynamically given the candidate pool when interacting with the GPT-3 API.We implement two state-of-the-art methods as baselines, i.e.,UnifiedQA (Khashabi et al., 2020)  on general question answering and TAPEX(Liu et al., 2022b)  on tabular question answering. Both are implemented in pre-trained and fine-tuned settings. Experimental results show that our model PROMPTPG can achieve an overall accuracy of 68.23% on TABMWP, which greatly surpasses previous methods by a large margin of up to 5.31%. Further analysis demonstrates that PROMPTPG can select better in-context examples compared with a wide range of existing selection strategies and reduce the prediction variance significantly compared to random selection.

