DYNAMIC PROMPT LEARNING VIA POLICY GRADIENT FOR SEMI-STRUCTURED MATHEMATICAL REASONING

Abstract

Mathematical reasoning, a core ability of human intelligence, presents unique challenges for machines in abstract thinking and logical reasoning. Recent large pre-trained language models such as GPT-3 have achieved remarkable progress on mathematical reasoning tasks written in text form, such as math word problems (MWP). However, it is unknown if the models can handle more complex problems that involve math reasoning over heterogeneous information, such as tabular data. To fill the gap, we present Tabular Math Word Problems (TABMWP), a new dataset containing 38,431 open-domain grade-level problems that require mathematical reasoning on both textual and tabular data. Each question in TABMWP is aligned with a tabular context, which is presented as an image, semi-structured text, and a structured table. There are two types of questions: free-text and multichoice, and each problem is annotated with gold solutions to reveal the multi-step reasoning process. We evaluate different pre-trained models on TABMWP, including the GPT-3 model in a few-shot setting. As earlier studies suggest, since few-shot GPT-3 relies on the selection of in-context examples, its performance is unstable and can degrade to near chance. The unstable issue is more severe when handling complex problems like TABMWP. To mitigate this, we further propose a novel approach, PROMPTPG, which utilizes policy gradient to learn to select in-context examples from a small amount of training data and then constructs the corresponding prompt for the test example. Experimental results show that our method outperforms the best baseline by 5.31% on the accuracy metric and reduces the prediction variance significantly compared to random selection, which verifies its effectiveness in the selection of in-context examples. 1

1. INTRODUCTION

Developing machines equipped with mathematical reasoning capabilities is one of the long-standing goals of artificial intelligence. Solving math word problems (MWPs) is a well-defined task to diagnose the ability of intelligent systems to perform numerical reasoning and problem-solving as humans. A surge of datasets has been proposed to facilitate the research in this domain (Upadhyay & Chang, 2017; Amini et al., 2019; Miao et al., 2020; Cobbe et al., 2021) . However, most existing MWP datasets focus on textual math word problems only. Tables, widely distributed in different documents such as invoices, health records, and financial reports, contain rich structured information different from unstructured text. Solving math word problems in such a tabular context is much more challenging than existing MWP benchmarks since the system needs to make cell selections and align heterogeneous information before performing further numerical reasoning. To fill this gap, we propose Tabular Math Word Problems (TABMWP), a new large-scale dataset that contains 38,431 math word problems with tabular context, taken from grade-level math curricula. There are two question types: free-text questions in which the answer is an integer or decimal number, and multi-choice questions where the answer is a text span chosen from option candidates. Different from existing MWP datasets, each problem in TABMWP is accompanied by a tabular context, which is represented in three formats: an image, a semi-structured text, and a structured Question: If Tracy buys 5 kilograms of spherical beads, 4 kilograms of star-shaped beads, and 3 kilograms of flower-shaped beads, how much will she spend? (unit: $) Answer: 31.44 Solution: Find the cost of the spherical beads. Multiply: $3.42 × 5 = $17.10. Find the cost of the star-shaped beads. Multiply: $1.95 × 4 = $7.80. Find the cost of the flower-shaped beads. Multiply: $2.18 × 3 = $6.54. Now find the total cost by adding: $17.10 + $7.80 + $6.54 = $31.44. She will spend $31.44. Figure 1 : Two examples from the TABMWP dataset. The example above is a free-text problem with a numerical answer; the example below is a multi-choice problem with a textual answer. table. Each problem is also annotated with a detailed solution that reveals the multi-step reasoning steps to ensure full explainability. To solve problems in TABMWP, a system requires multi-hop mathematical reasoning over heterogeneous information by looking up table cells given textual clues and conducting multi-step operations to predict the final answer. Take the problem above in Figure 1 as an example. To answer the question "how much will she spend (if Tracy buys three kinds of beads)?", we first need to look up the corresponding three rows in the given table, calculate the individual cost for each kind of bead, and finally sum three costs up to get the answer of 31.44. Inspired the success of the large pre-trained language model GPT-3 (Brown et al., 2020) in solving math word problems (Wei et al., 2022; Wang et al., 2022) , we first build a strong baseline using few-shot GPT-3 on TABMWP. A few in-context examples are randomly selected from the training set, along with the test example, and are constructed as a prompt for GPT-3 to predict the answer. However, recent studies have shown that this type of few-shot learning can be highly unstable across different selections of in-context examples (Zhao et al., 2021; Liu et al., 2022a; Lu et al., 2022c) . It could be worse on TABMWP since its problems are distributed across multiple question types and diverse table layouts. Liu et al. (2022a) try to address this issue by retrieving semantically similar examples. However, this method might not work well sometimes on TABMWP because it is not capable of measuring the similarity of structured information, such as the number of cells in tables. To alleviate this challenge, we further propose a novel approach that can learn to select in-context examples from a small amount of training data via policy gradient for prompt learning, termed PROMPTPG. As illustrated in Figure 2 , an agent learns to find optimal in-context examples from a candidate pool, with the goal of maximizing the prediction rewards on given training examples when interacting with the GPT-3 environment. A policy network defines the strategy of how to select the in-context examples given the current training example. The policy network is built on top of the language model BERT (Devlin et al., 2018) with fixed parameters, followed by a one-layer linear neural network with learnable parameters. The learnable parameters are updated following the policy gradient strategy (Sutton et al., 1998) . Unlike random selection (Wei et al., 2022; Wang et al., 2022) , brute-force search, or retrieval-based selection (Liu et al., 2022a) , PROMPTPG learns to construct the prompt dynamically given the candidate pool when interacting with the GPT-3 API. We implement two state-of-the-art methods as baselines, i.e., UnifiedQA (Khashabi et al., 2020) on general question answering and TAPEX (Liu et al., 2022b) 

2.1. TASK FORMULATION

A tabular math word problem p is represented as a pair (t, q), where t is a table context and q is a question. The table t could be represented in a visual format as an image, semi-structured text, or a structured database. In this work, we focus on the semi-structured format as the table context for simplicity. The table t features complicated layouts and formats: it contains multiple rows and columns, and each cell can be a string of text, a string of a number, or a mix of them. Depending on the question and answer types, the question q may be accompanied by multiple-choice options c = {c 1 , c 2 , . . . , c n } or a unit u. Given a semi-structured tabular context t and an unstructured question text q, the task is to generate the answer a, which is either numerical only text for a freetext question, or a text span from given options for a multiple-choice question.

2.2. DATASET CONSTRUCTION

Data collection. We construct TABMWP based on openly available content and more details are provided in Appendix A.1. Only math word problems that are accompanied by a tabular context and a detailed solution are collected. We develop a script to extract the tabular context, the question, options that apply, the correct answer, and the solution for each problem. These elements can be precisely identified using HTML tags. For each table, we take a screenshot and store its raw text. Data preprocessing. To make TABMWP compatible with various baselines, we represent the tabular context as three formats: an image, semi-structured text, and a structured spreadsheet. The semi-structured format is created by converting the raw table text into a flattened token sequence, with each row separated by a newline character '\n' and each column separated by '|'. The semistructured text is further transformed to the structured format, which can be easily retrieved and executed by SQL-based methods (Liu et al., 2022b) using packages like pandas. For clarity, the table title is separated from the raw table. Examples of three formats are shown in Appendix A.1. For better quantitative evaluation, we formalize the TABMWP problems as two question types: (a) free-text questions, where the answer is numerical text only and the unit text is separately extracted; and (b) multi-choice questions, the answer of which is the text span from choice options. The order of choice options is shuffled to alleviate distribution bias. Redundant information in solutions is removed, and some solutions are manually rewritten to be more human-readable. Finally, problems with the same table, question, and answer text are regarded as redundant and thus removed. We further conduct quality control to ensure data quality, which is discussed in Appendix A.1. 

2.3. DATASET STATISTICS

✗ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✓ ✓ formula DRAW-1K (2017) 1,000 ✗ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✓ ✓ formula Math23K (2017) 23,162 ✗ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✓ ✓ formula MathQA (2019) 37,297 ✗ ✓ ✗ ✗ ✗ ✗ ✓ ✗ ✓ ✓ formula ASDiv (2020) 2,305 ✗ ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✓ ✓ formula SVAMP (2021) 1,000 ✗ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✓ ✗ formula GSM8K (2021) 8,792 ✗ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✓ ✗ text IconQA (2021b) 107,439 ✗ ✓ ✗ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ FinQA (2021) 8,

3.1. FEW-SHOT GPT-3 FOR TABMWP

Provided with a few in-context examples of math word problems as the context, GPT-3 can generate the answer for a test problem, and shows impressive performance across different MWP datasets (Wei et al., 2022; Wang et al., 2022) . Inspired by its success, we first build a strong baseline using few-shot GPT-3 on our TABMWP dataset. Specifically, a few training examples, along with the test example p i , are provided to GPT-3 for the answer prediction. Each training example consists of a table context t, a question q, options c that apply, and an answer a. To make the few-shot GPT-3 model workable on TABMWP, we utilize the semi-structured format as the tabular context. Following Wei et al. ( 2022), a solution s can be augmented in front of the answer a to reveal the multi-step reasoning process, which is able to boost the prediction performance.

3.2. DYNAMIC PROMPTING VIA POLICY GRADIENT

The in-context examples can be randomly (Wei et al., 2022; Wang et al., 2022) or retrieval-based selected (Liu et al., 2022a) from the training set. Recent research, however, has shown that few-shot GPT-3 can be highly unstable across different selections of in-context examples and permutations of those examples (Zhao et al., 2021; Liu et al., 2022a; Lu et al., 2022c) . This instability may be more severe on TABMWP, where examples are more distinct because they include both unstructured questions of various types and semi-structured tables in various layouts. To alleviate this issue, we aim to propose a novel approach that can learn to select performing in-context examples using a policy gradient strategy, without brute-force searching or manually designed heuristics. Formally, given a TABMWP problem p i , we want the agent to find K in-context examples e i = {e 1 i , e 2 i , ..., e K i } from a candidate pool E cand , and generate the answer âi , maximizing a reward r i = R(â i |p i ). The in-context examples are selected according to a policy e k i ∼ π θ (e i |p i ), e k i ∈ E cand , e k i are independent for k = {1, 2, ..., K}, where θ are the policy's parameters. The answer is generated through: âi = GPT-3(e i , p i ) using the selected examples and the given problem as the input prompt. The reward is then computed by evaluating the generated answer âi with respect to the ground truth answer a i : r i = R(â i |p i ) = EVAL(â i , a i ), r i ∈ {-1, 1}. The function EVAL() returns a reward of 1 if the generated answer aligned with the label and -1 otherwise. Our goal is to maximize the expected reward of the generated answer under the policy E ei∼π θ (ei|pi) [R(GPT-3(e i , p i ))]. We optimize the reward with respect to the parameters of the policy network using the Policy Gradient method (Sutton et al., 1998) . The expected reward cannot be computed in closed form, so we compute an unbiased estimation with Monte Carlo Sampling, E ei∼π θ (ei|pi) [R(GPT-3(e i , p i ))] ≈ 1 N N i=1 R(GPT-3(e i , p i )), e i ∼ π θ (e i |p i ), where N is the size of each batch yielded from our training problem set P train . In this work, we experiment using the REINFORCE policy gradient algorithm (Williams, 1992): ∇E ei∼π θ (ei|pi) [R(GPT-3(e i , p i ))] = E ei∼π θ (ei|pi) ∇ θ log(π θ (e i |p i ))R(GPT-3(e i , p i )) ≈ 1 N N i=1 ∇ θ log(π θ (e i |p i ))R(GPT-3(e i , p i )), e i ∼ π θ (e i |p i ). Intuitively, if the predicted answer is correct, we update the policy so that the probability of selecting the same prompts gets higher. Otherwise, we update the policy to reduce the probability of selecting such less matched examples. The learning process is summarized in Algorithm 1 in the appendix. To get the contextualized representation of the given problem and candidate examples, we use the BERT (Devlin et al., 2018) [CLS] token representation as the problem encoding. We add a small linear layer on top of the BERT final pooling layer. That allows our model to learn both the semantic similarity that the pre-trained BERT model provides and the hidden logical similarity shared among the math problems. During training, the parameters of BERT are fixed and only the appended linear layer is updated, i.e., θ is composed of the learnable parameters W and b: h (e i ) = W(BERT(e i )) + b, h(p i ) = W(BERT(p i )) + b, π θ (e i |p i ) = exp [h(e i ) • h(p i )] e ′ i ∈Ecand exp [h(e ′ i ) • h(p i )] . (5)

4. EXPERIMENTS 4.1 EXPERIMENTAL SETTINGS

Baselines. We first develop two large language models, UnifiedQA (Khashabi et al., 2020) and TAPEX (Liu et al., 2022b) , in both pre-trained and fine-tuned settings, as strong baselines on TABMWP. Different model sizes are included to examine the performance across different model capacities. We further implement the zero-shot GPT-3 model, the few-shot GPT-3 model, and their chain-of-thought (CoT) reasoning variants (Wei et al., 2022) . We also study the heuristic guess baseline and human performance to analyze the lower and upper bounds on TABMWP, respectively. Evaluation metric. The answer part is extracted from the GPT-3 generation using manually designed regular expressions. To evaluate the baselines and our method, we utilize the accuracy metric to determine if the generated answer is correct given the ground truth answer. For free-text problems where the answer is set as a number, we normalize the prediction and the label to decimal numbers with two-digit precision and check if their values are equivalent. For multi-choice problems, we choose the most similar one from options to the generated answer following Khashabi et al. (2020) . 3 . More implementation details can be found in Appendix A.4.

4.2. EXPERIMENTAL RESULTS

Table 3 demonstrates the results of different baselines and our method on the TABMWP dataset. Benefiting from pre-training on the tabular corpus, the TAPEX baseline performs better on average than UnifiedQA with a similar model size, which is only pre-trained on unstructured textual data. Increasing the model size can improve the prediction accuracy for both UnifiedQA and TAPEX. Fine-tuned on TABMWP, the baseline models can significantly improve the prediction performance on the average and all aggregated accuracy metrics. Without any examples provided to GPT-3, zero-shot GPT-3 achieves a comparable accuracy to the best fine-tuned baselines, UnifiedQA LARGE and TAPEX LARGE , showing its surprisingly good generalization ability on TABMWP. Provided with two randomly sampled in-context examples as the prompt, few-shot GPT-3 gets an improvement of 0.17%. Generating the multi-step solution before the answer, the few-shot-CoT GPT-3 model reports the best performance among all of these baseline models, with an accuracy of 62.92%. Unlike few-shot-CoT GPT-3 randomly selecting the in-context examples, our proposed PROMPTPG learns to select performing examples with the help of policy gradient. PROMPTPG establishes a state-of-the-art performance on the TABMWP dataset: it surpasses the best baseline few-shot-CoT GPT-3 by 5.31% on average. PROMPTPG shows its consistent advantages on two question types, two grade groups, and most of the answer types. Heuristic guess and human performance. The accuracy of multi-choice questions by heuristic guess is 39.81%, which aligns with the fact that there are 2.88 options on average. The accuracy for free-text questions is considerably low since the inputs of TABMWP problems do not have direct clues for the answers. Humans outperform all benchmarks consistently across question types, answer types, and grade groups, with a 21.99% average accuracy advantage over our best performing PROMPTPG. This gap is to be filled by future research on semi-structured mathematical reasoning. Problem types and difficulty. Among all the baselines, we find it is easier for models to answer multi-choice questions than free-text questions. Questions with the boolean (BOOL) and other (OTH) answer types tend to have lower accuracy scores than the extractive (EXTR) answer type, because the former ones need the abilities of fact verification and language understanding on diverse options, respectively. It is also not surprising for us to find that all the models perform worse on problems in grades 7-8 than in a lower-level group of 1-6. 

4.3. ABLATION STUDY

Here, we will study how different factors have an effect on the performances of baselines and our method on TABMWP. Experiments are conducted on 1,000 development examples. Blind study of the dataset. We evaluate the information gain of each component of the TABMWP problems by removing it from model inputs. To eliminate the impact and variance caused by example selection, the study is conducted using the zero-shot GPT-3 model. As shown in Table 4 , there is a dramatic decline when either the tabular context (T) or the question text (Q) is missing from the inputs. For example, T→A and Q→A only attain an average accuracy of 6.10% and 7.00%, respectively, and their accuracies are near to zero on the multi-choice questions. Taking both tabular and textual data as inputs (TQ→A), the model significantly beats the heuristic guess. With the complete input information (TQ(C)→A), the full model achieves the best performance. The blind study shows that our TABMWP is robust and reliable in distribution, and all input components are indispensable parts that provide necessary information for answering the questions. 13 -18 suggest that PROMPTPG has limitations when solving problems provided with complex tabular contexts or requiring a high-level ability of mathematical reasoning.

5.1. MATH WORD PROBLEMS

The task of solving Math Word Problems (MWPs) is to predict the answer given a natural language description of a math problem. There have been great efforts in developing datasets for MWPs, including Math23K (Wang et al., 2017) , MathQA (Amini et al., 2019 ), ASDiv (Miao et al., 2020) , SVAMP (Patel et al., 2021) , and Lila (Mishra et al., 2022) . However, these datasets only involve the textual modality, and most are limited to a small data scale. Some recent datasets like DVQA (Kafle et al., 2018) , IconQA (Lu et al., 2021b) , Geometry3K (Lu et al., 2021a) , and UniGeo (Chen et al., 2022) introduce math problems with diagrams as the visual context, where the system needs to perform mathematical reasoning over multi-modal information. To the best of our knowledge, our dataset TABMWP is the first dataset that requires mathematical reasoning over heterogeneous information from both the textual question and the tabular context. To solve MWPs, one popular line of previous methods is to generate the intermediate expressions and execute them to get the final answers (Huang et al., 2017; Roy & Roth, 2017; Amini et al., 2019) . Inspired by the recent progress achieved by GPT-3 in solving MWPs (Wei et al., 2022; Wang et al., 2022; Kojima et al., 2022) , we evaluate TABMWP using GPT-3 models in zero-shot and few-shot learning manners. (Pasupat & Liang, 2015) , WikiSQL (Zhong et al., 2017) , and SQA (Iyyer et al., 2017) contain semi-structured tables from Wikipedia, while Spider (Yu et al., 2018) collects structured tables sourced from databases. Recent work aims at introducing datasets that require multi-hop reasoning between the textual and tabular data: Hy-bridQA (Chen et al., 2020b) , OTTQA (Chen et al., 2020a) , MultiModalQA (Talmor et al., 2020) , AIT-QA (Katsis et al., 2021), and FeTaQA (Nan et al., 2022) . Datasets most related to our TABMWP dataset are FinQA (Chen et al., 2021) , TAT-QA (Zhu et al., 2021) , and MultiHiertt (Zhao et al., 2022) because they need numerical reasoning on financial reports with tabular data. Note that 77.6% of questions in TAT-QA can be solvable without mathematical reasoning and 50.0% of questions in FinQA are not table-must to be answered. In contrast, our proposed TABMWP collects questions where both mathematical reasoning and tabular context are necessary.

5.3. PROMPT LEARNING FOR LANGUAGE MODELS

Large pre-trained language models, such as GPT-3 (Brown et al., 2020) , have shown their remarkable ability of few-shot learning on a wide range of downstream tasks (Houlsby et al., 2019; Brown et al., 2020; Ma et al., 2022; Lu et al., 2022a) . Given a few in-context examples as demonstrations, GPT-3 can generalize to unseen test examples without parameter updating. For example, Wei et al. (2022) randomly select different in-context examples from the training set and formulate their corresponding prompt with a test sample. However, recent studies show that few-shot GPT-3 highly depends on the selection of in-context examples and could be unstable, varying from the near chance to near state-of-the-art performance (Zhao et al., 2021; Liu et al., 2022a; Lu et al., 2022b) . To mitigate the volatility of selecting in-context examples, Lu et al. (2022c) propose retrieving relevant examples that are semantically similar to the test sample. Other possible strategies could be using brute-force permutation search or relying on manually designed heuristics like choosing the most complex examples. Inspired by reinforcement learning's ability to search for an optimal action policy, we propose applying the policy gradient strategy (Sutton et al., 1998) to learn to select in-context examples more efficiently and stably without designing human-designed heuristics.

6. CONCLUSION

In this paper, we propose TABMWP, the first large-scale dataset for math word problems in tabular contexts. TABMWP contains 38,431 open-domain problems with two question types and three answer types, and each problem is annotated with a multi-step solution. We evaluate TABMWP using state-of-the-art QA and TableQA methods in both pre-trained and fine-tuned settings, as well as the large pre-trained language model GPT-3. We further propose a novel approach, PROMPTPG, for few-shot GPT-3, which utilizes policy gradient to learn to select in-context examples from the training data and construct the performing prompt for the test example. Experimental results show that PROMPTPG outperforms existing strong baselines by a large margin of 5.31% and reduces the accuracy volatility compared to random selection. To the best of our knowledge, it is the first work that applies reinforcement learning to select in-context examples for the few-shot GPT-3 model.

A APPENDIX

A.1 DATASET COLLECTION The raw problems are collected from an online learning website, IXLfoot_1 , which hosts a large number of high-quality math problems curated by educational experts. Quality control. The goal of constructing TABMWP is to collect math word problems that necessitate multi-hop mathematical reasoning between the question and the tabular context. Therefore, we ask human experts to filter problems that can be solved either without the context of the table or by looking up table cells without numerical reasoning. To further ensure data quality, we ask human experts to perform a final review to re-check the dataset and manually revise incorrect annotations.

Question types Answer types (%) Descriptions

Free-text Integer (59.50%) The answer is an integer number, e.g., "40", "1,207", "-3". Decimal (15.23%) The answer is a decimal or a fraction number, e.g., "192.80", "68/217". Multi-choice Extractive (13.01%) The answer could be extracted from the table context. Boolean (10.97%) The answer is Boolean, e.g., "yes"/"no", "true"/"false", "linear"/"nonlear". Other (1.29%) The answer belongs to other text types, e.g., a statement. Table 7 : Three different formats for the tables in the TABMWP dataset.

A.2 HUMAN STUDY

To examine how humans perform on our TABMWP dataset, we released the human evaluation task on Amazon Mechanical Turk (AMT) to the test split. We designed two sub-tasks for the human study: answering the free-text questions and answering the multi-choice questions. The user interfaces for the two sub-tasks are shown in Figure 4 . Each human intelligence task (HIT) contains 5 exam questions and 15 test questions. A worker should have a HIT Approval Rate of 98% or higher and be approved with 5,000 or more HITs. The worker is provided with detailed instructions at the beginning and needs to pass at least 3 free-text exam questions or 4 multi-choice exam questions to be qualified for the human study. Each HIT is assigned to two different workers. We assign a reward of $0.80 and $0.60 for one HIT of free-text and multi-choice sub-tasks, respectively.

A.3 THE PROMPTPG ALGORITHM

The pipeline of PROMPTPG to learn to select in-context examples is summarized in Algorithm 1.

A.4 IMPLEMENTATION DETAILS

Heuristics guess. To investigate the lower bound of the accuracy on TABMWP, we design simple heuristics to guess answers for each question type. For multi-choice questions, we randomly âi ← GPT-3(e 1 i , ..., e k i , pi) ▷ âi is the GPT-3 generated answer 9: ri ← EVAL(âi, ai), ri ∈ {-1, 1} ▷ ai is the ground truth answer of pi 10: Lbatch ← Lbatch -ri • ln π θ (ei|pi) 11: end for 12: Optimize Lbatch wrt. θ 13: end for 14: end for 15: return π θ 16: end function select one from the given options with even probabilities. For free-text questions on TABMWP, the answers could only be integral or decimal numbers. Intuitively, we take advantage of regular expressions to extract all the numbers from the tabular context and the question text as candidates, and then randomly choose one number as the prediction. UnifiedQA baselines. UnifiedQA (Khashabi et al., 2020 ) is a T5-based (Raffel et al., 2020) QA system that was pre-trained on 8 seed QA datasets of multiple formats but with a unified text-to-text paradigm. We load the pre-trained checkpoint as the pre-trained baseline and train it on TABMWP as the fine-tuned baseline. Three different parameter sizes are compared: SMALL (60M), BASE (220M), and LARGE (770M). TAPEX baselines. TAPEX (Liu et al., 2022b ) is a BART-based (Lewis et al., 2020) language model pre-trained on structured tabular data to mimic the behavior of a SQL executor that can answer table-based questions. TAPEX shows state-of-the-art performance on four table-related datasets. We establish the pre-trained and fine-tuned baselines on top of TAPEX with two model sizes: BASE (140M) and LARGE (400M). Zero-shot GPT-3 and zero-shot-CoT GPT-3. We establish the zero-shot baseline based on GPT-3 (Brown et al., 2020) . The zero-shot setup follows the format of TQ(C)→A where the input is the concatenation of tokens of the tabular context (T), the question text (Q), and choice options (C) that apply while the output is to predict the answer (A). Following Kojima et al. (2022) , we further build zero-shot-CoT GPT-3, which refers to the GPT-3 model with a chain-of-thought (CoT) prompt. Specifically, we add the prompt "Let's think step by step" at the end of the input to ask the model to generate the multi-step solution (S) to mimic the reasoning process as humans. Then the model takes the raw input and the newly generated solution to predict the final answer. Few-shot GPT-3 and few-shot-CoT GPT-3. In the few-shot setting, we follow the standard prompting (Wei et al., 2022) where in-context examples are randomly selected from the training data as demonstrations for the text example. Similarly, the few-shot-CoT GPT-3 baseline takes the prompt template of TQ(C)→SA to generate the solution before the final answer. Experimental details. Our experiments for UnifiedQA baselines, TAPEX baselines, and our proposed PROMPTPG are conducted using PyTorch on two Nvidia RTX 3090 GPUs. For fine-tuning the UnifiedQA and TAPEX baselines, we use the Adam optimizer (Kingma & Ba, 2014) with an initial learning rate of 5e-5. The training process takes 10 epochs with a batch size of 16. The maximum number of input tokens is set as 200 and the maximum output length is 100. In our proposed PROMPTPG, the embedding size of the added linear neural network is 768. To learn the policy network, we use the Adam optimizer with an initial learning rate of 1e-3. The maximum number of training epochs is 30, with a batch size of 20. The training process is stopped early if there is any NaN value in the loss for a batch of training data. For the GPT-3 engine, we use TEXT-DAVINCI-002, the most capable engine recommended by the official documentation. The temperature is set as 0 and the top probability is set as 1.0 to get the most deterministic prediction. The maximum number of tokens allowed for generating text is 512. Both the frequency penalty and the presence penalty are set as the default value, i.e., 0. Figure 13 : The wrong prediction from our PROMPTPG for a free-text question example. Our model retrieves the wrong price for the rose quartz, thus calculating the wrong cost sum of three items.



The data and code are available at https://promptpg.github.io. Work was partially done while Pan Lu was an intern at Allen Institute for AI (AI2). https://www.ixl.com/math



Figure 2: Our proposed PROMPTPG is able to learn to select performing in-context examples via policy gradient when interacting with the GPT-3 API without any manually designed heuristics. approach, PROMPTPG, which learns the prompt dynamically via policy gradient to select in-context examples for few-shot GPT-3. To the best of our knowledge, it is the first work that applies reinforcement learning to select in-context examples for the few-shot GPT-3 model; (c) Experimental results show that PROMPTPG achieves an improvement of up to 5.31% on TABMWP over existing methods, with reduced selection instability compared to random selection.

Implementation details. Fine-tuned UnifiedQA and TAPEX baselines are trained on the train split and evaluated on the test split. Few-shot GPT-3 and few-shot-CoT GPT-3 randomly select two incontext examples from the training data to build the prompt. Our PROMPTPG is built on top of few-shot GPT-3 with a different selection strategy: (a) in the training stage, the agent learns to select two examples from 20 candidates and is evaluated on 160 training examples to calculate the reward; (b) in the test stage, the agent with an optimal policy chooses two examples from 20 candidates for each test example. The candidates are randomly selected from the training set. Experiments for two few-shot GPT-3 baselines and our PROMPTPG are repeated three times, and the average accuracy is reported in Table

Blind studies on TABMWP. T: tabular context; Q: question; C: choice options; A: answer. Q(C) means choice options come after the question in the input, while Q refers to the question only. Number of training examples. We study the effect of different numbers of training examples on our dynamic prompt learning in Figure 3 (a). With more training examples, the prediction accuracy first gradually increases to a peak of around 160 training examples. After that, the accuracy goes down with a growing variance. We reckon it is because the policy gradient algorithm can benefit from the scaling-up training data but fails to exploit more examples efficiently.

Accuracy w.r.t. different numbers of training examples, given 20 candidate examples.

water balloon toss | 11:30 A.M. | 11:50 A.M. obstacle course | 12:05 P.M. | 12:25 P.M. parachute ball toss | 12:30 P.M. | 1:30 P.M. jump rope race | 1:40 P.M. | 2:05 P.M. balloon stomp | 2:15 P.M. | 2:35 P.M. relay race | 2:50 P.M. | 3:40 P.M. hula hoop contest | 3:55 P.M. | 4:30 P.M.

Figure 4: User interfaces of human study for free-text and multi-choice questions.

Figure 5: Two in-context examples selected by PROMPTPG, the prompt, and the correct prediction. The selected examples require similar abilities of mathematical reasoning to the test example.

Figure10: The correct prediction from our PROMPTPG for a free-text example. In this example, the model is asked to understand a hierarchical tax report and calculate the pay after taxes.

Figure11: The correct prediction from our PROMPTPG for a multi-choice question. There are 9 rows and 6 columns in the given tabular context. Our model successfully locates the target cells in the table and performs multi-hop reasoning to predict the correct answer.

Figure12: The correct prediction from our PROMPTPG for a multi-choice question with Boolean options. It needs to compare the budget and the total costs to verify if Ariana has enough money.

Table:yellow tiger's eye | $0.85 piece of green slate | $0.59 piece of red sandstone | $0.19 piece of rose quartz | $0.61 smooth piece of marble | $0s eye \(|\) $0.85 \\ piece of green slate \(|\) $0.59 \\ piece of red sandstone \(|\) $0.19 \\ piece of rose quartz \(|\) $0.61 \\ smooth piece of marble \(|\) $0.45 \\ Question: How much money does Connor need to buy a piece of rose quartz, a piece of green slate, and a piece of red sandstone? (unit: $) Answer: (Step 1) Connor needs to buy a piece of rose quartz, a piece of green slate, and a piece of red sandstone. To find the total amount of money Connor needs, add the prices of the three items. (Step 2) $0.85 + $0.59 + $0.19 = $1.63 (Step 3) Connor needs $1.63 to buy a piece of rose quartz, a piece of green slate, and a piece of red sandstone. The answer is 1.63. Output: 1.63 Ground truth: 1.39

Key statistics for TABMWP.

A comparison of MWP andTable QA datasets that require numerical reasoning. text*: each table in TABMWP is accompanied by an image format.

Few-shot-CoT (2-shot) 160+20 Dynamic 66.1774.11 64.12 74.16 76.19 72.81 65.71 71.20 64.27 68.23 5.31↑

Evaluation results of various baselines and our method on TABMWP. Training Data: number of used training data; Selection Strategy: strategy of selecting in-context examples for few-shot GPT-3; FREE: free-text questions; MC: multi-choice questions; INT: integer answers; DEC: decimal answers; EXTR: extractive text answers; BOOL: Boolean text answers; OTH: other text answers.

Number of candidate examples. In Figure3(b), we investigate how different numbers of candidate examples can affect policy learning performance. With the increasing candidate number, it is observed that the prediction accuracy will first go up and then go down after a threshold, given 80 or 160 training examples. It is probably because when the candidate pool is too small, the policy gradient algorithm has a limited action space to explore enough problem types. In contrast, too many candidates could make the algorithm hard to learn an optimal policy in a large search space.

Evaluation results of different selection strategies with three trials.

Question Answering (Table QA) refers to the task of answering questions about tabular data. Numerous datasets have been developed for Table QA. For example, TabMCQ (Jauhar et al., 2016) is an early dataset collected from grade exams. Datasets like WTQ

Format diversity of questions and answers in TABMWP.

Field day schedule

Event | Begin | End

Field day schedule

Three different formats for the tables in the TABMWP dataset.

Experimental settings and raw accuracy results of random selection and our PROMPTPG for the few-shot GPT-3 model on the TABMWP test split. For each setting, we repeat the experiment with the same set of three different random seeds.

Results of different numbers of few-shot examples on 1,000 development examples. Number of few-shot examples. We study the few-shot-CoT GPT-3 model with random selection in terms of the different numbers of in-context shots. For each number of in-context shots, the experiment was conducted on 1,000 development examples and repeated three times. The results are shown in Table9. When increasing the number of in-context shots from the current 2 to 4, the few-shot-CoT GPT-3 model reduces the prediction variance from the random selection of incontext shots and achieves an accuracy improvement of 2.5%. When the number of in-context shots is increased to 5, the model with random selection does not gain further benefits. Our PromptPG displays impressive advantages over random selection in terms of data efficiency and prediction accuracy. With only two in-context shots, PromptPG achieves the highest accuracy of 70.9% and a comparable low deviation compared to random selection with more shots.

science-fiction book | $4.31 mystery novel | $8.26 crossword puzzle book | $8.74 geography book | $8.61 coloring book | $8.08 paperback book | $8.45 Ariana has $16.50. Does she have enough to buy a paperback book and a mystery novel? Options: (A) yes (B) no Answer: (Step 1) To answer this question, look at the table to find the price of a paperback book and the price of a mystery novel. (Step 2) A paperback book costs $8.45 and a mystery novel costs $8.26. (Step 3) $8.45 + $8.26 = $16.71 (Step 4) Since $16.71 is greater than $16.50, Ariana does not have enough money to buy a paperback book and a mystery novel. The answer is no.

acknowledgement

We would like to thank Zhou Yu and Jiuxiang Gu for insightful discussions on dataset collection. We thank Muhao Chen and Yao Fu for constructive suggestions in developing baselines and experiments. The work does not relate to Liang Qiu's position at Amazon Alexa.

annex

Published as a conference paper at ICLR 2023 A.6 RELATED WORK OF POLICY GRADIENT Policy gradient is an approach to solving reinforcement learning problems that target modeling and optimizing the policy directly. Many policy gradient algorithms have been proposed in the past decade (Silver et al., 2014; Lillicrap et al., 2015; Mnih et al., 2016; Schulman et al., 2017; Barth-Maron et al., 2018) . They have been proven effective in areas like robotics (Peters & Schaal, 2006) and chatbots (Kandasamy et al., 2017) . In recent work that focuses on aligning language models with human values (Ouyang et al., 2022; Qiu et al., 2022; Glaese et al., 2022) , policy gradient has been used to optimize language models with rewards learned from human feedback and preference. To the best of our knowledge, our PromptPG is the first work that proposes to select prompts dynamically for large pre-trained language models in the mathematical reasoning field. Question: At Josie's birthday party, children and parents celebrated by participating in some activities. How many more children than parents are playing tag? (Unit: children) Answer:(Step 1) For the activity "Playing tag", the stem is 4, and the leaf is 9. Find the row where the stem is 4. In that row, count all the leaves equal to 9. (Step 2) You counted 9 leaves, which are blue in the stem-and-leaf plot above. 9 children are playing tag. The answer is 9.Output: 9 Ground truth: 5 Figure 9 : The correct prediction from our PROMPTPG for a free-text question example. This example requires looking up three items and their corresponding prices in the table, calculating their costs, and finally summing them up to get the final answer. Question: A farmer counted the number of apples on each tree in her orchard. How many trees have at least 51 apples but fewer than 83 apples? (unit: trees) Answer:(Step 1) To solve this problem, look at the table and find all the trees that have at least 51 apples but fewer than 83 apples. (Step 2) There are three trees that have at least 51 apples but fewer than 83 apples. The answer is 3.Output: 3 Ground truth: 7Figure 14 : The wrong prediction from our PROMPTPG for a free-text question example. In this example, the problem is provided with an abstract Stem-Leaf table. Our model fails to understand this domain-specific table and lacks a high-level ability of logical reasoning to get the correct answer. Question: A garden supply store kept track of how many lawn mowers it sold in the past 7 days. What is the median of the numbers? Answer:(Step 1) The median is the "middle" value. To find the median, put the numbers in order from least to greatest and find the number in the middle.(Step 2) 8, 7, 5, 7, 9, 8, 8 (Step 3) The median is 7. The answer is 7. Output: 7 Ground truth: 8Figure 15 : The wrong prediction from our PROMPTPG for a free-text question example. It seems that our model has a limited ability to order numbers in the mathematical domain. This issue could be alleviated by completing human-designed rules or developing an additional module to extract the answer from the prediction more accurately in various cases.

