INVERSELY ELICITING NUMERICAL REASONING IN LANGUAGE MODELS VIA SOLVING LINEAR SYSTEMS

Abstract

Numerical reasoning over natural language has been a long-standing goal for the research community. However, recent language models have proven difficult to reliably generalize to a broad range of numbers, although they have shown proficiency in reasoning over common and simple numbers. In this paper, we propose a novel method to elicit and exploit the numerical reasoning knowledge hidden in pre-trained language models using simple anchor numbers. Concretely, we first leverage simple numbers as anchors to probe the implicitly inferred arithmetic expressions from language models, and then explicitly apply the expressions on complex numbers to get corresponding answers. To inversely elicit arithmetic expressions, we transform and formulate the task as an analytically solvable linear system. Experimental results on several numerical reasoning benchmarks demonstrate that our approach is highly effective. More importantly, our approach works in the inference phase without extra model training, making it highly portable and achieving significant and consistent performance benefits across a variety of language models in zero-shot, few-shot, and fine-tuning scenarios.

1. INTRODUCTION

Language Models (LMs) have demonstrated great success on a wide range of natural language tasks (Devlin et al., 2018; Brown et al., 2020b; Chowdhery et al., 2022) , and recent works even explore to use LMs as a general-purpose interface for diverse modalities (Hao et al., 2022; Xie et al., 2022; He et al., 2022) . But when it comes to reasoning about numbers, the crucial parts of text, tables, and knowledge bases, the performance of LMs slumps. A key challenge of numerical reasoning for now is number calculation. Even rational numbers, a small subset of real numbers, readily constitute an infinite space that cannot be completely covered by pre-training corpora, hence posing a significant obstacle to LMs. Recent works have shown strong context understanding capabilities of LMs in numerical reasoning datasets (Dua et al., 2019; Cobbe et al., 2021) , but LMs are still far from being robust on implicit numerical calculation: as numbers grow bigger and more complex, LMs are more likely to fail, e.g., 8, 534.5 + 17.85; and even for small number additions, e.g., 512+128 and 513+129, LMs are not stable enough to produce the correct answer consistently. Similar observations are also reported by Razeghi et al. (2022) , showing that end-to-end LMs easily fail to calculate numbers that rarely appear in pre-training corpora. Fortunately, by reverse thinking, we have a positive perspective: with the exact same context, LMs are significantly more accurate and stable on simple numbers -typically small integers that appear frequently in the pre-training corpora -than complex numbers, indicating that LMs have a strong capability of applying arithmetic results to simple numbers after pre-training. This motivates us to leverage simple numbers as "anchors" to probe the implicitly inferred arithmetic expressions from language models and then explicitly apply the expressions on complex numbers. Specifically, as Figure 1 illustrates, when detecting complex numbers (10, 477 and 7, 459) that are challenging for LMs, to first replace them by anchor numbers(10 and 7, etc) and use LMs to output answers (3, etc) that are more much accurate than complex numbers, then inversely elicit the hidden arithmetic relationship (x 1 -x 2 ) implicitly inferred by LMs through anchor inputs/outputs (10,7,3, etc) , and finally explicitly doing the arithmetic using the initial complex numbers (10, 459) to produce the precise answer (3, 018). In this way, our method combines the advances of LMs on understanding complex context and memorizing simple numbers for reliable numerical reasoning.

Paragraph:

As of the census of 2000, there were 10,477 households and 7,459 families in the county. Question: How many more households are there than families? 3 Arithmetic Relationship Inversion 2 Number Substitution 𝑦 = 𝑥 ! -𝑥 " 3 (𝑦) 10 (𝑥 ! ), 7 (𝑥 " ) Reasoning LM

Solving Algorithm

Answer: 3,018 10,477 (𝑥 ! ), 7,459 (𝑥 " ) 1 Operand Proposal As of the census of 2000, there were 25,764 people, 10,477 households, and 7,459 families residing in the county.

Paragraph:

As of the census of 2000, there were 10 households and 7 families in the county.

Question:

How many more households are there than families? Figure 1 : The illustration of our proposed framework, which elicits numerical reasoning in language models via Solving Linear Systems (SOLIS). This To inversely elicit arithmetic relationships in LMs through anchor numbers, we propose SOLIS, a novel method to transform and formulate this problem to a linear system that can be straightforwardly solved in an analytic way. Alternative search-based and heuristic-based methods are further devised to promote robustness for noisy linear systems. Experimental results show significant and consistent gains over various language models and diverse zero-shot, few-shot and fine-tuning settings on several representative numerical reasoning datasets.

2. PRELIMINARY STUDY

In this section, we will first demonstrate the brittleness of language models' ability on arithmeticallyrelated tasks. Unlike arithmetic benchmarks such as AddSub or MultiArith (Roy & Roth, 2015) which contain natural language context for each sample, we directly generate and feed the arithmetic expressions and test the performance on language models. This is done to reduce potential perturbing factors and highlight the models' calculating ability. We impose constraints on the complexity of the expressions: we only study the four fundamental operations, and demand no more than 4 operands, where each operand's integer range is less then 10, 000 and floating point precision is less than 4. To conduct a systematic investigation, we first produce F which represents the set of all the expressions satisfying our constraints. We randomly sample numbers within the limits of range and precision as the operands. For one expression f ∈ F with a specified range and precision, we randomly generate 50 samples. We evaluate the language model on these samples and denote this synthesized task as MathExp which stands for Math Expressions. We sample a maximum of 50 expressions for each different settings of complexity, and test these samples using large scale language model GPT-3 (Brown et al., 2020a) . We conduct the study on GPT-3 in a few-shot manner: to unleash its potential, we pre-pend 10 to 20 expressions (having the same f , integer range, and floating point precision as the tested sample) together with the answers as the prompt. We then call the OpenAI APIfoot_0 to get all the predictions, and evaluate the performance accordingly. Results in Figure 2 indicate that even the latest powerful GPT-3(Code-Davinci-002) fails to achieve a satisfactory performance: (i) the prediction accuracy decreases largely as the number gets more complex, i.e., integer range or floating point



https://openai.com/api



Figure 2: Performance with different floating point precision (left) and integer range (right).

paper introduces a new idea of eliciting and exploiting the numerical reasoning knowledge hidden in pre-trained LMs through probing with simple anchor numbers. Importantly, our framework does not need any additional model training or labeled data, because it simply works during the test-time inference phase, and it is portable to all existing fine-tuned/few-shot/zero-shot LMs with decoders. Thus it is significantly different from existing neural symbolic methods that need continuous training(Liang et al., 2016)  and program synthesis from examples that need specific and human-provided input-output pairs for each example in the inference phase(Gulwani, 2011).

