ISARSTEP: A BENCHMARK FOR HIGH-LEVEL MATHEMATICAL REASONING

Abstract

A well-defined benchmark is essential for measuring and accelerating research progress of machine learning models. In this paper, we present a benchmark for high-level mathematical reasoning and study the reasoning capabilities of neural sequence-to-sequence models. We build a non-synthetic dataset from the largest repository of proofs written by human experts in a theorem prover. The dataset has a broad coverage of undergraduate and research-level mathematical and computer science theorems. In our defined task, a model is required to fill in a missing intermediate proposition given surrounding proofs. This task provides a starting point for the long-term goal of having machines generate human-readable proofs automatically. Our experiments and analysis reveal that while the task is challenging, neural models can capture non-trivial mathematical reasoning. We further design a hierarchical transformer that outperforms the transformer baseline. The dataset and models are available from:

1. INTRODUCTION

Neural networks have achieved outstanding performance on a wide range of problems in natural language processing, computer vision, and speech recognition. However, research investigating their capacity of doing mathematical reasoning is still limited, with earlier attempts focusing on simple arithmetic tasks like integer addition and multiplication (Zaremba & Sutskever, 2014; Kaiser & Sutskever, 2016; Trask et al., 2018) . More recently, there has been work on solving school-level mathematical problems (Saxton et al., 2019 ), logical reasoning (Evans et al., 2018) , and problems of function integration, ordinary differential equations (Lample & Charton, 2020), and properties of differential systems (Charton et al., 2020) . While these are valuable contributions to the machine learning community, they focused on generating answers to questions from a specific domain and were carried out on synthetic datasets with small vocabulary (e.g. up to 100 unique tokens). In this paper, we consider general undergraduate and research-level mathematical proofs as a target for neural networks. When humans prove a theorem, a crucial step is to propose an intermediate proposition to bridge the gap between the goal and the currently known facts. This step requires complicated reasoning capabilities such as creative thinking, inference, understanding existing conditions, and symbolic manipulation of rules. For example, consider the following proof of the irrationality of √ 2: Proof of irrationality of √ 2. Assume √ 2 is rational. Then there exists a pair of coprime integers a and b such that √ 2 = a/b, and it follows that 2 = a 2 /b 2 and then 2b 2 = a 2 . Hence a is even. Thus there exists an integer c such that a = 2c, which combined with 2b 2 = a 2 yields 2c 2 = b 2 : hence b is also even. So a and b are both even although they are coprime, contradiction. To derive ∃c ∈ Z. a = 2c from 2bfoot_1 = a 2 , the intermediate proposition "a is even" would reduce the gap and lead to a successful proof. We would like to simulate the way humans prove theorems by proposing an intermediate proposition synthesis task -IsarStep. Instead of having primitive steps like 3 + 5 = 8, the proof steps in IsarStep are at a higher-level, with much bigger steps as basic. Therefore it usually cannot be simply solved by pattern matching and rewriting. To succeed in this task, a model is required to learn the meaning of important mathematical concepts (e.g. determinant in linear algebra, residue in complex analysis), how they are related to each other through theorems, and how they are utilised in proof derivations. Solving the IsaStep task will be potentially helpful for improving the automation of theorem provers, because proposing a valid intermediate proposition will help reduce their search space significantly. It is also a first step towards the long-term goal of sketching complete human-readable proofs automatically. We have built the IsarStep dataset by mining arguably the largest publicly-hosted repository of mechanised proofs: the Achieve of Formal Proofs (AFP). 1 The AFP is checked by the Isabelle proof assistant (Paulson, 1994) and contains 143K lemmas. Combining the AFP with the standard library of Isabelle/HOL yields a dataset of 204K formally-proved lemmas. The dataset covers a broad spectrum of subjects, including foundational logic (e.g. Gödel's incompleteness theorems), advanced analysis (e.g. the Prime Number Theorem), computer algebra, cryptographic frameworks, and various data structures. A nice property of the mined formal proofs is that they are mostly declarative proofs, a proof style very close to human prose proofs. 2 Fig. 1 illustrates the proof of irrationality of √ 2 in Isabelle. We can see that the proof is actually legible (even to people who are not familiar with the system) and and it captures high-level structures like those in human proofs. We further explore the reasoning capabilities of neural models. We frame the proposed task as a sequence-to-sequence (seq2seq) prediction problem. Beyond evaluating the existing neural seq2seq model baselines-the seq2seq with attention (Bahdanau et al., 2015) , the transformer (Vaswani et al., 2017) -we also propose a new architecture, the hierarchical transformer ( §4). The architecture is motivated by the way humans reason about propositions; it consists of a set of local transformer layers, modelling the representation of each proposition, and a set of global layers, modelling the correlation across propositions. Experiments ( §5) show that these neural models can solve 15-25% of problems on the test set, and the hierarchical transformer achieves the best result. Further analysis ( §6) on the output of these models shows that while the proposition synthesis task is hard, the neural models can indeed capture mathematical reasoning. We find that the embeddings of closely related mathematical concepts are close in cosine space; models can reason about the relation between set, subset, and member, and perform more complex multi-step reasoning that is even hard for humans. Our contributions are summarised as follows: 1. We mine a large non-synthetic dataset of formal proofs and propose a task for evaluating neural models' mathematical reasoning abilities. The dataset contains 820K training examples with a vocabulary size of 30K.



https://www.isa-afp.org A comparison of proofs in different systems is available inWiedijk (2006). The declarative style proof is also available in Mizar(Grabowski et al., 2010), where the style originates.



Figure 1: Full declarative proof the irrationality of √ 2 in Isabelle/HOL.

availability

https://github.com/ Wenda302/IsarStep.

