FORMAL MATHEMATICS STATEMENT CURRICULUM LEARNING

Abstract

We explore the use of expert iteration in the context of language modeling applied to formal mathematics. We show that at same compute budget, expert iteration, by which we mean proof search interleaved with learning, dramatically outperforms proof search only. We also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associated ground-truth proofs. Finally, by applying this expert iteration to a manually curated set of problem statements, we surpass previous state-of-the-art on the miniF2F benchmark, automatically solving multiple challenging problems drawn from high school olympiads.

1. INTRODUCTION

Deep learning has enjoyed spectacular success in many domains, including language (Brown et al., 2020; Devlin et al., 2019; Wu et al., 2016 ), vision (Radford et al., 2021; Tan & Le, 2019) , and image generation (Ramesh et al., 2021; Karras et al., 2019) . One domain where deep learning has not yet enjoyed a comparable success is in tasks that require extensive planning and symbolic reasoning, with the exception of two-player games (Silver et al., 2016; 2017; Berner et al., 2019; Vinyals et al., 2019) . In such games, deep learning systems exhibit a considerable degree of reasoning, especially when trained with self-play combined with a search procedure such as Monte Carlo Tree Search (MCTS) (Browne et al., 2012) . But the resulting reasoning abilities achieved are limited due to the relatively narrow scope of games. As such, theorem proving in interactive proof assistants, or formal mathematics, appears as an interesting game-like domain to tackle due to its increased scope. The typical tasks consist of generating a machine-checkable proof given a formal statements. Like games, formal mathematics has an automated way of determining whether a trajectory (i.e. a proof) is successful (i.e. formally correct). But the vast scope of formal mathematics means that any strong reasoning result obtained in it will be more meaningful than comparable results in games (e.g. finding proofs to mathematical conjectures), and could even be applicable to important practical problems (e.g. software verification). However, tackling formal mathematics involves two main challenges that we must address in order to continue making progress: Infinite action space Not only does formal mathematics have an extremely large search space (like Go (Silver et al., 2016) for example), it also has an infinite action space. At each step of proof search, the model must choose not from a well-behaved finite set of actions, but a complex and infinite set of tactics, potentially involving exogenous mathematical terms that have to be generated (e.g., generating a mathematical statement to be used as a witness, an object used steps such as "there exists an x ...", or a cut, the introduction and the chaining of a lemma in the middle of a proof). No direct self-play setup In formal mathematics, a prover is not playing against an opponent but against a set of statements to prove. When faced with a statement that is just too hard, there is no obvious reframing of the formal mathematics setup that will let the prover generate intermediary easier statements to tackle first. This asymmetry prevents naive application of the symmetric self-play algorithms commonly used in 2-player games. These two differences make a naive application of reinforcement learning to formal mathematics leave a large room for improvement (Whalen, 2016; Winands et al., 2008) . Past work proposed to address the infinite action space problem by sampling from a language model (Polu & Sutskever, 2020), while training such language model requires a large dataset of statements with proof. This paper focuses on this second problem and our basis for addressing it is the observation that the key role of self-play is to provide an unsupervised curriculum. We propose instead to supply auxiliary sets of problem statements (without requiring proofs) of varying difficulty. We empirically show that, when the difficulty of these auxiliary problems is varied enough, a simple expert iteration procedure is able to solve a curriculum of increasingly difficult problems, eventually generalizing to our target distribution. We show that this works with both automatically-generated and manually-curated auxiliary distributions of problems and leverage this to achieve state-of-the-art on the miniF2F benchmark. Our results suggest that continuous self-improvement in formal mathematics can potentially be reduced to the problem of generating such sets of formal statements, which we have done in part manually in this work, but could eventually be scaled in the future with more automation (such as more domain-specific statements generator or even informal to formal machine translation). miniF2F benchmark In this work, we target the miniF2F (Zheng et al., 2022) benchmark, which consists of 244 validation and 244 test formalized statements of mathematical problems from various competitions. We believe it to be a better measure of mathematical reasoning compared to a formal library-derived split. Also, the extreme scarcity in formal libraries of this type of problems makes it an ideal test-bed for the expert iteration methodology studied in this paper.

2. RELATED WORK

Our work strongly relies on, and can be seen as a natural continuation of the work presented in the original GPT-f paper (Polu & Sutskever, 2020) which studies the use of language models to generate tactics, the PACT paper (Han et al., 2022) which applies GPT-f to Lean and studies the benefits from co-training on self-supervised objectives, and the miniF2F benchmark (Zheng et al., 2022) . We present additional related work in Appendix A.

3. FORMAL ENVIRONMENT

We choose Lean (de Moura et al., 2015; lea) as our formal environment. Unlike Metamath (Megill & Wheeler, 2019) , which has been studied in the original GPT-f paper (Polu & Sutskever, 2020) , Lean benefits from high-level tactics which were shown to be beneficial in the context of the miniF2F benchmark. Also, Lean has recently received a lot of attention from the mathematical community, thanks to projects such as the Perfectoid Spaces (Buzzard et al., 2019) and the Liquid Tensor experiment (Scholze, 2020), and benefits from a vibrant community of hundreds of contributors to its main mathematical library called mathlib. We refer to the PACT paper's Background section (Han et al., 2022) for a detailed introduction to Lean in the context of neural theorem proving. We refer to Appendix D for an illustration of miniF2F input and Lean environment.

lean-gym

In the PACT paper (Han et al., 2022) , proof search is performed by the Lean runtime using the LEANSTEP environment, with a generic backend interface to models. While easy to use-one just needs to plug in their model-this approach makes it difficult to alter and iterate on the search procedure because it is programmed in Lean (which is not designed or intended for cluster-wide parallelised I/O intensive tasks), and the coupling of the search procedure with the Lean runtime introduces challenges when scaling to a large number of parallel workers. To solve these issues we implemented lean-gymfoot_0 -a simple REPL interface over the standard input/output implemented in Lean directly. We present lean-gym's API and discuss some of its advantages and limitations in Appendix B.



https://github.com/openai/lean-gym

