TOWARDS A MATHEMATICS FORMALISATION ASSIS-TANT USING LARGE LANGUAGE MODELS

Abstract

Mathematics formalisation is the task of writing mathematics (i.e., definitions, theorem statements, proofs) in natural language, as found in books and papers, into a formal language that can then be checked for correctness by a program. It is a thriving activity today, however formalisation remains cumbersome. In this paper, we explore the abilities of a large language model (Codex) to help with formalisation in the Lean theorem prover. We find that with careful inputdependent prompt selection and postprocessing, Codex is able to formalise short mathematical statements at undergrad level with nearly 75% accuracy for 120 theorem statements. For proofs quantitative analysis is infeasible and we undertake a detailed case study. We choose a diverse set of 13 theorems at undergrad level with proofs that fit in two-three paragraphs. We show that with a new prompting strategy Codex can formalise these proofs in natural language with at least one out of twelve Codex completion being easy to repair into a complete proof. This is surprising as essentially no aligned data exists for formalised mathematics, particularly for proofs. These results suggest that large language models are a promising avenue towards fully or partially automating formalisation.

1. INTRODUCTION

Mathematics (definitions, theorems, proofs, remarks) as found in books and papers is written in a semi-formal style combining natural language with formal language in specialized notation. We refer to the language of this style of writing mathematics as natural language or NL. Formalisation of mathematics consists of writing mathematics in a formal language that can then be checked and manipulated by a computer. NL mathematics writing, while being more rigorous than writing in most other domains, falls far short of the standard of detail and rigour required for full formalisation. Formalisation is done with the help of proof assistants. A proof assistant consists of a formal language in which mathematical statements can be encoded along with a piece of software that assists in writing and checking proofs in the formal language up to the foundational axioms. See under Prompt in Figure 1 for some examples. Formalisation is an old endeavour that is thriving with several actively developed libraries of formalised mathematics for major proof assistants including Coq, Isabelle, Lean and Mizar. A major use of proof assistants is in software and hardware verification but here we are concerned with their applications in mathematics: checking formalised mathematics automatically results in a much higher degree of confidence in the correctness of proofs. Formalisation promises to open up new possibilities in mathematical exposition, teaching, research and collaboration (Massot, 2021; Buzzard, 2022) ; in addition, it can facilitate automated proof discovery, e.g. (Lample et al., 2022) . Formalisation of mathematics today poses a barrier to entry because of the need to learn to use proof assistants; it is also notoriously labour-intensive because many details normally taken for granted in the language of mathematics must be supplied when formalising. Autoformalisation Wang et al. ( 2018) is the task of (semi-)automatically turning a piece of mathematics in natural language into a formalised one. An autoformalisation tool that speeds-up formalisation or fully automates it would be of great value by enabling the above advantages of formalisation and opening up new ones Szegedy (2020). Autoformalisation is challenging. It is a natural language understanding problem for the language of mathematics. While the language of mathematics is stylized compared to natural language in other domains and deals with relatively narrow subject matter, it retains much of the complexity in addition to presenting new challenges for autoformalisation of its own, including supplying missing details and assumptions that are taken for granted by humans, and semantically mapping concepts in the informal description to those in the formal corpus (Ganesalingam, 2013; Massot, 2021) . Autoformalisation also presents practical challenges in the application of modern deep learningbased methods: the amount of formalised mathematics available is much smaller than code in major programming languages. Furthermore, there is very little aligned data between informal and formal mathematics. Autoformalisation implicitly includes semantic search in the formalised library. Autoformalisation of proofs is much more than independent autoformalisation of each statement in the proof: one needs to maintain context across the proof and find correspondence between NL constructs and tactics in formal proofs. In this paper we worked with Lean, a popular proof assistant with two actively used versions: Lean 3 (de Moura et al., 2015) and Lean4 (de Moura & Ullrich, 2021). The rapidly evolving Lean mathematical library (abbreviated mathlib) is one of the largest libraries of formal mathematics. mathlib is currently 226MB in size. mathlib is monolithic by design, ensuring that formalisations of different parts of mathematics can be combined easily. The resulting standardization of terminology in mathlib and its good coverage make Lean an attractive target for autoformalisation. To our knowledge, the only form in which aligned data occurs in mathlib is as docstrings for definitions and theorem statements. Furthermore, there is a complete lack of aligned data for proofs: while some examples of natural language proofs together with their corresponding formal proofs occur in the blueprints of some Lean formalisation projects, e.g. the Liquid Tensor Experiment, these are only a handful and highly specialised. Our contributions. In this paper we apply a large language model (specifically, Codex) to the problem of autoformalisation. We focused on two different tasks: (1) translating theorem statements of a form similar to docstrings of mathlib to theorems (in Lean 4), and ( 2) translating (outlines of) NL proofs to Lean proofs (in Lean 3). The latest version of Lean, Lean 4, is (in addition to an interactive theorem prover) a full-fledged programming language with a fast runtime. This allows a seamless integration of proofs, programs and meta-programs. We use Lean 4 for one set of experiments and Lean 3 for the other because, at the time of writing, mathlib was only partially available in Lean 4 (via a partial binary port). Hence we use Lean 4 where its additional capabilities are important and Lean 3 where these are not used and the larger library is of greater value. More details on Lean are in Appendix A. Theorem statement autoformalisation. For the evaluation dataset, we chose 120 theorem statements at the undergrad level so that the relevant concepts (background theory and definitions) were mostly already in mathlib. Since mathlib is substantial (it has a significant fraction of undergrad mathematics curriculum apart from many advanced results), this is not a restriction. We focused on theorem statements at the undergrad and more advanced level from various areas of mathematics. These statements tend to be more challenging for autoformalisation compared to mathematics competition problems studied in prior work (Wu et al., 2022) as they often assume more in terms of implicit context and draw from a much larger background (Wu et al., 2022) . We experimented with using input-dependent prompting, with mathlib as a database. Specifically, we chose our few-shot prompts to consist of theorem-docstring pairs from mathlib where the docstring is close in a sentence similarity metric to the statement to be formalised. We also experimented with filtering outputs generated at high temperatures by checking validity in Lean 4 and some other post-processing. Our results showed that there is a strong effect of both prompt engineering and selection, and even more when used in combination and that a reasonably large fraction are elaborated when both prompt engineering and selection is done (the results improve further when more prompts are used). In the context of autoformalisation, we are the first to use input-dependent prompting. Our use of elaboration for postprocessing is novel. Both of these are greatly facilitated by the availability of mathlib, and the nature of Lean 4, which gives easy access to its internals and in Lean 4 itself -the latter allowing shared code and avoiding context switching.

