INT: AN INEQUALITY BENCHMARK FOR EVALUATING GENERALIZATION IN THEOREM PROVING

Abstract

In learning-assisted theorem proving, one of the most critical challenges is to generalize to theorems unlike those seen at training time. In this paper, we introduce INT, an INequality Theorem proving benchmark designed to test agents' generalization ability. INT is based on a theorem generator, which provides theoretically infinite data and allows us to measure 6 different types of generalization, each reflecting a distinct challenge, characteristic of automated theorem proving. In addition, INT provides a fast theorem proving environment with sequence-based and graph-based interfaces, conducive to performing learning-based research. We introduce baselines with architectures including transformers and graph neural networks (GNNs) for INT. Using INT, we find that transformer-based agents achieve stronger test performance for most of the generalization tasks, despite having much larger outof-distribution generalization gaps than GNNs. We further find that the addition of Monte Carlo Tree Search (MCTS) at test time helps to prove new theorems.

1. INTRODUCTION

Advances in theorem proving can catalyze developments in fields including formal mathematics (Mc-Cune, 1997) , software verification (Darvas et al., 2005) , and hardware design (Kern and Greenstreet, 1999) . Following its recent success across other application domains, machine learning has significantly improved the performance of theorem provers (Bansal et al., 2019; Bridge et al., 2014; Gauthier et al., 2018; Huang et al., 2019; Irving et al., 2016; Kaliszyk et al., 2018; Lee et al., 2020; Loos et al., 2017; Urban et al., 2011; Wang and Deng, 2020; Yang and Deng, 2019; Li et al., 2020; Rabe et al., 2020; Polu and Sutskever, 2020) . Two key factors that make theorem proving particularly challenging for ML are data sparsity and that it requires out-of-distribution generalization. Firstly, due to the difficulty of formalizing mathematics for humans, manually generated formal proofs are necessarily expensive. Typical formal mathematics datasets contain thousands (Huang et al., 2019) to tens-of-thousands (Yang and Deng, 2019) of theorems -orders of magnitude smaller than datasets that enabled breakthroughs in areas such as vision (Deng et al., 2009) and natural language processing (Rajpurkar et al., 2016) . Secondly, the assumption frequently made in machine learning that each data point is identically and independently distributed does not hold in general for theorem proving: interesting problems we want to prove are non-trivially different from those we have proofs for. Hence, the out-of-distribution generalization ability is crucial. Synthetic datasets that rely on procedural generation provide a potentially unlimited amount of data. Well-designed synthetic datasets have been shown to help understand the capabilities of machine learning models (Johnson et al., 2017; Ros et al., 2016; Weston et al., 2016) . With the goal of alleviating the data scarcity problem and understanding out-of-distribution generalization for theorem proving, we introduce INT. INT is a synthetic INequality Theorem proving benchmark designed for evaluating generalization. It can generate a theoretically unlimited number of theorems and proofs in the domain of algebraic equalities and inequalities. INT allows tweaking of its problem distribution along 6 dimensions, enabling us to probe multiple aspects of out-of-distribution generalization. It is accompanied by a fast proof assistant with sequence and graph-based interfaces. A common reservation to hold for synthetic datasets is one of realism: can synthetic data help to prove realistic theorems? Polu and Sutskever (2020) adopted our generation method and showed that augmentation of 1% of synthetic theorems in training helped to complete 2.3% more proofs on Metamath (Megill and Wheeler, 2019) . This demonstrates the usefulness of INT in real mathematics. Time and memory requirements for the proof assistant have often been an obstacle for using theorem provers as RL environments. Most existing proof assistants require a large software library to define numerous mathematical theorems, leading to slow simulation. Therefore, a key design objective for INT was to be lightweight and swift. Taking advantage of the limited scope of inequality theorems, we load a minimal library and achieve fast simulation. Reducing the simulation overhead allows for experimentation with planning methods such as MCTS which requires many calls to a simulator. We summarize the contributions of this paper as follows: 1. We make, to the best of our knowledge, the first attempt to investigate an important question in learning-assisted theorem proving research, i.e., can theorem provers generalize to different problem distributions? We introduce INT for evaluating six dimensions of generalization. 2. We introduce and benchmark baseline agents for the six types of generalization tasks in INT. We find that transformer-based agents' generalization abilities are superior when training and test data are drawn from the same distribution and inferior in out-of-distribution tasks in INT, compared to GNN-based agents. Surprisingly, despite larger generalization gaps, transformer-based agents have favorable test success rates over GNN-based ones in most cases. 3. We find that searching with MCTS at test time greatly improves generalization.

2. RELATED WORKS

Automatic and Interactive Theorem Proving. Modern Automatic Theorem Provers (ATPs) such as E (Schulz, 2013) and Vampire (Kovács and Voronkov, 2013) represent mathematical theorems in first-order logic and prove them with resolution-based proof calculi. On the other hand, Interactive Theorem Provers (ITPs) allow human formalization of proofs. This perhaps makes them more suitable for biologically inspired methods such as machine learning. Famous ITPs include Isabelle (Paulson, 1986) , Coq (Barras et al., 1999) , LEAN (de Moura et al., 2015), and HOL Light (Harrison, 1996) . Learning-assisted Theorem Proving. Theorem provers have been improved by supervised learning (Urban et al., 2011; Bridge et al., 2014; Irving et al., 2016; Loos et al., 2017; Wang et al., 2017; Rocktäschel and Riedel, 2017; Bansal et al., 2019; Gauthier et al., 2018; Huang et al., 2019; Yang and Deng, 2019; Kaliszyk and Urban, 2015; Polu and Sutskever, 2020; Li et al., 2020; Rabe et al., 2020; Jakubuv and Urban, 2019; Olsák et al., 2020; Jakubuv et al., 2020; Kaliszyk et al., 2015; Gauthier and Kaliszyk, 2015) . Wang et al. (2017) used graph embeddings to represent logic formulas and achieved state-of-the-art classification accuracy on the HolStep dataset (Kaliszyk et al., 2017) . Reinforcement learning (RL) was employed in (Zombori et al., 2019; Gauthier, 2019; 2020) . Kaliszyk et al. (2018) combined MCTS with RL to prove theorems with connection tableau. Notably, GPT-f (Polu and Sutskever, 2020) adopts our INT generation method for dataset augmentation. Datasets for Theorem Proving. There have been many formal mathematical libraries (Megill and Wheeler, 2019; Rudnicki, 1992; Gauthier, 2019) . Formalized mathematical theorems include the Feit-Thompson theorem (Gonthier et al., 2013) and the Kepler Conjecture (Hales et al., 2017) . The largest human formal reasoning dataset is IsarStep (Li et al., 2020) , where they mined the archive of formal proofs and brought together 143K theorems in total. These works rely on human efforts to formalize theorems, which leads to small to moderate-sized datasets. There have been studies on synthesizing theorems (Urban, 2007; Urban et al., 2008; Piotrowski and Urban, 2018; Gauthier et al., 2017; 2016; Chvalovskỳ et al., 2019; Lenat, 1976; Fajtlowicz, 1988; Colton, 2012; Johansson et al., 2014) It is worth mentioning that there have been a few approaches (Urban and Jakubv, 2020; Wang and Deng, 2020) on neural theorem synthesizers. Our theorem generator INT is designed to be capable of creating an infinite number of theorems, as well as benchmarking the generalization ability of learning-assisted theorem provers.

3. THE INT BENCHMARK DATASET AND PROOF ASSISTANT

Our INT benchmark dataset provides mathematical theorems and a means to study the generalization capability of theorem provers. For this purpose, we need control over the distribution of theorems:

