INT: AN INEQUALITY BENCHMARK FOR EVALUATING GENERALIZATION IN THEOREM PROVING

Abstract

In learning-assisted theorem proving, one of the most critical challenges is to generalize to theorems unlike those seen at training time. In this paper, we introduce INT, an INequality Theorem proving benchmark designed to test agents' generalization ability. INT is based on a theorem generator, which provides theoretically infinite data and allows us to measure 6 different types of generalization, each reflecting a distinct challenge, characteristic of automated theorem proving. In addition, INT provides a fast theorem proving environment with sequence-based and graph-based interfaces, conducive to performing learning-based research. We introduce baselines with architectures including transformers and graph neural networks (GNNs) for INT. Using INT, we find that transformer-based agents achieve stronger test performance for most of the generalization tasks, despite having much larger outof-distribution generalization gaps than GNNs. We further find that the addition of Monte Carlo Tree Search (MCTS) at test time helps to prove new theorems.

1. INTRODUCTION

Advances in theorem proving can catalyze developments in fields including formal mathematics (Mc-Cune, 1997) , software verification (Darvas et al., 2005) , and hardware design (Kern and Greenstreet, 1999) . Following its recent success across other application domains, machine learning has significantly improved the performance of theorem provers (Bansal et al., 2019; Bridge et al., 2014; Gauthier et al., 2018; Huang et al., 2019; Irving et al., 2016; Kaliszyk et al., 2018; Lee et al., 2020; Loos et al., 2017; Urban et al., 2011; Wang and Deng, 2020; Yang and Deng, 2019; Li et al., 2020; Rabe et al., 2020; Polu and Sutskever, 2020) . Two key factors that make theorem proving particularly challenging for ML are data sparsity and that it requires out-of-distribution generalization. Firstly, due to the difficulty of formalizing mathematics for humans, manually generated formal proofs are necessarily expensive. Typical formal mathematics datasets contain thousands (Huang et al., 2019) to tens-of-thousands (Yang and Deng, 2019) of theorems -orders of magnitude smaller than datasets that enabled breakthroughs in areas such as vision (Deng et al., 2009) and natural language processing (Rajpurkar et al., 2016) . Secondly, the assumption frequently made in machine learning that each data point is identically and independently distributed does not hold in general for theorem proving: interesting problems we want to prove are non-trivially different from those we have proofs for. Hence, the out-of-distribution generalization ability is crucial. Synthetic datasets that rely on procedural generation provide a potentially unlimited amount of data. Well-designed synthetic datasets have been shown to help understand the capabilities of machine learning models (Johnson et al., 2017; Ros et al., 2016; Weston et al., 2016) . With the goal of alleviating the data scarcity problem and understanding out-of-distribution generalization for theorem proving, we introduce INT. INT is a synthetic INequality Theorem proving benchmark designed for evaluating generalization. It can generate a theoretically unlimited number of theorems and proofs in the domain of algebraic equalities and inequalities. INT allows tweaking of its problem distribution along 6 dimensions, enabling us to probe multiple aspects of out-of-distribution generalization. It is accompanied by a fast proof assistant with sequence and graph-based interfaces. A common reservation to hold for synthetic datasets is one of realism: can synthetic data help to prove realistic * : equal contribution 1

