INT: AN INEQUALITY BENCHMARK FOR EVALUATING GENERALIZATION IN THEOREM PROVING

Abstract

In learning-assisted theorem proving, one of the most critical challenges is to generalize to theorems unlike those seen at training time. In this paper, we introduce INT, an INequality Theorem proving benchmark designed to test agents' generalization ability. INT is based on a theorem generator, which provides theoretically infinite data and allows us to measure 6 different types of generalization, each reflecting a distinct challenge, characteristic of automated theorem proving. In addition, INT provides a fast theorem proving environment with sequence-based and graph-based interfaces, conducive to performing learning-based research. We introduce baselines with architectures including transformers and graph neural networks (GNNs) for INT. Using INT, we find that transformer-based agents achieve stronger test performance for most of the generalization tasks, despite having much larger outof-distribution generalization gaps than GNNs. We further find that the addition of Monte Carlo Tree Search (MCTS) at test time helps to prove new theorems.

1. INTRODUCTION

Advances in theorem proving can catalyze developments in fields including formal mathematics (Mc-Cune, 1997) , software verification (Darvas et al., 2005) , and hardware design (Kern and Greenstreet, 1999) . Following its recent success across other application domains, machine learning has significantly improved the performance of theorem provers (Bansal et al., 2019; Bridge et al., 2014; Gauthier et al., 2018; Huang et al., 2019; Irving et al., 2016; Kaliszyk et al., 2018; Lee et al., 2020; Loos et al., 2017; Urban et al., 2011; Wang and Deng, 2020; Yang and Deng, 2019; Li et al., 2020; Rabe et al., 2020; Polu and Sutskever, 2020) . Two key factors that make theorem proving particularly challenging for ML are data sparsity and that it requires out-of-distribution generalization. Firstly, due to the difficulty of formalizing mathematics for humans, manually generated formal proofs are necessarily expensive. Typical formal mathematics datasets contain thousands (Huang et al., 2019) to tens-of-thousands (Yang and Deng, 2019) of theorems -orders of magnitude smaller than datasets that enabled breakthroughs in areas such as vision (Deng et al., 2009) and natural language processing (Rajpurkar et al., 2016) . Secondly, the assumption frequently made in machine learning that each data point is identically and independently distributed does not hold in general for theorem proving: interesting problems we want to prove are non-trivially different from those we have proofs for. Hence, the out-of-distribution generalization ability is crucial. Synthetic datasets that rely on procedural generation provide a potentially unlimited amount of data. Well-designed synthetic datasets have been shown to help understand the capabilities of machine learning models (Johnson et al., 2017; Ros et al., 2016; Weston et al., 2016) . With the goal of alleviating the data scarcity problem and understanding out-of-distribution generalization for theorem proving, we introduce INT. INT is a synthetic INequality Theorem proving benchmark designed for evaluating generalization. It can generate a theoretically unlimited number of theorems and proofs in the domain of algebraic equalities and inequalities. INT allows tweaking of its problem distribution along 6 dimensions, enabling us to probe multiple aspects of out-of-distribution generalization. It is accompanied by a fast proof assistant with sequence and graph-based interfaces. A common reservation to hold for synthetic datasets is one of realism: can synthetic data help to prove realistic this is achieved by a highly customizable synthetic theorem generator. We used a set of ordered field axioms (Dummit and Foote, 2004) to generate inequality theorems and a subset of it to generate equality theorems. Details of the axiomization schemes can be found in Appendix A. The code for generating theorems and conducting experiments is available at https://github.com/ albertqjiang/INT.

3.1. TERMINOLOGY

The axiom combination of a proof refers to the set of axioms used in constructing it. The sequence of axioms applied in order in the proof is called the axiom order. For example, let A, B, C denote three unique axioms, and their order of application in a proof be Then, if the assumptions are in the proven facts, the conclusions are added to the proven facts; if the conclusions include the goal, the unproven assumptions will become the new goal. The assistant considers the theorem proven, if after all steps in the proof are applied, the goal is empty or trivial. In Figure 1 , we present the same proof in LEAN (de Moura et al., 2015) and INT assistants. They both process proofs by simplifying the goal until it is trivial. The INT assistant's seq2seq interface (Figure 1b ) is very similar to that of LEAN (Figure 1a ) with the rewrite tactic. An action is composed of an axiom followed by argument names and their positions in the proof state. in obj indicates that the arguments can be found in the objective. The graph interface (Figure 1c ) of the INT assistant allows theorem provers to chose axiom arguments from the computation graphs of the proof state by node. We can view theorem proving with this interface as a graph manipulation task. INT assistant provides fast simulation. To demonstrate this, we produced 10,000 typical proof steps in both interfaces, 40-character-long on average. We executed them with HOL Light (Harrison, 1996) and INT assistant. The average time it takes per step is 7.96ms in HOL Light and 1.28ms in INT, resulting in a 6.2× speedup. The correctness of the proofs is ensured by a trusted core of fewer than 200 lines of code.

3.3. THEOREM GENERATOR

One of the main contributions of this paper is to provide a generation algorithm that is able to produce a distribution of non-trivial synthetic theorems given an axiom order. Generating theorems by randomly sampling axiom and argument applications will often yield theorems with short proofs. Instead, we write production rules for axioms in the form of transformation and extension rules. With these production rules, we can find arguments and new premises required for longer proofs. We provide the theorem generation algorithm in Algorithm 1. The general idea of the algorithm is to morph a trivial logic statement into one that requires a non-trivial proof; we call this statement the core logic statement. We initiate the core logic statement C 0 to be one of the initial conditions. At step t of the generation process, we are given an axiom a t specified by the axiom order. We apply the MORPH function associated with the axiom a t to C Get new logic statement and premises: Ct, Pt ← MORPH (at, Ct-1).

7:

Add new premises to the set of all premises: P ← P ∪ Pt.

8:

end for 9: return CL, P 10: end function generated logic statement and the premises form the implication C t-1 , a t , P t → C t (see Appendix B for details). Therefore, we can chain the implications from all steps together to obtain a proof whose length is the axiom order: C 0 , {a t , P t } L t=1 → C L , where L denotes the length. The last core logic statement C L and its premises C 0 , {P t } L t=1 are returned as the theorem generated. Below we show a step-by-step example of how a theorem is generated with our algorithm. With recorded axiom and argument applications, we can synthesize proofs to the theorems. The proofs can be used for behavior cloning. Appendix E shows statistics of the generated proofs, including the distribution of length of theorems in characters, the distribution of axioms, and the distribution of the number of nodes in proof state graphs.

4. EXPERIMENTS

Our experiments are intended to answer the following questions: 1. Can neural agents generalize to theorems: 1) sampled from the same distribution as training data, 2) with different initial conditions, 3) with unseen axiom orders, 4) with unseen axiom combinations, 5) with different numbers of unique axioms, 6) with shorter or longer proofs? Right: All agents are trained on degree-0 problems and evaluated against problems of degree 0, 1, and 2. We find that transformer-based agents deteriorate in performance as the test problems become more complex than training problems. For GNN-based agents, there are no obvious trends as to how the proof success rate changes as the degree of the initial entities is varied. 2. How do different architectures (transformer vs. GNN) affect theorem provers' in-distribution and out-of-distribution generalization? 3. Can search at test time help generalization?

4.1. EXPERIMENT DETAILS

In the following experiments, we used the proofs generated by the INT generator to perform behavior cloning. We then evaluated the success rates of trained agents in a theorem proving environment. We denote the cardinality of an axiom combination as K and the length of a proof as L. In the worked example, K = 4 and L = 4. For each theorem distribution, we first generated a fixed test set of 1000 problems, and then produced training problems in an online fashion, while making sure the training problems were different from the test ones. For each experiment, we generated 1000 problems and performed 10 epochs of training before generating the next 1000. We ran 1500 such iterations in total, with 1.5 million problems generated. We used the Adam optimizer (Kingma and Ba, 2015) . We searched over the learning rates {10 -5 , 3 • 10 -5 , 10 -4 , 3 • 10 -4 } in preliminary experiments and found 10 -4 to be the best choice, which was used for following experiments. We used one Nvidia P100 or Tesla T4 GPU with 4 CPU cores for training. For each experiment, we ran 2 random seeds, and picked the one with higher validation success rates for test evaluation. Since this paper focuses on inequalities, all figures and tables in the main text are based on results from the ordered-field axiomization. We also include results of GNN-based agents on equalities in Appendix G. In this section, we introduce four baselines built on commonly used architectures: Transformers (Vaswani et al., 2017) , Graph Neural Networks (GNNs), TreeLSTMs (Tai et al., 2015) and Bag-of-Words (BoWs). In preliminary experiments, we found Graph Isomorphism Networks (GINs) (Xu et al., 2019) to have performed the best among several representative GNN architectures. So we used GIN as our GNN of choice. Transformers interact with the INT proof assistant through the seq2seq interface while the other baselines through the graph interface. For sequence-to-sequence training, we used a character-level transformer architecture with 6 encoding layers and 6 decoding layers. We used 512 embedding dimensions, 8 attention heads and 2048 hidden dimensions for position-wise feedforward layers. We used dropout with rate 0.1, label smoothing with coefficient 0.1, and a maximum 2048 tokens per batch. The library fairseq (Ott et al., 2019) was used for its implementation. For data in the graph form, each node in computation graphs corresponds to a character in the formula. We first used a learnable word embedding of dimension 512 to represent each node. We then used 6 GIN layers to encode graph inputs into vector representations, each with 512 hidden dimensions. The graph representation was obtained by taking the sum of all the node embeddings. For the TreeLSTM and the BoW baselines, we used a bidirectional TreeLSTM with 512 hidden dimensions and a BoW architecture to compute the graph representation vectors from node embeddings. The hyper-parameters used were found to be optimal in preliminary experiments. We then proposed axioms conditioned on the graph representations, with a two-layer MLP of hidden dimension 256. Conditioning on the graph representation and axiom prediction, the arguments are selected in an autoregressive fashion. Namely, the prediction of the next node is conditioned on the previous ones. For each argument prediction, we used a one-layer MLP with a hidden size of 256. We used graph neural network libraries Pytorch Geometric (Fey and Lenssen, 2019) for the GIN implementation, and DGL (Wang et al., 2019) for the TreeLSTM implementation. We trained agents based on architectures mentioned above by behavior cloning on theorems of various length (L) and number of axioms (K). The success rates for proving 1000 test theorems are plotted in Figure 3 . As the BoW architecture did not utilize the structure of the state, it failed miserably at proving theorems, indicating the significance of the structural information. TreeLSTM performed worse than the graph neural network baseline. The transformer and the GNN baselines perform the best among the architectures chosen and they take inputs in sequential and graph forms, respectively. Thus, we used these two architectures in the following experiments to investigate generalization.

4.3. BENCHMARKING SIX DIMENSIONS OF GENERALIZATION

IID Generalization In this experiment, the training and test data are independently and identically distributed (IID). The performances of our transformer-based and GNN-based agents are displayed on the left in Figure 2 . As can be seen, the performance of agents examined on train and test problems are very similar. The largest difference between train and test success rates is 2% (K3L7). Notably, transformer-based agents complete 15.3% more test proofs than GNN-based agents on average. Initial Condition Consider two theorems: (1) (a + b) 2 = a 2 + b 2 + 2ab and (2) (a + (b + c)) 2 = a 2 + (b + c) 2 + 2a(b + c). The two problems take the same axioms and the same number of steps to prove. However, the axiom argument complexities are different, which can be seen as a result of varying initial conditions. Can agents trained on problems like (1) prove theorems like (2)? For an initial condition of the form X = X, we use the degree of the entity X to determine the complexity. In this experiment, we trained agents on problems with initial conditions made up of entities of degree 0, and evaluated them on ones of degrees 1 and 2. The results are presented in Figure 2 (b) with various K and L. For transformer-based agents, the success rate drops 25.6% on degree-1 problems and 31.5% on degree-2 problems on average. However, for GNN-based agents, the largest generalization gap between training and test success rates is 3% (K3L5). This shows that GNN agents can generalize to problems of higher complexities while transformer agents struggle. We trained the agents on problems that have the same proof length (L = 7) and varying Ks. The results are on the left of Figure 4 . It can be observed from the figure that in general, agents perform the best on the K they were trained on and worse when K shifts away. Transformer-based agents showed better performances in all K and L settings, completing 20.9% more proofs than GNN-based ones on average. The success rates of transformer-based agents drop 5.6% on average when the test K is shifted away by 1 from the training K. For GNN-based agents, this averages to 5.1%. This shows that their generalization abilities to different number of axioms are similar.

Axiom Orders

Proof Length We tested the generalization ability of theorem provers over the dimension of proof length of the theorems. To do this, we kept the cardinality of the axiom set to be the same (K = 3) and varied the evaluated problems' proof length (L = 3, 5, 7). The result is presented on the right of Figure 4 . For all of the agents trained, the success rate decreases as the length of the proof increases. This is due to the natural difficulty of completing longer proofs. Observing the figure, we see that the longer the training problems, the less they deteriorate in performance when proofs becomes longer: agents trained on K3L3 problems complete 18.8% fewer proofs when L is increased by 1, while ones trained on K3L7 complete 5.7% fewer. Furthermore, the performance of transformer-based agents decreases by 12.2% when the test proof length increases by 1, compared to 10.7% for GNN-based ones. This suggests that transformers have inferior proof length generalization abilities than GNNs.

4.4. GENERALIZING WITH SEARCH

We investigated whether performing search at test time can help agents generalize. Specifically, we investigated the effectiveness of Monte-Carlo Tree Search (MCTS) in finding proofs for unseen theorems with GNN-based agents. We chose GNN-based agents because they are better at outof-distribution generalization than transformer-based ones. Straightforward application of MCTS is impractical: in our theorem proving environment, the action space can be as large as 1.3M in size (see Appendix H). Hence, it would be infeasible to expand all possible actions when constructing the MCTS trees. Thus, we only performed MCTS over the axiom space (18 distinct axioms in total), and the arguments were proposed by the behavior cloning agents. Following AlphaGo Zero/AlphaZero (Silver et al., 2017; 2018) , we trained a value network to estimate the value of a state. The value network is an MLP with two hidden layers of size 256, taking the GNN global representations of graphs as input. It was trained on 1000 episodes of rollouts obtained by the behavior cloning agents, with a learning rate of 3 • 10 -6 . We also followed AlphaZero for the choice of the upper confidence bound, and the way that actions are proposed using visit counts. We used 200 simulations for constructing MCTS trees. More details can be found in Appendix F. We took the agents trained on "K3L3", "K3L5", and "K3L7" from section 4.3, and evaluated the agents' performance when boosted by MCTS. Generalization The average success rates on 1000 test theorems are presented on the left in Table 2 . We can see that search greatly improved the generalization results. It helped to solve 21% more problems on average for the agent trained on theorem distribution K3L7. Remarkably, when evaluating on K3L7 theorems, search helped the K3L3 agent improve its success rate from 25% to 40%: a relative improvement of 60%. It is interesting to see the K3L7 behavior cloning agent solved 9% fewer problems on average than the K3L5 agent. But search brought about much larger improvement to the K3L7 agent and helped it to solve the largest proportion of problems on average -90%. This indicates that skills learned through behavior cloning can be better exploited by searching. The average proof length for 1000 problems is presented on the right in Table 2 (we count those unsolved problem as 15, the step limit of an episode). We can see that by performing search, we are able to discover proofs of length closer to the ground truth proof length. For test theorems requiring 3-step proofs, the K3L3 agent was able to prove them in 3.33 steps on average, with a gap of 0.33 steps to the optimal value. Similarly, for test theorems requiring 5-step proofs, the K3L5 agent was able to prove them in 5.52 steps on average, with a gap of 0.52 steps; and for theorems requiring 7-step proofs, K3L7 agent achieved a gap of 0.5 steps.

4.5. DISCUSSION

Experimental results suggested that transformer-based agents can complete more proofs in the IID generalization scenario but have larger out-of-distribution generalization gaps than GNN-based ones. The larger gap may be due to the lack of constraints in the sequence-to-sequence framework, in which the model can propose sequences that are invalid actions, whereas the graph interface constrains the model to propose valid actions only. However, we still see that transformers are able to complete more proofs overall. This shows the superiority of transformers in model capacity when applied to theorem proving. This insight motivates us to explore the possibility of taking the best from both worlds, combining both graph structural information and the strong transformer architecture to improve learning-assisted theorem proving. We leave it for future work.

5. CONCLUSION

We addressed the problem of diagnosing the generalization weaknesses in learning-assisted theorem provers. We constructed INT, a synthetic benchmark of inequalities, to analyze the generalization of machine learning methods. We evaluated transformer-based and GNN-based agents and a variation of GNN-based agents with MCTS at test time. Experiments revealed that transformer-based agents generalize better when the IID assumption holds while GNN-based agents generalize better in outof-distribution scenarios. We also showed that search can boost the generalization ability of agents. We stress that proving theorems in INT is not an end in itself. A hard-coded expert system might perform well on INT but not generalize to real-world mathematical theorems. Therefore, INT should be treated as instrumental when diagnosing generalization of agents. The best practice is to use INT in conjunction with real mathematical datasets. We believe our benchmark can also be of interest to the learning community, facilitating research in studying generalization beyond the IID assumption. The agents' abilities to reason and to go beyond the IID assumption are essential in theorem proving, and studying how to acquire these abilities is at the frontier of learning research. In other domains requiring out-of-distribution generalization, such as making novel dialogs (Chen et al., 2017) or confronting unseen opponents in Starcraft (Vinyals et al., 2019) , the requirements for data and computation forbid a generally affordable research environment. The INT benchmark provides practical means of studying out-of-distribution generalization. 

APPENDIX A AXIOM SPECIFICATIONS

MultiplicatoinCommutativity (MC) → a • b = b • a MultiplicationAssociativity (MA) → a • (b • c) = (a • b) • c MultiplicationSimplification (MS) (a = 0) ∧ (a = b) → 1 = a • 1 b AdditionMultiplicationLeftDistribution (AMLD) → (a + b) • c = a • c + b • c AdditionMultiplicationRightDistribution (AMRD) → a • (b + c) = a • b + a • c SquareDefinition (SD) → a 2 = a • a MultiplicationOne (MO) → a • 1 = a AdditionZero (AZ) → a + 0 = a IneqMoveTerm (IMT) a + b ≥ c → a ≥ c + (-b) FirstPrincipleOfInequality (FPOI) (a ≥ b) ∧ (c ≥ d) → a + c ≥ b + d SecondPrincipleOfInequality (SPOI) (a ≥ b) ∧ (c ≥ 0) → a • c ≥ b • c Table 3 APPENDIX B THE MORPH FUNCTION We detail the morphing of C at each step as follows. For each theorem a, we define two symbolic patterns: L a and R a , each represented by an expression (see Appendix C for full details). For example, if a is AdditionCommutativity, we use L a = x 1 + x 2 to denote any formula that is a sum of two terms (x 1 and x 2 can be arbitrary terms). We check if one of the nodes in the computation graph of C has the structure defined by L a . If so, we then transform that node to a formula specified by R a . For example, if C is (p + q) + l = (p + (q + l)), p + q is a node that matches the pattern specified by L a , in which x 1 = p and x 2 = q. Let R a = x 2 + x 1 . We hence transform the node p + q to q + p as specified by R a . As a result, C becomes (q + p) + l = (p + (q + l)). If there is no node in the computation graph, we morph the core logic statement using the extension function E, defined in Appendix D . We sample nodes in available computation graphs and combine them with C, coming up with C and optionally a non-empty set of new premises P new . Algorithm 2 Theorem Generator (complete) 1: function GENERATE_THEOREM(initial conditions I, axiom order A) 2: Axiom order length L = len(A). 3: Initialize core logic statement C 0 ∼ U nif orm(I), and the set of premises P = {C 0 }. Collect N , the set of all nodes in the graphs. Extend C and get the set of premises: C , P new ← E(a, C, N ).

10:

end if return C , P new . 12: end function The reasons that we have two sets of rules for morphing are as follow: 1) Transformation rules can only be applied when the axiom will produce an equality, while extension rules can be applied to any axiom. So in order to generate theorems with all the axioms, we need the extension rules. 2) Almost all the extension rules will complicate the core logic statement while none of the transformation rules will. If we only have extension rules, the goal generated can be very complex even the proof is of moderate length. In order to generate compact theorems (goal not too complicated) with long proofs, the transformation rules are preferred. Therefore we only apply extension rules when transformation rules are not applicable.

APPENDIX C TRANSFORMATION RULES

The implementations of the transformation rules L and R. Since an action in the MDP consists of an axiom and a list of nodes as its arguments and the number of axioms is fixed, the number of nodes available determines the size the action space. Therefore it is interesting to investigate how many nodes are available in a proof. In Figure 7 we present the average number of nodes in proofs of different length. It can be told from the figure that the longer the proofs, the more nodes there will be, as expected. Comparing the axiom sets used, we find that the average number of nodes for ordered-field axioms is larger than that of field axioms. This is likely the consequence of ordered-field axioms, in generation, being more capable of producing new premises(e.g. First Principle of Inequality will produce an inequality premise(see Table 6 ), thus adding more nodes in the graphs). We give more experimental details for the use of MCTS. Following (Silver et al., 2017) , in the selection step of the MCTS tree construction, we use the following formula to select the next action, , where Q(s, a) represents the action value function, N (s, a) denotes the visit counts, P (s, a) is the prior probability, and c puct is a constant hyperparameter. In all of our experiments, we used the behavior cloning policy for computing P (s, a), and we used c puct = 1. After the MCTS tree is built, the action is sampled from the policy distribution π(a|s) = N (s, a) Axiom (a) L a R a AdditionCommutativity x 1 + x 2 x 2 + x 1 AdditionAssociativity x 1 + (x 2 + x 3 ) (x 1 + x 2 ) + x 3 AdditionSimplification x 1 + (-x 1 ) 0 MultiplicatoinCommutativity x 1 • x 2 x 2 • x 1 MultiplicationAssociativity x 1 • (x 2 • x 3 ) (x 1 • x 2 ) • x 3 MultiplicationSimplification x 1 • 1 x1 1 AdditionMultiplicationLeftDistribution (x 1 + x 2 ) • x 3 x 1 • x 3 + x 2 • x 3 AdditionMultiplicationRightDistribution x 1 • (x 2 + x 3 ) x 1 • x 2 + x 1 • x 3 SquareDefinition x 2 1 x 1 • x 1 MultiplicationOne x 1 • 1 or 1 • x 1 x 1 AdditionZero x 1 + 0 or 0 + x 1 τ , where τ is a hyperparameter and was chosen as 1 in our experiments. 



Figure 1: A proof of a + b + c = c + a + b in LEAN and INT, with seq2seq and graph interfaces.

generate a theorem with initial conditions I: {a = a, b = b, c = c, d = d, e = e} and axiom order A: [AdditionAssociativity (AA), AdditionCommutativity (AC), EquivalenceImplies-DoubleInequality (EIDI), FirstPrincipleOfInequality (FPI)]. Core logic statement C0 ∼ U nif orm(I) : a = a. Step 1: a1 = AA. C1: a + (b + c) = (a + b) + c, P1 = ∅. Step 2: a2 = AC. C2: a + (b + c) = (b + a) + c, P2 = ∅. Step 3: a3 = EIDI. C3: a + (b + c) ≥ (b + a) + c, P3 = ∅. Step 4: a4 = FPI. C4: (a + (b + c)) + d ≥ ((b + a) + c) + e, P4 = {d ≥ e}. Theorem generated: Given d ≥ e, prove a + (b + c) + d ≥ b + a + c + e.

Figure 2: Proof success rates on problems generated with different K and L parameters. Left: When the IID assumption holds, the success rate decreases as the two generation parameters K and L are increased. Right: All agents are trained on degree-0 problems and evaluated against problems of degree 0, 1, and 2. We find that transformer-based agents deteriorate in performance as the test problems become more complex than training problems. For GNN-based agents, there are no obvious trends as to how the proof success rate changes as the degree of the initial entities is varied.

Figure 3: Proof success rates on test problems generated with K and L settings. Transformer and GNN perform well; TreeLSTM has mediocre performance; and Bag-of-Words performs poorly: it cannot prove more than 5% of problems.

→ a + b = b + a AdditionAssociativity (AA) → a + (b + c) = (a + b) + c AdditionSimplification (AS) a = b → a + (-b) = 0

b) ∧ (c = d) → a + c = b + d EquMoveTerm(Helper axiom) (EMT) a + b = c → a = c + (-b) Ordered field axioms Definition All field axioms SquareGEQZero (SGEQZ) a = b → a • b ≥ 0 EquivalenceImpliesDoubleInequality (EIDI) a = b → (a ≥ b) ∧ (a ≤ b)

Figure 7

* = argmax a Q(s, a) + c puct P (s, a) b N (s, b) 1 + N (s, a)

t-1 and derive a new logic statement C t and corresponding premises P t . The key design idea in the MORPH function is to ensure that the newly

Left: Average success rates (in %) of agents trained on different numbers of axiom orders. Right: Average success rates (in %) of agents trained on different numbers of axiom combinations.

Let A and B represent two different axioms. There are multiple orders in which they can be applied in a K2L3 problem. O 1 = [A, A, B] and O 2 = [B, A, B] are two examples. Can an agent trained on problems generated with O 1 prove theorems generated with O 2 ?For both architectures, we investigated how well agents can generalize to problems with different Proof success rates on problems generated with different parameters. Left: We keep L the same and vary K. The success rate is likely to decrease when the test problems have different K from the training problems. Right: We keep K the same and vary L. For all agents, the proof success rate is lower on theorems that require longer proofs.GNN-based ones, as their average generalization gap is larger. This is particularly true when the number of axiom orders in the training set is 100: transformer-based agents can prove only 10.0% of test theorems. Remarkably, they still manage to complete more proofs than GNNs when the number of axiom orders in the training set exceeds 500.

The behavior cloning (BC) agents versus the MCTS-assisted (search) agents. Left: The average success rates (in %) of agents with and without MCTS over 1000 test theorems. Right: The average length of successful proofs by agents with and without MCTS over 1000 test theorems. K denotes the cardinality of the axiom combination of a proof, L denotes the length of the proof.

Top: Proof success rates (in %) of agents trained on different numbers of axiom orders. Bottom: Proof success rates (in %) of agents trained on different numbers of axiom combinations. K denotes the cardinality of the axiom combination of a proof, L denotes the length of the proof. Average 87.6 21.1 86.6 53.6 79.0 70.4 75.7 74.7   Average 79.1 47.5 76.6 68.0 72.6 72.4 72.8 71.9

ACKNOWLEDGEMENTS

We thank Jay McClelland, Han Huang and Yuanhao Wang for helpful comments and discussions. We also thank anonymous reviewers for valuable and constructive feedbacks. We are grateful to the Vector Institute for providing computing resources. YW was supported by the Google PhD fellowship. AQJ was supported by a Vector Institute research grant.

APPENDIX D EXTENSION FUNCTION

For these axioms, the core logic statement C needs to be of the form LHS(C) = RHS(C). Published as a conference paper at ICLR 2021 For these axioms, the core logic statement C needs to be of the form LHS(C) ≥ RHS(C).

Axiom (a)

Extension function E(C, a, N )

IneqMoveTerm

Only execute when LHS(C) is of the form x + y return x ≥ RHS(C) + (-y), ∅ FirstPrincipleOfInequality Sample nodes n 1 , n 2 ∼ N , whereWe compare the length of the theorems generated in characters and plot their distributions in Figure 5 . The length of the theorem in characters is a measure for how complicated it is. As is expected, the more complicated the theorem is, the longer the proof(bigger L). It is also worth noting that as L becomes bigger, the distribution of theorem length becomes less concentrated. This is likely a consequence of a more spread-out theorem length range. Figure 5 : The distribution of theorem length in characters for field axioms(left) and ordered-field axioms(right) generated with parameters K3L3, K3L5, and K3L7. As the length of the proof is increased, so is the number of characters in the theorem, while the distribution of latter is less concentrated.

APPENDIX E.2 AXIOM DISTRIBUTIONS

The frequency at which each axiom is applied influences the distribution of theorems our generator is able to produce. In Figure 6 , we present the proportions of axioms that are applied in generating 10,000 theorems. Their frequencies are a measure of how easy it is to satisfy the conditions to apply them. For the field axioms, the PrincipleOfEquality axiom is the most frequently used(9.30%) and the EquMoveTerm axiom is the most rarely used(2.38%). EquMoveTerm has a strict condition for application: the left hand side of the core logic statement has to be of the form x + y, therefore not frequently applied. For the ordered-field axioms, the EquivalenceImpliesDoubleInequality axiom is the most frequently used(10.18%). Since we start with a trivial equality in generation and want to end up with an inequality, a transition from equality to inequality is needed. Among the ways of transitioning, this conditions to apply this axiom is easiest to satisfy. Its popularity is followed by the group of Field axioms, from MultiplicationCommutativity(4.69%) to AdditionAssociativity(5.98%).The rest are ordered-field axioms which define the properties of inequalities, proportions ranging from IneqMoveTerm(1.14%) to FirstPrincipleOfInequality(5.74%). The agents converge slower and to a lower success rate when the proof length is increased. Also, the agents on field axioms are easier to train than those on ordered-field axioms.

APPENDIX G.2 PERFORMANCE VARIATION OF TRAINED AGENTS

To verify that the experimental results are statistically significant, we ran the experiments on proof length generalization in subsection 4.3 with 5 random seeds and tabled the results.Table 7 : Success rates of agents trained and tested on problems of different parameters (mean ± std) in percentage.

Transformers Tested on K3 L3 K3 L5 K3 L7

Trained on K3 L3 97.6 ± 0.9 31.5 ± 1.6 10.9 ± 1.0 K3 L5 97.2 ± 0.7 88.3 ± 1.2 59.5 ± 1.6 K3 L796.6 ± 1.2 87.0 ± 1.6 75.1 ± 1.2 Figure 12 : Proof success rates on problems generated with different parameters ((K denotes the cardinality of the axiom combination of a proof, L denotes the length of the proof). We keep parameter K the same and vary parameter L. For all agents, the proof success rate is lower on theorems that require longer proofs. The best-performing agent for problems of a given length is usually the agent trained on problems of the same length.

APPENDIX H THEOREM PROVING AS A MARKOV DECISION PROCESS (MDP)

We model theorem proving as a Markov Decision Process. A state s in the MDP is the proof state maintained by the assistant, namely, the goal, the premises and the proven facts, represented by computation graphs. An action a is a tuple of an axiom and a sequence of arguments. We denote the axiom space as X and the argument space, the set of all the nodes in available computation graphs, as N . The maximum number of arguments for one axiom within our axiomizations is 3, therefore the action space is A = X × N 3 . The assistant ignores redundant arguments if fewer than 3 are needed for the axiom considered. We show in Appendix E.3 the distribution of the number of nodes for proofs of different length. The size of the discrete action space can be as large as 18 × 42 3 ≈ 1.33 × 10 6 . The deterministic state transition function P (s, a) is implicitly determined by the proof assistant.When the proof assistant deems the proof complete and the theorem proven, the episode terminates and a reward of one is given. Otherwise, the reward is zero at each step. When the step limit for a proof is exhausted, the episode terminates with a reward of zero. For experiments in this paper, we used a step limit of 15.

Equality theorems

Theorem 1 Goal:Theorem 4Premises:Theorem 6Premises:Theorem 9Goal: 1 = ((((aTheorem 10 Goal:

Theorem 15

Goal: 1 = ((Theorem 16Theorem 20Premises:Theorem 23 Premises:Theorem 24Premises:Theorem 26 Premises:Theorem 27Premises:Theorem 28 Premises:Theorem 30Goal:Theorem 32Goal: (((cTheorem 33Goal:)) ))Theorem 35Theorem 36Theorem 38Theorem 39 Goal:Theorem 41 Goal:Theorem 42 Premises:Theorem 43 Premises:Theorem 44 Goal:Theorem 46Premises:Theorem 47Premises:Theorem 50 Goal: 0 = ( ((((0+(((c• 

Inequality theorems

Theorem 1 Premises:Theorem 3 Premises:Theorem 5Premises:Theorem 6Premises: Theorem 9Goal:Theorem 10 Premises:Theorem 11 Goal: ((((((0 + (c + (-c ) (((((((c • 

