MATHEMATICAL REASONING VIA SELF-SUPERVISED SKIP-TREE TRAINING

Abstract

We demonstrate that self-supervised language modeling applied to mathematical formulas enables logical reasoning. To measure the logical reasoning abilities of language models, we formulate several evaluation (downstream) tasks, such as inferring types, suggesting missing assumptions, and completing equalities. For training language models for formal mathematics, we propose a novel skip-tree task. We find that models trained on the skip-tree task show surprisingly strong mathematical reasoning abilities, and outperform models trained on standard skipsequence tasks. We also analyze the models' ability to formulate new conjectures by measuring how often the predictions are provable and useful in other proofs. Reasoning can refer to a wide range of abilities, and thus we measure the mathematical reasoning abilities of language models on a variety of tasks, including mechanical derivations, such as type inference, and also creative tasks, such as predicting under which assumptions a statement is true. As we want to study what reasoning capabilities can be acquired just through self-supervised training, we do not employ fine-tuning on these tasks. Instead, we designed the tasks to be syntactically similar to the training task, such that the language model may produce correct answers. An advantage of formal language compared to natural language is that we can attempt to automatically evaluate statements. That is, we can let our language models produce conjectures, which we then try to prove using the DeepHOL theorem prover (Bansal et al., 2019; 2020). Besides evaluating the provability of the produced statements, we go one step further and evaluate their usefulness, by measuring how many times they are used as premises in proofs of other theorems. Our contributions are as follows: 1. We show that self-supervised training on mathematical formulas alone leads to logical reasoning capabilities. 2. We introduce a new skip-tree training task that outperforms the state-of-the-art skip-sequence training. We also introduce several evaluation tasks that are subsumed by skip-tree training (i.e. predict a missing subexpression), but test specific logical reasoning abilities to make the performance of the models interpretable. 3. We suggest a way to create and evaluate mathematical conjectures using existing neural theorem provers. The remainder of this paper is structured as follows: First, we review related work on language modeling and deep learning for mathematics in Section 2. Then, in Section 3 we discuss the source corpus of formal mathematical statements from which we generate our training data. In Section 4, we present the skip-tree training task, as well as several variations that we used in our ablation studies. We present the evaluation tasks in Section 5, discuss our experimental findings in Section 6, and conclude in Section 7. RELATED WORK

1. INTRODUCTION

Language modeling using Transformers (Vaswani et al., 2017) has been hugely successful for applications like translation and text generation. Models like GPT are able to generate news articles and stories given just an abstract (Radford et al., 2018) . These models are usually (pre-)trained on a proxy task, such as predicting missing words in the case of BERT (Devlin et al., 2019) , before fine tuning the models on more specific (downstream) tasks such as machine translation and questionanswering. These proxy tasks are not reliant on labels, and thus can be trained on large corpora of unlabeled data. Recently, however, we have seen successful demonstrations of language modeling using only self-supervised training without any fine tuning (Brown et al., 2020) . In this work, we extend this line of thought and demonstrate that purely self-supervised training can even lead to mathematical reasoning abilities. This represents a major departure from prior work in deep learning for mathematics, which has focused on learning directly on logical reasoning tasks, such as predicting the proof steps or premises or assignments. These approaches require labeled data, which is hard to come by and typically very limited in size. In contrast, our language modeling approach to mathematics allows us to train on unlabeled mathematical expressions. We start with the HOList dataset (Bansal et al., 2019) , which spans a wide range of mathematical topics, including topology, multivariate calculus, real and complex analysis, geometric algebra, and measure theory, formalized in the HOL Light proof assistant (Harrison, 1996) . We find that training a language model on all mathematical expressions in this dataset leads to surprisingly strong mathematical reasoning capabilities. We believe that this opens the door to different kinds of neural theorem provers, which do not only search through a well-defined search space of tactics and premises, but which are capable to generating their own lemmas and could even come up with a new Ansatz requiring a creative substitution. For self-supervised training on mathematical expressions, we propose a novel skip-tree task, which is a specialization of the skip-sequence task that respects the tree structure of expressions. We show that models trained on the skip-tree task significantly outperform those trained on the skip-sequence task, which is the state of the art for sequence to sequence models for natural language. Most previous works that apply sequence-to-sequence models to logics have focused on specific logical tasks in supervised training settings (e.g. Piotrowski and Urban (2020a) ). In contrast, we train language models on a self-supervised proxy task that does not require labeled data and can thus be applied to almost any source of mathematical expressions. Lample and Charton (2020) use a Transformer model for symbolic integration. They train their model directly on the the task to produce the integral of a given expression. To generate training data, their approach needs a classical algorithm to compute the derivative of the expressions. Finkbeiner et al. (2020) explore the generalization properties of the Transformer architecture predicting the solutions to SAT formulas and temporal logic, but require a data generator that can solve formulas, which is currently not feasible for higher-order logic. Piotrowski et al. (2019) train RNNs on individual logical reasoning steps, such as substitutions, using a dataset of rewrites on polynomials extracted from Prover9. Wang et al. (2018) translate between synthetic descriptions in natural language and formal mathematics on a dataset generated with Mizar. Self-supervised training techniques for formal mathematics have received much less attention. Wang et al. (2020) apply recent self-supervised translation techniques by Lample et al. (2018) to align formal and informal statements. Very recently, Li et al. (2020) and Polu and Sutskever (2020) applied language modeling to proofs of formal mathematics. In contrast, this work focuses on measuring reasoning abilities on mathematical statements (not necessarily proofs) achieved through self-supervised training only. Independently from our work, Urban and Jakubův (2020) presented initial experiments on applying self-supervised language modeling to formal mathematics in order to produce conjectures. However, they only evaluate the learned models through the truth of the produced conjectures, while we also consider several reasoning tasks and measure the usefulness of conjectures. Earlier methods to produce conjectures were limited in scope. For example, Piotrowski and Urban (2020b) propose a method to predict the next literal in an automated theorem prover using recurrent neural networks after supervised training. Prior to that Gauthier et al. (2016) relied only on statistical approaches to produce conjectures. Applying natural language techniques to formal mathematics has a long history. Already in 2004, Cairns (2004) applied information retrieval based on latent semantics to improve over search for keywords, and Urban (2004) formulated the intention to learn from large amounts of data in formalized mathematics. Transformer models for program understanding have focused on providing inductive biases in the architecture (Shiv and Quirk, 2019; Hellendoorn et al., 2020) , whereas this work suggests to use a modified language modeling proxy task.

3. DATASET

We start from the HOList dataset introduced by Bansal et al. (2019) . The complete dataset includes 29465 theorems and their proofs. We here consider only the "core" and "complex" datasets which comprise 18943 theorems, 637 definitions and 603,950 proof steps. The theorems and proofs were written (by humans) using the HOL Light proof assistant, and span various areas of mathematics such as set theory, arithmetic, linear algebra, topology, and multivariate complex analysis. Each proof starts with the theorem statement as a proof goal. For each goal, the dataset contains a tactic that a human applied to it, which then produces a list of subgoals. Most tactics have arguments, such as previously proven theorems (also called premises), which are invoked by that tactic. From this dataset we extract all theorem statements as well as all intermediate proof goals. We use S-expressions to represent statements. For example, consider x = y, which in s-expression syntax is represented as follows: (a (a (c (fun (A) (fun (A) (bool))) =) (v A x)) (v A y)) Each subexpression here is either a leaf or a triple. The first element of these triples indicates their kind: a indicates function applications, c indicates constants (i.e. symbols that have been defined in the formal system), v indicates a variable, and finally fun indicates a function type. The equality operator "=" is represented by (c (fun (A) (fun (A) (bool))) =), which indicates that it is a constant that has type (fun (A) (fun (A) (bool))). This type indicate it is a curried function with two arguments of arbitrary type, indicated by the generic type variable A, and returns a bool. The variables x and y are represented as (v A x) and (v A y). The v indicates that this

Validation Testing Training

Figure 1 : We use the theorems and proofs of the training split, marked in green, for training. For measuring the final performance of our evaluation tasks, we only used the theorems of the test set, marked in blue. This ensures that the models have never seen the statements from which the evaluation tasks are derived. The skip-tree training task for the example of the equality operator on boolean constants (original formula). In this example we assume that a part of the type was sampled to be the subexpression to be predicted, and that subexpression c was sampled to be masked out additionally. Note the input to the decoder is shifted to the right, such that the next token prediction task yields the target sequence. subexpression is a variable, and A is a type variable, which indicates that x and y could have any type. Since both x and y use the same type variable A, they must have the same type. In our dataset, we wrap each S-expression derived from a theorem in a <theorem> node, and all other expression in a <goal> node. That is, an expression (c (fun ...) ...) is represented now as (<goal> (c (fun ...) ...)). We use the same split as used in HOList (Bansal et al., 2019) . They first split the theorems into training, validation, and test, and then assign all statements in entire proof of each theorem to the same split as the theorem. This means that we have used the proof of 11,655 theorems in the training split of the core and complex libraries. This avoids partially revealing the proofs of theorems in the validation and test sets during training. We derive all training data from the theorems and proofs in the training set, and use only the theorems (not the proofs) for the evaluation tasks. This addresses the possibility that some proof steps for training theorems and for validation theorems might be shared. In Figure 1 we depict our choice of training and evaluation data.

4. SKIP-TREE TRAINING

In this section we define the skip-tree training task. We parse a given mathematical statement into a tree of subexpressions, and replace one of the subexpressions by a <PREDICT> token. The task is to predict the subexpression replaced by <PREDICT>. See Figure 2 for an example. For training, the trees are converted back to a sequence of tokens; the target sequence is extended by a <START> token in the front and an <END> token in the back. We filter out training examples that • are derived from an input sequence that is longer than 10k characters as our models would see only a small prefix of the input sequence (encoder length is 1024 tokens), and The evaluation sets were also filtered, but only 0.16% of the evaluation examples were dropped. Additional masked subexpressions. In addition to the subexpression to be masked out by <PREDICT>, we select k = 2 subexpressions to be masked out by a different mask token <MASK>. In contrast to the <PREDICT> token, we replace all occurrences of these subexpressions by the <MASK> token. Note that it can happen that the subexpressions we want to replace by the <MASK> tokens overlap with each other or with the subexpression replaced by the <PREDICT> token. In this case, we give the highest preference to the <PREDICT> token, and then in decreasing order of size for the expression to be replaced by the <MASK> tokens. The subexpressions masked by <MASK> do not have to be predicted, they are only hidden. They make the model tolerant to having only partial information. As a side-effect masking out additional terms makes the tasks harder and shorter. Distributions of subexpressions. Sampling subexpressions uniformly at random results in very short sequences to be predicted: since our trees are mostly ternary, two thirds of the subexpressions are leaves. Besides picking subexpressions uniformly at random, we thus experiment with weighting the subexpressions by the number of tokens they contain. We refer to these variants as "uniform" and "weighted". This results in a much more diverse set of expressions to be sampled. The experiments show that this helps on some, but not all, tasks. Multiple samples per statement. We generate up to n = 100 training examples from each statement in the source data set by sampling different subexpressions to predict (and different <MASK> tokens). To avoid duplicates, we sample the subexpressions to predict without replacement. For small formulas with less than 100 subexpressions, this reduced the number of training examples we generate from them. Our initial data, the core and complex corpus of HOList, consists of 604K mathematical statements, from which 376K statements are in the training split. After sampling and filtering out training examples that do not fit the requirements listed in the beginning of this section, we are left with around 25.8M training examples. Table 1 lists the statistics of the datasets and various ablations.

4.1. ABLATIONS

To verify the design choices of the skip-tree training task we generated multiple variants of the training task and trained a model on each of them. No mask tokens. To answer the question of whether it helps to mask out subexpressions besides the one to predict, we generated a dataset with k = 0, called "skip-tree (no <MASK>)". Fewer samples per statement. Instead of sampling many training examples from each formula, we could train on a fewer training examples for more epochs. We generated a smaller version with n = 20 of the skip-tree training data, which we call "skip-tree (small)". Skip-sequence. For natural language the state-of-the-art self-supervised training task for sequenceto-sequence models is the skip-sequence task (see MASS (Song et al., 2019) , SpanBERT (Joshi et al., 2020) , and T5 (Raffel et al., 2019) ). The skip-tree task is similar to the skip-sequence task. But instead of predicting arbitrary subsequences in the skip-sequence task, the skip-tree task makes sure that the subsequence to predict is a subexpression. For example, for the statement a + b = c + d, the skip-sequence task may select the subsequence "= c+", but the skip-tree task would only allow us to pick valid subexpressions, such as "a + b". Our experiments will show that this subtle difference has a dramatic impact on the ability of our language models to predict correct answers to reasoning tasks. We generated three datasets for the skip-sequence task, where we sample subsequences of different lengths (short/medium/long), with lengths up to 50, 100, and 512 tokens. Unfiltered. While generating the training data, we filtered out mathematical statements that exceed 10k characters. This removed some 32% of the statements from the HOList corpus. In this ablation study we removed this filter and train on the full corpus.

5. EVALUATION TASKS

In this section we suggest several logical reasoning tasks on which language models can be evaluated. These tasks require different levels of logical reasoning, ranging from mostly mechanical tasks, such as type inference, to more creative tasks, such as predicting missing assumptions. We intentionally define them to have the same format as the training task, i.e. predict a missing part of a larger expression, as this allows us to test models without fine-tuning. However, the evaluation tasks are out-of-distribution in two ways: First, we generate them from the theorems (excluding proofs) from the validation/test sets. This ensures that the model has never seen the theorems from which we generated these evaluation tasks, nor has it seen the proofs of these theorems. This makes the tasks more challenging, and forces the models to go beyond memorization. Second, we mask out very specific elements, such as types and assumptions. This makes the results on the evaluation tasks easier to interpret. To give the interested reader a better impression of the evaluation tasks, we provide a list of randomly selected examples in Appendix E. Type Inference. We generate type inference problems by omitting the typing annotation of variables, constants, and (lambda-)abstractions. We generated two variants: In the task we call "Type Inference," we replace only the selected type by the <PREDICT> token and do not mask out anything else. In the second variant we name "Hard Type Inference," we additionally replace all other types by the <MASK> token. The two tasks loosely correspond to the deriving the first and the last type during type inference. Consider the following example of a "Type Inference" evaluation task: (a (a (c <PREDICT> =) (v A x)) (v A y)) The type of the equality operator is uniquely defined, given the types of the two subterms of the equation. In this example the type could have been computed by a classical type inference algorithm. For the "Hard Type Inference" evaluation task, the input could look as follows: (a (a (c <PREDICT> =) (v <MASK> x)) (v <MASK> y)) Now, the type inference task is highly ambiguous. In fact, variable x could have any type, and the equality operator would have to adapt to the type of its arguments accordingly. Assumptions. This evaluation task is to predict missing assumptions for theorems in the validation set. We extract these tasks by searching for "top-level implications" and replacing their left operand by the <PREDICT> token. We define an implication operator "⇒" in an expression to be a top-level implication if it is either the top-most operator of the expression, or occurs only under quantifiers, conjunctions, disjunctions, or on the right side of other top-level implications. This definition helps us to avoid picking assumptions in negated parts of formulas. Note that we can have multiple top-level implications per validation theorem. Consider the abstracted example (a ⇒ b) ∧ (c ⇒ (d ⇒ e)). In this case, a, c, and d are all considered to be assumptions of top-level implications. An example from the theorem database is x = y ⇒ a + x = a + y, for which the task is to predict x = y given <PREDICT> ⇒ a + x = a + y. (We omit the presentation of this example as an s-expression for the sake of readability.) At first, the expression to predict in this case may seem unique, but there are actually many ways to complete the task into a true statement; e.g. y = x or x = 0 ∧ y = 0. Still, most humans would likely guess x = y as it is simple and general, and because x occurs before y in the alphabet. To make a correct prediction, our language models thus have to not only reason which which statements are likely correct answer, but also needs to find a reasonably general statement and also know about naming conventions. Below we give some examples of this reasoning task that we selected for their simplicity. (For a representative selection, see Appendix E.) While it is often easy to "see" that a given solution to such a task is correct, it can be non-trivial to come up with a solution in the first place. We encourage the reader to make their own predictions before looking up the ground truth in Appendix C: • <PREDICT> ⇒ (g \ {s}) = g • <PREDICT> ⇒ (x1/y1 = x2/y2 ⇔ x1 * y2 = x2 * y1) • <PREDICT> ⇒ (x ⇔ ( b ∨ x1) ∧ (b ∨ x0)) Equalities. Similar to the task of predicting missing assumptions, we ask to predict one side of a top-level equality in this task. Again, we define top-level equalities to be any equality that occurs as the top-level operator of the formula or occurs inside quantifiers, conjunctions, disjunctions, or on the right side of implications. For example, from the theorem ∀x.x = (x = True) we extract two evaluation examples: ∀x. <PREDICT> = (x = True) and ∀x. x = <PREDICT>. Again, we present some simple example tasks (in human-readable notation) and provide the ground truth as well as the model predictions in Appendix C: • ∀x, n ∈ N : (x n = 1) = <PREDICT> • ∀m, n : n ≤ m ⇒ m -n + n = <PREDICT> • ∀l, m : <PREDICT> = APPEND(REVERSE(m), REVERSE(l))

6. RESULTS AND DISCUSSION

In language modeling for natural language one of the key metrics is how often the next token in the ground truth is correctly predicted. This is not an ideal measurement for formal mathematics as even a single incorrect token can invalidate the entire statement. Also, the S-expression representation is relatively lengthy and barely human-readable, so a token-level measurement does not allow us to compare our models to the natural language models in any case. In the first part of our evaluation we therefore focus on exact (i.e. syntactic) matches of the entire predicted statement. We trained a Transformer architecture on the skip-tree dataset and each of the ablations for up to 1M steps (=parameter updates) with a batch size of 256. This means our models are trained for about 10 epochs (depending on dataset size). Our models have 39M trainable parameters; the hyperparameters are specified in the appendix. We trained them on an 8x8 TPU configuration, which equates to 128 cores. The training runs took between 5 and 15 hours, depending on the average length of the output sequences, which translates to up to 1.4 and 4.2 PetaFLOPs days per training run. We measured each evaluation task on 1000 samples. The evaluation was performed on CPUs in a cloud computing framework on recent CPU architectures. Depending on the average length of the output sequences this required a different amounts of resources: (regular and hard) type inference took 4 CPU hours, equality completion took 24 CPU hours, missing assumption took 43 CPU hours. The average prediction time per token on CPU is around 60 milliseconds. Each evaluation task was repeated every 50k training steps on 1000 freshly sampled examples of the validation set. We then picked the best checkpoint (based on the results on the validation data) and evaluated it on the test set. In Table 2 we present the results of this experiment. Skip-tree vs skip-sequence. The skip-tree task and its ablations clearly dominate the skip-sequence task. One major difference between skip-tree and skip-sequence tasks is the lack of <MASK> tokens in the skip-sequence task. We therefore have to compare its performance to the "Skip-tree (no <MASK>)" ablation study to get a fair picture. The length of the skipped sequences appears to play a substantial role, with the Skip-sequence (medium), masking out sequences of up to length 100, performing best. A manual inspection of the predictions of the skip-sequence models showed they rarely parse or typecheck. It seems that the skip-sequence models consistently add surplus tokens at the end, or stop expressions too early; they appear to be unable to correctly identify the end of the expression. Impact of <MASK> tokens. Hard Type Inference is the only evaluation task that contains the <MASK> token. Models trained on datasets without the <MASK> token perform poorly here (see grayed-out numbers in Table 2 ). The presence or absence of <MASK> tokens has only a minor impact on the other, as we can observe through the comparison of "Skip-tree (weighted)" and "Skip-tree (no <MASK>)". Multiple samples per statement. For each source statement in the HOList corpus we sampled n = 100 subexpressions to be masked out. Lowering n to 20 significantly decreased the performance, as we can see in the comparison between "Skip-tree (small)" and "Skip-tree (weighted)". Sampling many subexpressions per source statement appears to be a good way to increase the number of training examples from limited source data. Uniform vs weighted sampling. Sampling subexpressions weighted by their size in the skip-tree task significantly improves the performance on the harder tasks, missing assumptions and equality completion. On the Type Inference tasks, the performance is very similar. We conjecture this is because of the average size of the terms to predict is smaller for the uniform sampling strategy, which is more similar to the average size of the types to predict.

6.1. CONJECTURING

In the experiments above, we measured how often the models predicted the ground truth in the evaluation tasks. We now change our point of view, and examine whether the models can be used to generate new and useful conjectures. In the following, we analyze all statements produced in the Assumptions and Equalities tasks in Table 2 , and we introduce a new task, which we call free-form conjecturing. For free-form conjecturing we simply ask the model to produce theorems by presenting the input sequence (the "prompt"): (<theorem> <PREDICT>). The subexpression the models has to fill in is thus an entire theorem. We use a beam search with beam width 1024 to produce enough outputs for a meaningful evaluation. How often are predictions true and new? For this measurement, we attempt to prove the conjectured statements with the DeepHOL theorem prover (Bansal et al., 2019) . This gives us a lower bound to the number of true statements, as the version of the DeepHOL theorem prover used here can prove around 58% of the validation theorems. So we expect the estimates here to be considerably below the number of actually true statements. In Table 3 We believe that these measurements show a significant bias towards true statements. While in some tasks, less than half of the statements were provable, there are many more ways to write a false statement than a true statement. Are the conjectures useful? For some evaluation tasks, the models could "cheat" on the truth metric by making the statements trivially true. For example, the models can predict False as an assumption, or complete the missing part of an equation by making it an identity (e.g. complete x = <PREDICT> by predicting x). In fact, manual inspection revealed several such cases. To make this measurable, we added the provable statements to the theorem database, and ran the reinforcement learning experiments of the DeepHOL theorem prover (Bansal et al., 2019) and then measured how many of the statements were used as premises. In this experiment we also make sure that the new theorems cannot be used in the proofs of their premises. In a "pruning" step DeepHOL minimizes proofs by removing each individual premise in a proof and checking if the proof still holds. This filters out statements that have no effect on the proof. Only the premises that survive this step are classified as useful. We ran three reinforcement learning experiments, one for each set of conjectures produced by one the evaluation tasks. We then measured how many of the theorems generated by each task are used as a premise in one of the over 200,000 proofs found for each of the experiments. For the assumptions task, 505 of the 813 theorems were used at least once. For the equalities task and the free-form conjectures it was 831 out of 1811 and 54 out of 172, respectively. We provide usage histograms in Appendix B. While some of the most frequently used conjectures turned out to be alpha-equivalent to existing theorems in the database, we found some interesting examples among the most used conjectures produced: • b = a + c ⇒ a = b -c. • COUNTABLE({s(n) | n ∈ N}). • ∀f, s : (∀x : x ∈ s =⇒ f (x) = vec(0)) =⇒ f integrable_on s. In fact, humans have used the first conjectured theorem over vector arithmetic in many proofs. However, this theorem has always been defined as a local lemma and thus did not make it into the theorem database. For theorems two and three in the list above, thorough manual search has revealed no closely related statement in the theorem database. This suggest that self-supervised language models show some ability to produce new, useful conjectures, even without fine tuning or specialized training.

7. CONCLUSION

In this work, we applied the paradigms of self-supervised language modeling to formal mathematics and show that, surprisingly, this leads to mathematical reasoning capabilities. For training, we introduced a novel self-supervised skip-tree task for formal mathematics that outperforms existing training tasks used for natural language. We also suggested several evaluation tasks for measuring mathematical reasoning capabilities of language models for formal mathematics without the need of fine tuning. Finally, we explored the ability of language models to produce new conjectures by measuring how many of the new predictions are provable and useful for proving other theorems.

A HYPERPARAMETERS

We trained the Transformer models for the skip-tree tasks with these hyperparameters: • vocabulary size: 1200 We settled with these hyperparameters after trying various alternatives. We explored encoders and decoders with up to 12 layers and various learning rates and intermediate sizes. 

C A CLOSE LOOK AT SIMPLE EXAMPLE TASKS

Assumptions. In Section 5 we presented the following three examples of the task to predict missing assumptions. For the sake of readability we here discuss only the pretty printed versions. For examples in s-expression syntax, please visit Appendix E. • <PREDICT> ⇒ (g \ {s}) = g • <PREDICT> ⇒ (x1/y1 = x2/y2 ⇔ x1 * y2 = x2 * y1) • <PREDICT> ⇒ (x ⇔ ( b ∨ x1) ∧ (b ∨ x0)) The ground truth answers are as follows: • ¬(s ∈ g) • 0 < y1 ∧ 0 < y2, note that 0 = y1 ∧ 0 = y2 would be a more general assumption. • ((b ⇔ False) ⇒ (x ⇔ x0)) ∧ (b ⇔ True) ⇒ (x ⇔ x1) We prompted one of our skip-tree models with these tasks. For the second and the third task, the model "skip-tree (weighted)" makes a correct prediction in the top 3 candidates in a beam search of width 8. For the first task, the model mostly produces incorrectly typed expressions: it appears to think that s is a set of the same type as g. Equalities. We presented these examples for the equality evaluation task: • ∀x, n ∈ N : (x n = 1) = <PREDICT> • ∀m, n : n ≤ m ⇒ m -n + n = <PREDICT> • ∀l, m : <PREDICT> = APPEND(REVERSE(m), REVERSE(l)) The ground truth for the tasks is: • x = 1 ∨ n = 0 • m • REVERSE(APPEND(l, m)) Examples two and three are predicted correctly in a beam search with beam width 8. For the first example, the model almost gets it correct in two of the 8 attempts: x = 1 ∨ n = 1, and x = 0 ∨ n = 1. We find it surprising that the model apparently understands that there are two cases to consider, but that the exact combination of constants (1 and 0) is a challenge.

D MODEL PERFORMANCE BY TRAINING STEP

In Figure 4 we can see the performance of the model throughout training. We can see that the performance on validation and test is very similar, but that there is some variance. We can also observe that even after 1M steps, the model has apparently not quite converged. Hard Type Inference. • (<theorem> (a (c <MASK> !) (l (v <MASK> s) (a (a (c <MASK> =) (a (c <MASK> INTERS) (v <MASK> s))) (a (a (c <PREDICT> DIFF) (c <MASK> UNIV)) (a (c <MASK> UNIONS) (a (c <MASK> GSPEC) (l (v <MASK> GEN%PVAR%0) (a (c <MASK> ?) (l (v <MASK> t) (a (a (a (c <MASK> SETSPEC) (v <MASK> GEN%PVAR%0)) (a (a (c <MASK> IN) (v <MASK> t)) (v <MASK> s))) (a (a (c <MASK> DIFF) (c <MASK> UNIV)) (v <MASK> t))))))))))))) Ground truth: <START> (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (fun ?0 (bool)))) <END> Source theorem pretty printed: !s. INTERS s = (:?0) DIFF UNIONS {(:?0 ) DIFF t | t IN s} • (<theorem> (a (c <MASK> !) (l (v <MASK> f) (a (c <MASK> !) (l (v <MASK> s) (a (a (c <MASK> =) (a (a (c <MASK> uniformly_continuous_on) (v <MASK> f)) (v <MASK> s))) (a (c <MASK> !) (l (v <MASK> e) (a (a (c <MASK> ==>) (a (a (c <MASK> real_lt) (a (c <MASK> real_of_num) (a (c <MASK> NUMERAL) (c <MASK> _0)))) (v <MASK> e))) (a (c <MASK> ?) (l (v <MASK> d) (a (a (c <MASK> ∧) (a (a (c <MASK> real_lt) (fun (bool) (fun (bool) (bool))) ==>) <PREDICT>) (v (bool) r)))) Ground truth: <START> (a (a (c (fun (bool) (fun (bool) (bool))) =) (v (bool) p)) (v (bool) q)) <END> Source theorem pretty printed: q ∧ ∼p ==> (p <=> q) ==> r • Prompt: (<theorem> (a (c (fun (fun (fun (cart (real) N) (real)) (bool)) (bool)) !) (l (v (fun (cart (real) N) (real)) f) (a (c (fun (fun (fun (real)  (real)) (bool)) (bool)) !) (l (v (fun (real) (real)) g) (a (c (fun (fun (cart (real) N) (bool)) (bool)) !) (l (v (cart (real) N) x) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) <PREDICT>) (a (a (c (fun (fun (cart (real) N) (real)) (fun (net (cart (real) N)) (bool))) real_continuous) (a (a (c (fun (fun (real) (real)) (fun (fun (cart (real) N) (real)) (fun (cart (real) N) (real)))) o) (v (fun (real) (real)) g)) (v (fun (cart (real) N) (real)) f))) (a (c (fun (cart (real) N) (net (cart (real) N))) at) (v (cart (real) N) x))))))))))) Ground truth: <START> (a (a (c (fun (bool) (fun (bool) (bool))) ∧) (a (a (c (fun (fun (cart (real) N) (real)) (fun (net (cart (real) N)) (bool))) real_continuous) (v (fun (cart (real) N) (real)) f)) (a (c (fun (cart (real) N) (net (cart (real) N))) at) (v (cart (real) N) x)))) (a (a (c (fun (fun (real) (real)) (fun (net (real)) (bool))) real_continuous) (v (fun (real) (real)) g)) (a (a (c (fun (net (real)) (fun (fun (real) (bool)) (net (real)))) within) (a (c (fun (real) (net (real))) atreal) (a (v (fun (cart (real) N) (real)) f) (v (cart (real) N) x)))) (a (a (c (fun (fun (cart (real) N) (real)) (fun (fun (cart (real) N) (bool)) (fun (real) (bool)))) IMAGE) (v (fun (cart (real) N) (real)) f)) (c (fun (cart (real) N) (bool)) UNIV))) )) <END> Source theorem pretty printed: !f g x. f real_continuous at x ∧ g real_continuous atreal (f x) within IMAGE f (:realˆN) ==> g o f real_continuous at x • Prompt: (<theorem> (a (c (fun (fun (fun (cart (real) Equalities. M) (cart (real) N)) (bool)) (bool)) !) (l (v (fun (cart (real) M) (cart (real) N)) f) (a (c (fun (fun (fun (cart (real) M) (cart (real) P)) (bool)) (bool)) !) (l (v (fun (cart (real) M) (cart (real) P)) g) (a (c (fun (fun (fun (cart (real) M) (bool)) (bool)) (bool)) !) (l (v (fun (cart (real) M) (bool)) s) (a (c (fun (fun (num) (bool)) (bool)) !) (l (v (num) n) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) <PREDICT>) (a (a (a (c (fun (num) (fun (fun (cart (real) M) (bool)) (fun (fun (cart (real) M) (cart (real) (finite_sum N P))) (bool)))) baire) (v (num) n)) (v (fun (cart • Prompt: 



fun ( bool ) <PREDICT>) = ) ( c ( fun ( bool ) ( fun ( bool ) ( bool ) ) ) = ) <START> ( fun ( bool ) ( bool ) ) <END> ( fun ( bool ) ( bool ) ) <END>Original formula:

Figure2: The skip-tree training task for the example of the equality operator on boolean constants (original formula). In this example we assume that a part of the type was sampled to be the subexpression to be predicted, and that subexpression c was sampled to be masked out additionally. Note the input to the decoder is shifted to the right, such that the next token prediction task yields the target sequence.

Figure 3: Histograms of premise usage of the conjectures generated through the assumptions task (left), the equality task (middle), and through free-form conjecturing (right). X-axes are the new theorems, sorted by number of usages. Y-axes indicate the number of usages on a log scale.

Figure 4: Performance on the Missing Assumptions task of a model trained on the Skip-tree (weighted) task. Y-axis is the training steps and X-axis is the model performance as the ratio of correctly predicted examples. Left: validation data. Right: test data.

real) M) (bool)) s)) (l (v (cart (real) M) x)(a (a (c (fun (cart  (real)  N) (fun (cart (real) P) (cart (real) (finite_sum N P)))) pastecart) (a (v (fun (cart (real) M) (cart (real) N)) f) (v (cart (real) M) x))) (a (v (fun (cart (real) M) (cart (real) P)) g) (v (cart (real) M) x))))))))))))))) Ground truth: <START> (a (a (c (fun (bool) (fun (bool) (bool))) ∧) (a (a (a (c (fun (num) (fun (fun (cart (real) M) (bool)) (fun (fun (cart (real) M) (cart (real) N)) (bool)))) baire) (v (num) n)) (v (fun (cart (real) M) (bool)) s)) (v (fun (cart (real) M) (cart (real) N)) f))) (a (a (a (c (fun (num) (fun (fun (cart (real) M) (bool)) (fun (fun (cart (real) M) (cart (real) P)) (bool)))) baire) (v (num) n)) (v (fun (cart (real) M) (bool)) s)) (v (fun (cart (real) M) (cart (real) P)) g)))<END> Source theorem pretty printed: !f g s n. baire n s f ∧ baire n s g ==> baire n s (lambda x. pastecart (f x) (g x))

(<theorem> (a (c (fun (fun (fun ?0 (cart (real)  (2))) (bool)) (bool)) !)(l (v (fun ?0 (cart (real)  (2))) f)(a  (c (fun (fun (fun ?0 (cart (real)  (2))) (bool)) (bool)) !)(l (v (fun ?0 (cart (real)  (2))) g)(a (c (fun (fun (fun  ?0 (bool)) (bool)) (bool)) !) (l (v (fun ?0 (bool)) s) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (c (fun (fun ?0 (bool)) (bool)) FINITE) (v (fun ?0 (bool)) s))) (a (a (c (fun (cart (real) (2)) (fun (cart (real) (2)) (bool))) =) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (cart (real) (2))) (cart (real) (2)))) cproduct) (v (fun ?0 (bool)) s)) (l (v ?0 x) (a (a (c (fun (cart (real) (2)) (fun (cart (real) (2)) (cart (real) (2)))) complex_mul) (a (v (fun ?0 (cart (real) (2))) f) (v ?0 x))) (a (v (fun ?0 (cart (real) (2))) g) (v ?0 x)))))) <PREDICT>))))))))) Ground truth: <START> (a (a (c (fun (cart (real) (2)) (fun (cart (real) (2)) (cart (real) (2)))) complex_mul) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (cart (real) (2))) (cart (real) (2)))) cproduct) (v (fun ?0 (bool)) s)) (v (fun ?0 (cart (real) (2))) f))) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (cart (real) (2))) (cart (real) (2)))) cproduct) (v (fun ?0 (bool)) s)) (v (fun ?0 (cart (real) (2))) g)))<END> Source theorem pretty printed: !f g s. FINITE s ==> cproduct s (\x. f x * g x) = cproduct s f * cproduct s g• Prompt:(<theorem> (a (c (fun (fun (fun (cart (real) N) (bool)) (bool)) (bool)) !) (l (v (fun (cart (real) N) (bool)) s) (a (c (fun (fun (fun (cart (real) N) (bool)) (bool)) (bool)) !) (l (v (fun (cart (real) N) (bool)) t) (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (a (c (fun (bool) (fun (bool) (bool))) ∧) (a (c (fun (fun (cart (real) N) (bool)) (bool)) convex) (v (fun (cart (real) N) (bool)) s))) (a (a (c (fun (bool) (fun (bool) (bool))) ∧) (a (c (fun (fun (cart (real) N) (bool)) (bool)) affine) (v (fun (cart (real) N) (bool)) t))) (a (c (fun (bool) (bool)) ∼) (a (a (c (fun (fun (cart (real) N) (bool)) (fun (fun (cart (real) N) (bool)) (bool))) =) (a (a (c (fun (fun (cart (real) N) (bool)) (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool)))) INTER) (a (c (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool))) relative_interior) (v (fun (cart (real) N) (bool)) s))) (v (fun (cart (real) N) (bool)) t))) (c (fun (cart (real) N) (bool)) EMPTY))))))(a (a (c  (fun (fun (cart (real)  N) (bool)) (fun (fun (cart (real) N) (bool)) (bool))) =) <PREDICT>) (a (a (c (fun (fun (cart (real) N) (bool)) (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool)))) INTER) (a (c (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool))) closure) (v (fun (cart (real) N) (bool)) s))) (v (fun (cart (real) N) (bool)) t))))))))) Ground truth: <START> (a (c (fun (fun (cart (real) N) (bool)) (fun (cart (real) N) (bool))) closure) (a (a (c (fun (fun (cart (real) N) (bool)) (fun (fun (cart (real) N) (bool))

• exceed the length of the decoder (512 tokens).While the second criterion drops only a negligible 0.1% of the training examples, the first criterion drops around 32% of the training examples. We confirmed that not dropping those training examples does not significantly change the results (see the ablation study titled "unfiltered").

Basic statistics of the training splits of the data sets. Number of tokens in the training set measured before padding.

Success rate of predicting the ground truth in a beam search of width 8 after training a model on various datasets. Bold numbers indicate results that are within 0.5% of the best result. Grayed out values indicate experiments where the training data did not include the <MASK> token but the evaluation did.

Percentage of "provable statements"/"provable new statements". The type inference tasks are not included as we are only interested in the predictions that do not match the ground truth. For the type inference tasks, these statements are either semantically equivalent to existing statements or statements that do not type check. percentage of generated statements that are provable and new -excluding exact (syntactic) matches with the ground truth and statements from the training set.

annex

Assumptions.• Prompt:(<theorem> (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (a (c (fun (fun ?1 (bool)) (fun (fun ?1 (bool)) (bool))) =) (a (c (fun (fun ?1 (bool)) (fun ?1 (bool))) GSPEC) (l (v ?1 GEN%PVAR%0) (a (c (fun (fun ?1 (bool)) (bool)) ?) (l (v ?1 x) (a (a (a (c (fun ?1 (fun (bool) (fun ?1 (bool)))) SETSPEC) (v ?1 GEN%PVAR%0)) (a (a (c (fun (bool) (fun (bool) (bool))) ∧) (a (a (c (fun ?1 (fun (fun ?1 (bool) (a (c (fun ?0 (fun ?0 (bool) • Prompt:(<theorem> (a (c (fun (fun (fun (cart (real (<theorem> (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (a (c (fun (bool) (<theorem> (a (a (c (fun (bool) (fun (bool) (bool))) ==>) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (bool))) SUBSET) (v (fun ?0 (bool)) t)) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (fun ?0 (bool)))) DIFF) (c (fun ?0 (bool)) UNIV)) (v (fun ?0 (bool)) s)))) (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (bool))) =) <PREDICT>) (c (fun ?0 (bool)) EMPTY)))) Ground truth: <START> (a (a (c (fun (fun ?0 (bool)) (fun (fun ?0 (bool)) (fun ?0 (bool)))) INTER) (v (fun ?0 (bool)) s)) (v (fun ?0 (bool)) t)) <END> Source theorem pretty printed: t SUBSET (:?0) DIFF s ==> s INTER t = {} • Prompt:(<theorem> (a (c (fun (fun (real) (bool)) (bool)) !) (l (v (real) x) (a (a (c (fun (real) 

