MATHEMATICAL REASONING VIA SELF-SUPERVISED SKIP-TREE TRAINING

Abstract

We demonstrate that self-supervised language modeling applied to mathematical formulas enables logical reasoning. To measure the logical reasoning abilities of language models, we formulate several evaluation (downstream) tasks, such as inferring types, suggesting missing assumptions, and completing equalities. For training language models for formal mathematics, we propose a novel skip-tree task. We find that models trained on the skip-tree task show surprisingly strong mathematical reasoning abilities, and outperform models trained on standard skipsequence tasks. We also analyze the models' ability to formulate new conjectures by measuring how often the predictions are provable and useful in other proofs.

1. INTRODUCTION

Language modeling using Transformers (Vaswani et al., 2017) has been hugely successful for applications like translation and text generation. Models like GPT are able to generate news articles and stories given just an abstract (Radford et al., 2018) . These models are usually (pre-)trained on a proxy task, such as predicting missing words in the case of BERT (Devlin et al., 2019) , before fine tuning the models on more specific (downstream) tasks such as machine translation and questionanswering. These proxy tasks are not reliant on labels, and thus can be trained on large corpora of unlabeled data. Recently, however, we have seen successful demonstrations of language modeling using only self-supervised training without any fine tuning (Brown et al., 2020) . In this work, we extend this line of thought and demonstrate that purely self-supervised training can even lead to mathematical reasoning abilities. This represents a major departure from prior work in deep learning for mathematics, which has focused on learning directly on logical reasoning tasks, such as predicting the proof steps or premises or assignments. These approaches require labeled data, which is hard to come by and typically very limited in size. In contrast, our language modeling approach to mathematics allows us to train on unlabeled mathematical expressions. We start with the HOList dataset (Bansal et al., 2019) , which spans a wide range of mathematical topics, including topology, multivariate calculus, real and complex analysis, geometric algebra, and measure theory, formalized in the HOL Light proof assistant (Harrison, 1996) . We find that training a language model on all mathematical expressions in this dataset leads to surprisingly strong mathematical reasoning capabilities. We believe that this opens the door to different kinds of neural theorem provers, which do not only search through a well-defined search space of tactics and premises, but which are capable to generating their own lemmas and could even come up with a new Ansatz requiring a creative substitution. For self-supervised training on mathematical expressions, we propose a novel skip-tree task, which is a specialization of the skip-sequence task that respects the tree structure of expressions. We show that models trained on the skip-tree task significantly outperform those trained on the skip-sequence task, which is the state of the art for sequence to sequence models for natural language. Reasoning can refer to a wide range of abilities, and thus we measure the mathematical reasoning abilities of language models on a variety of tasks, including mechanical derivations, such as type inference, and also creative tasks, such as predicting under which assumptions a statement is true. As we want to study what reasoning capabilities can be acquired just through self-supervised training, we do not employ fine-tuning on these tasks. Instead, we designed the tasks to be syntactically similar to the training task, such that the language model may produce correct answers. An advantage of formal language compared to natural language is that we can attempt to automatically evaluate statements. That is, we can let our language models produce conjectures, which we then try to prove using the DeepHOL theorem prover (Bansal et al., 2019; 2020) . Besides evaluating the provability of the produced statements, we go one step further and evaluate their usefulness, by measuring how many times they are used as premises in proofs of other theorems. Our contributions are as follows: 1. We show that self-supervised training on mathematical formulas alone leads to logical reasoning capabilities. 2. We introduce a new skip-tree training task that outperforms the state-of-the-art skip-sequence training. We also introduce several evaluation tasks that are subsumed by skip-tree training (i.e. predict a missing subexpression), but test specific logical reasoning abilities to make the performance of the models interpretable. 3. We suggest a way to create and evaluate mathematical conjectures using existing neural theorem provers. The remainder of this paper is structured as follows: First, we review related work on language modeling and deep learning for mathematics in Section 2. Then, in Section 3 we discuss the source corpus of formal mathematical statements from which we generate our training data. In Section 4, we present the skip-tree training task, as well as several variations that we used in our ablation studies. We present the evaluation tasks in Section 5, discuss our experimental findings in Section 6, and conclude in Section 7.

2. RELATED WORK

Recently, we have seen a series of rapid improvements in language modeling stemming from better pretraining tasks (Devlin et al., 2019; Zhang et al., 2019; Song et al., 2019; Dong et al., 2019; Raffel et al., 2019; Conneau and Lample, 2019) . BERT (Devlin et al., 2019 ) is a pretraining task for Transformers (Vaswani et al., 2017) , which masks out a certain fraction of the input tokens that the model then has to predict. UniLM uses multiple pretraining tasks (Dong et al., 2019) . One of them is a sequence-to-sequence task; to predict the next sentence from the previous sentence. MASS and SpanBERT consider a generalized sequence-to-sequence pretraining task, which is to predict a masked out subsequence of the input (Song et al., 2019; Joshi et al., 2020) . However, both MASS and SpanBERT reveal the length of the sequence to predict as they replace it by a number of mask tokens equal to the length of the sequence. T5 introduced a generalization of sequence-to-sequence pretraining tasks that is crucial to our work (Raffel et al., 2019) . They replace the subsequence (or multiple subsequences) to be predicted by a single token (not a number of mask tokens equal to the length of the subsequence, as in MASS). Zhang et al. (2019) additionally exploit the sentence structure of natural language. They suggest the pretraining task Pegasus, which masks out entire sentences of a given text, and additionally masks out randomly selected tokens in the remaining text (or alternatively replace them by other tokens). In a similar way Pegasus' exploitation of the sentence structure of natural language, our skip-tree task exploits the tree structure of formal expressions. Zhang et al. ( 2019) also suggest sampling the sentences to be masked with the help of ROUGE1-F1 (Lin, 2004) . We work with the HOList dataset by Bansal et al. (2019) , which is closely related to the Flyspeck dataset by Kaliszyk and Urban (2014). There are other datasets which might be suitable for our approach as well, including proofs extracted from HOL4 (Gauthier et al., 2017) , and from Coq (Huang et al., 2019; Yang and Deng, 2019; Sanchez-Stern et al., 2019) .

