LIME: LEARNING INDUCTIVE BIAS FOR PRIMITIVES OF MATHEMATICAL REASONING

Abstract

While designing inductive bias in neural architectures has been widely studied, we hypothesize that transformer networks are flexible enough to learn inductive bias from suitable generic tasks. Here, we replace architecture engineering by encoding inductive bias in the form of datasets. Inspired by Peirce's view that deduction, induction, and abduction form an irreducible set of reasoning primitives, we design three synthetic tasks that are intended to require the model to have these three abilities. We specifically design these synthetic tasks in a way that they are devoid of mathematical knowledge to ensure that only the fundamental reasoning biases can be learned from these tasks. This defines a new pre-training methodology called "LIME" (Learning Inductive bias for Mathematical rEasoning). Models trained with LIME significantly outperform vanilla transformers on three very different large mathematical reasoning benchmarks. Unlike dominating the computation cost as traditional pre-training approaches, LIME requires only a small fraction of the computation cost of the typical downstream task.

1. INTRODUCTION

Inductive bias is essential for successful neural network learning. Many of the breakthroughs in machine learning are accompanied by new neural architectures with better inductive biases, such as locality bias in convolutional neural networks (LeCun et al., 1999) , recurrence and memory in LSTMs (Hochreiter and Schmidhuber, 1997) , and structural bias in graph neural networks (Scarselli et al., 2008) . However, existing designs of inductive biases need to be explicitly encoded in neural architecture. This is sometimes difficult as one may not know the exact mechanism for an abstract ability, in order to describe the architectural bias explicitly. In particular, designing proper inductive bias for abstract concepts such as mathematical reasoning becomes an extremely challenging task. Moreover, attempts to design elaborate architectures for reasoning often fall short of the performance of more generic transformer architecture. In this work, we aim to avoid the search for new architectures and investigate whether one can learn useful inductive bias for mathematical reasoning through pretraining. Large-scale unsupervised pretraining of language models revolutionized the field of natural language processing (NLP), improving the state-of-the-art in question answering, name entity recognition, text classification, and other domains, e.g. (Radford et al., 2018; Devlin et al., 2019; Yang et al., 2019; Liu et al., 2019; Raffel et al., 2020; Brown et al., 2020) . As a result, pretraining has become a common practice for modern neural network based NLP. One plausible explanation for the benefit of pretraining is that the model can learn world knowledge by memorizing the contents of the natural language corpus. This can be useful in various natural language downstream tasks, such as question answering and text classification. However, there is another potential advantage of pre-training-it may distill inductive biases into the model that are helpful for training on downstream tasks (Brown et al., 2020; Warstadt and Bowman, 2020) . We focus on the latter and design pre-training tasks that are intentionally devoid of knowledge and only allow the model to learn inductive bias for reasoning. Inspired by the logician Charles Peirce (Peirce, 1992), we believe that the following three primitives are the most crucial for reasoning: 1. Deduction: the ability to deduce new truths from given facts and inference rules. 2. Induction: the ability to induce general inference rules from a set of known facts. 3. Abduction: the ability to explain the relationship between the evidences and inference rules. To endow the models with an inductive bias for mathematical reasoning, we design a synthetic task for each of the three inductive biases. We hypothesize that the transformer networks are flexible enough to learn strong inductive bias from the three synthetic reasoning tasks and consequently improving the downstream tasks. Although such inductive bias may be useful in general reasoning tasks (e.g., NLP tasks), in this work, we focus on mathematical reasoning benchmarks, for which we expect to observe the largest gains. We call training on these tasks LIME -an acronym for "Learning Inductive Bias for Mathematical rEasoning". Note that there is only a limited amount of pretraining data available for formal mathematical benchmarks, therefore the study of generic pre-training techniques is particularly important for the success of machine learning in mathematical reasoning. We demonstrate that LIME pretrained models provide significant gains across three large mathematical reasoning benchmarks: IsarStep (Li et al., 2020), HOList Skip-tree (Rabe et al., 2020) and MetaMathStep (Polu and Sutskever, 2020) . Notably, on the IsarStep benchmark, pre-training improved the top-1 accuracy from 20.4% to 26.9% and top-10 accuracy from 33.1% to 41.0%. Compared to the traditional pre-training tasks, there are two major differences. First, we do not load the input embeddings or the weights in the output layer for finetuning on downstream tasks. This allows us to use the same pre-trained model for a variety of downstream tasks, which can have vastly different vocabularies due to language or tokenization differences. Also, it prevents the transfer of content knowledge from the pretraining to downstream tasks, supporting the evidence of learning inductive biases. Furthermore, pretraining on synthetic tasks require only a fraction of the computational cost of downstream tasks. With only about two hours of training on a single modern GPU, one already obtains all the benefits, in contrast to days of training on a large natural language corpus with hundreds of GPUs/TPUs. Our method can also be regarded as a form of curriculum learning, in which the model is taught basic, extremely generic but general skills before being trained on the specific problem domain. To summarize, the contributions of the paper are: 1. Providing the first method to design inductive biases in the form of datasets for mathematical reasoning. 2. Demonstrating significant improvements in the reasoning performance of transformer models on three large mathematical reasoning benchmarks with negligible extra computation cost. 3. By showing how pretraining brings benefits other than learning content knowledge, disentangling the study of its working mechanism.

2. RELATED WORK

Learning Models Applied to Mathematics There has been increasing interest in applying deep learning methods to Interactive Theorem Provers (ITP) (Bansal et al.; 2019; Gauthier et al., 2020; Huang et al., 2019; Yang and Deng, 2019; Wu et al., 2020; Li et al., 2020; Polu and Sutskever, 2020) . The work that is most related to ours is GPT-f (Polu and Sutskever, 2020) . The authors performed pretraining on several natural language corpora and showed significant improvements for an ITP system -MetaMath. Different from ours, they used GPT-style large-scale language modeling pretraining, which dominates the computation cost compared to the downstream task. We, on the other hand, propose pretraining on a few lightweight synthetic tasks costing only a minor fraction of the computation spent on the downstream task. 



Lample and Charton (2020)  have demonstrated that transformer models can be used for symbolic mathematics by successfully predicting the integrals of formulas from a randomly generated dataset. Similar observations are made for logical problems relevant to verification: that transformer networks can learn the semantics of logics(Hahn et al., 2020).Rabe et al. (2020)  have shown that mathematical reasoning can emerge from self-supervised training alone.Li et al. (2020)  show that language models can learn to synthesize missing high-level intermediate propositions given a local context. Piotrowski and Urban (2020) used RNNs in automated theorem provers for first-order logic.Wang et al. (2020)   explored the use of machine translation to translate between synthetically generated natural language descriptions of proofs and formally represented proofs. Urban and Jakubův (2020) present initial experiments on generating mathematical conjectures with a Transformer model.

