DO TRANSFORMERS UNDERSTAND POLYNOMIAL SIMPLIFICATION?

Abstract

Recently researchers have demonstrated that Transformers can be trained to learn symbolic tasks such as solving integration and differential equations in an end-toend fashion. In these setups, for an input symbolic expression, the Transformer predicts the final solution in a single step. Since such tasks may consist of a sequence of logical steps, question remains whether such networks have understood and learnt individual steps to reach the solution. To take a deeper look, we consider the task of polynomial simplification. Polynomials can be written in a simple normal form as a sum of monomials which are ordered in a lexicographic order. For a polynomial which is not necessarily in this normal form, a sequence of simplification steps is applied to reach the fully simplified (i.e., in the normal form) polynomial. For this task, we describe a synthetic Polynomial dataset generation algorithm which generates polynomials with unique proof steps. Then, we conduct an extensive analysis of the Transformer's abilities to learn the polynomial simplification task along different dimensions.

1. INTRODUCTION

With the state-of-the-art performance of Deep Neural Nets (DNNs) in perceptual tasks, researchers have started to explore their logical reasoning capabilities, in particular within the domain of Automated Theorem Proving (ATP). In these domains (LEAN (de Moura et al., 2015) , HOL Light and Mizar (miz, 2020) ), many recent works (Paliwal et al., 2020; Aygün et al., 2020; Hahn et al., 2020) have shown that Graph Neural Networks (Gori et al., 2005; Veličković et al., 2018) and Transformers (Vaswani et al., 2017) can be trained to perform impressively on the theorem-proving task as part of a neuro-symbolic system. In a related but different development, recently Lample & Charton (2019) showed that for symbolic integration and differential equations, a large amount of synthetic end-to-end examples can be generated using symbolic systems. In these tasks, the authors show that Transformer networks can be trained to produce the final solution from an input integral (or differential equation) in a single step. This points to the exciting possibility of using deep neural nets to learn end-to-end theorem provers, and can be beneficial for formal mathematics (Szegedy, 2020). However, the setup combines multiple reasoning steps in a single shot. Additionally, integration (or differential equation solving) is a complex task requiring understanding of the integral symbols, functions, variables, and the basic concepts of arithmetic. As the system in Lample & Charton (2019) is simply trained to output the top solution(s) and a corresponding confidence score(s), it is unclear what internal mechanisms enable these models to solve these problems. This lack of transparency has been noted in this context (Davis, 2019 ). An earlier work by Piotrowski et al. (2019) showed similar results for certain symbolic manipulation tasks and their work shares the same limitation. In this paper we ask if instead of only producing the end-result of symbolic manipulation or integral, can we have the model produce a human-readable proof as well. While we do not know if these models reason in the way humans do, one way to produce proofs would be to "extract" a proof from the models of the above type by "probing" them in some mannner. The problem of unraveling the inner workings of Transformers by probing is an active area of research; however, at present our understanding is still evolving (Rogers et al., 2020) . Hence taking a detour, we instead train the model to produce the full proof. Inspired by Piotrowski et al. (2019) , we explore a novel but simpler setting of polynomial simplification. We illustrate the task with an example. We begin with a polynomial which is a sum of product of factors, where each factor is again a sum of monomials (including constants), as shown below: P 0 = (2 * x 2 2 ) * factor (3 * x 1 2 term +4) + product (5 * x 2 1 + x 1 1 * x 1 2 ) * (3 * x 1 1 ) * (2), /* Initial */ To construct unique simplification steps, first each term in a factor is simplified. Once all factors are simplified (facstep); then within a product, all factors are multiplied (mulstep). Lastly simplified products are summed (sumstep). 2019) explores the task of learning symbolic re-write of an entire expression. In contrast, in our setting, for step-wise prediction, at each step the system needs to find the candidate sub-expression and a relevant simplification type to perform the simplification. This setup resembles the traditional ATP setup where a system needs to learn and execute symbolic steps to reach a final solution. But it is simpler as for each step only one type of simplification is applicable. By proof for an initial polynomial (P 0 ) we mean the sequence of simplification steps (P 1 to P 5 ). A model trained on step-wise prediction task, can be used to generate a full proof. Essentially, we start with an initial polynomial, and recursively feed the model output to itself, till it generates the final simplified polynomial (in normal form). A proof is correct when all steps are correct. P 0 = (2 * x 2 2 ) * (3 * x 1 2 + 4) + (5 * x 2 1 + x 1 1 * x 1 2 ) * (3 * x 1 1 ) * (2), /* FACSTEP */ = (2 * x 2 2 ) * (3 * x 2 + 4) + (5 * x 2 1 + x 1 1 * x 1 2 ) * (3 * x 1 1 ) * (2), (P 1 ), /* FACSTEP */ = (2 * x 2 2 ) * (3 * x 2 + 4) + (5 * x 2 1 + x 1 * x 2 ) * (3 * x 1 ) * (2), (P 2 ), /* MULSTEP */ = (6 * x 3 2 + 8 * x 2 2 ) + (5 * x 2 1 + x 1 * x 2 ) * (3 * In the above setting (termed COARSE), all terms in a factor are simplified at once in a facstep, and similarly all factors in a product are simplified at once in a mulstep. Additionally, we define another setting FINER, where a facstep involves simplification of a single term, and a mulstep involves multiplications of only two factors at once, illustrated below with an example (for facstep): P 0 = (5 * x 2 1 + x 1 1 * x 1 2 ) * (3 * x 1 1 ) * (2), /* FACSTEP */ = (5 * x 2 1 + x 1 * x 2 ) * (3 * x 1 1 ) * (2), /* FACSTEP */ = (5 * x 2 1 + x 1 * x 2 ) * (3 * x 1 ) * (2). As a state-of-the-art model, we explore Transformers. While both Graph Neural Networks and Transformers have been used for single-step representation learning of symbolic theorems and single step goal-theorem scoring, Transformer-based sequence-to-sequence networks have shown superiority in end-to-end tasks in integration, differential equations (Lample & Charton, 2019) and temporal logic (Hahn et al., 2020) domains. Hence for the aforementioned tasks of step-wise polynomial simplification, we explore the Transformer's ability along several dimensions. Our contributions are the following: 1) we propose polynomial simplification tasks requiring multiple steps of symbolic manipulation, 2) we show how datasets of different configurations can be generated synthetically for the task, 3) we propose an array of metrics to dissect the performance of Transformers, and 4) lastly through extensive experiments we show the performance of the Transformer on this task, establishing a strong baseline for future endeavors. Results Summary By varying over coefficient size, proof granularity and input representation (in Tables 1, 2, Appendix Table 6 ) we observe that 1) full proof accuracy is only slightly lower than single-shot endpoint prediction accuracy in many 1-variable configurations, 2) coarse granular proofs help learn somewhat more accurate proofs, 3) prefix representation helps in most cases but infix sometimes provides higher accuracy. More than 80% errors (Tab. 7 and 8 in Appendix) occur in multiplication steps, and we observe (through independent experiments) Transformer's struggle

