DO TRANSFORMERS UNDERSTAND POLYNOMIAL SIMPLIFICATION?

Abstract

Recently researchers have demonstrated that Transformers can be trained to learn symbolic tasks such as solving integration and differential equations in an end-toend fashion. In these setups, for an input symbolic expression, the Transformer predicts the final solution in a single step. Since such tasks may consist of a sequence of logical steps, question remains whether such networks have understood and learnt individual steps to reach the solution. To take a deeper look, we consider the task of polynomial simplification. Polynomials can be written in a simple normal form as a sum of monomials which are ordered in a lexicographic order. For a polynomial which is not necessarily in this normal form, a sequence of simplification steps is applied to reach the fully simplified (i.e., in the normal form) polynomial. For this task, we describe a synthetic Polynomial dataset generation algorithm which generates polynomials with unique proof steps. Then, we conduct an extensive analysis of the Transformer's abilities to learn the polynomial simplification task along different dimensions.

1. INTRODUCTION

With the state-of-the-art performance of Deep Neural Nets (DNNs) in perceptual tasks, researchers have started to explore their logical reasoning capabilities, in particular within the domain of Automated Theorem Proving (ATP). In these domains (LEAN (de Moura et al., 2015) , HOL Light and Mizar (miz, 2020) ), many recent works (Paliwal et al., 2020; Aygün et al., 2020; Hahn et al., 2020) have shown that Graph Neural Networks (Gori et al., 2005; Veličković et al., 2018) and Transformers (Vaswani et al., 2017) can be trained to perform impressively on the theorem-proving task as part of a neuro-symbolic system. In a related but different development, recently Lample & Charton (2019) showed that for symbolic integration and differential equations, a large amount of synthetic end-to-end examples can be generated using symbolic systems. In these tasks, the authors show that Transformer networks can be trained to produce the final solution from an input integral (or differential equation) in a single step. This points to the exciting possibility of using deep neural nets to learn end-to-end theorem provers, and can be beneficial for formal mathematics (Szegedy, 2020) . However, the setup combines multiple reasoning steps in a single shot. Additionally, integration (or differential equation solving) is a complex task requiring understanding of the integral symbols, functions, variables, and the basic concepts of arithmetic. As the system in Lample & Charton (2019) is simply trained to output the top solution(s) and a corresponding confidence score(s), it is unclear what internal mechanisms enable these models to solve these problems. This lack of transparency has been noted in this context (Davis, 2019 ). An earlier work by Piotrowski et al. (2019) showed similar results for certain symbolic manipulation tasks and their work shares the same limitation. In this paper we ask if instead of only producing the end-result of symbolic manipulation or integral, can we have the model produce a human-readable proof as well. While we do not know if these models reason in the way humans do, one way to produce proofs would be to "extract" a proof from the models of the above type by "probing" them in some mannner. The problem of unraveling the inner workings of Transformers by probing is an active area of research; however, at present our understanding is still evolving (Rogers et al., 2020) . Hence taking a detour, we instead train the model to produce the full proof. Inspired by Piotrowski et al. (2019) , we explore a novel but simpler setting of polynomial simplification. We illustrate the task with an example. We begin with a polynomial which is a sum of product of factors, where each factor is again a sum of monomials (including constants), as shown below: P 0 = (2 * x 2 2 ) * factor (3 * x 1 2 term +4) + product (5 * x 2 1 + x 1 1 * x 1 2 ) * (3 * x 1 1 ) * (2), /* Initial */ To construct unique simplification steps, first each term in a factor is simplified. Once all factors are simplified (facstep); then within a product, all factors are multiplied (mulstep). Lastly simplified products are summed (sumstep) . (P 5 ), /* ENDPOINT */. Piotrowski et al. (2019) explores the task of learning symbolic re-write of an entire expression. In contrast, in our setting, for step-wise prediction, at each step the system needs to find the candidate sub-expression and a relevant simplification type to perform the simplification. This setup resembles the traditional ATP setup where a system needs to learn and execute symbolic steps to reach a final solution. But it is simpler as for each step only one type of simplification is applicable. By proof for an initial polynomial (P 0 ) we mean the sequence of simplification steps (P 1 to P 5 ). A model trained on step-wise prediction task, can be used to generate a full proof. Essentially, we start with an initial polynomial, and recursively feed the model output to itself, till it generates the final simplified polynomial (in normal form). A proof is correct when all steps are correct. P 0 = (2 * x 2 2 ) * (3 * x 1 2 + 4) + (5 * x 2 1 + x 1 1 * x 1 2 ) * (3 * x 1 1 ) * (2), /* FACSTEP */ = (2 * x 2 2 ) * (3 * x 2 + 4) + (5 * x 2 1 + x 1 1 * x 1 2 ) * (3 * In the above setting (termed COARSE), all terms in a factor are simplified at once in a facstep, and similarly all factors in a product are simplified at once in a mulstep. Additionally, we define another setting FINER, where a facstep involves simplification of a single term, and a mulstep involves multiplications of only two factors at once, illustrated below with an example (for facstep): P 0 = (5 * x 2 1 + x 1 1 * x 1 2 ) * (3 * x 1 1 ) * (2), /* FACSTEP */ = (5 * x 2 1 + x 1 * x 2 ) * (3 * x 1 1 ) * (2), /* FACSTEP */ = (5 * x 2 1 + x 1 * x 2 ) * (3 * x 1 ) * (2). As a state-of-the-art model, we explore Transformers. While both Graph Neural Networks and Transformers have been used for single-step representation learning of symbolic theorems and single step goal-theorem scoring, Transformer-based sequence-to-sequence networks have shown superiority in end-to-end tasks in integration, differential equations (Lample & Charton, 2019) and temporal logic (Hahn et al., 2020) domains. Hence for the aforementioned tasks of step-wise polynomial simplification, we explore the Transformer's ability along several dimensions. Our contributions are the following: 1) we propose polynomial simplification tasks requiring multiple steps of symbolic manipulation, 2) we show how datasets of different configurations can be generated synthetically for the task, 3) we propose an array of metrics to dissect the performance of Transformers, and 4) lastly through extensive experiments we show the performance of the Transformer on this task, establishing a strong baseline for future endeavors.

Results

Summary By varying over coefficient size, proof granularity and input representation (in Tables 1, 2 , Appendix Table 6 ) we observe that 1) full proof accuracy is only slightly lower than single-shot endpoint prediction accuracy in many 1-variable configurations, 2) coarse granular proofs help learn somewhat more accurate proofs, 3) prefix representation helps in most cases but infix sometimes provides higher accuracy. More than 80% errors (Tab. 7 and 8 in Appendix) occur in multiplication steps, and we observe (through independent experiments) Transformer's struggle to learn how to multiply numeric coefficients. By letting the system annotate the candidate subexpression, we observe that the system can understand candidate sub-expressions and which next step to perform explicitly (Tables 3, and Appendix Tables 9, 10, 11) . Also, through visualization we observe similar effects (Figures 1, 2 Appendix). We see systems trained for 2-variable outperform corresponding 1-variable systems on 1-variable test sets. For 1 variable, we observe steady and significant higher gains (till 10% for full proof) using curriculum learning (Table 17 Appendix).

2. RELATED WORK AND DISCUSSION

Unlike the problems dealt with by the aforementioned automatic theorem provers and related neuralbased systems, polynomial simplification does not involve any search. Our problem is simpler and tests a specific ability, namely certain kinds of symbol manipulations. This simplicity affords certain advantages (shared by Piotrowski et al. (2019) and Lample & Charton (2019) ): (1) We can generate artificial data to train models without limitations on the size. (2) It is easier to test the abilities of the models more thoroughly along multiple axes. (3) Accuracy achieved is much higher than for harder tasks, suggesting that fully solving such tasks may be possible in the near future. To compare with symbolic manipulation systems we note that in more detail the ability tested by our task is the following: the model must be able to identify smallest parts of the polynomial that can be simplified: (1) simplification of a factor, (2) multiplication of two factors, (3) addition of two sub-polynomials. Having identified what simplification to apply the model must produce a new polynomial with just that simplification. This ability is not tested by previous neural-based symbolic manipulation systems such as Piotrowski et al. (2019) and Lample & Charton (2019) and related works such as Saxton et al. (2019) and Hahn et al. (2020) . Several recent works have produced synthetic datasets for theorem proving tasks (Aygün et al., 2020; Wu et al., 2020; Polu & Sutskever, 2020) , however, their focus remains more on search-based proofs.

3. POLYNOMIAL SIMPLIFICATION DATASET

We proceed similarly to Lample & Charton (2019) to generate the symbolic polynomials and simplified steps synthetically using the Sympy library of Python. To have a fine-grained control over the generated polynomials and well-defined proof steps, we consider polynomials which are sums of productsfoot_0 . We also note that symbolic generation using the Sympy library lets us ensure correctness of each generated expressions and validity of each steps.

3.1. NOTATIONS

We start with the set of variables x P = {x 1 , . . . , x nvar }. We represent the starting point polynomial P 0 in x P as the sum of products of factors: P 0 = P 1 + P 2 + . . . + P nprod , P i = nfaci j=1 f ij , where each factor (f ij ) has the form f = k (a k * l x d kl kl ), where x kl ∈ x P (dropping i, j for clarity). Here coefficients a k ∈ N + , and powers of the variables d kl ∈ N. nprod is the number of products and nfac i denotes the number of factors in P i . We denote the set of factors as f P = {f ij |∃i, P i = nfaci j=1 f ij }. The simplified endpoint polynomial is of the form P = q m=1 tm , where tm = âm * n x n dmn , where x n ∈ x P . We use the symbol Pi to denote the simplified form of P i . The functions terms(), vars(), coeffs() returns a list of terms, variables, coefficients in the input expression. Our sampling algorithm guarantees that the generated polynomial and its simplified endpoint abides by constraints on number of terms, products, factors and variables; limit on degree and coefficient sizes. An example is nprod ∈ {2, . . . , maxP P } (The full list is provided in Appendix Table 4 ).

3.2. BUILDING A POLYNOMIAL PROOF

Here, we briefly describe the starting polynomial generation process; detailed algorithm is in the appendix. Any randomly sampled polynomial (represented as a sum of products) can be included as a starting point in the dataset as long as the polynomial respects certain configuration parameters (in Appendix Table 4 ). This is unlike Lample & Charton (2019) , where many randomly generated integrals (or differential equations) might not have a solution. Hence, we randomly sample the constraint parameters in a top-down manner; and then construct terms, factors and products in a bottomup manner using the parameters. We first sample the following 1) a set of participating variables ( x P ), 2) maximum degree for any monomial in the simplified polynomial (mdeg), and 3) the number of products in the starting polynomial (nprod). We then call the algorithm buildProduct (Algorithm 1 in appendix) to create nprod individual products. Building a Product In buildProduct (Algorithm 1 in Appendix), first we sample nfac i , the maximum number of factors in the product (P i ). We then build factors sequentially. For each new factor, we sample a subset of variables in a factor. We pass on product-level constraints such as maximum degree in a product, maximum terms in a product, and maximum coefficient for a product as rdegree, rterms and rcoeff respectively; and call the sub-routine buildFactor (Algorithm 2 to create a factor. After a factor is sampled, the constraints rdegree, rterms and rcoeff are updated. buildFactor is used to create at most nfac i factors, that all abide by the above constraints and stops if the limit of maximum degree in the product is reached. The terms in a factor are arranged in a lexicographical order. Since, this sequential generation of factors may induce a certain pattern of decreasing degrees and coefficients, we shuffle the factors to create the final product. Simplification Steps and Full Proof For both COARSE and FINER configurations, we build the proof steps in the following way: 1) first we do a sequence of facsteps where terms get collected within a factor (such as 2x + 3x to 5x, x 1 and 1x becomes x); 2) then a sequence of mulsteps are performed where simplified factors are multiplied out; and 3) lastly, in sumstep simplified products are added together. As mentioned before, the sequence of simplification steps till the endpoint constitute a full proof.

4.1. DATASET

We vary dataset configurations along the following dimensions: • Number of Variables in polynomial, product and factor is varied between 1 and 2. • Coefficients Size: Maximum coefficient in the polynomial, product and factor are gradually varied from {60, 20, 5} (SMALL), to {120, 40, 8} (MEDIUM) and {300, 100, 10} (LARGE). DEFAULT is {120, 40, 8}. • Maximum degree in polynomial and a factor has two configurations: {6, 3} (DEFAULT), and {12, 5} (MEDIUM DEGREE). • Maximum number of terms in a simplified product and a factor has two configurations: {8, 3} (DEFAULT), and {20, 4} (MEDIUM TERMS). For the latter, we also set maximum products in a sum and maximum factors in a product as 5 and 4 respectively. • No Backtrack: We also try a very large configuration (NO BACKTRACK) where maximum coefficients in polynomial, product and factor are {10125, 3375, 5}, maximum degree in polynomial and factor are set to {9, 3}. Maximum terms in a product is set to 27. This is a configuration, where no sampled factor, or product is ever rejected for violating any higher-level constraint. Infix and Prefix We focus on exploring seq2seq networks for all our experiments. We consider the prefix and infix traversals of the abstract syntax tree of the polynomial input as sequences. Lample & Charton (2019) briefly touched upon the usefulness of the prefix notation over infix, but do not provide any empirical evidence supporting the statement. Hence, in our experiments, we consider both INFIX and PREFIX representations.

4.2. TASKS AND METRICS

We identify two central tasks : 1) Step-wise prediction: where an input polynomial is provided and the task is to perform the next proof step, and 2) Endpoint Prediction: where given a polynomial, the task is to predict the fully simplified polynomial in a single step. To compare with the Endpoint prediction task, we use the Step-wise prediction task to compute the full proof accuracy as the percentage of proofs where all individual proof steps are accuratefoot_1 . Apart from the accuracy, we also compare the examples seen by the systems trained in the above two types of tasks. For the Step-wise task, a training example corresponds to an individual simplification step; whereas for the Endpoint task an example is a pair denoting the initial and the endpoint polynomial. We also report the following: 1) error percentages grouped by each different types of steps facstep, mulstep, and sumstep, 2) calibration scores of the systems based on a threshold. To compute accuracy for an example (in both the tasks), we use the simplify method of Sympy library and check symbolically whether the difference between the predicted expression and the ground-truth expression is equal to zero. Calibration: As end-to-end models grow more accurate and their usage increases, it's important that the users can trust such models. In addition to reporting each simplified step and a confidence score, we also report calibration score computed from the ratio of the top two outputs predicted for each step (using beam width 5). Using a Calibration constant threshold (usually 5), we report the sure rate which is percentage of times when the ratio (in log scale with base e) exceeds the threshold. We also report precision, recall and F-1 score for calibration.

4.3. MODEL

Adapting the experimental setup by Lample & Charton (2019)foot_2 , we train a seq2seq network to predict the next proof step provided a polynomial as a sequence. For all the experiments, we train a Transformer network (Vaswani et al., 2017) architecture with 4 attention heads, 4 encoder and decoder layers with hidden embedding size of 256. We use an Adam optimizer (Kingma & Ba, 2014) with a learning rate of 10 -foot_3 . We limit the maximum token length to 512 and use a batch size of 32 polynomial pairs. During training, we synthetically generate each batch of equations. To avoid collisions between train and test sets, we first use a fixed seed to generate the test and the validation sets of polynomial simplification full proofs and collect the simplified end-points. We make sure that the simplified versions of the input polynomial in the training batches, do not collide with any endpoints in the the test and validation set. Authors in Piotrowski et al. (2019) shows that probability of such collisions in the generated integration dataset by Lample & Charton (2019) to be quite high, and urges to report the test accuracy by accounting for such collisions explicitly. During inference, we use beam-search with different beam widths (beam 1 and 5) to decode the expressions. For our results, beam width 1 is used for proof accuracy. Calibration results are produced using beam 5 decoding. During decoding, if any malformed (prefix or infix) expressions are generated, we report the percentage of such expressions 4 .

4.4. EXPERIMENT ORGANIZATION

In the next sub-sections, we provide problem space-size estimate ( §4.5) to understand if the accuracies are an effect of memorization. Then we vary the proof granularity, coefficient configurations and input representation to test Transformers' accuracy and errors in both tasks ( §4.6). Next, to assess whether Transformers can specifically predict candidate next sub-expression to be simplified, we try an annotated proof setting ( §4.6.1). To estimate the learning ability of addition and multiplication on symbolic variables, we test a setting where the coefficients are also symbolic, thus bypassing the need for the Transformer to do integer multiplication. Next, we discuss out-of-distribution general-ization ability of the systems ( §4.7). We also explore several curriculum strategies to take advantage of the well-defined sub-tasks and their varying complexities ( §4.8). Lastly, we provide layer-wise attention visualizations of a trained system in the Appendix (Figs. 1 & 2).

4.5. PROBLEM SPACE SIZE ESTIMATION

For smaller configurations, it is probable that eventually all simplified polynomials would be included in the training data. To account for this, we estimate the problem space size for each configuration and report the size of training data for comparison. We randomly generate two sets of starting polynomials say S 1 and S 2 , and check for collisions among them. Assuming the actual size is X and uniform distribution over all starting polynomials, the expected number of collisions would be R = S1 * S2 X . Using the above method, we estimate the number of un-simplified polynomials and the number of unique endpoints, and report in Appendix Table 5 . We observe that compared to the number of training examples it took for the models to converge in both End-point and Step-wise prediction tasks, the space of possible equations is often 25 (or more) times higher. Sampled polynomials are not uniformly distributed as we assign an equal probability while sampling polynomials of lower and higher degrees, say 3 and 6; whereas there are more polynomials of degree 6 than degree 3. For non-uniform distributions, we expect more collisions as higher probability equations are more likely to occur in both S 1 and S 2 . Moreover, since many equations may map to the same endpoint, such collisions for endpoints are even more likely. Thus, our empirical estimate of the population size provides a lower bound on the true value.

4.6. INPUT REPRESENTATION

We report the results for one and two variables, for all configurations in Tables 1 and 2 . In Table 1, we include results for both COARSE and FINER configurations. We observe that COARSE proof-steps with PREFIX representation provides the best full proof accuracy for four out of six configurations (especially for larger coefficient sizes). Across COARSE and FINER, in five out of six configurations PREFIX representation increases the full proof accuracy over INFIX, while the improvement is not always substantial. In SMALL COEFF configuration, the FINER setting improves over COARSE for full proof accuracy. From the calibration results, we see that the winning combinations often provide the highest calibration F-1 score (more prominent for 2 variables), indicating lesser ambiguity in the decision made. In Table 2 , using PREFIX representation for two variables provides 3 to 4% boosts in full proof accuracy for 4 out of 6 configurations. Since, FINER steps do not improve full proof accuracy for two variables, we report the results in Table 6 in the appendix. However, for NO BACKTRACK, the infix representation clocks a 9.5% improvement over prefix. Comparing with Endpoint accuracy, as coefficient sizes grow from SMALL to NO BACKTRACK, for 1 variable, the Endpoint accuracy is only slightly higher (1 to 2%) than the full proof accuracy. However, for MEDIUM TERMS and MEDIUM DEGREE, the Endpoint accuracy shows a 3.6% and 13% improvement respectively. For 2 variables, Endpoint task accuracy is larger in most cases. In Tables 7 and 8 (in Appendix) we show the model errors for each step type. We observe that more than 80% of the model errors occur in the multiplication step. In the MEDIUM TERMS setting, factor simplification causes 15-25% of the errors, possibly because of higher number of factors to simplify. For 2 variable case, addition step accounts for 10-15% of the errors. In all other cases, both factor simplification and addition cause close to 5% of the model errors each. As mentioned in §4.4, we experimented with symbolic coefficients to mitigate the difficulties with integer multiplication. This however didn't give good results possibly due to output becoming too long.

4.6.1. ANNOTATED PROOFS

In each step, simplification is performed over a sub-expression of the polynomial. To check explicitly, if the system can locate the sub-expression and find the type of simplification step, we devise the annotated proof setting. For each simplification step, we add an intermediate step, in which the model annotates the part of polynomial to operate on. For example, the starting input sequence is "MARK $ (5 * x 2 1 + x 1 * x 2 ) * (3 * x 1 ) * (2)"; and the corresponding expected output sequence is "MUL $ #(5 * x 2 1 + x 1 * x 2 ) * (3 * x 1 )# * (2)" . Each sequence has two parts: 1) the step index to perform (MARK, MUL, FAC, SUM), and 2) the polynomial expression. For MARK step, a marker token (#) is used to annotate the candidate sub-expression to be simplified next. 10 and 11 . Compared to non-annotated setting, while the step-wise accuracy is similar, the proof accuracy suffers often by 7-10%. A reason for such decrease in accuracy is that length of the annotated proofs are twice as long as non-annotated. However, we note that the errors in MARK step are the lowest compared to other types of steps. This indicates that the models are able to learn the candidate sub-expression for simplification, and predict the next operation correctly.

4.7. OUT-OF-DISTRIBUTION EVALUATION

We also test out-of-distribution generalization by choosing different test configurations than train. The best 2 Variable models (COARSE/PREFIX) were tested on 1 Variable dataset with same coefficient configuration. We interestingly observe (in Appendix (MEDIUM COEFF), the 2 variable models outperform the corresponding 1 variable models. For the LARGE COEFF case, the improvement is close to 6% over the 1 variable model. As expected, the 2 Variable models perform better on 1 variable dataset than 2 variable. The results for OOD evaluation with respect to coefficient limits, polynomial degree and polynomial length (no. of terms in starting polynomial) are discussed in the Appendix (Tables 15 & 16 ).

4.8. CURRICULUM LEARNING

Simplification steps entail learning of addition and multiplication of numeric coefficients and symbolic variables. But, as some of the individual sub-tasks seem harder to grasp, we employ different types of curricula based on the Mastering-Rate-based (MR) curriculum learning algorithm proposed by Willems et al. ( 2020) 5 . For all our experiments, we use the MR algorithm with gAmax Linreg A2D converter functions described in Willems et al. (2020) . Model parameters and the training configurations remain the same. We show the results in Table 17 for 1 variable COARSE configuration. As coefficient size grows from SMALL, MEDIUM, LARGE to NO BACKTRACK, improvements in full proof accuracy steadily increase from 1% to 10.84% (COARSE/INFIX). For NO BACKTRACK, the improvement in top-1 accuracy is by 20% from a no curriculum setting. However, we observe for MEDIUM TERMS, there is a drop in accuracy for all curricula and input representations. It is possible that, more carefully designed curricula may improve the results. There is no clear advantage observed between infix or prefix representations. However, compared to learning without a curriculum, the improvement observed for infix representation is often larger than prefix.

5. CONCLUSION

We explored the polynomial simplification task to investigate the capabilities and shortcomings of Transformer networks across various dimensions. We proposed a synthetic polynomial generation algorithm which generates constrained polynomials with unique proof steps. While Transformers perform impressively in many settings, reaching above 90% proof accuracies, there were also clear limitations and there are many avenues for future work. Among notable results, in many cases full proof accuracy is lower than endpoint accuracy, but with a low margin. This is perhaps not surprising because the model is trained to optimize for stepwise accuracy and generating a valid proof requires getting all of the multiple proof steps correct. Thus a more proof-centric training approach might further improve proof-wise accuracies. Prefix representation has a slight advantage over infix and coarse proofs have slight advantage over fine proofs. Transformers quickly learn addition, but consistently struggle with multiplication. Carefully designed curriculums can boost full proof accuracy up to 10% for large coefficient sizes. Models trained on two variable datasets often did very well on single variable datasets-even better than the models trained on single variable datasets. Exploring multivariate polynomial manipulations and more general algebraic systems are some immediate future directions, though even for the polynomial simplification task significant gaps remain in our understanding. , 2019) . We take model trained on COARSE granularity proofs using INFIX representation for 1 variable using the SMALL COEFF configuration. We take the following example: P 0 = (4 * x 2 1 ) * (5 * x 3 1 + 4 * x1) + (12 * x 1 ), /* MULSTEP */ = (20 * x 5 1 + 16 * x 1 * * 3) + (12 * x 1 ) In Figure 1 , we observe that in layer 2 encoder-decoder attention indicates that while generating the number 16, the Transformer network is clearly able to attend to the two digits 4 and 4 required for the multiplication. In Figure 2 , we observe that the Transformer networks, in the same time also learns to copy the expression 12 + x 1 in Layer 1. Even though such clear logical patterns emerge quite frequently, in some cases patterns become hard to interpret. Algorithm 1: BuildProduct (Sampling Products) Input: x P , mdeg Constraints: nvars prod, max coeff prod, max fac prod, max terms prod Output: A list of factors F seq 1 Sample nvar ∈ {num vars fac, . . . , nvars prod} 2 nvar = min(|x P |, nvar) 3 Sample nvar variables from x P as x Pi // Variable set for this product 4 Sample nf ac ∈ {2, . . . , max fac prod} // #Factors for this product / * Get maximum degree, terms and coefficient available * / 5 rdegree = mdeg, rterms = max terms prod, rcoef f = max coeff prod 6 cprod = 1 // Keeping track of product built till now 7 F seq = [ ] 8 for j ← 1 to nf ac 1 do 9 f j = buildFactor(x Pi , rdegree, rterms, rcoef f ) / * Update degree, terms and coefficient for next factor * / Figure 2: The layer 1 encoder-decoder attention for the coefficient 12 in the last product (20 * x 5 1 + 16 * x 1 * * 3) + (12 * x 1 ). It is expected, that in this step, this product remains unchanged and simply copied o the output. Therefore, we see that the layers learn to copy the coefficients directly by attending to the corresponding digits (i.e. 1 attends to 1 in the last product). Config: COARSE, SMALL COEFF, INFIX, 1 variable.

B ALGORITHMS

The polynomial sampling algorithms buildProduct and buildFactor are provided in Algorithms 1 and 2 respectively.  // E.g. if d[k] = 4, cvars = [x 1 , x 2 ]. May sample [x 1 , x 2 , x 1 , x 1 ] 8 Convert the selected d[k] variables to a term // t k = c k * x 3 1 * x 2 , 9 end 10 f j = nterms k=1 t k 11 return f j ;

C TABLE OF CONSTRAINTS AND NOTATIONS

We provide the full list of constraints and notations in Table 4 .  Term Constraints #Products #Factors in P i #Terms in f ij #Terms in Pi nprod ∈ {2, . . . , maxP P } nfac i ∈ {2, . . . , maxf P }, ∀i ∈ {1, . . . , nprod} |terms(f i )| ∈ {1, . . . , maxT f }, ∀f i j ∈ f P |terms( Pi )| ≤ maxT P ∀P i ∈ P 0 Degree Constraints #Degree in P #Degree in f ij d mn ≤ D P , ∀m t m ∈ terms( P ), ∀n x n ∈ vars( t m ) d kl ≤ D f , ∀k terms(f ij ), ∀f ij ∈ f P Variable Constraints #Variables in P 0 #Variables in P i #Variables in f i |x P | ≤ V P |vars(P i )| ≤ V P , ∀P i ∈ P 0 |vars(f ij )| ≤ V f , ∀f j ∈ f P Coefficient Constraints Coeff in P Coeff in Pi Coeff in f i âj ≤ C P ,∀ âj ∈ coeffs( P ) âij ≤ C P , ∀a coeffs( Pi ), ∀P i ∈ P 0 a k ≤ C f , ∀a coeffs(f ij ), ∀f ij ∈ f P

D PROBLEM SPACE SIZE ESTIMATION

We present the problem space size estimates here in Table 5 . 

E INPUT REPRESENTATION (ADDITIONAL RESULTS)

We present the results for FINE configuration for 2 variable setting here in Table 6 . The errors made by the models for 1 Variable and 2 Variable settings are presented in Tables 7 and 8 respectively. 

F ANNOTATED PROOF (ADDITIONAL RESULTS)

We present the results for COARSE and FINE configuration for 2 variable setting for annotated proofs here in Table 9 . The errors made by the models for 1 Variable and 2 Variable settings are presented in Tables 10 and 11 respectively. 

G FULLY SYMBOLIC PROOFS

As > 80% of the errors occurred in multiplication step, we separately tested the Transformer's ability to do arithmetic, by creating datasets involving multiplication and addition of 4-digit and 9-digit numbers. While the models quickly achieved an accuracy of close to 99% for addition; for multiplication, they could not go beyond even 1% after seeing 2M examples. Hence, we envision a setting where polynomial simplification steps only involve symbolic addition and multiplication, without any arithmetic manipulation. For example, instead of multiplying 3 and 4 as 12, the model will output c 1 * c 2 given coefficients c 1 and c 2 . The results for 1 Variable setting are presented in 

H OUT-OF-DISTRIBUTION EVALUATION

We present the results for Out-of-Distribution evaluation here. (Matiisen et al., 2019; Graves et al., 2017) are special cases of the above setting with different choices for program. To learn on tasks that are learnable but not learnt yet, authors define an ordered curriculum O C which is a directed graph over tasks in C. An edge from A to B indicates that learning task A before B is preferable. For supervised learners, the learnability for each task depends on mastering rate (M c (t)) computed from the normalized mean accuracy for that task at time-step t. At each timestep, the MR algorithm computes attention over a task (a c (t)) from mastering rates of its ancestors For polynomial simplification for 1 variable, we define the following tasks ADD, MUL2, MUL3, SCOEFF and MIXED. For ADD, only one factor per product is allowed, so there is no multiplication. For MUL2 and MUL3 only 1 product is allowed with maximum two factors and three factors respectively. SCOEFF points to the SMALL COEFF configuration and MIXED is the final variable size configuration of the target variable configuration. We define the following curriculums: • C: {(ADD, MUL3), (MUL3, MIXED), (ADD, MIXED)}. • C2: {(ADD, MUL2), (MUL2, MUL3), (MUL3, MIXED), (ADD, MIXED)}. • C4: {(ADD, MUL2), (MUL2, MUL3), (MUL3, SCOEFF), (ADD, SCOEFF) (SCOEFF, MIXED)}. For all our experiments, we use the MR algorithm with gAmax Linreg A2D converter functions described in Willems et al. (2020) . Model parameters and the training configurations remains the same as beforefoot_5 . We show the results in Table 17 for COARSE configuration. As coefficient size grows from SMALL, MEDIUM, LARGE to NO BACKTRACK -the improvements in full proof accuracy steadily increase from 1% to 10.84%. For NO BACKTRACK, the improvement in top-1 accuracy is by 20% from a no curriculum setting. However, we observe for MEDIUM TERMS, there is a drop in accuracy for all curriculums and input representations. It is possible that, more carefully designed curriculums may improve the results. There is no conceivable pattern observed for infix or prefix representations. However, compared to learning without curriculum, the improvement observed for infix representation is larger than prefix. 



The generation algorithm inLample & Charton (2019) may generate nested sums and products. For such polynomials, an unique proof sequence is hard to define which makes whole proof s harder to evaluate. Our restriction over the form of the polynomial helps us generate unique proofs, which are easier to evaluate. We have also attempted recursive proof generation, where the output from the decoder is fed to the encoder in the next step. It does not vary from the teacher-forcing since, if in any step the model is wrong, the model does not recover after that. https://github.com/facebookresearch/SymbolicMathematics Similar toLample & Charton (2019), we find that the percentage of malformed outputs was very low (< 0.5%). So we did not explicitly correct for it. For full details, please see Appendix Section I. We use N b as 10. For other default parameters in CL, please check github.com/lcswillems/ automatic-curriculum.



10 cprod = cprod * f j 11 rdegree = rdegree -degree(f j ) 12 rterms = max terms prod/|terms(cprod)| 13 rcoef f = max coeff prod/max(coef f s(cprod)) 14 Append f j in F seq 15 if rdegree == 0 then 16 break 17 end 18 end 19 Shuffle F seq

Figure 1: The layer 2 encoder-decoder attention for the output digits 16 in the first simplified product for the output (20 * x 5 1 + 16 * x 1 * * 3) + (12 * x 1 ). As expected, the digits 1 and 6 attends to the coefficients of the first and third monomial in the input expression (4 * x 2 1 ) * (5 * x 3 1 + 4 * x1) + (12 * x 1 ). Config: COARSE, SMALL COEFF, INFIX, 1 variable.

BuildFactor (Sampling A Factor) Input: x Pi , rdegree, rterms, rcoef f Constraints: num vars fac, max coeff fac, max terms fac, max degree fac Output: A factor f j , Number of terms nterms j 1 Sample nvar ∈ {1, . . . , num vars fac} 2 cvars = Sample nvar variables from x Pi // Variable set for this factor 3 Sample nterms ∈ {1, . . . , min(max terms fac, rterms)} // # Terms for this factor 4 Sample {d k } nterms k=1 , s.t. d k ∈ {0, . . . , min(max degree fac, rdegree)} // Term degrees: degree 0 allows for constant terms 5 Sample {c k } nterms k=1 , s.t. c k ∈ {1, . . . , min(max coeff fac, rcoef f )} // Term coefficients 6 for k ← 1 to nterms 1 do 7 selects d[k] variables from cvars with replacement

and successors. During training to sample batches, a hyperparameter N b for the curriculum determines the number of batches to be considered at a step, before re-computing the attention over tasks. Using the program d, we first sample N b * b examples from tasks in C. The model is then trained on randomly sampled N b minibatches are sampled updating the mastering rates.

Results for 1 variable in the COARSE and FINE configuration for both Infix and Prefix representation.

Results for 2 variables for the COARSE configuration for both Infix and prefix representations.We experiment only with INFIX representation. The results for 1 variable and 2 variables are in Table3 and 9(in Appendix). The errors per step type are shown in Appendix Tables

Table 14) that in all settings except one Results for FINE and COARSE configurations for 1 Variable for annotated proofs

Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018.

List of notations, and corresponding constraints that a generated polynomial satisfies.

Size Estimates for the problem space, after generating sets of size 5M.

Results for FINE configuration for 2 Variables for Infix and Prefix representation (No curriculum, No annotation).

Errors for 1 variable in the COARSE and FINE configuration for both Infix and Prefix input representation. (No curriculum, No annotation).

Errors for 2 variables in the COARSE and FINE configuration for both Infix and Prefix input representation. (No curriculum, No annotation).

Results for FINE and COARSE configurations for 2 Variables for annotated proofs (No curriculum).

Here, MEDIUM COEFF and MEDIUM DEGREE denote the same configuration as the case with integer coefficients. The only difference being that the limits of coefficients no longer apply. The

Errors for FINE and COARSE configurations for 1 Variable for annotated proofs (No curriculum).

Errors for FINE and COARSE configurations for 2 Variable for annotated proofs (No curriculum).errors made by the model for each kind of step are summarized in Table13. We observe that the proof accuracy is about 20% less than the non-symbolic models. This could be because the intermediate polynomials in the simplification sequence become very long with symbolic coefficients.

Results for 1 Variable Symbolic Coeff setting. (No curriculum, No annotation).

Table 14 contains results for best 2 variable models (Prefix/Coarse) tested on 1 Variable setting. Table15contains results for best 1 variable models (Prefix/Coarse) tested on SMALL, MEDIUM and LARGE coefficient setting. As expected, the SMALL and MEDIUM models perform much worse when tested on higher coefficients. We also evaluated the best 1 variable models (Prefix/Coarse) on MEDIUM DEGREE and TERMS set-Errors made by models in 1 Variable Symbolic Coeff setting. (No curriculum, No annotation).tings, to check generalization with respect to # terms and degree of polynomial. Table16contains results for the same. The MEDIUM COEFF model is not able to generalize to more terms or polynomials of higher degree.

Results for OOD Testing. NVAR = 2 COARSE/PREFIX models tested on corresponding NVAR = 1 setting (No curriculum, No annotation).

OOD Testing: Prefix/Coarse 1 Variable Models tested on various coefficient limit configurations (SMALL, MEDIUM and COARSE). (No curriculum, No annotation).

Prefix/Coarse 1 Variable Models tested on various #term and degree configurations (MEDIUM DEGREE and MEDIUM TERMS). (No curriculum, No annotation). (2020) define curriculum learning by 1) a curriculum i.e. a set of tasks C = {c 1 , . . . , c n }, where a task is set of examples of similar type with a sampling distribution, and 2) a program which for each training step defines the tasks to train the learner given its learning state and the curriculum. Formally, the program d : N → D C , is a sequence of distributions over C. The authors estimate the program function through an attention function which defines attention over the tasks at a time-step, and an attention-to-distribution converter which converts the attention to a distribution over C. Authors observe that other algorithms

Curriculum Learning results for 1 variable for the COARSE configuration for both Infix and prefix representations.

