RETHINKING SYMBOLIC REGRESSION: MORPHOLOGY AND ADAPTABILITY FOR EVOLUTIONARY ALGORITHMS

Abstract

Symbolic Regression (SR) is the well-studied problem of finding closed-form analytical expressions that describe the relationship between variables in a measurement dataset. In this paper, we rethink SR from two perspectives: morphology and adaptability. Morphology: Current SR algorithms typically use several manmade heuristics to influence the morphology (or structure) of the expressions in the search space. These man-made heuristics may introduce unintentional bias and data leakage, especially with the relatively few equation-recovery benchmark problems available for evaluating SR approaches. To address this, we formulate a novel minimalistic approach, based on constructing a depth-aware mathematical language model trained on terminal walks of expression trees, as a replacement to these heuristics. Adaptability: Current SR algorithms tend to select expressions based on only a single fitness function (e.g., MSE on the training set). We promote the use of an adaptability framework in evolutionary SR which uses fitness functions that alternate across generations. This leads to robust expressions that perform well on the training set and are close to the true functional form. We demonstrate this by alternating fitness functions that quantify faithfulness to values (via MSE) and empirical derivatives (via a novel theoretically justified fitness metric coined MSEDI). Proof-of-concept: We combine these ideas into a minimalistic evolutionary SR algorithm that outperforms a suite of benchmark and state of-theart SR algorithms in problems with unknown constants added, which we claim are more reflective of SR performance for real-world applications. Our claim is then strengthened by reproducing the superior performance on real-world regression datasets from SRBench. For researchers interested in equation-recovery problems, we also propose a set of conventions that can be used to promote fairness in comparison across SR methods and to reduce unintentional bias.

1. INTRODUCTION

Important discoveries rarely come in the form of large black-box models; they often appear as simple, elegant, and concise expressions. The field of applying machine learning to generate such mathematical expressions is known as Symbolic Regression (SR). The expressions obtained from SR come in a compact and human-readable form that has fewer parameters than black-box models. These expressions allow for useful scientific insights by mere inspection. This property has led SR to be gradually recognized as a first-class algorithm in various scientific fields, including Physics (Udrescu & Tegmark, 2020), Material Sciences (Wang et al., 2019; Sun et al., 2019) and Knowledge Engineering (Martinez-Gil & Chaves-Gonzalez, 2020 ) in recent years. The most common technique used in SR is genetic programming (GP) (Koza, 1992) . GP generates populations of candidate expressions and evolves the best expressions (selected via fitness function) across generations through evolutionary operations such as selection, crossover, and mutation. In this paper, we rethink GP-SR from two evolutionary-inspired perspectives: morphology and adaptability. Morphology. SR algorithms have traditionally used several man-made heuristics to influence the morphology of expressions. One method is to introduce rules (Worm & Chiu, 2013) , constraints (Petersen et al., 2019; Bladek & Krawiec, 2019) and rule-based simplifications (Zhang et al., 2006) , with the objective of removing redundant operations and author-defined senseless expression. An example is disallowing nested trigonometric functions (e.g. sin(1 + cos(x))). Another method is to assign complexity scores to each elementary operations (Loftis et al., 2020; Korns, 2013) , intending to suppress the appearance of rare operations that are given a high author-assigned complexity score. However, these man-made heuristics may introduce unintentional bias and data leakage, exacerbated by the small quantity of benchmark problems in SR (Orzechowski et al., 2018) . With the success of deep learning and its applications to SR, there exists substantial motivation to utilize deep learning to generate potential morphologies of candidate expressions in SR (Petersen et al., 2019; Mundhenk et al., 2021) . Such a technique also comes with the benefit of being easily transferable to a problem with a different set of elementary operations. In this regard, we first show how current SR methods are reliant on these man-made heuristics and highlight the potential drawbacks. Then, in this paper, we show how using our neural network pre-trained on Physics equations (which we later introduce as TW-MLM) improves the performance of GP-SR even in the absence of such man-made heuristics. Adaptability. SR algorithms tend to evaluate a candidate expression based on its faithfulness to empirical values. Some common fitness functions are Mean Absolute Error (MAE), Mean Squared Error (MSE) and Normalized Root Mean Square Error (NRMSE) (Mundhenk et al., 2021) , among other measurements. We propose that SR should create a variety of characterization measures beyond the existing ones that measure faithfulness to empirical values. While previous work have suggested alternative characteristics in dealing with time-series data (Schmidt & Lipson, 2010; 2009) , it is not easily transferable to SR in general and was not theoretically derived. In this paper, we propose to quantify the faithfulness of a candidate expression's empirical derivative to the ground truth. Motivated by evolutionary-theory (Bateson, 2017) , we adopt an adaptability framework that changes the fitness functions across generations. This process makes it harder for pseudo-equations to survive through the different fitness functions and easier for a ground truth equation to survive across the generations. The additional benefit of such a method lies in the increased utility of the eventual expression. If the intention of the user is to take the derivative of the eventual equation, then a measure of faithfulness to empirical derivatives as fitness would assist that objective. In this paper, we alternate between different fitness functions to improve performance of GP-SR. Specifically, we alternate between MSE and a newly defined fitness function we term MSEDI. Proof-of-concept. We combine these ideas to demonstrate a proof-of-concept (foreshadowed in Figure 1 ) through a minimalistic evolutionary SR algorithm that outperforms all methods (including state-of-the-art SR) in problems with unknown constants (i.e., Jin* (Jin et al., 2019) ) and outperforms many benchmark models in problems without unknown constants (i.e., Nguyen* (Uy et al., 2011) and R* (Krawiec & Pawlak, 2013) ). We contend that performance on datasets with unknown constants are more indicative of SR performance in real-world applications that including naturally occurring processes such as scaling. Our claim is then strengthened by reproducing this superior performance on real-world regression datasets from SRBench (La Cava et al., 2021) . To accommodate future research in SR using real-life datasets, we propose extra SR conventions for synthetic datasets in line with the "relevancy" criteria for results on benchmarks problem to correlate well with results on realworld applications (McDermott et al., 2012) . The remainder of this paper is organized as follows: Section 2 explains related work and mechanisms to improve SR, focusing on recent deep learning work. Section 3 describes our proposed methodology and mechanisms. Section 4 combines all our proposed mechanisms with the vanilla GP approach and compares the performance with state-of-the-art, commercial and traditional SR approaches. Reflections and future work are given in Section 5. The main contributions of this paper are as follows: 1. We propose a set of conventions and best practices that reduce the risk of human bias in the context of using SR on real-life datasets. 2. We develop a predictive model called TW-MLM that learns the morphology of expressions through a mathematical language model, which is then used to generate candidate expressions for SR. TW-MLM serves as an alternative to man-made heuristics that is less prone to human bias and easily transferable to new problems. 3. We propose a method for alternating fitness functions inspired by adaptability in evolution theory to tackle the problem of multi-objective optimization in SR. We do this by alternating between MSE and our novel theoretically justified metric called MSEDI. 4. We develop an integrated proof-of-concept by combining our ideas and performed extensive testing on both synthetic and real-world datasets (from SRBench). Mathematical Language Model. Symbolic mathematics has been successfully addressed as both a machine translation problem and a next word prediction problem (Lample & Charton, 2019; Kim et al., 2021) . In particular, it has been reported that by adding a pre-trained seq2seq model as a supplementary mechanism to neural-guided SR, performance has improved (Kim et al., 2021) . The consensus is that mathematics should be viewed as a language with complex rules. Most useful mathematical expressions are not only short and concise, they also tend to obey a variety of hidden rules, such as avoiding nested trigonometric functions (e.g. sin(1 + cos(x)). In this context, a mathematical language model can be developed to learn such rules implicitly.

2. RELATED WORK

Derivatives in SR. The usage of derivatives in SR has been hinted in previous works for timeseries data (Schmidt & Lipson, 2010; 2009) . In their work, given N data samples from two timedependent variables, x(t) and y(t), the system's derivatives are ∆x/∆y = x ′ /y ′ , where x ′ and y ′ represent the empirical derivatives of x and y with respect to time. These values are then compared against the derivative obtained from the candidate expressions, δx i /δy i , through a mean logarithmic error:-1 N N i=1 log(1 + |∆x i /∆y i -δx i /δy i |). However, the theoretical derivation is not explored in their paper. Recent works in SR also include model discrimination by incorporating prior information on the sign of the first derivative based on physical knowledge (Engle & Sahinidis, 2021). In our paper, we develop a theoretical basis in line with existing assumptions for modelling errors to quantify faithfulness to derivatives for general equations which do not necessitate time-dependency. Multi-Objective Genetic Programming (MOGP). Several general approaches exist for MOGP (Konak et al., 2006) . A possible approach is to set all but one objective as constraints. However, the process of selecting values for constraints and which objectives to set as constraints is arbitrary. Another approach is to output a Pareto optimal set instead of a single individual expression to reflect the trade-offs between the different objectives. However, in the context of SR, the ground truth expression would be the best performer for both objectives, rather than forming a trade-off. Benchmarks Methods. For comparison, we include traditional SR, state-of-the-art (SOTA), commercial algorithms, and random search. The methods selected report the lifetime population size of expressions to enable fair comparison of recovery rates. The benchmark methods are: (i) DSR (Petersen et al., 2019) : Previous SOTA method that pioneered usage of reinforcement learning in SR; (ii) DSO-NGGP (Mundhenk et al., 2021) : Current SOTA method in SR that is a hybrid of DSR and GP; (iii) GPLearn (Stephens, 2016) : Python framework for standard GP-SR (Koza, 1992) , which has seen wide usage as a generic SR method (Pankratius et al., 2018; Ferreira et al., 2019) ; (iv) Tur-ingBot (TuringBot, 2020): Commercial SR product based on simulated annealing, which has been shown to be competitive among top commercial SR algorithms (Ashok et al., 2021) . (v) Random Search: Generate expressions at random without evolution.

3. METHODOLOGY AND MECHANISMS

Here, we first propose revised conventions to be used for SR experiments targeted at resolving the flaws and criticism of current SR metrics, which are independent of our SR algorithm. We then introduce our novel methods of controlling morphology of expressions and promoting adaptability in evolution, and justify these methods individually through preliminary results and ablation studies.

3.1. PROPOSED CONVENTIONS AND BEST PRACTICES

We propose and justify conventions for SR experiments, adhering to the criteria discussed in the call for better benchmarking to be done so that SR experiments can correlate better with results on realworld applications (McDermott et al., 2012) . In many recent papers, recovery rate has been utilized as the primary metric for evaluating SR and is defined as "the fraction of independent training runs in which an algorithm discovers an expression that is symbolically equivalent to the ground truth expression within a maximum of 2 million candidate expressions" (Mundhenk et al., 2021; Petersen et al., 2019; Kim et al., 2021; Larma et al., 2021) . To this end, the conventions we propose will be focused on improving the effectiveness of recovery rate as an evaluation metric for SR experiments. Fixed set of primitive functions. In contrast to previous SR experiments that use a varying primitive function set depending on the equation, we propose to use a fixed set of primitive functions across all datasets and all methods. This is a necessary step towards using SR on real-life datasets since we are blind to the underlying primitive functions in real-world scenarios. In our paper, we use a fixed primitive function set {add, mul, sub, div, sin, cos, arcsin, log, exp, pow} for all datasets and methods, selected from the dataset used to train our mathematical language model. Top-1 Approximate Recovery. We also propose a new measure, top-1 approximate recovery rate, that is more reflective of performance on real-life datasets compared to the exact recovery rate defined in the first paragraph of Section 3.1. Top-1 means taking only the best scoring expression in the strictest sense, as consistent with how SR is used for real-life data (Abdellaoui & Mehrkanoon, 2021; Phukoetphim et al., 2016; Barmpalexis et al., 2011) . In other words, we only assess one best equation per experimental run. We define an approximate recovery to be when the r-squared value of an expression (touted as the best error measure for SR (Keijzer, 2004 )) over the entire sampling domain of the dataset is more than 99%. This is consistent with calls from the GP community to discourage measuring exact recovery on synthetic datasets as they usually do not correlate well with performance on real-life applications (McDermott et al., 2012) . The most obvious drawback of an exact recovery rate is that for real-life datasets, it is impossible to measure the true recovery rate by checking mathematical equivalence since there will not be any accompanying ground truth equation for comparison. In our paper, we present top-1 approximate recovery rate as our primary metric, but we also include the results for top-1 exact recovery rate in brackets for comparison. Selecting appropriate lifetime population size. To guard against setting a lifetime population that is too high, we also propose to benchmark against a random method (generate the entire lifetime population in one generation without any genetic operation) and to ensure the metric across all methods is not saturated at the selected population size. This is done to elicit meaningful conclusions from the results. Previous works have used an arbitrary lifetime population size of 2 million (Mundhenk et al., 2021; Petersen et al., 2019; Kim et al., 2021; Larma et al., 2021) , but it is difficult to compare results across the different methods since many equations have near 100% recovery, even for the worst-performing methods. Additionally, using an arbitrary value may create unintended bias in results (Bergstra & Bengio, 2012) . To these ends, we reduce the lifetime population size to 10000 following 2 observations we made prior to our main experiments. First, at the new size, the top-1 approximate recovery rate is never saturated across all methods, allowing us to compare the performance of each method across various equations. Second, we evaluate a fully random search, which is implemented in practice by setting GP with only 1 generation at full population size, and find that the performance across most of the equations is near 0. This gives us higher confidence to report that SR methods with positive performance do recover equations by pure chance. Datasets with unknown constants are more relevant. We also recommend testing on datasets which include unknown constants, such as Jin* dataset, since we find that the performance of methods varies drastically with and without unknown constants. In addition, it is of practical interest to consider datasets with unknown constants since it is common in real-life relations between variables, such as feature scaling (Udrescu & Tegmark, 2020). For example, we have f = e -θ 2 /2 / √ 2π a real-world physics equation from AI-Feynman database (see Appendix Table 7 ). In our experimental results recorded in Table 1 , we observe that between TuringBot and GPLearn, TuringBot has the poorer performance in Jin* dataset that has unknown constants. However, Tur-ingBot has the superior performance in Nguyen* dataset that does not contain unknown constants. A similar observation can be made between GPLearn and DSO-NGGP where the relative performance is reversed depending on the presence of unknown constants. This phenomenon is due to the difference in the frequency of appearance of constants and the presence and extent of constants optimization in each method. For datasets with no unknown constants, the more the algorithm utilizes and optimizes constants, the more likely that equations with morphology that are dissimilar to the true equation can become top candidates. This can be seen as inhibiting the evolutionary process, where the spots for expressions to undergo evolution are instead taken up by these psuedo-equations (expressions that perform well on the training set in terms of MSE but are not close to the ground truth equation in its functional form). In the context of real-life datasets, we argue that it is justifiable to assume that unknown constants will appear frequently in naturally occurring processes such as scaling. The performance of methods on datasets without unknown constants would then be less reflective of the performance on real-life datasets. We thus recommend that SR experiments favor datasets with unknown constants and assert that results obtained from such datasets are more reflective of real-life application.

3.2. MORPHOLOGY OF EXPRESSIONS

Previous works have been done to influence the morphology of candidate expressions using deep learning methods (Petersen et al., 2019; Bladek & Krawiec, 2019; Udrescu & Tegmark, 2020) . However, these methods have never been fully independent, and have instead been an addition to another method or utilize man-made heuristics such as those mentioned in earlier sections. We also find that these methods can be heavily reliant on the heuristics. For instance, DSR performance drops sharply with the removal of in-situ constraints and complexity scores, with top-1 approximate recovery rate decreasing to 33% the original value and top-1 exact recovery rate decreasing to 10% the original value. Furthermore, these heuristics increase the likelihood for the algorithm to be biased towards certain form of expressions, allowing for an implicit data leakage. Finally, it is an extremely difficult task to form such man-made rules, and it is made even harder when the discovery of such rules needs to be repeated from scratch when the primitive function space is distinctively different, such as when changing from ordinary algebra to boolean algebra. Thus, we aim to propose a method that is free of such man-made heuristics. Here, we outline a standalone method of generating candidate expressions that can be used independently. Terminal walks representation. Instead of using the prefix representation of expressions as input to a seq2seq model as done in other SR methods, we take inspiration from random walks used in node2vec (Grover & Leskovec, 2016) and generate terminal walks from expression trees to reflect the hierarchical nature of expressions. A single terminal walk refers to the collection of nodes traversed from the root of the expression tree to either a variable node or a constant node. These terminal walks will then be treated as sentences. The benefit of such a method over prefix representation is that the distances between operations in terminal walks are reflective of the distances between operations in expression tree. On the other hand, in prefix representation, the operations that appear to be faraway may instead be near in the expression tree. For example, for the equation sin(1 + cos(x)), the nested cos is 2 tokens away in both the terminal walk {sin, add, cos, x} and in the expression tree form. However, it is 3 tokens away in the prefix notation. Building language model to replace heuristics. We then treat the collection of all terminal walks as a corpus of sentences to train a RNN as commonly practiced in next-word-prediction natural language processing tasks (Barman & Boruah, 2018). For our paper, we used an embedding layer, a long short-term memory layer and a dense layer sequentially to train a lightweight RNN. This RNN will be our mathematical language model for our GP algorithm, which we coin as terminal walks mathematical language model (TW-MLM). Candidate expressions are then generated by randomly selecting an operation from a uniform distribution, then for every incomplete link in the tree, an incomplete terminal walk is generated and fed into the TW-MLM. The TW-MLM then outputs a probability distribution that is used to select the next node to complete the tree. This process repeats until the tree has no incomplete links. Ablation: TW-MLM to improve recovery rates. In this ablation experiment, we use the baseline GP algorithm (Koza, 1992) implemented in GPLearn for comparison since the other competitive SR methods impose man-made heuristics which would complicate the insights drawn from experimental results. For training the TW-MLM, we use the set of 2023 terminal walks generated from 100 Physics equations (Udrescu & Tegmark, 2020). These equations are suitable as they contain widely accepted expressions used in real-life scenarios, in contrast to the other SR datasets. When tokenizing the equations, variables are represented as a single "variable token". When TW-MLM generates a new equation for a numerical dataset, this token is replaced by a randomly selected variable. Throughout this paper, we use a lightweight RNN, comprising of a single embedding layer, a single long short-term memory layer and a single dense layer. We experiment with replacing the RNN with a transformer, which we find to perform worse. As seen in Table 2 , our experiments show that with just the sole usage of our TW-MLM to generate candidate expressions, we can drastically improve both top-1 exact recovery rate and top-1 approximate recovery rate across all 3 datasets. The intuition for the improvement of results is that the TW-MLM learns the intrinsic patterns which makes equations human-readable, which promotes the GP algorithm to explore a search space with a high concentration of human-readable equations.

3.3. ADAPTABILITY IN EVOLUTION

Borrowing the idea of adaptability from the theory of evolution (Bateson, 2017) , we argue that a candidate expression close to the ground truth will survive through a multitude of fitness functions, whereas a pseudo-expression may perform deceptively well for one but perform worse for the other fitness measures. In this context, the challenge for SR is to find suitable secondary fitness functions that measure characteristics beyond the primary fitness function, i.e., faithfulness to empirical values via the MSE. One must also optimize for both fitness functions and we do this by alternating between the primary and secondary fitness functions. Additionally, we note that one benefit of the simple closed-form analytical expression found by SR is that it allows the user to apply traditional mathematical tools on the expression, such as derivatives. In this paper, we explore the derivative as part of the secondary fitness functions, e.g., faithfulness to empirical derivatives. We describe two natural approaches to doing this, mean squared error of derivatives and mean squared error of difference, and argue that the former is superior to the latter for our application.

Mean Squared Error (MSE) of Derivative (MSEDE)

. MSE is commonly used as a fitness and optimisation function, and yields the Maximum Likelihood Estimate (MLE): θ M SE = arg min θ N i=1 (y i -ŷi ) 2 . MSE rewards expressions that are faithful to the values in the datasets. Likewise, we can reward expressions for being faithful to empirical derivatives. We can compute the MSE of the empirical derivatives from the dataset and candidate expressions. Consider N pairs of values (x i , y i ), where i = 1, 2, . . . , N , sorted based in ascending values of x. We define the empirical derivative as ∆yi ∆xi = yi+1-yi xi+1-xi . Then, θ M SEDE = arg min θ N -1 i=1 ( ∆yi ∆xi -∆ŷi ∆xi ) 2 . Mean Squared Error of Difference (MSEDI). Here, we develop a new fitness measure derived from a theoretical basis that is consistent with the traditional error modelling framework used in MSE: Given measurements y = f (x) + C + ϵ, where C is a an arbitrary constant and ϵ ∼ N (0, σ), we aim to find ŷ = g θ (x), such that g θ (x) is a close approximation to f (x). We derive the MLE of parameters θ by considering empirical derivatives. When g θ (x) ≈ f (x) + C, ∆yi ∆xi -∆ŷi ∆xi = ϵi+1-ϵi ∆xi ∼ N (0, 2σ 2 (∆xi) 2 ). We note that the difference between derivatives obtained from dataset and candidate expression follows a Gaussian distribution as well, which are independent to each other under the set of odd-valued or even-valued i. The total log likelihood across both sets should be similar by symmetry. We choose to evaluate the total log likelihood across all i instead of picking one set. By letting σ ′ i = √ 2σ ∆xi , the log likelihood of a particular ∆yi ∆xi is ln(P r( ∆yi ∆xi |θ)) = -ln( √ 2πσ ′ i ) -1 2σ ′ 2 i ( ∆yi ∆xi -∆ŷi ∆xi ) 2 . The total log likelihood across all i is thus, N -1 i=1 -ln( √ 2πσ ′ i ) - 1 2σ ′ 2 i ∆y i ∆x i - ∆ŷ i ∆x i 2 . ( ) The parameters θ of g θ (x) are obtained by maximizing the total log likelihood in (1) to obtain: θ M SEDI = arg max θ N -1 i=1 -ln( √ 2πσ ′ i ) - 1 2σ ′ 2 i ∆y i ∆x i - ∆ŷ i ∆x i 2 (2) = arg max θ N -1 i=1 - 1 2σ ′ 2 i ∆y i ∆x i - ∆ŷ i ∆x i 2 (3) = arg min θ N -1 i=1 (∆y i -∆ŷ i ) 2 (4) MSEDI more relevant for real-world compared to MSEDE. However, we discover that MSEDE is a potentially harmful fitness function when dealing real-life datasets since it overfits to a selected set of noisy derivative values. Our experiments that include noise yield that using MSEDE led to no recovery across 100 repetitions of experimentation, while MSEDI was successful. We observe that random error, ϵ, contributes to large noise in empirical derivatives, especially when ∆x i is small. In other words, MSEDE can be viewed as a weighted version of MSEDI, with the weights being 1 (∆xi) 2 . The MSEDE function thus encourages over-fitting to values which are heavily weighted. Additionally, MSEDE relies on the value 1 ∆xi , which is problematic when ∆x i is 0, which is common in real-world datasets with duplicated x values. Thus MSEDI better fulfills the criteria of relevancy to real-life problem (White et al., 2013) , providing an additional justification to choose our theoretically derived MSEDI. Multi-objective optimization using MSEDI and MSE. As discussed in Section 2, it is difficult to create a general method to optimize for both fitness functions using traditional methods. For instance, using a weighted combination do not work for some equations as one fitness will dominate, effectively only optimizing for one fitness in those cases. Instead, using the idea of adaptability, we utilize MSEDI as a secondary test with the intention to filter away pseudo-expressions as described earlier in Section 3.3. Our experiments show the addition of this adaptability mechanism (using MSEDI as fitness function) once every 5 generations improves the exact recovery by more than double while maintaining a similar approximate recovery rate. Ablation: Escaping local optima with and without MSEDI. In addition to acting as a secondary fitness function to remove pseudo-expressions, we also find that MSEDI functions by helping GP escape poor local optima. Among the experiments that do not recover the equation, we save the state of the GP and run it for an additional 5 generations of MSE. If the top-1 equation remains the same (which occurred for 72.82% of experiments), we rollback the state of the GP and run just 1 generation of MSEDI. The percentage of experiments in which this 1 generation of MSEDI helps to change the top equation are 19.82%, 15.58%, 44.33% for the Nguyen*, R* and Jin* datasets respectively. Thus, from the results we know that MSEDI helps to escape poor local optima which would have otherwise stagnated.

4. INTEGRATED PROOF-OF-CONCEPT

Combining the ideas described above, we added TW-MLM and the adaptability mechanism, that alternates MSE and MSEDI as fitness function, to GP with constant optimization and using the primitive function set {add, mul, sub, div, sin, cos, arcsin, log, exp, pow} as discussed in Section 3.1. The interfacing of each component can be visualized in a simple schematic (Figure 1 ). Algorithm 1: Outline of our Integrated Proof-of-Concept while current generation < max generations do Fill up population of equations by generating equations from TW-MLM; if current generation % 5 then Evaluate fitness of equations using MSEDI else Evaluate fitness of equations using MSE end Evolve and select equations based on evolutionary operations in GP (e.g., crossover, subtree mutation, hoist mutation, point mutation, reproduction) end Algorithm 1 outlines our proof-of-concept. TW-MLM is trained once on a MLM dataset and utilized to generate candidate expressions at the start of every generation of GP. GP then evaluates and filters from this set of expressions together with candidates derived from the previous generation through a fitness functions that alternates between generation. In our method, we use MSE as our base fitness function and switch to MSEDI once every 5 generations (decided based on hyper-parameter tuning on a smaller-sized experiment). We then conduct a series of experiments to test the proposed approach on both synthetic and real-world datasets. Synthetic Datasets Experiments. We compare with the benchmarks methods and random search outlined in Section 2. We choose synthetic datasets with unknown constants (i.e., Jin*) and without unknown constants (i.e., Nguyen* and R*). Each equation in Table 6 (Appendix) is used to conduct 100 experiments per method. In Table 3 and Table 4 , we tabulate the performance on the three synthetic datasets (which contain 17 equations to be recovered). Synthetic Datasets Performance Comparison. Our method shows overall competitive performance with state-of-the-art. On equations with unknown constants, shown in the Jin* dataset in Table 3 , our method outperforms all other methods by a large margin, in both top-1 approximate recovery rate and top-1 exact recovery rate. On equations without unknown constants (e.g., Nguyen*), our method outperforms all except DSO-NGGP for almost all of the equations, as shown in Table 4 . Since Jin* included unknown constants that allows for relations that are reflective of real-life variables such as feature scaling (Udrescu & Tegmark, 2020), it can be argued that the results from Jin* hold more practical value. In this sense, our model clearly outperformed all models as seen in Table 3 . Real-World Datasets (SRBench) Evaluation. We then evaluated our method against DSO-NGGP on 6 real-world datasets from SRBench (La Cava et al., 2021) , with the results tabulated in Table 5 . Unlike traditional SR datasets, real-world datasets do not have an accompanying ground truth equation. We present the percentage of experiments in which our method outperformed DSO-NGGP in terms of r-squared value. Our method consistently outperforms DSO-NGGP, further corroborating our previous results and strengthening the claim that SR datasets with unknown constants are more reflective of real-world dataset performance. Intuition and Ablation. The function of TW-MLM in our method is to implicitly learn the rules and constraints about the morphology of human-readable equations that are previously developed based on human judgment. This increases the likelihood of our method to venture into the search space containing expressions that are consistent with existing widely-accepted human-readable equations. In this sense, TW-MLM acts as a guide in the vast search space of possible equations. Though manmade heuristics such as rules and constraints exist to fulfill these roles, it is difficult to find, difficult to express and not easily transferable to new problems. On the other hand, TW-MLM finds these rules and constraints implicitly and the method can be easily transferred to a new problem. The adaptability mechanism (by alternating between MSE and MSEDI fitness function) then synergizes with TW-MLM and GP by increasing the difficulty for pseudo-equations to survive each generation, as shown in the ablation study in Section 3.2. We also observed throught the ablation study in Section 3.3 how MSEDI helps to escape from poor optima points in GP. These mechanisms added to GP allow us to create a model that performs competitively with state-of-the-art.

5. REFLECTIONS

Summary. In this paper, we demonstrate an efficient proof-of-concept that incorporates two new independent mechanisms into genetic programming-based SR: (a) Terminal walks mathematical language model (TW-MLM) and (b) Adaptability via alternating fitness functions (i.e., between MSE and our novel theoretically justified metric called MSEDI). Through these simple modifications, we are able to obtain competitive results with respect to a diverse range of methods, outperforming all, including state-of-the-art, when datasets with unknown constants are involved. We then reproduce this outperformance on real-world datasets. We also state and justify the conventions we use that can promote consistency, comparability and relevance to real-world problems. Ultimately, we hope that SR can demonstrate more competitive real-world results among other machine learning methods, given its natural advantage in both interpretability and explainability. The code for this paper is available at: https://github.com/kentridgeai/MorphologyAndAdaptabilitySR Limitations. Our method performs much better on equations with unknown constants (Jin*), in contrast to equations without unknown constants (Nguyen*). The reason behind this observation can be attributed to the difference in frequency of unknown constants during the generation of candidate expressions. Our TW-MLM is trained on 100 AI-Feynman equations which has the frequent use of constants. This means that the TW-MLM has to learn to include constants into candidate expressions frequently, increasing the chances of over-optimization of constants in candidate expressions. Future Work. (i) The performance of our method on datasets without unknown constants may suffer due to over-optimization of constants in candidate expressions. To reduce the chances of this, SR can alternate between randomly generating constants and optimizing for constants. This is similar to the way we alternate between fitness functions in this paper. (ii) The search space of SR grows rapidly with an increasing number of variables, making real-world datasets with many variables a difficult problem for SR. We are currently working on an iterative greedy approach to multi-variable SR, dealing with one additional variable per SR-run to address the issues that SR faces with time complexity. A APPENDIX  x 3 + x 2 + x U (-1, 1, 20) NGUYEN*-2 x 4 + x 3 + x 2 + x U (-1, 1, 20) NGUYEN*-3 x 5 + x 4 + x 3 + x 2 + x U (-1, 1, 20) NGUYEN*-4 x 6 + x 5 + x 4 + x 3 + x 2 + x U (-1, 1, 20) NGUYEN*-5 sin x 2 cos(x) -1 U (-1, 1, 20) NGUYEN*-6 sin(x) + sin x + x 2 U (-1, 1, 20) NGUYEN*-7 log(x + 1) + log x 2 + 1 U (0, 2, 20) NGUYEN*-8 √ x U (0, 4, 20) R*-1 (x+1) 3 x 2 -x+1 U (-1, 1, 20) R*-2 x 5 -3x 3 +1 x 2 +1 U (-1, 1, 20) R*-3 x 6 +x 5 x 4 +x 3 +x 2 +x+1 U (-1, 1, 20) JIN*-1 2.5x 4 -1.3x 3 + 0.5x 2 -1.7x U (-3, 3, 100) JIN*-2 8.0x 3 + 8.0x 2 -15.0 U (-3, 3, 100) JIN*-3 0.7x 3 -1.7x U (-3, 3, 100) JIN*-4 1.5 exp(x) + 5.0 cos(x) U (-3, 3, 100) JIN*-5 6.0 sin(x) cos(x) U (-3, 3, 100) JIN*-6 1.35x 2 + 5.5 sin (x -1.0) 2 U (-3, 3, 100) I.12.5 F = q 2 E f I.12.11 F = q (E f + Bv sin θ) I.13.4 K = 1 2 m v 2 + u 2 + w 2 I.13.12 U = Gm 



Figure 1: Integrated proof-of-concept schematic.

* = I 1 + I 2 + 2 √ I 1 I 2 cos δ I.38.12 r = 4πϵℏ 2 mq 2 I.39.10 E = 3 2 p F V I.39.11 E = 1 γ-1 p F V I.39.22 P F = nk b T V I.40.1 n = n 0 e µ M B cos θ II.15.5 E = -p d E f cos θ II./(k b T ))+exp(-µmB/(k b T )) II.35.21 M = n ρ µ M tanh µ M B k b T II.36.38 f = µmB k b T + µmαM ϵc 2 k b T II.37.1 E = µ M (1 + χ)B II.38.3 F = Y Ax d II.38.14 µ S = Y

Top-1 Approximate / Top-1 Exact Recovery Rates (%) of current methods across Jin* (with unknown constants) and Nguyen* datasets (without unknown constants). Results were averaged over 100 runs per equation per method.

Top-1 Approximate / Top-1 Exact Recovery Rates (in percentage) of GP and TW-MLM-GP across Nguyen*, R* and Jin* dataset. Results were averaged over 100 runs per equation per method.

Synthetic Dataset: Top-1 Approximate / Top-1 Exact Recovery Rates (%) of 6 methods for Jin* dataset. Results were averaged over 100 experiments per equation per method.

Synthetic Dataset: Top-1 Approximate / Top-1 Exact Recovery Rates (%) of 6 methods for Nguyen* and R* datasets. Results were averaged over 100 experiments per equation per method.

Real-World Dataset (from SRBench): Percentage of experiments where our method outperforms DSO-NGGP in r-squared value. Results are shown for 100 experiments per dataset.

Symbolic regression dataset specifications. Input variable is denoted by x. U (a, b, c) denotes c random points uniformly sampled between a and b for x. Equations were selected or modified to have the same number of variables throughout to maintain the same search space size to allow for meaningful comparison.

100 AI-Feynman Physics Equations Udrescu & Tegmark (2020)

ACKNOWLEDGMENTS

This research is supported by A*STAR, CISCO Systems (USA) Pte. Ltd and the National University of Singapore under its Cisco-NUS Accelerated Digital Economy Corporate Laboratory (Award I21001E0002). Additionally, we would like to thank the members of the Kent-Ridge AI research group at the National University of Singapore for helpful feedback and interesting discussions on this work.

