RETHINKING SYMBOLIC REGRESSION: MORPHOLOGY AND ADAPTABILITY FOR EVOLUTIONARY ALGORITHMS

Abstract

Symbolic Regression (SR) is the well-studied problem of finding closed-form analytical expressions that describe the relationship between variables in a measurement dataset. In this paper, we rethink SR from two perspectives: morphology and adaptability. Morphology: Current SR algorithms typically use several manmade heuristics to influence the morphology (or structure) of the expressions in the search space. These man-made heuristics may introduce unintentional bias and data leakage, especially with the relatively few equation-recovery benchmark problems available for evaluating SR approaches. To address this, we formulate a novel minimalistic approach, based on constructing a depth-aware mathematical language model trained on terminal walks of expression trees, as a replacement to these heuristics. Adaptability: Current SR algorithms tend to select expressions based on only a single fitness function (e.g., MSE on the training set). We promote the use of an adaptability framework in evolutionary SR which uses fitness functions that alternate across generations. This leads to robust expressions that perform well on the training set and are close to the true functional form. We demonstrate this by alternating fitness functions that quantify faithfulness to values (via MSE) and empirical derivatives (via a novel theoretically justified fitness metric coined MSEDI). Proof-of-concept: We combine these ideas into a minimalistic evolutionary SR algorithm that outperforms a suite of benchmark and state of-theart SR algorithms in problems with unknown constants added, which we claim are more reflective of SR performance for real-world applications. Our claim is then strengthened by reproducing the superior performance on real-world regression datasets from SRBench. For researchers interested in equation-recovery problems, we also propose a set of conventions that can be used to promote fairness in comparison across SR methods and to reduce unintentional bias.

1. INTRODUCTION

Important discoveries rarely come in the form of large black-box models; they often appear as simple, elegant, and concise expressions. The field of applying machine learning to generate such mathematical expressions is known as Symbolic Regression (SR). The expressions obtained from SR come in a compact and human-readable form that has fewer parameters than black-box models. These expressions allow for useful scientific insights by mere inspection. This property has led SR to be gradually recognized as a first-class algorithm in various scientific fields, including Physics (Udrescu & Tegmark, 2020), Material Sciences (Wang et al., 2019; Sun et al., 2019) and Knowledge Engineering (Martinez-Gil & Chaves-Gonzalez, 2020 ) in recent years. The most common technique used in SR is genetic programming (GP) (Koza, 1992) . GP generates populations of candidate expressions and evolves the best expressions (selected via fitness function) across generations through evolutionary operations such as selection, crossover, and mutation. In this paper, we rethink GP-SR from two evolutionary-inspired perspectives: morphology and adaptability. Morphology. SR algorithms have traditionally used several man-made heuristics to influence the morphology of expressions. One method is to introduce rules (Worm & Chiu, 2013 ), constraints (Petersen et al., 2019; Bladek & Krawiec, 2019) and rule-based simplifications (Zhang et al., 2006) , with the objective of removing redundant operations and author-defined senseless expression. An example is disallowing nested trigonometric functions (e.g. sin(1 + cos(x))). Another method is to assign complexity scores to each elementary operations (Loftis et al., 2020; Korns, 2013) , intending to suppress the appearance of rare operations that are given a high author-assigned complexity score. However, these man-made heuristics may introduce unintentional bias and data leakage, exacerbated by the small quantity of benchmark problems in SR (Orzechowski et al., 2018) . With the success of deep learning and its applications to SR, there exists substantial motivation to utilize deep learning to generate potential morphologies of candidate expressions in SR (Petersen et al., 2019; Mundhenk et al., 2021) . Such a technique also comes with the benefit of being easily transferable to a problem with a different set of elementary operations. In this regard, we first show how current SR methods are reliant on these man-made heuristics and highlight the potential drawbacks. Then, in this paper, we show how using our neural network pre-trained on Physics equations (which we later introduce as TW-MLM) improves the performance of GP-SR even in the absence of such man-made heuristics. Adaptability. SR algorithms tend to evaluate a candidate expression based on its faithfulness to empirical values. Some common fitness functions are Mean Absolute Error (MAE), Mean Squared Error (MSE) and Normalized Root Mean Square Error (NRMSE) (Mundhenk et al., 2021) , among other measurements. We propose that SR should create a variety of characterization measures beyond the existing ones that measure faithfulness to empirical values. While previous work have suggested alternative characteristics in dealing with time-series data (Schmidt & Lipson, 2010; 2009) , it is not easily transferable to SR in general and was not theoretically derived. In this paper, we propose to quantify the faithfulness of a candidate expression's empirical derivative to the ground truth. Motivated by evolutionary-theory (Bateson, 2017), we adopt an adaptability framework that changes the fitness functions across generations. This process makes it harder for pseudo-equations to survive through the different fitness functions and easier for a ground truth equation to survive across the generations. The additional benefit of such a method lies in the increased utility of the eventual expression. If the intention of the user is to take the derivative of the eventual equation, then a measure of faithfulness to empirical derivatives as fitness would assist that objective. In this paper, we alternate between different fitness functions to improve performance of GP-SR. Specifically, we alternate between MSE and a newly defined fitness function we term MSEDI. Proof-of-concept. We combine these ideas to demonstrate a proof-of-concept (foreshadowed in Figure 1 ) through a minimalistic evolutionary SR algorithm that outperforms all methods (including state-of-the-art SR) in problems with unknown constants (i.e., Jin* (Jin et al., 2019)) and outperforms many benchmark models in problems without unknown constants (i.e., Nguyen* (Uy et al., 2011) and R* (Krawiec & Pawlak, 2013)). We contend that performance on datasets with unknown constants are more indicative of SR performance in real-world applications that including naturally occurring processes such as scaling. Our claim is then strengthened by reproducing this superior performance on real-world regression datasets from SRBench (La Cava et al., 2021) . To accommodate future research in SR using real-life datasets, we propose extra SR conventions for synthetic datasets in line with the "relevancy" criteria for results on benchmarks problem to correlate well with results on realworld applications (McDermott et al., 2012) . The remainder of this paper is organized as follows: Section 2 explains related work and mechanisms to improve SR, focusing on recent deep learning work. Section 3 describes our proposed methodology and mechanisms. Section 4 combines all our proposed mechanisms with the vanilla GP approach and compares the performance with state-of-the-art, commercial and traditional SR approaches. Reflections and future work are given in Section 5. The main contributions of this paper are as follows:



Figure 1: Integrated proof-of-concept schematic.

