RETHINKING SYMBOLIC REGRESSION: MORPHOLOGY AND ADAPTABILITY FOR EVOLUTIONARY ALGORITHMS

Abstract

Symbolic Regression (SR) is the well-studied problem of finding closed-form analytical expressions that describe the relationship between variables in a measurement dataset. In this paper, we rethink SR from two perspectives: morphology and adaptability. Morphology: Current SR algorithms typically use several manmade heuristics to influence the morphology (or structure) of the expressions in the search space. These man-made heuristics may introduce unintentional bias and data leakage, especially with the relatively few equation-recovery benchmark problems available for evaluating SR approaches. To address this, we formulate a novel minimalistic approach, based on constructing a depth-aware mathematical language model trained on terminal walks of expression trees, as a replacement to these heuristics. Adaptability: Current SR algorithms tend to select expressions based on only a single fitness function (e.g., MSE on the training set). We promote the use of an adaptability framework in evolutionary SR which uses fitness functions that alternate across generations. This leads to robust expressions that perform well on the training set and are close to the true functional form. We demonstrate this by alternating fitness functions that quantify faithfulness to values (via MSE) and empirical derivatives (via a novel theoretically justified fitness metric coined MSEDI). Proof-of-concept: We combine these ideas into a minimalistic evolutionary SR algorithm that outperforms a suite of benchmark and state of-theart SR algorithms in problems with unknown constants added, which we claim are more reflective of SR performance for real-world applications. Our claim is then strengthened by reproducing the superior performance on real-world regression datasets from SRBench. For researchers interested in equation-recovery problems, we also propose a set of conventions that can be used to promote fairness in comparison across SR methods and to reduce unintentional bias.

1. INTRODUCTION

Important discoveries rarely come in the form of large black-box models; they often appear as simple, elegant, and concise expressions. The field of applying machine learning to generate such mathematical expressions is known as Symbolic Regression (SR). The expressions obtained from SR come in a compact and human-readable form that has fewer parameters than black-box models. These expressions allow for useful scientific insights by mere inspection. This property has led SR to be gradually recognized as a first-class algorithm in various scientific fields, including Physics (Udrescu & Tegmark, 2020), Material Sciences (Wang et al., 2019; Sun et al., 2019) and Knowledge Engineering (Martinez-Gil & Chaves-Gonzalez, 2020 ) in recent years. The most common technique used in SR is genetic programming (GP) (Koza, 1992) . GP generates populations of candidate expressions and evolves the best expressions (selected via fitness function) across generations through evolutionary operations such as selection, crossover, and mutation. In this paper, we rethink GP-SR from two evolutionary-inspired perspectives: morphology and adaptability.

