RETHINKING SYMBOLIC REGRESSION DATASETS AND BENCHMARKS FOR SCIENTIFIC DISCOVERY Anonymous authors Paper under double-blind review

Abstract

This paper revisits datasets and evaluation criteria for Symbolic Regression, a task of recovering mathematical expressions from given data, specifically focused on its potential for scientific discovery. Focused on a set of formulas used in the existing datasets based on Feynman Lectures on Physics, we recreate 120 datasets to discuss the performance of symbolic regression for scientific discovery (SRSD). For each of the 120 SRSD datasets, we carefully review the properties of the formula and its variables to design reasonably realistic sampling ranges of values so that our new SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method can (re)discover physical laws from such datasets. As an evaluation metric, we also propose to use normalized edit distances between a predicted equation and the ground-truth equation trees. While existing metrics are either binary or errors between the target values and an SR model's predicted values for a given input, normalized edit distances evaluate a sort of similarity between the ground-truth and predicted equation trees. We have conducted experiments on our new SRSD datasets using five state-of-the-art SR methods in SRBench and a simple baseline based on a recent Transformer architecture. The results show that we provide a more realistic performance evaluation and open up a new machine learning-based approach for scientific discovery. We provide our datasets and code as part of the supplementary material.

1. INTRODUCTION

Recent advances in machine learning (ML), especially deep learning (DL), have led to the proposal of many methods that can reproduce the given data and make appropriate inferences on new inputs. Such methods are, however, often black-box, which makes it difficult for humans to understand how they made predictions for given inputs. This property will be more critical especially when non-ML experts apply ML to problems in their research domains such as physics and chemistry. Symbolic regression (SR) is the task of producing a mathematical expression (symbolic expression) that fits a given dataset. SR has been studied in the genetic programming (GP) community (Hoai et al., 2002; Keijzer, 2003; Koza & Poli, 2005; Johnson, 2009; Uy et al., 2011; Orzechowski et al., 2018) , and DL-based SR has been attracting more attention from the ML/DL community (Petersen et al., 2020; Landajuela et al., 2021; Biggio et al., 2021; Valipour et al., 2021; La Cava et al., 2021; Kamienny et al., 2022) . Because of its interpretability, various scientific communities apply SR to advance research in their scientific fields e.g., Physics (Wu & Tegmark, 2019; Udrescu & Tegmark, 2020; Udrescu et al., 2020; Kim et al., 2020; Cranmer et al., 2020; Liu & Tegmark, 2021; Liu et al., 2021b ), Applied Mechanics (Huang et al., 2021 ), Climatology (Abdellaoui & Mehrkanoon, 2021) , Materials (Sun et al., 2019; Wang et al., 2019; Weng et al., 2020; Loftis et al., 2020), and Chemistry (Batra et al., 2020) . Given that SR has been studied in various communities, La Cava et al. (2021) propose SRBench, a unified benchmark framework for symbolic regression methods. In the benchmark study, they combine the Feynman Symbolic Regression Database (FSRD) (Udrescu & Tegmark, 2020) and the ODE-Strogatz repository (Strogatz, 2018) to compare a number of SR methods, using a large-scale heterogeneous computing cluster. 1To discuss the potential of symbolic regression for scientific discovery (SRSD), there still remain some issues to be addressed: oversimplified datasets and lack of evaluation metric towards SRSD. For symbolic regression tasks, existing datasets consist of values sampled from limited domains such as in range of 1 to 5, and there are no large-scale datasets with reasonably realistic values that capture the properties of the formula and its variables. Thus, it is difficult to discuss the potential of symbolic regression for scientific discovery with such existing datasets. For instance, the FSRD consists of 120 formulas selected mostly from Feynman Lectures Seriesfoot_2 (Feynman et al., 1963a; b; c) and are core benchmark datasets used in SRBench (La Cava et al., 2021) . While the formulas indicate physical laws, variables and constants used in each dataset have no physical meanings since the datasets are not designed to discover the physical laws from the observed data in the real world. (See Section 3.1.) Moreover, there is a lack of appropriate metrics to evaluate these methods for SRSD. An intuitive approach would be to measure the prediction error or correlation between the predicted values and the target values in the test data, as in standard regression problems. However, low prediction errors could be achieved even by complex models that differ from the original law. In addition, SRBench (La Cava et al., 2021) presents the percentage of agreement between the target and the estimated equations. But in such cases, both 1) equations that do not match at all and 2) that differ by only one termfoot_3 are equally treated as incorrect. As a result, it is considered as a coarse-resolution evaluation method for accuracy in SRSD, which still needs more discussion towards real-world applications. A key feature of SR is its interpretability, and some studies (Udrescu et al., 2020; La Cava et al., 2021) use complexity of the predicted expression as an evaluation metric (the simpler the better). However, it is based on a big assumption that a simpler expression may be more likely to be a hidden law in the data (scientific discovery such as physics law), which may not be true for SRSD. Therefore, there are no single evaluation metrics proposed to take into account both the interpretability and how close to the true expression the estimated expression is. To address these issues, we propose new SRSD datasets, introduce a new evaluation method, and conduct benchmark experiments using representative SR methods and a new Transformer-based SR baseline. We carefully review and design annotation policies for the new datasets, considering the properties of the physics formulas. Besides, given that a formula can be represented as a tree structure, we introduce a normalized edit distance on the tree structure to allow quantitative evaluation of predicted formulas that do not perfectly match the true formulas. Using the proposed SRSD datasets and evaluation metric, we perform benchmark experiments with a set of SR baselines and find that there is still significant room for improvements in terms of the new evaluation metric.

2. RELATED STUDIES

In this section, we briefly introduce related studies focused on 1) symbolic regression for scientific discovery and 2) symbolic regression dataset and evaluation.

2.1. SRSD: SYMBOLIC REGRESSION FOR SCIENTIFIC DISCOVERY

A pioneer study on symbolic regression for scientific discovery is conducted by Schmidt & Lipson (2009) , who propose a data-driven scientific discovery method. They collect data from standard experimental systems like those used in undergrad physics education: an air-track oscillator and a double pendulum. Their proposed algorithm detects different types of laws from the data such as position manifolds, energy laws, and equations of motion and sum of forces laws. Following the study, data-driven scientific discovery has been attracting attention from research communities and been applied to various domains such as Physics (Wu & Tegmark, 2019; Udrescu & Tegmark, 2020; Udrescu et al., 2020; Kim et al., 2020; Cranmer et al., 2020; Liu & Tegmark, 2021; Liu et al., 2021b ), Applied Mechanics (Huang et al., 2021 ), Climatology (Abdellaoui & Mehrkanoon, 2021) , Materials (Sun et al., 2019; Wang et al., 2019; Weng et al., 2020; Loftis et al., 2020), and Chemistry (Batra et al., 2020) .



They used hosts with 24-28 core Intel(R) Xeon(R) CPU E5-2690 v4 2.60GHz processors and 250GB RAM. Udrescu & Tegmark (2020) extract 20 of the 120 equations as "bonus" from other seminal books(Goldstein et al., 2002; Jackson, 1999; Weinberg, 1972; Schwartz, 2014). If those differ by a constant or scalar, SRBench treats the estimated equation as correct for solution rate.

