RETHINKING SYMBOLIC REGRESSION DATASETS AND BENCHMARKS FOR SCIENTIFIC DISCOVERY Anonymous authors Paper under double-blind review

Abstract

This paper revisits datasets and evaluation criteria for Symbolic Regression, a task of recovering mathematical expressions from given data, specifically focused on its potential for scientific discovery. Focused on a set of formulas used in the existing datasets based on Feynman Lectures on Physics, we recreate 120 datasets to discuss the performance of symbolic regression for scientific discovery (SRSD). For each of the 120 SRSD datasets, we carefully review the properties of the formula and its variables to design reasonably realistic sampling ranges of values so that our new SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method can (re)discover physical laws from such datasets. As an evaluation metric, we also propose to use normalized edit distances between a predicted equation and the ground-truth equation trees. While existing metrics are either binary or errors between the target values and an SR model's predicted values for a given input, normalized edit distances evaluate a sort of similarity between the ground-truth and predicted equation trees. We have conducted experiments on our new SRSD datasets using five state-of-the-art SR methods in SRBench and a simple baseline based on a recent Transformer architecture. The results show that we provide a more realistic performance evaluation and open up a new machine learning-based approach for scientific discovery. We provide our datasets and code as part of the supplementary material.

1. INTRODUCTION

Recent advances in machine learning (ML), especially deep learning (DL), have led to the proposal of many methods that can reproduce the given data and make appropriate inferences on new inputs. Such methods are, however, often black-box, which makes it difficult for humans to understand how they made predictions for given inputs. This property will be more critical especially when non-ML experts apply ML to problems in their research domains such as physics and chemistry. Symbolic regression (SR) is the task of producing a mathematical expression (symbolic expression) that fits a given dataset. SR has been studied in the genetic programming (GP) community (Hoai et al., 2002; Keijzer, 2003; Koza & Poli, 2005; Johnson, 2009; Uy et al., 2011; Orzechowski et al., 2018) , and DL-based SR has been attracting more attention from the ML/DL community (Petersen et al., 2020; Landajuela et al., 2021; Biggio et al., 2021; Valipour et al., 2021; La Cava et al., 2021; Kamienny et al., 2022) . Because of its interpretability, various scientific communities apply SR to advance research in their scientific fields e.g., Physics (Wu & Tegmark, 2019; Udrescu & Tegmark, 2020; Udrescu et al., 2020; Kim et al., 2020; Cranmer et al., 2020; Liu & Tegmark, 2021; Liu et al., 2021b ), Applied Mechanics (Huang et al., 2021 ), Climatology (Abdellaoui & Mehrkanoon, 2021) , Materials (Sun et al., 2019; Wang et al., 2019; Weng et al., 2020; Loftis et al., 2020), and Chemistry (Batra et al., 2020) . Given that SR has been studied in various communities, La Cava et al. (2021) propose SRBench, a unified benchmark framework for symbolic regression methods. In the benchmark study, they combine the Feynman Symbolic Regression Database (FSRD) (Udrescu & Tegmark, 2020) and the ODE-Strogatz repository (Strogatz, 2018) to compare a number of SR methods, using a large-scale heterogeneous computing cluster. 1



They used hosts with 24-28 core Intel(R) Xeon(R) CPU E5-2690 v4 2.60GHz processors and 250GB RAM.1

