DEEP GENERATIVE SYMBOLIC REGRESSION

Abstract

Symbolic regression (SR) aims to discover concise closed-form mathematical equations from data, a task fundamental to scientific discovery. However, the problem is highly challenging because closed-form equations lie in a complex combinatorial search space. Existing methods, ranging from heuristic search to reinforcement learning, fail to scale with the number of input variables. We make the observation that closed-form equations often have structural characteristics and invariances (e.g., the commutative law) that could be further exploited to build more effective symbolic regression solutions. Motivated by this observation, our key contribution is to leverage pre-trained deep generative models to capture the intrinsic regularities of equations, thereby providing a solid foundation for subsequent optimization steps. We show that our novel formalism unifies several prominent approaches of symbolic regression and offers a new perspective to justify and improve on the previous ad hoc designs, such as the usage of cross-entropy loss during pre-training. Specifically, we propose an instantiation of our framework, Deep Generative Symbolic Regression (DGSR). In our experiments, we show that DGSR achieves a higher recovery rate of true equations in the setting of a larger number of input variables, and it is more computationally efficient at inference time than state-of-the-art RL symbolic regression solutions.

1. INTRODUCTION

Symbolic regression (SR) aims to find a concise equation f that best fits a given dataset D by searching the space of mathematical equations. The identified equations have concise closed-form expressions. Thus, they are interpretable to human experts and amenable to further mathematical analysis (Augusto & Barbosa, 2000) . Fundamentally, two limitations prevent the wider ML community from adopting SR as a standard tool for supervised learning. That is, SR is only applicable to problems with few variables (e.g., three) and it is very computationally intensive. This is because the space of equations grows exponentially with the equation length and has both discrete (×, +, sin) and continuous (2.5) components. Although researchers have attempted to solve SR by heuristic search (Augusto & Barbosa, 2000; Schmidt & Lipson, 2009; Stinstra et al., 2008; Udrescu & Tegmark, 2020) , reinforcement learning (Petersen et al., 2020; Tang et al., 2020) , and deep learning with pre-training (Biggio et al., 2021; Kamienny et al., 2022) , achieving both high scalability to the number of input variables and computational efficiency is still an open problem. We believe that learning a good representation of the equation is the key to solve these challenges. Equations are complex objects with many unique invariance structures that could guide the search. Simple equivalence rules (such as commutativity) can rapidly build up with multiple variables or terms, giving rise to complex structures that have many equation invariances. Importantly, these equation equivalence properties have not been adequately reflected in the representations used by existing SR methods. First, existing heuristic search methods represent equations as expression trees (Jin et al., 2019) , which can only capture commutativity (x 1 x 2 = x 2 x 1 ) via swapping the leaves of a binary operator (×, +). However, trees cannot capture many other properties such as distributivity (x 1 x 2 + x 1 x 3 = x 1 (x 2 + x 3 )). Second, existing pre-trained encoder-decoder methods represent equations as sequences of tokens, i.e., x 1 + x 2 . = ("x 1 ", " + ", "x 2 "), just as sentences of words in natural language (Valipour et al., 2021) . The sequence representation cannot encode any invariance structure, e.g., x 1 + x 2 and x 2 + x 1 will be deemed as two different sequences. Finally, existing RL methods for symbolic regression do not learn representations of equations. For each dataset, these methods learn a specific policy network to generate equations that fit the data well, hence they need to re-train the policy from scratch each time a new dataset D is observed, which is computationally intensive. On the quest to apply symbolic regression to a larger number of input variables, we investigate a deep conditional generative framework that attempts to fulfill the following desired properties: (P1) Learn equation invariances: the equation representations learnt should encode both the equation equivalence invariances, as well as the invariances of their associated datasets. (P2) Efficient inference: performing gradient refinement of the generative model should be computationally efficient at inference time. (P3) Generalize to unseen variables: can generalize to unseen input variables of a higher dimension from those seen during pre-training. To fulfill P1-P3, we propose the Deep Generative Symbolic Regression (DGSR) framework. Rather than represent equations as trees or sequences, DGSR learns the representations of equations with a deep generative model, which have excelled at modelling complex structures such as images and molecular graphs. Specifically, DGSR leverages pre-trained conditional generative models that correctly encode the equation invariances. The equation representations are learned using a deep generative model that is composed of invariant neural networks and trained using an end-to-end loss function inspired by Bayesian inference. Crucially, this end-to-end loss enables both pre-training and gradient refinement of the pre-trained model at inference time, allowing the model to be more computationally efficient (P2) and generalize to unseen input variables (P3). Contributions. Our contributions are two-fold: 1 ⃝ In Section 3, we outline the DGSR framework, that can perform symbolic regression on a larger number of input variables, whilst achieving less inference time computational cost compared to RL techniques (P2). This is achieved by learning better representations of equations that are aware of the various equation invariance structures (P1). 2 ⃝ In section 5.1, we benchmark DGSR against the existing symbolic regression approaches on standard benchmark problem sets, and on more challenging problem sets that have a larger number of input variables. Specifically, we demonstrate that DGSR has a higher recovery rate of the true underlying equation in the setting of a larger number of input variables, whilst using less inference compute compared to RL techniques, and DGSR achieves significant and state-of-the-art true equation recovery rate on the SRBench ground-truth datasets compared to the SRBench baselines. We also gain insight and understanding of how DGSR works in Section 5.2, of how it can discover the underlying true equation-even when pre-trained on datasets where the number of input variables is less than the number of input variables seen at inference time (P3). As well as be able to capture these equation equivalences (P1) and correctly encode the dataset D to start from a good equation distribution leading to efficient inference (P2).

2. PROBLEM FORMALISM

The standard task of a symbolic regressor method is to return a closed-form equation f that best fits a given dataset D = {(X i , y i )} n i=1 , i.e., y i ≈ f (X i ), ∀i ∈ [1 : n], for all samples i. Where y i ∈ R, X i ∈ R d and d is the number of input variables, i.e., X = [x 1 , . . . , x d ]. Closed-form equations. The equations that we seek to discover are closed-form, i.e., it can be expressed as a finite sequence of operators (×, +, -, . . . ), input variables (x 1 , x 2 , . . . ) and numeric constants (3.141, 2.71, . . . ) (Borwein et al., 2013) . We define f to mean the functional form of an equation, where it can have numeric constant placeholders β's to replace numeric constants, e.g., f (x) = β 0 x+sin(x + β 1 ). To discover the full equation, we need to infer the functional form and then estimate the unknown constants β's, if any are present (Petersen et al., 2020) . Equations can also be represented as a sequence of discrete tokens in prefix notation f = [ f1 , . . . , f| f | ] (Petersen et al., 2020) where each token is chosen from a library of possible tokens, e.g., [+, -, ÷, ×, x 1 , exp, log, sin, cos]. The tokens f can then be instantiated into an equation f and evaluated on an input X. In existing works, the numeric constant placeholder tokens are learnt through a further secondary non-linear optimizer step using the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm (Fletcher, 2013; Biggio et al., 2021) . In lieu of extensive notation, we define when evaluating f to also infer any placeholder tokens using BFGS.

