DEEP GENERATIVE SYMBOLIC REGRESSION

Abstract

Symbolic regression (SR) aims to discover concise closed-form mathematical equations from data, a task fundamental to scientific discovery. However, the problem is highly challenging because closed-form equations lie in a complex combinatorial search space. Existing methods, ranging from heuristic search to reinforcement learning, fail to scale with the number of input variables. We make the observation that closed-form equations often have structural characteristics and invariances (e.g., the commutative law) that could be further exploited to build more effective symbolic regression solutions. Motivated by this observation, our key contribution is to leverage pre-trained deep generative models to capture the intrinsic regularities of equations, thereby providing a solid foundation for subsequent optimization steps. We show that our novel formalism unifies several prominent approaches of symbolic regression and offers a new perspective to justify and improve on the previous ad hoc designs, such as the usage of cross-entropy loss during pre-training. Specifically, we propose an instantiation of our framework, Deep Generative Symbolic Regression (DGSR). In our experiments, we show that DGSR achieves a higher recovery rate of true equations in the setting of a larger number of input variables, and it is more computationally efficient at inference time than state-of-the-art RL symbolic regression solutions.

1. INTRODUCTION

Symbolic regression (SR) aims to find a concise equation f that best fits a given dataset D by searching the space of mathematical equations. The identified equations have concise closed-form expressions. Thus, they are interpretable to human experts and amenable to further mathematical analysis (Augusto & Barbosa, 2000) . Fundamentally, two limitations prevent the wider ML community from adopting SR as a standard tool for supervised learning. That is, SR is only applicable to problems with few variables (e.g., three) and it is very computationally intensive. This is because the space of equations grows exponentially with the equation length and has both discrete (×, +, sin) and continuous (2.5) components. We believe that learning a good representation of the equation is the key to solve these challenges. Equations are complex objects with many unique invariance structures that could guide the search. Simple equivalence rules (such as commutativity) can rapidly build up with multiple variables or terms, giving rise to complex structures that have many equation invariances. Importantly, these equation equivalence properties have not been adequately reflected in the representations used by existing SR methods. First, existing heuristic search methods represent equations as expression trees (Jin et al., 2019) , which can only capture commutativity (x 1 x 2 = x 2 x 1 ) via swapping the leaves of a binary operator (×, +). However, trees cannot capture many other properties such as distributivity (x 1 x 2 + x 1 x 3 = x 1 (x 2 + x 3 )). Second, existing pre-trained encoder-decoder methods represent equations as sequences of tokens, i.e., x 1 + x 2 . = ("x 1 ", " + ", "x 2 "), just as



Although researchers have attempted to solve SR by heuristic search(Augusto & Barbosa, 2000; Schmidt &  Lipson, 2009; Stinstra et al., 2008; Udrescu & Tegmark, 2020), reinforcement learning(Petersen  et al., 2020; Tang et al., 2020), and deep learning with pre-training(Biggio et al., 2021; Kamienny  et al., 2022), achieving both high scalability to the number of input variables and computational efficiency is still an open problem.

