TRANSFORMER-BASED MODEL FOR SYMBOLIC RE-GRESSION VIA JOINT SUPERVISED LEARNING

Abstract

Symbolic regression (SR) is an important technique for discovering hidden mathematical expressions from observed data. Transformer-based approaches have been widely used for machine translation due to their high performance, and are recently highly expected to be used for SR. They input the data points, then output the expression skeleton, and finally optimize the coefficients. However, recent transformer-based methods for SR focus more attention on large scale training data and ignore the ill-posed problem: the lack of sufficient supervision, i.e., expressions that may be completely different have the same supervision because of their same skeleton, which makes it challenging to deal with data that may be from the same expression skeleton but with different coefficients. Therefore, we present a transformer-based model for SR with the ability to alleviate this problem. Specifically, we leverage a feature extractor based on pure residual MLP networks to obtain more information about data points. Furthermore, the core idea is that we propose a joint learning mechanism combining supervised contrastive learning, which makes features of data points from expressions with the same skeleton more similar so as to effectively alleviates the ill-posed problem. The benchmark results show that the proposed method is up to 25% higher with respect to the recovery rate of skeletons than typical transformer-based methods. Moreover, our method outperforms state-of-the-art SR methods based on reinforcement learning and genetic programming in terms of the coefficient of determination (R 2 ).

1. INTRODUCTION

Exploring mathematical expressions that can be fitted to real-world observed data is the core of expressing scientific discoveries. The correct expression would not only provide us with useful scientific insights simply by inspection but would also allow us to forecast how the process will change in the future. The task of finding such an interpretable mathematical expression from observed data is called symbolic regression. More specifically, given a dataset (X, y), where each feature X i ∈ R n and target y i ∈ R, the goal of symbolic regression is to identify a function f (i.e., y ≈ f (X) : R n → R) that best fits the dataset. Symbolic regression is NP-hard because the search space of an expression grows exponentially with the length of the expression, and the presence of numeric constants further exacerbates its difficulty (Lu et al., 2016) . Considering this issue, genetic programming (GP) as the most common approach is leveraged to tackle the symbolic regression problems (Forrest, 1993; Koza, 1994; Schmidt & Lipson, 2009; Staelens et al., 2013; Arnaldo et al., 2015; Bładek & Krawiec, 2019) . GP-based methods iteratively "evolves" each generation of mathematical expressions through selection, crossover, and mutation. Although this approach can be effective, the expression it yields is complex, and it is also known to be computationally expensive and to exhibit high sensitivity to hyperparameters. A more recent line of research has made use of the neural network to tackle the aforementioned shortcomings. Martius & Lampert (2016) propose a simple fully-connected neural network called "EQL", where elementary functions (sin, cos, +, ...) are used as activation functions. The limitation of EQL is the existence of vanishing gradient and exploding gradient, and the depth of the network limits the complexity of the predicted equation. More recently, deep symbolic optimization (DSO) (Petersen et al., 2021) trains the RNN using the reinforcement learning algorithm based on a risk-seeking policy gradient to generate expressions. They take the output from RNN as an initial population for a genetic algorithm to find the target expression. Albeit the above two approaches show promising results, they still handle symbolic regression as an instance-based problem, training a model from scratch on each new input dataset for a regression task. Inspired by the successes of large scale pre-training, recent efforts in symbolic regression have focused on using the transformer-based model and training with a large amount of data (Valipour et al., 2021; Biggio et al., 2021) . They all approach the symbolic regression problem as a machine translation problem, mapping the input data to latent representations via encoders, and then outputting the skeleton of expressions without constants by decoders. These transformer-based methods (Valipour et al., 2021; Biggio et al., 2021) for symbolic regression exists two main drawbacks: (i) A natural question is what architecture of the encoder is optimally suited for symbolic regression. It is clear that the decoder's ability to sample expressions efficiently is severely constrained by the encoder's ability to extract the features of the input data. The idea is that the encoder should not just encode the points, but also represent the expression on a high level such that the decoder only prints the representation as a sequence of symbols. (ii) They use the single character of the expression's string (Valipour et al., 2021) and the pre-order traversal of the expression tree (Biggio et al., 2021) as supervision information, respectively, which is an ill-posed problem that does not provide sufficient supervision: different instances of the same skeleton can have very different shapes, and instances of very different skeletons can be very close. To alleviate these issues, we proposed a transformerbased method for symbolic regression using a new feature extractor and a joint supervised learning mechanism. In summary, we introduce the main contributions in this study as follows: • We leverage a pure residual MLP feature extractor for extracting the local and global features of observed data targeting symbolic regression tasks, which aids the expression generator in producing more correct expression skeletons. • We propose a joint learning mechanism combining supervised contrastive learning that combines the supervision of the whole expression skeleton with the supervision of the preorder traversal of its expression tree, which alleviates the ill-posed problem effectively. • Empirically, the proposed method is up to 25% better than recent transformer-based methods with respect to the recovery rate of expression skeleton. Moreover, our method outperforms several strong baseline methods in terms of R 2 .

2. RELATED WORK

Genetic programming (GP) for symbolic regression. Traditionally, the approaches to symbolic regression are based on genetic algorithms (Forrest, 1993) . Later, the symbolic regression task is seen as an optimization problem for the search space (Koza, 1994) . By far the most popular commercial software Eureqa (Dubčáková, 2011) is the most successful application based on GP methods. A limitation of the genetic algorithms-based methods is that they need to train for each equation from scratch, which is slow, computationally expensive and highly randomized. The models tend to generate more complex equations and they are sensitive to the choice of hyperparameters (Petersen et al., 2021) . Neural network for symbolic regression. Symbolic regression based on neural network approaches can be broadly classified into three categories. First, the methods based on equation learner (EQL) (Martius & Lampert, 2016; Sahoo et al., 2018; Werner et al., 2021) are trained by replacing the activation function of the neural network with arithmetic operators, which inherits the ability of neural networks to deal with high-dimensional data and scales well with the number of input-output pairs (Biggio et al., 2021) . Nevertheless, the existence of exponential and logarithmic activation

