TRANSFORMER-BASED MODEL FOR SYMBOLIC RE-GRESSION VIA JOINT SUPERVISED LEARNING

Abstract

Symbolic regression (SR) is an important technique for discovering hidden mathematical expressions from observed data. Transformer-based approaches have been widely used for machine translation due to their high performance, and are recently highly expected to be used for SR. They input the data points, then output the expression skeleton, and finally optimize the coefficients. However, recent transformer-based methods for SR focus more attention on large scale training data and ignore the ill-posed problem: the lack of sufficient supervision, i.e., expressions that may be completely different have the same supervision because of their same skeleton, which makes it challenging to deal with data that may be from the same expression skeleton but with different coefficients. Therefore, we present a transformer-based model for SR with the ability to alleviate this problem. Specifically, we leverage a feature extractor based on pure residual MLP networks to obtain more information about data points. Furthermore, the core idea is that we propose a joint learning mechanism combining supervised contrastive learning, which makes features of data points from expressions with the same skeleton more similar so as to effectively alleviates the ill-posed problem. The benchmark results show that the proposed method is up to 25% higher with respect to the recovery rate of skeletons than typical transformer-based methods. Moreover, our method outperforms state-of-the-art SR methods based on reinforcement learning and genetic programming in terms of the coefficient of determination (R 2 ).

1. INTRODUCTION

Exploring mathematical expressions that can be fitted to real-world observed data is the core of expressing scientific discoveries. The correct expression would not only provide us with useful scientific insights simply by inspection but would also allow us to forecast how the process will change in the future. The task of finding such an interpretable mathematical expression from observed data is called symbolic regression. More specifically, given a dataset (X, y), where each feature X i ∈ R n and target y i ∈ R, the goal of symbolic regression is to identify a function f (i.e., y ≈ f (X) : R n → R) that best fits the dataset. Symbolic regression is NP-hard because the search space of an expression grows exponentially with the length of the expression, and the presence of numeric constants further exacerbates its difficulty (Lu et al., 2016) . Considering this issue, genetic programming (GP) as the most common approach is leveraged to tackle the symbolic regression problems (Forrest, 1993; Koza, 1994; Schmidt & Lipson, 2009; Staelens et al., 2013; Arnaldo et al., 2015; Bładek & Krawiec, 2019) . GP-based methods iteratively "evolves" each generation of mathematical expressions through selection, crossover, and mutation. Although this approach can be effective, the expression it yields is complex, and it is also known to be computationally expensive and to exhibit high sensitivity to hyperparameters. A more recent line of research has made use of the neural network to tackle the aforementioned shortcomings. Martius & Lampert (2016) propose a simple fully-connected neural network called "EQL", where elementary functions (sin, cos, +, ...) are used as activation functions. The limitation of EQL is the existence of vanishing gradient and exploding gradient, and the depth of the network limits the complexity of the predicted equation. More recently, deep symbolic optimization (DSO) (Petersen et al., 2021) trains the RNN using the reinforcement learning algorithm based on a risk-seeking policy gradient to generate expressions. They take the output from RNN as an initial population for a genetic algorithm to find the target expression. Albeit the above two approaches show promising results, they still handle symbolic regression as an instance-based problem, training a model from scratch on each new input dataset for a regression task. Inspired by the successes of large scale pre-training, recent efforts in symbolic regression have focused on using the transformer-based model and training with a large amount of data (Valipour et al., 2021; Biggio et al., 2021) . They all approach the symbolic regression problem as a machine translation problem, mapping the input data to latent representations via encoders, and then outputting the skeleton of expressions without constants by decoders. These transformer-based methods (Valipour et al., 2021; Biggio et al., 2021) for symbolic regression exists two main drawbacks: (i) A natural question is what architecture of the encoder is optimally suited for symbolic regression. It is clear that the decoder's ability to sample expressions efficiently is severely constrained by the encoder's ability to extract the features of the input data. The idea is that the encoder should not just encode the points, but also represent the expression on a high level such that the decoder only prints the representation as a sequence of symbols. (ii) They use the single character of the expression's string (Valipour et al., 2021) and the pre-order traversal of the expression tree (Biggio et al., 2021) as supervision information, respectively, which is an ill-posed problem that does not provide sufficient supervision: different instances of the same skeleton can have very different shapes, and instances of very different skeletons can be very close. To alleviate these issues, we proposed a transformerbased method for symbolic regression using a new feature extractor and a joint supervised learning mechanism. In summary, we introduce the main contributions in this study as follows: • We leverage a pure residual MLP feature extractor for extracting the local and global features of observed data targeting symbolic regression tasks, which aids the expression generator in producing more correct expression skeletons. • We propose a joint learning mechanism combining supervised contrastive learning that combines the supervision of the whole expression skeleton with the supervision of the preorder traversal of its expression tree, which alleviates the ill-posed problem effectively. • Empirically, the proposed method is up to 25% better than recent transformer-based methods with respect to the recovery rate of expression skeleton. Moreover, our method outperforms several strong baseline methods in terms of R 2 .

2. RELATED WORK

Genetic programming (GP) for symbolic regression. Traditionally, the approaches to symbolic regression are based on genetic algorithms (Forrest, 1993) . Later, the symbolic regression task is seen as an optimization problem for the search space (Koza, 1994) . By far the most popular commercial software Eureqa (Dubčáková, 2011) is the most successful application based on GP methods. A limitation of the genetic algorithms-based methods is that they need to train for each equation from scratch, which is slow, computationally expensive and highly randomized. The models tend to generate more complex equations and they are sensitive to the choice of hyperparameters (Petersen et al., 2021) . Neural network for symbolic regression. Symbolic regression based on neural network approaches can be broadly classified into three categories. First, the methods based on equation learner (EQL) (Martius & Lampert, 2016; Sahoo et al., 2018; Werner et al., 2021) are trained by replacing the activation function of the neural network with arithmetic operators, which inherits the ability of neural networks to deal with high-dimensional data and scales well with the number of input-output pairs (Biggio et al., 2021) . Nevertheless, the existence of exponential and logarithmic activation  •• ••• ×𝑁 ! ×𝑁 ! Geometric Affine Module 𝑓(𝑥, 𝑦) = 2𝑥 " + 3𝑦 " Points Feature Vector Positional Embedding Token Embedding Masked Multi-Head Attention Add & Norm Feed Forward Add & Norm Softmax (1 -𝜆)ℒ !"#$$%&'("#)* 𝑋 #×(&'() 𝜆ℒ !#'("+$(,-& + Pre-order of expression tree ×𝑁 &

Feature Extractor

Expression Generator 𝑥 𝑦 𝑓 = 1 2 14 𝐶 ! 𝑥 " + 𝐶 " 𝑦 " Skeleton Figure 1: Schematic diagram of training. The data inputs and expression skeletons' labels are passed through the feature extractor. Then, given the feature vectors, token embedding, and positional embedding, the expression generator produces expression skeletons in parallel. Finally, the model jointly computes the supervised contrastive learning loss and cross-entropy loss. functions leads to gradient instability. Also, the complexity of predicted expression depends on the depth of the EQL network. Reinforcement learning for symbolic regression. The second approach is the autoregressive model based on reinforcement learning. Petersen et al. ( 2021) uses reinforcement learning based on a risk-seeking policy gradient to train a RNN to generate a probability distribution over the space of mathematical expressions. For such symbolic regression tasks, they proposed a new objective function based on risk-seeking policy gradients, that focuses on learning only on maximizing best-case performance rather than the average performance of a policy. Genetic programming and neural-guided search are mechanistically dissimilar, yet both have proven to be effective solutions to symbolic regression. Mundhenk et al. (2021) proposed a more novel approach, combining the two approaches to leverage each of their strengths. They take the output from the RNN as an initial population for a genetic algorithm. The method represents a significant step forward in the application of deep learning to symbolic regression (Biggio et al., 2021) . The promising results make it the currently recognized state-of-the-art approach to symbolic regression tasks. Nevertheless, the limitations of this method are obvious, namely, the network has to be retrained from scratch for each new equation and the RNN is never directly conditioned on the data it is trained to model (Biggio et al., 2021) . Large scale transformer-based models for symbolic regression. The third approach is to train a large scale transformer-based model by using a large amount of data. More recently, Symbol-icGPT (Valipour et al., 2021 ) trained a GPT (Radford et al., 2019) model to construct a mapping of pairs of points and symbolic output. They first input the data points into T-net (Qi et al., 2017) to get a potential representation of the data points and then input it to the GPT (Radford et al., 2019) for generating expression strings. They generate the expression at the character level and finally concatenate it into an expression. In general, they explore an alternative approach to symbolic regression by considering it as a task in language modeling. Symbolic mathematics behaves as a language in its own right, with well-formed mathematical expressions treated as valid "sentences" in this language (Valipour et al., 2021) . Furthermore, NeSymReS (Biggio et al., 2021) proposed a similar method, where they use the encoder from the Set transformer (Lee et al., 2019) and the decoder from the original transformer architecture (Vaswani et al., 2017) . Their greatest contribution is to show that their approach is able to improve performance as the size of the dataset increases.

3. METHODS

The proposed joint learning mechanism is shown in Figure 1 . First, we leverage a permutationinvariant feature extractor based on residual MLP networks to obtain feature vectors of data points. Then, given the feature vectors, the expression generator autoregressively generates individual math-ematical symbols until we obtain the entire skeleton of the expression. In the forward propagation, we compute the contrastive loss with respect to the expression skeleton categories and the crossentropy loss with respect to the mathematical symbol categories separately. The parameters of the network are jointly updated in backpropagation. Training can efficiently process each sequence in a single forward pass of the network thanks to the masked attention and teacher forcing (Vaswani et al., 2017) . During inference, multiple predictions are sampled from the model using the beam search strategy and we select the prediction with the lowest error.

3.1. EXTRACTING EFFECTIVE FEATURE OF DATA POINTS

As mentioned in section 1, the feature extracted from data points affects generating expressions through the decoder. Valipour et al. (2021) obtain the latent representation of data using the Tnet (Qi et al., 2017) . Albeit efficient, local feature loss caused by non-locality and non-hierarchy degrades the representational quality of details for point cloud (Ma et al., 2022) . As a similar work, NeSymReS (Biggio et al., 2021) use the Set Transformer (Lee et al., 2019) based on self-attention to extract data points features. However, it focuses too much on local feature extraction and lacks global information, which is not suitable for expression data points. We explore and visualize the similarity of the features extracted from data points by these methods and compare the expression skeletons generated by the decoder, which confirms the above problems. The result is shown in section 4. The performance of the feature extractor has to be considered, therefore, we opt for the framework of PointMLP (Ma et al., 2022) based on pure residual MLP as our feature extractor in order to obtain the local and overall information of the data points. Given a set of data points d+1) , where n indicates the number of points and d denotes the dimension of the variable. PointMLP learns hierarchical features of data points by stacking multiple learning stages. In each stage, N s points are re-sampled by the farthest point sampling (FPS) algorithm, where s indexes the stage and K neighbors are employed for each sampled point (Ma et al., 2022) . Conceptually, the kernel operation of PointMLP can be formulated as: D = {(x i , y i )} n i=1 ∈ R n×( O i = POS (MaxPool (PRE (f i,j ) , |j = 1, • • • , K)) where f i,j is the j-th neighbor point feature of i-th sampled point. POS (•) and PRE (•) are residual point MLP blocks: the shared PRE (•) is designed to learn shared weights from a local region while the POS (•) is leveraged to extract deep aggregated features. POS (•) and PRE (•) consist of several residual MLP blocks: MLP (x) + x. We use the max pooling layer to aggregate global features. After MLP blocks, we add a dropout layer (Srivastava et al., 2014) . Following (Ma et al., 2022) , we also leverage a lightweight geometric affine module to tackle the problem that is less robust and caused by the sparse and irregular geometric structures in local regions. Let {f i,j } j=1,••• ,k ∈ R k×d be the grouped local neighbors of f i ∈ R d containing k points, and each neighbor point f i,j is a d-dimensional vector. We transform the local neighbor points by the following formulation: {f i,j } = α ⊙ {f i,j } -f i σ + ϵ + β, σ = 1 k × n × d n i=1 k j=1 (f i,j -f i ) 2 where α ∈ R d and β ∈ R d are learnable parameters, ⊙ indicates Hadamard production, and ϵ = 1e -5 is a small number for numerical stability. Note that σ is a scalar that describes the feature deviation across all local groups and channels. By doing so, we transform the local points to a normal distribution while maintaining original geometric properties.

3.2. TRAINING WITH JOINT SUPERVISION INFORMATION

Recent transformer-based face an ill-posed problem because use insufficient supervision information: they use the single character of the expression's string (Valipour et al., 2021) and the pre-order traversal of the expression (Biggio et al., 2021) as supervision information, respectively, and train the model by minimizing the cross-entropy loss. In this work, we propose a joint objective function that combines cross-entropy (CE) loss and supervised contrastive learning (CL) loss. On the basis of using symbol labels as supervision, we treat the skeleton of expressions as category labels to enrich the supervisory information. The CL loss with respect to the expression skeleton categories and the CE loss with respect to the mathematical symbol categories are separately calculated in the forward propagation. In backpropagation, the network's parameters are concurrently updated. As the auxiliary loss, the CL loss is meant to capture the similarities of feature vectors between expressions with the same skeletons and contrast them with others. The promising results are described in section 4. The overall loss is a weighted average of CE loss and proposed CL loss, as given in equation ( 1). The canonical definition of the CE loss that we use is given in equation ( 2). The novel CL loss is given in equation (3). The overall loss is then given in the following: L = (1 -λ)L CE + λL CL L CE = - 1 N N i=1 C c=1 y i,c • ln ŷi,c L CL = - N i=1 1 N ℓi + ϵ N j=1 1 i̸ =j 1 ℓi=ℓj ln exp(s i,j /τ ) N k=1 1 i̸ =k 1 ℓi̸ =ℓ k exp (s i,k /τ ) (3) Here, λ is a scalar weighting hyperparameter that we tune for the training stage; N represents the mini-batch size; C denotes the size of the pre-specified tokens library; y i,c denotes the symbol label and ŷi,c denotes the expression generator output for the probability of the ith token belonging to the class c; N li is the total number of examples in the batch that have the same skeleton label as ℓ i ; ϵ is a very small scalar preventing devision by zero; 1 i̸ =j , 1 ℓi=ℓj and 1 ℓi̸ =ℓ k are similar indicator functions; s i,j = v i • v j / ∥v i ∥ ∥v j ∥ denotes the cosine similarity between the sample i and the sample j, where v i and v j represent the high-level feature vectors of the sample i and the sample j respectively from the feature extractor; τ > 0 is an adjustable scalar temperature parameter that controls the separation of classes.

3.3. GENERATING EXPRESSIONS WITH A TRANSFORMER-BASED MODEL

We leverage a framework of the GPT language model (Radford et al., 2019) as the expression generator, which is an autoregressive language model based on the decoder of transformer (Vaswani et al., 2017) . During inference, we generate an expression τ one token at a time along the pre-order traversal. For example, the expression f (x 1 , x 2 ) = sin (x 1 ) + log (x 2 ) is encoded as [+, sin, x 1 , log, x 2 ]. We denote the i th token of the traversal as τ i and each token has a value within a given library L of possible tokens, e.g., {+, -, ×, ÷, sin, cos, log, x 1 , x 2 }. Specifically, the i th output of the generator with parameters θ and feature vector v extracted from data D pass through a softmax layer to produce vector ψ (i) , which defines the probability distribution for selecting the i th token τ i . The likelihood of the entire generated expression is simply the product of the likelihoods of its tokens: p(τ |θ, v) = |τ | i=1 p(τ i |τ 1:(i-1) ; θ, v) = |τ | i=1 ψ (i) L(τi) . Note that the output sequence from the generator does not contain any numerical constants. Instead, we use a special placeholder ⟨C⟩ denoting the presence of a constant that will be optimized at a later stage.

3.4. LEARNING CONSTANTS USING BFGS BY RESTARTING MULTIPLE TIMES

At inference time, Mundhenk et al. (2021) ; Valipour et al. (2021) ; Biggio et al. (2021) all use BFGS optimization algorithm (Fletcher, 1984) on the mean squared error to fit the constants. BFGS is a quasi-newton method for solving unconstrained nonlinear convex optimization problems. However, for the symbolic regression task, the loss function minimized by BFGS may be highly non-convex, which is likely to be falling into several local minima: even when the skeleton is perfectly predicted, the correct constants are not guaranteed to be found. To ameliorate this issue, we use a simple and hardly any space cost method: restart the BFGS algorithm by initializing the different starting points multiple times, and the global optimal point is achieved as much as possible. The result is shown in section 4.

4.1. GENERATING DATASETS

We generate mathematical expressions following the framework as described by (Lample & Charton, 2019) . The framework starts by generating a random unary-binary tree and filling the nodes with operators and the leaves with independent variables or constants. The unnormalized probabilities of each operation and operator are given in Appendix A. We generate the training set containing approximately 100, 000 unique expression skeletons. For each expression, we re-sample its constant values for 10, 20, 30, 40, and 50 times. Once an expression tree has been generated, we can represent the expression using a pre-order traversal of the tree. We opt for scalar functions of up to two independent variables (i.e., d = 2 and y = f (x 1 , x 2 )) and three numerical constants each. Specifically, each generated expression's constant values are replaced with the constant placeholder ⟨C⟩. Then, the additive and multiplicative constant values are sampled uniformly from the interval [-2, 2] to fill in the placeholder. After this, the entire equation will be simplified using the rules built in the symbolic manipulation library SymPy (Meurer et al., 2017) . Finally, we sample uniformly at random 100 input points X = {x i } n i=1 from the interval [-10, 10] and evaluate the expressions on the X to obtain the corresponding outputs Y . X will be re-sampled if produce non-finite values (NaN or ±∞) and we discard the expression that cannot be sampled completely in 30 seconds. Creating the SSDNC benchmark. We generate a more challenging test set to discover the performance of transformer-based methods, which includes approximately 100 unique expression skeletons and 10 re-sampled numerical constants for each skeleton. We sample random support of 100 points from the uniform distribution U(-10, 10), for each independent variable. We call it SSDNC, for the same skeletons with different numerical coefficients.

4.2. TRAINING AND INFERENCE

We train the feature extractor and expression generator jointly to minimize the objective loss combined with cross-entropy loss and supervised contrastive loss. More specifically, we train the model using the Adam optimizer (Kingma & Ba, 2014) on 4 NVIDIA V100 GPUs. More detailed hyperparameters are reported in Appendix 5, which were found empirically and not fine-tuned for maximum performance. Note that we use the same dataset for training to facilitate quantitative benchmarking when evaluating the feature extraction and other capabilities with SymbolicGPT (Valipour et al., 2021) and NeSymReS (Biggio et al., 2021) . At inference, most of the expressions of benchmark test sets are not seen during training, and we resampled all data points to avoid the possible overfitting problem.

4.3. METRICS

We have selected the coefficient of determination (Rfoot_1 ) to assess the quality of our method. The R 2 (Glantz & Slinker, 2001) is defined as follows: R 2 (y, ŷ) = 1 - k i=1 (y i -ŷi ) 2 k i=1 (y i -ȳ) 2 where y i and ŷi are the ground-truth and predicted values for point i, respectively. ȳ is the average of y i over all the points. k is the number of test points. The advantage of using R 2 is its nice interpretation. if R 2 > 0, then means the prediction is better than predicting just the average value, and if R 2 = 1, then we get a perfect prediction. However, due to the presence of the max operator, R 2 is sensitive to outliers, and hence to the number of points considered at test time (more points entail a higher risk of an outlier). To circumvent this, we discard the 5% worst predictions for all methods used, following (Biggio et al., 2021) .

4.4. BASELINES

We compare the performance of our method with four strong symbolic regression baselines: Figure 2 : For the expression skeleton c 1 sin (x 1 )+c 2 cos (x 2 )+c 3 , four heat maps of cosine similarity between the fifty different feature vectors from different methods, where the redder color means the cosine similarity is closer to 0, and the greener color means the cosine similarity is closer to 1. • Deep Symbolic Optimization (DSO). A symbolic regression method based on RNN and reinforcement learning search strategy (Mundhenk et al., 2021) . We use the open-source implementation provided by the authors.foot_2 • Genetic Programming. Standard GP-based symbolic regression (Koza, 1994) based on the open-source Python library gplearn.foot_3  All details for baselines are reported in Appendix D.

4.5. FEATURE EXTRACTION PERFORMANCE

For transformer-based methods in symbolic regression, we empirically demonstrate that the performance of the feature extractor plays a critical role in the overall training and evaluation stages. We evaluate the feature extraction performance of recent transformer-based methods, i.e., Symbol-icGPT (Valipour et al., 2021) and NeSymReS (Biggio et al., 2021) , and our method with/without CL on SSDNC benchmark. After training, we input the data points corresponding to the specific expression skeleton with different constants into the feature extractor, then we compute the cosine similarity between the different feature vectors. As shown in Figure 2 , SymbolicGPT (Valipour et al., 2021) and NeSymReS (Biggio et al., 2021) all produce dissimilar feature vectors for the data points of different expressions belonging to the same skeleton. The reason is that the feature extractor they used focuses more on local features and loses key information, which is not appropriate for the data points of symbolic regression. By manual inspection, we find that this problem can adversely affect the expression generator to produce an expression skeleton. The results in section 4.6 illustrate the high correlation between the feature extractor and expression generating. Benefiting from using the architecture of pure residual MLP, our feature extractor is able to obtain more similar feature vectors when facing the same skeleton, even without CL. After training jointly, we make the feature vectors of data points from the same skeleton more similar, which effectively alleviates the ill-posed problem and improves the performance of generating expressions with the same skeletons. In inference, since transformer-based methods first generate expression skeletons based on features of data points, we first compare the performance of these methods in recovering expression skeletons. we evaluate the recovery rate of expression skeletons on the SSDNC benchmark. As shown in Table 1 , our method with the more effective feature extractor and the joint leaning mechanism can better guide the generator in generating expressions, thus outperforming other transformer-based methods. Generating the right expression skeleton is crucial to the final Table 2 : Results comparing our method with CL with state-of-the-art methods on several benchmarks. Our method, SymbolicGPT, and NeSymReS all use the beam search strategy with the beam size equaling 128. We report the average value of R 2 for each benchmark.

Ours

SymbolicGPT NeSymReS DSO GP result because optimizing constants is relatively simple. This is reflected in the comparison for R 2 in the following results. Benchmark R 2 ↑ R 2 ↑ R 2 ↑ R 2 ↑ R 2 ↑ Nguyen 0.

Statistics of fitting accuracy

We evaluate our method and current state-of-the-art approaches on the widely used public benchmarks, i.e., the Nguyen benchmark (Uy et al., 2011) , Constant, Keijzer (Keijzer, 2003) , R rationals (Krawiec & Pawlak, 2013) , AI-Feynman database (Udrescu & Tegmark, 2020) and our SSDNC test set. Nguyen was the main benchmark used in Petersen et al. (2021) . There are terms that appear in three ground truth equations that are not included in the set of equations that our model can fit, specifically x 6 and x y , but we can find approximate expressions. AI-Feynman database is extracted from the popular Feynman Lectures on Physics series and contains expressions with up to nine variables. In our study, we consider all the expressions with up to two variables. The complete benchmark functions are given in Appendix 4. From the results in Table 2 , our method outperforms all baseline methods in terms of average R 2 on six benchmarks. Performance under different training sets sizes. Following subsection 4.1, we generate the different size of data set that contains the various number of expressions with different constants and the same skeletons. Since DSO (Mundhenk et al., 2021) and GP-based methods are trained from scratch for specific problems, they are not included in this comparison. We train our model, SymbolicGPT and NeSymReS on our training sets separately. As shown in Figure 3 , our method with/without CL all outperforms these two baseline methods in terms of R 2 on the SSDNC benchmark. Symbol-icGPT and NeSymReS can also improve performance simply by increasing the size of the dataset, which shows that the data-driven approach for SR can continuously improve performance as the size of the dataset increases. Performance under noisy data. We evaluated the robustness of our and baseline methods to noisy data by adding independent Gaussian noise to the dependent variable, with mean zero and standard deviation proportional to the root-mean-square of the dependent variable in the training data (Mundhenk et al., 2021) . In Figure 4 , we varied the proportionality constant from 0 (noiseless) to 10 -1 , following (Mundhenk et al., 2021) , and evaluated each algorithm across the SSDNC benchmark. 'GT' denotes ground-truth and 'Pred', the model prediction. Specifically, (a): GT: (x 6 +x 5 ) (x 4 +x 3 +x 2 +x+1) , Pred: x • (x -sin ( 1 x )); (b): GT: x • cos (tan (x)), Pred: x • cos (tan (x)); (c): GT: 0.2x 3 + 0.5y 3 -y -x, Pred: 0.1853x 3 + 0.4974y 3 -0.8608y. Our method with CL is competitive compared with DSO based on reinforcement learning, not overfitting the noise when adding a small amount of noise. Optimizing constants via restarting BFGS multiple times. We try to reach the global optimum rather than the local optimum by restarting BFGS several times, each time using a different initialization. We set the restart times to 5, 10, 15 and 20. 0 means that the optimization only once. Figure 5 shows that R 2 improves as the number of BFGS restarts increases. We can conclude that multiple initializations can effectively jump out of local optimum in the constant optimization process so as to achieve better fitting accuracy. Finding mathematically equivalent expressions. By manual observation, we find that our method can generate more expressions with the same symbol as the target expression, which benefits from the high skeleton recovery rate. Interestingly, we also find that the model sometimes predicts a more simple expression compared to the ground truth with fairly high fitting accuracy. For example, for the Keijzer-4 expression x 3 • exp(-x) • cos(x) • sin(x) • (sin(x) 2 • cos(x) -1), our method predicts -0.634193499960418x 3 •exp(-x)•sin(1.83246993155237x+10.9750345372947), which is simpler in the terms of skeleton and achieving a high R 2 through the constant optimization. Additionally, the model can learn some transformation relations of trigonometric functions, e.g., for Constant-7 expression 2 sin(1.3x 1 ) • cos(x 2 ), our method predicts 2 cos(1.3x 1 -π 2 ) • cos(x 2 ), which implicates the transformation relation, i.e., sin (x) = cos (x -π 2 ). Out-of domain performance. To test the out-of domain performance of our method, we first run the inference on the points sampled from the training range and then evaluate these predicted functions on points outside the sampling range. In Figure 6 , we visualized some of the model predictions about unary and binary expressions. The experimental results show that our model can have good generalization performance outside the sampling domain when predicting more complex expressions, without overfitting the sampled data.

5. CONCLUSION

We propose a transformer-based method for symbolic regression using a new feature extractor and a joint supervised learning mechanism. Specifically, we leverage a pure residual MLP feature extractor to obtain more valuable features on input data points. In order to fundamentally alleviate the ill-posed problem, we propose a joint supervised learning mechanism combining supervised contrastive learning, which strengthens the similarity of feature vectors from the same skeleton. The expression skeleton recovery rate of the proposed method is up to 25% higher than that of recent transformer-based methods. Evaluated on six benchmarks, the results show that our method outperforms current state-of-the-art methods based on reinforcement learning and genetic programming in terms of R 2 . It is worth noting that our method is competitive with the reinforcement learning-based method in terms of robustness. Finally, by evaluating the performance of our method outside the sampling range, we showed that it has good extrapolation capabilities.

A DETAILS FOR DATASET GENERATION

In this section, we describe the configurations used to generate the training set, validation set, and SSDNC benchmark. We sample each non-leaf node following the unnormalized weighted distribution shown in Table 3 Our base data set approximately contains 100,000 unique expressions' skeletons. Then the different training sets are generated according to the number of expressions with different coefficients. We set this number to 10, 20, 30, 40 and 50. The validation set contains 1K skeletons that are randomly sampled from the base data set and assigned constants that differ from the training set. We generate the more challenging test benchmark SSDNC, which includes approximately 100 unique expression skeletons and 10 re-sampled numerical constants for each skeleton. Source code is available at https://github.com/AILWQ/Joint_Supervised_ Learning_for_SR. Table 3 : Unnormalised probabilities of unary and binary operators used by the dataset generator. Operation Mathematical meaning Unnormalized probability add + 10 mul × 10 sub - 5 div ÷ 5 sqrt √ • 4 exp exp • 4 ln ln • 4 sin sin • 4 cos cos • 4 tan tan • 4 pow2 (•) 2 4 pow3 (•) 3 2 pow4 (•) 4 1 pow5 (•) 5 1 B BECHMARK FUNCTIONS This section describes the exact functions used to compare our method with the current state-of-theart methods. In Table 4 , we show the name of the benchmark and corresponding expressions. 

Name Expression

Nguyen-1 x 3 + x 2 + x Nguyen-2 x 4 + x 3 + x 2 + x Nguyen-3 x 5 + x 4 + x 3 + x 2 + x Nguyen-4 x 6 + x 5 + x 4 + x 3 + x 2 + x Nguyen-5 sin(x 2 ) cos(x) -1 Nguyen-6 sin(x) + sin(x + x 2 ) Nguyen-7 ln(x + 1) + ln(x 2 + 1) Nguyen-8 √ x Nguyen-9 sin(x) + sin(y 2 ) Nguyen-10 2 sin(x) cos(y) Nguyen-11 x y Nguyen-12 x 4 -x 3 + 1 2 y 2 -y Constant-1 3.39x 3 + 2.12x 2 + 1.78x Constant-2 sin x 2 • cos x -0.75 Constant-3 sin (1.5x) • cos (0.5y)) Constant-4 2.7x y Constant-5 √ 1.23x Constant-6 x 0.426 Constant-7 2 sin (1.3x) • cos y Constant-8 ln(x + 1.4) + ln(x 2 + 1.3) Keijzer-3 0.3 • x • sin (2 • π • x) Keijzer-4 x 3 exp (-x) cos (x) sin (x)(sin x 2 cos x -1) Keijzer-6 Keijzer-15 x 3 5 + y 3 2 -y -x R-1 (x+1) 3 x 2 -x+1

R-2

xfoot_4 -3x 3 +1 x 2 +1 R-3 xfoot_5 +x 5 x 4 +x 3 +x 2 +x+1 Feynman-1  exp ( -x 2 2 ) √ 2•π

C MODEL HYPERPARAMETERS

In this section, we give more details about the hyperparameters of our models with/without CL. The full set of hyperparameters can be seen in Table 5 .

D BASELINES DETAILS

Transformer-based methods. For SymbolicGPT (Valipour et al., 2021) and NeSymReS (Biggio et al., 2021) , we use the standard hyperparameters provided in the open-source implementation of these methods 56 . Some parameters such as the dimension of input, vocabulary size, and so on, are manually adjusted as the dataset changes. Note that we do not change model hyperparameters that may affect performance.

E EXPLORING PHYSIC LAWS

Symbolic regression is used by many research communities to advance the study of numerous scientific fields, e.g., Physics (Wu & Tegmark, 2019; Udrescu & Tegmark, 2020) , Chemistry (Batra et al., 2021) , and Materials (Sun et al., 2019; Wang et al., 2019; Weng et al., 2020; Loftis et al., 2020) . Our method shows great potential in recovering some of the laws of physics. As reported in Table 8 , we successfully recover all the expressions of two variables in Feynman benchmark test sets. Since π is not included in the dictionary during training, the corresponding value in the recovered expression is predicted to be a decimal. Symbolic regression algorithms are getting better. Our work will be useful for future data-driven symbolic regression methods. We look forward to the day when a computer helps physicists discover the basic laws of physics, even just like Kepler, discovers a useful and hitherto unknown physics expression through symbolic regression. 



https://github.com/mojivalipour/symbolicgpt https://github.com/SymposiumOrganization/NeuralSymbolicRegressionThatScales https://github.com/brendenpetersen/deep-symbolicregression https://gplearn.readthedocs.io/en/stable/ https://github.com/mojivalipour/symbolicgpt https://github.com/SymposiumOrganization/NeuralSymbolicRegressionThatScales https://github.com/brendenpetersen/deep-symbolicregression https://gplearn.readthedocs.io/en/stable/



SymbolicGPT (Valipour et al., 2021)  A recent novel transformer-based language model for symbolic regression. We use the open-source implementation provided by the authors. 1 • Neuaral Symbolic Regression that Scales (NeSymReS)(Biggio et al., 2021) Recently proposed transformer-based symbolic regression model on the large training data. We use the open-source implementation provided by the authors. 2

Figure 4: R 2 vs gaussian noisy data. Error bar represent standard error. Inference on SS-DNC benchmark.

Figure 5: R 2 for different restart times of BFGS in the constant optimization stage. Inference on SSDNC benchmark.

Figure 6: Examples of model predictions using beam search with beam size equaling to 128. The shaded area represents the sampling range. For all functions, x and y were sampled from [-10, 10]. 'GT' denotes ground-truth and 'Pred', the model prediction. Specifically, (a): GT:



Benchmark functions that we have used in our experiments. Input variables are denoted by x and/or y. We have restricted ourselves only to the univariate and bivariate functions.

Recovery expressions on each of the Feynman benchmark of our methods. U(a, b, c) denotes c random points uniformly sampled between a and b for each input variable. R 2 values are rounded to 5 decimals.

ACKNOWLEDGMENTS

This work was supported by the Key-Area Research and Development Program of Guangdong Province (No.2019B010107001), and the Key Research Program of the Chinese Academy of Sciences (No.XDPB22).

annex

Deep symbolic Optimization (DSO). For DSO (Mundhenk et al., 2021) , we use the standard parameter settings in the open-source implementation 7 . DSO depends on two main hyper-parameters namely the entropy coefficient λ H and the risk factor ϵ, and hyperparameters related to genetic programming hybrid methods. The λ H is used to weight a bonus proportional to the entropy of the sampled expression which is added to the main objective. The intervention in the definition of the final objective depends on the (1 -ϵ) quantile of the distribution of rewards under the current policy.According to the open-source implementation, the chosen hyperparameters are listed in Table 6 .Genetic Programming (GP). For GP-based methods, we opt for the function SymbolicRegressor of open-source Python library gplearn 8 . Our choices for the hyperparameters are mostly the default values indicated in the library documentation. The detailed settings are reported in Table 7 . 

