DEEP GENERATIVE SYMBOLIC REGRESSION

Abstract

Symbolic regression (SR) aims to discover concise closed-form mathematical equations from data, a task fundamental to scientific discovery. However, the problem is highly challenging because closed-form equations lie in a complex combinatorial search space. Existing methods, ranging from heuristic search to reinforcement learning, fail to scale with the number of input variables. We make the observation that closed-form equations often have structural characteristics and invariances (e.g., the commutative law) that could be further exploited to build more effective symbolic regression solutions. Motivated by this observation, our key contribution is to leverage pre-trained deep generative models to capture the intrinsic regularities of equations, thereby providing a solid foundation for subsequent optimization steps. We show that our novel formalism unifies several prominent approaches of symbolic regression and offers a new perspective to justify and improve on the previous ad hoc designs, such as the usage of cross-entropy loss during pre-training. Specifically, we propose an instantiation of our framework, Deep Generative Symbolic Regression (DGSR). In our experiments, we show that DGSR achieves a higher recovery rate of true equations in the setting of a larger number of input variables, and it is more computationally efficient at inference time than state-of-the-art RL symbolic regression solutions.

1. INTRODUCTION

Symbolic regression (SR) aims to find a concise equation f that best fits a given dataset D by searching the space of mathematical equations. The identified equations have concise closed-form expressions. Thus, they are interpretable to human experts and amenable to further mathematical analysis (Augusto & Barbosa, 2000) . Fundamentally, two limitations prevent the wider ML community from adopting SR as a standard tool for supervised learning. That is, SR is only applicable to problems with few variables (e.g., three) and it is very computationally intensive. This is because the space of equations grows exponentially with the equation length and has both discrete (×, +, sin) and continuous (2.5) components. Although researchers have attempted to solve SR by heuristic search (Augusto & Barbosa, 2000; Schmidt & Lipson, 2009; Stinstra et al., 2008; Udrescu & Tegmark, 2020) , reinforcement learning (Petersen et al., 2020; Tang et al., 2020) , and deep learning with pre-training (Biggio et al., 2021; Kamienny et al., 2022) , achieving both high scalability to the number of input variables and computational efficiency is still an open problem. We believe that learning a good representation of the equation is the key to solve these challenges. Equations are complex objects with many unique invariance structures that could guide the search. Simple equivalence rules (such as commutativity) can rapidly build up with multiple variables or terms, giving rise to complex structures that have many equation invariances. Importantly, these equation equivalence properties have not been adequately reflected in the representations used by existing SR methods. First, existing heuristic search methods represent equations as expression trees (Jin et al., 2019) , which can only capture commutativity (x 1 x 2 = x 2 x 1 ) via swapping the leaves of a binary operator (×, +). However, trees cannot capture many other properties such as distributivity (x 1 x 2 + x 1 x 3 = x 1 (x 2 + x 3 )). Second, existing pre-trained encoder-decoder methods represent equations as sequences of tokens, i.e., x 1 + x 2 . = ("x 1 ", " + ", "x 2 "), just as sentences of words in natural language (Valipour et al., 2021) . The sequence representation cannot encode any invariance structure, e.g., x 1 + x 2 and x 2 + x 1 will be deemed as two different sequences. Finally, existing RL methods for symbolic regression do not learn representations of equations. For each dataset, these methods learn a specific policy network to generate equations that fit the data well, hence they need to re-train the policy from scratch each time a new dataset D is observed, which is computationally intensive. On the quest to apply symbolic regression to a larger number of input variables, we investigate a deep conditional generative framework that attempts to fulfill the following desired properties: (P1) Learn equation invariances: the equation representations learnt should encode both the equation equivalence invariances, as well as the invariances of their associated datasets. (P2) Efficient inference: performing gradient refinement of the generative model should be computationally efficient at inference time. (P3) Generalize to unseen variables: can generalize to unseen input variables of a higher dimension from those seen during pre-training. To fulfill P1-P3, we propose the Deep Generative Symbolic Regression (DGSR) framework. Rather than represent equations as trees or sequences, DGSR learns the representations of equations with a deep generative model, which have excelled at modelling complex structures such as images and molecular graphs. Specifically, DGSR leverages pre-trained conditional generative models that correctly encode the equation invariances. The equation representations are learned using a deep generative model that is composed of invariant neural networks and trained using an end-to-end loss function inspired by Bayesian inference. Crucially, this end-to-end loss enables both pre-training and gradient refinement of the pre-trained model at inference time, allowing the model to be more computationally efficient (P2) and generalize to unseen input variables (P3). Contributions. Our contributions are two-fold: 1 ⃝ In Section 3, we outline the DGSR framework, that can perform symbolic regression on a larger number of input variables, whilst achieving less inference time computational cost compared to RL techniques (P2). This is achieved by learning better representations of equations that are aware of the various equation invariance structures (P1). 2 ⃝ In section 5.1, we benchmark DGSR against the existing symbolic regression approaches on standard benchmark problem sets, and on more challenging problem sets that have a larger number of input variables. Specifically, we demonstrate that DGSR has a higher recovery rate of the true underlying equation in the setting of a larger number of input variables, whilst using less inference compute compared to RL techniques, and DGSR achieves significant and state-of-the-art true equation recovery rate on the SRBench ground-truth datasets compared to the SRBench baselines. We also gain insight and understanding of how DGSR works in Section 5.2, of how it can discover the underlying true equation-even when pre-trained on datasets where the number of input variables is less than the number of input variables seen at inference time (P3). As well as be able to capture these equation equivalences (P1) and correctly encode the dataset D to start from a good equation distribution leading to efficient inference (P2).

2. PROBLEM FORMALISM

The standard task of a symbolic regressor method is to return a closed-form equation f that best fits a given dataset D = {(X i , y i )} n i=1 , i.e., y i ≈ f (X i ), ∀i ∈ [1 : n], for all samples i. Where y i ∈ R, X i ∈ R d and d is the number of input variables, i.e., X = [x 1 , . . . , x d ]. Closed-form equations. The equations that we seek to discover are closed-form, i.e., it can be expressed as a finite sequence of operators (×, +, -, . . . ), input variables (x 1 , x 2 , . . . ) and numeric constants (3.141, 2.71, . . . ) (Borwein et al., 2013) . We define f to mean the functional form of an equation, where it can have numeric constant placeholders β's to replace numeric constants, e.g., f (x) = β 0 x+sin(x + β 1 ). To discover the full equation, we need to infer the functional form and then estimate the unknown constants β's, if any are present (Petersen et al., 2020) . Equations can also be represented as a sequence of discrete tokens in prefix notation f = [ f1 , . . . , f| f | ] ( Petersen et al., 2020) where each token is chosen from a library of possible tokens, e.g., [+, -, ÷, ×, x 1 , exp, log, sin, cos]. The tokens f can then be instantiated into an equation f and evaluated on an input X. In existing works, the numeric constant placeholder tokens are learnt through a further secondary non-linear optimizer step using the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm (Fletcher, 2013; Biggio et al., 2021) . In lieu of extensive notation, we define when evaluating f to also infer any placeholder tokens using BFGS. < l a t e x i t s h a 1 _ b a s e 6 4 = " x G e 6 E w / K m s d o R f I J a n Z R 8 j 3 N r 6 E = " > A A A B 8 n i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I q M u C L l y 2 Y B / Q D i W T Z t r Q T G Z I 7 g h l K P g T b l w o 4 t a v c e f f m G m 7 0 N Y D g c M 5 N 9 x z T 5 B I Y d B 1 v 5 3 C 2 v r G 5 l Z x u 7 S z u 7 d / U D 4 8 a p k 4 1 Y w 3 W S x j 3 Q m o 4 V I o 3 k S B k n c S z W k U S N 4 O x r e 5 3 3 7 k 2 o h Y P e A k 4 X 5 E h 0 q E g l G 0 U r c X U R w x K r O 7 a b 9 c c a v u D G S V e A t S g Q X q / f J X b x C z N O I K m a T G d D 0 3 Q T + j G g W T f F r q p Y Y n l I 3 p k H c t V T T i x s 9 m k a f k z C o D E s b a P o V k p v 7 + k d H I m E k U 2 M k 8 o l n 2 c v E / r 5 t i e O N n Q i U p c s X m i 8 J U E o x J f j 8 Z C M 0 Z y o k l l G l h s x I 2 o p o y t C 2 V b A n e 8 s m r p H V R 9 a 6 q l 4 3 L S q 3 x N K + j C C d w C u f g w T X U 4 B 7 q 0 A Q G M T z D K 7 w 5 6 L w 4 7 8 7 H f L T g L C o 8 h j 9 w P n 8 A o k 6 R 8 g = = < / l a t e x i t > D < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 u g t 8 h j Q t L M C O R w i D F m I 1 2 j l H D g = " > A A A B 6 H i c b V A 9 S w N B E J 2 L X z F + R S 1 t F o N g F e 4 k R M u A j W U C 5 g O S I + x t 5 p I 1 e 3 v H 7 p 4 Q j o C 9 j Y U i t v 4 k O / + N m 4 9 C E x 8 M P N 6 b Y W Z e k A i u j e t + O 7 m N z a 3 t n f x u Y W / / 4 P C o e H z S 0 n G q G D Z Z L G L V C a h G w S U 2 D T c C O 4 l C G g U C 2 8 H 4 d u a 3 H 1 F p H s t 7 M 0 n Q j + h Q 8 p A z a q z U C P v F k l t 2 5 y D r x F u S E i x R 7 x e / e o O Y p R F K w w T V u u u 5 i f E z q g x n A q e F X q o x o W x M h 9 i 1 V N I I t Z / N D 5 2 S C 6 s M S B g r W 9 K Q u f p 7 I q O R 1 p M o s J 0 R N S O 9 6 s 3 E / 7 x u a s I b P + M y S Q 1 K t l g U p o K Y m M y + J g O u k B k x s Y Q y x e 2 t h I 2 o o s z Y b A o 2 B G / 1 5 X X S u i p 7 1 X K l U S n V G k + L O P J w B u d w C R 5 c Q w 3 u o A 5 N Y I D w D K / w 5 j w 4 L 8 6 7 8 7 F o z T n L C E / h D 5 z P H / h i j Y I = < / l a t e x i t > f < l a t e x i t s h a 1 _ b a s e 6 4 = " f g S i D x D k f m a T a L Q O H 3 2 r w i B W e Z s = " > A A A B + X i c b V D L S g M x F L 1 T X 7 W + R l 2 6 C R a h b s q M l O q y o A u X L d g H t E P J p J k 2 N J M Z k k y h j A U / x I 0 L R d z 6 J + 7 8 G z N t F 9 p 6 4 M L h n H v J y f F j z p R 2 n G 8 r t 7 G 5 t b 2 T 3 y 3 s 7 R 8 c H t n H J y 0 V J Z L Q J o l 4 J D s + V p Q z Q Z u a a U 4 7 s a Q 4 9 D l t + + P b z G 9 P q F Q s E g 9 6 G l M v x E P B A k a w N l L f t u N S 8 N g L s R 4 R z N O 7 2 W X f L j p l Z w 6 0 T t w l K c I S 9 b 7 9 1 R t E J A m p 0 I R j p b q u E 2 s v x V I z w u m s 0 E s U j T E Z 4 y H t G i p w S J W X z p P P 0 I V R B i i I p B m h 0 V z 9 f Z H i U K l p 6 J v N L K N a 9 T L x P 6 + b 6 O D G S 5 m I E 0 0 F W T w U J B z p C G U 1 o A G T l G g + N Q Q T y U x W R E Z Y Y q J N W Q V T g r v 6 5 X X S u i q 7 1 X K l U S n W G k + L O v J w B u d Q A h e u o Q b 3 U I c m E J j A M 7 z C m 5 V a L 9 a 7 9 b F Y z V n L C k / h D 6 z P H 2 a G k / g = < / l a t e x i t > p(f |D) A generative view of SR. We prvoide a probabilistic interpretation of the data generating process in Figure 1 , where we treat the true equation f as a (latent) random variable following a prior distribution p(f ). Therefore a dataset can be interpreted as an evaluation of f on sampled points X ∼ X , i.e., D = {(X i , f (X i ))} n i=1 , f ∼ p(f ). Crucially, SR can be seen as performing probabilistic inference on the posterior distribution p(f |D). Therefore at inference time, it is natural to formulate SR into a maximum a posteriori (MAP)foot_0 estimation problem, i.e., f * = arg max f p(f |D). Thus, SR can be solved by: (1) estimating the posterior p(f |D) conditioned on the observations D-with pre-trained deep conditional generative models, p θ (f |D) with model parameters θ, (2) further refining this posterior at inference time and (3) finding the maximum a posteriori (MAP) estimate via a discrete search method.

3. DEEP GENERATIVE SR FRAMEWORK

We now outline the Deep Generative SR (DGSR) framework. The key idea is to use both equation and dataset-invariant aware neural networks combined with an end-to-end loss inspired by Bayesian inference. As we shall see in the following, this allows us to learn the equation and dataset invariances (P1), pre-train on a pre-training set and gradient refine on the observed dataset D at inference time-leading to both a more efficient inference procedure (P2) and generalize to unseen input variables (P3) at inference time. Principally, the framework consists of two steps: (1) a Pre-training step, Section 3.1, where an equation and dataset-invariant aware encoder-decoder model learns the posterior distribution p θ (f |D) with parameters θ by pre-training, and (2) an Inference step, Section 3.2, that uses an optimization method to gradient refine this posterior and a discrete search method to find an approximate of the maximum of this posterior. For each step, in the following we justify each component in turn, providing the desired properties it must satisfy and provide a suitable instantiation for each component in the overall framework.

3.1. PRE-TRAINING STEP

Learning invariances in the dataset. We seek to learn the invariances of datasets (P1). Intuitively, a dataset that is defined by a latent (unobserved) equation f should have a representation that is invariant to the number of samples n in the dataset D. Principally, we specify that to achieve this the architecture of the encoder-decoder of the conditional generative model, p θ (f |D), should satisfy the following two properties: (1) have an encoding function h that is permutation invariant over the encoded input-output pairs {(X i , y i )} n i=1 from g, and can handle a different number of samples n, i.e., V = h({g(X i , y i )} n i=1 ) (Lee et al., 2019) . Where g : X d → Z d is an encoding function from the individual input variables in [x i1 , . . . , x id ] = X i of the points in X . (2) Have a decoder that is autoregressive, that decodes the latent vector V to give an output probability of an equation f , which allows sampling of equations. Suitable encoder-decoder models (e.g., Transformers (Biggio et al., 2021) , RNNs (Sutskever et al., 2014) , etc.) can be used that satisfy these two properties. The conditional generative model has parameters θ = {ζ, ϕ}, where the encoder has parameters ζ and the decoder parameters ϕ, detailed in Figure 2 . Specifically, we instantiate DGSR with a set transformer (Lee et al., 2019) encoder that satisfies (1) and a specific transformer decoder that satisfies (2). This specific transformer decoder leverages the hierarchical tree state representation during decoding (Petersen et al., 2020) . Where the encoder that has encoded a dataset D into a latent vector V ∈ R w is fed into a transformer decoder (Vaswani et al., 2017) . Here, the decoder generates each token of the equation f autoregressively, that is, it samples from p( fi | f1:(1-i);θ;D ). During sampling of each token, the existing generated tokens f1:(1-i) are processed into their hierarchical tree state representation (Petersen et al., 2020) and are encoded with an embedding into an additional latent vector that is concatenated to the encoder latent vector, forming a total latent vector of U ∈ R w+ds to be used in decoding, where d s is the additional state dimension. We detail this in Appendix B, and show other architectures can be used in Appendix U. We pre-train on a pre-training set consisting of m datasets {D (j) } m j=1 , where D (j) is defined by sampling f (j) ∼ p(f ) from a given prior p(f ) (see Appendix J on how to specify p(f )). Then, to construct each dataset we evaluate f (j) on n (j)foot_1 random points in X , i.e., D (j) = {(f (j) (X (j) i ), X (j) i )} n (j) i=1 . < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 p W 4 k A k 3 k o c K 1 7 Q b r y I Y + k H U J + w = " > A A A C A H i c b V B N S 8 N A E N 3 4 W e t X 1 I M H L 8 E i 1 E t J p K j H o h 4 8 V r A f 0 I S y 2 W 7 a p Z t N 2 J 0 I J e b i X / H i Q R G v / g x v / h s 3 b Q 7 a + m D g 8 d 4 M M / P 8 m D M F t v 1 t L C 2 v r K 6 t l z b K m 1 v b O 7 v m 3 n 5 b R Y k k t E U i H s m u j x X l T N A W M O C 0 G 0 u K Q 5 / T j j + + z v 3 O A 5 W K R e I e J j H 1 Q j w U L G A E g 5 b 6 5 m H c d 2 F E A V e D R z f E M C K Y p z f Z a d + s 2 D V 7 C m u R O A W p o A L N v v n l D i K S h F Q A 4 V i p n m P H 4 K V Y A i O c Z m U 3 U T T G Z I y H t K e p w C F V X j p 9 I L N O t D K w g k j q E m B N 1 d 8 T K Q 6 V m o S + 7 s x v V P N e L v 7 n 9 R I I L r 2 U i T g B K s h s U Z B w C y I r T 8 M a M E k J 8 I k m m E i m b 7 X I C E t M Q G d W 1 i E 4 8 y 8 v k v Z Z z T m v 1 e / q l c Z V E U c J H a F j V E U O u k A N d I u a q I U I y t A z e k V v x p P x Y r w b H 7 P W J a O Y O U B / Y H z + A L q + l o A = < / l a t e x i t > p ✓ (f |D) < l a t e x i t s h a 1 _ b a s e 6 4 = " S Q f h q n I d 9 + R q D U V o r O 0 w Z z j Y L M Y = " > A A A C D X i c b V C 7 T s M w F H V 4 l v I K M L J Y F C S m K k E V M F a w M K G C 6 E N q S u W 4 T m v V c S L 7 B q m K 8 g M s / A o L A w i x s r P x N z h t B 2 g 5 k q X j c + 7 V v f f 4 s e A a H O f b W l h c W l 5 Z L a w V 1 z c 2 t 7 b t n d 2 G j h J F W Z 1 G I l I t n 2 g m u G R 1 4 C B Y K 1 a M h L 5 g T X 9 4 m f v N B 6 Y 0 j + Q d j G L W C U l f 8 o B T A k b q 2 o d e S G D g B 2 k r 8 7 i c f P z 0 N r t P e x 7 w k G l 8 n R W 7 d s k p O 2 P g e e J O S Q l N U e v a X 1 4 v o k n I J F B B t G 6 7 T g y d l C j g V L C s 6 C W a x Y Q O S Z + 1 D Z X E D O q k 4 2 s y f G S U H g 4 i Z Z 4 E P F Z / d 6 Q k 1 H o U + q Y y X 1 f P e r n 4 n 9 d O I D j v p F z G C T B J J 4 O C R G C I c B 4 N 7 n H F K I i R I Y Q q b n b F d E A U o W A C z E N w Z 0 + e J 4 2 T s n t a r t x U S t W L a R w F t I 8 O 0 D F y 0 R m q o i t U Q 3 V E 0 S N 6 R q / o z X q y X q x 3 6 2 N S u m B N e / b Q H 1 i f P x b L n D E = < / l a t e x i t > X 2 R d⇥N < l a t e x i t s h a 1 _ b a s e 6 4 = " H o Z N m y O n k e P h 5 b Q X R b k 5 K b k J R e A = " > A A A C s n i c b Z H L T s M w E E W d 8 A 6 v A k s 2 F h W I B S p J V A F L B B u W I F F A N C V y n G m x 6 j i R 7 S C q K B / I l h 1 / g 9 O G 8 i g j W T q 6 c z 0 e z 0 Q Z Z 0 q 7 7 o d l z 8 0 v L C 4 t r z i r a + s b m 4 2 t 7 T u V 5 p J C h 6 Y 8 l Q 8 R U c C Z g I 5 m m s N D J o E k E Y f 7 a H h Z 5 e 9 f Q C q W i l s 9 y q C X k I F g f U a J N l L Y e M N B B A M m i i g h W r L X 0 s E Y v 4 a F d + S V + G B C f k U B j V O t v q S 4 x E F Q W / 2 p 1 Z + 1 + t / W 4 G U s V 4 Z v i m v r l z S t K q Z V x W x V U V V 1 c A A i n j Y e N p p u y x 0 H n g W v h i a q 4 z p s v A d x S v M E h K a c K N X 1 3 E z 3 C i I 1 o x x K J 8 g V Z I Q O y Q C 6 B g V J Q P W K 8 c h L v G + U G P d T a Y 7 Q e K z + v F G Q R K l R E h m n 6 e 9 Z / c 1 V 4 n + 5 b q 7 7 Z 7 2 C i S z X I O j k o X 7 O s U 5 x t T 8 c M w l U 8 5 E B Q i U z v W L 6 T C S h 2 m z Z M U P w / n 5 5 F u 7 8 l n f S a t + 0 m + c X 9 T i W 0 S 7 a Q 4 f I Q 6 f o H F 2 h a 9 R B 1 D q 2 O t a T F d p t + 9 E m N p 1 Y b a u + s 4 N + h c 0 / A U o U y U 8 = < / l a t e x i t > 2 6 6 6 4 x 1,1 x 1,2 • • • x 1,d x 2,1 x 2,2 • • • x 2,d . . . . . . . . . . . . x n,1 x n,2 • • • x n,d 3 7 7 7 5 < l a t e x i t s h a 1 _ b a s e 6 4 = " I N I O m s e r z p V K n Y G 4 D F f a 2 v U 8 U E Y = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k + H E r e P H Y g q 2 F N p T N d t K u 3 W z C 7 k Y o o b / A i w d F v P q T v P l v 3 L Y 5 a O u D g c d 7 M 8 z M C x L B t X H d b 6 e w t r 6 x u V X c L u 3 s 7 u 0 f l A + P 2 j p O F c M W i 0 W s O g H V K L j E l u F G Y C d R S K N A 4 E M w v p 3 5 D 0 + o N I / l v Z k k 6 E d 0 K H n I G T V W a g 7 7 5 Y p b d e c g q 8 T L S Q V y N P r l r 9 4 g Z m m E 0 j B B t e 5 6 b m L 8 j C r D m c B p q Z d q T C g b 0 y F 2 L Z U 0 Q u 1 n 8 0 O n 5 M w q A x L G y p Y 0 Z K 7 + n s h o p P U k C m x n R M 1 I L 3 s z 8 T + v m 5 r w 2 s + 4 T F K D k i 0 W h a k g J i a z r 8 m A K 2 R G T C y h T H F 7 K 2 E j q i g z N p u S D c F b f n m V t C + q 3 m W 1 1 q x V 6 j d 5 H E U 4 g V M 4 B w + u o A 5 3 0 I A W M E B 4 h l d 4 c x 6 d F + f d + V i 0 F p x 8 5 h j + w P n 8 A c y u j O s = < / l a t e x i t > g < l a t e x i t s h a 1 _ b a s e 6 4 = " I N I O m s e r z p V K n Y G 4 D F f a 2 v U 8 U E Y = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k + H E r e P H Y g q 2 F N p T N d t K u 3 W z C 7 k Y o o b / A i w d F v P q T v P l v 3 L Y 5 a O u D g c d 7 M 8 z M C x L B t X H d b 6 e w t r 6 x u V X c L u 3 s 7 u 0 f l A + P 2 j p O F c M W i 0 W s O g H V K L j E l u F G Y C d R S K N A 4 E M w v p 3 5 D 0 + o N I / l v Z k k 6 E d 0 K H n I G T V W a g 7 7 5 Y p b d e c g q 8 T L S Q V y N P r l r 9 4 g Z m m E 0 j B B t e 5 6 b m L 8 j C r D m c B p q Z d q T C g b 0 y F 2 L Z U 0 Q u 1 n 8 0 O n 5 M w q A x L G y p Y 0 Z K 7 + n s h o p P U k C m x n R M 1 I L 3 s z 8 T + v m 5 r w 2 s + 4 T F K D k i 0 W h a k g J i a z r 8 m A K 2 R G T C y h T H F 7 K 2 E j q i g z N p u S D c F b f n m V t C + q 3 m W 1 1 q x V 6 j d 5 H E U 4 g V M 4 B w + u o A 5 3 0 I A W M E B 4 h l d 4 c x 6 d F + f d + V i 0 F p x 8 5 h j + w P n 8 A c y u j O s = < / l a t e x i t > g < l a t e x i t s h a 1 _ b a s e 6 4 = " I N I O m s e r z p V K n Y G 4 D F f a 2 v U 8 U E Y = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k + H E r e P H Y g q 2 F N p T N d t K u 3 W z C 7 k Y o o b / A i w d F v P q T v P l v 3 L Y 5 a O u D g c d 7 M 8 z M C x L B t X H d b 6 e w t r 6 x u V X c L u 3 s 7 u 0 f l A + P 2 j p O F c M W i 0 W s O g H V K L j E l u F G Y C d R S K N A 4 E M w v p 3 5 D 0 + o N I / l v Z k k 6 E d 0 K H n I G T V W a g 7 7 5 Y p b d e c g q 8 T L S Q V y N P r l r 9 4 g Z m m E 0 j B B t e 5 6 b m L 8 j C r D m c B p q Z d q T C g b 0 y F 2 L Z U 0 Q u 1 n 8 0 O n 5 M w q A x L G y p Y 0 Z K 7 + n s h o p P U k C m x n R M 1 I L 3 s z 8 T + v m 5 r w 2 s + 4 T F K D k i 0 W h a k g J i a z r 8 m A K 2 R G T C y h T H F 7 K 2 E j q i g z N p u S D c F b f n m V t C + q 3 m W 1 1 q x V 6 j d 5 H E U 4 g V M 4 B w + u o A 5 3 0 I A W M E B 4 h l d 4 c x 6 d F + f d + V i 0 F p x 8 5 h j + w P n 8 A c y u j O s = < / l a t e x i t > g < l a t e x i t s h a 1 _ b a s e 6 4 = " b s d z g 3 s / r B p j 2 4 N N X r E 0 C c y e q 9 E = " > A A A C B n i c b V D L S s N A F J 3 U V 6 2 v q E s R B o v g q i R S f O w K b l x W s Q 9 o Y p h M J + 3 Q y S T M T C o l Z O X G X 3 H j Q h G 3 f o M 7 / 8 Z J m 4 W 2 H h g 4 c 8 6 9 3 H u P H z M q l W V 9 G 6 W l 5 Z X V t f J 6 Z W N z a 3 v H 3 N 1 r y y g R m L R w x C L R 9 Z E k j H L S U l Q x 0 o 0 F Q a H P S M c f X e V + Z 0 y E p B G / U 5 O Y u C E a c B p Q j J S W P P P Q C Z E a + k E 6 z j z q U D 7 7 + u l t d p 8 + Z J 5 Z t W r W F H C R 2 A W p g g J N z / x y + h F O Q s I V Z k j K n m 3 F y k 2 R U B Q z k l W c R J I Y 4 R E a k J 6 m H I V E u u n 0 j A w e a 6 U P g 0 j o x x W c q r 8 7 U h R K O Q l 9 X Z l v K e e 9 X P z P 6 y U q u H B T y u N E E Y 5 n g 4 K E Q R X B P B P Y p 4 J g x S a a I C y o 3 h X i I R I I K 5 1 c R Y d g z 5 + 8 S N q n N f u s V r + p V x u X R R x l c A C O w A m w w T l o g G v Q B C 2 A w S N 4 B q / g z X g y X o x 3 4 2 N W W j K K n n 3 w B 8 b n D 7 K l m e U = < / l a t e x i t > v i 2 R w < l a t e x i t s h a 1 _ b a s e 6 4 = " a Z o 0 y 2 U e k k d B L 3 N x 4 Z K f d s A 4 x W U = " > A A A B 8 3 i c b V D L S g M x F L 2 p r 1 p f V Z d u g k V w V W a k + N g V 3 L i s Y B / Q G U o m z b S h m c y Q Z A p l 6 G + 4 c a G I W 3 / G n X 9 j p p 2 F t h 4 I H M 6 5 l 3 t y g k R w b R z n G 5 U 2 N r e 2 d 8 q 7 l b 3 9 g 8 O j 6 v F J R 8 e p o q x N Y x G r X k A 0 E 1 y y t u F G s F 6 i G I k C w b r B 5 D 7 3 u 1 O m N I / l k 5 k l z I / I S P K Q U 2 K s 5 H k R M e M g z K b z g T u o 1 p y 6 s w B e J 2 5 B a l C g N a h + e c O Y p h G T h g q i d d 9 1 E u N n R B l O B Z t X v F S z h N A J G b G + p Z J E T P v Z I v M c X 1 h l i M N Y 2 S c N X q i / N z I S a T 2 L A j u Z Z 9 S r X i 7 + 5 / V T E 9 7 6 G Z d J a p i k y 0 N h K r C J c V 4 A H n L F q B E z S w h V 3 G b F d E w U o c b W V L E l u K t f X i e d q 7 p 7 X W 8 8 N m r N u 6 K O M p z B O V y C C z f Q h A d o Q R s o J P A M r / C G U v S C 3 t H H c r S E i p 1 T + A P 0 + Q M n i 5 G + < / l a t e x i t > v 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " C N e q 3 0 g W B t J q W N z k F c F N r 9 n e g R c = " > A A A B 8 3 i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y U 4 m N X c O O y g q 2 F z l A y a a Y N z S R D k i m U o b / h x o U i b v 0 Z d / 6 N m X Y W 2 n o g c D j n X u 7 J C R P O t H H d b 6 e 0 s b m 1 v V P e r e z t H x w e V Y 9 P u l q m i t A O k V y q X o g 1 5 U z Q j m G G 0 1 6 i K I 5 D T p / C y V 3 u P 0 2 p 0 k y K R z N L a B D j k W A R I 9 h Y y f d j b M Z h l E 3 n g 8 a g W n P r 7 g J o n X g F q U G B 9 q D 6 5 Q 8 l S W M q D O F Y 6 7 7 n J i b I s D K M c D q v + K m m C S Y T P K J 9 S w W O q Q 6 y R e Y 5 u r D K E E V S 2 S c M W q i / N z I c a z 2 L Q z u Z Z 9 S r X i 7 + 5 / V T E 9 0 E G R N J a q g g y 0 N R y p G R K C 8 A D Z m i x P C Z J Z g o Z r M i M s Y K E 2 N r q t g S v N U v r 5 N u o + 5 d 1 Z s P z V r r t q i j D G d w D p f g w T W 0 4 B 7 a 0 A E C C T z D K 7 w 5 q f P i v D s f y 9 G S U + y c w h 8 4 n z 8 p D 5 G / < / l a t e x i t > v 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " S 0 v 2 q / I B i p r p / x F k s 7 9  E J y J u M h o = " > A A A B 8 3 i c b V D L S g M x F L 2 p r 1 p f V Z d u g k V w V W a k + N g V 3 L i s Y B / Q G U o m z b S h m c y Q Z A p l 6 G + 4 c a G I W 3 / G n X 9 j p p 2 F t h 4 I H M 6 5 l 3 t y g k R w b R z n G 5 U 2 N r e 2 d 8 q 7 l b 3 9 g 8 O j 6 v F J R 8 e p o q x N Y x G r X k A 0 E 1 y y t u F G s F 6 i G I k C w b r B 5 D 7 3 u 1 O m N I / l k 5 k l z I / I S P K Q U 2 K s 5 H k R M e M g z K b z g R x U a 0 7 d W Q C v E 7 c g N S j Q G l S / v G F M 0 4 h J Q w X R u u 8 6 i f E z o g y n g s 0 r X q p Z Q u i E j F j f U k k i p v 1 s k X m O L 6 w y x G G s 7 J M G L 9 T f G x m J t J 5 F g Z 3 M M + p V L x f / 8 / q p C W / 9 j M s k N U z S 5 a E w F d j E O C 8 A D 7 l i 1 I i Z J Y Q q b r N i O i a K U G N r q t g S 3 N U v r 5 P O V d 2 9 r j c e G 7 X m X V F H G c 7 g H C 7 B h R t o w g O 0 o A 0 U E n i G V 3 h D K X p B 7 + h j O V p C x c 4 p / A H 6 / A G D / 5 H 7 < / l a t e x i t > v n < l a t e x i t s h a 1 _ b a s e 6 4 = " f j A F w j Q e V b o S 8 P h A R I 4 / i q Y H O 6 4 = " > A A A C A n i c b V D L S s N A F L 3 x W e s r 6 k r c B I v g q i R S f O w K b l x W s Q 9 o Y p l M J + 3 Q y S T M T J Q S g h t / x Y 0 L R d z 6 F e 7 8 G y d t F t p 6 Y O D M O f d y 7 z 1 + z K h U t v 1 t L C w u L a + s l t b K 6 x u b W 9 v m z m 5 L R o n A p I k j F o m O j y R h l J O m o o q R T i w I C n 1 G 2 v 7 o M v f b 9 0 R I G v F b N Y 6 J F 6 I B p w H F S G m p Z + 6 7 I V J D P 0 h b m U v 5 9 O O n N 9 n d Q 8 + s 2 F V 7 A m u e O A W p Q I F G z / x y + x F O Q s I V Z k j K r m P H y k u R U B Q z k p X d R J I Y 4 R E a k K 6 m H I V E e u n k h M w 6 0 k r f C i K h H 1 f W R P 3 d k a J Q y n H o 6 8 p 8 R z n r 5 e J / X j d R w b m X U h 4 n i n A 8 H R Q k z F K R l e d h 9 a k g W L G x J g g L q n e 1 8 B A J h J V O r a x D c G Z P n i e t k 6 p z W q 1 d 1 y r 1 i y K O E h z A I R y D A 2 d Q h y t o Q B M w P M I z v M K b 8 W S 8 G O / G x 7 R 0 w S h 6 9 u A P j M 8 f G N K X 3 Q = = < / l a t e x i t > V 2 R w < l a t e x i t s h a 1 _ b a s e 6 4 = " q y 1 g f G Y t Y K p h X l 8 O d U z C T 4 i k 7 0 o = " > A A A B 8 X i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s x I 8 b E r u H F Z w T 6 w H U o m v d O G Z j J D k h H K 0 L 9 w 4 0 I R t / 6 N O / / G t J 2 F t h 4 I H M 6 5 l 5 x 7 g k R w b V z 3 2 y m s r W 9 s b h W 3 S z u 7 e / s H 5 c O j l o 5 T x b D J Y h G r T k A 1 C i 6 x a b g R 2 E k U 0 i g Q 2 A 7 G t z O / / Y R K 8 1 g + m E m C f k S H k o e c U W O l x 1 5 E z S g I s 9 a 0 X 6 6 4 V X c O s k q 8 n F Q g R 6 N f / u o N Y p Z G K A 0 T V O u u 5 y b G z 6 g y n A m c l n q p x o S y M R 1 i 1 1 J J I 9 R + N k 8 8 J W d W G Z A w V v Z J Q + b q 7 4 2 M R l p P o s B O z h L q Z W 8 m / u d 1 U x N e + x m X S W p Q s s V H Y S q I i c n s f D L g C p k R E 0 s o U 9 x m J W x E F W X G l l S y J X j L J 6 + S 1 k X V u 6 z W 7 m u V + k 1 e R x F O 4 B T O w Y M r q M M d N K A J D C Q 8 w y u 8 O d p 5 c d 6 d j 8 V o w c l 3 j u E P n M 8 f y c i Q + g = = < / l a t e x i t > V < l a t e x i t s h a 1 _ b a s e 6 4 = " l 1 N 6 d 8 4 u f Q g O 8 e T H q V b 7 A l 0 B m 9 0 = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k + H E r e P H Y g q 2 F N p T N d t K u 3 W z C 7 k Y o o b / A i w d F v P q T v P l v 3 L Y 5 a O u D g c d 7 M 8 z M C x L B t X H d b 6 e w t r 6 x u V X c L u 3 s 7 u 0 f l A + P 2 j p O F c M W i 0 W s O g H V K L j E l u F G Y C d R S K N A 4 E M w v p 3 5 D 0 + o N I / l v Z k k 6 E d 0 K H n I G T V W a o 7 6 5 Y p b d e c g q 8 T L S Q V y N P r l r 9 4 g Z m m E 0 j B B t e 5 6 b m L 8 j C r D m c B p q Z d q T C g b 0 y F 2 L Z U 0 Q u 1 n 8 0 O n 5 M w q A x L G y p Y 0 Z K 7 + n s h o p P U k C m x n R M 1 I L 3 s z 8 T + v m 5 r w 2 s + 4 T F K D k i 0 W h a k g J i a z r 8 m A K 2 R G T C y h T H F 7 K 2 E j q i g z N p u S D c F b f n m V t C + q 3 m W 1 1 q x V 6 j d 5 H E U 4 g V M 4 B w + u o A 5 3 0 I A W M E B 4 h l d 4 c x 6 d F + f d + V i 0 F p = " > A A A C I n i c b V D L S g M x F M 3 U V 6 2 v q k s 3 w S K 4 k D I j x c d G C m 5 c V r A P 6 A x D J s 2 0 o Z k H y R 2 h T O d b 3 P g r b l w o 6 k r w Y 0 y n X d j W A y E n 5 9 y b 5 B 4 v F l y B a X 4 b h Z X V t f W N 4 m Z p a 3 t n d 6 + 8 f 9 B S U S I p a 9 J I R L L j E c U E D 1 k T O A j W i S U j g S d Y 2 x v e T v z 2 I 5 O K R + E D j G L m B K Q f c p 9 T A l p y y 9 d d 2 y M y t Q c E U j / L X O s M 2 7 0 I l N 7 m 9 H S c 0 1 z U 5 3 H m u O W K W T V z 4 G V i z U g F z d B w y 5 / 6 Z p o E L A Q q i F J d y 4 z B S Y k E T g X L S n a i W E z o k P R Z V 9 O Q B E w 5 a T 5 i h k + 0 0 s N + J P U K A e f q 3 4 6 U B E q N A k 9 X B g Q G a t G b i P 9 5 3 Q T 8 K y f l Y Z w A C + n 0 I T 8 R G C I 8 y Q v 3 u G Q U x E g T Q i X X f 8 V 0 Q C S h o F M t 6 R C s x Z G X S e u 8 a l 1 U a / e 1 S v 1 m F k c R H a F j d I o s d I n q 6 A 4 1 U B N R 9 I R e 0 B t 6 N 5 6 N V + P D + J q W F o x Z z y G a g / H z C 7 n 3 p a 8 = < / l a t e x i t > [ f1 , . . . , f| f | ] < l a t e x i t s h a 1 _ b a s e 6 4 = " e P l t H p k S C T R e C Z + 0 p e k R x A G / N M Q = " > A A A C F X i c b V D L S g M x F M 3 U V 6 2 v U Z d u g k V o o Z Q Z K e p K C i K 4 U S r a B 7 S l Z N J M G 5 p 5 k N w R y z A / 4 c Z f c e N C E b e C O / / G 9 C F o 6 4 G Q k 3 P u J f c e J x R c g W V 9 G a m F x a X l l f R q Z m 1 9 Y 3 P L 3 N 6 p q S C S l F V p I A L Z c I h i g v u s C h w E a 4 S S E c 8 R r O 4 M z k Z + / Y 5 J x Q P / F o Y h a 3 u k 5 3 O X U w J a 6 p i F F r B 7 i K 8 u b 8 6 T X K t P I H b 1 7 R H o O 2 7 c S P I F / P M Y J v m O m b W K 1 h h 4 n t h T k k V T V D r m Z 6 s b 0 M h j P l B B l G r a V g j t m E j g V L A k 0 4 o U C w k d k B 5 r a u o T j 6 l 2 P N 4 q w Q d a 6 W I 3 k P r 4 g M f q 7 4 6 Y e E o N P U d X j k Z U s 9 5 I / M 9 r R u C e t G P u h x E w n 0 4 + c i O B I c C j i H C X S 0 Z B D D U h V H I 9 K 6 Z 9 I g k F H W R G h 2 D P r j x P a o d F + 6 h Y u i 5 l y 6 f T O N J o D + 2 j H L L R M S q j C 1 R B V U T R A 3 p C L + j V e D S e j T f j f V K a M q Y 9 u + g P j I 9 v Z t e e 9 g = = < / l a t e x i t > NMSE( f (X), y) P.G. . . . < l a t e x i t s h a 1 _ b a s e 6 4 = " h r 9 N X 5 3 e K f a (1) an encoding architecture that is permutation invariant across the number of samples n in the observed dataset D = {(X i , y i )} n i=1 , and (2) a Bayesian inspired end-to-end loss NMSE function, Eq. 1 from the encoded dataset D to the outputs from the predicted equations, i.e., NMSE( f (X), y). The highlighted boundaries show the subset of pre-trained encoder-decoder methods and RL methods. Inference Pre-training Backpropagation < l a t e x i t s h a _ b a s e = " O Y B T W Z w R z U Z A z k s k q T f s = " > A A A C E H i c b V D L S s N A F J U V v q E s g W s I C W R m M h F H T h s o J Q B P K Z D p p h e z E y E E P I J b v w V N y U c e v S n X / j J I g r Q c G z p x z L / f e S M C m k Y X p p Y X F p e a W W l l b j c r d O i K I O C Z t H L C A x w k C K M + a U s q G e m F n C D P Y a T r T K y v t P u K C B f y f j k N g e G v n U p R h J J Q Q t D c o w R S T S y u p V / H T X r p Q + N y M r H e h V o k g P P E L E g V F G g N E r G O D I I E D A n R N Q g n i k m J G o o V C R I i P E E j l f U R x R d p I f l M I D p Q y h G D f A l z X d H g j w h Y s R l d m S Y t b L x P + f i T d c z u h f h h J u P p I D d i U A Y w S w c O K S d Y s l g R h D l V u I R h x h q T K s q B D M Z P n S e e k b p W G e N a v O i i K M M s A + q A E T n I E m u A E t A Y Y P I A n A J e t U f t W X v T q e l J a o Q V / o H A B n W o = < / l a t e x i t > D = {(X, y)} Encoder Decoder < l a t e x i t s h a 1 _ b a s e 6 4 = " N n J O m V d z c E h l U D E M j q o b / A P z 7 k w = " > A A A B 6 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k V L 0 V v H i s Y D + g D W W z 3 T R L d z d h d y O U 0 L / g x Y M i X v 1 D 3 v w 3 b t o c t P X B w O O 9 G W b m B Q l n 2 r j u t 1 P a 2 N z a 3 i n v V v b 2 D w 6 P q s c n X R 2 n i t A O i X m s + g H W l D N J O 4 Y Z T v u J o l g E n P a C 6 V 3 u 9 5 6 o 0 i y W j 2 a W U F / g i W Q h I 9 j k 0 j C J 2 K h a c + v u A m i d e A W p Q Y H 2 q P o 1 H M c k F V Q a w r H W A 8 9 N j J 9 h Z R j h d F 4 Z p p o m m E z x h A 4 s l V h Q 7 W e L W + f o w i p j F M b K l j R o o f 6 e y L D Q e i Y C 2 y m w i f S q l 4 v / e Y P U h D d + x m S S G i r J c l G Y c m R i l D + O x k x R Y v j M E k w U s 7 c i E m G F i b H x V G w I 3 u r L 6 6 R 7 V f F M i J z U Y X B m j M 5 U C A = " > A A A B 7 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k + H E r e P F Y w b S F N p T N d t M u 3 W z C 7 k S o p b / B i w d F v P q D v P l v 3 L Y 5 a O u D g c d 7 M 8 z M C 1 M p D L r u t 1 N Y W 9 / Y 3 C p u l 3 Z 2 9 / Y P y o d H T Z N k m n G f J T L R 7 Z A a L o X i P g q U v J 1 q T u N Q 8 l Y 4 u p 3 5 r U e u j U j U A 4 5 T H s R 0 o E Q k G E U r + d 0 n j r R X r r h V d w 6 y S r y c V C B H o 1 f + 6 v Y T l s V c I Z P U m I 7 n p h h M q E b B J J + W u p n h K W U j O u A d S x W N u Q k m 8 2 O n 5 M w q f R I l 2 p Z C M l d / T 0 x o b M w 4 D m 1 n T H F o l r 2 Z + J / X y T C 6 D i Z C p R l y x R a L o k w S T M j s c 9 I X m j O U Y 0 s o 0 8 L e S t i Q a s r Q 5 l O y I X j L L 6 + S 5 k X V u 6 z W 7 m u V + k 0 e R x F O 4 B T O w Y M r q M M d N M A H B g K e 4 R X e H O W 8 O O / O x 6 K 1 4 O Q z x / A Loss function. We seek to learn invariances of equations (P1). Intuitively this is achieved by using a loss where different forms of the same equation have the same loss. To learn both the invariances of equations and datasets, we require an end-to-end loss from the observed dataset D to the predicted outputs from the equations generated, to train the conditional generator, p θ (f |D). To achieve this, we maximize the likelihood p(D|f ) distribution under a Monte Carlo scheme to incorporate the prior p(f ) (Harrison et al., 2018) . A natural and common assumption of the observed datasets is that they are sampled with Gaussian i.i.d. noise (Murphy, 2012) . Therefore to maximize this augmented likelihood distribution we minimize the normalized mean squared error (NMSE) loss over a mini-batch of t datasets, where for each dataset D (j) we sample kfoot_2 equations from the conditional generator, p θ (f |D). L(θ) = 1 t t j=1 1 k k c=1 1 σ y (j) 1 n (j) n (j) i=1 (y (j) i -f (c) (X (j) i )) 2 , f (j) ∼ p θ (f |D (j) ) Where σ y (j) is the standard deviation of the outputs y (j) . We wish to minimize this end-to-end NMSE loss, Equation 1; however, the process of turning an equation of tokens f into an equation f is a non-differentiable step. Therefore, we require an end-to-end non-differentiable optimization algorithm. DGSR is agnostic to the exact choice of optimization algorithm used to optimize this loss, and any relevant end-to-end non-differentiable optimization algorithm can be used. Suitable methods are policy gradient approaches, which include policy gradients with genetic programming (Petersen et al., 2020; Mundhenk et al., 2021) . To use these, we reformulate the loss of Equation 1for each equation into a reward function of R(θ) = 1/(1 + L(θ)), that is optimizable using policy gradients (Petersen et al., 2020) . Here both the encoder and decoder are trained during pre-training, i.e., optimizing the parameters θ and is further illustrated in Figure 2  RL [1,2,3] NSME( f (X i )), y i ) p θ (f ) ✗ ✗ ✗-Train from scratch - Encoder [4,5,6,7] CE( f , f * ) p θ (f |D) ✓ ✓ ✗-Cannot gradient refine ✗ DGSR This work Eq. 1, NSME( f (X i )), y i ) p θ (f |D) ✓ ✓ ✓-Can gradient refine ✓ sampled equations drawn from p θ (f |D) to have a high probability of generating the true equation f * . We achieve this by only refining the decoder weights ϕ (Figure 2 ) and keeping the encoder weights fixed. Furthermore, we show empirically that other optimization algorithms can be used with an ablation of these in Section 5.2 and Appendix E. Finally, our goal is to find the single best fitting equation for the observed dataset D. We achieve this by using a discrete search method to find the maximum a posteriori estimate of the refined posterior, p θ (f |D). This is achieved by a simple Monte Carlo scheme that samples k equations and scores each one based on its (NMSE) fit, then returns the equation with the best fit. Principally, there exist other discrete search methods that can be used as well, e.g., Beam search (Steinbiss et al., 1994) .

4. RELATED WORK

In the following we review the existing deep SR approaches, and summarize their main differences in Table 1 . We provide an extended discussion of additional related works, including heuristic-based methods and methods that use a prior in Appendix F. We illustrate in Figure 2 that RL and pre-trained encoder-decoder methods can be seen as ad hoc subsets of the DGSR framework. RL methods. These works use a policy network, typically implemented with RNNs, to output a sequence of tokens (actions) to form an equation. The output equation obtains a reward based on some goodness-of-fit metric (e.g., RMSE). Since the tokens are discrete, the method uses policy gradients to train the policy network. Most existing works focus on improving the pioneering policy gradient approach for SR (Petersen et al., 2020; Costa et al., 2020; Landajuela et al., 2021) , however the policy network is randomly initialized and tends to output ill-informed equations at the beginning, which slows down the procedure. Furthermore, the policy network needs to be re-trained each time a new dataset D is available. Hybrid RL and GP methods. These methods combine RL with genetic programming (GPs). Mundhenk et al. (2021) use a policy network to seed the starting population of a GP algorithm, instead of starting with a random population as in a standard GP. Other works use RL to adjust the probabilities of genetic operations (Such et al., 2017; Chang et al., 2018; Chen et al., 2018; Mundhenk et al., 2021; Chen et al., 2020) . Similarly, these methods cannot improve with more learning from other datasets and have to re-train their models from scratch, making inference slow at test time. Pre-trained encoder-decoder methods. Unlike RL, these methods pre-train an encoder-decoder neural network to model p(f |D) using a curated dataset (Biggio et al., 2021) . Specifically, Valipour et al. (2021) propose to use standard language models, e.g., GPT. At inference time, these methods sample from p θ (f |D) using the pre-trained network, thereby achieving low complexity at inferencethat is efficient inference. These methods have two key limitations: (1) they use cross-entropy (CE) loss for pre-training and (2) they cannot gradient refine their model, leading to sub-optimal solutions. First (1), cross entropy, whilst useful for comparing categorical distributions, does not account for equations that are equivalent mathematically. Although prior works, specifically Lample & Charton (2019) , observed the "surprising" and "very intriguing" result that sampling multiple equations from their pre-trained encoder-decoder model yielded some equations that are equivalent mathematically, when pre-trained using a CE loss. Furthermore, the pioneering work of d'Ascoli et al. (2022) has shown this behavior as well. Whereas using our proposed end-to-end NMSE loss, Eq. 1-i.e., will have the same loss value for different equivalent equation forms that are mathematically equivalent-therefore this loss is a natural and principled way to incorporate the equation equivalence property, inherent to symbolic regression. Second (2), DGSR is to the best of our knowledge the first SR method to be able to perform gradient refinement of a pre-trained encoder-decoder model using our end-to-end NMSE loss, Eq. 1-to update the weights of the decoder at inference time. We note that there exists other non-gradient refinement approaches, that cannot update their decoder's weights. These consist of: (1) optimizing the constants in the generated equation form with a secondary optimization step (commonly using the BFGS algorithm) (Petersen et al., 2020; Biggio et al., 2021) , and (2) using the MSE of the predicted equation(s) to guide a beam search sampler (d'Ascoli et al., 2022; Kamienny et al., 2022) . As a result, to generalize to equations with a greater number of input variables pre-trained encoder-decoder methods require large pre-training datasets (e.g., millions of datasets (Biggio et al., 2021) ), and even larger generative models (e.g., ∼ 100 million parameters (Kamienny et al., 2022) ).

5. EXPERIMENTS AND EVALUATION

We evaluate DGSR on a set of common equations in natural sciences from the standard SR benchmark problem sets and on a problem set with a large number of input variables (d = 12). Benchmark algorithms. We compare against Neural Guided Genetic Programming (NGGP) Mundhenk et al. (2021) ; as this is the current state-of-the-art for SR, superseding DSR (Petersen et al., 2020) . We also compare with genetic programming (GP) (Fortin et al., 2012) which has long been an industry standard and compare with Neural Symbolic Regression that Scales (NESYMRES), an pre-trained encoder-decoder method. We note that NESYMRES was only pre-trained on a large three input variable dataset, and thus can only be used and is included on problem sets that have d ≤ 3. Further details of model selection, hyperparameters and implementation details are in Appendix Gfoot_3 . Dataset generation. Each symbolic regression "problem set" is defined by the following: a set of ω unique ground truth equations-where each equation f * has d input variables, a domain X over which to sample 10d input variable points (unless otherwise specified) and a set of allowable tokens. For each equation f * an inference time training and test set are sampled independently from the defined problem set domain, each of 10d input-output samples, to form a dataset D = {X i , f * (X i )} 10d i=1 . The training dataset is used to optimize the loss at inference time and the test set is only used for evaluation of the best equations found at the end of inference. Inference runs for 2 million equation evaluations, unless the true equation is found early-stopping the procedure. To construct the pre-training set {D (j) } m j=1 , we use the concise equation generation method of Lample & Charton (2019) . This uses the library of tokens for a particular problem set and is detailed further in Appendix J, with details of training and how to specify p(f ). Benchmark problem sets. We note that we achieve similar performance to the standard SR benchmark problem sets in Appendix H and therefore seek to evaluate DGSR on more challenging SR benchmarks with more variables (d ≥ 2), whilst benchmarking on realistic equations that experts wish to discover. We use equations from the Feynman SR database (Udrescu & Tegmark, 2020), to provide more challenging equations of a larger number of input variables. These are derived from the Feynman Lectures on Physics (Feynman et al., 1965) . We randomly selected a subset of ω = 7 equations with two input variables (Feynman d = 2), and a further, more challenging, subset of ω = 8 equations with five input variables (Feynman d = 5). Additionally, we sample an additional Feynman dataset of ω = 32 equations with d = {3, 4, 6, 7, 8, 9} input variables (Additional Feynman). We also benchmark on SRBench (La Cava et al., 2021) , which includes a further ω = 133 equations, of ω = 119 of the Feynman equations and ω = 14 ODE-Strogatz (Strogatz, 2018) equations. Finally, we consider a more challenging problem set consisting of d = 12 variables of ω = 7 equations synthetically generated (Synthetic d = 12). We detail all problem sets in Appendix I. Evaluation. We evaluate against the standard symbolic regression metric of recovery rate (A Rec %)the percentage of runs where the true equation f * was found, over a set number of κ random seed runs (Petersen et al., 2020) . This uses the strictest definition of symbolic equivalence, by a computer algebraic system (Meurer et al., 2017) . We also evaluate the average number of equation evaluations γ until the true equation f * is found. We use this metric as a proxy for computational complexity across the benchmark algorithms, as testing many generated equations is a bottleneck in SR (Biggio et al., 2021; Petersen et al., 2020) , discussed further in Appendix K. Unless noted further we follow the experimental setup of Petersen et al. ( 2020) and use their complexity definition, also detailed in Appendix K. We run all problems κ = 10 times using a different random seed for each run (unless otherwise specified), and pre-train with 100K generated equations for each benchmark problem set. 32 {3, 4, 6, 7, 8, 9} 10 67.81 ± 4.60 67.81 ± 3 

5.1. MAIN RESULTS

The average recovery rate (A Rec %) and the average number of equation evaluations for the benchmark problem sets are tabulated in Table 2 . DGSR achieves a higher recovery rate with more input variables, specifically in the problem sets of Feynman d = 5, Additional Feynman and Synthetic d = 12. We note that NESYMRES achieves the lowest number of equation evaluations, however, suffers from a significantly lower recovery rate. Standard benchmark problem sets. DGSR is state-of-the-art on SRBench (La Cava et al., 2021) for true equation recovery on the ground truth unique equations, with a significant increase of true equation recovery of 63.25% compared to the previous benchmark method of 52.65% in SRBench, Appendix S. DGSR also achieves a new significant state-of-the-art high recovery rate on the R rationals (Krawiec & Pawlak, 2013) problem set with a 10% increase in recovery rate, Appendix H. It also achieves the same performance as state-of-the-art (NGGP) in the standard benchmark problem sets that have a small number of input variables, of the Nguyen (Uy et al., 2011) and Livermore problem sets (Mundhenk et al., 2021) detailed in Appendix H.

5.2. INSIGHT AND UNDERSTANDING OF HOW DGSR WORKS

In this section we seek to gain further insight of how DGSR achieves a higher recovery rate with a larger number of input variables, whilst having fewer equation evaluations compared to RL techniques. In the following we seek to understand if DGSR is able to: capture these equation equivalences (P1), at refinement perform inference computationally efficiently (P2) and generalize to unseen input variables of a higher dimension from those seen during pre-training (P3). Can DGSR capture the equation equivalences? (P1). To explore if DGSR is learning these equation equivalences, we turn off early stopping and count the number of unique ground truth f * equivalent equations that are discovered, as shown in Figure 3 (a). Empirically we observe that DGSR is able to correctly capture equation equivalences and exploits these to generate many unique equivalent-yet true equations, with 10 of these tabulated in Table 3 . We note that the RL method, Equivalent generated equations 3 2 x 1 x 2 x 1 (x 2 + x2x2 x2+x2 ) x 1 (x 2 + x2 1 x 2 (x2+x2) ) 3 2 x 1 x 2 x 2 (x 1 + x 2 x1 x2+x2 ) x 2 (x 1 + x1x2 x2+x2 ) 3 2 x 1 x 2 x 1 (x 2 x2 x2+x2 + x 2 ) x 2 (x 1 x1 x1+x1 + x 1 ) 3 2 x 1 x 2 x 1 (x 2 x2 x2+x2 + x 2 ) x 2 (x 1 + x 2 x1 x2+x2 ) 3 2 x 1 x 2 x 1 (x 2 x1 x1+x1 + x 2 ) x 2 (x 1 x2 x2+x2 + x 1 ) 3 2 x 1 x 2 x 1 (x 2 x1 x1+x1 + x 2 ) x 1 (x 1 x2 x1+x1 + x 2 ) 3 2 x 1 x 2 x 2 (x 1 + x1 1 x 1 (x1+x1) ) x 2 (x 1 + x 2 x1 x2+x2 ) 3 2 x 1 x 2 x 2 (x 1 x1 x1+x1 + x 1 ) x 2 (x 1 + x1 1 x 2 (x2+x2) ) 3 2 x 1 x 2 x 2 (x 1 x2 x2+x2 + x 1 ) x 1 (x 2 x2 x2+x2 + x 2 ) 3 2 x 1 x 2 x 2 (x 1 x2 x2+x2 + x 1 ) x 1 (x 2 + (x 2 + x 1 -x2 x1+x1 )) (a) (b) (c) True Equation ComplexityTrue Equation Complexity NGGP is also able to find the true equation. Furthermore, we highlight that all these equations are equivalent achieving zero test NMSE and can be simplified into f * , detailed further in Appendix M. Moreover, DGSR is able to learn how to generate valid equations more readily. This is important as an SR method needs an equation to be valid for it to be evaluated, that is, one where the generated tokens f can be instantiated into an equation f and evaluated (e.g., log -2x 2 1 is not valid). Figure 3 (b) shows that DGSR has learnt how to generate valid equations, that also have a high probability of containing the true equation f * . Whereas the RL method, NGGP generates mostly invalid equations. We note that the pre-trained encoder-decoder method, NESYMRES generates almost perfectly valid equations-however struggles to produce the true equation f * in most problems, as seen in Table 2 . We analyze some of the most challenging equations to recover, that all methods failed to find. We observe that DGSR can still find good fitting equations that are concise, i.e., having a low test NMSE with a low equation complexity. A few of these are shown with Pareto fronts in Figure 4 and in Appendix N. We highlight that for a good SR method we wish to determine concise and best fitting equations. Otherwise, it is undesirable to over-fit with an equation that has many terms, having a high complexity-that fails to generalize well. Additionally, we analyze the challenging real-world setting of low data samples with noise, in Appendix L. Here we observe that DGSR is state-of-the-art with a significant average recovery rate increase of at least 10% better than that of NGGP in this setting, and reason that DGSR is able to exploit the encoded prior p(f ). Furthermore, we perform an ablation study to investigate how useful pre-training and an encoder is for recovering the true equation, in Table 4 . This demonstrates empirically for DGSR pre-training increases the recovery rate of the true equation, and highlights that the decoder also benefits from pre-training implicitly modelling p(f ) without the encoder. We also ablate pre-training DGSR with a cross-entropy loss on the output of the decoder instead and observe that an end-to-end NMSE loss benefits the recovery rate. This supports our belief that with our invariant aware model and end-to-end loss, DGSR is able to learn the equation and dataset invariances (P1) to have a higher recovery rate. Can DGSR perform computationally efficient inference? (P2). We wish to understand if our pre-trained conditional generative model, p θ (f |D) can encode the observed dataset D to start with a good initial distribution that is further refined. We do this by evaluating the negative log-likelihood of the true equation f * during inference, as plotted in Figure 4 (c). We observe that DGSR finds the true equation f * in few equation evaluations, by correctly conditioning on the observed dataset D to start with a distribution that has a high probability of sampling f * , which is then further refined. This also indicates that DGSR has learnt a better representation of the true equation f * (P1) where equivalent equation forms are inherently represented compared to the pre-trained encoder-decoder method, NESYMRES which can only represent one equation form. In contrast, NGGP starts with a random initial equation distribution and eventually converges to a suitable distribution, however this requires a greater number of equation evaluations. Here the pre-trained encoder-decoder method, NESYMRES is unable to refine its equation distribution model. This leads it to have a constant probability of sampling the true equation f * , which in this problem is too low to be sampled and was not discovered after the maximum of 2 million equations sampled. We note that in theory, one could obtain the true equation f * via an uninformed random search, however this would take a prohibitively large amount of equation samples and hence equation evaluations to be feasible. Furthermore, DGSR is capable of being used with other optimizers, and show this in Figure 5 , where it uses the optimizer of Petersen et al. (2020) . This is an ablated version of the optimizer from NGGP; that is a policy gradient method without the GP component. Empirically we demonstrate that using this different optimizer, DGSR still achieves a greater and significant computational inference efficiency compared to RL methods using the same optimizer. Where DGSR uses a total of γ = 29, 356 average equation evaluations compared to the state-of-the-art RL method with γ = 151, 231 average equation evaluations on the Feynman d = 2 problem set (Appendix R). Can DGSR generalize to unseen input variables of a higher dimension? (P3). We observe in Table 4 that even when DGSR is pre-trained with a smaller number of input variables than those seen at inference time, it is still able to learn a useful equation representation (P1) that aids generalizing to the unseen input variables of a higher dimension. Here, we verify this by pre-training on a dataset with d = 2 and evaluating on the Feynman d = 5 problem set.

6. DISCUSSION AND FUTURE WORK

We hope this work provides a practical framework to advance deep symbolic regression methods, which are immensely useful in the natural sciences. We note that DGSR has the following limitations of: (1) may fail to discover highly complex equations, (2) optimizing the numeric constants can get stuck in local optima and (3) it assumes all variables in f are observed. Each of these pose exciting open challenges for future work, and are discussed in detail in Appendix V. Of these, we envisage Deep Generative SR enabling future works of tackling even larger numbers of input variables and assisting in the automation of the problem of scientific discovery. Doing so has the opportunity to accelerate the scientific discovery of equations that determine the true underlying processes of nature and the world. < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 p W 4 k A k 3 k o c K 1 7 Q b r y I Y + k H U J + w = " > A A A C A H i c b V B N S 8 N A E N 3 4 W e t X 1 I M H L 8 E i 1 E t J p K j H o h 4 8 V r A f 0 I S y 2 W 7 a p Z t N 2 J 0 I J e b i X / H i Q R G v / g x v / h s 3 b Q 7 a + m D g 8 d 4 M M / P 8 m D M F t v 1 t L C 2 v r K 6 t l z b K m 1 v b O 7 v m 3 n 5 b R Y k k t E U i H s m u j x X l T N A W M O C 0 G 0 u K Q 5 / T j j + + z v 3 O A 5 W K R e I e J j H 1 Q j w U L G A E g 5 b 6 5 m H c d 2 F E A V e D R z f E M C K Y p z f Z a d + s 2 D V 7 C m u R O A W p o A L N v v n l D i K S h F Q A 4 V i p n m P H 4 K V Y A i O c Z m U 3 U T T G Z I y H t K e p w C F V X j p 9 I L N O t D K w g k j q E m B N 1 d 8 T K Q 6 V m o S + 7 s x v V P N e L v 7 n 9 R I I L r 2 U i T g B K s h s U Z B w C y I r T 8 M a M E k J 8 I k m m E i m b 7 X I C E t M Q G d W 1 i E 4 8 y 8 v k v Z Z z T m v 1 e / q l c Z V E U c J H a F j V E U O u k A N d I u a q I U I y t A z e k V v x p P x Y r w b H 7 P W J a O Y O U B / Y H z + A L q + l o A = < / l a t e x i t > p ✓ (f |D) < l a t e x i t s h a 1 _ b a s e 6 4 = " S Q f h q n I d 9 + R q D U V o r O 0 w Z z j Y L M Y = " > A A A C D X i c b V C 7 T s M w F H V 4 l v I K M L J Y F C S m K k E V M F a w M K G C 6 E N q S u W 4 T m v V c S L 7 B q m K 8 g M s / A o L A w i x s r P x N z h t B 2 g 5 k q X j c + 7 V v f f 4 s e A a H O f b W l h c W l 5 Z L a w V 1 z c 2 t 7 b t n d 2 G j h J F W Z 1 G I l I t n 2 g m u G R 1 4 C B Y K 1 a M h L 5 g T X 9 4 m f v N B 6 Y 0 j + Q d j G L W C U l f 8 o B T A k b q 2 o d e S G D g B 2 k r 8 7 i c f P z 0 N r t P e x 7 w k G l 8 n R W 7 d s k p O 2 P g e e J O S Q l N U e v a X 1 4 v o k n I J F B B t G 6 7 T g y d l C j g V L C s 6 C W a x Y Q O S Z + 1 D Z X E D O q k 4 2 s y f G S U H g 4 i Z Z 4 E P F Z / d 6 Q k 1 H o U + q Y y X 1 f P e r n 4 n 9 d O I D j v p F z G C T B J J 4 O C R G C I c B 4 N 7 n H F K I i R I Y Q q b n b F d E A U o W A C z E N w Z 0 + e J 4 2 T s n t a r t x U S t W L a R w F t I 8 O 0 D F y 0 R m q o i t U Q 3 V E 0 S N 6 R q / o z X q y X q x 3 6 2 N S u m B N e / b Q H 1 i f P x b L n D E = < / l a t e x i t > X 2 R d⇥N < l a t e x i t s h a 1 _ b a s e 6 4 = " H o Z N m y O n k e P h 5 b Q X R b k 5 K b k J R e A = " > A A A C s n i c b Z H L T s M w E E W d 8 A 6 v A k s 2 F h W I B S p J V A F L B B u W I F F A N C V y n G m x 6 j i R 7 S C q K B / I l h 1 / g 9 O G 8 i g j W T q 6 c z 0 e z 0 Q Z Z 0 q 7 7 o d l z 8 0 v L C 4 t r z i r a + s b m 4 2 t 7 T u V 5 p J C h 6 Y 8 l Q 8 R U c C Z g I 5 m m s N D J o E k E Y f 7 a H h Z 5 e 9 f Q C q W i l s 9 y q C X k I F g f U a J N l L Y e M N B B A M m i i g h W r L X 0 s E Y v 4 a F d + S V + G B C f k U B j V O t v q S 4 x E F Q W / 2 p 1 Z + 1 + t / W 4 G U s V 4 Z v i m v r l z S t K q Z V x W x V U V V 1 c A A i n j Y e N p p u y x 0 H n g W v h i a q 4 z p s v A d x S v M E h K a c K N X 1 3 E z 3 C i I 1 o x x K J 8 g V Z I Q O y Q C 6 B g V J Q P W K 8 c h L v G + U G P d T a Y 7 Q e K z + v F G Q R K l R E h m n 6 e 9 Z / c 1 V 4 n + 5 b q 7 7 Z 7 2 C i S z X I O j k o X 7 O s U 5 x t T 8 c M w l U 8 5 E B Q i U z v W L 6 T C S h 2 m z Z M U P w / n 5 5 F u 7 8 l n f S a t + 0 m + c X 9 T i W 0 S 7 a Q 4 f I Q 6 f o H F 2 h a 9 R B 1 D q 2 O t a T F d p t + 9 E m N p 1 Y b a u + s 4 N + h c 0 / A U o U y U 8 = < / l a t e x i t > 2 6 6 6 4 x 1,1 x 1,2 • • • x 1,d x 2,1 x 2,2 • • • x 2,d . . . . . . . . . . . . x n,1 x n,2 • • • x n,d 3 7 7 7 5 < l a t e x i t s h a 1 _ b a s e 6 4 = " I N I O m s e r z p V K n Y G 4 D F f a 2 v U 8 U E Y = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k + H E r e P H Y g q 2 F N p T N d t K u 3 W z C 7 k Y o o b / A i w d F v P q T v P l v 3 L Y 5 a O u D g c d 7 M 8 z M C x L B t X H d b 6 e w t r 6 x u V X c L u 3 s 7 u 0 f l A + P 2 j p O F c M W i 0 W s O g H V K L j E l u F G Y C d R S K N A 4 E M w v p 3 5 D 0 + o N I / l v Z k k 6 E d 0 K H n I G T V W a g 7 7 5 Y p b d e c g q 8 T L S Q V y N P r l r 9 4 g Z m m E 0 j B B t e 5 6 b m L 8 j C r D m c B p q Z d q T C g b 0 y F 2 L Z U 0 Q u 1 n 8 0 O n 5 M w q A x L G y p Y 0 Z K 7 + n s h o p P U k C m x n R M 1 I L 3 s z 8 T + v m 5 r w 2 s + 4 T F K D k i 0 W h a k g J i a z r 8 m A K 2 R G T C y h T H F 7 K 2 E j q i g z N p u S D c F b f n m V t C + q 3 m W 1 1 q x V 6 j d 5 H E U 4 g V M 4 B w + u o A 5 3 0 I A W M E B 4 h l d 4 c x 6 d F + f d + V i 0 F p x 8 5 h j + w P n 8 A c y u j O s = < / l a t e x i t > g < l a t e x i t s h a 1 _ b a s e 6 4 = " I N I O m s e r z p V K n Y G 4 D F f a 2 v U 8 U E Y = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k + H E r e P H Y g q 2 F N p T N d t K u 3 W z C 7 k Y o o b / A i w d F v P q T v P l v 3 L Y 5 a O u D g c d 7 M 8 z M C x L B t X H d b 6 e w t r 6 x u V X c L u 3 s 7 u 0 f l A + P 2 j p O F c M W i 0 W s O g H V K L j E l u F G Y C d R S K N A 4 E M w v p 3 5 D 0 + o N I / l v Z k k 6 E d 0 K H n I G T V W a g 7 7 5 Y p b d e c g q 8 T L S Q V y N P r l r 9 4 g Z m m E 0 j B B t e 5 6 b m L 8 j C r D m c B p q Z d q T C g b 0 y F 2 L Z U 0 Q u 1 n 8 0 O n 5 M w q A x L G y p Y 0 Z K 7 + n s h o p P U k C m x n R M 1 I L 3 s z 8 T + v m 5 r w 2 s + 4 T F K D k i 0 W h a k g J i a z r 8 m A K 2 R G T C y h T H F 7 K 2 E j q i g z N p u S D c F b f n m V t C + q 3 m W 1 1 q x V 6 j d 5 H E U 4 g V M 4 B w + u o A 5 3 0 I A W M E B 4 h l d 4 c x 6 d F + f d + V i 0 F p x 8 5 h j + w P n 8 A c y u j O s = < / l a t e x i t > g < l a t e x i t s h a 1 _ b a s e 6 4 = " I N I O m s e r z p V K n Y G 4 D F f a 2 v U 8 U E Y = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k + H E r e P H Y g q 2 F N p T N d t K u 3 W z C 7 k Y o o b / A i w d F v P q T v P l v 3 L Y 5 a O u D g c d 7 M 8 z M C x L B t X H d b 6 e w t r 6 x u V X c L u 3 s 7 u 0 f l A + P 2 j p O F c M W i 0 W s O g H V K L j E l u F G Y C d R S K N A 4 E M w v p 3 5 D 0 + o N I / l v Z k k 6 E d 0 K H n I G T V W a g 7 7 5 Y p b d e c g q 8 T L S Q V y N P r l r 9 4 g Z m m E 0 j B B t e 5 6 b m L 8 j C r D m c B p q Z d q T C g b 0 y F 2 L Z U 0 Q u 1 n 8 0 O n 5 M w q A x L G y p Y 0 Z K 7 + n s h o p P U k C m x n R M 1 I L 3 s z 8 T + v m 5 r w 2 s + 4 T F K D k i 0 W h a k g J i a z r 8 m A K 2 R G T C y h T H F 7 K 2 E j q i g z N p u S D c F b f n m V t C + q 3 m W 1 1 q x V 6 j d 5 H E U 4 g V M 4 B w + u o A 5 3 0 I A W M E B 4 h l d 4 c x 6 d F + f d + V i 0 F p x 8 5 h j + w P n 8 A c y u j O s = < / l a t e x i t > g < l a t e x i t s h a 1 _ b a s e 6 4 = " b s d z g 3 s / r B p j 2 4 N N X r E 0 C c y e q 9 E = " > A A A C B n i c b V D L S s N A F J 3 U V 6 2 v q E s R B o v g q i R S f O w K b l x W s Q 9 o Y p h M J + 3 Q y S T M T C o l Z O X G X 3 H j Q h G 3 f o M 7 / 8 Z J m 4 W 2 H h g 4 c 8 6 9 3 H u P H z M q l W V 9 G 6 W l 5 Z X V t f J 6 Z W N z a 3 v H 3 N 1 r y y g R m L R w x C L R 9 Z E k j H L S U l Q x 0 o 0 F Q a H P S M c f X e V + Z 0 y E p B G / U 5 O Y u C E a c B p Q j J S W P P P Q C Z E a + k E 6 z j z q U D 7 7 + u l t d p 8 + Z J 5 Z t W r W F H C R 2 A W p g g J N z / x y + h F O Q s I V Z k j K n m 3 F y k 2 R U B Q z k l W c R J I Y 4 R E a k J 6 m H I V E u u n 0 j A w e a 6 U P g 0 j o x x W c q r 8 7 U h R K O Q l 9 X Z l v K e e 9 X P z P 6 y U q u H B T y u N E E Y 5 n g 4 K E Q R X B P B P Y p 4 J g x S a a I C y o 3 h X i I R I I K 5 1 c R Y d g z 5 + 8 S N q n N f u s V r + p V x u X R R x l c A C O w A m w w T l o g G v Q B C 2 A w S N 4 B q / g z X g y X o x 3 4 2 N W W j K K n n 3 w B 8 b n D 7 K l m e U = < / l a t e x i t > v i 2 R w < l a t e x i t s h a 1 _ b a s e 6 4 = " a Z o 0 y 2 U e k k d B L 3 N x 4 Z K f d s A 4 x W U = " > A A A B 8 3 i c b V D L S g M x F L 2 p r 1 p f V Z d u g k V w V W a k + N g V 3 L i s Y B / Q G U o m z b S h m c y Q Z A p l 6 G + 4 c a G I W 3 / G n X 9 j p p 2 F t h 4 I H M 6 5 l 3 t y g k R w b R z n G 5 U 2 N r e 2 d 8 q 7 l b 3 9 g 8 O j 6 v F J R 8 e p o q x N Y x G r X k A 0 E 1 y y t u F G s F 6 i G I k C w b r B 5 D 7 3 u 1 O m N I / l k 5 k l z I / I S P K Q U 2 K s 5 H k R M e M g z K b z g T u o 1 p y 6 s w B e J 2 5 B a l C g N a h + e c O Y p h G T h g q i d d 9 1 E u N n R B l O B Z t X v F S z h N A J G b G + p Z J E T P v Z I v M c X 1 h l i M N Y 2 S c N X q i / N z I S a T 2 L A j u Z Z 9 S r X i 7 + 5 / V T E 9 7 6 G Z d J a p i k y 0 N h K r C J c V 4 A H n L F q B E z S w h V 3 G b F d E w U o c b W V L E l u K t f X i e d q 7 p 7 X W 8 8 N m r N u 6 K O M p z B O V y C C z f Q h A d o Q R s o J P A M r / C G U v S C 3 t H H c r S E i p 1 T + A P 0 + Q M n i 5 G + < / l a t e x i t > v 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " C N e q 3 0 g W B t J q W N z k F c F N r 9 n e g R c = " > A A A B 8 3 i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y U 4 m N X c O O y g q 2 F z l A y a a Y N z S R D k i m U o b / h x o U i b v 0 Z d / 6 N m X Y W 2 n o g c D j n X u 7 J C R P O t H H d b 6 e 0 s b m 1 v V P e r e z t H x w e V Y 9 P u l q m i t A O k V y q X o g 1 5 U z Q j m G G 0 1 6 i K I 5 D T p / C y V 3 u P 0 2 p 0 k y K R z N L a B D j k W A R I 9 h Y y f d j b M Z h l E 3 n g 8 a g W n P r 7 g J o n X g F q U G B 9 q D 6 5 Q 8 l S W M q D O F Y 6 7 7 n J i b I s D K M c D q v + K m m C S Y T P K J 9 S w W O q Q 6 y R e Y 5 u r D K E E V S 2 S c M W q i / N z I c a z 2 L Q z u Z Z 9 S r X i 7 + 5 / V T E 9 0 E G R N J a q g g y 0 N R y p G R K C 8 A D Z m i x P C Z J Z g o Z r M i M s Y K E 2 N r q t g S v N U v r 5 N u o + 5 d 1 Z s P z V r r t q i j D G d w D p f g w T W 0 4 B 7 a 0 A E C C T z D K 7 w 5 q f P i v D s f y 9 G S U + y c w h 8 4 n z 8 p D 5 G / < / l a t e x i t > v 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " S 0 v 2 q / I B i p r p / x F k s 7 9  E J y J u M h o = " > A A A B 8 3 i c b V D L S g M x F L 2 p r 1 p f V Z d u g k V w V W a k + N g V 3 L i s Y B / Q G U o m z b S h m c y Q Z A p l 6 G + 4 c a G I W 3 / G n X 9 j p p 2 F t h 4 I H M 6 5 l 3 t y g k R w b R z n G 5 U 2 N r e 2 d 8 q 7 l b 3 9 g 8 O j 6 v F J R 8 e p o q x N Y x G r X k A 0 E 1 y y t u F G s F 6 i G I k C w b r B 5 D 7 3 u 1 O m N I / l k 5 k l z I / I S P K Q U 2 K s 5 H k R M e M g z K b z g R x U a 0 7 d W Q C v E 7 c g N S j Q G l S / v G F M 0 4 h J Q w X R u u 8 6 i f E z o g y n g s 0 r X q p Z Q u i E j F j f U k k i p v 1 s k X m O L 6 w y x G G s 7 J M G L 9 T f G x m J t J 5 F g Z 3 M M + p V L x f / 8 / q p C W / 9 j M s k N U z S 5 a E w F d j E O C 8 A D 7 l i 1 I i Z J Y Q q b r N i O i a K U G N r q t g S 3 N U v r 5 P O V d 2 9 r j c e G 7 X m X V F H G c 7 g H C 7 B h R t o w g O 0 o A 0 U E n i G V 3 h D K X p B 7 + h j O V p C x c 4 p / A H 6 / A G D / 5 H 7 < / l a t e x i t > v n < l a t e x i t s h a 1 _ b a s e 6 4 = " f j A F w j Q e V b o S 8 P h A R I 4 / i q Y H O 6 4 = " > A A A C A n i c b V D L S s N A F L 3 x W e s r 6 k r c B I v g q i R S f O w K b l x W s Q 9 o Y p l M J + 3 Q y S T M T J Q S g h t / x Y 0 L R d z 6 F e 7 8 G y d t F t p 6 Y O D M O f d y 7 z 1 + z K h U t v 1 t L C w u L a + s l t b K 6 x u b W 9 v m z m 5 L R o n A p I k j F o m O j y R h l J O m o o q R T i w I C n 1 G 2 v 7 o M v f b 9 0 R I G v F b N Y 6 J F 6 I B p w H F S G m p Z + 6 7 I V J D P 0 h b m U v 5 9 O O n N 9 n d Q 8 + s 2 F V 7 A m u e O A W p Q I F G z / x y + x F O Q s I V Z k j K r m P H y k u R U B Q z k p X d R J I Y 4 R E a k K 6 m H I V E e u n k h M w 6 0 k r f C i K h H 1 f W R P 3 d k a J Q y n H o 6 8 p 8 R z n r 5 e J / X j d R w b m X U h 4 n i n A 8 H R Q k z F K R l e d h 9 a k g W L G x J g g L q n e 1 8 B A J h J V O r a x D c G Z P n i e t k 6 p z W q 1 d 1 y r 1 i y K O E h z A I R y D A 2 d Q h y t o Q B M w P M I z v M K b 8 W S 8 G O / G x 7 R 0 w S h 6 9 u A P j M 8 f G N K X 3 Q = = < / l a t e x i t > V 2 R w < l a t e x i t s h a 1 _ b a s e 6 4 = " q y 1 g f G Y t Y K p h X l 8 O d U z C T 4 i k 7 0 o = " > A A A B 8 X i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s x I 8 b E r u H F Z w T 6 w H U o m v d O G Z j J D k h H K 0 L 9 w 4 0 I R t / 6 N O / / G t J 2 F t h 4 I H M 6 5 l 5 x 7 g k R w b V z 3 2 y m s r W 9 s b h W 3 S z u 7 e / s H 5 c O j l o 5 T x b D J Y h G r T k A 1 C i 6 x a b g R 2 E k U 0 i g Q 2 A 7 G t z O / / Y R K 8 1 g + m E m C f k S H k o e c U W O l x 1 5 E z S g I s 9 a 0 X 6 6 4 V X c O s k q 8 n F Q g R 6 N f / u o N Y p Z G K A 0 T V O u u 5 y b G z 6 g y n A m c l n q p x o S y M R 1 i 1 1 J J I 9 R + N k 8 8 J W d W G Z A w V v Z J Q + b q 7 4 2 M R l p P o s B O z h L q Z W 8 m / u d 1 U x N e + x m X S W p Q s s V H Y S q I i c n s f D L g C p k R E 0 s o U 9 x m J W x E F W X G l l S y J X j L J 6 + S 1 k X V u 6 z W 7 m u V + k 1 e R x F O 4 B T O w Y M r q M M d N K A J D C Q 8 w y u 8 O d p 5 c d 6 d j 8 V o w c l 3 j u E P n M 8 f y c i Q + g = = < / l a t e x i t > V < l a t e x i t s h a 1 _ b a s e 6 4 = " l 1 N 6 d 8 4 u f Q g O 8 e T H q V b 7 A l 0 B m 9 0 = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k + H E r e P H Y g q 2 F N p T N d t K u 3 W z C 7 k Y o o b / A i w d F v P q T v P l v 3 L Y 5 a O u D g c d 7 M 8 z M C x L B t X H d b 6 e w t r 6 x u V X c L u 3 s 7 u 0 f l A + P 2 j p O F c M W i 0 W s O g H V K L j E l u F G Y C d R S K N A 4 E M w v p 3 5 D 0 + o N I / l v Z k k 6 E d 0 K H n I G T V W a o 7 6 5 Y p b d e c g q 8 T L S Q V y N P r l r 9 4 g Z m m E 0 j B B t e 5 6 b m L 8 j C r D m c B p q Z d q T C g b 0 y F 2 L Z U 0 Q u 1 n 8 0 O n 5 M w q A x L G y p Y 0 Z K 7 + n s h o p P U k C m x n R M 1 I L 3 s z 8 T + v m 5 r w 2 s + 4 T F K D k i 0 W h a k g J i a z r 8 m A K 2 R G T C y h T H F 7 K 2 E j q i g z N p u S D c F b f n m V t C + q 3 m W 1 1 q x V 6 j d I R L L j E c U E D 1 k T O A j W i S U j g S d Y 2 x v e T v z 2 I 5 O K R + E D j G L m B K Q f c p 9 T A l p y y 9 d d 2 y M y t Q c E U j / L X O s M 2 7 0 I l N 7 m 9 H S c 0 1 z U 5 3 H m u O W K W T V z 4 G V i z U g F z d B w y 5 / 6 Z p o E L A Q q i F J d y 4 z B S Y k E T g X L S n a i W E z o k P R Z V 9 O Q B E w 5 a T 5 i h k + 0 0 s N + J P U K A e f q 3 4 6 U B E q N A k 9 X B g Q G a t G b i P 9 5 3 Q T 8 K y f l Y Z w A C + n 0 I T 8 R G C I 8 y Q v 3 u G Q U x E g T Q i X X f 8 V 0 Q C S h o F M t 6 R C s x Z G X S Q = " > A A A C F X i c b V D L S g M x F M 3 U V 6 2 v U Z d u g k V o o Z Q Z K e p K C i K 4 U S r a B 7 S l Z N J M G 5 p 5 k N w R y z A / 4 c Z f c e N C E b e C O / / G 9 C F o 6 4 G Q k 3 P u J f c e J x R c g W V 9 G a m F x a X l l f R q Z m 1 9 Y 3 P L 3 N 6 p q S C S l F V p I A L Z c I h i g v u s C h w E a 4 S S E c 8 R r O 4 M z k Z + / Y 5 J x Q P / F o Y h a 3 u k 5 3 O X U w J a 6 p i F F r B 7 i K 8 u b 8 6 T X K t P I H b 1 7 R H o O 2 7 c S P I F / P M Y J v m O m b W K 1 h h 4 n t h T k k V T V D r m Z 6 s b 0 M h j P l B B l G r a V g j t m E j g V L A k 0 4 o U C w k d k B 5 r a u o T j 6 l 2 P N 4 q w Q d a 6 W I 3 k P r 4 g M f q 7 4 6 Y e E o N P U d X j k Z U s 9 5 I / M 9 r R u C e t G P u h x E w n 0 4 + c i O B I c C j i H C X S 0 Z B D D U h V H I 9 K 6 Z 9 I g k F H W R G h 2 D P r j x P a o d F + 6 h Y u i 5 l y 6 f T O N J o D + 2 j H L L R M S q j C 1 R B V U T R A 3 p C L + j V e D S e j T f j f V K a M q Y 9 u + g P j I 9 v Z t e e 9 g = = < / l a t e x i t > NMSE( f (X), y) P.G. . . . < l a t e x i t s h a 1 _ b a s e 6 4 = " h r 9 N X 5 3 e K f a (1) An encoding architecture that is permutation invariant across the number of samples n in the observed dataset D = {(X i , y i )} n i=1 , (2) An Bayesian inspired end-to-end loss NMSE function, Eq. 1 from the encoded dataset D to the outputs from the predicted equations, i.e., NMSE( f (X), y). The highlighted boundaries show the subset of pre-trained encoder-decoder methods and RL methods. Inference Pre-training Backpropagation < l a t e x i t s h a _ b a s e = " O Y B T W Z w R z U Z A z k s k q T f s = " > A A A C E H i c b V D L S s N A F J U V v q E s g W s I C W R m M h F H T h s o J Q B P K Z D p p h e z E y E E P I J b v w V N y U c e v S n X / j J I g r Q c G z p x z L / f e S M C m k Y X p p Y X F p e a W W l l b j c r d O i K I O C Z t H L C A x w k C K M + a U s q G e m F n C D P Y a T r T K y v t P u K C B f y f j k N g e G v n U p R h J J Q Q t D c o w R S T S y u p V / H T X r p Q + N y M r H e h V o k g P P E L E g V F G g N E r G O D I I E D A n R N Q g n i k m J G o o V C R I i P E E j l f U R x R d p I f l M I D p Q y h G D f A l z X d H g j w h Y s R l d m S Y t b L x P + f i T d c z u h f h h J u P p I D d i U A Y w S w c O K S d Y s l g R h D l V u I R h x h q T K s q B D M Z P n S e e k b p W G e N a v O i i K M M s A + q A E T n I E m u A E t A Y Y P I A n A J e t U f t W X v T q e l J a o Q V / o H A B n W o = < / l a t e x i t > D = {(X, y)} Encoder Decoder < l a t e x i t s h a 1 _ b a s e 6 4 = " N n J O m V d z c E h l U D E M j q o b / A P z 7 k w = " > A A A B 6 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k V L 0 V v H i s Y D + g D W W z 3 T R L d z d h d y O U 0 L / g x Y M i X v 1 D 3 v w 3 b t o c t P X B w O O 9 G W b m B Q l n 2 r j u t 1 P a 2 N z a 3 i n v V v b 2 D w 6 P q s c n X R 2 n i t A O i X m s + g H W l D N J O 4 Y Z T v u J o l g E n P a C 6 V 3 u 9 5 6 o 0 i y W j 2 a W U F / g i W Q h I 9 j k 0 j C J 2 K h a c + v u A m i d e A W p Q Y H 2 q P o 1 H M c k F V Q a w r H W A 8 9 N j J 9 h Z R j h d F 4 Z p p o m m E z x h A 4 s l V h Q 7 W e L W + f o w i p j F M b K l j R o o f 6 e y L D Q e i Y C 2 y m w i f S q l 4 v / e Y P U h D d + x m S S G i r J c l G Y c m R i l D + O x k x R Y v j M E k w U s 7 c i E m G F i b H x V G w I 3 u r L 6 6 R 7 V f F M i J z U Y X B m j M 5 U C A = " > A A A B 7 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k + H E r e P F Y w b S F N p T N d t M u 3 W z C 7 k S o p b / B i w d F v P q D v P l v 3 L Y 5 a O u D g c d 7 M 8 z M C 1 M p D L r u t 1 N Y W 9 / Y 3 C p u l 3 Z 2 9 / Y P y o d H T Z N k m n G f J T L R 7 Z A a L o X i P g q U v J 1 q T u N Q 8 l Y 4 u p 3 5 r U e u j U j U A 4 5 T H s R 0 o E Q k G E U r + d 0 n j r R X r r h V d w 6 y S r y c V C B H o 1 f + 6 v Y T l s V c I Z P U m I 7 n p h h M q E b B J J + W u p n h K W U j O u A d S x W N u Q k m 8 2 O n 5 M w q f R I l 2 p Z C M l d / T 0 x o b M w 4 D m 1 n T H F o l r 2 Z + J / X y T C 6 D i Z C p R l y x R a L o k w S T M j s c 9 I X m j O U Y 0 s o 0 8 L e S t i Q a s r Q 5 l O y I X j L L 6 + S 5 k X V u 6 z W 7 m u V + k 0 e R x F O 4 B T O w Y M r q M M d N M A H B g K e 4 R X e H O W 8 O O / O x 6 K 1 4 O Q z x / A

A GLOSSARY OF TERMS

We provide a short glossary of key terms, in Table 5 . 

B DGSR INSTANTIATION

We outline our instantiation of DGSR with the following architecture, for the conditional generator we split it into two parts that of an encoder and a decoder with total parameters θ = {ζ, ϕ}, where the encoder has parameters ζ and the decoder parameters ϕ, detailed in Figure 6 . See Appendix G for implementation hyperparameters and further details. Encoder. We use a set transformer (Lee et al., 2019) to encode a dataset D = {(X i , y i )} n i=1 into a latent vector V ∈ R w , where w ∈ N. Defined with n as the number of samples, y i ∈ R and X i ∈ R d is of d variable dimension. It is also possible to represent the input float values D into a multi-hot bit representation according to the half-precision IEEE-754 standard, as is common in some encoder symbolic regression methods (Biggio et al., 2021; Kamienny et al., 2022) . We note that in our experimental instantiation we did not represent the inputs in this way and instead, feed the float values directly into the set transformer. Empirically we observed similar performance therefore chose not to include this extra encoding, on our benchmark problem sets evaluated. However, we highlight that the user should follow best practices, as if their problem has observations D that have drastically different values, the multi-hot bit representation encoding could be useful (Biggio et al., 2021) . Decoder. The latent vector, V ∈ R w is fed into the decoder. We instantiated this with a standard transformer decoder (Vaswani et al., 2017) . The decoder generates each token of the equation f autoregressively, that is sampling from p( fi | f1:(1-i);θ;D ). We process the existing generated tokens f1:(1-i) into their hierarchical tree state representation (Petersen et al., 2020) , which are provided as inputs to the transformer decoder, a representation of the parent and sibling nodes of the token being sampled (Petersen et al., 2020) . The hierarchical tree state representation is generated according to Petersen et al. (2020) and is encoded with a fixed size embedding and fed into a standard transformer encoder (Vaswani et al., 2017) . This generates an additional latent vector which is concatenated to the encoder latent vector, forming a total latent vector of U ∈ R w+ds , where d s is the additional state dimension. The decoder generates each token sequentially, outputting a categorical distribution to sample from. From this, it is straightforward to apply token selection constraints based on the previously generated tokens. We incorporate the token selection constraints of Mundhenk et al. (2021); Petersen et al. (2020) , which includes: limiting equations to a minimum and maximum length, children of an operator should not all be constants, the child of a unary operator should not be the inverse of that operator, descendants of trigonometric operators should not be other trigonometric operators etc. During pre-training and inference, when we encode a dataset D into the latent vector V and sample k equations from the decoder. This is achieved by tiling (repeating) the encoded dataset latent vector k times and running the decoder over the respective latent vectors to form a batch size of k equations. Optimization Algorithm. We wish to minimize this end-to-end NMSE loss, Equation 1; however, the process of turning an equation of tokens f into an equation f is a non-differentiable step. Therefore, we require an end-to-end non-differentiable optimization algorithm. DGSR is agnostic to the exact choice of optimization algorithm used to optimize this loss, and any relevant end-to-end nondifferentiable optimization algorithm can be used. Suitable methods are policy gradients approaches, which include policy gradients with genetic programming (Petersen et al., 2020; Mundhenk et al., 2021) . To use these, we reformulate the loss of Equation 1for each equation into a reward function of R(θ) = 1/(1 + L(θ)), that is optimizable using policy gradients (Petersen et al., 2020) . Here both the encoder and decoder are trained during pre-training, i.e., optimizing the parameters θ and is further illustrated in Figure 2 with a block diagram. To be competitive to the existing state-of-the-art we formulate this reward function and optimize it with neural guided priority queue training (NGPQT) of Mundhenk et al. (2021) , detailed in Appendix C. Furthermore, we provide pseudocode for DGSR in Appendix D and show empirically other optimization algorithms can be used with an ablation of these in Section 5.2 and Appendix E.

C NGPQT OPTIMIZATION METHOD

The work of Mundhenk et al. (2021) introduces the hybrid neural-guided genetic programming method to SR. This uses an RNN (LSTM) for the generator (i.e., decoder) (Petersen et al., 2020) with a genetic programming component and achieves state-of-the-art results on a range of symbolic regression benchmarks (Mundhenk et al., 2021) . This optimization method is applicable to any neural network autoregressive model that can output an equation. Equations sampled from the generator are used to seed a starting population of a random restart genetic programming component that gradually learns better starting populations of equations, by optimizing the parameters θ of the generator. Specifically, this RL optimization method, formulates the generator as a reinforcement learning policy. This is optimized over a mini-batch of t datasets, where for each dataset we sample k equations from the conditional generator, p θ (f |D). Therefore, we sample a total of kt equations F per mini-batch. We note that at inference time, we only have one dataset therefore t = 1. We evaluate each equation f ∈ F under the reward function R(f ) = 1/(1 + L(f )) and perform gradient descent on a RL policy gradient loss function (Mundhenk et al., 2021) . Here we define L(f ) as the normalized mean squared error (NMSE) for a single equation, i.e. L(f ) = 1 σ y 1 n n i=1 (y i -f (X i )) 2 (2) Where σ y is the standard deviation of the observed outputs y. We note that instead of optimizing Equation 1 directly, we achieve the same goal of optimizing Equation 1 by optimizing the reward function R(f ) for each equation f ∈ F over a batch of kt equations, F. Choice of RL policy gradient loss function. There exist multiple RL policy gradient loss functions that are applicable (Mundhenk et al., 2021) . The optimization method of Mundhenk et al. (2021) propose to use priority queue training (PQT) (Abolafia et al., 2018) . Detailed further, some of these RL policy gradient loss functions are (Mundhenk et al., 2021 ): • Priority Queue Training (PQT). The generator is trained with a continually updated buffer of the top-q best fitting equations, i.e., a maximum-reward priority queue (MRPQ) of maximum size q (Abolafia et al., 2018). Training is performed over equations in the MRPQ using a supervised learning objective: L(θ) = 1 q f ∈F ∇ θ log p θ (f |D). • Vanilla policy gradient (VPG). Uses the REINFORCE algorithm (Williams, 1992) . Training is performed over the equations in the batch F with the loss function: L(θ) = 1 k f ∈F (R(f ) -b)∇ θ log p θ (f |D) , where b is a baseline, defined with an exponentiallyweighted moving average (EWMA) of rewards. • Risk-seeking policy gradient (RSPG). Uses a modified VPG to optimize for the best case reward (Petersen et al., 2020) rather than the average reward: L(θ) = 1 ϵk f ∈F (R(f ) - R ϵ )∇ θ log p θ (f |D)1 R(f )>Rϵ , where ϵ is a hyperparameter that controls the degree of riskseeking and R ϵ is the empirical (1ϵ) quantile of the rewards of F. Furthermore, we follow Mundhenk et al. (2021) and also include a common additional term in the loss function proportional to the entropy of the distribution at each position along the equation generated (Mundhenk et al., 2021; Petersen et al., 2020) . Specifically, these are the same equation complexity regularization methods of Mundhenk et al. (2021) , using the hierarchical entropy regularizer and the soft length prior from Landajuela et al. (2021) . The hierarchical entropy regularizer encourages the decoder (decoding tokens sequentially) to perpetually explore early tokens (without getting stuck committing to early tokens during training, dubbed the "early commitment problem") (Landajuela et al., 2021) . Whereas the soft length prior discourages the equation from being either too short or too long, which is superior to a hard length prior that forces each equation to be generated between a pre-specified minimum and maximum length. Using a soft length prior, the generator can learn the optimal equation length, which has been shown to improve learning (Landajuela et al., 2021) . The groundbreaking work of Balla et al. (2022) , further shows that unit regularization can be added to improve the SR method, and provides a useful decomposition of the complexity regularization into two components of the number of tokens ("activation functions") and the number of numeric constants-whereby it can be beneficial to tune these regularization terms separately. The full analysis of all regularization terms is out of scope for this work, however we leave this as an exciting direction for future work to explore. For the genetic programming component, we follow the same setup as in Mundhenk et al. (2021) . This uses a standard genetic programming formulation DEAP (Fortin et al., 2012) , introducing a few improvements, these being: equal probability amongst the mutation types (e.g., uniform, node replacement, insertion and shrink mutation), incorporating the equation constraints from Petersen et al. (2020) (also discussed in Appendix B) and the initial equations population is seeded by that of the generator equation samples. Unless otherwise specified, we use the same neural guided PQT (NGPQT) optimization method at inference time for DGSR-specifically this the PQT method, and we use the same optimization implementation of Mundhenk et al. (2021) of filtering the equations first by the empirical (1ϵ) quantile of the rewards of F, and then second filtering these for the top-q best fitting equations for use in PQT. Furthermore, we pre-train with the optimization method of VPG.

D DGSR PSEUDOCODE AND SYSTEM DESCRIPTION

We outline the DGSR system with Figure 6 . The conditional generative model, p θ (f |D) is comprised of an encoder and a decoder, with total parameters θ. Specifically the parameters of the encoder ζ and decoder ϕ are a subset of the total model parameters, i.e., θ = {ζ, ϕ}. During pre-training we update the parameters for both the encoder and the decoder, that is θ, whilst at inference time we only update the parameters of the decoder ϕ. We denote the best equation found during inference as f a , not be confused with the true underlying equation f * for that problem. If DGSR identifies the true equation then f a is equivalent to f * , i.e., f a = f * (Mundhenk et al., 2021) . The pre-training pseudocode for DGSR is detailed in Algorithm 1 and the inference pseudocode in Algorithm 2. For comprehensiveness we repeat the pre-training and inference training details here, however, note these details can also be found in Appendix J, with further details of the loss optimization methods in Appendix C.

Pre-training.

Using the specifications of p(f ), we can generate an almost unbounded number of equations. We pre-compile 100K equations and train the conditional generator on these using a mini-batch of t datasets, following Biggio et al. (2021) . The overall pre-training algorithm is detailed in Algorithm 1. During pre-training we use the vanilla policy gradient (VPG) loss function to train all the conditional generator parameters θ (i.e., the encoder and decoder parameters). This is optimized over a mini-batch of t datasets, where for each dataset we sample k equations from the conditional generator, p θ (f |D). Therefore, we sample a total of kt equations F per mini-batch. For each equation f ∈ F we compute the normalized mean squared error (NMSE), i.e. L(f ) = 1 σ y 1 n n i=1 (y i -f (X i )) 2 , f ∼ p θ (f |D) Where σ y is the standard deviation of the observed outputs y. We formulate each equations NMSE loss as a reward function by the following R(f ) = 1/(1 + L(f )). Training is performed over the equations in the batch F with the vanilla policy gradient loss function of, L(θ) = 1 k f ∈F (R(f ) -b)∇ θ log p θ (f |D) (4) Where b is a baseline, defined with an exponentially-weighted moving average (EWMA) of rewards, i.e., b t = α E[R(f )] + (1α)b t-1 . Furthermore, we follow Mundhenk et al. (2021) and also include a common additional term in the loss function proportional to the entropy of the distribution at each position along the equation generated (Mundhenk et al., 2021; Petersen et al., 2020) . Specifically, these are the same equation complexity regularization methods of Mundhenk et al. (2021) , using the hierarchical entropy regularizer and the soft length prior from Landajuela et al. (2021) . Algorithm  * ∼ p(f ) X ∼ X D ← {(f * (X), X)} F generator ← {f (i) ∼ p θ (f |D)} k i=1 ▷ Sample k equations from the conditional generator R ← {R(f, D)∀f ∈ F generator } ▷ Compute rewards L ← L + L(θ) ▷ Compute the generator loss (e.g., using VPG) end for θ ← θ + ∇ θ L ▷ Train the generator over the batch of t datasets end for Inference training. We detail the inference training routine in Algorithm 2. A training set and a test are sampled independently from the observed dataset D. The training dataset is used to optimize the loss at inference time and the test set is only used for evaluation of the best equations found at the end of inference, which unless the true equation f * is found runs for 2 million equation evaluations on the training dataset. During inference we use the neural guided PQT (NGPQT) optimization method, outlined in Appendix C, and only train the decoder in the conditional generative model, i.e., only the decoders parameters ϕ are updated. To construct the loss function we sample a batch of k equations from the conditional generative model p θ (f |D) and use these to seed a genetic programming component (DEAP (Fortin et al., 2012) ). The genetic programming component evaluates the equations fitness by its individual reward by computing the equations NMSE and then the reward of that loss (i.e., using R(f ) = 1/(1+L(f )) again). After a pre-defined number of genetic programming rounds (defined by a hyperparameter), we join the initial generated equations from the conditional generator to those of the best equations identified by the former genetic programming component. We then use this set of equations to train the priority queue training (PQT) loss function, and only update the parameters of the decoder. To construct the PQT loss, we continually update a buffer of the top-q best fitting equations, i.e., a maximum-reward priority queue (MRPQ) of maximum size q (Abolafia et al., 2018). Training is performed over equations in the MRPQ using a supervised learning objective: L(ϕ) = 1 q f ∈F ∇ ϕ log p ϕ (f |D) Additionally, we also include a common additional term in the loss function proportional to the entropy of the distribution at each position along the equation generated (Mundhenk et al., 2021; Petersen et al., 2020)  F generator ← {f (i) ∼ p θ (f |D)} k i=1 ▷ Sample k equations from the conditional generator F GP ← GP(F generator ) ▷ Seed GP component, as defined in Appendix C F train ← F generator ∪ F GP ▷ Join generated equations and best GP equations R ← {R(f )∀f ∈ F train } ▷ Compute rewards θ ← ϕ + ∇ ϕ L(ϕ) ▷ Train the generator (e.g., using PQT) if max R > R(f a ) then f a ← f arg max R ▷ Update the best equation seen end while Advantage over cross entropy loss. Specifically using an end-to-end-loss during pre-training and inference is key, as the conditional generator is able to learn and exploit the unique equation invariances and equivalent forms. For example two equivalent equations that have different forms (e.g.,x 1 (x 2 + sin(x 3 )) = x 1 x 2 + x 1 sin(x 3 )) will still have an identical NMSE loss (Equation 1). Whereas existing pre-training methods that pre-train using a cross entropy loss L CE (f * , f ), between the known ground truth true equation f * and the predicted equation f , have a non-identical loss, failing to capture equational equivalence relations. Furthermore the existing pre-training methods trained in this way require the ground truth equation f * to train their conditional generative model. However this is unknown at inference time (as it is this that we wish to find), therefore they cannot update their posterior at inference and are limited to only sample from it. We empirically illustrate this in Figure 4 (c) . Given this, the existing pre-trained encoder-decoder methods require an exponentially larger model and pre-training dataset size when pre-training on a dataset of increasing covariate dimension (Kamienny et al., 2022) .

E OTHER OPTIMIZATION ALGORITHMS

DGSR supports other optimization algorithms that can optimize a non-differentiable loss function. It is common to reformulate the NMSE loss per equation into a reward via R(f ) = 1/(1 + L(f )), which can then be optimized using policy gradients (Petersen et al., 2020) . Suitable policy gradient algorithms are outlined in Appendix C, which include vanilla policy gradients, risk-seeking policy gradients and priority queue training. We note that it is possible to use DGSR with other RL optimization methods such as distributional RL optimization Bellemare et al. (2017) . See Appendix R for other optimization results, without the genetic programming component. 

F EXTENDED RELATED WORK

( f (Xi)), yi) pθ(f ) ✗ ✗ ✗-Train from scratch - Prior [4] RSME( f (Xi)), yi) pθ(f ) ✗ ✗ ✗-Train from scratch - Encoder [5,6,7,8] CE( f , f * ) pθ(f |D) ✓ ✗ ✗-Cannot gradient refine ✗ DGSR This work Eq. 1, NSME( f (Xi)), yi) pθ(f |D) ✓ ✓ ✓-Can gradient refine ✓ In the following we review the existing deep SR approaches, and summarize their main differences in Table 1 . We provide an extended discussion of additional related works, including heuristic-based methods in Appendix F. We illustrate in Figure 2 that RL and pre-trained encoder-decoder methods can be seen as ad hoc subsets of the DGSR framework. In the following, we provide an extended discussion of related works, including the additional related work of heuristic-based methods. RL methods. These works use a policy network, typically implemented with RNNs, to output a sequence of tokens (actions) in order to form an equation. The output equation obtains a reward based on some goodness-of-fit metric (e.g., RMSE). Since the tokens are discrete, the method uses policy gradients to train the policy network. Most existing works focus on improving the pioneering policy gradient approach for SR, that of Petersen et al. ( 2020) (Costa et al., 2020; Landajuela et al., 2021) . However, the policy network is randomly initialized (without pre-training) and tends to output ill-informed equations at the beginning, which slows down the procedure. Furthermore, the policy network needs to be re-trained each time a new dataset D is available. Hybrid RL and GP methods. These methods combine RL with genetic programming (GPs). Mundhenk et al. (2021) use a policy network to seed the starting population of a GP algorithm, instead of starting with a random population as in a standard GP. Other works use RL to adjust the probabilities of genetic operations (Such et al., 2017; Chang et al., 2018; Chen et al., 2018; Mundhenk et al., 2021; Chen et al., 2020) . Similarly, these methods cannot improve with more learning from other datasets and have to re-train the model from scratch, making inference slow at test time. Pre-trained encoder-decoder methods. Unlike RL, these methods pre-train an encoder-decoder neural network to model p(f |D) using a curated dataset (Biggio et al., 2021) . Specifically, Valipour et al. (2021) propose to use standard language models, e.g., GPT. At inference time, these methods sample from p θ (f |D) using the pre-trained network, thereby achieving low complexity at inferencethat is efficient inference. These methods have two key limitations: (1) they use cross-entropy (CE) loss for pre-training and (2) they cannot gradient refine their model, leading to sub-optimal solutions. First (1), cross entropy, whilst useful for comparing categorical distributions, does not account for equations that are equivalent mathematically. Although prior works, specifically Lample & Charton (2019), observed the "surprising" and "very intriguing" result that sampling multiple equations from their pre-trained encoder-decoder model yielded some equations that are equivalent mathematically, when pre-trained using a CE loss. Furthermore, the pioneering work of d 'Ascoli et al. (2022) has shown this behavior as well. Whereas using our proposed end-to-end NMSE loss, Eq. 1-will have the same loss value for different equivalent equation forms that are mathematically equivalent-therefore this loss is a natural and principled way to incorporate the equation equivalence property, inherent to symbolic regression. Second (2), DGSR is to the best of our knowledge the first SR method to be able to perform gradient refinement of a pre-trained encoder-decoder model using our end-to-end NMSE loss, Eq. 1-to update the weights of the decoder at inference time. We note that there exists other non-gradient refinement approaches, that cannot update their decoder's weights. These consist of: (1) optimizing the constants in the generated equation form with a secondary optimization step (commonly using the BFGS algorithm) (Petersen et al., 2020; Biggio et al., 2021) , and (2) using the MSE of the predicted equation(s) to guide a beam search sampler (d'Ascoli et al., 2022; Kamienny et al., 2022) . As a result, to generalize to equations with a greater number of input variables pre-trained encoder-decoder methods require large pre-training datasets (e.g., millions of datasets (Biggio et al., 2021) ), and even larger generative models (e.g., ∼ 100 million parameters (Kamienny et al., 2022) ). Using priors. The work of Jin et al. (2019) explicitly uses a simple pre-determined prior over the equation token set and updates this using an MCMC algorithm. They propose encoding this simple prior by hand, which has the drawbacks of, not being able to condition on the observations D (such as the equation classes and domains of X ∈ X and y ∈ Y), not learnt automatically from datasets and is too restrictive to capture the conditional dependence of tokens as they are generated. Heuristic Based Methods. Many symbolic regression algorithms use search heuristics designed for equations. Examples include genetic programming (GP) (Augusto & Barbosa, 2000; Schmidt & Lipson, 2009) , simulated annealing (Stinstra et al., 2008) and AI Feynman (Udrescu & Tegmark, 2020). Genetic programming symbolic regression (Augusto & Barbosa, 2000; Schmidt & Lipson, 2009; Bäck et al., 2018) starts with a population of random equations, and evolves them through selecting the fittest for crossover and mutation to improve their fitness function. Although often useful, genetic programs suffer from scaling poorly to larger dimensions and are highly sensitive to hyperparameters. Alternatively, it is possible to use simulated annealing for symbolic regression, however again this has difficulty to scale to larger dimensions. AI Feynman (Udrescu & Tegmark, 2020) tackles the problem by applying a set of sequential heuristics (e.g., solve via dimensional analysis, translational symmetry, multiplicative separability, polynomial fit etc), and divides and transforms the dataset into simpler pieces that are then processed separately in a recursive manner. Their tool is a problem simplification tool for symbolic regression, and uses neural networks to identify these simplifying properties, such as translational symmetry and multiplicative separability. The respective sub-problems can be tackled by any symbolic regression algorithm, and Udrescu & Tegmark (2020) use a simple inner search algorithm of either a polynomial fit or a brute-force search. Furthermore, using brute-force search fails to scale to higher dimensions and is computationally inefficient, as it unable to leverage any structure making the search more tractable. Additionally, there exist other works to create an interpretable model from a black box model (Crabbe et al., 2020; Alaa & van der Schaar, 2019) .

G BENCHMARK ALGORITHMS

In this section we detail benchmark algorithms, consisting of the following: (1) types of benchmark algorithms selected, (2) hyperparameters and implementation details and (3) a discussion on inclusion criteria for benchmark model selection. Benchmark Algorithm Selection. The benchmark symbolic regression algorithms we selected to compare against are: Neural Guided Genetic Programming (NGGP) Mundhenk et al. (2021) , as this is the current state-of-the-art for symbolic regression, superseding DSR (Petersen et al., 2020) . Genetic programming (GP) (Fortin et al., 2012) which has long been an industry standard, and compare with Neural Symbolic Regression that Scales (NESYMRES), an pre-trained encoder-decoder method. Neural Guided Genetic Programming (NGGP). (Mundhenk et al., 2021) . We use their code and implementation provided, following their proposal to set the generator to be a single-layer LSTM RNN with 32 hidden nodes. We further follow their hyperparameter settings (Mundhenk et al., 2021) , unless otherwise noted. This uses the same optimization method of NGPQT, as described further in Appendix C. Genetic programming (GP). (Fortin et al., 2012) . We use the software package "DEAP" (Fortin et al., 2012) , following the same symbolic regression GP setup as Petersen et al. ( 2020), using their stated hyperparameters. This uses an initial population of equations generated using the "full" method (Koza, 1992) with a depth randomly selected between d min and d max . Following Petersen et al. (2020) we do not include GP post-hoc constraints other than constraining the maximum length of the equations generated to a specified maximum length of 30 unless otherwise defined and setting the maximum number of possible constants to 3. Neural Symbolic Regression that Scales (NESYMRES). (Biggio et al., 2021) . We use their code and implementation provided, following their proposal, and using their pre-trained model used in their paper. To ensure fair comparison, we used the largest beam size that the authors proposed in the range of possible beam sizes, that of a beam size of 256. As NESYMRES was only pre-trained on a dataset with variables d ≤ 3, we only evaluated it on problem sets where d ≤ 3. Deep Generative Symbolic Regression (DGSR). This work. We define the architecture in Appendix B. The encoder uses the set transformer from Lee et al. (2019) . Using the notation from Lee et al. (2019) the encoder is formed by 3 induced set attention blocks (ISABs) and one final component of pooling by multi-head attention (PMA). This uses a hidden dimension of 32 units, one head, one output feature and 64 inducing points. The decoder uses a standard transformer decoder (where standard transformer models are from the core PyTorch library (Paszke et al., 2019) ). The decoder consists of 2 layers, with a hidden dimension of 32, one attention head and zero dropout. The dimension of the input and output tokens is the size of the token library used in training plus two (i.e., for a padding token and a start and stop token). The inputs are encoded using an embedding of size 32 and have an additional positional encoding added to them, following standard practice (Vaswani et al., 2017) . We also mask the target output tokens to prevent information leakage during the forward step, again following standard practice (Vaswani et al., 2017) . The decoder generates each token of the equation f autoregressively, that is sampling from p( fi | f1:(1-i);θ;D ) following the same sampling procedure from Petersen et al. ( 2020). We process the existing generated tokens f1:(1-i) into their hierarchical tree state representation, detailed in Petersen et al. ( 2020), providing as inputs to the transformer decoder a representation of the parent and sibling nodes of the token being sampled (Petersen et al., 2020) . The hierarchical tree state representation is encoded with a fixed size embedding of 16 units, with an additional positional encoding added to it (Vaswani et al., 2017) , and fed into a standard transformer encoder, of 3 layers, with a hidden dimension of 32, one attention head and zero dropout. This generates an additional latent vector which is concatenated to the encoder (of the observations D) latent vector, forming a total latent vector of U ∈ R w+ds , where d s = 32, and w = 32. We also use the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm (Fletcher, 2013) for inferring numeric constants if any are present in the equations generated, following the setup of Petersen et al. (2020) . In total our conditional generative model has a total of 122,446 parameters, which we note is approximately two orders of magnitude less than other pre-trained encoder-decoder methods (NESYMRES with 26 million parameters, and E2E with 86 million parameters). Notable exclusions. The benchmark symbolic regression algorithms selected are all suitable for discovering equations of up to three variables, can accommodate numerical constants within the equations and have made their code accessible to benchmark against. There exist other symbolic regression algorithms that do not exhibit these features, therefore we do not include them to compare against: • AI Feynman (AIF): Udrescu & Tegmark (2020) developed a problem simplification tool for symbolic regression, this divides and transforms the dataset into simpler pieces that are then processed separately in a recursive manner (also discussed in Appendix F). The respective sub-problems can be tackled by any symbolic regression algorithm, where Udrescu & Tegmark (2020) use a simple inner search algorithm of either a polynomial fit or a bruteforce search. Although as others have noted (Petersen et al., 2020) , more challenging sub-problems which have numerical constants or are non-separable, still require a more comprehensive underlying symbolic regression method to solve the sub-problem. Therefore, AIF could be used as a pre-processing problem-simplification step and combined with any symbolic regression method. However, we analyse in this work the underlying symbolic regression search problem, e.g., after any problem simplification steps have been applied. Furthermore, the simple inner search that AIF uses (polynomial fit or brute force search) is computationally expensive, and scales poorly to increasing variable size. It can rely on further information about the equation to be provided such as the units for each input variable, which is unrealistic in most settings where the units are often unknown and the output does not have a physical interpretation (e.g., A dataset with many non-physical variables). • End-to-end symbolic regression with transformers (E2E): Kamienny et al. (2022) introduced an pre-trained encoder-decoder method that also provides an initial estimate of the constant floats, which are further refined in a standard secondary optimization step (using BFGS). Their transformer model contains a total of 86 million parameters, trained on a dataset of millions of equations. It was not possible to compare against E2E, as the authors had not released their code at the time of completing this work. Furthermore, it would be infeasible to re-implement and train such a large transformer model without a pre-trained model, therefore we exclude this as a symbolic regression benchmark in this work.

H STANDARD BENCHMARK PROBLEM RESULTS

Additionally, we evaluated DGSR against the popular standard benchmark problem sets of Nguyen (Uy et al., 2011 ), Nguyen with constants (Petersen et al., 2020) , R rationals (Krawiec & Pawlak, 2013) and Livermore (Mundhenk et al., 2021) . Nguyen problem set. The average recovery rates on the Nguyen problem set (Uy et al., 2011) can be seen in Table 7 . The Nguyen symbolic regression benchmark suite Uy et al. ( 2011), consists of 12 commonly used benchmark problem equations and has been extensively benchmarked against in symbolic regression prior works (White et al., 2013; Petersen et al., 2020; Mundhenk et al., 2021) . We observe that DGSR achieves similar performance to NGGP. Note that NGGP optimized its hyperparameters for the Nguyen problem set, specifically for the Nguyen-7 problem (empirically observing NGGP having a high sensitivity to hyperparameters). Whereas DGSR did not have its many hyperparameters optimized (Appendix J), instead we only optimized one hyperparameter that of the learning rate to the same Nguyen-7 problem, empirically finding a learning rate of 0.0001 to be best, this was used across this benchmark problem set and the variation including constants, seen next (we performed the same hyperparameter tuning of the learning rate for NGGP, however found the default already optimized). We benchmarked against NGGP directly as it is currently state-of-the-art on the Nguyen problem set and defer the reader to other benchmark algorithms on the Nguyen problem set in Mundhenk et al. (2021); Petersen et al. (2020) . Nguyen with constants problem set. The average recovery rates on the Nguyen with constants problem set (Petersen et al., 2020) can be seen in Table 8 . This benchmark problem set was introduced by Petersen et al. ( 2020), being a variation of a subset of the Nguyen problem set with constants to also optimize for. We observe that DGSR achieves the same average recovery rate to NGGP. We benchmarked against NGGP directly as it is currently state-of-the-art on the Nguyen with constants problem set, and defer the reader to other benchmark algorithms on the Nguyen with constants problem set in Petersen et al. ( 2020). R rationals problem set. The average recovery rates on the R rationals problem set (Krawiec & Pawlak, 2013) can be seen in Table 9 . We use the original problem set and the extended domain problem set, as defined in Mundhenk et al. (2021) , indicated with *. We observe DGSR has a higher average recovery rate than state-of-the-art NGGP and even finding the original equation for the problem R3, which was not previously possible with the existing state-of-the-art (that of NGGP). Again, we benchmarked against NGGP directly as it is currently state-of-the-art on the R rationals problem set and defer the reader to other benchmark algorithms on the R rationals problem set in Mundhenk et al. (2021) . Note that NGGP optimized its hyperparameters for the R rationals problem set, specifically for R-3 * . Whereas DGSR did not have its many hyperparameters optimized (Appendix J), instead we only optimized one hyperparameter that of the learning rate to the same Table 7 : Average recovery rate (A Rec %) on the Nguyen problem set with 95 % confidence intervals. Averaged over κ = 10 random seeds.

Benchmark Equation DGSR NGGP

Nguyen-1 x 3 1 + x 2 1 + x 1 100 100 Nguyen-2 x 4 1 + x 3 1 + x 2 1 + x 1 100 100 Nguyen-3 x 5 1 + x 4 1 + x 3 1 + x 2 1 + x 1 100 100 Nguyen-4 x 6 1 + x 5 1 + x 4 1 + x 3 1 + x 2 1 + x 1 100 100 Nguyen-5 sin x 2 1 cos(x 1 ) -1 100 100 Nguyen-6 sin(x 1 ) + sin x 1 + x 2 1 100 100 Nguyen-7 log(x 1 + 1) + log x 2 1 + 1 60 100 Nguyen-8 √ x 1 90 100 Nguyen-9 sin(x 1 ) + sin x 2 2 100 100 Nguyen-10 2 sin(x 1 ) cos(x 2 ) 100 100 Nguyen-11 x x2 1 100 100 Nguyen-12 x 4 1 - x 3 1 + 1 2 x 2 2 -x 2 0 0 Average recovery rate (%) A Rec % 87.50 ± 4.07 91.67 ± 0.00 Average recovery rate (%) A Rec % 100 ± 0 100 ± 0 problem of R-3 * , empirically finding a learning rate of 0.0001 to be best, this was used across this benchmark problem set. For fair comparison we performed the same learning rate hyperparameter optimization for NGGP and found a learning rate of 0.0001 to be best, achieving higher recovery rate results than originally reported (Mundhenk et al., 2021) . Table 9 : Average recovery rate (A Rec %) on the R rationals problem set with 95% confidence intervals. Averaged over κ = 10 random seeds. Benchmark Equation DGSR NGGP R-1 (x1+1) 3 x 2 1 -x1+1 0 0 R-2 x 5 1 -3x 3 1 +1 x 2 1 +1 0 0 R-3 x 6 1 +x 5 1 x 4 1 +x 3 1 +x 2 1 +x1+1 20 0 R-1 * (x1+1) 3 x 2 1 -x1+1 100 100 R-2 * x 5 1 -3x 3 1 +1 x 2 1 +1 90 60 R-3 * x 6 1 +x 5 1 x 4 1 +x 3 1 +x 2 1 +x1+1 100 100 Average recovery rate (%) A Rec % 51.66 ± 7.23 43.33 ± 5.06 Livermore problem set. The average recovery rates on the Livermore problem set (Mundhenk et al., 2021) can be seen in Table 10 . DGSR achieves a similar performance to NGGP. Again, we benchmarked against NGGP directly as it is currently state-of-the-art on the Livermore problem set and defer the reader to other benchmark algorithms on the Livermore problem set in Mundhenk et al. (2021) .

I BENCHMARK PROBLEM DETAILS

Standard Symbolic Regression Benchmark problems. Details of the standard symbolic regression benchmark problem sets that we compared against in Appendix H are tabulated in Table 32 and Table 33 . Specifically, most standard symbolic regression benchmarks use the following token library L Koza = {+, -, ÷, ×, x 1 , exp, log, sin, cos} and have a defined variable domain X and sampling specification for each problem (e.g., Table 32 ). We follow the same setup as Petersen et al. (2020) . Feynman problem sets. We use equations from the Feynman Symbolic Regression Database (Udrescu & Tegmark, 2020), to provide more challenging equations of multiple variables. These are derived from the Feynman Lectures on Physics (Feynman et al., 1965) , and also specify the domain X of the variables. We filtered these equations to those that had tokens that exist within the standard library set of tokens, that of L Koza = {+, -, ÷, ×, x 1 , exp, log, sin, cos}, excluding the variable tokens (e.g., {x 1 , x 2 , . . . }). We randomly selected a subset of these equations with two variables (labelled Feynman d = 2), and a further, more challenging subset sampled with five variables (labelled Feynman d = 5). Details of the Feynman benchmark problem sets are tabulated in Tables 34, 35 . We note that the token library does not include the "const" token, however equations which Table 10 : Average recovery rate (A Rec %) on the R rationals problem set with 95 % confidence intervals. Averaged over κ = 10 random seeds.

Benchmark Equation DGSR NGGP

Livermore-1 1/3 + x 1 + sin x 2 1 60 100 Livermore-2 sin x 2 1 cos(x 1 ) -2 100 100 Livermore-3 sin x 3 1 cos x 2 1 -1 100 100 Livermore-4 log(x 1 + 1) + log x 2 1 + 1 + log(x 1 ) 100 100 Livermore-5 x 4 1 -x 3 1 + x 2 1 -x 2 50 20 Livermore-6 4x 4 1 + 3x 3 1 + 2x 2 1 + x 1 90 90 Livermore-7 sinh(x 1 ) 0 0 Livermore-8 cosh(x 1 ) 0 0 Livermore-9 x 9 1 + x 8 1 + x 7 1 + x 6 1 + x 5 1 + x 4 1 + x 3 1 + x 2 1 + x 1 30 10 Livermore-10 6 sin(x 1 ) cos(x 2 ) 40 0 Livermore-11 x 2 1 x 2 1 x1+x2 100 100 Livermore-12 x 5 1 /x 3 2 100 100 Livermore-13 x 1/3 1 100 100 Livermore-14  x 3 1 + x 2 1 + x 1 + sin(x 1 ) + sin + x 4 1 + x 2 1 + x 1 100 100 Livermore-20 exp -x 2 1 100 100 Livermore-21 x 8 1 + x 7 1 + x 6 1 + x 5 1 + x 4 1 + x 3 1 + x 2 1 + x 1 100 100 Livermore-22 exp -0.5x 2 1 10 90 Average recovery rate (%) A Rec % 71.36 ± 9.82 73.63 ± 7.26 do have numeric constants can still be recovered with the provided token library, e.g., Feynman-6 can be recovered by x1(x2×x2) x1(x2×x2)+x1(x2×x2) . Synthetic d = 12 problem set. We use the same equation generation framework discussed in Appendix J to synthetically generate a problem set of equations with d = 12 variables. We use a token library set of L Synth = {+, -, ÷, ×, x 1 , . . . , x 12 }. We note that the size of the library set L Synth is 16, which is greater than the size of the standard library set L Koza of 9, this creates an exponentially larger symbolic regression search space for the symbolic regression methods, making this benchmark problem set more challenging. Details of the Synthetic d = 12 problem set are tabulated in Table 36 . When generating the equations for this problem set, we set the number of leaves of the equations to be generated to l max = 25, l min = 10. SRBench. See Appendix S for details of the SRBench (La Cava et al., 2021) dataset.

J DATASET GENERATION AND TRAINING

To construct the pre-training set {D (j) } m j=1 , we use the pioneering equation generation method of Lample & Charton (2019) , which is further extended by Biggio et al. (2021) to generate equations with constants. This equation generation framework allows us to generate equations that correspond to an input library of tokens from a particular benchmark problem, f (j) ∼ p(f ), ∀j ∈ [1 : m], where we sample m = 100K equations. For each equation f (j) , we further obtain a dataset D (j) = {(y (j) i , X (j) i )} n (j) i=1 by evaluating f (j) on n (j) random points in X , i.e., y (j) i = f (j) (X i ). We define the variable domain X from the dataset specification from the problem set and use the most common specification for X from that problem set (e.g., for Feynman d = 2 problem set we use U (1, 5, 20), as defined in Table 34 ). This allows us to pre-train a deep conditional generative model, p θ (f |D) to encode a particular prior p(f ) for a specific library of tokens and X . For each problem set encountered that has a different library of tokens or X we generated a pre-training set and pre-trained a conditional generative model for it. How to specify p(f ). The user can specify p(f ) by first selecting the specific library of tokens they wish to generate equations with, which includes the maximum number of possible variables that could appear in the equations, e.g., d = 5 includes the tokens of {x 1 , . . . , x 5 }. The framework of Lample & Charton (2019) generates equation trees, where each randomly generated equation tree has a pre-specified number of maximum leaves l max and a minimum number of leaves l min , we set to l max = 5, l min = 3 unless otherwise specified. Secondly, each non-leaf node is sampled following a user specified unnormalized weighted distribution of each operator, and we use the one shown in Table 11 . Following Lample & Charton (2019) ; Biggio et al. (2021) , each leaf node has a 0.8 probability of being an input variable and 0.2 probability of being an integer. We constrain equation trees that contain a variable of a higher dimension, to also contain the lower dimensional variables, e.g., if x 5 is present, then we require x 1 , . . . , x 4 to also be present in the equation tree. The tree is traversed in pre-order to produce an equation traversal in prefix notation, that is then converted to infix notation and parsed with Sympy (Meurer et al., 2017) to generate a functional equation, f (j) . If we desire to generate equations with arbitrary numeric constants we can further modify the equation to include numeric constant placeholders, which can be filled by sampling values from a defined distribution (e.g., uniform distribution, U (-1, 1)) (Biggio et al., 2021) . Furthermore, we store equations as functions, to allow the input support variable points to be re-sampled during pre-training. This partially pre-generated set allows for faster generation of pre-training data for the mini-batches (Biggio et al., 2021) . We further drop any generated dataset D that contain Nans, which can arise from invalid operations (e.g., taking the logarithm of negative values). Table 11 : We use the following unnormalized weighted distribution when sampling non-leaf nodes in the equation generation framework of Lample & Charton (2019) . a learning rate of 0.001. We also used ϵ = 0.02 for the risk seeking quantile parameter. We note that we use the same GP component hyperparameters as in NGGP (Mundhenk et al., 2021) . Hyperparameter selection. Unless stated otherwise we used the same hyperparameters from Mundhenk et al. (2021) . We did not carry out an extensive hyperparameter optimization routine, as done by Petersen et al. (2020) ; Mundhenk et al. (2021) (due to limited compute available), rather we tuned the learning rate only over a grid search of {0.1, 0.0025, 0.001, 0.0001}. Empirically the learning rate of 0.001 performs best in pre-training and inference for DGSR tuned on the Feynman-2 benchmark problem, and therefore is used throughout unless otherwise stated. To ensure fair comparison we also tuned the learning rate of NGGP over the same grid search, empirically observing 0.0025 performs the best and is used throughout, unless otherwise stated. A further description of the evaluation metrics and associated error bars are detailed in Appendix K. Compute details. This work was performed using a Intel Core i9-12900K CPU @ 3.20GHz, 64GB RAM with a Nvidia RTX3090 GPU 24GB. Pre-training the conditional generator took on average 5 hours.

K EVALUATION METRICS

In the following we discuss each evaluation metric in further detail. Recovery rate. (A Rec %)-the percentage of runs where the true equation f * was found, over a set number of κ random seed runs (Petersen et al., 2020) . This uses the strictest definition of symbolic equivalence, by a computer algebraic system (Meurer et al., 2017) . Specifically, this checks for equivalence of both the functional form and any numeric constants if they are present. Additionally, we quote the average 95% confidence intervals for a problem set, by computing the 95% interval for each problem across the random seed runs and then averaging across the set of problems included in a benchmark problem set. We note that recovery rate is a stricter symbolic regression metric as it checks for exact equation equivalence. Other symbolic regression methods have proposed to use in distribution accuracy, where the predicted outputs of the equation are within a percentage of the true y values observed, and similarly for out of distribution accuracy (Biggio et al., 2021; Petersen et al., 2020) . Naturally, if we find the correct true equation f * for a given problem-then additional other evaluation metrics that measure fit are satisfied to a perfect score, such as a: test MSE of 0.0, test extrapolation MSE of 0.0, coefficient of determination R 2 -score of 1.0 and an accuracy to tolerance τ of 100% for both in distribution and out of distribution. Equation evaluations. We also evaluate the average number of equation evaluations γ until the true equation f * was found. We use this metric as a proxy for computational complexity across the benchmark algorithms, as testing many generated equations is a bottleneck in SR (Biggio et al., 2021; Kamienny et al., 2022) . For example, analysing the standard symbolic regression benchmark problem Nguyen-7 c , DGSR finds the true equation in γ = 20,187 equation evaluations, taking a total of 1 minute and 36.9 seconds; whereas NGGP finds the true equation in γ = 30,112 equation evaluations, taking a total of 2 minutes and 20.5 seconds, both results averaged over κ = 10 random seeds. Pareto front with complexity. We use the Pareto front and corresponding complexity definition as detailed in Petersen et al. (2020) . Included here for completeness, the Pareto front is computed using the simple complexity measure of C(f ) = |f | i c(f i ). Where c is the complexity for a token, as defined as: 1 for +, -, ×, input variables and numeric constants; 2 for ÷; 3 for sin and cos; and 4 for exp and log.

L FEYNMAN D=2 RESULTS

Feynman d = 2 recovery rates. Average recovery rates on the Feynman d = 2 problem set can be seen in Table 12 . The corresponding inference equation evaluations on the Feynman d = 2 problem set can be seen in Table 13 . Noise ablation. We empirically observe DGSRs average recovery rate also decreases with increasing noise in the observations D, which is to be expected compared to other symbolic regression methods (Petersen et al., 2020) . We show a comparison of DGSR against NGGP when increasing the noise  i + ϵ i , ∀i ∈ [1 : n], where ϵ i ∼ N (0, αy RMS ), with y RMS = n i=1 y 2 i . That is the standard deviation is proportional to the root-mean-square of y. In Figure 6 we vary the proportionality constant α from 0 (noiseless) to 0.1 and evaluated DGSR and NGGP across all the problems in the Feynman d = 2 benchmark problem set. Figure 6 is plotted using a 10-fold larger training dataset, following Petersen et al. (2020) . On average DGSR can perform equally as well as NGGP with noisy data, if not better. Data sub sample ablation with noise. We empirically performed a data sub sample ablation with a small noise level of α = 0.001, varying the inference training data samples from n = 2 to n = 20, with the results tabulated in Table 14 and plotted in Figure 7 . Empirically this could suggest DGSR can leverage the encoded prior information in settings of noise and low data samples, which could be observed in real-world datasets. 

M FEYNMAN-7 EQUIVALENT EQUATIONS

Exploiting equation equivalences. Figure 3 (a) shows DGSR is able to correctly capture equation equivalences, and exploits these to generate many unique equivalent true equations. Although DGSR is able to generate many equivalent equations that are equivalent to the true equation f * , we tabulate only the first 64 of these in Table 37 . We note that all these equations are equivalent achieving zero test NMSE and can be simplified into f * . We modified the standard experiment setting, to avoid early stopping once the true equation was found, and record only the true equations that have a unique form yet equivalent equation to f * . Note that the true equation is f * = 3 2 x 1 x 2 and using the defined benchmark token library set, the first shortest equivalent equation to f * is x 1 (x 2 + x2x2 x2+x2 ).

N FEYNMAN D=5 PARETO FRONT EQUATIONS

Finding accurate and simple equations. Shown in Figure 8 , for the most challenging equations to recover, DGSR can still find equations that are accurate and simple, i.e., having a low test NMSE and low complexity. The equations analyzed in the Pareto fronts in Figure 8 , were chosen as none of the symbolic regression methods were able to find them. We note for a good symbolic regression method we wish to determine concise, simple (low complexity) and best fitting equations, otherwise it is undesirable to over-fit with an equation that has many terms having a high complexity, that fails to generalize well. Feynman-8 Pareto Front. We tabulate five of the many equations along the Pareto front in Figure 8 (a) for the Feynman-8 problem, with DGSR equations in Table 15 , NGGP equations in Table 16 and GP equations in Table 17 . For completeness we duplicate some of Figure 4 , here as Figure 8 . Feynman-13 Pareto Front. We tabulate five of the many equations along the Pareto front in Figure 8 (b) for the Feynman-13 problem, with DGSR equations in Table 18 , NGGP equations in Table 19 and GP equations in Table 20 . 

O FEYNMAN D=5 RESULTS

Feynman d = 5 recovery rates. Average recovery rates on the Feynman d = 5 problem set can be seen in Table 21 . The corresponding inference equation evaluations on the Feynman d = 5 problem set can be seen in Table 22 . 

P ADDITIONAL FEYNMAN RESULTS

The average recovery rates on an additional AI Feynman problem set (Udrescu & Tegmark, 2020) of 32 equations can be seen in Table 23 . DGSR achieves a similar performance to NGGP. To get this problem set, we filtered all the Feynman equations that have equation tokens that are inside the Koza token library (detailed in Appendix I). DGSR can run on any token sets, however, requires changing the supported token set and pre-training a new conditional generative model when changing the token set to be used (if not done so already).

Q SYNTHETIC D=12 RESULTS

The recovery rates for the synthetic d = 12 problem set are tabulated in Table 24 . The specification and generation details are in Appendix I. Empirically we investigated attempting to recover the equations with linear regression with polynomial features (testing both full, and sparse i.e., lassolinear regression), however found this was not possible, and predictions from a linear model had a significantly higher test NMSE compared to DGSR and NGGP methods. We observe that DGSR can find the true equation when searching through a challenging very large equation space. Note that to generate equations of a suitable length we modified the maximum length of tokens that can be generated to 256 for all methods, i.e., DGSR, NGGP and GP.

R USING DIFFERENT OPTIMIZER RESULTS

DGSR can be used with different PG optimizers. DGSR can be used with other policy gradient optimization methods. We ablated the optimizer by switching off the GP component for both DGSR and NGGP, which becomes similar to the optimizer in Petersen et al. (2020) using PQT. Empirically we observe without the GP component the average recovery rate decreases, as others have shown for this optimizer (Mundhenk et al., 2021) . However, we still observe that DGSR has a higher average recovery than that of NGGP, when both do not use the genetic programming component, whilst having a significantly lower number of equation evaluations, in Table 25 2022) . While we would love to perform a more comprehensive evaluation on all unique equations in SRBench with additional noise at this time, unfortunately we do not have the same resources of multiple 10's of GPUs. However, we believe that these combined results are strong enough to support all the major claims in this paper. SRBench ground truth unique equations. We observe in the zero noise case on SRBench ground truth unique equations (ODE-Strogatz and Feynman), that DGSR achieves the highest symbolic recovery (solution) rate against the baselines provided of 63.25%, which is significant compared to the second best SRBench baseline of AI Feynman at 52.65%, shown in Figure 9 , 10. Furthermore, analyzing the metric of equation accuracy, defined by R 2 test > 0.99 (i.e., the R 2 metric on the test samples is greater than 0.99), DGSR remains competitive placing 4th, with a mean equation accuracy rate of 90.94%. Whereas the other three best methods are genetic programming methods, MRGP, Operon and SBP-GP which have a mean accuracy rate of 96.13%, 93.92% and 93.65% respectively. Furthermore, DGSR on the same unique equations, as shown in the Figure 11 , has the lowest simplified equation complexity for the highest comparative accuracy MRGP: 157.62, Operon: 41.18, ) and lowest inference time in seconds for the highest comparative accuracy (inference time -DGSR: 706.94s, MRGP: 13,665.00s, Operon: 1,874.91s, SBP-GP: 27,713.85s). We highlight that it is possible to fit a more accurate equation which Feynman-A-1 9 x 3 x 1 x 2 (x 5 -x 4 ) 2 +(x 7 -x 6 ) 2 +(x 9 -x 8 ) 2 0 0 Feynman-A-2 8 x 1 x 2 x 3 x 4 + x 1 x 5 x 6 x 2 7 x 3 x 4 x 8 Feynman-A-3 6 x 1 e -x 2 x 5 x 3 x 6 x 4 70 90 Feynman-A-4 6 x 1 x 4 + x 2 x 5 + x 3 x 6 100 Feynman-A-5 6 x 1 (1 + x 5 x 6 cos(x 4 ) x2 * x3 ) 100 Feynman-A-6 6 x 1 (1 + x 3 )x 2 100 Feynman-A-7 4 x 1 x 4 x 2 x 3 100 Feynman-A-8 4 x 1 x 2 x 3 x 4 100 Feynman-A-9 4 1 x 1 -1 x 2 x 4 x 3 20 100 Feynman-A-10 4 x 1 x 2 x 3 2x 4 90 100 Feynman-A-11 4 x 1 x 2 x 4 x 3 100 Feynman-A-12 4 x 1 (cos(x 2 x 3 ) + x 4 cos(x 2 x 3 ) 2 ) 0 0 Feynman-A-13 4 -x 1 x 2 x 3 x4 100 Feynman-A-14 4 x 1 x 3 +x 2 x 4 x 1 +x 2 60 Feynman-A-15 4 1 2 x 1 (x 2 2 + x 2 3 + x 2 4 ) 0 0 Feynman-A-16 3 -x 1 x 2 cos(x 3 ) 100 Feynman-A-17 3 x3+x2 1+ x 3 x 2 x1 2 30 0 Feynman-A-18 3 x 1 x 2 x 3 100 Feynman-A-19 3 x 1 x 2 x 2 3 100 Feynman-A-20 3 x 1 x 2 x 3 2 100 Feynman-A-21 3 1 x 1 -1 x 2 x 3 70 90 Feynman-A-22 3 x 3 1- x 2 x 1 Feynman-A-23 3 x 1 x 3 x 2 100 Feynman-A-24 3 x 1 sin(x 3 x 10 + x 11 + x 12 + x 3 (x 1 + x 2 ) + x 4 x 5 + x 6 + x 7 + x 8 + x 9 100 100 0 Synthetic-3 x 2 2 ) 2 sin(x 2 /2) 2 0 0 Feynman-A-25 3 x 1 (1 + x 2 cos(x 3 )) 100 Feynman-A-26 3 1 1 x 1 + x 3 x 2 100 Feynman-A-27 3 2x 1 (1 -cos(x 2 x 3 )) 90 Feynman-A-28 3 x 1 x 2 (1+x 3 ) 100 Feynman-A-29 7 ( x 1 x 2 x 3 x 4 x 5 4x 6 sin(x 7 /2) 2 ) 2 0 0 Feynman-A-30 4 x 1 1+x 1 /(x 2 x 2 3 )(1-cos(x 4 )) 0 0 Feynman-A-31 4 x 1 (1-x 2 2 ) 1+x 2 cos(x 3 -x 4 ) 0 0 Feynman-A-32 4 x 1 sin(x 2 /2)sin(x 4 x 3 /2) (x 2 /2 sin(x 3 /2)) 2 x 10 + x 9 (x 1 + x 2 + x 3 + x 4 + x 5 + x 6 + x 7 + x 8 ) + x 11 + x 12 0 0 0 Synthetic-4 x 8 (x 6 + x 7 ) -(x 10 + x 11 x 12 + x 9 )x 1 + x 2 + x 3 + x 4 + x 5 0 0 0 Synthetic-5 x 10 + x 11 + x 12 + x 9 (x 1 + x 2 )x 3 + x 4 + x 5 + x 6 + x 7 + x 8 45 0 0 Synthetic-6 x 1 (x 10x 11 )x 12 + x 2 + x 3 + x 4 + x 5 + x 6 + x 7 + x 8 + x 9 100 100 0 Synthetic-7 x T LOCAL OPTIMA We discuss two sources of understood local optima in the symbolic regression literature, that of (1) the skeleton equation local optima (Mundhenk et al., 2021) and (2) the numerical constants local optima (Kamienny et al., 2022) . (1). DGSR specifically is assisted to avoid getting stuck in skeleton equation local optima, as it is optimized at inference with a combined policy gradient-based and genetic programming training optimization algorithm, that of neural guided priority queue training (NGPQT) of Mundhenk et al. (2021) , detailed in Appendix C. Mundhenk et al. (2021) hypothesizes the improved performance over gradient-based training methods is due to the genetic programming component providing "fresh" new samples that help the optimization method escape local optima. We also observe the increase in (2). Like many existing works, the current DGSR suffers from the local optima of the numerical constants. DGSR uses the same setup of the numerical optimizer as Petersen et al. (2020), i.e., it first guesses an initial guess of 1.0 for each constant, and then further refines them with the optimizer of Broyden-Fletcher-Goldfarb-Shanno algorithm (BFGS). However, the recent seminal work of Kamienny et al. (2022) propose a solution to mitigate this issue, and their approach could be 

Y ADDITIONAL SYNTHETIC EXPERIMENTS

To provide further empirical results of DGSR, we synthetically generated ω equations using the concise and seminal equation generation framework of Lample & Charton (2019)-which is the same procedure used to generate the pre-training dataset equations. We tabulate these additional synthetic results for dimensions d = {2, 5} in Table 30 . To be consistent with prior work of Kamienny et al. (2022) , that use this same experimental setup, we followed their experimental setup-which only evaluates each synthetic equation for one random seed. Here, we provide additional evaluation metrics that are computed out of distribution-that is we sample new points X ∼ X to form an out of distribution test set {f * (X), X}. Specifically we further include the following metrics of: coefficient of determination R 2 -score (La Cava et al., 2021) , Test NMSE and accuracy to tolerance τ (Biggio et al., 2021; Kamienny et al., 2022) . Where coefficient of determination R 2 -score is defined as: R 2 = 1 - n i=1 (y i -ŷi ) 2 n i=1 (y i -ȳ) 2 ȳ = 1 n n i=1 y i Similarly the accuracy to tolerance τ is defined as (Kamienny et al., 2022) : Acc τ = 1 max 1≤i≤n ŷi -y i yi ≤ τ (7) where 1 is the indicator function. We also provide a similar additional synthetic equation ablation of pre-training DGSR with a smaller number of input variables d = 2 than those seen at inference time d = 5-thereby further providing empirical evidence for the property of generalizing to unseen input variables (P3). This is detailed in Table 31 . 

Name Equation Dataset Library

Nguyen-1 x 3 1 + x 2 1 + x 1 U (-1, 1, 20) L Koza Nguyen-2 x 4 1 + x 3 1 + x 2 1 + x 1 U (-1, 1, 20) L Koza Nguyen-3 x 5 1 + x 4 1 + x 3 1 + x 2 1 + x 1 U (-1, 1, 20) L Koza Nguyen-4 x 6 1 + x 5 1 + x 4 1 + x 3 1 + x 2 1 + x 1 U (-1, 1, 20) L Koza Nguyen-5 sin x 2 1 cos(x 1 ) -1 U (-1, 1, 20) L Koza Nguyen-6 sin(x 1 ) + sin x 1 + x 2 1 U (-1, 1, 20) L Koza Nguyen-7 log(x 1 + 1) + log x 2 1 + 1 U (0, 2, 20) L Koza Nguyen-8 √ x 1 U (0, 4, 20) L Koza Nguyen-9 sin(x 1 ) + sin x 2 2 U (0, 1, 20) L Koza ∪ {x 2 } Nguyen-10 2 sin(x 1 ) cos(x 2 ) U (0, 1, 20) L Koza ∪ {x 2 } Nguyen-11 x x2 1 U (0, 1, 20) L Koza ∪ {x 2 } Nguyen-12 x 4 1x 3 1 + 1 2 x 2 2x 2 U (0, 1, 20) L Koza ∪ {x 2 } Nguyen-1 c 3.39x 3 1 + 2.12x 2 1 + 1.78x 1 U (-1, 1, 20) L Koza ∪ {const} Nguyen-5 c sin x 2 1 cos(x 1 ) -0.75 U (-1, 1, 20) L Koza ∪ {const} Nguyen-7 c log(x 1 + 1.4) + log x 2 1 + 1.3 U (0, 2, 20) L Koza ∪ {const} Nguyen-8 c √ 1.23x 1 U (0, 4, 20) L Koza ∪ {const} Nguyen-10 c sin(1.5x 1 ) cos(0.5x 2 ) U (0, 1, 20) L Koza ∪ {x 2 , const} R-1 (x1+1) 3 x 2 1 -x1+1 E(-1, 1, 20) L Koza R-2 x 5 1 -3x 3 1 +1 x 2 1 +1 E(-1, 1, 20) L Koza R-3 x 6 1 +x 5 1 x 4 1 +x 3 1 +x 2 1 +x1+1 E(-1, 1, 20) L Koza R-1 * (x1+1) 3 x 2 1 -x1+1 E(-10, 10, 20) L Koza R-2 * x 5 1 -3x 3 1 +1 x 2 1 +1 E(-10, 10, 20) L Koza R-3 * U (1, 5, 50) L Koza ∪ {x 2 , x 3 , x 4 , x 5 }

Feynman-13

x 1 (e x 2 x 3 x 4 x 5 -1) U (1, 5, 50) L Koza ∪ {x 2 , x 3 , x 4 , x 5 } Feynman-14 x 5 x 1 x 2 ( 1 x4 -1 x3 ) U (1, 5, 50) L Koza ∪ {x 2 , x 3 , x 4 , x 5 } Feynman-15 x 1 (x 2 + x 3 x 4 sin x 5 ) U (1, 5, 50) L Koza ∪ {x 2 , x 3 , x 4 , x 5 } Table 35 : Additional Feynman benchmark problem specifications. Input variables can be x 1 , . . . , x 9 . U (a, b, c) corresponds to c random points uniformly sampled between a to b for each input variable separately, where the training and test datasets use different random seeds. Where L Koza = {+, -, ÷, ×, x 1 , exp, log, sin, cos}. U (1, 5, 60) L Koza ∪ {x 2 , . . . , x 6 } Feynman-A-4

Name

x 1 x 4 + x 2 x 5 + x 3 x 6 U (1, 5, 60) L Koza ∪ {x 2 , . . . , x 6 } Feynman-A-5 x 1 (1 + x5x6 cos(x4) x2 * x3 ) U (1, 3, 60) L Koza ∪ {x 2 , . . . , x 6 } Feynman-A-6 x 1 (1 + x 3 )x 2 U (1, 5, 60) L Koza ∪ {x 2 , . . . , x 6 } Feynman-A-7 U (1, 5, 40) L Koza ∪ {x 2 , . . . , x 4 } Feynman-A-12 x 1 (cos(x 2 x 3 ) + x 4 cos(x 2 x 3 ) 2 ) U (1, 3, 40) L Koza ∪ {x 2 , . . . , x 4 } Feynman-A-13 -x 1 x 2 x3 x4 U (1, 5, 40) L Koza ∪ {x 2 , . . . , x 4 } Feynman-A-14 U (1, 5, 30) L Koza ∪ {x 2 , . . . , x 3 } Feynman-A-25 x 1 (1 + x 2 cos(x 3 )) U (1, 5, 30) L Koza ∪ {x 2 , . . . , x 3 } Feynman-A-26 

Name Equation Dataset Library

Synthetic-1 x 12 + x 9 (x 10 + x 11 ) + x 1 + x 2 + x 3 + x 4 + x 5 + x 6 + x 7 x 8 U (-1, 1, 120) L Synth Synthetic-2 x 10 + x 11 + x 12 + x 3 (x 1 + x 2 ) + x 4 x 5 + x 6 + x 7 + x 8 + x 9 U (-1, 1, 120) L Synth Synthetic-3 x 10 + x 9 (x 1 + x 2 + x 3 + x 4 + x 5 + x 6 + x 7 + x 8 ) + x 11 + x 12 U (-1, 1, 120) L Synth Synthetic-4 x 8 (x 6 + x 7 ) -(x 10 + x 11 x 12 + x 9 )x 1 + x 2 + x 3 + x 4 + x 5 U (-1, 1, 120) L Synth Synthetic-5 x 10 + x 11 + x 12 + x 9 (x 1 + x 2 )x 3 + x 4 + x 5 + x 6 + x 7 + x 8 U (-1, 1, 120) L Synth Synthetic-6 x 1 (x 10x 11 )x 12 + x 2 + x 3 + x 4 + x 5 + x 6 + x 7 + x 8 + x 9 U (-1, 1, 120) L Synth Synthetic-7 x 1 x 2x 11 (-x 10 + x 6 + x 7 ) + x 8x 9 + x 12 + x 3 + x 4 + x 5 U (-1, 1, 120) L Synth x 2 +x 2 x 1 ) x 1 x 2 x 2 (x 1 + x 2 x1 x2+x2 ) x 1 (x 2 x 2 x2 1 x 2 +x 2 x2 + x 2 ) x 1 x 2 x 1 (x 2 x2 x2+x2 + x 2 ) x 2 (x 1 x2 x2 + x 1 x2 x2+x2 ) x 1 x 2 x 1 (x 2 x2 x2+x2 + x 2 ) x 2 (x 1 x1 x 2 x 2 +x 2 x1 + x 1 ) x 1 x 2 x 1 (x 2 x1 x1+x1 + x 2 ) x 2 (x 1 x1 1 1 x 2 (x 1 +x 1 ) x2 + x 1 ) x 1 x 2 x 1 (x 2 x1 x1+x1 + x 2 ) x 2 (x 1 x2 1 1 x 1 x 2 x1+x1 + x 1 ) x 1 x 2 x 2 (x 1 + x1 1 x 1 (x1+x1) ) x 1 (x 2 x2 x2+(x2+(x2(-1)+x2)) + x 2 ) x 1 x 2 x 2 (x 1 x1 x1+x1 + x 1 ) x 1 (x 2 x2 x2 x 2 x 2 +x2 + x 2 ) x 1 x 2 x 2 (x 1 x2 x2+x2 + x 1 ) x 2 (x 1 x2 x2 x2 x2+x2 + x 1 ) x 1 x 2 x 2 (x 1 x2 x2+x2 + x 1 ) x 2 (x 1 x2 x2+ x 1 x 2 x 1 + x 1 ) x 1 x 2 x 1 (x 2 + x2 1 x 2 (x2+x2) ) (x 1 + x1 1 x 2 (x2+x2) )(x 2 (-1) + (x 2 + x 2 )) x 1 x 2 x 2 (x 1 + x1x2 x2+x2 ) x 2 (x 1 + x1 1 x 2 (x2 x 1 x 1 +x2) ) x 1 x 2 x 2 (x 1 x1 x1+x1 + x 1 ) x 1 (x 2 x2 x2+(x2+(x1(-1)+x1)) + x 2 ) x 1 x 2 x 2 (x 1 + x 2 x1 x2+x2 ) x 2 (x 1 + x1 1 x 2 x 1 x 2 (x1+x1) ) x 1 x 2 x 2 (x 1 x2 x2+x2 + x 1 ) x 2 (x 1 + x1 1 x 2 (x2 x 1 x 1 +x2) ) x (x2+x2) (-1))) x 2 (x 1 x1 x1 x 1 +x 1 x 1 + x 1 ) x 1 x 2 x 1 (x 1 x1 x 2 x 1 x1+x1 + x 2 ) x 1 (x 1 x2 x1+ x 1 x 2 x 2 + x 2 ) x 1 x 2 x 1 (x 2 + x2 1 x 2 x1 x 2 +x 2 x 1 ) x 2 (x 1 1 x 2 x2x2 x2+x2 + x 1 ) x 1 x 2 x 1 (x 2 + x2 1 x 1 x1 x 2 +x 2 x 2 ) x 2 (x 1 1 x 2 x2 1 x 1 (x1+x1) + x 1 ) x 1 x 2 x 1 x2 x2 (x 1 x2 x1+x1 + x 2 ) x 1 (x 2 x2 1 1 x 2 (x 2 +x 2 ) x2 + x 2 ) x 1 x 2 x 1 (x 2 x2 x2+(x2+(x1(-1)+x1)) + x 2 ) x 2 (x 1 x2 x1 x 2 x 1 +x2 + x 1 ) x 1 x 2 x 2 (x 1 x2 x1 x 2 x 1 +x2 + x 1 ) x )) ) x 



We define all acronyms in a glossary in Appendix A. For generality we note that DGSR can handle datasets of different sample sizes. Where t and k are hyperparameters. Additionally, the code is available at https://github.com/samholt/DeepGenerativeSymbolicRegression and have a broader research group codebase at https://github.com/vanderschaarlab/DeepGenerativeSymbolicRegression where EWMA baseline is defined as bt= α E[R(f )] + (1 -α)bt-1



Figure 1: The data generating process.

e a 9 c Z D o 9 a 6 L e I o w x m c w y V 4 c A 0 t u I c 2 d I B A B M / w C m + O c F 6 c d + d j 2 V p y i p l T + A P n 8 w c T w Y 4 / < / l a t e x i t >

Figure 2: Block diagram of DGSR. DGSR is able to learn the invariances of equations and datasets D (P1) by having both:(1) an encoding architecture that is permutation invariant across the number of samples n in the observed dataset D = {(X i , y i )} n i=1 , and (2) a Bayesian inspired end-to-end loss NMSE function, Eq. 1 from the encoded dataset D to the outputs from the predicted equations, i.e., NMSE( f (X), y). The highlighted boundaries show the subset of pre-trained encoder-decoder methods and RL methods.

Figure 3: (a) Number of unique ground truth f * equivalent equations discovered for problem Feynman-7 (A. L), (b) Percentage of valid equations generated from a sample of k for problem Feynman-7 (A. L), (c) Average recovery rate of Feynman d = 2, Feynman d = 5 and Synthetic d = 12 benchmark problem sets plotted against the number of input variables d.

Figure 4: (a-b) Pareto front of test NMSE against equation complexity. Labelled: (a) Feynman-8, (b) Feynman-13. Ground truth equation complexity is the red line. Equations discovered are listed in A. N. (c) Negative log-likelihood of the ground truth true equation f * for problem Feynman-7 (A. L).

Pre-trained dataset (d = 5) = (d inference = 5) 67.50 Pre-trained dataset (d = 2) < (d inference = 5) 61.29

Figure 5: Percentage of valid equations generated from a sample of k equations on the Feynman-7 problem (A. L), with a different optimizer, that of Petersen et al. (2020).

5 H E U 4 g V M 4 B w + u o A 5 3 0 I A W M E B 4 h l d 4 c x 6 d F + f d + V i 0 F p x 8 5 h j + w P n 8 A c 4 y j O w = < / l a t e x i t > h < l a t e x i t s h a 1 _ b a s e 6 4 = " E y R z w Y h 7 Y f L j o W x t M H e T 7 8 e h x p k = " > A A A C I n i c b V D L S g M x F M 3 U V 6 2 v q k s 3 w S K 4 k D I j x c d G C m 5 c V r A P 6 A x D J s 2 0 o Z k H y R 2 h T O d b 3 P g r b l w o 6 k r w Y 0 y n X d j W A y E n 5 9 y b 5 B 4 v F l y B a X 4 b h Z X V t f W N 4 m Z p a 3 t n d 6 + 8 f 9 B S U S I p a 9 J

e u 8 a l 1 U a / e 1 S v 1 m F k c R H a F j d I o s d I n q 6 A 4 1 U B N R 9 I R e 0 B t 6 N 5 6 N V + P D + J q W F o x Z z y G a g / H z C 7 n 3 p a 8 = < / l a t e x i t > [ f1 , . . . , f| f | ] < l a t e x i t s h a 1 _ b a s e 6 4 = " e P l t H p k S C T R e C Z + 0 p e k R x A G / N M

e a 9 c Z D o 9 a 6 L e I o w x m c w y V 4 c A 0 t u I c 2 d I B A B M / w C m + O c F 6 c d + d j 2 V p y i p l T + A P n 8 w c T w Y 4 / < / l a t e x i t >

Figure 6: Block diagram of DGSR. DGSR is able to learn the invariances of equations and datasets D (P1) by having both:(1) An encoding architecture that is permutation invariant across the number of samples n in the observed dataset D = {(X i , y i )} n i=1 , (2) An Bayesian inspired end-to-end loss NMSE function, Eq. 1 from the encoded dataset D to the outputs from the predicted equations, i.e., NMSE( f (X), y). The highlighted boundaries show the subset of pre-trained encoder-decoder methods and RL methods.

. Using the specifications of p(f ), we can generate an almost unbounded number of equations. We pre-compile 100K equations and train the conditional generator on these using a mini-batch of t datasets, followingBiggio et al. (2021). The overall pre-training algorithm is detailed in Algorithm 1, in Appendix D. Additionally we construct a validation set of 100 equations using the same pre-training setup, with a different random seed and check and remove any of the validation equations from the pre-training set. Furthermore, we check and remove any test problem set equations from the pre-training and validation equation sets. During pre-training we use the vanilla policy gradient (VPG) loss function to train the conditional generator parameters θ. This is detailed in Appendices D, C, and we use the hyperparameters: batch size of k = 500 equations to sample, mini-batch of t = 5 datasets, EWMA coefficient 5 α = 0.5, entropy weight λ H = 0.003, minimum equation length = 4, maximum equation length = 30, Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.001 and an early stopping patience of a 100 iterations (of a mini-batch). Empirically we observe that early stopping occurs approximately after a set of 10K datasets has been trained on. Furthermore, we use vanilla policy gradients (PG) during pre-training as this optimizes the average distribution of p θ (f |D) and was empirically found to perform the best, amongst other types of PG methods possible (e.g., NGPQT or RSPG). Inference. We detail the inference training routine in Algorithm 2. A training set and a test are sampled independently from the defined problem equation domain, to form a dataset D. The training dataset is used to optimize the loss at inference time and the test set is only used for evaluation of the best equations found at the end of inference, which unless the true equation f * is found runs for 2 million equation evaluations on the training dataset. During inference we use the neural guided PQT (NGPQT) optimization method, outlined in Appendix C. The hyperparameters for inference time are: batch size of k = 500 equations to sample, entropy weight λ H = 0.003, minimum equation length = 4, maximum equation length = 30, PQT queue size = 10, sample selection size = 1, GP generations per iteration = 25, GP cross over probability = 0.5, GP mutation probability = 0.5, GP tournament size = 5, GP mutate tree maximum = 3 and Adam optimizer (Kingma & Ba, 2014) with

Figure 6: Noise ablation of average recovery rate on the Feynman d = 2 benchmark problem set against increasing noise level. Averaged over 3 random seeds.

Figure 8: (a-b) Pareto front of test NMSE against equation complexity. Labelled: (a) Feynman-8, (b) Feynman-13. Ground truth equation complexity is the red line. Equations discovered are listed in Appendix N. (c) Average recovery rate of Feynman d = 2, Feynman d = 5 and synthetic d = 12 benchmark problems plotted against variable dimension d.

Figure 10: Symbolic recovery rate, split for each problem set type, labelled here as Symbolic solution rate (%) on the SRBench ground truth unique equations, and the SRBench provided methods. Points indicate the mean the test set performance on all ground truth problems, and bars show the 95% confidence interval.

Figure 11: Equation accuracy, plotted here as the accuracy solution rate (R 2 test > 0.99), against equation complexity and inference training time on the SRBench ground truth unique equations, and the SRBench provided methods. Points indicate the mean the test set performance on all ground truth problems, and bars show the 95% confidence interval.

Config ω Average recovery rate (%) A Rec % Pre-trained dataset (d = 5) = (d inference = 5) 5,000 84.18 Pre-trained dataset (d = 2) < (d inference = 5) 3,000 78.93Table 32: Standard symbolic regression benchmark problem specifications. Input variables can be x 1 , x 2 . U (a, b, c) corresponds to c random points uniformly sampled between a to b for each input variable separately, where the training and test datasets use different random seeds. E(a, b, c) corresponds to c points evenly spaced between a and b for each input variable; where training and test datasets use the same points. Where L Koza = {+, -, ÷, ×, x 1 , exp, log, sin, cos}.

Feynman benchmark problem specifications. Input variables can be x 1 , . . . , x 2 . U (a, b, c) corresponds to c random points uniformly sampled between a to b for each input variable separately, where the training and test datasets use different random seeds. Where L Koza = {+, -, ÷, ×, x 1 , exp, log, sin, cos}. 50)L Koza ∪ {x 2 , x 3 , x 4 , x 5 } Feynman-9 x 1 x 2 x 3 log x5 x4 U (1, 5, 50) L Koza ∪ {x 2 , x 3 , x 4 , x 5 } Feynman-10 x1(x3-x2)x4 x5U (1, 5, 50) L Koza ∪ {x 2 , x 3 , x 4 , x 5

x4) 2 +(x7-x6) 2 +(x9-x8) 2U (1, 2, 90) L Koza ∪ {x 2 , . . . , x 9 } Feynman-

40) L Koza ∪ {x 2 , . . . , x 4 } Feynman-A-10 x1x2x3 2x4 U (1, 5, 40) L Koza ∪ {x 2 , . . . , x 4 } Feynman-A-11 x1x2x4 x3

(1, 5, 40)  L Koza ∪ {x 2 , . . . , x 4 } Feynman-40) L Koza ∪ {x 2 , . . . , x 4 } Feynman-A-16 -x 1 x 2 cos(x 3 ) U (1, 5, 30) L Koza ∪ {x 2 , . . . , x 3 } Feynman-30) L Koza ∪ {x 2 , . . . , x 3 } Feynman-A-18 x 1 x 2 x 3 U (1, 5, 30) L Koza ∪ {x 2 , . . . , x 3 } Feynman-A-19 x 1 x 2 x 2 (1, 5, 30) L Koza ∪ {x 2 , . . . , x 3 } Feynman-A-20 x 1 (1, 5, 30) L Koza ∪ {x 2 , . . . , x 3 } Feynman-A-21 1 x1-1 x 2 x 3U (2, 5, 30) L Koza ∪ {x 2 , . . . , x 3 } Feynman-10, 30) L Koza ∪ {x 2 , . . . , x 3 } Feynman-A-23x 1 x 3 x 2 U (1, 5, 30) L Koza ∪ {x 2 , . . . , x 3 }

(1, 5, 30)  L Koza ∪ {x 2 , . . . , x 3 } Feynman-A-27 2x 1 (1cos(x 2 x 3 )) U (1, 5, 30) L Koza ∪ {x 2 , . . . , x 3 } Feynman-A-28 x1 x2(1+x3) U (1, 5, 30) L Koza ∪ {x 2 , . . . , x 3 } Feynman-A-29 ( x1x2x3x4x5 4x6 sin(x7/2) 2 ) 2 U (1, 2, 70) L Koza ∪ {x 2 , . . . , x 7 } Feynman-40) L Koza ∪ {x 2 , . . . , x 4 } 40) L Koza ∪ {x 2 , . . . , x 4 }Table 36: Synthetic d = 12 benchmark problem specifications. U (a, b, c) corresponds to c random points uniformly sampled between a to b for each input variable separately, where the training and test datasets use different random seeds. Where L Synth = {+, -, ÷, ×, x 1 , . . . , x 12 }.

Average recovery rate (A Rec %) and the average number of equation evaluations γ across the benchmark problem sets, with 95 % confidence intervals. Individual rates and equation evaluations are detailed in Appendices L O P Q. Where: ω is the number of unique equations f * in a benchmark problem set, κ is the number of random seed runs, and d is the number of input variables in the problem set.

DGSR equivalent f * generated equations at inference time, for problem Feynman-7.

DGSR ablation study using average recovery rate (A Rec %) on the Feynman d = 5 benchmark problem set. Where d is the number of input variables.

Glossary of key terms.

. Specifically, these are the same equation complexity regularization methods ofMundhenk et al. (2021), using the hierarchical entropy regularizer and the soft length prior fromLandajuela et al. (2021). Furthermore, we keep track of the best equation seen, by the highest reward, denoted as f a .

Comparison of related works. Columns: Learn Eq. Invariances (P1)-can it learn equation invariances? Eff. Inf. Refinement (P2)-can it perform gradient refinement computationally efficiently at inference time (i.e., update the decoder weights)? Generalize unseen vars. ? (P3)-can it generalize to unseen input variables from that those seen during pre-training.

Average recovery rate (A Rec %) on the Nguyen with constants problem set with 95% confidence intervals. Averaged over κ = 10 random seeds.

Average recovery rate (A Rec %) on the Feynman d = 2 problem set with 95 % confidence intervals. Averaged over κ = 40 random seeds. Rec % 85.36 ± 0.69 85.71 ± 0.00 57.14 ± 0.00 50.00 ± 7.20 level on the Feynman d = 2 problem set in Figure6. Following the noise setup of Petersen et al. (2020), we add independent Gaussian noise to the dependent output y, i.e., ỹi = y

Inference equation evaluations on the Feynman d = 2 benchmark problem set with 95 % confidence intervals. DNF = Did not Find (i.e., recover the true equation given a budget of 2 million equation evaluations). Averaged over κ = 40 random seeds.

Dataset sub sample ablation with noise of α = 0.001. Average recovery rate (A Rec %) on the Feynman d = 2 problem set. Averaged over κ = 10 random seeds. Dataset sub sample ablation with noise of α = 0.001, with average recovery rate (%) on the Feynman d = 2 problem set against dataset of size n samples. Averaged over 10 random seeds.

DGSR equations from the Pareto plot in Figure8(a) for the Feynman-8 problem.

NGGP equations from the Pareto plot in Figure8(a) for the Feynman-8 problem.

GP equations from the Pareto plot in Figure8(a) for the Feynman-8 problem.

DGSR equations from the Pareto plot in Figure8(b) for the Feynman-13 problem. (x 2 x 3 /x 5x 4 x 5 sin(x 1 )/(x 2 x 3 ))(x 1 + x 2log(x 5 ))/(x 2 4 x 5 )

NGGP equations from the Pareto plot in Figure 8 (b) for the Feynman-13 problem.

GP equations from the Pareto plot in Figure8(b) for the Feynman-13 problem. + sin(x 4 ))(x 3 + x 3 (x 2 /(exp(cos(exp(x 4 ))) + x 4 /x 3 ) + x 3 ) sin(x 5 )/x 2 5 )

Average recovery rate (A Rec %) on the Feynman d = 5 problem set with 95 % confidence intervals. Averaged over κ = 40 random seeds.

Inference equation evaluations on the Feynman d = 5 benchmark problem set with 95% confidence intervals. DNF = Did not Find (i.e., recover the true equation given a budget of 2 million equation evaluations).

and Table 26.

Average recovery rate (A Rec %) on the additional Feynman problem set with 95 % confidence intervals. Averaged over κ = 10 random seeds. Here d is the variable dimension.

Average recovery rate (A Rec %) on the synthetic d = 12 problem set with 95 % confidence intervals. Averaged over κ = 20 random seeds. Here -indicates that the method was not able to find any true equations, therefore average number of equation evaluations until the true equation f * is discovered cannot be estimated. (x 10 + x 11 ) + x 1 + x 2 + x 3 + x 4 + x 5 + x 6 + x 7 x 8

1 x 2x 11 (-x 10 + x 6 + x 7 ) + x 8x 9 + x 12 + x 3 + x 4 + x 5 Ablated optimizer where both DGSR and NGGP have their genetic programming component switched off, being similar to the one inPetersen et al. (2020). Top: Inference equation evaluations on the Feynman d = 2 benchmark problem set with standard deviations. DNF = Did not Find (i.e., recover the true equation given a budget of 2 million equation evaluations). Bottom: Average recovery rate (A Rec %) on the Feynman d = 2 benchmark problem set, with 95% confidence intervals, with rates in Table26. Both averaged over κ = 15 random seeds.

Ablated optimizer where both DGSR and NGGP have their genetic programming component switched off, being similar to the one inPetersen et al. (2020). Average recovery rate (A Rec %) on the Feynman d = 2 problem set with 95% confidence intervals. Averaged over κ = 15 random seeds. Symbolic recovery rate, labelled here as Symbolic solution rate (%) on the SRBench ground truth unique equations, and the SRBench provided methods. Points indicate the mean the test set performance on all ground truth problems, and bars show the 95% confidence interval.

DGSR encoder architecture ablation, we change the encoder architecture from a set transformer in DGSR to that of a plain transformer for the encoder. Average recovery rate (A Rec %) on the Feynman d = 5 problem with average inference equation evaluations. decoders weights at inference time. This can be seen in Table29, with evaluation metrics evaluated across all the problems in the Feynman d=5 problem set.

DGSR training ablation, we pre-train using only a cross entropy loss and then do not perform refinement at inference time-thereby only sampling from the decoder in this ablation. Average recovery rate (A Rec %) on the Feynman d = 5 problem with average inference equation evaluations.DGSR Pre-train with CE and no refinement at inference Average recovery rate (%) A Rec % 67.50

Additional synthetic experiments. Where: ω is the number of unique equations f * in a benchmark problem set, coefficient of determination R 2 -score, Acc τ is the accuracy to tolerance, where τ = 0.05, MSE is the test mean squared error and d is the number of input variables in the problem set. Here follow a different symbolic regression experimental setup, that ofKamienny et al. (2022) and synthetically a problem set of ω unique equations using the concise equation generator ofLample & Charton (2019) and only run each experiment for one random seed over a large number of unique equations f * -as is recommended byKamienny et al. (2022).

DGSR ablation study using average recovery rate (A Rec %) on a generated synthetic dataset d = 5 benchmark problem set. Where d is the number of input variables.

DGSR equivalent f * generated equations at inference time, for problem Feynman-7.

ACKNOWLEDGEMENTS.

SH would like to acknowledge and thank AstraZeneca for funding. This work was additionally supported by the Office of Naval Research (ONR) and the NSF (Grant number: 1722516). Moreover, we would like to warmly thank all the anonymous reviewers, alongside research group members of the van der Scaar lab, for their valuable input, comments and suggestions as the paper was developed-where all these inputs ultimately improved the paper. Furthermore, SH would like to thank G-research for a small grant.

annex

Ethics Statement. We envisage DGSR as a tool to help human experts discover underlying equations from processes, however emphasize that the equations discovered would need to be further verified by a human expert or in an experimental setting. Furthermore, the data used in this work is synthetically generated from given equation problem sets, and no human-derived data was used.Reproducibility Statement. To ensure reproducibility, we outline in Section 5: (1) the benchmark algorithms used and include their implementation details, including their hyperparameters and how they were selected fully, in Appendix G. (2) How we generated the inference datasets for a single equation f in a problem set of ω equations and provide full details of the pre-training dataset generation and inference dataset generation in Appendix J. (3) Which benchmark problem sets we used and provide full problem set details, including all equations in a problem set, token set used and the domain to sample X points from in Appendix I. (4) The evaluation metrics used, how these are computed over random seed runs and detail all of these further in Appendix K. Finally, the code is available at https://github.com/samholt/DeepGenerativeSymbolicRegression. • Appendix T: Local Optima • Appendix V: Limitations and Open Challenges incorporated in future versions of DGSR. Specifically, we envisage future work using the generator to predict an initial guess of the numerical constants (rather than initializing them as constants), however note this is out of scope for this current work, therefore leave this as a future work we plan to implement and build on top of.

U ENCODER DECODER ARCHITECTURE ABLATION

DGSR can use other encoder-decoder architectures, specifically it is desirable to satisfy the invariant and equivariant properties outlined in section 3.1. We perform an ablation, training DGSR with a set transformer encoder and a LSTM RNN decoder, instead of the proposed transformer decoder. The results are tabulated in Table 27 . We compared this ablated version of DGSR on the Feynman d = 2 problem set. V LIMITATIONS AND OPEN CHALLENGESIn the following we discuss the limitations with open challenges.Complex equations. DGSR may fail to discover highly complex equations. The difficulty in discovering the true equation f * could arise from three factors: (1) a large number of variables are required, (2) the equation f * involves many operators making the equation length long, or (3) the equation f * exhibits a highly nested structure. We note that these settings are inherently difficult, even for human experts, however, pose an exciting open challenge for future works.Unobserved variables. DGSR assumes all variables in the true equation f * are observed. Therefore, it would fail to discover a true equation f * with unobserved variables present, however this setting is challenging and or even impossible without the use of additional assumptions (Reinbold et al., 2021; Lu et al., 2021) . Furthermore, DGSR could still find a concise approximate equation using the observed variables.Local optima. We detail both sources of local optima in the symbolic regression literature in detail in Appendix T, with respect to DGSR and future work associated to solve these. At a high level DGSR is able to overcome the local optima of the functional equation local optima by leveraging the genetic programming component, as other methods have shown (Mundhenk et al., 2021) . Moreover, DGSR will suffer from the same constant token optimization optima as other methods, although there exists future work to address these, detailed in Appendix T.

W ENCODER ABLATION

We performed an additional encoder ablation, by replacing the set transformer in the encoder-decoder architecture with a plain transformer. We emphasize that we do not recommend using a plain transformer as the encoder for symbolic regression as it is not permutation invariant across the input samples n in the dataset D (Section 3.1). The ablation results can be seen in the Table 28 , with evaluation metrics evaluated across all the problems in the Feynman d=5 problem set.

X CROSS ENTROPY WITH NO REFINEMENT ABLATION

We also performed a further additional ablation of pre-training our encoder-decoder architecture with a cross entropy loss without refinement at inference time-only sampling from the decoder-thereby 

Name Equation Dataset Library

Livermore-1 1/3 + x 1 + sin x 2 1 U (-10, 10, 1000)x 5 1 /x 

