EFFICIENT DISCOVERY OF DYNAMICAL LAWS IN SYMBOLIC FORM

Abstract

We propose a transformer-based sequence-to-sequence model that recovers scalar ordinary differential equations (ODEs) in symbolic form from time-series data of a single observed solution trajectory of the ODE. Our method is efficiently scalable: after one-time pretraining on a large set of ODEs, we can infer the governing laws of a new observed solution in a few forward passes of the model. First, we generate and make available a large dataset of more than 3M ODEs together with more than 63M numerical solutions for different initial conditions that may serve as a useful benchmark for future work on machine learning for dynamical systems. Then we show that our model performs better or on par with existing methods in various test cases in terms of accurate symbolic recovery of the ODE, especially for more complex expressions. Reliably recovering the symbolic form of dynamical laws is important as it allows for further dissemination of the inferred dynamics as well as meaningful modifications for predictions under interventions.

1. INTRODUCTION

Science is commonly described as the "discovery of natural laws through experimentation and observation". Researchers in the natural sciences increasingly turn to machine learning (ML) to aid the discovery of natural laws from observational data alone, which is often abundantly available, hoping to bypass expensive and cumbersome targeted experimentation. While there may be fundamental limitations to what can be extracted from observations alone, recent successes of ML in the entire range of natural sciences provide ample reason for excitement. In this work, we focus on ordinary differential equations, a ubiquitous description of dynamical natural laws in physics, chemistry, and systems biology. For a first order ODE ẏ := ∂y /∂t = f (y, t), we call f (which uniquely defines the ODE) the underlying dynamical law. Informally, our goal is then to infer f in symbolic form given discrete time-series observations of a single solution {y i := y(t i )} n i=1 of the underlying ODE. Contrary to "black-box-techniques" such as Neural Ordinary Differential Equations (NODE) (Chen et al., 2018 ) that aim at inferring a possible f as an arguably opaque neural network, we focus specifically on symbolic regression. From the perspective of the sciences, a law of nature is useful insofar as it is more broadly applicable than to merely describe a single observation. In particular, the reason to learn a dynamical law in the first place is to dissect and understand it as well as to make predictions about situations that differ from the observed one. From this perspective, a symbolic representation of the law (in our case the function f ) has several advantages over block-box representations: they are compact and directly interpretable, they are amenable to analytic analysis, they allow for meaningful changes and thus enable assessment of interventions and counterfactuals. In this work, we develop Neural Symbolic Ordinary Differential Equation (NSODE), a sequence-tosequence transformer to efficiently infer governing ODEs in symbolic form from a single observed solution trajectory that makes use of massive pretraining. We first (randomly) generate a total of >3M scalar, autonomous, non-linear, first-order ODEs, together with a total of >63M numerical solutions from various (random) initial conditions. All solutions are carefully checked for convergence of the numerical integration. This dataset is unprecedented in both its scale and diversity and will be made publicly available alongside the code that was used to generate it. We then devise NSODE, a sequence-to-sequence transformer that maps observed trajectories, i.e., numeric sequences of the form {(t i , y i )} n i=1 , directly to symbolic equations as strings, e.g., "y ** 2+1.64 * cos(y)", which is the prediction for f . This example directly highlights the benefit of symbolic representations in that the y 2 and cos(y) terms tell us something about the fundamental dynamics of the observed system; the constant 1.64 will have semantic meaning in a given context and we can, for example, make predictions about settings in which this constant takes a different value. NSODE combines and innovates on technical advances regarding input representations and an efficiently optimizable loss formulation. Our model outperforms scalable existing methods in terms of skeleton recovery, especially on more complex expressions. While other methods still perform better on simple expressions found in (adjusted) existing benchmark datasets (which these methods have been tuned to) as well as a novel set of simple ODEs that we manually collected from different domains, they are typically orders of magnitude slower than NSODE.

2. BACKGROUND AND RELATED WORK

Modeling dynamics and forecasting their behavior has a long history in machine learning. While NODEs (Chen et al., 2018) (with a large body of follow up work) are perhaps the most prominent approach, their inherent black-box character complicates scientific understanding of the observed phenomena. Recent alternatives such as tractable dendritic RNNs (Brenner et al., 2022) and Neural Operators (Kovachki et al., 2021; Li et al., 2021) set out to facilitate scientific discovery by combining accurate predictions with improved interpretability. A considerable advantage of these and similar approaches is their scalability to high dimensional systems as well as their relative robustness to noise, missing or irregularly sampled data (Iakovlev et al., 2021) and challenging properties such as multiscale dynamics (Vlachas et al., 2022) or chaos (Park et al., 2022; Patel & Ott, 2022) . Despite these advantages, we turn the focus to a different class of models in this paper and look at approaches that explicitly predict a mathematical expression in symbolic form that describes the observed dynamics. Such models are in a sense situated on the other side of the spectrum: while less scalable, the predicted symbolic expression is compact and readily interpretable so that dynamical properties can be analytically deduced and investigated. A recent benchmark study and great overview including deep learning-based as well as symbolic models can be found in Gilpin (2021). Evolutionary algorithms. Classically, symbolic regression is approached through evolutionary algorithms such as genetic programming (Koza, 1993) . Genetic programming randomly evolves a population of prospective mathematical expressions over many iterations and mimics natural selection by keeping only the best contenders across iterations, where superiority is measured by user-defined fitness functions (Schmidt & Lipson, 2009) . Process-based modeling follows a similar approach but includes domain knowledge-informed constraints on particular components of the system in order to reduce the search space to reasonable candidates (Todorovski & Dzeroski, 1997; Bridewell et al., 2008; Simidjievski et al., 2020) .



Figure 1: An overview illustration of the data generation (top) and training pipeline (bottom). Our dataset stores solutions in numerical (non-binarized) form on the entire regular solution time grid.

