METAPHYSICA: CAUSALITY-AWARE ROBUSTNESS TO OOD INITIAL CONDITIONS IN PHYSICS-INFORMED MACHINE LEARNING

Abstract

A fundamental challenge in physics-informed machine learning (PIML) is the design of robust PIML methods for out-of-distribution (OOD) forecasting tasks, where the tasks require learning-to-learn from observations of the same (ODE) dynamical system with different unknown parameters, and demand accurate forecasts even under initial conditions outside the training support. In this work we propose a solution for such tasks, which we define as a meta-learning procedure for causal structural discovery (including invariant risk minimization). Using three different OOD tasks, we empirically observe that the proposed approach significantly outperforms existing state-of-the-art PIML and deep learning methods.



like inductive methods, the tasks are dependent and knowledge can be transferred between the learned ODEs. By meta learning we mean the definition in (Thrun & Pratt, 1998, Chapter 1.2) , where given: (a) a family of M tasks (a task is a single experiment in our setting), i = 1, . . . , M ; (b) training experience for each task i ∈ {1, . . . , M }, which for us are the time series observations of an experiment X (i) t0 , . . . , X (i) t T , and; (c) a family of performance measures (e.g., one for each task) described by the risk function R (i) ; our algorithm will meta learn such that performance at each task improves with experience (more observations) and with the number of tasks (number of experiments). For an algorithm to fit this definition, there must be a transfer of knowledge between multiple tasks that has a positive impact on expected task performance across all tasks. 3. Learning ODEs as structural causal discovery. In order to learn an ODE that is robust OOD changes in initial conditions (with possibly non-overlapping training and test distribution supports), we define a family of structural causal models and perform a structural causal search in order to find the correct model for our task (which is assumed to be in the family). We test common structural causal discovery approaches for linear models: ℓ 1 -regularization with and without an invariant risk minimization-type objective, which we observe achieve similar empirical results. The proposed method is then empirically validated using three commonly-used simulated physics tasks (with measurement noise): Damped pendulum systems (Yin et al., 2021) , predator-prey systems (Wang et al., 2021a) , and epidemic modeling (Wang et al., 2021a) , all under both constant ODE parameters and varying ODE parameters per experiment. ODE parameters between train and test experiments have the same distribution but non-overlapping (OOD) initial condition distributions.

2. DYNAMICAL SYSTEM FORECASTING AS A META LEARNING TASK

In this section we formally describe the task of forecasting a dynamical system with a focus on the out-of-distribution initial condition scenario. Definition 1 (Dynamical system forecasting task) In what follows we describe our task: 1. Training data (depicted in Figure 1 (a)): In training, we are given a set of M experiments, which we will denote as M tasks. Task i ∈ {1, . . . , M } has an associated (hidden) environment e (i) . Different tasks can have the same environment. Let T (i) := X (i) t0 , . . . , X t T (i) denote the noisy observations of our dynamical system, with X (i) t := x (i) t + ε (i) t , where dx (i) t dt = ψ(x (i) t ; W (i) * , ξ * ) , {t 0 , . . . , t T (i) } are regularly-spaced discrete time steps, x t ∈ R d is the (hidden) state of the system at time t during experiment (task) i, ε (i) t are independent zero-mean Gaussian noises, ψ is an unknown deterministic function with hidden ground truth parameters W (i) * ∼ P (W * ) and ξ * , where the global task-independent parameters W * and ξ * are also hidden. Regularly spaced intervals are not strictly necessary for our method, but it makes its implementation simpler. Initial conditions: The distribution of initial conditions X (i) t0 ∼ P (X t0 | E = e (i) ) of task i may depend on its environment. The unknown parameters ξ * remain constant across environments. , where r is generally small, of the dynamical system dx (M +1) t dt = ψ(x (M +1) t ; W (M +1) * , ξ * ) with initial condition X (M +1) t0 ∼ P (X t0 | E = e (M +1) ) and (unknown) system parameters W (M +1) * ∼ P (W * ) and hidden global parameters ξ * the same as in training. Our task is to predict X (M +1) tr+1 , . . . , X (M +1) t T (M +1) from the initial observations T (M +1) , using the inductive knowledge obtained from the training data.

3.. Out-of-distribution initial conditions:

Initial conditions in training {P (X t0 | E = e (i) )} M i=1 , can be different from initial conditions in test P (X t0 | E = e (M +1) ) with possibly non-overlapping support due to the presence of an unseen environment in training. In training, we are given trajectories that may have (a) different initial conditions, and (b) different unknown ODE system parameters. We observe a test trajectory (indexed by M + 1) from time t = t 0 , . . . , t r and we wish to forecast its future after time t r . The test trajectory can have an OOD initial condition but in-distribution ODE parameter W (M +1) * with an unknown value. Illustrative example. Figure 2a shows an example of an out-of-distribution task for forecasting the motion of a pendulum with friction. (1.) The state X t = [θ t , ω t ] ∈ R 2 describes the angle made by the pendulum with the vertical and the corresponding angular velocity at time t. The true (unknown) function ψ describing this dynamical system is given by ψ ([θ t , ω t ]; W * ) = [ω t , -α * 2 sin(θ t )-ρ * ω t ] with W * = (α * , ρ * ) denoting the parameters relating to the pendulum's period and the damping coefficient. (2.) In training, we observe M (noisy) trajectories of motion over discrete time steps t = 0, 0.1, . . . , 10 from experiments (tasks) where a pendulum is dropped with no angular velocity. (3.) In training, each experiment is performed by dropping different pendulums (i.e., W (i) * ∼ P (W * )) from angles 0 < θ t0 < π/2. (4.) In test, the experiment is repeated with a different distribution over the initial dropping angles, π -0.1 < θ t0 < π (nearly vertical angles) with small angular velocities. The test trajectory is observed over a smaller time window t = 0, 0.1, . . . , 3.3 and the forecasting task is to predict the future states of the pendulum till time t = 10. (Raissi et al., 2017a; Brunton et al., 2016) ) are not able to transfer knowledge from training tasks to a test task with different W * . Thus, these models can be fit only using test observations till time tr ignoring the training data. (e) Inductive PIML methods (e.g., (Yin et al., 2021; Mehta et al., 2021) ) use a known (possibly incomplete) physics model ϕ( • ; ω) and inductively predict its parameters ω for each task, typically using a neural network. However, predicting these physics parameters at test this way is not robust. Furthermore, they use a neural network term to correct for the incomplete physics model and face the same robustness issue discussed in (c). 

3. RELATED WORK & CHALLENGES WITH EXISTING APPROACHES

Next we describe different classes of existing approaches that are commonly used for the dynamical system forecasting (Definition 1) and their inherent challenges out-of-distribution.

3.1. STANDARD NEURAL NETWORKS METHODS

Deep learning's ability to model complex phenomena has allowed it to make great strides in a number of physics applications (Lusch et al., 2018; Yeo & Melnyk, 2019; Kochkov et al., 2021; Dang et al., 2022; Brandstetter et al., 2022b) . However, standard deep learning methods are known to learn spurious correlations and tend to fail when the test distribution of the inputs are different from that observed in training (Wang et al., 2021a; Geirhos et al., 2020) . Figure 2 depicts the out-of-distribution failure of several deep learning methods from NeuralODE (Chen et al., 2018) to more complex meta learning approaches (Wang et al., 2021b; Kirchmeyer et al., 2022) in our running damped pendulum example (more details of the experiment is in Section 5). In standard deep learning tasks, Xu et al. (2021) show that an MLP's failure to extrapolate to out-ofdistribution can be traced to an absence of algorithmic alignment, which is an appropriate combination of basis and activation functions within the architecture for the task. For example, the outputs of an MLP with ReLU activations will be linear far from the training domain even when trained to predict a sine/quadratic function. For dynamical system forecasting, our Figure 1 (c) depicts the results of a similar experiment for a standard sequence model (NeuralODE): the model can approximate the target sine function in the training domain (green region) but predicts a linear function far outside the training domain. This means that PIML also needs algorithmic alignment (i.e., to include appropriate basis functions) in order to make accurate forecasts in OOD tasks.

3.2. PHYSICS-INFORMED MACHINE LEARNING (PIML) METHODS

To alleviate the challenges described above for standard neural networks, several physics-informed machine learning (PIML) methods have been proposed (e.g., (Willard et al., 2020; Wang et al., 2020a; Faghmous & Kumar, 2014; Daw et al., 2017) ) that utilize physics-based domain knowledge about the dynamical system in consideration for better predictions. The type of physics-based knowledge vary across methods, for example, (a) a dictionary of basis functions (e.g., sin, cos, d dt ) (Schmidt & Lipson, 2009; Brunton et al., 2016; Martius & Lampert, 2016; Raissi, 2018; Cranmer et al., 2020a ) related to the task, (b) a completely specified physics model (Raissi et al., 2017a; Raissi, 2018; Jiang et al., 2019) or with missing terms (Yin et al., 2021) , and (c) different domain-specific physical constraints such as energy conservation (Greydanus et al., 2019; Cranmer et al., 2020b) , symmetries (Wang et al., 2020b; Finzi et al., 2021; Brandstetter et al., 2022a) . While these PIML methods improve upon standard neural networks, Figure 2 shows that they are generally not designed for OOD forecasting tasks. To precisely study the reasons for this failure, we categorize these methods into inductive and transductive methods based on requirements over the dynamical system parameters W * . Transductive PIML methods. Transductive inference focuses on predicting missing parts from the training data. In PIML, transductive inference methods treat each training and test examples as unrelated tasks, hence OOD generalization tends to be less of a challenge in transductive methods. For instance, SINDy (Brunton et al., 2016) , EQL (Martius & Lampert, 2016) , and related methods (Raissi, 2018; Chen, 2021) , learn the ODE equation based on a dictionary of basis functions for a specific parameter W (i) * . These transductive methods, however, do not transfer knowledge learnt in training to predicting test examples with a different in-distribution W (j) * . This forces these methods to forecast simply based on the initial observations of the test task alone, often leading to poor performance. Figure 1(d) illustrates this case where a transductive method (unsuccessfully) tries to learn the unknown parameter W (3) * of the test task from a few initial test observations. Another class of transductive methods (Raissi et al., 2017a; b; Yu et al., 2022 ) assume all the physics parameters W * of all experiments to remain constant across all training and test tasks, regularizing neural networks to respect a given physics model. They have been shown to be challenging to train for harder differential equations (Krishnapriyan et al., 2021) or return trivial solutions (Leiteritz & Pflüger, 2021) . Recently, Causal PINNs (Wang et al., 2022) mitigate some of these training challenges by ensuring that, for any time t, predictions at time less than t are accurately resolved before predictions at time t. Not only will these methods perform poorly in-distribution if different experiments have different physics parameters, they also do not allow for causal interventions to variables in the dynamical system. Inductive PIML. Taking the opposite approach, inductive inference focuses on learning rules from the training data that can be applied to unseen test examples. Inductive methods dominate PIML approaches but are fragile OOD, since the learned rules are learned within the scope of the training data and are not guarantee to work outside the training data scope. For example, APHYNITY (Yin et al., 2021) and NDS (Mehta et al., 2021) are such inductive methods that augment a neural network to a known incomplete physics model where the parameters of the physics model are predicted inductively using a recurrent network. As illustrated in Figure 1 (e), these methods are able to learn from training tasks with different true parameters W (i) * . However, in our experiments, APHYNITY often returns incorrect physics parameters OOD (see Figure 2c ). Further, the augmented neural network suffers from the same issues discussed in Section 3.1 leading to poor OOD performance as seen in Figure 2 . With these key reasons identified for the fragility of existing methods to OOD initial conditions, next we propose an approach (MetaPhysiCa) that is more robust to these challenges and outputs more robust predictions out-of-distribution, while also giving accurate predictions in-distribution. In what follows we describe MetaPhysiCa, our proposed approach. We start with the description of a family of causal models, then explain how meta learning allows us to perform a hybrid transductiveinductive approach for improved OOD accuracy.

4.1. STRUCTURAL CAUSAL MODEL

We describe the dynamical system using a deterministic structural causal model (Peters et al., 2022) with measurement noise over the observed states and explicitly define the assumptions over the unknown function ψ in Definition 1. The causal diagram is depicted in Figure 3 in the plated notation iterating over time t = t 0 , . . . , t T (i) for each task T (i) . As before, the state of the dynamical system is X (i) t ∈ R d for task i. We note that our SCM may not necessarily be the true SCM, but rather a SCM that is indistinguishable from the true one w.r.t. interventions limited to changes in the environment variable E that affects the initial conditions X (i) t0 . We define the causal process at each time step t for i-th task as follows. Let f k (•; ξ k ) : R d → R, 1 ≤ k ≤ m, be m linearly independent basis functions each with a separate set of parameters ξ k * acting on an input state x (i) t . Examples of such basis functions include trigonometric functions like f 1 (x (i) t ; ξ 1 * ) = sin(ξ 1,1 x (i) t,1 + ξ 1,2 ), polynomial functions like f 2 (x (i) t ; ξ 2 ) = x (i) t,1 x (i) t,2 , and so on. The corresponding outputs from these basis are shown as z (i) k,t := f k (x (i) t ; ξ k ) in Figure 3. The derivative dx (i) t,j/dt for a particular dimension j ∈ {1, . . . , d} is only affected by a few (unknown) basis function outputs z (i) k,t (green arrows in Figure 3 ) and is a linear combination of these selected basis functions with coefficients W (i) * . However, these selected basis functions and their corresponding parameters ξ are assumed to be invariant across all the tasks, i.e., dx (i) t,j/dt, j ∈ {1, . . . , d}, is defined using the same basis functions for all i = 1, . . . , M . Finally, the derivatives dictate the next state of the dynamical system. We observe the dynamical system with independent additive measurement noise X (i) t := x (i) t + ε (i) t , where ε (i) t ∼ N (0, σ 2 ε I). We assume that we are given the collection of m possible basis functions f k (•; ξ), k = 1, . . . , m, m ≥ 2, with unknown ξ and no prior knowledge of which {f k } m k=1 causally influence dx (i) t /dt. The need for basis functions stems from extensive experimentation and our analysis in Section 3.1, where we show that appropriate basis functions must be incorporated within the architecture in order to extrapolate to OOD scenarios (see Figure 1  (c)). 4.2 META LEARNING & MODEL ARCHITECTURE Given the training data {(x (i) t ) t } M i=1 generated from the unknown SCM described above, our goal is three-fold: (a) discover the true underlying causal structure, i.e., which of the edges z k,t → dxt,j /dt exist for j = 1, . . . , d, (b) learn the global parameters ξ that parameterize the relevant basis functions, and (c) learn the task-specific parameters W (i) * that act as coefficients in linear combination of the selected basis functions. In the following, we propose a meta-learning framework that introduces structure (gate) parameters Φ that are shared across tasks and task-specific coefficients W (i) that vary across the tasks d X(i) t dt = (W (i) ⊙ Φ)F ( X(i) t ; ξ) , where ⊙ is the Hadamard product and • F ( X(i) t ; ξ) := f 1 ( X(i) t ; ξ 1 ) • • • f m ( X(i) t ; ξ m ) T is the vector of outputs from the basis functions with parameters ξ, • Φ ∈ {0, 1} d×m are the learnable parameters governing the global causal structure across all tasks such that Φ j,k = 1 iff edge z k,t → dxt,j /dt exists in Figure 3 , • W (i) ∈ R d×m are task-specific parameters that act as coefficients in linear combination of the selected basis functions. Next we describe a procedure to obtain the structure parameters Φ. Finding whether an edge exists or not in the causal graph is known as the causal structure discovery problem (e.g., Heinze-Deml et al. ( 2018)). We use a score-based causal discovery approach (e.g., Huang et al. (2018) ) where we assign a score to each possible causal graph. We wish to find the minimal causal structure, i.e., with the least number of edges, that also fits the training data. This balances the complexity of the causal structure with training likelihood, and avoids overfitting the training data. A sparse structure for Φ implies fewer terms in the RHS of the learnt equation for the derivatives in Equation (2). Several causal discovery approaches have been proposed that learn such minimal causal structure via continuous optimization (Zheng et al., 2018; Ng et al., 2022) . We use the log-likelihood of the training data with ℓ 1 -regularization term to induce sparsity that is known to perform well for general causal structure discovery tasks (Zheng et al., 2018) . Note that since the direction of all the edges are known (i.e., z k,t → dxt,j /dt), we do not need the acyclicity constraints and the causal graph is uniquely identified by its Markov equivalence class (Pearl, 2009, Chapter 2) . The prediction error is given by R (i) (W (i) , Φ, ξ) := 1 T (i) +1 t T (i) t=t0 || X(i) t -X (i) t || 2 2 where X(i) t = X (i) t0 + t t0 (W (i) ⊙ Φ)F ( X(i) τ ; ξ)dτ are the predictions obtained using an ODE solver to integrate Equation (2). In practice however, we found the squared loss directly between the predicted and estimated ground truth derivatives, i.e., R (i)  (W (i) , Φ, ξ) = 1 T (i) +1 t T (i) t=t0 || d X(i) t /dt -dX (i) /dt|| 2 2 , leads to a stable learning procedure with better accuracy in-distribution and OOD. As discussed before, we use an ℓ 1 -regularization term ||Φ|| 1 to learn a causal structure with the fewest possible edges z k,t → dxt,j /dt, j = 1, . . . , d, while minimizing the prediction error in training. We also use ℓ 1 -regularization on the task-specific parameters, ||W (i) || 1 , to learn a simpler model within each task i, if possible, than the one learnt globally for all tasks via Φ. Our structure discovery task comes with an additional challenge as the training tasks could have been obtained under different (hidden) environments (as defined in Definition 1). While there are score-based (discrete optimization) approaches (Ghassami et al., 2018; Perry et al., 2022) for such non-IID data, aforementioned approaches based on continuous optimization (e.g., (Zheng et al., 2018) ) are not guaranteed to learn the correct structure. For example, they may output a structure that is optimal for one environment consisting of a large number of training tasks but suboptimal for other environments. Our goal then is to learn a structure that minimizes the prediction error across all environments simultaneously, similar to learning robust representations via invariant risk minimization-type methods (Arjovsky et al., 2019; Krueger et al., 2021) . Since the environment e (i) of a particular task i is hidden to our approach, we use a modified V-REx regularization (Krueger et al., 2021) that minimizes the variance of prediction errors across tasks instead of environments, focusing on robustness to the worst-case scenario (that all tasks have unique environments). Now we are ready to describe our final optimization objective. Similar to standard meta-learning objectives (Finn et al., 2017; Franceschi et al., 2018; Hospedales et al., 2021) , we propose a bi-level objective that optimizes the structure parameters Φ and the global parameters ξ in the outer-level, and the task-specific parameters W (i) in the inner-level as follows Φ, ξ = arg min Φ,ξ 1 M M i=1 R (i) ( Ŵ (i) , Φ, ξ) + λ Φ ||Φ|| 1 + λ REx Variance({R (i) ( Ŵ (i) , Φ, ξ)} M i=1 ) s.t. Ŵ (i) = arg min W (i) R (i) (W (i) , Φ, ξ) + λ W ||W (i) || 1 , ∀i = 1, . . . , M , where λ Φ , λ W and λ REx are hyperparameters. While the exact bi-level optimization in Equation ( 3) is challenging to solve due to the lack of closed-form solution for the inner optimization, it can be approximated by alternate SGD steps for (Φ, ξ) and {W (i) } M i=1 in outer and inner loops respectively (Borkar, 1997; Chen et al., 2021) . In our experiments, jointly optimizing Φ, ξ and W (i) , i = 1, . . . , M, instead resulted in comparable performance with considerable computational benefits over alternating SGD. The discrete structure parameters Φ can be approximated using (stochastic) Gumbel-Softmax variables (Jang et al., 2017; Ng et al., 2022) or using deterministic binarization techniques (Courbariaux et al., 2015; 2016) . We use the latter and reparameterize Φ j,k := 1(σ( Φ j,k ) > 0.5) where Φ ′ ∈ R d×m , σ(•) is the sigmoid function, and the gradients are estimated via a straight-through-estimator. Hyperparameter selection: We choose the hyperparameters λ Φ , λ W , λ REx that result in sparsest model (i.e., with the least || Φ|| 0 ) while achieving validation loss within 5% of the best validation loss in held-out in-distribution validation data. The use of in-distribution data for validation is key requirement since in OOD tasks one does not have access to samples from the test distribution. Additional implementation details are provided in Appendix B.

4.3. TRANSDUCTIVE TEST-TIME ADAPTATION WITH INDUCTIVE REGULARIZATION

Finally, given a test task T (M +1) = (X (M +1) t0 , . . . , X (M +1) tr ) with the unknown ground-truth parameters W (M +1) * ∼ P (W * ) as defined in Definition 1, we adapt the learnt model's task-specific parameters W (M +1) by optimizing the following while keeping Φ, ξ fixed where  Ŵ (M +1) = arg min W (M +1) 1 t r + 1 tr t=t0 || X(M+1) t -X (M +1) t || 2 2 + λ W ||W (M +1) || 1 (4) Test Normalized RMSE (NRMSE) ↓ Constant W (i) * Varying W (i X(M+1) t = X (M +1) t0 + t t0 (W (M +1) ⊙ Φ)F ( X

5. EMPIRICAL EVALUATION

We evaluate MetaPhysiCa in synthetic forecasting tasks based on 3 different dynamical systems (ODEs) from the literature (Yin et al., 2021; Wang et al., 2021a) adapted to our OOD scenario, namely, (i) Damped pendulum system, (ii) Predator-prey system and (iii) Epidemic model. We compare against the following approaches: (a) NeuralODE (Chen et al., 2018) , a deep learning method for learning ODEs, (b) DyAd (Wang et al., 2021b) (modified for ODEs), a meta-learning framework that adapts the forecaster to different training tasks with a weakly-supervised encoder, (c) CoDA (Kirchmeyer et al., 2022) , that learns to modify its parameters to each environment with a low-rank adaptation, (d) APHYNITY (Yin et al., 2021) , a PIML method that augments a known incomplete physics model with a neural network, (e) SINDy (Brunton et al., 2016) , a transductive PIML method that uses sparse regression to learn linear coefficients over a given set of basis functions, (f) EQL (Martius & Lampert, 2016) , a transductive PIML method that uses sin, cos and other activation functions within a neural network and learns a sparse model. Additional details about the models is presented in Appendix B. Dataset generation. As per Definition 1, for each dynamical system, we simulate the respective ODE to generate M = 1000 training tasks each observed over regularly-spaced discrete time steps {t 0 , . . . , t T } 1 where ∀l, t l = 0.1l. For each training task T (i) , i = 1, . . . , M , we sample an initial condition X (i) t0 ∼ P (X t0 |E = e) where E = e is the training environment. We consider two scenarios for the dynamical system parameters: (a) Constant W (i) * , where W (i) * is constant for all tasks i, and (b) Varying W (i) * , where we sample a different W (i) * ∼ P (W * ) for each task i. Note however that none of the models have oracle knowledge of which of the two scenarios the data is observed from. At OOD test, we generate M ′ = 200 test tasks by simulating the respective dynamical system over timesteps {t 0 , . . . , t r }, where again ∀l, t l = 0.1l. For each test task j = 1, . . . , M ′ , we sample OOD initial conditions X (j) t0 ∼ P (X t0 |E = e ′ ) where E = e ′ is the test environment and can induce a completely different support for the initial conditions X (j) t0 than in training. The distribution of the dynamical system parameters W * is kept the same across training and test. We consider three dynamical systems in our experiments, with 3 to 6 RHS terms in their respective differential equations: a damped pendulum system (Yin et al., 2021) , a predator-prey system (Wang et al., 2021a) , and an epidemic (SIR) model (Wang et al., 2021a) , with following OOD shifts in their initial conditions respectively: acute initial angles in training to nearly vertical initial angles in OOD test, initial prey population 10× less in OOD test than in training, and initial population susceptible to a disease 10× more in OOD test than in training. We generate the damped pendulum dataset with 1% zero-mean Gaussian noise and the rest with no noise to show that OOD failure of baselines is unrelated to noise: existing methods fail OOD even with clean observations. For methods that require ground truth derivatives during training, we estimate them from noisy trajectories using Total Variation Regularization (TVR) (Rudin et al., 1992; Chartrand, 2011) as done by Brunton et al. (2016) . Detailed description of the datasets is presented in Appendix A and experiments with increasing amounts of noise is presented in Appendix C.4. Results. We repeat our experiments 5 times with different random seeds and report in-distribution (ID) and out-of-distribution (OOD) normalized root mean squared errors (NRMSE), i.e., RMSE normalized with standard deviation of the ground truth observations. Figures 2, 4 and 5 show the errors and example predictions from all models for the three datasets respectively. The first two columns of Tables 2d, 4a, 5a show results when W (i) * is constant across tasks i. NeuralODE, DyAd, CoDA and APHYNITY use neural network components and are able to learn the in-distribution task well with low errors. However, the corresponding errors OOD are high as they are unable to adapt to OOD initial conditions. Example OOD predictions (Figures 2c, 4c and 7b ) from these methods show that they have not learnt the true dynamics of the system. For example, for epidemic modeling (Figure 7b ), most models predict trajectories very similar to training trajectories even though the number of susceptible individuals is 10× higher in OOD test. SINDy and EQL cannot use the training data and are fit on the test observations alone (see Figure 1(d) ). Thus, they are unable to identify an accurate analytical equation from these few observations of the test task, resulting in prediction issues due to stiff ODEs. MetaPhysiCa consistently performs the best OOD across all datasets achieving 8.5× to 35× lower NRMSE OOD errors respectively in the 3 datasets than the best baseline. The last two columns of Tables 2d, 4a , 5a show results for the more challenging scenario when W (i) * ∼ P (W * ) is varying across tasks. The results follow the same trend and MetaPhysiCa performs best OOD across all datasets achieving 8× to 28× lower NRMSE OOD errors respectively in the 3 datasets than the best baseline. In Appendix C.1, we show that MetaPhysiCa learns the ground truth ODE (possibly reparameterized) for all 3 dynamical systems.

6. CONCLUSIONS

In this work we considered the out-of-distribution (OOD) task of forecasting a dynamical system (ODE) under new initial conditions. We showed that existing PIML methods do not perform well in these tasks and proposed MetaPhysiCa that uses a meta-learning framework to learn the causal structure for the shared dynamics across all environments, while adapting the task-specific parameters. Results on three OOD (initial condition) forecasting tasks show that MetaPhysiCa is more robust with 8× to 35× reduction in OOD error compared to the best competing baseline. Limitations & future work. We believe that forecasting models should be robust to OOD shifts, and that our work takes a step in the right direction with several potential avenues for future research: (i) Partial differential equations (PDEs): Extending MetaPhysiCa to forecasting PDEs under OOD scenarios is an interesting extension that requires an expanded set of basis functions that includes differential operators (like the Laplace operator), and considering out-of-distribution boundary conditions. (ii) More expressive structural causal models (SCMs): Our experiment on a complex ODE task (in Appendix C.5) suggests that MetaPhysiCa with a more expressive SCM that allows for composition of basis functions is able to forecast out-of-distribution better than competing baselines, but suffers from learning stiff ODEs due to the complexity of a 2-layer learnable basis function procedure. Better optimization techniques may help alleviate this problem. 

State variables

Constant W (i) * = W param Varying W (i) * ∼ U(W param , 2W param ) Datasets ID OOD ID OOD Damped pendulum X t = (θ t , ω t ) θ 0 ∼ U(0, π/2) θ 0 ∼ U(π -0.1, π) θ 0 ∼ U(0, π/2) θ 0 ∼ U(π -0.1, π) ω 0 = 0 ω 0 ∼ U(-1, 0) ω 0 = 0 ω 0 ∼ U(-1, 0) W * = (α, ρ) α param = 1, ρ param = 0.2 Predator prey system X t = (p t , q t ) p 0 ∼ U(1000, 2000) p 0 ∼ U(100, 200) p 0 ∼ U(1000, 2000) p 0 ∼ U(100, 200) q 0 ∼ U(10, 20) q 0 ∼ U(10, 20) q 0 ∼ U(10, 20) q 0 ∼ U(10, 20) W * = (α, β, γ, δ) α param = 1, β param = 0.06, γ param = 0.5, δ param = 0.0005 Epidemic modeling X t = (S t , I t , R t ) S 0 ∼ U(9, 10) S 0 ∼ U(90, 100) S 0 ∼ U(9, 10) S 0 ∼ U(90, 100) I 0 ∼ U(1, 5) I 0 ∼ U(1, 5) I 0 ∼ U(1, 5) I 0 ∼ U(1, 5) R 0 = 0 R 0 = 0 R 0 = 0 R 0 = 0 W * = (β, γ) β param = 4, γ param = 0.4 Table 1 : Description of the dataset generation process. For each dataset, X t denotes the state variable of the dynamical system and W * denotes its parameters. Third and fourth columns correspond to the case when the (hidden) ground truth parameters W (i) * are kept fixed for all the tasks to W (i) * = W param . For example, in the damped pendulum dataset, we fix β (i) * = β param = 1 and ρ (i) * = ρ param = 0.2 for all tasks i. Column ID represents in-distribution initial states while the column OOD represents the out-of-distribution initial states. Similarly, the final two columns correspond to the case when the ground truth parameters W (i) * vary across tasks and are sampled from a uniform distribution W (i) * ∼ U(W param , 2W param ). For example, in the damped pendulum dataset, we sample α (i) * ∼ U(α param , 2α param ) = (1, 2) and ρ (i) * ∼ U(ρ param , 2ρ param ) = (0.2, 0.4) for each task i. ( t0 ∈ U(-1, 0). We sample the dynamical system parameters α (i) * ∼ U(1, 2) and ρ (i) * ∼ U(0.1, 0.2). Note that the damping coefficient ρ (i) * is sampled out-of-support from its training distribution. Rest of the experimental methodology is kept same as before. We report the normalized RMSE of the all the methods in Table 4 for three test scenarios: indistribution (ID), out-of-distribution initial conditions (OOD X t0 ), and out-of-distribution initial conditions and ODE parameters (OOD X t0 and W (i) * ). MetaPhysiCa is able to adapt relatively well to the out-of-distribution ODE parameters and performs ≈ 4× better than the best baseline. Unfortunately, the test-time adaptation is not perfect (NRMSE is 5× higher for OOD X t0 and W (i) * compared to OOD initial conditions alone), possibly because the trajectories with higher α (i) * and higher ρ (i) * are harder to forecast.

C.4 ROBUSTNESS TO NOISE

We repeat the Damped pendulum and Predator-prey experiments with increasing amounts of noises. Specifically, we add 1%, 5% and 10% Gaussian noise to all the trajectories, both in training and in test. As discussed before, we use Total Variation Regularization (TVR) (Rudin et al., 1992; Chartrand, 2011) for estimating derivatives from noisy data as done by Brunton et al. (2016) . We report the normalized RMSE for different models trained on the noisy versions of data in Figure 7 . SINDy and EQL are not shown as they returned errors during test-time predictions similar to the case with no noise because the learnt ODE was too stiff (numerically unstable) to solve. In both tasks, the proposed Pendulum and Predator-prey experiments with different percentages of Gaussian noise added (0%, 1%, 5%, 10%). MetaPhysiCa is relatively robust to ≤ 5% Gaussian noise and outperforms the baselines. With a larger amount of noise, MetaPhysiCa is unable to identify the dynamical system accurately but performs comparable to the baselines. method is relatively robust to small amounts of noise and outperforms the baselines. With 10% noise, MetaPhysiCa is unable to identify the dynamical system accurately, but performs comparable to the baselines.

C.5 COMPLEX ODE TASK

In this section, we extend MetaPhysiCa to consider significantly more expressive structural causal models (compared to Figure 3 ) that allow for composition of the basis functions. This is achieved with a 2-layer learnable basis function composition procedure. For example, given basis functions f 1 (x t ; ξ 1 ) = sin(ξ 1,1 x t,1 + ξ 1,2 ), and f 2 (x t ; ξ 2 ) = x t,1 x t,2 , one can construct more expressive basis functions with compositions: f3 (x t ; ξ 3 ) = sin(ξ 3,3 sin(ξ 3,1 x t,1 + ξ 3,2 ) + ξ 3,4 ), f4 (x t ; ξ 4 ) = x t,1 x t,2 sin(ξ 4,1 x t,1 + ξ 4,2 ), etc., where ξ j are global parameters that remain constant for all training/test tasks. The rest of the SCM remains the same and the derivative dx (i) t,j/dt for a particular dimension j ∈ {1, . . . , d} is a sparse linear combination of the original basis functions and the more expressive second layer ones. We evaluated MetaPhysiCa on a more complex ODE task from Chen (2020) adapted to our setting. We consider a two-dimensional ODE with state X t = [p t , q t ] ∈ R 2 : dpt dt = a * sin(p t )+b * sin(q 2 t ); dpt dt = c * sin(p t ) cos(q t ), where W * = (a * , b * , c * ) are the dynamical system parameters. We simulate the ODE over time steps {t 0 , . . . , t T } with ∀l, t l = 0.1l, T = 100 in training and over time steps {t 0 , . . . , t r } in test with r = 1 3 T . In training, we sample initial states p t , q t ∼ U(0.5, 1), whereas in out-of-distribution test, we sample p t , q t ∼ U(1, 1.5). For constant W (i) * scenario, the dynamical system parameters are set to a (i) * = b (i) * = c (i) * = 1 for all i, whereas for the varying W (i) * scenario, the dynamical system parameters are sampled as a (i) * , b (i) * , c (i) * ∼ U(1.0, 1.5). Table 5 shows the results for this task. First, we note that due to the complexity of a 2-layer learnable basis function procedure, we sometimes need to use validation data (held out from training) to crossvalidate the learned model (and reject meta-models that do not do well in validation). MetaPhysiCa learnt a stiff ODE for 2 out of 5 folds of cross-validation, resulting in no predictions for in-distribution validation data, which were rejected (marked as superscript * ). In these experiments MetaPhysiCa performs 1.5× to 1.7× better than the competing baselines. We believe there is room for improvement in the optimization procedure of these more complex models. 



In our experiments, we let T (i) = T constant for all tasks for simplicity of implementation but the proposed method is not restricted to this case.



Test data ((depicted in Figure 1(b)): At test, we are given an observed initial sequence T (M +1)

Figure 1: Dynamical system OOD problem definition and traditional approaches to address it. (a) Training data consists of multiple observations from the same dynamical system with different parameters W (i) * . Each training curve can be seen as a different task i where the goal is to predict X (i) t+1 from X (i) t for all t. (b) At test, we are given observations till tr (red solid) and the goal is to predict the future observations till tT (gray dashed). (c) Shows OOD failure of a standard neural network (NeuralODE (Chen et al., 2018)) for dynamical system forecasting. When trained to predict the motion of damped pendulum, the model predicts accurately in the training domain (green shaded), but predicts a linear function outside the training domain. (d) Transductive PIML methods (e.g.,(Raissi et al., 2017a;Brunton et al., 2016)) are not able to transfer knowledge from training tasks to a test task with different W * . Thus, these models can be fit only using test observations till time tr ignoring the training data. (e) Inductive PIML methods (e.g.,(Yin et al., 2021;Mehta et al., 2021)) use a known (possibly incomplete) physics model ϕ( • ; ω) and inductively predict its parameters ω for each task, typically using a neural network. However, predicting these physics parameters at test this way is not robust. Furthermore, they use a neural network term to correct for the incomplete physics model and face the same robustness issue discussed in (c).

Figure 2: (a) Predict pendulum motion from noisy observations: (i) in-distribution, when dropped from acute angles and (ii) out-of-distribution, when dropped from nearly vertical angles. (b, c) shows example ground truth curves (blue stars) in-and out-of-distribution along with predictions from different models. While most tested methods perform well in-distribution, only MetaPhysiCa (orange) closely follows the true curve OOD and all other methods are terribly non-robust. (d) Standard deep learning methods and physics-informed machine learning methods fail to forecast accurately out-of-distribution. On the other hand, MetaPhysiCa outputs up to 8.5× more robust OOD predictions.

Figure3: Deterministic SCM for a dynamical system with measurement noise. The dynamics is defined via an unknown linear combination of basis functions.

Figure 4: (Predator-prey results) (a) MetaPhysiCa outputs 30× and 8× more robust OOD predictions in constant W (i) * and varying W (i) * datasets respectively. (b, c) shows example ground truth curves (blue stars) in-and out-of-distribution along with corresponding predictions. While most tested methods perform well in-distribution, only MetaPhysiCa (orange) closely follows the true curve OOD.

dτ are the predictions obtained using the optimal values Φ, ξ, and λ W is the hyperparameter chosen during training. Note the following two key aspects of the test-time adaptation in Equation (4): (a) Only the task-specific parameters W (M +1) are adapted whereas the meta-model Φ learnt during training is kept fixed, and (b) only the observations from time t 0 , . . . , t r of the given test trajectory is used to adapt the parameters W (M +1) . The final predictions ( X(M+1)t ) t T (M +1) trfrom the model are obtained with the test-time adapted parameters Ŵ (M +1) and the fixed parameters with no adaptation Φ, ξ.

Figure 5: (Epidemic model results) (a) MetaPhysiCa outputs 35× and 28× more robust OOD predictions in constant W (i) * and varying W (i) * datasets respectively. (b, c) shows example ground truth curves (blue stars) in-and out-of-distribution along with corresponding predictions. Only MetaPhysiCa (orange) closely follows the true curve OOD.

Figure7: (Performance with increasing noise.) Out-of-distribution NRMSE values for Damped Pendulum and Predator-prey experiments with different percentages of Gaussian noise added (0%, 1%, 5%, 10%). MetaPhysiCa is relatively robust to ≤ 5% Gaussian noise and outperforms the baselines. With a larger amount of noise, MetaPhysiCa is unable to identify the dynamical system accurately but performs comparable to the baselines.

Test NRMSE ↓ for different methods. NaN * indicates that the model returned errors during test.

Test NRMSE ↓ for different methods. NaN * indicates that the model returned errors during test.

. . . , M . At OOD test, we generate M ′ = 200 out-of-distribution test tasks with a different initial susceptible population, S (Ablation.) Out-of-distribution test NRMSE for MetaPhysiCa without each individual component on the three dynamical systems (varying W (i) * scenario). Sparsity regularization (i.e., ||Φ|| 1 ) and test-time adaptation are the most important components, whereas the task-specific ℓ 1 -regularization (i.e., ||W(i) || 1 ) and the V-REx penalty(Krueger et al., 2021) help in some tasks, but not in others.

(Damped pendulum.)  Normalized RMSE ↓ of test predictions from different methods under two cases: (a) when initial conditions X t0 are OOD, and (b) when both initial conditions X t0 and ODE parameters W (i) * are OOD. NaN * indicates that the model returned errors during test-time predictions. MetaPhysiCa is able to adapt its parameters to the OOD parameters W (i) * and outputs ≈ 5× more robust OOD predictions compared to the baselines..

Test NRMSE ↓ for different methods. * indicates that the method returned errors during predictions due to learning a stiff ODE.

Supplementary Material of "MetaPhysiCa: Causality-aware Robustness to OOD Initial Conditions in Physics-informed Machine

Learning"A DESCRIPTION OF TASKS For each dynamical system, we simulate the respective ODE to generate M = 1000 training tasks each observed over regularly-spaced discrete time steps {t 0 , . . . , t T } where ∀l, t l = 0.1l. Our data generation process is succinctly depicted in Table 1 . For each dataset, the second column shows the state variables X t and the unknown parameters W * . For each training task T (i) , i = 1, . . . , M , we sample an initial condition X (i) t0 ∼ P (X t0 |E = e) where E = e is the training environment (shown under ID columns of the table). We consider two scenarios for the dynamical system parameters:• Constant W (i) * (third and fourth columns in Table 1 ): W (i) * is constant for all tasks i .For all tasks i, W (i) * = W param as indicated in the table.• Varying W (i) * (final two columns in Table 1 ): We sample a different W (i) * ∼ U(W param , 2W param ) for each task i with W param shown in the table.At OOD test, we generate M ′ = 200 test tasks by simulating the respective dynamical system over timesteps {t 0 , . . . , t r }, where again ∀l, t l = 0.1l. For each test task j = 1, . . . , M ′ , we sample initial conditions X (j) t0 ∼ P (X t0 |E = e ′ ) where E = e ′ is the test environment and can induce a completely different support for the initial conditions X (j) t0 than in training. The distribution of the dynamical system parameters W * is kept the same across training and test.Damped pendulum system (Yin et al., 2021) . The state X t = [θ t , ω t ] ∈ R 2 describes the angle made by the pendulum with the vertical and the corresponding angular velocity at time t. The true (unknown) function ψ describing this dynamical system is given by dθt dt = ω t , dωt dt = -α * 2 sin(θ t )ρ * ω t where W * = (α * , ρ * ) are the dynamical system parameters. We simulate the ODE over time steps {t 0 , . . . , t T } with ∀l, t l = 0.1l, T = 100 in training and over time steps {t 0 , . . . , t r } in test with r = 1 3 T . In training, the pendulum is dropped from initial angles θ (i) t0 ∼ U(0, π/2) with no angular velocity, whereas in OOD test, the pendulum is dropped from initial angles θ (j) t0 ∼ U(π -0.1, π) and angular velocity ωPredator-prey system (Wang et al., 2021a) . We wish to model the dynamics between two species acting as prey and predator respectively. We adapt the experiment by Wang et al. (2021a) to our out-ofdistribution forecasting scenario according to Definition 1. Let p and q denote the prey and predator populations respectively. The ordinary differential equations describing the dynamical system is given by dp dt = α * p -β * pq , dq dt = δ * pq -γ * q , where W * = (α * , β * , γ * , δ * ) are the (unknown) dynamical system parameters. We simulate the ODE over time steps {t 0 , . . . , t T } with ∀l, t l = 0.1l, T = 100 in training and over time steps {t 0 , . . . , t r } in test with r = 1 3 T . We generate M = 1000 training tasks with different initial prey and predator populations with prey p (i) t0 ∼ U(1000, 2000) and predator q (i) t0 ∼ U(10, 20) for each i = 1, . . . , M . At OOD test, we generate M ′ = 200 out-of-distribution (OOD) test tasks with different initial prey populations p (j) t0 ∼ U(100, 200) but the same distribution for predator population q (j) t0 ∼ U(10, 20).Epidemic modeling (Wang et al., 2021a) . We adapt the experiment by Wang et al. (2021a) to our out-of-distribution forecasting scenario according to Definition 1. The state of the dynamical system is described by three variables: number of susceptible (S), infected (I) and recovered (R) individuals. The dynamics is described using the following ODEs: dS dt = -β SI N , dI dt = β SI N -γI, dR dt = γI, where W = (β, γ) are the (unknown) dynamical system parameters and N = S + I + R is the total population. We simulate the ODE over time steps {t 0 , . . . , t T } with ∀l, t l = 0.1l, T = 100 in training and over time steps r = 1 10 T . We generate M = 1000 training tasks with different initial populations for susceptible (S) and infected (I) individuals, while the number of initial recovered In training, Φ, denoting the causal structure, is shared among all tasks i = 1, . . . , M , while W (i) are the task-specific parameters. Predicted derivatives for task i over time t = t 0 , . . . , t T are obtained from Equation (2) using the parameters Φ, W (i) and the basis functions F (X (i) t ; ξ). During test, we adapt W (M +1) over the observations of the test trajectory from time t 0 , . . . , t r , keeping the learnt causal structure Φ fixed.

B IMPLEMENTATION DETAILS

In what follows, we describe implementation details of MetaPhysiCa and the baselines.

B.1 METAPHYSICA

Figure 6 shows a schematic diagram of MetaPhysiCa and the corresponding training/test procedures. Recall from Equation ( 2) that the proposed model is defined aswhere ⊙ is the Hadamard product andis the vector of outputs from the basis functions with parameters ξ, • Φ ∈ {0, 1} d×m are the learnable parameters governing the global causal structure across all tasks such that Φ j,k = 1 iff edge z k,t → dxt,j /dt exists, • W (i) ∈ R d×m are task-specific parameters that act as coefficients in linear combination of the selected basis functions.In our experiments, we use polynomial and trigonometric basis functions, such thatEquation (3) describes a bi-level objective that optimizes the structure parameters Φ and the global parameters ξ in the outer-level, and the task-specific parameters W (i) in the inner-level as followswhere λ Φ , λ W and λ REx are hyperparameters. As discussed in the main text, the jointly optimizing Φ, ξ and W (i) , i = 1, . . . , M, instead of alternating SGD resulted in comparable performance with considerable computational benefits. We use the following joint optimization objective to approximate Equation (3),We perform a grid search over the following hyperparameters: regularization strengths λ Φ ∈ {10 -4 , 10 -3 , 5×10 -3 , 10 -2 }, λ W ∈ {0, 10 -4 , 10 -3 , 10 -2 }, λ REx ∈ {0, 10 -3 , 10 -2 }, and learning rates η ∈ {10 -2 , 10 -3 , 10 -4 }. We choose the hyperparameters that result in sparsest model (i.e., with the least || Φ|| 0 ) while achieving validation loss within 5% of the best validation loss in held-out in-distribution validation data.

B.2 NEURALODE (CHEN ET AL., 2018)

The prediction dynamics corresponding to the latent NeuralODE model is given by d Xt dt = F nn ( Xt , z ≤r ; W 1 ) where z ≤r = F enc (X t0 , . . . , X tr ; W 2 ) encodes the initial observations using a recurrent neural network F enc (e.g., GRU), and F nn is a feedforward neural network. The model is trained with an ODE solver (dopri5) and the gradients computed using the adjoint method (Chen et al., 2018) . We perform a grid search over the following hyperparameters: number of layers for F nn , L ∈ {1, 2, 3}, size of each hidden layer of F nn , d h ∈ {32, 64, 128}, size of the encoder representation z ≤r , d z ∈ {32, 64, 128}, batch sizes B ∈ {32, 64}, and learning rates η ∈ {10 -2 , 10 -3 , 10 -4 }.

B.3 DYAD (MODIFIED FOR ODES) (WANG ET AL., 2021B)

DyAd, originally proposed for forecasting PDEs, uses a meta-learning framework to adapt to different training tasks by learning a per-task weak label. We modify their approach for our ODE-based experiments. Since we do not assume the presence of weak labels for supervision for adaptation, we use mean of each variable in the training task as the task's weak label. We use NeuralODE as the base sequence model for the forecaster network. The forecaster network takes the initial observations as input and forecasts the future observations while being adapted with the encoder network. The encoder network is a recurrent network (GRU in our experiments) that takes as input the initial observations and predicts the weak label. The last layer representation from the encoder network is used to adapt NeuralODE via AdaIN (Huang & Belongie, 2017) . We perform a grid search over the following hyperparameters: size of hidden layers for the forecaster and encoder networks d h ∈ {32, 64, 128}, number of layers for the forecaster network, L ∈ {1, 2, 3}, batch sizes B ∈ {32, 64}, and learning rates η ∈ {10 -2 , 10 -3 , 10 -4 }.

B.4 APHYNITY (YIN ET AL., 2021)

APHYNITY assumes that we are given a (possibly incomplete) physics model ϕ(•, Θ phy ) with parameters Θ phy . When the training data may consist of tasks with different W (i) * , APHYNITY predicts the physics parameters with respect to the task i inductively using a recurrent neural network G nn from the initial observations of the system as Θ(i) phy = G nn (X t0 , . . . , X tr ; W 2 ). Then, APHYNITY augments the given physics model ϕ with a feedforward neural network component F nn and defines the final dynamics as. APHYNITY solves a constrained optimization problem to minimize the norm of the neural network component while still predicting the training trajectories accurately. The model is trained with an ODE solver (dopri5) and the gradients computed using the adjoint method (Chen et al., 2018) . In our experiments, we provide APHYNITY with simpler physics models:• For damped pendulum system, we use a physics model that assumes no friction: dθt dt = ω t , dωt dt = -α 2 phy sin(θ t ) where Θ phy = α phy is the physics model parameter. • For predator-prey system, we use a physics model that assumes no interaction between the two species: dp dt = α phy p , dq dt = -γ phy q where Θ phy = (α phy , γ phy ) are the physics model parameters. • For epidemic model, we use a physics model that assumes the disease is not infectious: dS dt = 0, dI dt = -γI, dR dt = γI, where Θ phy = γ phy is the physics model parameter.In each dataset, APHYNITY needs to augment the physics model with a neural network component for accurate predictions.We perform a grid search over the following hyperparameters: number of layers for F nn , L ∈ {1, 2, 3}, size of each hidden layer of F nn , d h ∈ {32, 64, 128}, batch sizes B ∈ {32, 64}, and learning rates η ∈ {10 -2 , 10 -3 , 10 -4 }.

B.5 SINDY (BRUNTON ET AL., 2016)

SINDy uses a given dictionary of basis functions to model the dynamics as d Xt dt = Θ( Xt )W where Θ is feature map with the basis functions (such as polynomial and trigonometric functions) and W is simply a weight matrix. SINDy is trained using sequential threshold least squares (STLS) for sparse weights W . We perform a grid search over the following hyperparameters: threshold parameter used in STLS optimization, τ 0 ∈ {0.005, 0.01, 0.05, 0.1, 0.2, 0.5}, and the regularization strength α ∈ {0.05, 0.01, 0.1, 0.5}.

B.6 EQUATION LEARNER (MARTIUS & LAMPERT, 2016)

Equation learner (EQL) is a neural network architecture where each layer is defined as follows with input x and output owhere f i are unary basis functions (such as sin, cos, etc.) and g i are binary basis functions (such as multiplication). We use id, sin and multiplication functions in our implementation. EQL is trained using a sparsity inducing ℓ 1 -regularization with hard thresholding for the final few epochs. We perform a grid search over the following hyperparameters: number of EQL layers, L ∈ {1, 2}, number of nodes for each type of basis function, h ∈ {1, 3, 5}, regularization strength α ∈ {10 -1 , 10 -2 , 10 -3 , 10 -4 , 10 -5 }, batch sizes B ∈ {32, 64}, and learning rates η ∈ {10 -2 , 10 -3 , 10 -4 }.

C.1 QUALITATIVE ANALYSIS

Recall from Equation (2) that the proposed model is defined aswhere F ( X(i) t ; ξ) is the vector of outputs from the basis functions, Φ ∈ {0, 1} d×m are the learnable parameters governing the global causal structure across all tasks, and W (i) ∈ R d×m are task-specific parameters that act as coefficients in linear combination of the selected basis functions. After training, the ODE learnt by the model can be easily inferred by checking all the terms in Φ that are greater than zero, i.e., Φ j,k > 0 implies f k (x t ; ξ k ) → dxt,j /dt exists in the causal graph. In other words, RHS of learnt ODE for dxt,j /dt contains the basis function f k (x t ; ξ k ).Table 2 shows the ground truth ODE and the learnt ODE for the three experiments. For each learnt ODE, we also depict the learnable parameters W l that can be adapted using Equation ( 4) during test-time. For damped pendulum and predator-prey system, the RHS terms in the learnt ODE exactly matches ground truth ODE, and from Figures 2 and 4 , it is clear that the method is able to accurately adapt the learnable parameters W l during test-time. For epidemic modeling task, MetaPhysiCa learns a reparameterized version of the ground truth ODE. For example, MetaPhysiCaN is a constant denoting the total population. While the learnt reparameterized ODE is more complex because it allows different values for W ′ a , W ′ b , W ′ c , the test-time adaptation of these learnable parameters with the initial test observations results in them taking the same values.

C.2 ABLATION RESULTS

We present an ablation study comparing different components of MetaPhysiCa in Table 3 . Table shows out-of-distribution test NRMSE for MetaPhysiCa without each individual component on the three dynamical systems (varying W (i) * scenario). We observe that sparsity regularization (i.e., ||Φ|| 1 ) and test-time adaptation are the most important components. For two out of three tasks, the method returns prediction errors without sparsity regularization.When testing MetaPhysiCa without test-time adaptation, we simply use the mean of the task-specific weights learnt for training tasks as the task-specific weight for the given test trajectory, i.e., Ŵ M +1 = 1 M i W (i) . This results in high OOD errors showing the importance of test-time adaptation. The other two components of the MetaPhysiCa, the task-specific ℓ 1 -regularization (i.e., ||W (i) || 1 ) and the V-REx penalty (Krueger et al., 2021) help in some experiments and perform comparably in others.

C.3 OUT-OF-DISTRIBUTION ODE PARAMETERS

The forecasting task in Definition 1 considers out-of-distribution initial conditions X t0 and indistribution ODE parameters W (i) * . Here, we consider OOD values for true dynamical system parameters W (i) * as well, which significantly increases the difficulty of the forecasting task.Consider the damped pendulum system: dθt dt = ω t , dωt dt = -α * 2 sin(θ t )-ρ * ω t where W * = (α * , ρ * ) are the dynamical system parameters. Training: Pendulum is dropped from initial angles θ (i) t0 ∼ U(0, π/2) with no angular velocity. We sample the dynamical system parameters α (i) * ∼ U(1, 2)

