MULTISCALE NEURAL OPERATOR: LEARNING FAST AND GRID-INDEPENDENT PDE SOLVERS

Abstract

Numerical simulations in climate, chemistry, or astrophysics are computationally too expensive for uncertainty quantification or parameter-exploration at highresolution. Reduced-order or surrogate models are multiple orders of magnitude faster, but traditional surrogates are inflexible or inaccurate and pure machine learning (ML)-based surrogates too data-hungry. We propose a hybrid, flexible surrogate model that exploits known physics for simulating large-scale dynamics and limits learning to the hard-to-model term, which is called parametrization or closure and captures the effect of fine-onto large-scale dynamics. Leveraging neural operators, we are the first to learn grid-independent, non-local, and flexible parametrizations. Our multiscale neural operator is motivated by a rich literature in multiscale modeling, has quasilinear runtime complexity, is more accurate or flexible than state-of-the-art parametrizations and demonstrated on the chaotic equation multiscale Lorenz96. Under review as a conference paper at ICLR 2023 operators/DMD (Williams et al., 2015) , take simplying assumptions to the dynamics, e.g., linearizing the equations, which can break down in high-dimensional or nonlinear regimes (Quarteroni & Rozza, 2014) . Instead, our work leverages the expressiveness of neural operators as universal approximations (Chen & Chen, 1995) to learn fast high-dimensional surrogates that are accurate in nonlinear regimes (

1. INTRODUCTION

Climate change increases the likelihood of storms, floods, wildfires, heat waves, biodiversity loss and air pollution (IPCC, 2018) . Decision-makers rely on climate models to understand and plan for changes in climate, but current climate models are computationally too expensive: as a result, they are hard to access, cannot predict local changes (< 10km), fail to resolve local extremes (e.g., rainfall), and do not reliably quantify uncertainties (Palmer et al., 2019) . For example, running a global climate model at 1km resolution can take ten days on a 4888×GPU node supercomputer, consuming the same electricity as a coal power plants generates in one hour (Fuhrer et al., 2018) . Similarly, in molecular dynamics (Batzner et al., 2022) , chemistry (Behler, 2011) , biology (Yazdani et al., 2020) , energy (Zhang et al., 2019) , astrophysics or fluids (Duraisamy et al., 2019) , scientific progress is hindered by the computational cost of solving partial differential equations (PDEs) at high-resolution (Karniadakis et al., 2021) . We are proposing the first PDE surrogate that quickly computes approximate solutions via correcting known large-scale simulations with learned, gridindependent, non-local parametrizations. (MNO) . Explicitly modeling all scales of Earth's weather is too expensive for traditional and learning-based solvers (Palmer et al., 2019) . MNO dramatically reduces the computational cost by modeling the large-scale explicitly and learning the effect of fine-onto large-scale dynamics; such as turbulence slowing down a river stream. We embed a grid-independent neural operator in the large-scale physical simulations as a "parametrization", conceptually similar to stacking dolls (Snagglebit, 2022) . Surrogate models are fast, reduced-order, and lightweight copies of numerical simulations (Quarteroni & Rozza, 2014) and of significant interest in physics-informed machine learning (Kashinath et al., 2021; Reichstein et al., 2019; Karpatne et al., 2019; Ganguly et al., 2014) . Machine learning (ML)-based surrogates have simulated PDEs up to 1 -3 order of magnitude faster than traditional numerical solvers and are more flexible and accurate than traditional surrogate models (Karniadakis et al., 2021) . However, pure ML-based surrogates are too data-hungry (Rasp et al., 2020) ; so, hybrid ML-physics models are created, for example, via incorporating known symmetries (Bronstein et al., 2021; Batzner et al., 2022) or equations (Willard et al., 2022) . Most hybrid models represent the solution at the highest possible resolution which becomes computationally infeasible in multiscale or very high-resolution physics; even at optimal runtime (Pavliotis & Stuart, 2008; Peng et al., 2021) . As depicted in Figs. 1 and 2 , we simulate multiscale physics by running easy-to-acces large-scale models and focusing learning on the challenging task: How can we model the influence of fineonto large-scale dynamics, i.e., what is the subgrid parametrization term? The lack of accuracy in current subgrid parametrizations, also called closure or residual terms, is one of the major sources of uncertainty in multiscale systems, such as turbulence or climate (Palmer et al., 2019; Gentine et al., 2018) . Learning subgrid parametrizations can be combined with incorporating equations as soft (Raissi et al., 2019) or hard (Beucler et al., 2021a) constraints. Various works learn subgrid parametrizations, but are either inaccurate, hard to share or inflexible because they are local (Gentine et al., 2018 ), grid-dependent (Lapeyre et al., 2019) , or domain-specific (Behler J, 2007) , respectively as detailed in Section 2. We are the first to formulate the parametrization problem as learning neural operators (Anandkumar et al., 2020) to represent non-local, flexible, and grid-independent parametrizations. We propose, multiscale neural operator (MNO), a novel learning-based PDE surrogate for multiscale physics with the key contributions: • A learning-based multiscale PDE surrogate that has quasilinear runtime complexity, leverages known large-scale physics, is grid-independent, flexible, and does not require autodifferentiable solvers. • The first surrogate to approximate grid-independent, non-local parametrizations via neural operators. • Demonstration of the surrogate on the chaotic, coupled, multiscale PDE: multiscale Lorenz96.

2. RELATED WORKS

We embed our work in the broader field of physics-informed machine learning and surrogate modeling. We propose the first surrogate that corrects a coarse-grained simulation via learned, gridindependent, non-local parameterizations. Direct numerical simulation. Despite significant progress in simulating physics numerically it remains prohibitively expensive to repeatedly solve high-dimensional partial differential equations (PDEs) (Karniadakis et al., 2021) . For example, finite difference, element, volume, and (pseudo-) spectral methods have to be re-run for every choice of initial or boundary condition, grid, or parameters (Farlow, 1993; Boyd, 2013) . The issue arises if the chosen method does not have optimal runtime, i.e., does not scale linearly with the number of grid points, which renders it infeasibly expensive for calculating ensembles (Boyd, 2013) . Select methods have optimal or close-to-optimal runtime, e.g., quasi-linear O(N log N ), and outperform machine learning-based methods in runtime and accuracy, but their implementation often requires significant problem-specific adaptations; for example multigrid (Briggs et al., 2000) or spectral methods (Boyd, 2013) . We acknowledge the existence of impressive resarch directions towards optimal and flexible non-ML solvers, such as the spectral solver called Dedalus (Burns et al., 2020) , but advocate to simultaneously explore easy-toadapt ML methods to create fast, accurate, and flexible surrogate models. Surrogate modeling. Surrogate models are approximations, lightweight copies, or reducedorder models of PDE solutions, often fit to data, and used for parameter exploration or uncertainty quantificiation (Smith, 2013; Quarteroni & Rozza, 2014) . Surrogate models via SVD/POD (Chatterjee, 2000) , Eigendecompositions/KLE (Fukunaga & Koontz, 1970) , Koopman The field of physics-informed machine learning is very broad, as reviewed most recently in (Willard et al., 2022) and (Karniadakis et al., 2021; Carleo et al., 2019; Karpatne et al., 2017) . We focus on the task of learning fast and accurate surrogate models of fine-scale models when a fast and approximate coarse-grained simulation is availabe. This task differs from other interesting research areas in equation discovery or symbolic regression (Brunton et al., 2016; Long et al., 2018b; 2019; Liu et al., 2021; Qian et al., 2022) , downscaling or superresolution (Xie et al., 2018; Bode et al., 2021; Kurinchi-Vendhan et al., 2021; Stengel et al., 2020; Vandal et al., 2017; Groenke et al., 2020) , design space exploration or data synthesis (Chen & Ahmed; Chan & Elsheikh, 2019) , controls (Bieker et al., 2020) or interpretability (Toms et al., 2020; McGraw & Barnes, 2018) . Our work is complementary to data assimilation or parameter calibration (Jia et al., 2019; 2021; Karpatne et al., 2017; Zhang et al., 2019; Bonavita & Laloyaux, 2020) which fit to observational data instead of models and differs from inverse modeling and parameter estimation (Parish & Duraisamy, 2016; Hamilton et al., 2017; Yin et al., 2021; Long et al., 2018a) which usually fit parametrizations that are independent of the previous state. Correcting coarse-grid simulations via parametrizations. Problems with large domains are often solved via multiscale methods (Pavliotis & Stuart, 2008) . Multiscale methods simulate the dynamics on a coarse-grid and capture the effects of small-scale dynamics that occur within a grid cell via additive terms, called subgrid parametrizations, closures, or residuals (Pavliotis & Stuart, 2008; McGuffie & Henderson-Sellers, 2005) . Existing subgrid parametrizations for many equations are still inaccurate (Webb et al., 2015) and ML outperformed them by learning parametrizations directly from high-resolution simulations; for example in turbulence (Duraisamy et al., 2019) , climate (Gentine et al., 2018) , chemistry (Hansen et al., 2013 ), biology (Peng et al., 2021) , materials (Liu et al., 2022) , or hydrology (Bennett & Nijssen, 2020) . The majority of ML-based parametrizations, however, is local (Gentine et al., 2018; O'Gorman & Dwyer, 2018; Brenowitz & Bretherton, 2018; Brenowitz et al., 2020; Bretherton et al., 2022; Yuval et al., 2021; Cachay et al., 2021b; Bennett & Nijssen, 2020; Hansen et al., 2013; Liu et al., 2022; Prakash et al., 2021; Ling et al., 2016; Parish & Duraisamy, 2016; Wu et al., 2018; Rasp, 2020) , i.e., the in-and output are variables of single grid points, which assumes perfect scale separation, for example, in isotropic homogeneous turbulent flows (P., 2006) . However, local parametrizations are inaccurate; for example in the case of anisotropic nonhomogeneous dynamics (P., 2006; Wang et al., 2022) , for correcting global error of coarse spectral discretizations (Boyd, 2013) A physics-based model, N , can quickly propagate the state, ūt, at a largescale, but will accumulate the error, h = N (u) -N ū. A neural operator, K θ , wraps the computational and implementation complexities of unmodeled fine-scale dynamics into a non-local and grid-independent term, ĥ, that iteratively corrects the large-scale model. Right: Multiscale Lorenz96. We demonstrate multiscale neural operator (MNO) on the multiscale Lorenz96 equation, a model for chaotic atmospheric dynamics. Image: (Rasp, 2020) 2018; Pathak et al., 2018) . More recent works propose non-local parametrizations, but their formulations either rely on a fixed-resolution grid (Wang et al., 2022; Blakseth et al., 2022; Lapeyre et al., 2019; Chattopadhyay et al., 2020b) , an autodifferentiable solver (Um et al., 2020; Sirignano et al., 2020; Frezat et al., 2022) , or are formulated for a specific domain (Behler J, 2007) . A single work proposes non-local and grid-independent parametrizations (Pathak et al., 2020) , but requires the explicit representation of a high-resolution state which is computationally infeasible for large domains, such as in climate modeling. We are the first to propose grid-independent and non-local parametrizations via neural operators to create fast and accurate surrogate models of fine-scale simulations. Neural operators for grid-independent, non-local parametrizations. Most current learningbased non-local parametrizations rely on FCNNs, CNNs (Lapeyre et al., 2019) , or RNNs (Chattopadhyay et al., 2020b) , which are mappings between finite-dimensional spaces and thus griddependent. In comparison, neural operators learn mappings in between infinite-dimensional function spaces (Kovachki et al., 2021) such as the Laplacian, Hessian, gradient, or Jacobian. Typically, neural operators lift the input into a grid-independent state such as Fourier (Li et al., 2021a) , Eigen- (Bhattacharya et al., 2020) , graph kernel (Li et al., 2020; Anandkumar et al., 2020) or other latent (Lu et al., 2021) modes and learn weights in the lifted domain. We are the first to formulate neural operators for learning parametrizations.

3. APPROACH

We propose multiscale neural operator (MNO): a surrogate model with quasilinear runtime complexity that exploits know coarse-grained simulations and learns a grid-independent, non-local parametrization. As detailed in the following MNO propagates the dynamics according to: ∂ ū ∂t Corrected Large-scale Dyn.

= N (ū)

Large-scale Dyn. + K θ (ū)

Parametrization

(1)

3.1. MULTISCALE NEURAL OPERATOR

Partial differential equations. We focus on partial differential equations (PDEs) that can be written as initial value problem (IVP) via the method of lines (William, 1991) . The PDEs in focus have one temporal dimension, t ∈ [0, T ] =: D t , and (multiple) spatial dimensions, x = [x 1 , ..., x d ] T ∈ D x , and can be written in the iterative, explicit, symbolic form (Farlow, 1993) : ∂u ∂t -N (u) = 0 with t, x ∈ [0, T ] × D x u(0, x) = u 0 (x), B[u](t, x) = 0 with x ∈ D x , (t, x)∈[0, T ]×∂D x In our case, the (non-)linear operator, N , encodes the known physical equations; for example a combination of Laplacian, integral, differential, etc. operators. Further, u : D t × D x → D u is the solution to the initial values, u 0 : D x → D u , and Dirichlet, B D [u] = u -b D , or Neumann boundary conditions, B N [u] = n T ∂ x u -b N , with outward facing normal on the boundary, n⊥∂B. Scale separation. We transfer a concept from the rich and mathematical literature in multiscale modeling (Pavliotis & Stuart, 2008) to consider a filter kernel operator, G * , that creates the largescale solution, ū(x) = u(x) + u ′ (x), where u ′ are the small-scale deviations and • denotes the filtered variable,  and 3 ) is linear, ϕ + ψ = φ + ψ (P., 2006), we can rewrite (2) to: φ(x) = G * ϕ = Dx G(x, x ′ )ϕ(x ′ )dx ′ . Assuming the kernel, G, 1) preserves constant fields, ā = a, 2) commutes with differentiation, [G * , ∂ ∂s ] with s=x, t, G * ∂u ∂t = ∂ ū ∂t = G * N (u) = N (ū) + [G * , N ](u) where [G * , N ](u) = G * N (u) -N (G * u) is the filter subgrid parametrization, closure term, or commutation error, i.e., the error introduced through propagating the coarse-grained solution. Approximations of the subgrid parametrization as an operator that acts on ū require significant domain expertise and are derived on a problem-specific basis. In the case of isotropic homogeneous turbulence, for example, the subgrid parametrization can be approximated as the spatial derivative of the subgrid stress tensor, P., 2006) . Many works approximate the subgrid stress tensor with physics-informed ML (Prakash et al., 2021; Ling et al., 2016; Parish & Duraisamy, 2016; Wu et al., 2018) , but are domain-specific, local, or require a differentiable solver or fixed-grid. We propose a general purpose method to approximating the subgrid parametrization, independent of the grid, domain, isotropy, and underlying solver. [G * , N ](ū) turbulence ≈ ∂τij ∂xj = ∂u ′ i u ′ j ∂xj ( Multiscale neural operator. We aim to approximate the parametrization / filter commutation error, [G * , N ] ≈: h, via learning a neural operator on high-resolution training data. Let K θ be a neural operator with the mapping: [G * , N ] ≈ K θ : Ū (D x ; R du ) → H(D x ; R du ) (4) where θ are the learned parameters and Ū , H are separable Banach spaces of all continuous functions taking values, R du , defined on the bounded, open set, D x ⊂ R dx , with norm ||f || Ū = ||f || H = max x∈Dx |f (x)|. We embed the neural operator as an autoregressive model with fixed time-discretization, ∆t, such that the final multiscale neural operator (MNO) model is: ū(t + ∆t) = f (t, ū, ∂ ū ∂x , ∂ 2 ū ∂x 2 , . . . ) + K θ (ū) where f (t, ū, ∂ ū ∂x , ∂ 2 ū ∂x 2 ) = t+∆t t N (ū)dτ is the known large-scale tendency, i.e. one-step solution. MNO is fit using MSE with the loss function: L = E t E ū|u(t)∼p(t) (L(K θ (ū(t)), [G * , N ](u(t))) where the ground-truth data, u(t) ∼ p(t), is generated by integrating a high-resolution simulation with varying parameters, initial or boundary conditions and uniformly sampling time snippets according to the distribution p(t). Similar to problems in superresolution, there exist multiple realizations of the learned commutation error, [G * , N ](ū), for a given ground-truth, [G * , N ](u); using MSE will learn a smooth average and future work will explore adversarial losses (Goodfellow et al., 2014) or an intersection between neural operators and normalizing flows (Rezende & Mohamed, 2015) or diffusion-based models (Sohl-Dickstein et al., 2015) to account for the stochasticity (Wilks, 2005) . During training, the model input is generated via ū(t) = G * (u(t)) and the target via h target = N (u) -N (ū). During inference MNO is initialized with a large-scale state and integrates the dynamics in time via coupling the neural operator and a large-scale simulation. Our approach does not need access to the high-resolution simulator or equations; it only requires a precomputed high-resolution dataset, which are increasingly available (Hersbach et al., 2020; Burns et al., 2022) , and allows the user to incorporate existing easy-to-access solvers of large-scale equations. There is no requirement for the large-scale solver to be autodifferentiable which significantly simplifies the implementation for large-scale models, such as in climate. If desired, our loss function can easily be augmented with a physics-informed loss (Raissi et al., 2019) on the large-scale dynamics or parametrization term. Choice of neural operator. Our formulation is general enough to allow the use of many operators, such as Fourier (Li et al., 2021a) , PCA-based (Bhattacharya et al., 2020) , low-rank (Khoo & Ying, 2019) , Graph (Li et al., 2020) operators, or DeepOnet (Wang et al., 2021; Lu et al., 2021) . Because DeepONet (Lu et al., 2021) focuses on interpolation and assumes fixed-grid sensor data, we decided to modify Fourier Neural Operator (FNO) (Li et al., 2021a) for our purpose. FNO is a universal approximator of nonlinear operators (Kovachki et al., 2021; Chen & Chen, 1995) , grid-independent and can be formulated as autoregressive model (Li et al., 2021a) . As there exists significant knowledge on symmetries and conservation properties of the commutation error (P., 2006), MNO's explicit formulation increases interpretability and ease of incorporating symmetries and constraints. With FNO, we exploit approximate translational symmetries in the data and leave novel opportunities for neural operators that exploit the full range of known equi-and invariances of the subgrid parametrization term, such as Galilean invariance (Prakash et al., 2021) , for future work.

3.2. ILLUSTRATION OF MNO VIA MULTISCALE LORENZ96

We illustrate the idea of MNO on a canonical model of atmospheric dynamics, the multiscale Lorenz96 equation Lorenz (2006) ; Thornes et al. (2017) . This PDE is multiscale, chaotic, timecontinuous, space-discretized, 2D (space+time), nonlinear, displayed in Fig. 2 -right and detailed in Appendix A.3. Most importantly, the large-and small-scale solutions, X k ∈ R, Y j,k ∈ R ∀ j ∈ {0, ..., J}, k ∈ {0, . .., K}, demonstrate the curse of dimensionality: the number of the small-scale states grows exponentially with scale and explicit modeling becomes computationally expensive, for example, quadratic for two-scales: O(N 2 ) = O(JK). The PDE writes: ∂X k ∂t = X k-1 (X k+1 -X k-2 )-X k +F Large-scale Dyn.: ∂ Xk ∂t - h s c b J-1 j=0 Y j,k (X k ) Parametrization: h , ∂Y j,k ∂t =-cbY j+1,k (Y j+2,k -Y j-1,k )-cY j,k + h s c b X k . ( ) where F is the forcing, h s the coupling strength, b the relative magnitude of scales, and c the evolution speed. With the multiscale framework from Section 3.1, we define: u(x) = [X 0 , Y 0,0 , Y 1,0 , ..., Y J,0 , X 1 , Y 0,1 , ... , X K , ..., Y J,K ] x ∀x∈D x ={0, ..., K(J + 1)} N (u)(x) = ∂X k ∂t if x=k(J+1) ∀k∈{0, . . . , K} ∂Y j,k ∂t otherwise, G(x, x ′ ) = 1 if x ′ = k(J + 1) ∀k ∈ {0, . . . , K} 0 otherwise, with the solution, u, operator, N , and kernel, G. MNO learns the parametrization term via a neural operator, K θ = ĥ ≈ h, and then models: ∂ Xk ∂t = ∂ Xk ∂t + K θ ( X0:K )(k) where the known large-scale dynamics are approximated with The parametrization, K θ , accepts inputs that are sampled anywhere inside the spatial domain, which differs from previous local (Rasp, 2020) or grid-dependent (Chattopadhyay et al., 2020b) Lorenz96 parametrizations. ∂ Xk ∂t ≈ ∂X k ∂t and ground- truth parametrization is h(x) = {-hsc b J-1 j=0 Y j,k (X k ) if x = k(J + 1) ∀k ∈ {0, . . . , We create the ground-truth data via randomly sampled initial conditions, periodic boundary conditions, and integrating the coupled equation with a 4th-order Runge-Kutta solver. After a Lyapunov timescale the state is independent of initial conditions and we extract 4K snippets with T /∆t = 400steps length, corresponding to 10 Earth days, for 1-step training. During test the model is run autoregressively on 1K samples from a different initial condition, as detailed in Appendix A.3.

4. RESULTS

Our results demonstrate that multiscale neural operator (MNO) is faster than direct numerical simulation, generates stable solutions, and is more accurate than current parametrizations. We now proceed to discussing each of these in more detail. 4.1 RUNTIME COMPLEXITY: MNO IS FASTER THAN TRADITIONAL PDE SOLVERS MNO (orange in Fig. 3 ) has quasilinear, O(N log N ), runtime complexity in the number of largescale grid points, N =K, in the multiscale Lorenz96 equation. The runtime is dominated by a lifting operation, here a fast Fourier transform (FFT), which is necessary to learn spatial correlations in a grid-independent space. In comparison, the direct numerical simulation (black) has quadratic runtime complexity, O(N 2 ), because of the explicit representation of N 2 =JK small-scale states. Both models are linear in time, O(T ). Local parametrizations can achieve optimal runtime, O(N ), but it is an open question if there exists a decomposition that replaces FFT to yield an optimal, non-local, grid-independent model. We ran MNO up to a resolution of K = 2 24 , which would equal 75cm/px in a global 1D (space) climate model and only took ≈ 2s on a single CPU. MNO is three orders of magnitude (1000-times) faster than DNS, at a resolution of K = 2 15 or 200m/px. For 2D or 3D simulations the gains of using MNO vs. DNS are even higher with O(N 2 log N ) vs. O(N 4 ) and O(N 3 log N ) vs. O(N 6 ), respectively (Khairoutdinov et al., 2005) . The runtimes have been calculated by choosing the best of 1-100k runs depending on grid size on a single-threaded Intel Xeon Gold 6248 CPU@2.50GHz with 164Gb RAM. We time a one step update which, for DNS, is the calculation of (8) and for MNO the calculation of (9), i.e., the sum of a large-scale step and a pass through the neural operator. In Fig. 3 , the runtime of MNO and DNS plateaus at low-resolution (K < 2 9 ), because runtime measurement is dominated by grid-independent operations. DNS plateaus at a lower runtime, be-Method RMSE Climatology 6.902 Traditional parametrizations 2.326 ML-based parametrization (Rasp et al., 2018) 2.053 MNO (ours) 0.5067 cause MNO contains several fixed-cost matrix transformations. The runtime of DNS has a slight discontinuity at K ≈ 2 9 due to extending from cache to RAM memory. We focus on a runtime comparison, but MNO also has significant savings in memory: representing the state at K = 2 17 in double precision occupies 64GB RAM for DNS and 0.5MB for MNO.

4.2. MNO IS MORE ACCURATE THAN TRADITIONAL PARAMETRIZATIONS

Figure 4 -left shows a forecasted trajectory of a sample at the left boundary, k = 0, where MNO (orange-dotted) accurately forecasts the large-scale dynamics, X 0 (t), (black-solid) while current ML-based (blue-dotted) (Gentine et al., 2018) and traditional parametrizations (red-dotted) quickly diverge. The quantitive comparison of RMSE and a mean/std plot Fig. 4 over 1K samples and 200steps or 10days (∆t = 0.005 = 36min) confirms that MNO is the most accurate in comparison to ML-based parametrizations, traditional parametrizations, and a mean forecast (climatology). Note, the difficulty of the task: when forecasting chaotic dynamics even numerical errors rapidly amplify (P., 2006) . ML-based parametrizations is a state-of-the-art (SoA) model in learning parametrizations and trains a ResNet to forecast a local, grid-independent parametrization, h k = NN(X k ), similar to (Gentine et al., 2018) . The traditional parametrizations (trad. param.) are often used in practice and use linear regression to learn a local, grid-independent parametrization (McGuffie & Henderson-Sellers, 2005) . It was suggested that multiscale Lorenz96 is too easy as a test-case for comparing offline models because traditional parametrizations already perform well (Rasp, 2019) , but the significant difference between MNO and Trad. Params. shows that online evaluation is still interesting. The climatology forecasts the mean of the training dataset, X k (t) = 1/T T t=0 1/N N i=0 X k,i . The full list of hyperparameters and model parameters can be found in Appendix A.5.2. For fairness, we only compare against grid-independent methods that do not require an autodifferentiable solver. Models with soft or hard constraints, e.g., PINNs (Raissi et al., 2019) or DC3 (Donti et al., 2021) , are complementary to MNO. Further, note that our implementation of MNO uses an a priori loss function and could likely be improved by implementing an a posteriori loss functions, i.e., a loss functions that propagates the loss over multiple time steps similar to (Frezat et al., 2022) which requires an autodifferentiable solver or (Brandstetter et al., 2022) which does not.

4.3. MNO IS STABLE

Figure 5 shows that predicting large-scale dynamics with MNO is stable. We first plot a randomly selected sample of the first large-scale state, X k=0 (t) (left-black), to illustrate that the prediction is bounded. The MNO prediction (left-yellow) follows the ground-truth up to an approximate horizon of, t = 1.8 or 9 days, then diverges from the ground-truth solution, but stays within the bounds of the ground-truth prediction and does not diverge to infinity. The RMSE over time in Figure 5 shows that MNO (yellow) is approximately more accurate than current ML-based (blue) and traditional (red) parametrizations for ≈ 100%-longer time, measuring the time to intersect with climatology. Despite the difficulty in predicting chaotic dynamics, the RMSE of MNO reaches a plateau, which is slightly above the optimal plateau given by the climatology (black). The RMSE over time is calculated as: RMSE(t) = 1 K K k=0 ( 1 N N i=0 ( Xk,i (t) -X k,i (t)) 2 ).

5. LIMITATIONS AND FUTURE WORK

We demonstrated the accuracy, speed, and stability of MNO on the chaotic multiscale Lorenz96 equation. Future work, can extend MNO towards higher-dimensional or time-irregular systems and further integrate symmetries or constraints: The results show promise to extend MNO to higher-dimensional, chaotic, multiscale, multiphysics problems. We demonstrate the first steps towards quasi-geostrophic turbulence and Rayleigh-Benard convection in Appendix A.1 and aim towards integration in large-scale global atmospheric models, e.g., as approximation of cloud processes (Wang et al., 2022; Palmer et al., 2019) . Reducing the cost of climate models could dramatically improve uncertainties (Lütjens et al., 2021) or decisionexploration (Rooney-Varga et al., 2020) . MNO is grid-independent in space but not in time which could be alleviated via integrations with Neural ODEs (Chen et al., 2018) . MNO is a myopic model which might suffice for chaotic dynamics (Li et al., 2021b) , but could be combined with LSTMs (Mohan et al., 2019) or reservoir computing (Pathak et al., 2018) to contain a memory. Further, we leveraged global Fourier decompositions to exploit grid-independent periodic spatial correlations, but future work could also capture local discontinuities, e.g., along coastlines (Jiang et al., 2021) with multiwavelets (Gupta et al., 2021) , or incorporate non-periodic boundaries via Chebyshev polynomials. Lastly, MNO can be combined with Geometric deep learning, PINNs, or hard constraint models. This avenue of research is particularly exciting with MNO as there exist many known symmetries for various paramtrization terms (Prakash et al., 2021) .

6. CONCLUSION

We proposed a hybrid physics-ML surrogate of multiscale PDEs that is quasilinear, accurate, and stable. The surrogate limits learning to the influence of fine-onto large-scale dynamics and is the first to use neural operators for a grid-independent, non-local corrective term of large-scale simulations. We demonstrated that multiscale neural operator (MNO) is faster than direct numerical simulation (O(N log N ) vs. O(N 2 )), and more accurate (≈ 100% longer prediction horizon) than state-of-the-art parametrizations on the chaotic, multiscale equations multiscale Lorenz96. With the dramatic reduction in runtime MNO could enable rapid parameter exploration and robust uncertainty quantification in complex climate models. Climate change is a defining challenge of our time and environmental disasters will become more frequent: from storms, floods, wildfires and heat waves to biodiversity loss and air pollution (IPCC, 2018) . The impacts of these environmental disasters will likely be unjustly distributed: island states, minority populations, and the Global South are already facing the most severe consequences of climate change, while the Global North is responsible for the most emissions since the industrial revolution (Althor et al., 2016) . Decision-makers require more accurate, accessible, and local tools to understand and limit the economic and human impact of a changing climate (Palmer et al., 2019) . We propose multiscale neural operator (MNO) to improve the parametrizations in climate models, thus leading to more accurate predictions. Related techniques to MNO, specifically neural operatorbased surrogate models, could help reduce computational complexity of large-scale weather and climate models. The reduced computational complexity would make them more accessible to lowresource countries or allow for higher resolution predictions. Unfortunately, discoveries for faster differential equations solvers can and likely will be leveraged in ethically questionable fields, such as missile development or oil discovery. We acknowledge the possible negative impacts and hope that our targeted discussion and application to equations from climate modeling can steer the our work towards a positive impact.

8. REPRODUCIBILITY STATEMENT

The code will be submitted in a comment directed to the reviewers and area chairs with a link to an anonymous repository once the discussion forum is opened. The code will be made publicly available upon acceptance along with an open-source license, data, and instructions to reproduce the main results. A list of hyperparameters per model is detailed in Appendix A.6.2, data splits are explained in results; all simulation details in Appendix A.4; and background of the neural operator model in Appendix A.3. Conducting the study from ideation to publication used a total of approximately 10K CPU hours on an internal cluster.

A APPENDIX

A.1 QUASI-GEOSTROPHIC TURBULENCE (a) Ground-truth Figure 6 : Quasi-Geostrophic Turbulence (QGT). We demonstrate the first steps to extend MNO to 2D QGT. This plot shows our ground-truth training data with the direct numerical simulation, ω, and subgrid parametrization, ω ′ . We are planning to demonstrate multiscale neural operator a high-dimensional system, specifically the one-layer quasi-geostrophic (QG) turbulence depicted in Fig. 6 . QG turbulence is a derivative of the Navier-Stokes equations and a good model for atmospheric turbulence at the equator. The equations are derived from the incompressible, i.e., ∇ • u = 0, Navier-Stokes equations by 1) taking the curl of velocity field, w = ∇ × u, and 2) assuming the beta-plane approximation, f = f 0 + βy, and hydrostatic, ∂p ∂z = -ρg, and geostrophic, f v = 1 ρ ∂p ∂x ; f u = -1 ρ ∂p ∂y , balances (Majda & Wang, 2006) . The resulting equations, called quasi-geostrophic turbulence, are given by: ∂ t ω + J(ψ, ω) = ν∇ 2 ω -µω -β∂ x ψ + F ω = ∇ 2 ψ ( ) where ω is the vorticity, u = [u, v] T = [-∂ y ψ, ∂ x ψ] T is the velocity vector, ψ is the streamfunction, J(ψ, ω) = ∂ x ψ∂ y ω -∂ y ψ∂ x ω is the nonlinear Jacobian operator. Further, the parameters are the turbulent viscosity, ν, linear drag coefficient, µ, Rossby parameter, β, and source term, F . Vorticity can be computed with ω = ẑ • ∇ × u = ∂ x v -∂ y u Filtering the equation with a kernel results in the parametrized large-scale equation given by: ∂ t ω + J( ψ, ω) = ν∇ 2 ω -µω -β∂ x ψ + F + J( ψ, ω) -J(ψ, ω) Parametrization: h(ψ,ω) We then aim to approximate the subgrid-scale (SGS) parametrization, K θ ≈ h, with the neural operator, such that the final model is: ∂ω(x, y) ∂t = ∂ ω(x, y) ∂t + K θ (ψ, ω)(x, y) ω ∂ψ(x, y) ∂t = ∂ ψ(x, y) ∂t + K θ (ψ, ω)(x, y) ψ The QG turbulence equation is solved with a pseudospectral solver in space and RK4 explicit time integration. We choose the parameters, N x = N y = 512, ∆t = 480s, µ = 1.25 × 10 -8 s -1 , ν = 352m 2 /s, β = 0, and Reynolds number, Re = 22×10 4 . The variables are non-dimensionalized with T d = 1.2 × 10 6 s, i.e., ∆t solver = ∆t/T d and L d = 504 × 10 4 /πm, i.e., ∆x solver = 2π/N x . The reduced system is run with scale δ = 4, such that Nx = Ny = 128. The forcing initiates turbulent mixing and simulates slowly varying wind stress according to the solution of, F = C f (t)[cos(4y + π sin(1.4t)) -cos(4x + π sin(1.5t))], and 0.5||F || 2 = 3 with enstrophy injection rate, C F (t). To generate turbulent chaotic dynamics that are decoupled from the initial state, the simulation is initialized with some large-scale Fourier states and warmed up for 1300days. After warm-up we generate 18000 iterations (δ18000 at the fine-scale) from which we independently sample training and validation snippets.

A.2 RAYLEIGH-B ÉNARD CONVECTION

The proposed multiscale neural operator can also be leveraged for systems without explicit multiscale formulation. We demonstrate this by formulating the MNO equations for Rayleigh-Bénard Convection equations, as displayed in Fig. 7 . (a) Ground-truth 

A.2.1 DETAILS AND INTERPRETATION

Rayleigh-Bénard Convection (RBC) is a challenging set of equations for turbulent, chaotic, and convection-dominated flows. The equation finds applications in fluid dynamics, atmospheric dynamics, radiation, phase changes, magnetic fields, and more (Pandey et al., 2018) . So far, we have generated a ground-truth dataset that we implemented with the 2D turbulent Rayleigh-Bénard Convenction equations with Dedalus spectral solver (Burns et al., 2020) similar to (Pandey et al., 2018) : ∂u ∂t + u • ∇u = Pr Ra ∇ 2 u -∇p + b ∂T ∂t + u • ∇T = 1 √ RaP r ∇ 2 T ∇ • u = 0 with temperature/buoyancy, T , Rayleigh number, Ra = gα∆T H 3 /(νκ), thermal expansion coefficient, α, Prandtl number, Pr = ν/κ, momentum diffusivity or kinematic viscosity, ν, thermal diffusivity, κ = 1 √ RaPr , acceleration due to gravity, g, temperature difference, ∆T , unit vector, e, pressure, p, Nusselt number, Nu = Pr Ra , Reynolds number, Re = ⟨∇ 2 u⟩ V,t

Ra

Pr , and full volumetime average, ⟨•⟩ V,t , cell length, L x . The equations have been non-dimensionalized with the free-fall velocity, U f = √ gα∆H, and cell height, H. In the horizontal direction, x, we use periodic boundary conditions and in the vertical direction, z, we use no-slip boundary conditions for the velocity, u(z = 0) = u(z = L z ) = 0, and fixed temperatures, T (z = 0) = L z , T (z = L z ) = 0. The inital conditions are sampled randomly, b(z, t = 0) = L z + z + z(L z -z)ω, with ω ∼ N (0, 1 × 10 -3 . We chose: Ra = 2 × 10 6 , Pr = 1, L x = 4, H = 1.

A.3 FOURIER NEURAL OPERATOR

Our neural operator for learning subgrid parametrizations is based on Fourier neural operators (Li et al., 2021a) . Intuitively, the neural operator learns a parameter-to-solution mapping by learning a global convolution kernel. In detail, it learns the operator to transforms the current large-scale state, X(x 0:K , t) ∈ R K×d X to the subgrid parametrization, f x (x 0:K , t) := X 0:K ∈ R K×d X with number of grid points, K, and input dimensionality, d X , according to the following equations: v 0 = X 0:K P T + 1 K×1 b P v i+1 = σ v i W T + Dx κ ϕ (x, x ′ )v i (x ′ )dx ′ ≈ σ v i W T + 1 nv×1 b W + F -1 (R ϕ • Fv i ) fx,0:K = v n d Q T + 1 K×1 b Q (15) First, MNO lifts the input via a linear transform with matrix, P ∈ R nv×d X , bias, b P ∈ R 1×nv , vector of ones, 1 K×1 , and number of channels, n v . The linear transform is local in space, i.e., the same transform is applied to each grid point. Second, multiple nonlinear "Fourier layers" are applied to the encoded/lifted state. The encoded/lifted state's, v i ∈ R K×nv , spatial dimension is transformed into the Fourier domain via a fast Fourier transform. We implement the FFT as a multiplication with the pre-built forward and inverse Type-I DST matrix, F ∈ C kmax×K and F -1 ∈ C K×kmax , respectively, returning the vector, Fv i ∈ C kmax×nv . The dynamics are learned via convoluting the encoded state with a weight matrix. In Fourier space, convolution is a multiplication, hence each frequency is multiplied with a complex weight matrix across the channels, such that R ∈ C kmax×nv×nv . In parallel to the convolution with R, the encoded state is multiplied with the linear transform, W ∈ R nv×nv , and bias, b W ∈ R 1×nv . From a representation learning-perspective, the Fourier decomposition as a fast and interpretable feature extraction method that extracts smooth, periodic, and global features. The linear transform can be interpreted as residual term concisely capturing nonlinear residuals. So far, we have only applied linear transformations. To introduce nonlinearities, we apply a nonlinear activation function, σ, at the end of each Fourier layer. While the non-smoothness of the activation function ReLu, σ(z) = max(0, z), could introduce unwanted discontinuities in the solution, we choose it resulted in more accurate models than smoother activation functions such as tanh or sigmoid. Finally, the transformed state, v n d , is projected back onto solution space via another linear transform, Q ∈ R d X ×nv , and bias, b Q . The values of all trainable parameters, P, R, W, Q, b * , are found by using a nonlinear optimization algorithm, such as stochastic gradient descent or, here, Adam Kingma & Ba (2015) . We have used MSE between the predicted, fx , and ground-truth, f x , subgrid parametrizations as loss. The neural operator is implemented in pytorch, but does not require an autodifferentiable PDE solver to generate training data. During implementation, we used the DFT which assumes a uniformly spaced grids, but can be exchanged with non-uniform DFTs (NUDFT) to transform non-uniform grids (Dutt & Rokhlin, 1993) .

A.4.1 DETAILS AND INTERPRETATION

The equation contains K variables, X k ∈ R, and JK small-scale variables, Y j,k ∈ R that represent large-scale or small-scale atmospheric dynamics such as the movement of storms or formation of clouds, respectively. At every time-step each large-scale variable, X k , influences and is influenced by J small-scale variables, Y 0:J,k . The coupling could be interpreted as X k causing static instability and Y j,k causing drag from turbulence or latent heat fluxes from cloud formation. The indices k, j are both interpreted as latitude, while k ∈ {0, ..., K-1} indexes boxes of latitude and j ∈ {0, ..., J-1} indexes elements inside the box. Illustrated on a 1D Earth with a circumference of 360 • that is discretized with K = 36, J = 10, one a spatial step in k, j would equal 10 • , 1 • , respectively Lorenz (2006) ; we choose K = J = 4. A time step with ∆t = 0.005 would equal 36 minutes Lorenz (2006) . We choose a large forcing, F > 10, for which the equation becomes chaotic. The last terms in each equation capture the interaction between small-and large-scale, f x,k = -hc b J j=0 Y j,k (X k ), f y . The scale interaction is defined by the parameters where h = 0.5 is the coupling strength between spatial scales (with no coupling if h would be zero), b = 10 is the relative magnitude, and c = 8 the evolution speed of X -Y . The linear, -X k , and quadratic terms, X 2 * , model dissipative and advective (e.g., moving) dynamics, respectively. The equation assumes perfect "scale separation" which means that small-scale variables of different grid boxes, k, are independent of each other at a given timestep, Y j1,k2 (t)⊥Y j2,k1 (t) ∀t, j 1 , j 2 , k 1 ̸ = k 2 . The separation of small-and large-scale variables can be along the same or different domain and the discretized variables would then be y ∈ [0, ∆x] or y ∈ [y 0 , y end ], respectively. The equation wraps around the full large-or small-scale domain by using periodic boundaries, X -k :=X K-k , X K+k :=X k , Y -j,k :=Y J-j,k , Y J+j,k :=Y j,k . Note that having periodic boundary conditions in the small-scale domanin allows for superparametrization, i.e., independent simulation of the small-scale dynamics Campin et al. (2011) and differs from the three-tier Lorenz96 where variables at the borders of the small-scale domain depend on small-scale variables of the neighbouring k Thornes et al. (2017) .

A.4.2 SIMULATION

The initial conditions are sampled uniformly from a set of integers, X(t 0 ) ∼ U (-5, -4, ..., 5, 6), as a mean-zero unit-variance Gaussian Y (t 0 ) ∼ N (0, 1), and lower scale Gaussian Z(t 0 ) ∼ 0.05N (0, 1). The train and test set contains 4k and 1k samples, respectively. Each sample is T = 1 model time unit (MTU) or 200 (=T /∆t) time-steps long, which corresponds to 5 Earth days (= T /∆t * 36min with ∆t = 0.005) Lorenz (2006) . Hence, our results test the generalization towards different initial conditions, but not robustness to extrapolation or different choices of parameters, c, b, h, F . The sampling starts after T = 10. warmup time. The dataset uses double precision. We solve the equation by fourth order Runge-Kutta in time with step size ∆t = 0.005, similar to (Lorenz & Emanuel, 1998) . For a PDE that is discretized with fixed time step, ∆t, the groundtruth train and test data, h x,0:K (t), is constructed by integrating the coupled large-and small-scale dynamics. Note, that the neural operator only takes in the current state of the large-scale dynamics. Hence, , i.e., it uses the full large-scale spatial domain as input, which exploits spatial correlations and learns parametrizations that are independent of the large-scale spatial discretization. Our method can be queried for infinite time-steps into the future as it does not use time as input. We are incorporating the prior knowledge from physics by calculating the large-scale dynamics, dX LS,0:K . Note that the small-scale physics do not need to be known. Hence, MNO could be applied to any fixed time-step dataset for which an approximate model is known. The time-series modeling task uses a history of only one time step to learn chaotic dynamics (Li et al., 2021b) . We are using ADAM optimizer with learning rate, λ = 0.001, step size, 20, number of epochs, n e = 2, and an exponential learning rate scheduler with gamma, γ = 0.9 (Kingma & Ba, 2015) . Training took 1 : 50min on a single core Intel i7-7500U CPU@2.70GHz. Multiscale Lorenz96: ML-based parametrization The ML-based parametrizations uses a ResNet with n layers = 2 residual layers that contain a fully connected network with n units = 32 units. The model is optimized with Adam (Kingma & Ba, 2015) with learning rate 0.01, β = (0.9, 0.999), ϵ = 1 * 10 -8 , trained for 20n epochs = 20. Multiscale Lorenz96: Traditional parametrization The traditional parametrization uses leastsquares to find the best linear fit. The weight matrix is computed with A = (X T X) -1 X T Y , where X and Y are the concatenation of input large-scale features and target parametrizations, respectively. Inference is conducted with ŷ = Ax.

A.7 NEURAL NETWORKS VS. NEURAL OPERATORS

Most work in physics-informed machine learning relies on fully-connected neural networks (FC-NNs) or convolutional neural networks (Karniadakis et al., 2021) . FCNNs however are mappings between finite-dimensional spaces and learn mappings for single equation instances rather than learning the PDE solver. In our case FCNNs only learn mappings on fixed spatial grids. We leverage the recently formulated neural operators to extend the formulation to arbitrary grids. The key distinction is that the FCNN learns a parameter-dependent set of weights, Φ ay , that has to be retrained for every new parameter setting. The neural operator is a learned function mapping with parameter-independent weights, Θ, that takes parameter settings as input and returns a function over the spatial domain, G Θ (a y ). In comparison, the forcing term is approximated by an FCNN as fx,Φ (x k ; a y ) = g Φa y (x k ) and by a neural operator as fx,Θ (x k ; a y ) = G Θ (a y )(x k ). The mappings are given by: FCNN: g Φa y : D x → R d X , NO: G Θ : H ay (D x ; R da y ) → H X (D x ; R d X ). ( ) H ay is a function space (Banach) of PDE parameter functions, a y , that map the spatial domain, D y , onto d ay dimensional parameters, such as ICs, BCs, parameters, or forcing terms. H X is the function space of residuals that map the spatial domain, D x , onto the space of d X -dimensional residuals, R d X .



Figure1: Multiscale neural operator (MNO). Explicitly modeling all scales of Earth's weather is too expensive for traditional and learning-based solvers(Palmer et al., 2019). MNO dramatically reduces the computational cost by modeling the large-scale explicitly and learning the effect of fine-onto large-scale dynamics; such as turbulence slowing down a river stream. We embed a grid-independent neural operator in the large-scale physical simulations as a "parametrization", conceptually similar to stacking dolls (Snagglebit, 2022).

Figure 2: Left: Model Architecture.A physics-based model, N , can quickly propagate the state, ūt, at a largescale, but will accumulate the error, h = N (u) -N ū. A neural operator, K θ , wraps the computational and implementation complexities of unmodeled fine-scale dynamics into a non-local and grid-independent term, ĥ, that iteratively corrects the large-scale model. Right: Multiscale Lorenz96. We demonstrate multiscale neural operator (MNO) on the multiscale Lorenz96 equation, a model for chaotic atmospheric dynamics. Image:(Rasp, 2020)

K} and 0 otherwise}. See Appendix A.4 for all terms.

Figure 3: MNO is faster than direct numerical simulation. Our proposed multiscale neural operator (orange) can propagate multiscale PDE dynamics in quasilinear complexity, O(N log N ). For a grid with K = 2 15 , MNO is ∼ 1000-times faster than direct numerical simulation (black) which has quadratic complexity, O(N 2 )

Figure4: Left: MNO is more accurate than traditional parametrizations. A sample plot shows, that our proposed multiscale neural operator (yellow/orange-dotted) can accurately forecast the large-scale physics (black-solid), X k=0 (t). In comparison, ML-based blue-dotted) and traditional (red-dotted) parametrizations quickly start to diverge. Note that the system is chaotic and small deviations are rapidly amplified; even inserting the exact parametrizations in float32 instead of float64 quickly diverges. Right: Accuracy. MNO is more accurate than traditional parametrizations as measured by the root mean-square error (RMSE).

Figure 5: MNO is stable. MNO can propagate a sample state, X k=0 (t), over a long time horizon without diverging to infinity (left). The right plot shows that the RMSE of MNO plateaus for long-term forecasts, further confirming stability. Further, MNO (yellow) maintains accuracy longer than ML-based parametrizations (blue) and a climatology (black).

Figure 7: Rayleigh-Bénard Convection. We depicted a sample plot for ground-truth training data of the 2D RBC.

Figure 8: Mean accuracy. MNO (orange) forecasts the mean (solid) of the ground-truth DNS (blue) more accurately in comparison to ML-based parametrizations (green) and climatology (red). The standard deviations is plotted as dotted lines..

, or in large-scale climate models(Dueben & Bauer,

