COUPLED MULTIWAVELET NEURAL OPERATOR LEARN-ING FOR COUPLED PARTIAL DIFFERENTIAL EQUATIONS

Abstract

Coupled partial differential equations (PDEs) are key tasks in modeling the complex dynamics of many physical processes. Recently, neural operators have shown the ability to solve PDEs by learning the integral kernel directly in Fourier/Wavelet space, so the difficulty for solving the coupled PDEs depends on dealing with the coupled mappings between the functions. Towards this end, we propose a coupled multiwavelets neural operator (CMWNO) learning scheme by decoupling the coupled integral kernels during the multiwavelet decomposition and reconstruction procedures in the Wavelet space. The proposed model achieves significantly higher accuracy compared to previous learning-based solvers in solving the coupled PDEs including Gray-Scott (GS) equations and the non-local mean field game (MFG) problem. According to our experimental results, the proposed model exhibits a 2ˆ" 4ˆimprovement relative L2 error compared to the best results from the state-of-the-art models.

1. INTRODUCTION

Human perception relies on detecting and processing waves. While our eyes detect waves of electromagnetic radiation, our ears detect waves of compression in the surrounding air. Going beyond waves, from complex dynamics of blood flow to sustain tissue growth and life, to navigating underwater, ground and aerial vehicles at high speeds requires discovering, learning and controlling the partial differential equations (PDEs) governing individual or webs of biological, physical and chemical phenomena (Lacasse et al., 2007; Henriquez, 1993; Laval & Leclercq, 2013; Ghanavati et al., 2017; Radmanesh et al., 2020) . Within this context, neural operators have been successfully used to learn and solve various PDEs. By representing the integral kernel termed as Green's function in the Fourier or Wavelet spaces, the fourier neural operator (Li et al., 2020b) and the multiwaveletbased neural operator (Gupta et al., 2021b; a) ) exhibit significant improvements on solving PDEs compared with previous work. However, when it comes to coupled systems characterized by coupled differential equations such as mean field games (MFGs) (Lasry & Lions, 2007; Achdou & Capuzzo-Dolcetta, 2010) , analysis of coupled cyber-physical systems (Xue & Bogdan (2017) , or analysis of the surface currents in the tropical Pacific Ocean Bonjean & Lagerloef (2002) , the interactions between the variables/functions need to be considered to decouple the system. Without the knowledge of underlying PDEs, the complex interactions can be hardly represented in the data-driven model. To build a data-driven model that can give a general representation of the interactions to efficiently solve coupled differential equations, we propose the coupled multiwavelets neural operator (CMWNO). Neural Operators. Neural operators (Li et al., 2020b; c; a; Gupta et al., 2021b; Bhattacharya et al., 2020; Patel et al., 2021) focus on learning the mapping between infinite-dimensional spaces of functions. The critical feature for neural operators is to model the integral operator namely the Green's function through various neural network architectures. The graph neural operators (Li et al., 2020b; c) use the graph kernel to model the integral operator inspired by graph neural networks; the ˚Equal Contribution Fourier neural operator (Li et al., 2020b) uses an iterative architecture to learn the integral operator in Fourier space. The multiwavelet neural operators (Gupta et al., 2021b; a) utilize the non-standard form of the multiwavelets to represent the integral operator through 4 neural networks in the Wavelet space. The neural operators are completely data-driven and resolution independent by learning the mapping between the functions directly, which can achieve the state-of-the-art performance on solving PDEs and initial value problems (IVPs). To deal with coupled PDEs in the coupled system and be data-efficient, we aim to decode the various interaction scenarios inside the neural operators. Multiwavelet Transform. In contrast to wavelets, multiwavalets (refer to Appendix C) use more than one scaling functions which are orthogonal. The multiwavelets exploit the advantages of wavelets, such as (i) the vanishing moments, (ii) the orthogonality, and (iii) the compact support. Along the essence of wavelet transform, a series of wavelet bases are introduced with scaled/shifted versions in multiwavelets to construct the basis of the coarsest scale polynomial subspace. The multiwavelet bases have been proved to be successful for representing integral operators as shown in (Alpert et al., 1993) (the discrete version of multiwavelets) and (Alpert, 1993b) . In our proposed model, to develop compactly supported multiwavelets, we use the Legendre polynomials (Appendix D) which are non-zero only over a finite interval as the basis. The differential (B{Bt) and the integral ( ť Ω ) operators can be represented by the first-order multiwavelet coefficients (s and d) of orthogonal bases via decomposition in the Wavelet space. Mean Field Games (MFGs). As a representative problem for coupled systems in the real world, MFGs gains raising attentions in various areas, including economics (Achdou et al., 2014; 2022) , finance (Guéant et al., 2011; Huang et al., 2019) and engineering (De Paola et al., 2019; Gomes et al., 2021) , etc. Building on statistical mechanics principles and infusing them into the study of strategic decision making, MFGs investigate the dynamics of a large population of interacting agents seen as particles in a thermodynamic gas. Simply speaking, MFGs consist of (i) a Fokker-Planck equation (or related PDE) that describes the dynamics of the aggregate distribution of agents, which is coupled to (ii) a Hamilton-Jacobi-Bellman equation (another PDE) prescribing the optimal control of an individual (Lasry & Lions, 2006; 2007; Huang et al., 2006; 2007) . Among different types of MFGs, the class of non-potential MFGs system with mixed couplings is particularly important as it is more reflective of the real world with a continuum of agents in a differential game.

Solutions on MFGs.

Previous works either only restrict to systems without non-local coupling, such as alternating direction method of multipliers (ADMM) (Benamou & Carlier, 2015; Benamou et al., 2017) and primal-dual hybrid gradient (PDHG) algorithm (Briceno-Arias et al., 2019; 2018) or use general purpose numerical methods for solving the MFG problems (Achdou et al., 2013a; b; Achdou & Capuzzo-Dolcetta, 2010) , which misses specific information from the target structure. In addition, the aforementioned works are not parallelizable with linear computational cost under the coupled MFGs settings. Recently, (Liu & Nurbekyan, 2020) considers dual variables of nonlocal couplings in Fourier or feature space. Furthermore, (Liu et al., 2021) expands the feature-space in the kernel-based representations of machine learning methods and uses expansion coefficients to decouple the mean field interactions. However, both dual variables and expansion coefficients need to bound the interactions of coupled system in a reasonable interval with prior knowledge. In our work, we first introduce the neural operator into coupled MFG fields, which can decouple the various interactions inside the multiwavelet domain. Novel Contributions. The main novel contributions of our work are summarized as follows: • For coupled differential equations, we propose a coupled neural operator learning scheme, named CMWNO. To the best of our knowledge, CMWNO is the first neural operator work using pure data-driven method to decouple and then solve coupled differential equations. • Utilizing multiwavelet transform, CMWNO can deal with the interactions between the kernels of coupled differential equations in the Wavelet space. Specifically, we first yield the representation of coupled information during the decomposition process of multiwavelet transform. Then, the decoupled representation can interact separately to help the operators' reconstruction process. In addition, we propose a dice strategy to mimic the information interaction during the training process. • The proposed model successfully learns the interaction between the coupled variables when the couple degree is increasing and thus it could open new directions for studying complex coupled systems via data-driven methods. Experimentally, the proposed CMWNO framework offers the state-of-the-art performance on both Gray-Scotts (GS) equations and non-local MFGs. Specifically, CMWNO outperforms the best baseline (MWT c ) by 54.0% on GS equations with various resolutions and outperforms the best baseline (FNO c ) by 61.4% on non-local MFGs with different time steps.

2. COUPLED MULTIWAVELET NEURAL OPERATORS LEARNING

To solve a coupled control system characterized by coupled state equations in control theory, a popular way is to use the Laplace operator s to represent differential and integral operators (Gilbarg et al., 1977) . Therefore, the coupled high-order differential equations can be transformed into the first-order differential equations in the Laplace space which will reduce the decoupling difficulty. Inspired by the use of the Laplace operator and the properties of the multiwavelets, we assume that the interactions between kernels can be used to approximate the coupled information by reducing the degree of high-order operators in multiwavelet bases. With this assumption, we are able to build the coupled multiwavelet neural operators (CMWNO) learning scheme, which utilizes decomposition representation from the operator and mimic the interaction via a dice strategy.

2.1. COUPLED DIFFERENTIAL EQUATIONS

To provide a simple example of the coupled kernels, κ 1 and κ 2 , let us consider a general coupled system with 2 coupled variables upx, tq and vpx, tq with the given initial conditions u 0 pxq and v 0 pxq. Given A and U as two Sobolev spaces H s,p with s ą 0, p " 2, let T denote a generic operator such that T : A Ñ U. Without the knowledge of how these two variables are coupled, to solve for upx, τ q and vpx, τ q, we need two operators T 1 and T 2 such that T 1 u 0 pxq " upx, τ q and T 2 v 0 pxq " vpx, τ q. The coupled kernels termed as Green's function can be written as follows: T 1 u 0 pxq " ż D κ 1 px, y, u 0 pxq, u 0 pyq, v 0 pxq, v 0 pyq, κ 2 qu 0 pyqdy, T 2 v 0 pxq " ż D κ 2 px, y, u 0 pxq, u 0 pyq, v 0 pxq, v 0 pyq, κ 1 qv 0 pyqdy, upx, 0q " u 0 pxq; vpx, 0q " v 0 pxq, x P D, where D Ă R d is a bounded domain. The interacted kernels cannot be directly solved without considering the interference from the other kernel, and our idea is to simplify the kernels first and deal with the interactions in the multiwavelet domain.

2.2. MULTIWAVELET OPERATOR

To briefly introduce the multiwavelet operator, we explain how the neural networks are used to represent the kernel in this section. The basic concept of multiresolution analysis (MRA) and multiwavelets (Alpert et al., 1993; Alpert, 1993a; b) are provided in the Appendix C. Notation For k P Z and n P N, the space of piecewise polynomial functions is defined as: V k n " tf |the restriction of f to the interval p2 ´nl, 2 ´npl `1qq is a polynomial of degree ă k, for all l " 0, 1, . . . , 2 n ´1, and f vanishes elsewhereu. V k 0 consists of the orthogonal scaling functions φ i with i " 0, . . . , k ´1, and V k n can be spanned by shifting and scaling these functions as φ n jl pxq " 2 n{2 φ j p2 n x ´lq, where j " 0, . . . , k ´1 and l " 0, . . . , 2 n ´1. The coefficients of φ n jl pxq are called scaling coefficients marked as s n jl . The multiwavelet subspace W k n is defined as the orthogonal complement of V k n in V k n`1 such thatV k n À W k n " V k n`1 , V k n K W k n . W k 0 consists of the orthogonal wavelet functions ψ i with i " 0, . . . , k ´1. Similar to V k n , W k n is composed of the wavelets functions ψ n jl pxq with wavelets coefficients d n jl . To represent the functions and learn the mapping in multiwavelet space, the nonstandard form is used to represent the integral operator. According to (Beylkin et al., 1991; Alpert et al., 2002b) , an orthogonal projection operator P k n : H s,2 Ñ V k n , and Q k n : H s,2 Ñ W k n with Q k n " P k n`1 ´P k n , then an single operator T in our coupled system can be represented as: T " T k 0 `8 ÿ n"0 pA k n `Bk n `Ck n q, where T k 0 " P k 0 T P k 0 , A k n " Q k n T Q k n , B k n " Q k n T P k n , C k n " P k n T Q k n , Q k n is the multiwavelet operator. Therefore, the nonstandard forms of the operator is a collection of triplets t T k 0 , pA k i , B k i , C k i q n"0,1,... u. For a given operator T : T u 0 pxq " u τ pxq, the map under wavelet space can be written as: T i d l " A k i d i l `Bk i s i l , T i ŝ l " C k i d i l , T 0 s l " T s 0 l , i " 0, 1, . . . , n where, pT i s l , T i d l q{ps i l , d i l q are the scaling/wavelet coefficients of u τ pxq{u 0 pxq in subspace V k i`1 . In our model, one kernel is approximated using 4 simiple neural networks A, B, C and T such that T i d l « A θ A pd i l q `Bθ B ps i l q, T i ŝ l « C θ C pd i l q , and T 0 s l « Tθ T ps L l q.

2.3. COUPLED MULTIWAVELETS MODEL

This section introduces a coupled multiwavelets model to provide a general solution on coupled differential equations. First, we make a mild assumption to decouple two coupled operators given in Section 2.1. To simplify eq. 1, without loss of generality, we assume that we can build two operators T u and T v to approximate upx, τ q and vpx, τ q, where T u and T v are decoupled and do not carry any interference from each other. In other words, we can write T u u 0 pxq " u 1 px, τ q; T v v 0 pxq " v 1 px, τ q, where u 1 px, τ q and v 1 px, τ q are the approximations of upx, τ q and vpx, τ q without coupling. The assumption is mild and easy to get satisfied in the Wavelet space since the operators can be represented by the first-order multiwavelet coefficients. According to this assumption, we can derive the following relations: upx, τ q " T u upx, 0q `ϵ1 pT v q, x P D vpx, τ q " T v vpx, 0q `ϵ2 pT u q, x P D where ϵ 1 pT u q quantifies the interference from operator T v to solve upx, τ q and ϵpT v q represents the measurable interaction from operator T u . Therefore, the integral operators can be written as: T u u 0 pxq " ż D κ u px, yqu 0 pyqdy; T v v 0 pxq " ż D κ v px, yqv 0 pyqdy, the kernels κ u and κ v termed as Green's functions can be learned through neural operators, where κ u can be learned using the data of u while the kernel κ v is learned from v. To model ϵ 1 pT u q and ϵ 2 pT v q, we transform the operators into multiwavelet coefficients in the Wavelet space and embed it through simple linear combination after the decomposition steps. Based on the concept of multiwavelets (Appendix Section C), here we simply explain the decomposition step and reconstruction step of multiwavelets in our coupled system. Since V k n " V k n´1 À W k n´1 according to Section 2.2, the bases of V k n can be written as a linear combination of the scaling functions φ n´1 i and the wavelet functions ψ n´1 i . The linear coefficients pH p0q , H p1q , G p0q , G p1q q are termed as multiwavelet decomposition filters, transforming representation between subspaces V k n´1 , W k n´1 , and V k n . For a given function f pxq, the scaling/wavelet coefficients s n jl /d n jl of scaling/wavelet functions φ n jl / ψ n jl are computed as: s n jl " ż 2 ´npl`1q 2 ´nl f pxqφ n jl pxqdx; d n jl " ż 2 ´n pl`1q 2 ´n l f pxqψ n jl pxqdx. Using the multiwavelet decomposition filters, the relations between the coefficients on two consecutive levels n and n `1 are computed as (decomposition step): s n l " H p0q s n`1 2l `Hp1q s n`1 2l`1 ; d n l " G p0q s n`1 2l `Gp1q s n`1 2l`1 . Therefore, starting with the coefficients s n l , we repeatedly apply the decomposition step in eq. 7 to compute the scaling/wavelet coefficients on coarser levels. Similarly, the reconstruction step can be represented as: s n`1 2l " H p0q T s n l `Gp0q T d n l , s n`1 2l`1 " H p1q T s n l `Gp1q T d n l . (8) H, G … + H, G ! 𝑇 !,# $ + Decomposition … … … Reconstruction 𝑇 ! 𝑇 " A C B + 𝑇 %,& ' 𝑇 %, & $ 𝑇 %,# $ 𝑈 ',& $ 𝑈 ',# $ 𝑈 ',& ' 𝑇 %,# $ 𝑈 ),& ' = ! 𝑇 !,& ' + 𝑇 %,& ' ! 𝑇 !, & $ + 𝑇 %, & $ 𝑈 ),& $*+ H, G … + H, G ! 𝑇 %,# $ + … … … A C B + 𝑇 !,& ' 𝑇 !, & $ 𝑇 !,# $ 𝑉 ',& $ 𝑉 ',# $ 𝑉 ',& ' 𝑇 !,# $ ! 𝑇 %, & $ + 𝑇 !, & $ 𝑈 ),& $ 𝑉 ),& ' = ! 𝑇 %,& ' + 𝑇 !,& ' 𝑉 ),& $ 𝑉 ),& $*+ 𝑈 ',& $*+ 𝑉 ',& $*+ ! 𝑇 ! 𝑇 Figure 1 : Architecture of CMWNO. Note that there are two coupled operators, Tu and Tv, in our system, which aligns the number of coupled variables. The network T is only applied for the coarsest scale L (0 in this system). The dashed arrows correspond to the auxiliary information from the unused operator without gradient during training process. For the interaction between operators, when we update the operator Tu, the decomposed ingredients from Tv will be equipped into the reconstruction module of Tu in the Wavelet domain, vice versa. Repeatedly applying the reconstruction step, we can compute the coefficients s n l from s 0 l and d i l , i " 0, . . . , n. In general, the function can be parameterized as the scaling/wavelet coefficients in the Wavelet space after the decomposition steps, and the coefficients can be mapped to the function after reconstruction steps. In our work, to model the interference ϵ 1 pT u q and ϵ 2 pT v q, we obtain the multiwavelets coefficients of each kernel during the decomposition steps and embed them into the other kernel in the reconstruction step. Note that we will elaborate the detailed training strategy of how to mimic interactions inside our system in Section 2.4. Our idea is to represent the functions and operators in Wavelet space to decouple the system using simple linear combinations. Considering the example in Section 2.1, according to the eq. 4 and 5, we first build two operators T u and T v such that T u u 0 pxq " u 1 τ pxq; T v v 0 pxq " v 1 τ pxq. For the operators T u and T v , we denote their scaling/wavelet coefficients in wavelet domain as T i u,sl ; T i u,dl and T i v,sl ; T i v,dl respectively. For the input u 0 pxq; v 0 pxq and the output u τ pxq; v τ pxq, we denote their coefficients as U i 0,spdql ; V i 0,spdql and U i τ,spdql ; V i τ,spdql . According to eqs. 3 and 5, the multiwavelet coefficients of T u and T v can be calculated as: T i u,d l " A k u,i U i 0,d l `Bk u,i U i 0,s l , T i u,ŝ l " C k u,i U i 0,d l , T 0 u,s l " T U 0 0,s l ; T i v,d l " A k v,i V i 0,d l `Bk v,i V i 0,s l , T i v,ŝ l " C k v,i V i 0,d l , T 0 v,s l " T V 0 0,s l , where i " 0, 1, . . . , n. Considering the interference from the other operators, the coefficients of the solutions u τ pxq and v τ pxq in the Wavelet space can be written as: U i τ,d l " T i u,d l `T i v,d l , U i τ,ŝ l " T i u,ŝ l `T i v,ŝ l , U 0 τ,s l " T 0 u,s l `T 0 v,s l ; V i τ,d l " T i v,d l `T i u,d l , V i τ,ŝ l " T i v,ŝ l `T i u,ŝ l , V 0 τ,s l " T 0 v,s l `T 0 u,s l ; where i " 0, 1, . . . , n. In the training process, the inputs of the neural networks tA ru,vs , B ru,vs , C ru,vs , Tru,vs u are the multiwavelet coefficients of u 0 pxq; v 0 pxq, and the outputs are the multiwavelet coefficients of T u ; T v . When the neural networks tA u , B u , C u , Tu u are trained for T u , the neural networks tA v , B v , C v , Tv u output T i v,rsl,dls without backpropagation, we use T i ru,vs,rsl,dls to mark the coefficients without gradient. Utilizing the orthogonality of the multiwavelets, the coefficients embedding the information of the operators T u ; T v can be directly added to T v ; T u in the same Wavelet space V k n , then the neural networks with backpropagation can learn the information from the other operator. In that way, the complex coupled equations can be solved via reducing the order of the functions and directly approximate decoupled functions at each iteration. The architecture of the CMWNO is shown in Fig. 1 , which illustrates the mapping process inside the wavelet space of layer n. The operations inside the wavelet space can be matched by the order of layers in the models, which means the decomposition operations for different resolutions are done independently. After decomposing s n via eq. 7, we can get the transferred information of input where each component will be used to reconstruct the original input at the layer n. Inspired by scheduled sampling (Bengio et al., 2015) , which is designed to gently bridge the discrepancy between training and inference samples, we propose rolling the dice to randomly decide the interaction order between each neural operators, which is named dice strategy. Specifically, we roll the dice for every sample to decide which path to use, which can effectively mitigate the imbalance update problem for each kernel caused by the fixed training order. As illustrated in Fig. 2 , when the dice tells the model to use path 1 (upper path) , we will update operator T u by equipping the coupled information from the other operator T v first. Note that, T v is learned by previous samples and have not updated yet. Then we use the updated operator T u to decompose the initial state u 0 , which can be used to update T v . Inside the Wavelet space with well-defined basis, where we are able to utilize vary orthogonal information from each initial state jointly. Note that this strategy is scalable to more operators referring to Fig. 6 in Supplementary and we left the design of this strategy for future work.

3. EXPERIMENTS

In this section, we empirically evaluate the proposed model on famous coupled PDEs such as the Gray-Scott (GS) equations and the non-local mean field game (MFG) problem characterized by coupled PDEs. Note that we compare against the state-of-the-art data-driven models which fits for our research goal to build efficient coupled operators for general downstream data-driven applications without sufficient expert knowledge. The experiments show that CMWNO not only achieves the lowest L2 relative errors when solving coupled PDEs, but also works consistently great under different input conditions. For the data structure, since our datasets are functions, we apply point-wise evaluations on the input and output data. For example, for the function f pxq, x P D, we discretize the domain as x 1 , . . . , x s P D, where x i are s-point discretization of the domain. Unless stated otherwise, we train on 1000 samples and test on 200 samples. Model architecture. In our proposed model, for each operator, the neural networks A, B and C use a single-layered convolutional neural networks while T uses a single linear layer. Our model is extensible and each kernel constructed by 4 neural networks tA, B, C, T u learning the mapping in wavelet space. The number of the kernels can be chosen based on the number of coupled variables or the number of explicit operators.

Benchmark models.

We compare our model with the state-of-the-art neural operators including Fourier neural operator (FNO), Multiwavelet-based neural operator (MWT), and Padé exponential model (Padé), which show the best performance on solving PDEs according to the experiment results in (Li et al., 2020b; a; Gupta et al., 2021b; a) . For the benchmark neural operator models, since we have the coupled functions as input and output (e.g., u and v), we concatenate u and v for the models and marked the models as FNO c , MWT c , and Padé c . We also use two single multiwavelet-based neural operators to learn u τ pxq; v τ pxq from v τ pxq; u τ pxq independently and mark the model as MWT s . Similar to the coupling structure of our CMWNO, by creating the multiple kernels learned in Fourier space and applying the dice strategy during the Fourier transform, we build the coupled Fourier neural operator and mark it as CFNO. Training parameters. The neural operators are trained using Adam optimizer with a learning rate of 0.001 and decay of 0.95 after every 100 steps. The models are trained for a total of 500 epochs which is the same with training CMWNO for fair comparison. All experiments are done on an Nvidia A100 40GB GPUs.

3.1. GRAY-SCOTT (GS) EQUATIONS

The GS equations are coupled differential equations which model the underlying reaction and diffusion patterns of chemical species. It is also able to generate a wide range of patterns which exist in nature, such as bacteria, spirals and coral patterns. Each variable (i.e., u and v) diffuses independently with a linear growth or decay term, while coupled together by ˘uv 2 (Trefethen & Embree, 2001; Driscoll et al., 2014) . For a given field upx, tq; vpx, tq, the GS equations take the form: B t upx, tq " ϵ 1 B xx upx, tq `F p1 ´upx, tqq ´λupx, tqv 2 px, tq, x P p0, 10q, t P p0, 1s B t vpx, tq " ϵ 2 B xx vpx, tq ´pK `F qvpx, tq `λupx, tqv 2 px, tq, x P p0, 10q, t P p0, 1s upx, 0q " u 0 pxq; vpx, 0q " v 0 pxq, x P p0, 10q where ϵ 1 " 1, ϵ 2 " 10 ´2, K " 6.62 ˆ10 ´2, F " 2 ˆ10 ´2. We use the coupling coefficient λ P p0, 1s to control the degree of coupling of u and v. We aim to learn the operators (i) mapping the initial condition upx, 0q to the solution upx, t " 1q with the interference of vpx, tq; (ii) mapping the initial condition vpx, 0q to the solution vpx, t " 1q considering the interference of upx, tq. The initial conditions are generated in Gaussian random fields (GRF) according to u 0 pxq, v 0 pxq " N p0, 7 4 p´∆ `72 Iq ´2.5 q with periodic boundary conditions. We also use a different scheme to generate u 0 pxq by using the smooth random functions (Rand) in chebfun package (Driscoll et al., 2014) which returns a band-limited function defined by a Fourier series with independent random coefficients; the parameter γ specifies the minimal wavelength and here we choose γ " 0.5. Therefore, generating the initial conditions by different schemes, we have two combinations of the initial conditions (i.e., u 0 pxq and v 0 pxq) and we mark them as (U-GRF, V-GRF) and (U-Rand, V-GRF) respectively according to the generating schemes. Given the initial conditions, we solve the equations using a fourth-order stiff time-stepping scheme named as ETDRK4 (Cox & Matthews, 2002) with a resolution of 2 10 , and sub-sample this data to obtain the datasets with the lower resolutions. with the improvement of CMWNO on MWT, the improvement of CFNO on FNO is not significant, indicating that decoupling in the Fourier space is not as efficient as decoupling in the multiwavelets domain. The learning curve of the neural operators solving u τ pxq at resolution s " 1024 is shown in Fig. 3 .

HSRFKV 5HODWLYHHUURU

&0:12 0:7s 0:7c )12c 3DGHc Varying coupling coefficient By varying the coupling coefficient λ in the GS equations, we can get different degree of coupling between u and v according to eq. 11. The higher value of λ means higher degree of coupling between u and v. Given the same initial conditions u 0 pxq and v 0 pxq, the outputs with different λ (i.e., λ " 0.2, 0.4, 0.6, 0.8, 1) are shown in Fig. 4 . It shows that as λ increases, all the models perform worse. For solving u τ pxq, compared with at λ " 0.2, the relative L2 errors at λ " 1 of the models increase by 22.0% (CMWNO); 459.6% (MWT s ); 105.9% (MWT c ); 107.4% (FNO c ); 146.7% (Padé s ). In terms of v px,τ q , the numbers are 11.6% (CMWNO); 326.8% (MWT s ); 34.5% (MWT c ); 44.8% (FNO c ); 44.3% (Padé s ). As we can see, the MWT s works the worst since the model cannot learn the interaction between u and v. The models learning coupled operators through concatenated data works better than the single model but still do not perform well on high coupling data. On the contrary, our CMWNO outperforms well consistently with both low / high coupling coefficient, which indicates that our architecture is able to decouple the coupled kernels. Varying initial conditions In addition to experimenting with both the initial conditions u 0 pxq and v 0 pxq generated in the GRF as marked (U-GRF, V-GRF), we also perform the models on (U-Rand,V-GRF). The numerical results are shown in Table 3 (see Appendix F). Our CMWNO achieves the lowest relative L2 error on both u and v with 3X and 2X improvements respectively. We provide a sample of initial conditions in Fig. 7 (see Appendix E), and Fig. 5 shows its predicted outputs from models CMWNO, MWT s and MWT c . It shows that our proposed CMWNO can give a precise prediction in a smooth way while MWT s and MWT c can only fit the true curve roughly. 

3.2. MEAN FIELD GAME PROBLEM

For local interactions, directly discretizing interaction terms is economical. However, non-local MFG requires each player in making decisions to take into account the global information rather than local information, which will increase the amount of computation in the process of calculation. In other words, we need matrix multiplication on a full grid to calculate the interaction terms by evaluating the expressions ş ω Kpx, yqρpy, tqdy. In this work, we propose a more general framework, CMWNO, to model the interactions in the Wavelet space and the results show that our model can be used to deal with the coupled systems. Here we solve the non-local MFG which can be characterized as: B t ρpx, tq `∇ ¨pρpx, tq∇φpx, tqq " 0, x P r0, 1s, t P p0, 1q B t φpx, tq ´1 2 }φpx, tq} 2 `żD Kpx, yqρpy, tqdt " 0, x P r0, 1s, t P p0, 1q where ρpx, tq is the density distribution of the players, and φpx, tq is the cost function. In a forwardforward MFG setting (Gomes & Sedjro, 2017) , we can obtain the value of ρpx, 0q and φpx, 0q. We aim to learn the operators: (i) mapping the initial condition ρpx, 0q to the solution ρpx, t " τ q with the interference of φpx, tq; (ii) mapping the initial condition φpx, 0q to the solution φpx, t " τ q considering the interference of ρpx, tq. To obtain the datasets, we generate ρpx, 0q; ρpx, t " 1q by using the random functions in chebfun package with the wavelength parameter γ " 0.3; 0.1, respectively. The coupled equations are numerically solved by the primal-dual hybrid gradient (PDHG) algorithm (Briceno-Arias et al., 2019; 2018) with the resolution s " 256. The initial conditions of ρpx, 0q and φpx, 0q are used as the input while the ρpx, tq and φpx, tq (t " 0.2, 0.4, 0.6, 0.8) are taken as the output. We perform all the models working for coupled datasets mentioned above to solve this MFG coupled PDEs, and the results with different t are shown in Table 2 . Compared to the existing model with the best results, our proposed CMWNO yields 34.2% " 67.4% improvements in terms of ρ and 57.4% " 73.7% in terms of φ with respect to the relative L2 error. It is worth noting that MWT c performs the worst in most cases which indicates that the interactions between ρ and φ can not be learned through a single multiwavelet kernel. By interacting two kernels in the Wavelet space after decomposition steps, our proposed CMWNO can better decouple the interactions between ρ and φ to solve the MFG PDEs.

4. CONCLUSION

In this work, we propose a coupled multiwavelets neural operator using multiwavelet discretization of the spatial domain. Solving for coupled equations requires an information entanglement across operators for individual process. We found that combining operators in the projected domain of multiwavelets is effective. Numerical experiments using representative coupled PDEs including Gray-Scott and mean field game problem show that our coupling mechanism effectively learns the two processes in comparison with standalone operators.

… … …

Dice = 1 Rec Dec R D Rec Dec R D 𝑈 !,# $%& 𝑉 !,# $%& (𝑈 !,# $%& , 𝑉 !,# $%& , … , 𝑊 !,# $%& ) (𝑉 !,# $%& , 𝑈 !,# $%& , … , 𝑊 !,# $%& ) R D … R D … … Rec Dec R D 𝑊 !,# $%& R D … … Rec Dec R D 𝑉 !,# $%& R D … Rec Dec R D 𝑈 !,# $%& R D … … Rec Dec R D 𝑊 !,# $%& R D … … … Rec D D … Rec D D … … Rec D D … … Path 1 Path i Path m … Figure 6: The scalable dice strategy with multiple coupled kernels. For each sample of coupled variables, one only needs to go through a specific path (round diagonal corner rectangle). For example, we get dice equal to 1 in this case so that this specific sample will go through Path 1. Inside each path, the order of updating is from left to right, where the darker block indicates the operator we want to update and the lighter blocks provide decomposition information from the fixed operator. Note that the updated operator can help to update other operators at the same stage that occurs after the specific operator. Thus, for n kernels, we have n operators that need to be updated and m paths that can be selected, where m " A n n and A is the function of all permutations.

A RELATED WORK

The operator network can be made up of several neural networks, which aims to approximate any function defined as an input in one network and evaluated at the locations specified in the target network (Lu et al., 2022) . Chen & Chen propose the universal approximation theory of operators for a single layer, which has led to several research works recently. Among them, DeepONet (Lu et al., 2021) first applies deep neural network on the universal approximation theorem to learn nonlinear operators. After that, Fourier Neural Operator (FNO) formulates the operator regression by parameterizing the integral kernel directly in Fourier space (Li et al., 2020a) . Several works take advantage of FNO's ability to efficiently solve PDEs to design models for applications in practical chaotic systems such as turbulence simulation (Stachenfeld et al., 2021) , multiphase flow simulation (Wen et al., 2022) , and weather forecasting (Pathak et al., 2022) . In addition, we would like to highlight that neural operators are able to tackle the more fundamental problems in time series analysis (Cao et al., 2021; Zhang et al., 2022; Cao et al., 2020) and easily to finds many application in transportation, healthcare, manufacturing, finance (Cao et al., 2022) , etc. Gupta et al. introduce a multiwavelet-based neural operator that compresses the associated operator's kernel using fine-grained wavelets and the same group further proposes non-linear operator approximation for initial value problems. Besides, physics-informed neural network and machine learning methods provide a new research direction to equip specific physics information into neural operators Goswami et al. (2022) ; Meng et al. (2022) . For more information of neural operators, please refer to the new survey paper: (Kovachki et al., 2021) . Another related research line of our work is coupled PDEs Tang et al. (2009) , which is usually discussed in the form of MFGs (Benamou & Carlier, 2015; Benamou et al., 2017; Briceno-Arias et al., 2019; 2018; Liu & Nurbekyan, 2020; Liu et al., 2021) . In addition, the interaction between terms in complex systems Xue & Bogdan (2017) 

B REPRODUCIBILITY & CODE AVAILABILITY

Architecture description in detail: The CMWNO model in Figure 1 is presented in the form of a recurrent cell. In the decomposition stage, the input data at each iteration are U pi`1q 0,s (V pi`1q 0,s ) which then gets transformed into U i 0,s , U i 0,d (V i 0,s , V i 0,d ) using the filters H, G. At the same iteration, we also obtain the corresponding outputs T i u,d and T i u,ŝ (T i v,d and T i v,ŝ ). The same process is repeated in the next iteration but now with using U i 0,s (V i 0,s ) (obtained from the previous step) as the input, thus this makes a recurrent chain of operations and is a kind of ladder-down operation. The loop is repeated till we reach the L-th scale (coarsest scale) at which the final operation of T is applied according to eqs 9. The trainable neural networks layer in this stage is composed of tA u , B u , C u , Tu u and tA v , B v , C v , Tv u. We use two cascading neural network layers with normal Xavier initialization to handle 1D coupled equations in our experiments. Moreover, tA ˚, B ˚, C ˚u are all one-layer CNNs with the ReLU Nair & Hinton (2010) activation function followed by a linear layer, where CNN's kernel size equals 3, stride equals 1, and padding size equals 1. The input channel of CNN equals the feature number and the output channel is set to be 128 in all the experiments. In addition, t Tv u is a single k ˆk linear layer with k " 4 suggested by Gupta et al. (2021b) . In the reconstruction stage (which is ladder-up), iteratively, the outputs of decomposition part T i u,d and T i u,ŝ (T i v,d and T i v,ŝ ) are first combined by the dice strategy in section 2.4, and then we use a reconstruction filter H, G. to obtain the finer scales U pi`1q τ,s (V pi`1q τ,s ), and finally, U n τ,s (V n τ,s ) is the finest scale of the output. Our code to run the experiments can be found at https://github.com/joshuaxiao98/ CMWNO/.

C MULTIWAVELET BASES C.1 MULTIRESOLUTION ANALYSIS

The basic idea of MRA is to establish a preliminary basis in a subspace V 0 of L 2 pRq, and then use simple scaling and translation transformations to expand the basis of the subspace V 0 into L 2 pRq for analysis on multiscales. Multiwavelets further this operation by using a class of orthogonal polynomials (OPs), in our case, we use Legendre polynomials for an efficient representation over a finite interval (Alpert et al., 2002a) . For k P Z and n P N, the space of piecewise polynomial functions is defined as: V k n " tf |the restriction of f to the interval p2 ´nl, 2 ´npl `1qq is a polynomial of degree ă k, for all l " 0, 1, . . . , 2 n ´1, and f vanishes elsewhereu. Therefore, the space V k n has dimension 2 n k, and each subspace V k i is contained in V k i`1 shown as V k 0 Ă V k 1 Ă . . . V k n Ă . . . . Given a basis φ 0 , φ 1 , . . . , φ k´1 of V k 0 , the space V k n is spanned by 2 n k functions obtained from φ 0 , φ 1 , . . . , φ k´1 of V k 0 by shifts and scales as φ n jl pxq " 2 n{2 φ j p2 n x ´lq, j " 0, 1, . . . , k ´1, l " 0, 1, . . . , 2 n ´1. The functions φ 0 , φ 1 , . . . , φ k´1 are also called scaling functions which can project a function to the approach space V k 0 .

C.2 MULTIWAVELETS

The multiwavelet subspace W k n is defined as the orthogonal complement of V k n in V k n`1 , such that V k n à W k n " V k n`1 , V k n K W k n ; and W k n has dimension 2 n k. Therefore, the decomposition can be obtained as V k n " V k 0 à W k 0 à W k 1 . . . à W k n´1 . To form the orthogonal bases for W k n , a class of bases is constructed for L 2 pRq. Each basis consists of translates and dilates of a finite set of functions ψ 1 , . . . ψ k shown as follows: ψ n jl pxq " 2 n{2 ψ j p2 n x ´lq, j " 0, 1, . . . , k ´1, l " 0, 1, . . . , 2 n ´1. where the wavelet functions ψ 1 , . . . ψ k are piecewise polynomial and orthogonal to low-order polynomials (vanishing moments): ż 1 0 x i ψ j pxqdx " 0, i " 0, 1, . . . , k ´1. Here we restrict our attention to the interval r0, 1s P R; however, the transformation to any finite interval rp, qs could be directly obtained by the appropriate translates and dilates.

D LEGENDRE POLYNOMIALS

The Legendre polynomials are defined with respect to (w.r.t.) a uniform weight function w L pxq " 1 for ´1 ď x ď 1 or w L pxq " 1 r´1,1s pxq such that 1 ż ´1 P i pxqP j pxqdx " # 2 2i`1 i " j, 0 i ‰ j. For our work, we shift and scale the Legendre polynomials so they are defined over r0, 1s as P i p2x ´1q, and the corresponding weight function as w L p2x ´1q. The Legendre polynomials satisfy the following recurrence relationships iP i pxq " p2i ´1qxP i´1 pxq ´pi ´1qP i´2 pxq, p2i `1qP i pxq " P 1 i`1 pxq ´P 1 i´1 pxq, which allows the expression of derivatives as a linear combination of lower-degree polynomials itself as follows: P 1 i pxq " p2i ´1qP i´1 pxq `p2i ´3qP i´1 pxq `. . . , where the summation ends at either P 0 pxq or P 1 pxq, with P 0 pxq " 1 and P 1 pxq " x. A set of orthonormal basis of the space of polynomials with degree ă d defined over the interval r0, 1s is obtained using shifted Legendre polynomials such that ϕ i " ? 2i `1P i p2x ´1q, w.r.t. weight function wpxq " w L p2x ´1q, such that xϕ i , ϕ j y µ " ż 1 0 ϕ i pxqϕ j pxqdx " δ ij . The basis for V k 0 are chosen as normalized shifted Legendre polynomials of degree upto k w.r.t. weight function w L p2x ´1q " 1 r0,1s pxq from Section D. For example, the first three bases are ϕ 0 pxq " 1, ϕ 1 pxq " ? 3p2x ´1q, ϕ 2 pxq " ? 5p6x 2 ´6x `1q, 0 ď x ď 1. For deriving a set of basis ψ i of W k 0 using GSO, we need to evaluate the integrals efficiently, which could be achieved using the Gaussian quadrature.  where ϵ 1 " 5 ˆ10 ´2, ϵ 2 " 5 ˆ10 ´2, ϵ 3 " 2 ˆ10 ´2. Our goal is to learn the operators mapping the initial condition of each variable to the solution with the interference of other variables. The initial conditions are generated by using the smooth random functions (Rand) in chebfun package (Driscoll et al., 2014) . For the initial conditions upx, 0q; vpx, 0q; wpx, 0q, we set the parameters γ " 0.3; 0.2; 0.1 respectively. Given the initial conditions, we solve the equations using the fourthorder stiff time-stepping scheme named as To handle multiple coupled variables, we apply the dice strategy referring to Fig. 6 to mimic the interaction between u; v; w. As we can see in the results, our CMWNO still achieves the new state-of-the-art. 

G MODEL COMPARISON

Table 5 compares in detail the data of parameters from CMWNO and baselines. Although the number of parameters of our model is about twice as many as the second best model (FNO), the performance of our model is improved by 57.68% and 61.41%, separately. In solving the coupled PDE problem, our model is the optimal choice in terms of performance and power balance. One of our future efforts is to improve model efficiency, which we leave as the further work. H DISCUSSION ON PINN We note that neural operators and PINN are two recent deep learning approaches that have gathered the tractions in the community. The PINN combines the advantages of data-driven machine learning and physical modeling to train a model that automatically satisfies physical constraints with insufficient training data and has comparable generalization performance to predict important physical parameters of the model while ensuring accuracy. One can incorporate the differential form constraints from PDEs into the design of the loss function of the neural network with automatic differentiation techniques in deep neural networks. In addition, PINN cannot be used directly in a complete datadriven scenario without an exact PDE structure, and the PDE function is hard to be decided in the wild applications, However, one can take a compromise approach by relying on a specific PDE (such as (Connors et al., 2009) ) to design its loss function and using it on different input functions, which should be undesirable. tf |f are polynomials of degree ă k defined over interval p2 ´nl, 2 ´npl `1qq for all l " 0, 1, . . . , 2 n ´1, and assumes 0 elsewhereu 

I DEFAULT NOTATION

W k n Orthogonal space to V k n such that W k n À V k n " V k n`1



Figure 2: Dice strategy. For each sample, one only needs to go through a specific path (round diagonal corner rectangle). Inside each path, the order of updating is from left to right, where the darker block indicates the operator we want to update and the lighter blocks provide decomposition information from the fixed operator.

Figure 4: Comparing the models by varying the coupling coefficient λ at the initial condition (U-GRF, V-GRF) with resolution s " 1024.

Figure 3: Learning curve -Relative L2 error vs epochs for neural operators.

Figure 5: The output of GS couple equations at the initial condition (U-Rand, V-GRF). (Left) The predicted output of the models to upx, τ " 1q. (Right) The predicted output of the models to vpx, τ " 1q.

Figure 9: The sample input/output for ρ and ϕ. The top two figures are the initial conditions of ρ and ϕ as the input. The other figures are the output of ρ and τ with different time t.

The kernel in the opeartor ra, bs The real interval including a and b pa, bs The real interval excluding a but including b V k n

Gray-Scott (GS) equation benchmarks for different input resolution s at initial condition (U-GRF,V-GRF). The relative L2 errors are shown for each model. Bolded values are the best results of all the models, and underlined values are the best results of the existing models. Set the same below.



;Xiao et al. (2021);Yin et al. (2020) can be characterized by coupled PDEs. Combining those two research lines, this is the first work which proposes a decoupled multiwavelet-based neural operator learning schema to solve coupled PDE problems.

Gray-Scott (GS) equation benchmarks for different input resolution s at initial condition (U-Rand,V-GRF). The relative L2 errors are shown for each model.given field upx, tq; vpx, tq; wpx, tq, the BZ coupled equations take the form:B t upx, tq " ϵ 1 B xx upx, tq `u `v´uv ´u2 , x P p0, 1q, t P p0, 0.2s B t vpx, tq " ϵ 2 B xx vpx, tq `w ´v ´uv, x P p0, 1q, t P p0, 0.2s B t wpx, tq " ϵ 3 B xx wpx, tq `u ´2, x P p0, 1q, t P p0, 0.2s

ETDRK4 (Cox & Matthews, 2002)  with a resolution of 2 10 , and sub-sample this data to obtain the datasets with the a resolution of 2 8 . The results of the experiments on BZ coupled equations with different resolutions (i.e., s=256,1024) are shown in Table4. Belousov-Zhabotinsky (BZ) equation benchmarks for different input resolution s. The relative L2 errors are shown for each model.

Comparison of model's parameters.

T ˚,˚lCoefficients according to operatorU ˚,˚l , V ,˚lCoefficients according to input u ˚and v

ACKNOWLEDGEMENT

We are thankful to the anonymous reviewers for providing their valuable feedback which improved our manuscript. We would also like to thank Dr. Justinian Rosca and Lisang Ding for their valuable feedback. We gratefully acknowledge the support by the National Science Foundation Career award under Grant No. Cyber-Physical Systems / CNS-1453860, the NSF award under Grant CCF-1837131, MCB-1936775, CNS-1932620, the U.S. Army Research Office (ARO), the Defense Advanced Research Projects Agency (DARPA) Young Faculty Award and DARPA Director Award under Grant No. N66001-17-1-4044, an Intel faculty award, the Okawa Foundation award, a Northrop Grumman grant, and Google cloud program. A part of this work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. R.B. has been supported in part by a NSF award under grant DMS-2108900 and by the Simons Foundation. The views, opinions, and/or findings contained in this article are those of the authors and should not be interpreted as representing the official views or policies, either expressed or implied by the Defense Advanced Research Projects Agency, the Army Research Office, the Department of Defense, or the National Science Foundation.

E INITIAL STATES SAMPLES

In this section, we use Fig. 7 to illustrate the initial states (u-rand and v-grf) for Gray-scott equations; use Fig. 8 to exhibit the different initial states (u-grf and v-grf) and solutions for Gray-scott equations; use Fig. 9 to show the samples of ρpx, tq and ϕpx, tq at different time t in our non-local MFG case.

F.1 ADDITIONAL RESULTS FOR GRAY-SCOTT (GS) EQUATIONS

In this section, we provide more results on the L2 errors with different initial conditions on Table 3 .

F.2 BELOUSOV-ZHABOTINSKY (BZ) EQUATIONS

Adapted from the Belousov-Zhabotinsky dynamic system, the coupled Belousov-Zhabotinsky (BZ) equations in (21) describe a reaction-diffusion process with three species Driscoll et al. (2014) . For a

