NON-EQUISPACED FOURIER NEURAL SOLVERS FOR PDES

Abstract

Solving partial differential equations is difficult. Recently proposed neural resolution-invariant models, despite their effectiveness and efficiency, usually require equispaced spatial points of data. However, sampling in spatial domain is sometimes inevitably non-equispaced in real-world systems, limiting their applicability. In this paper, we propose a Non-equispaced Fourier PDE Solver (NFS) with adaptive interpolation on resampled equispaced points and a variant of Fourier Neural Operators as its components. Experimental results on complex PDEs demonstrate its advantages in accuracy and efficiency. Compared with the spatiallyequispaced benchmark methods, it achieves superior performance with 42.85% improvements on MAE, and is able to handle non-equispaced data with a tiny loss of accuracy. Besides, to our best knowledge, NFS is the first ML-based method with mesh invariant inference ability to successfully model turbulent flows in non-equispaced scenarios, with a minor deviation of the error on unseen spatial points.

1. INTRODUCTION

Solving the partial differential equations (PDEs) holds the key to revealing the underlying mechanisms and forecasting the future evolution of the systems. However, classical numerical PDE solvers require fine discretization in spatial domain to capture the patterns and assure convergence. Besides, they also suffer from computational inefficiency. Recently, data-driven neural PDE solvers revolutionize this field by providing fast and accurate solutions for PDEs. Unlike approaches designed to model one specific instance of PDE (E & Yu, 2017; Bar & Sochen, 2019; Smith et al., 2020; Pan & Duraisamy, 2020; Raissi et al., 2020) , neural operators (Guo et al., 2016; Sirignano & Spiliopoulos, 2018; Bhatnagar et al., 2019; KHOO et al., 2020; Li et al., 2020b; c; Bhattacharya et al., 2021; Brandstetter et al., 2022; Lin et al., 2022) directly learn the mapping between infinite-dimensional spaces of functions. They remedy the mesh-dependent nature of the finite-dimensional operators by producing a single set of network parameters that may be used with different discretizations. However, two problems still exist -discretization-invariant modeling for non-equispaced data and computational inefficiency compared with convolutional neural networks in the finite-dimensional setting. To alleviate the first problem, MPPDE (Brandstetter et al., 2022) lends basic modules in MPNN (Gilmer et al., 2017) to model the dynamics for spatially non-equispaced data, but even intensifies the time complexity due to the pushforward trick and suffers from unsatisfactory accuracy in complex systems (See Fig. 2(a) ). FNO (Li et al., 2020c) has achieved success in tackling the second problem of inefficiency and inaccuracy, while the spatial points must be equispaced due to its harnessing the fast Fourier transform (FFT). To sum up, two properties should be available in neural PDE solvers: (1) discretization-invariance and (2) equispace-unnecessity. Property (1) is shared by infinite-dimensional neural operators, in which the learned pattern can be generalized to unseen meshes. By contrast, classical vision models and graph spatio-temporal models are not discretization-invariant. Property (2) means that the model can handle irregularly-sampled spatial points. For example, graph spatio-temporal models do not require the data to be equispaced, but vision models are equispace-necessary, and limited to handling images as 2-d regular grids. And recently proposed methods can be classified into four types according to the two properties, as shown in Fig. 1 . As discussed, although the equispace-necessary methods enjoy fast parallel computation and low prediction error, they lack the ability to handle the spatially non-equispaced data. For these reasons, this paper aims to design a mesh-invariant model (defined in Fig. 1 ) called Non-equispaced Fourier neural Solver (NFS) with comparably low cost of computation Figure 1 : Four types of methods with or without the two concluded limitations. and high accuracy, by lending the powerful expressivity of FNO and vision models to efficiently solve the complex PDE systems. Our paper including leading contributions is organized as follows: • In Sec. 2, we first give some preliminaries on neural operators as related work, with a brief introduction to Vision Mixers, to build a bridge between Fourier Neural Operator and Vision Mixers. Thus, we illustrate our motivation for the work: To establish a mesh-invariant neural operator, by harnessing the network structure of Vision Mixers. • In Sec. 3, we proposed a Non-equispaced Fourier Solver (NFS), with adaptive interpolation operators and a variant of Fourier Neural Operators as the components. Approximation theorems that guarantee the expressiveness of the proposed interpolation operators are developed. Further discussion gives insights into the relation between NFS, patchwise embedding and multipole graph models. • In Sec. 4, extensive experiments on different types of PDEs are conducted to demonstrate the superiority of our methods. Detailed ablation studies show that both the proposed interpolation kernel and the architecture of Vision Mixers contribute to the improvements in performance. 2 BACKGROUND AND RELATED WORK  i ) : 1 ≤ i ≤ n s } are sampled. The observation of input function a ∈ A(D; R da ) and output u ∈ U(D; R du ) on the n s points are denoted by {a(x i ), u(x i )} ns i=1 , where A(D; R da ) and U(D; R du ) are separable Banach spaces of function taking values in R da and R du respectively. Suppose x ∼ µ is i.i.d. sampled from the probability measure µ supported on D. An infinite-dimensional neural operator G θ : A(D; R da ) → U(D; R du ) parameterized by θ ∈ Θ, aims to build an approximation so that G θ (a) ≈ u. A cost functional C : U(D; R du ) × U(D; R du ) → R is defined to optimize the parameter θ of the operator by the objective min θ∈Θ E x∼µ [C(G θ (a), u)(x)] ≈ 1 n s ns i=1 C(G θ (a), u)(x) To establish a mesh-invariant operator, X can be non-equispaced, and the learned G θ should be transferred to an arbitary discretization X ∈ D, where x ∈ X can be not necessarily contained in X. Because we focus on spatially non-equispaced points, when the PDE system is time-dependent, we assume that timestamps {t j } are uniformly sampled, which means we do not focus on temporally irregular sampling or continuous time problem (Rubanova et al., 2019; Chen et al., 2019; Çagatay Yıldız et al., 2019; Iakovlev et al., 2020) .

2.2. DISCRETE FOURIER TRANSFORM

Let k l = (k (1) l , . . . , k (d) l ) the l-th frequency corresponding to X, with k l ∈ Z d . The discrete Fourier transform of f : D → R d f is denoted by F(f )(k) ∈ C d f , with F -1 as its inverse, then F(f ) (j) (k l ) = ns i=1 f (j) (x i )e -2iπ<xi,k l > , F -1 (f ) (j) (x i ) = ns l=1 f (j) (k l )e 2iπ<xi,k l > , ( ) where j means the j-th dimension of f . General Fourier transforms have complexity O(n 2 s ). When the spatial points are distributed uniformly on equispaced grids, fast Fourier transform (FFT) and its inverse (IFFT) (Rader & Brenner, 1976) can be implemented to reduce the complexity to O(n s log n s ).

2.3. FOURIER NEURAL OPERATOR

Neural Operators. To model one specific instance of PDEs, a line of neural solvers have been designed, with prior physical knowledge as constraints. Different from these methods, neural operators (Lu et al., 2021; Nelsen & Stuart, 2021) require no knowledge of underlying PDEs, and only data. Finite-dimensional operator methods (Guo et al., 2016; Sirignano & Spiliopoulos, 2018; Bhatnagar et al., 2019; KHOO et al., 2020) are discretization-variant, meaning that the model can only learn the patterns of the spatial points which have been fed to the model in the training process. By contrast, infinite-dimensional operator methods (Li et al., 2020b; c; Bhattacharya et al., 2021; Brandstetter et al., 2022) are proposed to be discretization-invariant, enabling the learned models to generalize well to unseen meshes with zero-shot. Kernel integral operator method (Li et al., 2020a)  is a family of infinite-dimensional operators, in which (G θ (a))(x) = Q • v T • • • • • v 1 • P (a)(x) is formulated as an iterative architecture. A higher-dimensional representation function is first obtained by v 0 = P (a) ∈ U(D; R dv ), where P is a shallow fully-connected network. It is updated by v t+1 (x) := σ(W v t (x) + K φ (a)v t (x)), ∀x ∈ D where K φ : A → L(U) is a kernel integral operator mapping, mapping a to bounded linear operators, with parameters φ. W is a linear transform and σ is a non-linear activation function. After the final iteration, Q projects v T (x) back to U(D; R du ). Fourier Neural Operator (FNO) (Li et al., 2020c) as a member in kernel integral operator methods, updates the representation by applying the convolution theorem as: K φ (a)v(x) = F -1 (F(κ φ ) • F(v))(x) = F -1 (R φ • F(v))(x), where R φ as the Fourier transform of a periodic kernel function κ φ , is directly learned as the parameters in the updating process. To be resolution-invariant, FNO picks a finite-dimensional parameterization by truncating the Fourier series of both F(v) and R φ as a maximal number of modes k (l) max for 1 ≤ l ≤ d. Because the sampled spatial points are equispaced in FNO, it can conduct FFT and IFFT to get the Fourier series, which can be very efficient.

2.4. VISION MIXER AND GRAPH SPATIO-TEMPORAL MODEL

Vision Mixers (Tolstikhin et al., 2021; Rao et al., 2021; Guibas et al., 2021) are a line of models with a stack of (token mixing) -(channel mixing) -(token mixing) as their network structure for vision tasks. They are based on the assumption that the key component for the effectiveness of Vision Transformers (VIT) (Dosovitskiy et al., 2020) is attributed to the proper mixing of tokens. The defined tokens are equivalent to equispaced spatial points in the former definition, and the research on the mixing of them can be an analogy to modeling the proper interaction or message-passing patterns among spatial points. In specific, VIT uses a non-Mercer kernel function (Wright & Gonzalez, 2021) κ φ to adaptively learn the pattern of message-passing through the iterative updating process v t+1 (x) = σ(ChannelMix • TokenMix(v t (x))); TokenMix(v(x)) = i κ φ (x,x i , v(x), v(x i )) • v(x i ); ChannelMix(v(x)) = W v(x), (5) where W is a linear transform called channel mixing layer because it transforms the input on the channel of an image whose dimension is equivalent to function dimension d f . Note that we omit the residual connection in Eq. (3) for simplicity. Remark. The FNO can be regarded as a member of the family of Vision Mixers. The reason is that a component in an iteration in Eq. ( 4) can be written as (R φ • F(v))(x) = R φ • i e -2πi<x,xi> v(x i ), because in the equispaced scenarios, x i can be regarded as lying on the same grids as k after scaling. The kernel κ φ is parameterized by κ φ (x, x i , v(x), v(x i )) = e -2πi<x,xi> , and the matrix multiplication of R φ also performs mixing on channels. Besides, the inverse Fourier transform can also be regarded as token mixing layers, or so-called token demixing layers (Guibas et al., 2021) . However, the powerful fitting ability and efficiency of Vision Mixers are limited to being applied to non-equispaced spatial points. Another option for non-equispaced data is graph spatio-temporal models, in which interaction patterns among spatial points are modeled in a graph message-passing way (Gilmer et al., 2017; Atwood & Towsley, 2016; Defferrard et al., 2017) . The mechanism is similar to the token mixing in Vision Mixers by means of the summation in Eq. ( 5) conducted in the predefined neighborhood of each point. Unfortunately, the graph spatio-temporal models (Seo et al., 2016; Li et al., 2018; Bai et al., 2020; Lin et al., 2021) suffer from high computational complexity and unsatisfactory accuracy in solving complex dynamical systems (such as turbulent flows).

2.5. MOTIVATION

Since FNO belongs to Vision Mixers, this firstly raised a question to us: Do models employing Vision Mixers's architecture have the potential to model complex PDE systems? Thus, experiments are conducted to give an intuitive explanation of our motivation as shown in Fig. 2 . The data are generated by Navier-Stokes equations. It is noted that graph spatio-temporal methods can also handle the equispaced data. Detailed setup is given in Sec. 4. represents graph spatio-temporal models and ' ' is the proposed NFS. In (c), 'Eq' and 'Neq' mean the methods are trained in equispaced and non-equispaced scenarios respectively. We find that (1) All of the evaluated Vision Mixers are able to model the dynamical systems effectively, in spite of FNO as the only discretization-invariant model; (2) The complex dynamics of the systems are hardly captured by graph spatio-temporal models, whose performance on both accuracy and computational efficiency is very unsatisfactory in either equispaced or non-equispaced scenarios. Fig. 2(c ) shows the problem of infeasibility of graph spatio-temporal models through the loss curves on the validation set, compared with Vision Mixers. However, Vision Mixers fail to handle the non-equispaced data. Therefore, we aim to (1) establish a mesh-invariant model, by harnessing the network structure of Vision Mixers to achieve competitive efficiency and effectiveness in equispaced scenarios, as shown in Fig. 2(a);  (2) Besides, it should allow applicability and comparable accuracy in non-equispaced scenarios for solving PDEs, as shown in Fig. 2(b ).

3.1. NON-EQUISPACED FOURIER TRANSFORM

Nonuniformly signals are unavoidable in certain real-world physics scenarios, such as signals obtained by meteorological stations on the earth surface, which urge the fast Fourier transform (FFT) to be extended to non-equispaced data with efficient implementation of FFT. Non-equispaced FFTs usually rely on a mixture of interpolation and the judicious use of FFT, where the calculations of interpolation are no more than O(n s log n s ) operations (Kalamkar et al., 2012; Cheema et al., 2017) . For example, Lagrange interpolation is used to approximate the signal values on m s resampled equispaced points {x j } 1≤j≤ms , and then implement FFT on the interpolated points. A low rank approximation with complexity of O(n s log(1/ )) is used to replace the interpolation with complexity of O(n 2 s ) with as the precision of computations (Dutt & Rokhlin, 1995) . Another example is commonly used Gaussian-based interpolation (Kestur et al., 2010) . Denote F as equispaced FFT in particular, and H as the interpolation operator, and the proposed non-equispaced FFT can be written as (F • H(f ))(k) ≈ π τ e τ <k,k> ms j=1 e -2iπ<k,xj > ns i=1 f (x i )h τ (x i -x j ). H(f )(x j ) = ns i=1 f (x i )h τ (x i -x j ) interpolates values on resampled points via convolution with the periodic heat kernel h τ (x -y) = l∈Z d e -(x-l) 2 /4τ , with τ as a constant. Multiplication of the inperploation matrix (H τ ) i,j = h τ (x i -x j ) and the signal vector (f ) i = f (x i ) includes O(n s m s ) ≈ O(n 2 s ) operations. For the kernel h τ , it is a summation of Gaussian kernel, and convolution with a single Gaussian in each points x i 's neighborhood N (x i ) can yeid a tiny error depending on τ , so the interpolation operator can be approximate via H(f )(x j ) = xj ∈N (xi) f (x i )h τ (x i -x j ). Restricting the neighbor number to |N (x i )| ≤ log n s leads the complexity to reduce to O(n s log n s ).

3.2. NON-EQUISPACED FOURIER NEURAL PDE SOLVER

Non-equispaced interpolation. To harness the effectiveness of FNO, we use non-equispaced Fourier token mixing instead of the equispaced one. It generalizes the equispaced FFT in Eq. (4) as F(v) = (F • H η (a))(v). We denote H η : A → L(U) as the interpolation operator mapping, which maps parametric function to a bounded interpolation operator. H η (a) gets the inerploated values on m s resampled equispaced points via the convolution with kernel h η as (H η (a)v)(x j ) = 1 n s ns i=1 v(x i )h η (x j -x i , x i , a(x i )), where x j lies on resampled equispaced grids. Another H ζ interpolates back on the n s non-equispaced ones in the same way via the convolution with kernel h ζ . To reduce the operations to no more than O(n s log n s ), the summation is restricted in the neighborhood of x i and x j , such that |N (x i )| ≈ |N (x j )| ≤ c log n s with c as a predefined constant determining the neighborhood size of spatial points. We formulate the kernel with a shallow feed-forward neural networks. Thanks to the universal approximation of neural networks, the following theorem assures that the interpolation operator can approximate the representation function v arbitrarily well. (For detailed proof, see Appendix. A.3.) Empirical observations on the convergence of interpolation operators are given in Appendix C. Theorem 3.1 (Approximation Theorem of the Adaptive Interpolation). Assume the setting of Theorem A2 in Appendix. A.3 is satisfied. µ is the probability measure supported on D. For v ∈ U, suppose U = L p (D; R dv ), for any 1 < p < ∞. Then, given > 0, there exist a neural network h η : R d × R d × R da → R dv , such that ||v -v|| U ≤ , where v(x) = D h η (x -y, x, a(y))v(y)dµ(y). Applicability of Layer-Norm. As shown in Fig. 3 , besides the comparison of the proposed interpolation operator with the traditional ones, a notable difference between the original FNO and FNO layers in our Vision Mixer architecture is the applicability of normalization layers (Layer-Norm) which is usually used in Vision Mixers' architecture. FNO cannot adapt Layer-Norm layers, because the change of resolution will make the trained normalization parameters and spatial points disagree with each other. In comparison, the resampled equispaced points are fixed in our architecture, no matter how the discretization of the input changes. Therefore, the normalization layers can be added, in a similar way to Vision Mixers, bringing considerable improvements (See Sec. 4.3). Mesh invariance. In the intermediate layers, which adopt equispaced FNO, the resampled points are fixed in both training and inference process, invariant to the input meshes. In the interpolation layers, the operator H η (a) is discretization-invariant because the kernel can be inductively obtained by the newly observed signals a(x), its coordinate x and resampled spatial points' relative coordinates x j -x. In the same way, H ζ (a) is also mesh-invariant. This allows the NFS to achieve zero-shot mesh-invariant inference ability, which is demonstrated in Sec. 4.2. Complexity analysis. The complexity of FNO is O(n s log n s + n s k max ). In the interpolation layers, because the interpolated values of resampled points are determined by their neighbors, we set the size of each resampled point's neighborhood in G and observed non-equispaced points's neighborhood in G as |N (x i )| ≈ |N (x j )| ≤ c log(n s ), for 1 ≤ i ≤ n s , 1 ≤ j ≤ m s . And in this way, the sparsity of the interpolation matrix reduces the complexity of the two interpolation layers to O(c • n s log n s + c • m s log n s ). If we set the resampled points number as n s , the overall complexity is O(2c • n s log n s + n s log n s + n s k max ) ∼ O(n s log n s + n s k max ).

3.3. FURTHER DISCUSSION

Relation to Vision Mixer. The interpolation can be compared to patchwise embedding in Vision Mixers. For example, MLPMIXER learns the token mixing patterns adaptively with a feed-forward network, but the high resolution of input images does not permit the global mixing of tokens due to the complexity of O(n 2 s ). Therefore, the input images are firstly rearranged into patches, with each patch containing n p pixels. In this way, the complexity is reduced to O(n 2 s /n 2 p ), enabling feasible token mixing. The patchwise embedding is very similar to interpolating the values on resampled points, as the former one first chooses patches' centers as n 2 s /n 2 p resampled points, and 'interpolates' the resampled points by lifting the embedding dimension and the rearranging of their neighbors' values as the interpolated values, rather than using a kernel. Relation to multipole graph model. The adaptively learned interpolation layer in NFS has a similar formulation of multipole graph models (Li et al., 2020b) . In multipole graph models, the high-level nodes aggregate messages from their low-level neighbors as v High (x j ) = 1 |N (xj )| xi∈N (xj ) v Low (x i )h η (x j , x i , a(x j ), a(x i )). Compared to multipole graph models, the values of high-level resampled equispaced nodes are approximated with low-level non-equispaced nodes' values in NFS, but nodes' values of low levels are given in multipole graphs. This causes differences in multipole graph models' message-passing and NFS's interpolation: In the former one, messages flow circularly among different levels of nodes, while in NFS, messages only exchange twice between the nodes of two levels -one is from low-level non-equispaced nodes to high-level resampled equispaced nodes, and the other is the opposite.

4.1. EXPERIEMNTAL SETUP

Benchmarks for comparison. For finite-dimensional operators, we choose Vision Mixers including VIT (Dosovitskiy et al., 2020) , GFN (Rao et al., 2021) and MLPMIXER (Tolstikhin et al., 2021) as equispaced problem solvers, with DEEPONET-V and DEEPONET-U as two variants for DeepONet (Lu et al., 2021) and graph spatio-temporal models including DCRNN (Li et al., 2018) , AGCRN (Bai et al., 2020) and GCGRU (Seo et al., 2016) as non-equispaced problem solvers. For infinite-dimensional operators, the state-of-the-art FNO (Li et al., 2020c) for equispaced problems and MPPDE (Brandstetter et al., 2022) for non-equispaced problems are chosen. A brief introduction to these models is shown in Appendix. B.1. In Vision Mixers, the different timestamps in temporal axis are also regarded as 'tokens' in that timestamps are uniformly sampled. Protocol. The widely-used metrics -Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) are deployed to measure the performance. The reported mean and standard deviation of metrics are obtained through 5 independent experiments. All the models for comparison are trained with target function of MSE, i.e. C(u, v)(x) = ||u(x) -v(x)|| 2 corresponding to Eq. ( 1), and optimized by Adam optimizer in 500 epochs. The hyper-parameters are chosen through a carefully tuning on the validation set. Every trial is implemented on a single Nvidia-V100 (32510MB).

4.2. NUMERICAL EXPERIMENTS

Data. We choose four equations for numerical experiments, three of which are time-dependent (KdV, Burgers' and NS), while the other one is not (Darcy Flows). For 1-d problem, we consider Korteweg de Vries (KdV) and Burgers' equation (given in Appendix. B.2.). For 2-d PDEs, we consider Darcy Flow (given in Appendix. B.2.) and Navier-Stokes (NS) equation for a viscous, incompressible fluid in vorticity form on the unit torus: ∂ t w(x, t) + u(x, t) • ∇w(x, t) = ν∆w(x, t) + f (x), ∇ • u(x, t) = 0, w(x, 0) = w 0 (x), where Performance comparison. In this part, for time-dependent PDEs, our target is to map the observed physical quantities from initial condition u(X, T ) ∈ R ns×nt , where T = {t i : t i < T } 1≤i≤nt , to quantities at some later time u(X, T ) ∈ R ns×n t , where T = {t i : T < t i < T } 1≤i≤n t . We set the input timestamp number n t as 1 (initial state to future dynamics) and 10 (sequence to sequence), and prediction horizon n t as 10, 20 and 40 as short-, mid-and long-term settings. For Darcy Flows, which are independent of time, we directly build an operator to map a to u. In equispaced scenarios, the resolution is denoted by r d = n s , where d is the spatial dimension. In non-equispaced scenarios, the spatial points number is denoted by n s . The comparison of benchmarks with or without equispaceunnecessity are shown in (2) In non-equispaced scenarios, the evaluated graph spatio-temporal models' performance is unsatisfactory, especially in NS equations. x ∈ [0, 1] 2 , t ∈ [0, 1]. u is the velocity field, w = ∇ × u is the vorticity, w 0 is the initial vorticity, ν ∈ R + is In comparison, NFS achieves comparable high accuracy to the equispaced scenarios, for instance, according to columns of NS (r = 64, n t = 10, n t = 40) with (n s = 4096, n t = 10, n t = 40). (3) In some trials such as Burgers' (n t = 1) in Table . B4 in Appendix. B.3, Vison Mixers including FNO also suffer from non-convergence of loss; while NFS can still generate accurate predictions. The explanation of the phenomenon will be our future work. (n s = 4096, n t = 10, n t = 10) (n s = 1024, n t = 10, n t = 20) (n s = 4096, n t = 10, n t = 40) Mesh-invariance evaluation. We use (u(X, T ), u(X, T )) as the training set, and evaluate the model's performance of mesh-invariant inference ability on X , where |X | = n s . X is a different mesh with X ⊆ X . The visualization results of NS (n s = 4096, n t = 40) are shown in Fig. 4 . For a fixed n s , we randomly sampled different X for 100 times, to get the mean errors and standard deviations (given in Appendix. B.5) of different spatial meshes. We can conclude from Table . 3 that (1) The errors on unseen meshes are larger than the errors on seen meshes, showing the overfitting effects. However, the errors on unseen meshes are acceptable, since they are even lower than other models' prediction error on seen meshes. (2) Larger n s leads to higher prediction error because a large number of unseen points are likely to disturb the learned token mixing patterns. On the other hand, NS (n s = 1024, n t = 10) implies that small spatial point numbers of training meshes (n s ) hinder model's generalizing ability on unseen meshes, due to excessive loss of spatial information.

4.3. ARCHITECTURE ANALYSIS

Two modules in NFS differ from FNO. The first is the interpolation layers at the beginning and the end of the architecture. The second is the extra Layer-Norm in the FNO layers, which can be applicable in NFS thanks to its fixed resampled equispaced points, but inapplicable to FNO for preserving its resolution-invariance. We aim to figure out what makes NFS outperform FNO. Effects of neighborhood sizes. It is widely believed that modeling the long-range dependency among tokens brings improvements (Naseer et al., 2021; Tuli et al., 2021; Mao et al., 2021) . By contrast, some local kernel methods demonstrate their superiority (Yang et al., 2019; Liu et al., 2021; Chu et al., 2021; Park & Kim, 2022) . For this reason, we first conjecture that the large neighborhood sizes in the interpolation layer are conducive to predictive performance. Besides, as demonstrated in Sec. 3.3, the patchwise embedding in Vision Mixers can be an analogy to the resampling and interpolating, so we further establish a patchwise FNO (PFNO), with patch size equaling to 4 and [4, 4] in 1-d and 2-d PDE problems, equivalent to each resampled points aggregating 4 and 16 points in spatial domains in 1-d and 2-d situations respectively. Layer-Norm is stacked in the FNO layers in PFNO, for a fair comparison. Results of Fig. 5 show that the long-range dependency may even compromise the performance, as larger mean neighborhood sizes often cause higher errors. However, no matter how large is the neighbor size, the NFS outperforms PFNO. More details are given in Appendix. B.6. Therefore, we rule out the possibility of performance gains brought form large neighborhood sizes and suppose that proposed kernel interpolation layers are the key, and is superior to the simple patchwise embedding methods. Benefits from learned interpolation kernel. Since the kernel interpolation is likely to hold the key to improvements, we investigate the performance gains brought from adaptively learned interpolation kernels over the predefined one (See Fig. 3 ). We use an inflexible Gaussian kernel h(x j -x i ) = β exp(-(x j -x i -µ) T (Γ) -1 (x j -x i -µ) ) as a predefined one as discussed in Sec. 3.1, where Γ = diag(γ (1) , . . . , γ (d) ), and µ ∈ R d , γ (1) , . . . , γ (d) , β ∈ R + are learnable parameters. By setting all the other modules and the interpolation neighborhood sizes as the same, we compare performance on different meshes of the two interpolation kernels in Table . 3 (Gaus + LN), where the adaptively learned kernels achieve better accuracy. Benefits from normalization layers. Previous works demonstrated the normalization is necessary for network architecture, for fast convergence and stable training (Dong et al., 2021; Ba et al., 2016) . A notable difference between NFS and FNO is that the Layer-Norm can be implemented in NFS's layers without disabling its discretization-invariance. The improvements brought from the normalization layers are given in Table . 3 (Flex + LN), where the performance gap is obvious on unseen meshes.

4.4. NON-EQUISPACED VISION MIXERS

Since NFS can be regarded as a combination of our interpolation layers with the revised FNO, our interpolation layers can also be implemented in the other Vision Mixers, so that these methods are equipped with the ability to handle non-equispaced data. Details are given in Appendix. B.7. We find that (1) From Neighborhood of spatial point x. Table A1 : Glossary of Notations used in this paper.

A.2 GRAPH CONSTRUCTION

Neighborhood construction. Instead of using K-nearest neighborhood method, the neighborhood system in the interpolation layer is constructed by -ball, because in equispace scenarios, there will be multiple points as K-th nearest neighbor at the same time. For point x, its neighbor is defined according to d(x, x i ) ≤ x i ∈ N (x); d(x, x i ) > x i ∈ N (x). ( ) For given c defined in Sec. 3.2, we can restrict so that E x∼µ [|N (x)|] < c log(n s ). A.3 PROOF OF THEOREM 3.1. Our proof is mostly based on Chen & Chen (1995) and Kovachki et al. (2021) . For notation simplicity, in the proof, we directly write H η (a) as H η as the linear operator. Lemma A1. Let X be a Banach space, and U ⊆ X a compact set, and K ⊂ X a dense set. Then, for any > 0, there exists a number n ∈ N, and a series of continuous, linear functionals G 1 , G 2 , . . . , G n ∈ C(U; R), and elements ϕ 1 , . . . , ϕ n ∈ K, such that sup u∈U ||v - n j=1 G j (v)ϕ j || X ≤ ( ) The proof is given in Lemma 7. in Kovachki et al. (2021) , and Theorem 3. and 4. for reference . Theorem A2. Let D ⊆ R d be compact domain. Let U be a separable Banach space of real-valued functions on D, such that C(D, R) ⊆ U is dense. Suppose U = L p (D; R) for any 1 < p < ∞. ν is a probability measure supported on U and assume that, E v∼ν ||v|| U < ∞ for any v ∈ U. µ is a probabilistic measure supported on D, which defines the inner product of Hilbert space U as < f, g > U = D f (x)g(x)dµ(x). Then, there exists a neural network h η : R d × R d → R whose activation functions are of the Tauber-Wiener class, such that ||v -H(v)|| U ≤ , where H(v)(x) = D h η (x, y)v(y)dµ(y). Proof. Since U is a Polish space, we can find a compact set K, such that ν(U \ K) ≤ . Therefore, Lemma A1 can be applied, to find a number n ∈ N, a series of continuous linear functionals G j ∈ C(U; R) and functions ϕ j ∈ C(D; R) such that sup v∈K ||v - n j=1 G j (v)ϕ j || U ≤ . Denote Ĥn (v) = n j=1 G j (v)ϕ j , and let 1 < q < ∞ be the Hölder conjugate of p. Since U = L p (D; R), by Reisz Representation Theorem, there exists functions g j ∈ L q (D; R), such that G j (v) = D v(x)g j (x)dµ(x) for j = 1, . . . , n and v ∈ L p (D; R). By density of C(D; R) in L q (D; R), we can find functions ψ 1 , . . . , ψ n ∈ C(D; R), such that sup j∈{1,...,n} ||ψ j -g j || L q (D;R) ≤ /n. Then, we define Hn : L p (D; R) → C(D; R) by Hn (v) = n j=1 D ψ j (y)v(y)dµ(y)ϕ j (x). For the universal approximation (density) (Hornik et al., 1989) of neural networks, we can find a Multi-layer Feedforward network h η : R d × R d → R whose activation functions are of the Tauber-Wiener class, such that sup x,y∈D |h η (x, y) - n j=1 ψ j (y)ϕ j (x)| ≤ . Let H η (x) = D h η (x, y)v(y)dµ(y). Then, there exists a constant C 1 > 0, such that || Ĥn (v) -H(v)|| L p (D;R) ≤ C 1 (|| Ĥn (v) -Hn (v)|| L p (D;R) + || Hn (v) -H(v)|| L p (D;R) ). For the first term, there is a constant C 2 > 0, such that || Ĥn (v) -Hn (v)|| L p (D;R) ≤ C 2 n j=1 || D v(y)(g j (y) -ψ j (y))dµ(y)ϕ j || L p (D;R) ≤ C 2 n j=1 ||v(y)|| L p (D;R) ||g j (y) -ψ j (y)|| L q (D;R) ||ϕ j || L p (D;R) ≤ C 3 ||v(y)|| L p (D;R) , for some C 3 > 0. And for the second term, || Hn (v) -H(v)|| L p (D;R) = || D v(y)( n j=1 ψ j (y)ϕ j (•) -h η (•, y))dµ(y)|| L p (D;R) ≤ |D| ||v|| L p (D;R) , Therefore, there is a constant C > 0, such that U || Ĥn (v) -Hn (v)|| U dν(v) ≤ CE v∼ν ||v|| U . Because of the assumption that E v∼ν ||v|| U < ∞, and is arbitrary, then y)dµ(y), the interpolation operator can also approximate v to any precision . ||v -H(v)|| U ≤ ||v -Ĥn (v)|| U + || Ĥn (v) -H(v)|| U , the proof is complete. Corollary A3. Define H η (v) = D h η (x -y, x, a(y))v( Proof. We use a one-layer neural network h η : D × D → R as an example, which is defined as h η (x, y, a(y) = σ( d i=1 w x,i x (i) + w y,i y (i) + b). We can rewrite it as h η = σ( d i=1 w x,i (x (i) -y (i) ) + (w y,i + w x,i )y (i) + da j=1 w a,j a (j) (y) + b), where w a,j = 0. Corollary A4. The Theorem A2 and Corollary A3 can be extended for v : D → R dv , where d v > 1. Proof. As v = (v (1) , v (2) , . . . , v (dv) ), for each v (j) , a single neural network can be used for approximation. Moreover, in implementation, we make h η fully-connected, to improve the expressivity. Remark. As xi∈X v(x i )h η (x-x i , x i , a(x i )) is the unbiased estimation of E y∼µ (h η (x, y)v(y)), we use the Equation. ( 8) for the approximation.

B EXPERIMENTS B.1 BENCHMARK METHOD DESCRIPTION

Vision Mixers. We provide a framework for vision mixers as PDE solvers, including VIT, MLP-MIXER, FNET, GFN, FNO, PFNO and our NFS. The intermediate architecture of mixing layers is shown in Fig. B1 . The code of our framework will be released soon. And the resampling and back-sampling methods are stacked before 'Equispaced Input' and 'Equispaced Output'. In this way, the description of the Vision Mixers included in our framework can be described by different modules, as shown in Table . B1. All the trials on Vision Mixers set embedding size as 32, batch size as 4, layer number of the intermediate equispaced mixing layers as 2. In FNO and PFNO, the truncated K max is set as 16. The patch size of Vision Mixers with patchwise embedding are set as [4, 2] in 1-d PDEs and [4, 4, 2] in 2-d PDEs. The interpolation layers in NFS are composed of one layer of feed-forward network whose perceptron unit is equal to 4× embedding size of the model. DeepONet Variants. Since vanilla DeepONet uses MLP as Branch Net, it cannot be implemented in such a high-resolution dataset, because for a resolution like the trial (NS n s = 4096, n t = 10, n t = 10 ), DeepONet assigns each data point a weight parameter in a single MLP, leading the MLP's parameter number reaches O(n 2 s n 2 t ) ≈ 40960 2 in a single Branch Net, which is infeasible in practice. In the original paper, the spatial point's number in the experiments is set as 40, far less than in the recent Neural Operator's evaluation protocol. One feasible alternative is to use other architecture to replace the original MLP, thus allowing DeepONet to handle high-resolution data. For example, CNN and Vit. Therefore, we here conduct further experiments on the three equations in the context, to evaluate DeepONet-U (using UNet as the Branch Net) and DeepONet-V (using Vit as the Branch Net) as two variants of vanilla DeepONet for comparison. Note that the architecture of variants of DeepONet are all limited to equispaced data. Graph Spatio-Temporal Models. The evaluated graph spatio-temporal neural networks are based on recurrent neural networks for dynamics modeling, where the spatial dependency is modeled by graph neural networks. The spatial and temporal modules for AGCRN, DCRNN and GCGRU are shown in Table . B2. MPPDE used different architecture, with the pushforward trick used for taining, with rolling equaling 1 and time window equaling to 10 . All the trials on these graph spatio-tempral models set embedding size as 64, except MPPDE as 128. Batch size is set as 4. When the graph convolution needs multi-hop message-passing, we set the hop as 2. For MPPDE, the layer number of GNNs is 6. The embedding dimension in AGCRN is set as 2. The initial condition u 0 (x) is generated according to u 0 ∼ N (0.625(-∆ + 25I) -2 ) with periodic boundary conditions. ν is set as 0.01. x ∈ [0, 1] and t ∈ [0, 1]. The spatial resolution is 1024, and time resolution is 200. The dataset generation follows FNO's protocol, which can be downloaded from its source code on official Github.

Table B2: Description of different graph spatio-temporal models

KdV Equation. The equation is written as ∂ t u(x, t) + 3∂ x u 2 (x, t) + ∂ 3 x u(x, t) = 0, where x ∈ [0, 1]. The initial condition u 0 (x) is calculated as u(x, 0) = K i=1 0.5c i cos(0.5 c i + b i x -a i ) where c i ∼ N (0, σ i ), and a i , b i > 0. The spatial resolution is 1024. The dataset is generated by scipy package, with fftpack.diff used as pesudo-differential method and odeint used as forward Euler method. Darcy Flow. The equation is written as -∇(a(x)∇u(x)) = f (x) x ∈ (0, 1) 2 u(x) = 0 x ∈ ∂[0, 1] 2 The original resolution is 256 × 256. a(x) is generated by Gaussian random field, and we directly establish the operator to learn the mapping of a to u. NS Equation. Our generation of NS Equation is based on FNO's Appendix. A.3.3, with the forcing is kept fixed. The original spatial resolution is 128 × 128, and time resolution is 200.

B.3 COMPLETE RESULTS ON MODEL COMPARISON

Here we give complete results on the four Equations. Table . B3 give the performance comparison on Darcy flow of both equispaced and non-equispaced scenarios. Table . B4 and B5 gives performance comparison in equispaced scenarios on the other three time-dependent problems. Table . B6 and B7 gives performance comparison in non-equispaced scenarios on the other three time-dependent problems. In all the tasks except Darcy Flow, the depth of layer is set as 2, and k max = 16 in both NFS and FNO. However, we find in Darcy Flow, k max should be set much larger, or the loss will not decrease. In the reported results, k max = 32, 64, 128 in Darcy Flow. Table B3 : Performance comparison on Darcy Flow. NFS fails to model the non-equispaced Burgers' Equation when n t is set as 1, in which the performance is far from it can achieve in equispaced scenarios. Such problem will be our future work. MAE (×10 -3 ) RMSE(×10 -3 ) MAE(×10 -3 ) RMSE(×10 -3 ) MAE(×10 -3 ) RMSE(×10 -3 ) Darcy Flow (r = Table B4 : Performance comparison with Vision Mixer benchmarks on different equations (n t = 1). Validation loss on Burgers'(n t = 1) of VIT, GFN, and FNO does not converge. The results show that the early-stopping occurs in the begining of training. Table B6 : Performance comparison with graph spatio-temporal benchmarks (n t = 1). MAE (×10 -3 ) RMSE(×10 -3 ) MAE(×10 -3 ) RMSE(×10 -3 ) MAE(×10 -3 ) RMSE(×10 -3 ) Vision Mixers Burgers' (r = 512, n t = 10) Burgers' (r = 512, n t = 40) Burgers' (r = 1024, n t = Graph Spatio-Temporal Models MAE (×10 -3 ) RMSE(×10 -3 ) MAE(×10 -3 ) RMSE(×10 -3 ) MAE(×10 -3 ) RMSE(×10 Table B7 : Performance comparison with graph spatio-temporal benchmarks (n t = 10). Graph Spatio-Temporal Models MAE (×10 -3 ) RMSE(×10 -3 ) MAE(×10 -3 ) RMSE(×10 -3 ) MAE(×10 -3 ) RMSE(×10 The mesh-invariant evaluation on Burgers' and KDV Equations of NFS are given in Table . B8. In Table . B8, when the spatial resolution is just 512, inference performance on unseen meshes deteriorates. This result also validate our conclusion (2) in the third paragraph in Sec. 4. (×10 -3 ) RMSE(×10 -3 ) MAE(×10 -3 ) RMSE(×10 -3 ) MAE(×10 -3 ) RMSE(×10 -3 ) Gaus + LN NS (n s = 4096, n t = 10) NS (n s = 1024, n t = 20) NS (n s = 4096, n t = 40) X 1.6341±0 Besides, we give a full evaluation on mesh-invairance of NFS in NS equation, with its variants as a detailed results corresponding to Table . B9.

B.6 NEIGHBORHOOD SIZE'S EFFECTS

The effects of mean neighborhood size on the predictive performance on Burgers' (n s = 512, n t = 10, n t = 40) and KDV (n s = 512, n t = 10, n t = 40) are shown in Fig. B5 . We here first give Table . B11 to show the complexity of time and memory of all the evaluated methods on NS (r = 64, n t = 10, n t = 40). It demonstrates that our method has comparable efficiency to Vision Mixers. For the graph spatialtemporal models, they suffer from the recurrent network structures and thus are extremely timeconsuming while the parameter number is small, limiting their flexibility. Time. However, once we compare the used time in PFNO and NFS, we will find that the interpolation layers are considerably time-consuming. Another module that cost time complexity is the normalization layer, as the original FNO does not include Layer-Norm in its architecture, but it is stacked in PFNO. Theoretically, PFNO handles down-sampled grids in a low resolution, because of the patchwise embedding. However, it takes more time than FNO. Therefore, we conclude that the time complexity brought from Layer-Norm is very significant, but it is affordable because of the performance improvements. Memory. Besides, the operation of searching for each spatial point's neighborhood and calculating weighted summation in Eq. ( 9) and Eq. ( 10) are very memory-consuming. We test it on the same experiment, and give the memory usage of different models in forward process, as shown in Table . B12. The memory cost in backward process is 6902MB. C EMPIRICAL OBSERVATION FOR THEOREM 3.1 In Sec. 3.2, Theorem 3.1 is proved to assure the expressivity of NFS. However, no further evidence gives the assurance of the convergence of the kernel interpolation. Here we conduct empirical study to give some clues. We conduct experiments on NS equation with n s = 4096, n t = 10, n t = 40. In a single trial, NFS is trained with fixed meshes. We repeated the trials 10 times with different meshes, and then give the 



PROBLEM STATEMENT Let D ∈ R d be the bounded and open spatial domain where n s -point discretization of the domain D written as X = {x i = (x

Figure 2: Intuitive explanation of our motivation: In (a) and (b), ' ' represents Vision Mixers, ' ' represents graph spatio-temporal models and ' ' is the proposed NFS. In (c), 'Eq' and 'Neq' mean the methods are trained in equispaced and non-equispaced scenarios respectively.

Figure 3: The architecture of NFS: In non-equispaced inperpolation (NEI) layers, the interpolation kernels are adaptively learned rather than predefined, and the interpolated equispaced signals are processed through a stack of FNO layers with the same structure of Vision Mixers.

Figure 4: Visualization on non-equispaced NS equation: The training mesh (n s = 4096) differs from the mesh in inference process (n s = 8192). Appendix. B.4 gives more visualization. Table 3: Results on NS equation: MAE(×10 -3 ) of NFS and different variants of NFS on seen and unseen meshes. 'Flex + LN' is the proposed NFS. 'Flex' represents the flexible interpolation layer defined in Eq. (8), 'LN' is the Layer-Norm and 'Gaus' is the predefined Gaussian interpolation. Appendix. B.5 gives details and results on other equations.

Figure 5: Effects of neighborhood sizes on NS (r = 64, n t = 10, n t = 40).

Seo et al. (2016) Cheb Conv Defferrard et al. (2017) GRU DCRNN Li et al. (2018) Diff Conv Atwood & Towsley (2016) GRU AGCRN Bai et al. (2020) Node Similarity Bai et al. (2020) GRU B.2 DATA GENERATION Burgers' Equation.

Figure B2: Visualization on equispaced Burgers' equation.

Figure B3: Visualization on equispaced KdV equation.

Figure B4: Visualization on non-equispaced NS equation: The training mesh (n s = 4096 in upperleft) is different from the meshes in inference process (n s = 8192 in upper-right, n s = 12288 in lower-left and n s = 16384 in lower-right).

Burgers' (ns = 512, n t = 40). KdV (ns = 512, n t = 40).

FigureB5: The change of MAE and RMSE of NFS with the increase of neighborhood size on Burgers' (n s = 512, n t = 10, n t = 40) and KdV (n s = 512, n t = 10, n t = 40). PFNO is the baseline.B.7 INTERPOLATION WITH OTHER VISION MIXERSWe conduct experiments on non-equispaced NS equations with the combination of our interpolation layers and other Vision Mixers to figure out if they can achieve camparable performance.

one-v.s.-all deviations of the representation states calculated by nt || |H i -H j | |H i | + |H j | || 1 ,where H i is the representation states of the shape[ √ m s , √ m s , n t ],and | • | is the element-wise absolute value, and || • || 1 is the 1-norm of the matrix. If the Diff is small in the beginning and end, it can be inferred that the interpolation kernel function converges to a similar mapping since the final predictions are close to ground truth in these experiments, and the inputs are sampled from the same instance of PDEs. We give the Diff before the first FNO and after final of FNO layers in Table. C1. The small values indicate that the trained model usually has similar representation states. Figure. C1 and C2 give visualizations of representation states obtained by one instance of NS equation in two different trials. It indicates that the differences are getting smaller during the training.

Figure C1: Visualization on different representation states at the beginning of FNO layers.

Figure C2: Visualization on different representation states in the end of FNO layers.

MAE(×10 -3 ) comparison with graph spatio-temporal benchmarks.

Table. B10 and Table.  1, the degeneration of performance is obvious in other Vision Mixers. In comparison, FNO as intermediate equispaced layers, truncates the high frequency in its channel mixing and retains the low frequency shared by both resampled and original signals, so the loss of accuracy in non-equispaced scenarios is tiny in our NFS; (2) Although the performance on unseen meshes is more stable in these non-equispaced Vision Mixers, the performance gap is still large, according to Table. B10 and Table. 3. R da ) means for x ∈ D, a(x) ∈ R da .In time-dependent PDEs, a = u(•, T ), where T = {t i } nt i=1 .

Description of Vison Mixers in the unifying framework module by module.

64)

Performance comparison with Vision Mixer benchmarks on different equations (n t = 10).

Mesh-invariant performance of NFS on Burgers' and KdV equations (n t = 10).

Performance of NFS with its variants of NS equations (n t = 10) on unseen meshes.

comparison on complexity of the evaluated methods

The defined Diff calculated by different epochs. Epoch Diff begin Diff end

