FACTORIZED FOURIER NEURAL OPERATORS

Abstract

We propose the Factorized Fourier Neural Operator (F-FNO), a learning-based approach for simulating partial differential equations (PDEs). Starting from a recently proposed Fourier representation of flow fields, the F-FNO bridges the performance gap between pure machine learning approaches to that of the best numerical or hybrid solvers. This is achieved with new representations -separable spectral layers and improved residual connections -and a combination of training strategies such as the Markov assumption, Gaussian noise, and cosine learning rate decay. On several challenging benchmark PDEs on regular grids, structured meshes, and point clouds, the F-FNO can scale to deeper networks and outperform both the FNO and the geo-FNO, reducing the error by 83% on the Navier-Stokes problem, 31% on the elasticity problem, 57% on the airfoil flow problem, and 60% on the plastic forging problem. Compared to the state-of-the-art pseudo-spectral method, the F-FNO can take a step size that is an order of magnitude larger in time and achieve an order of magnitude speedup to produce the same solution quality.

1. INTRODUCTION

From modeling population dynamics to understanding the formation of stars, partial differential equations (PDEs) permeate the world of science and engineering. For most real-world problems, the lack of a closed-form solution requires using computationally expensive numerical solvers, sometimes consuming millions of core hours and terabytes of storage (Hosseini et al., 2016) . Recently, machine learning methods have been proposed to replace part (Kochkov et al., 2021) or all (Li et al., 2021a) of a numerical solver. Of particular interest are Fourier Neural Operators (FNOs) (Li et al., 2021a) , which are neural networks that can be trained end-to-end to learn a mapping between infinite-dimensional function spaces. The FNO can take a step size much bigger than is allowed in numerical methods, can perform super-resolution, and can be trained on many PDEs with the same underlying architecture. A more recent variant, dubbed geo-FNO (Li et al., 2022) , can handle irregular geometries such as structured meshes and point clouds. However, this first generation of neural operators suffers from stability issues. Lu et al. (2022) find that the performance of the FNO deteriorates significantly on complex geometries and noisy data. In our own experiments, we observe that both the FNO and the geo-FNO perform worse as we increase the network depth, eventually failing to converge at 24 layers. Even at 4 layers, the error between the FNO and a numerical solver remains large (14% error on the Kolmogorov flow). In this paper, we propose the Factorized Fourier Neural Operator (F-FNO) which contains an improved representation layer for the operator, and a better set of training approaches. By learning features in the Fourier space in each dimension independently, a process called Fourier factorization, we are able to reduce the model complexity by an order of magnitude and learn higher-dimensional problems such as the 3D plastic forging problem. The F-FNO places residual connections after activation, enabling our neural operator to benefit from a deeply stacked network. Coupled with training techniques such as teacher forcing, enforcing the Markov constraints, adding Gaussian noise to inputs, and using a cosine learning rate scheduler, we are able to outperform the state of the art by a large margin on three different PDE systems and four different geometries. On the Navier-Stokes (Kolmogorov flow) simulations on the torus, the F-FNO reduces the error by 83% compared to the FNO, while still achieving an order of magnitude speedup over the state-of-the-art pseudo-spectral method (Figs. 3 and 4 ). On point clouds and structured meshes, the F-FNO outperforms the geo-FNO on both structural mechanics and fluid dynamics PDEs, reducing the error by up to 60% (Table 2 ). 1. We propose a new representation, the F-FNO, which consists of separable Fourier representation and improved residual connections, reducing the model complexity and allowing it to scale to deeper networks (Fig. 2 and Eqs. ( 7) and ( 8)). 2. We show the importance of incorporating training techniques from the existing literature, such as Markov assumption, Gaussian noise, and cosine learning rate decay (Fig. 3 ); and investigate how well the operator can handle different input representations (Fig. 5 ). 3. We demonstrate F-FNO's strong performance in a variety of geometries and PDEs (Fig. 3 and Table 2 ). Code, datasets, and pre-trained models are availablefoot_0 .

2. RELATED WORK

Classical methods to solve PDE systems include finite element methods, finite difference methods, finite volume methods, and pseudo-spectral methods such as Crank-Nicholson and Carpenter-Kennedy. In these methods, space is discretized, and a more accurate simulation requires a finer discretization which increases the computational cost. Traditionally, we would use simplified models for specific PDEs, such as Reynolds averaged Navier-Stokes (Alfonsi, 2009) and large eddy simulation (Lesieur & Métais, 1996) , to reduce this cost. More recently, machine learning offers an alternative approach to accelerate the simulations. There are two main clusters of work: hybrid approaches and pure machine learning approaches. Hybrid approaches replace parts of traditional numerical solvers with learned alternatives but keep the components that impose physical constraints such as conservation laws; while pure machine learning approaches learn the time evolution of PDEs from data only. Hybrid methods typically aim to speed up traditional numerical solvers by using lower resolution grids (Bar-Sinai et al., 2019; Um et al., 2020; Kochkov et al., 2021) , or by replacing computationally expensive parts of the solver with learned alternatives Tompson et al. (2017) ; Obiols-Sales et al. (2020) . Bar-Sinai et al. (2019) develop a data driven method for discretizing PDE solutions, allowing coarser grids to be used without sacrificing detail. Kochkov et al. (2021) design a technique specifically for the Navier-Stokes equations that uses neural network-based interpolation to calculate velocities between grid points rather than using the more traditional polynomial interpolation. Their method leads to more accurate simulations while at the same time achieving an 86-fold speed improvement over Direct Numerical Simulation (DNS). Similarly, Tompson et al. (2017) employ a numerical solver and a decomposition specific to the Navier-Stokes equations, but introduce a convolutional neural network to infer the pressure map at each time step. While these hybrid methods are effective when designed for specific equations, they are not easily adaptable to other PDE tasks. An alternative approach, less specialized than most hybrid methods but also less general than pure machine learning methods, is learned correction (Um et al., 2020; Kochkov et al., 2021) which involves learning a residual term to the output of a numerical step. That is, the time derivative is now u t = u * t + LC(u * t ) , where u * t is the velocity field provided by a standard numerical solver on a coarse grid, and LC(u * t ) is a neural network that plays the role of super-resolution of missing details. Pure machine learning approaches eschew the numerical solver altogether and learn the field directly, i.e., u t = G(u t-1 ), where G is dubbed a neural operator. The operator can include graph neural networks Li et al. (2020a; b) , low-rank decomposition Kovachki et al. (2021b) , or Fourier transforms (Li et al., 2021a; b) . Pure machine learning models can also incorporate physical constraints, for example, by carefully designing loss functions based on conservation laws (Wandel et al., 2020) . They can even be based on existing simulation methods such as the operator designed by Wang et al. (2020) that uses learned filters in both Reynolds-averaged Navier-Stokes and Large Eddy Simulation before combining the predictions using U-Net. However, machine learning methods need not incorporate such constraints -for example, Kim et al. (2019) use a generative CNN model to represent velocity fields in a low-dimensional latent space and a feedforward neural network to advance the latent space to the next time point. Similarly, Bhattacharya et al. (2020) use PCA to map from an infinite dimensional input space into a latent space, on which a neural network operates before being transformed to the output space. Our work is most closely related to the Fourier 1 for details. On the torus datasets (a), the operator learns to evolve the vorticity over time. On Elasticity (b), the operator learns to predict the stress value on each point on a point cloud. On Airfoil (c), the operator learns to predict the flow velocity on each mesh point. On Plasticity (d), the operator learns the displacement of each mesh point given an initial boundary condition. transform-based approaches (Li et al., 2021a; 2022) (Li et al., 2021a) for flow simulation. In learning mappings between function spaces, the FNO outperforms graph-based neural operators and other finite-dimensional operators such as U-Net. In modeling chaotic systems, the FNO has been shown to capture invariant properties of chaotic systems (Li et al., 2021b) . More generally, Kovachki et al. (2021a) prove that the FNO can approximate any continuous operator.

3. THE FACTORIZED FOURIER NEURAL OPERATOR

Solving PDEs with neural operators An operator G : A → U is a mapping between two infinitedimensional function spaces A and U. Exactly what these function spaces represent depends on the problem. In general, solving a PDE involves finding a solution u ∈ U given some input parameter a ∈ A, and we would train a neural operator to learn the mapping a → u. Consider the vorticity formulation of the 2D Navier-Stokes equations, ∂ω ∂t + u • ∇ω = ν∇ 2 ω + f ∇ • u = 0 (1) where u is the velocity field, ω is the vorticity, and f is the external forcing function. These are the governing equations for the torus datasets (Fig. 1a ). The neural operator would learn to evolve this field from one time step to the next: ω t → ω t+1 . Or consider the equation for a solid body in structural mechanics, ρ ∂ 2 u ∂t 2 + ∇ • σ = 0, ( ) where ρ is the mass density, u is the displacement vector and σ is the stress tensor. Elasticity (Fig. 1b ) and Plasticity (Fig. 1d ) are both governed by this equation. In Plasticity, we would learn to map the initial boundary condition s d : [0, L] → R to the grid position x and displacement of each grid point over time: s d → (x, u, t). In Elasticity, we are instead interested in predicting the stress value for each point: x → σ. Finally consider the Euler equations to model the airflow around an aircraft wing (Fig. 1c ): ∂ρ ∂t + ∇ • (ρu) = 0 ∂ρu ∂t + ∇ • (ρu ⊗ u + pI) = 0 ∂E ∂t + ∇ • ((E + p)u) = 0 Figure 2 : The architecture of the Factorized Fourier Neural Operator (F-FNO) for a 2D problem. The iterative process (Eq. ( 4)) is shown at the top, in which the input function a(i, j) is first deformed from an irregular space into a uniform space a(x, y), and is then fed through a series of operator layers L in order to produce the output function u(i, j). A zoomed-in operator layer (Eq. ( 7)) is shown at the bottom which shows how we process each spatial dimension independently in the Fourier space, before merging them together again in the physical space. where ρ is the fluid mass density, p is the pressure, u is the velocity vector, and E is the energy. Here the operator would learn to map each grid point to the velocity field at equilibrium: x → u. Original FNO and geo-FNO architectures Motivated by the kernel formulation of the solution to linear PDEs using Green's functions, Li et al. (2020b; 2022) propose an iterative approach to map input function a to output function u, u = G(a) = (ϕ • Q • L (L) • • • • • L (1) • P • ϕ -1 )(a), where • indicates function composition, L is the number of layers/iterations, P is the lifting operator that maps the input to the first latent representation z (0) , L (ℓ) is the ℓ'th non-linear operator layer, and Q is the projection operator that maps the last latent representation z (L) to the output. On irregular geometries such as point clouds, we additionally define a coordinate map ϕ, parameterized by a small neural network and learned end-to-end, that deforms the physical space of irregular geometry into a regular computational space. The architecture without this coordinate map is called FNO, while the one with the coordinate map is called geo-FNO. Fig. 2 (top) contains a schematic diagram of this iterative process. Originally, Li et al. (2021a) formulate each operator layer as L (ℓ) z (ℓ) = σ W (ℓ) z (ℓ) + b (ℓ) + K (ℓ) (z (ℓ) ) , where σ : R → R is a point-wise non-linear activation function, W (ℓ) z (ℓ) +b (ℓ) is an affine point-wise map in the physical space, and K (ℓ) is a kernel integral operator using the Fourier transform, K (ℓ) z (ℓ) = IFFT R (ℓ) • FFT(z) The Fourier-domain weight matrices {R (ℓ) | ℓ ∈ {1, 2, . . . , L}} take up most of the model size, requiring O(LH 2 M D ) parameters, where H is the hidden size, M is the number of top Fourier modes being kept, and D is the problem dimension. Furthermore, the constant value for M and the affine point-wise map allow the FNO to be resolution-independent. Our improved F-FNO architecture We propose changing the operator layer in Eq. ( 5) to: L (ℓ) z (ℓ) = z (ℓ) + σ W (ℓ) 2 σ W (ℓ) 1 K (ℓ) z (ℓ) + b (ℓ) 1 + b (ℓ) 2 (7) Note that we apply the residual connection (z (ℓ) term) after the non-linearity to preserve more of the layer input. We also use a two-layer feedforward, inspired by the feedforward design used in transformers (Vaswani et al., 2017) . More importantly, we factorize the Fourier transforms over the problem dimensions, modifying Eq. ( 6) to The seemingly small change from K (ℓ) z (ℓ) = d∈D IFFT R (ℓ) d • FFT d (z (ℓ) ) (8) R (ℓ) to R (ℓ) d in the Fourier operator reduces the number of parameters to O(LH 2 M D). This is particularly useful when solving higher-dimensional problems such as 3D plastic forging (Fig. 1d ). The combination of the factorized transforms and residual connections allows the operator to converge in deep networks while continuing to improve performance (Fig. 3 ). It is also possible to share the weight matrices R d between the layers, which further reduces the parameters to O(H 2 M D). Fig. 2 (bottom) provides an overview of an F-FNO operator layer. Furthermore, the F-FNO is highly flexible in its input representation, which means anything that is relevant to the evolution of the field can be an input, such as viscosity or external forcing functions for the torus. This flexibility also allows the F-FNO to be easily generalized to different PDEs. Training techniques to learn neural operators We find that a combination of deep learning techniques are very important for the FNO to perform well, most of which were overlooked in Li et al. (2021a) 's original implementation. The first is enforcing the first-order Markov property. We find Li et al. (2021a) 's use of the last 10 time steps as inputs to the neural operator to be unnecessary. Instead, it is sufficient to feed information only from the current step, just like a numerical solver. Unlike prior works (Li et al., 2021a; Kochkov et al., 2021) , we do not unroll the model during training but instead use the teacher forcing technique which is often seen in time series and language modeling. In teacher forcing, we use the ground truth as the input to the neural operator. Finally during training, we find it useful to normalize the inputs and add a small amount of Gaussian noise, similar to how Sanchez-Gonzalez et al. (2020) train their graph networks. Coupled with cosine learning rate decay, we are able to make the training process of neural operators more stable. Ablation studies for the new representation and training techniques can be found in Fig. 3 .

4. DATASETS AND EVALUATION SETTINGS

PDEs on regular grids The four Torus datasets on regular grids (TorusLi, TorusKochkov, TorusVis, and TorusVisForce, summarized in Table 1 ) are simulations based on Kolmogorov flows which have been extensively studied in the literature (Chandler & Kerswell, 2013) . In particular, they model turbulent flows on the surface of a 3D torus (i.e., a 2D grid with periodic boundary conditions). TorusLi is publicly released by Li et al. (2021a) and is used to benchmark our model against the original FNO. The ground truths are assumed to be simulations generated by the pseudo-spectral Crank-Nicholson second-order method on 64x64 grids. All trajectories have a constant viscosity ν = 10 -5 (Re = 2000), use the same constant forcing function, f (x, y) = 0.1[sin(2π(x+y))+cos(2π(x+y))], and differ only in the initial field. Using the same Crank-Nicolson numerical solver, we generate two further datasets, called TorusVis and TorusVisForce, to test the generalization of the F-FNO across Navier-Stokes tasks with different viscosities and forcing functions. In particular, for each trajectory, we vary the viscosity between 10 -4 and 10 -5 , and set the forcing function to f (t, x, y) = 0.1 2 p=1 1 i=0 1 j=0 α pij sin 2πp(ix + jy) + δt + β pij cos 2πp(ix + jy) + δt , (9) where the amplitudes α pij and β pij are sampled from the standard uniform distribution. Furthermore, δ is set to 0 in TorusVis, making the forcing function constant across time; while it is set to 0.2 in TorusVisForce, giving us a time-varying force. Finally, we regenerate TorusKochkov (Fig. 1a ) using the same settings provided by Kochkov et al. (2021) but with different initial conditions from the original paper (since the authors did not release the full dataset). Here the ground truths are obtained from simulations on 2048x2048 grids using the pseudo-spectral Carpenter-Kennedy fourth-order method. The full-scale simulations are then downsampled to smaller grid sizes, allowing us to study the Pareto frontier of the speed vs accuracy space (see Fig. 4a ). TorusKochkov uses a fixed viscosity of 0.001 and a constant forcing function f = 4 cos(4y)x -0.1u, but on the bigger domain of [0, 2π] . Furthermore, we generate only 32 training trajectories to test how well the F-FNO can learn on a low-data regime. PDEs on irregular geometries The Elasticity, Airfoil, and Plasticity datasets (final three rows in Table 1 ) are taken from Li et al. (2022) . Elasticity is a point cloud dataset modeling the incompressible Rivlin-Saunders material (Pascon, 2019) . Each sample is a unit cell with a void in the center of arbitrary shape (Fig. 1b ). The task is to map each cloud point to its stress value. Airfoil models the transonic flow over an airfoil, shown as the white center in Fig. 1c . The neural operator would then learn to map each mesh location to its Mach number. Finally, Plasticity models the plastic forging problem, in which a die, parameterized by an arbitrary function and traveling at a constant speed, hits a block material from above (Fig. 1d ). Here the task is to map the shape of the die to the 101 × 31 structured mesh over 20 time steps. Note that Plasticity expects a 3D output, with two spatial dimensions and one time dimension. Training details For experiments involving the original FNO, FNO-TF (with teaching forcing), FNO-M (with the Markov assumption), and FNO-N (with improved residuals), we use the same training procedure as Li et al. (2021a) . For our own models, we train for 100,000 steps on the regular grid datasets and for 200 epochs for the irregular geometry datasets, warming up the learning rate to 2.5 × 10 -3 for the first 500 steps and then decaying it using the cosine function (Loshchilov & Hutter, 2017) . We use ReLU as our non-linear activation function, clip the gradient value at 0.1, and use the Adam optimizer (Kingma & Ba, 2015) with β 1 = 0.9, β 2 = 0.999, ϵ = 10 -8 . The weight decay factor is set to 10 -4 and is decoupled from the learning rate (Loshchilov & Hutter, 2019) . In each operator layer on the torus datasets, we always throw away half of the Fourier modes (e.g., on a 64x64 grid, we keep only the top 16 modes). Models are implemented in PyTorch (Paszke et al., 2017) and trained on a single Titan V GPU.

Evaluation metrics

We use the normalized mean squared error as the loss function, defined as N-MSE = 1 B B i=1 ∥ω i -ω∥ 2 ∥ω∥ 2 , where ∥•∥ 2 is the 2-norm, B is the batch size, and ω is the prediction of the ground truth ω. In addition to comparing the N-MSE directly, for TorusKochkov, we also compute the vorticity correlation, defined as ρ(ω, ω) = i j ω ij ∥ω∥ 2 ωij ∥ω∥ 2 , and from which we measure the time until this correlation drops below 95%. To be consistent with prior work, we use the N-MSE to compare the F-FNO against the FNO and geo-FNO Li et al. (2021a; 2022) , and the vorticity correlation to compare against Kochkov et al. ( 2021)'s work.

5. RESULTS FOR NAIVER-STOKES ON A TORUS

Comparison against FNO The performance on TorusLi is plotted in Fig. 3 , with the raw numbers shown in Table A .3. We note that our method F-FNO is substantially more accurate than the FNO regardless of network depth, when judged by N-MSE. The F-FNO uses fewer parameters than the FNO, has a similar training time, but generally has a longer inference time. Even so, the inference time for the F-FNO is still up to two orders of magnitude shorter than for the Crank-Nicolson numerical solver. In contrast to our method, Li et al. (2021a) do not use teacher forcing during training. Instead they use the previous 10 steps as input to predict the next 10 steps incrementally (by using each predicted Figure 4 : Performance of F-FNO on TorusKochkov. In (a), we plot the time until the correlation with the ground truths in the test set drops below 95% on the y-axis, against the time it takes to run one second of simulation on the x-axis. In (b), we show how, on the validation set of TorusKochkov, given a fixed spatial resolution of 64x64, changing the step size has no effect on the numerical solver; however there is an optimal step size for the F-FNO at around 0.2. value as the input to the next step). We find that the teacher forcing strategy (FNO-TF, orange line), in which we always use the ground truth from the previous time step as input during training, leads to a smaller N-MSE when the number of layers is less than 24. Furthermore, enforcing the first-order Markov property (FNO-M, dotted green line), where only one step of history is used, further improves the performance over FNO and FNO-TF. Including two or more steps of history does not improve the results. The models FNO, FNO-TF, and FNO-M do not scale with network depth, as seen by the increase in the N-MSE with network depth. These models even diverge during training when 24 layers are used. FNO-R, with the residual connections placed after the non-linearity, does not suffer from this problem and can finally converge at 24 layers. FNO++ further improves the performance, as a result of a careful combination of: normalizing the inputs, adding Gaussian noise to the training inputs, and using cosine learning rate decay. In particular, we find that adding a small amount of Gaussian noise to the normalized inputs helps to stabilize training. Without the noise, the validation loss at the early stage of training can explode. Finally, if we use Fourier factorization (F-FNO, yellow dashed line), the error drops by an additional 35% (3.73% → 2.41%) at 24 layers (Fig. 3b ), while the parameter count is reduced by an order of magnitude. Sharing the weights in the Fourier domain (F-FNO-WS, red line) makes little difference to the performance especially at deep layers, but it does reduce the parameter count by another order of magnitude to 1M (see Fig. A

.1 and Table A.3).

Trade-off between speed and accuracy From Fig. 4a , we observe that our method F-FNO only needs 64x64 input grids to reach a similar performance to a 128x128 grid solved with DNS. At the same time, the F-FNO also achieves an order of magnitude speedup. While the highly specialized hybrid method introduced by Kochkov et al. ( 2021) can achieve a speedup closer to two orders of magnitude over DNS, the F-FNO takes a much more flexible approach and thus can be more easily adapted to other PDEs and geometries. The improved accuracy of the F-FNO over DNS when both methods are using the same spatial resolution can be seen graphically in Fig. A .4. In this example, the F-FNO on a 128x128 grid produces a vorticity field that is visually closer to the ground truth than DNS running on the same grid size. This is also supported by comparing the time until correlation falls below 95% in Fig. 4a . Optimal step size The F-FNO includes a step size parameter which specifies how many seconds of simulation time one application of the operator will advance. A large step size sees the model predicting far into the future possibly leading to large errors, while a small step size means small errors have many more steps to compound. We thus try different step sizes in Fig. 4b . In numerical solvers, there is a close relationship between the step size in the time dimension and the spatial resolution. Specifically, the Courant-Friedrichs-Lewy (CFL) condition provides an optimal step size given a space discretization: ∆t = C max ∆x/∥u∥ max . This means that if we double the grid size, the solver should take a step that is twice as small (we follow this approach to obtain DNS's Pareto frontier in Fig. 4a ). Furthermore, having step sizes smaller than what is specified by the CFL condition would not provide any further benefit unless we also reduce the distance between two grid points (see purple line in Fig. 4b ). On the other hand, a step size that is too big (e.g., bigger than 0.05 on a 64x64 grid) will lead to stability issues in the numerical solver. For the F-FNO, we find that we can take a step size that is at least an order of magnitude bigger than the stable step size for a numerical solver. This is the key contribution to the efficiency of neural methods. Furthermore, there is a sweet spot for the step size -around 0.2 on TorusKochkov -and unlike its numerical counterpart, we find that there is no need to reduce the step size as we train the F-FNO on a higher spatial resolution.

Flexible input representations

The F-FNO can be trained to handle Navier-Stokes equations with viscosities (in TorusVis) and time-varying forcing functions (in TorusVisForce) provided at inference time. Our model, when given both the force and viscosity, in addition to the vorticity, is able to achieve an error of 2% (Fig. 5a ). If we remove the viscosity information, the error doubles. Removing the forcing function from the input further increases the error by an order of magnitude. This shows that the force has a substantial impact on the future vorticity field, and that the F-FNO can use information about the forcing function to make accurate predictions. More generally, different datasets benefit from having different input features -Table A .7 shows the minimum set of features to reach optimal performance on each of them. We also find that having redundant features does not significantly hurt the model, so there is no need to do aggressive feature pruning in practice. Table 2 : Performance (N-MSE, expressed as percentage, where lower is better) on point clouds (Elasticity) and structured meshes (Airfoil and Plasticity) between our F-FNO and the previous state-of-the-art geo-FNO (Li et al., 2022) . Cells with a dash correspond to models which do not converge. The N-MSE is accompanied by the standard deviation from three trials. More detailed results are shown in Tables A.4 to A.6. Our experiments with different input representations also reveal an interesting performance gain from the double encoding of information (Fig. 5b ). All datasets benefit from the coordinate encoding -i.e., having the (x, y) coordinates as two additional input channels -even if the positional information is already contained in the absolute position of grid points (indices of the input array). We hypothesize that these two positional representations are used by different parts of the F-FNO. The Fourier transform uses the absolute position of the grid points and thus the Fourier layer should have no need for the (x, y) positional features. However, the feedforward layer in the physical space is a pointwise operator and thus needs to rely on the raw coordinate values, since it would otherwise be independent of the absolute position of grid points.

6. RESULTS FOR PDES ON POINT CLOUDS AND MESHES

As shown in Table 2 , the geo-FNO (Li et al., 2022) , similar to the original FNO, also suffers from network scaling. It appears to be stuck in a local minimum beyond 8 layers in the Elasticity problem and it completely fails to converge beyond 12 layers in Airfoil and Plasticity. Plasticity is the only task in which the geo-FNO gets better as we go from 4 to 12 layers (0.74% → 0.45%). In addition to the poor scaling with network depth, we also find during our experiments that the geo-FNO can perform worse as we increase the hidden size H. This indicates that there might not be enough regularization in the model as we increase the model complexity. Our F-FNO, on the other hand, continues to gain performance with deeper networks and bigger hidden size, reducing the prediction error by 31% on the Elasticity point clouds (2.51% → 1.74%) and by 57% on the 2D transonic flow over airfoil problem (1.35% → 0.58%). Our Fourier factorization particularly shines in the plastic forging problem, in which the neural operator needs to output a 3D array, i.e., the displacement of each point on a 2D mesh over 20 time steps. As shown in Table A .6, our 24-layer F-FNO with 11M parameters outperforms the 12-layer geo-FNO with 57M parameters by 60% (0.45% → 0.18%).

7. CONCLUSION

The Fourier transform is a powerful tool to learn neural operators that can handle long-range spatial dependencies. By factorizing the transform, using better residual connections, and improving the training setup, our proposed F-FNO outperforms the state of the art on PDEs on a variety of geometries and domains. For future work, we are interested in examining equilibrium properties of generalized Fourier operators with an infinite number of layers and checking if the universal approximation property (Kovachki et al., 2021a) still holds under Fourier factorization. A APPENDIX Table A .1: An overview of the four fluid dynamics datasets on regular grids. Our newly generated datasets, TorusVis and TorusVisForce, contain simulation data with a more variety of viscosities and forces than TorusLi (Li et al., 2021a) and TorusKochkov (Kochkov et al., 2021) inference time (the time it takes to run one second of simulation). Error bars, when applicable, show the min and max values over three trials. In (b), as we move along a line, we increase the number of layers. We observe that only our model variants (F-FNO) have the desired slope, that is, as we use more resources (increasing the inference time), we obtain better predictions. (a) Zero-shot super-resolution performance of F-FNO. We train the model on 32x32 and 64x64 grids of TorusKochkov, and evaluate on the larger 128x128 and 256x256 grids. We observe some degradation in the correlation with the ground truths on unseen grid sizes. Zero-shot super-resolution In Fig . A .2a, we train the F-FNO once on 32x32 and 64x64 grids from TorusKochkov, and then perform inference and evaluation on 128x128 and 256x256 grids. This extends the super-resolution setting presented by Li et al. (2021a) as they only worked on simple PDEs such as the 1D Burger's equation and the 2D Darcy flow. We find that although the F-FNO can do zero-shot super-resolution -unlike a traditional CNN which by design cannot even accept inputs of variable size -its performance does degrade on grid sizes not seen during training. This is seen by the lower vorticity correlation of the super-resolution F-FNO settings in Fig. A.2a . We posit that the super-resolution performance could be improved by training on a variety of grid sizes (e.g., by downsampling each training example to a random size). We leave such exploration for future work. Capturing the energy spectrum In addition to having a high vorticity correlation, a good model should also produce predictions with an energy spectrum similar to the most accurate numerical methods. Given the Fourier transform of a velocity field û = FFT(u), we can compute, for each wavenumber k, the corresponding kinetic energy as E(k) = 1 2 ∥û k ∥ 2 . Fig. A.2b shows the energy spectrum of both the F-FNO and DNS at different resolutions. These multiple DNS resolutions are included both as a reference solution in the case of DNS 2048x2048, and to demonstrate that increasing the resolution of DNS further is not likely to substantially change the energy spectrum. We observe that compared to DNS on 2048x2048, the F-FNO trained on 64x64 grids produces an energy spectrum that has substantially lower energy at high wavenumbers. This is expected as at this spatial resolution we only select the top 16 modes in each Fourier layer. Even so, the F-FNO can still capture the long term trend much better than running DNS on a grid four times its size (see Fig. 4a ). As we select more Fourier modes on bigger grids (top 32 modes on 128x128 grids and top 64 modes on 256x256 grids), the energy spectrum produced converges towards that of the reference solution (DNS on 2048x2048). This gives some indication that the F-FNO is able to accurately predict both high and low frequency details. Effect of using cosine transforms As an alternative to the Fourier transform, Poli et al. (2022) proposed using the cosine transform, which has the advantage of being real-valued, thus halving the number of parameters. Let the Factorized Cosine Neural Operator (F-CNO) be the operator where the Fourier transform is replaced with the cosine transform. In Fig . A .3, we observe that on Airfoil, the F-CNO outperforms the F-FNO especially at deeper layers. On Plasticity, the F-CNO performs comparably to the F-FNO on the same depth, while using fewer parameters. We have not had much success in training the F-CNO on torus datasets such as TorusKochkov. We leave the investigation of how stable the cosine transform is on different domains to future work. Table A .3: Detailed performance on TorusLi. These results are used to generate Fig. 3 in the main paper. We run three trials for each experiment, each with a different random seed. We report the mean N-MSE from the three trials, along with the min and max value. A dash indicates that the data is not available. 2 in the main paper. We run three trials for each experiment, each with a different random seed. We report the mean N-MSE from the three trials, along with the min and max value. Note that for a given layer, our F-FNO (whether with weight sharing or without) has slightly more parameters than the geo-FNO. This is due to the F-FNO using a bigger hidden size H. We find that on the geo-FNO, increasing its hidden size does not necessarily translate to a better performance. 2 in the main paper. We run three trials for each experiment, each with a different random seed. We report the mean N-MSE from the three trials, along with the min and max value. No We plot the test loss (y-axis) against the model parameter count (x-axis). Error bars show the min-max values from three trials. As we move a long each line, we make the network deeper, which increases the number of parameters. On Airfoil (a), the F-CNO outperforms the F-FNO at deeper layers. On Plasticity (b), the performance between the two is mostly similar for the same depth. Since cosine transforms are real-valued, the F-CNO requires only half as many parameters as the F-FNO. 2021), we visualize how the correlation with the ground truths varies between different models. The heatmaps represent the surface of a torus mapped onto a 2D grid, with color representing the vorticity (the spinning motion) of the fluid. We observe that the vorticity fields predicted by the F-FNO trained on 128x128 grids (middle row) correlates with the ground truths (top row) for longer than if we run DNS on the same spatial resolution (bottom row). This is especially evident after 6 seconds of simulation time (compare the green boxes). In other words, for the same desired accuracy, the F-FNO requires a smaller grid input than a numerical solver. This observation is also backed up by Fig. 4a .



https://github.com/alasdairtran/fourierflow



Figure 1: An illustration of the input and output of different PDE problems. See the accompanying Table1for details. On the torus datasets (a), the operator learns to evolve the vorticity over time. On Elasticity (b), the operator learns to predict the stress value on each point on a point cloud. On Airfoil (c), the operator learns to predict the flow velocity on each mesh point. On Plasticity (d), the operator learns the displacement of each mesh point given an initial boundary condition.

Figure3: Performance (lower is better) on TorusLi, with error bars showing the min and max values over three trials. We show the original FNO(Li et al., 2021a), along with variants that use: teacher forcing, Markov assumption, improved residuals, a bag of tricks, Fourier factorization, and weight sharing. Note that F-FNO and F-FNO-WS are presented on a separate plot (b) to make visualizing the improvement easier (if shown in (a), F-FNO and F-FNO-WS would just be a straight line).

-Kennedy 4th-order) F-FNO-WS (F-FNO with weight sharing)

Performance of F-FNO on different input features: having only vorticity as an input with no further context (first group); having vorticity and the force field as inputs (second group); and having vorticity, the force field, and viscosity as inputs (third group). The error bars are the standard deviation from three trials. Effect of having the coordinates and velocity as additional input channels on TorusKochkov. A higher line corresponds to a model that can correlate with the ground-truth vorticity for longer. Error bands correspond to min and max values from three trials.

Figure 5: Performance of F-FNO on different contexts and input representations.

Figure A.1: The resource usage of four model variants, in terms of (a) the parameter count and (b)inference time (the time it takes to run one second of simulation). Error bars, when applicable, show the min and max values over three trials. In (b), as we move along a line, we increase the number of layers. We observe that only our model variants (F-FNO) have the desired slope, that is, as we use more resources (increasing the inference time), we obtain better predictions.

(super-resolution on 128x128) F-FNO (super-resolution on 256x256)

Energy spectra of F-FNO and DNS on various grid sizes. The spectra are computed by averaging the kinetic energy for each wavenumber between t = 12 and t = 34, when the predictions from all methods have decorrelated with the ground truths.

Figure A.2: Performance of F-FNO on zero-shot superresolution and its ability to capture the energy spectrum of DNS on TorusKochkov.

Figure A.3: Effect of the cosine transform on Airfoil and Plasticity. We plot the test loss (y-axis) against the model parameter count (x-axis). Error bars show the min-max values from three trials. As we move a long each line, we make the network deeper, which increases the number of parameters. On Airfoil (a), the F-CNO outperforms the F-FNO at deeper layers. On Plasticity (b), the performance between the two is mostly similar for the same depth. Since cosine transforms are real-valued, the F-CNO requires only half as many parameters as the F-FNO.

Figure A.4: Similar to Kochkov et al. (2021), we visualize how the correlation with the ground truths varies between different models. The heatmaps represent the surface of a torus mapped onto a 2D grid, with color representing the vorticity (the spinning motion) of the fluid. We observe that the vorticity fields predicted by the F-FNO trained on 128x128 grids (middle row) correlates with the ground truths (top row) for longer than if we run DNS on the same spatial resolution (bottom row). This is especially evident after 6 seconds of simulation time (compare the green boxes). In other words, for the same desired accuracy, the F-FNO requires a smaller grid input than a numerical solver. This observation is also backed up by Fig.4a.

which can efficiently model PDEs with zero-shot super-resolution but is not specific to the Navier-Stokes equations.

An overview of the datasets and the corresponding task.

. Note thatLi et al.  (2021a)  did not generate a validation set.TableA.2: An overview of the three PDE datasets on irregular geometries. These datasets were generated byLi et al. (2022).

4: Detailed performance on Airfoil. These results are more detailed version of Table2in the main paper. We run three trials for each experiment, each with a different random seed. We report the mean N-MSE from the three trials, along with the min and max value.



Table A.6: Detailed performance on Plasticity. These results are more detailed version of Table

7: The F-FNO is flexible in its input representation. We find that different datasets benefit from having different features. Shown here is the optimal input combination for each dataset on the torus.

