NONLINEAR RECONSTRUCTION FOR OPERATOR LEARNING OF PDES WITH DISCONTINUITIES

Abstract

A large class of hyperbolic and advection-dominated PDEs can have solutions with discontinuities. This paper investigates, both theoretically and empirically, the operator learning of PDEs with discontinuous solutions. We rigorously prove, in terms of lower approximation bounds, that methods which entail a linear reconstruction step (e.g. DeepONet or PCA-Net) fail to efficiently approximate the solution operator of such PDEs. In contrast, we show that certain methods employing a nonlinear reconstruction mechanism can overcome these fundamental lower bounds and approximate the underlying operator efficiently. The latter class includes Fourier Neural Operators and a novel extension of DeepONet termed shift-DeepONet. Our theoretical findings are confirmed by empirical results for advection equation, inviscid Burgers' equation and compressible Euler equations of aerodynamics.

1. INTRODUCTION

Many interesting phenomena in physics and engineering are described by partial differential equations (PDEs) with discontinuous solutions. The most common types of such PDEs are nonlinear hyperbolic systems of conservation laws (Dafermos, 2005) , such as the Euler equations of aerodynamics, the shallow-water equations of oceanography and MHD equations of plasma physics. It is well-known that solutions of these PDEs develop finite-time discontinuities such as shock waves, even when the initial and boundary data are smooth. Other examples include the propagation of waves with jumps in linear transport and wave equations, crack and fracture propagation in materials (Sun & Jin, 2012) , moving interfaces in multiphase flows (Drew & Passman, 1998) and motion of very sharp gradients as propagating fronts and traveling wave solutions for reaction-diffusion equations (Smoller, 2012) . Approximating such (propagating) discontinuities in PDEs is considered to be extremely challenging for traditional numerical methods (Hesthaven, 2018) as resolving them could require very small grid sizes. Although bespoke numerical methods such as high-resolution finitevolume methods, discontinuous Galerkin finite-element and spectral viscosity methods (Hesthaven, 2018) have successfully been used in this context, their very high computational cost prohibits their extensive use, particularly for many-query problems such as UQ, optimal control and (Bayesian) inverse problems (Lye et al., 2020) , necessitating the design of fast machine learning-based surrogates. As the task at hand in this context is to learn the underlying solution operator that maps input functions (initial and boundary data) to output functions (solution at a given time), recently developed operator learning methods can be employed in this infinite-dimensional setting (Higgins, 2021) . These methods include operator networks (Chen & Chen, 1995) and their deep version, DeepONet (Lu et al., 2019; 2021) , where two sets of neural networks (branch and trunk nets) are combined in a linear reconstruction procedure to obtain an infinite-dimensional output. DeepONets have been very successfully used for different PDEs (Lu et al., 2021; Mao et al., 2020b; Cai et al., 2021; Lin et al., 2021) . An alternative framework is provided by neural operators (Kovachki et al., 2021a) , wherein the affine functions within DNN hidden layers are generalized to infinite-dimensions by replacing them with kernel integral operators as in (Li et al., 2020a; Kovachki et al., 2021a; Li et al., 2020b) . A computationally efficient form of neural operators is the Fourier Neural Operator (FNO) (Li et al., 2021a) , where a translation invariant kernel is evaluated in Fourier space, leading to many successful applications for PDEs (Li et al., 2021a; b; Pathak et al., 2022) . Currently available theoretical results for operator learning (e.g. Lanthaler et al. (2022) ; Kovachki et al. (2021a; b) ; De Ryck & Mishra (2022b) ; Deng et al. (2022) ) leverage the regularity (or smoothness) of solutions of the PDE to prove that frameworks such as DeepONet, FNO and their variants approximate the underlying operator efficiently. Although such regularity holds for many elliptic and parabolic PDEs, it is obviously destroyed when discontinuities appear in the solutions of the PDEs such as in the hyperbolic PDEs mentioned above. Thus, a priori, it is unclear if existing operator learning frameworks can efficiently approximate PDEs with discontinuous solutions. This explains the paucity of theoretical and (to a lesser extent) empirical work on operator learning of PDEs with discontinuous solutions and provides the rationale for the current paper where, • using a lower bound, we rigorously prove approximation error estimates to show that operator learning architectures such as DeepONet (Lu et al., 2021) and PCA-Net (Bhattacharya et al., 2021) , which entail a linear reconstruction step, fail to efficiently approximate solution operators of prototypical PDEs with discontinuities. In particular, the approximation error only decays, at best, linearly in network size. • We rigorously prove that using a nonlinear reconstruction procedure within an operator learning architecture can lead to the efficient approximation of prototypical PDEs with discontinuities. In particular, the approximation error can decay exponentially in network size, even after discontinuity formation. This result is shown for two types of architectures with nonlinear reconstruction, namely the widely used Fourier Neural Operator (FNO) of (Li et al., 2021a) and for a novel variant of DeepONet that we term as shift-DeepONet. • We supplement the theoretical results with extensive experiments where FNO and shift-DeepONet are shown to consistently outperform DeepONet and other baselines for PDEs with discontinuous solutions such as linear advection, inviscid Burgers' equation, and both the one-and two-dimensional versions of the compressible Euler equations of gas dynamics.

2. METHODS

Setting. Given compact domains D ⊂ R d , U ⊂ R d ′ , we consider the approximation of operators G : X → Y, where X ⊂ L 2 (D) and Y ⊂ L 2 (U ) are the input and output function spaces. In the following, we will focus on the case, where ū → G(ū) maps initial data ū to the solution at some time t > 0, of an underlying time-dependent PDE. We assume the input ū to be sampled from a probability measure µ ∈ Prob(X ). DeepONet. DeepONet (Lu et al., 2021) will be our prototype for operator learning frameworks with linear reconstruction. To define them, let x := (x 1 , . . . , x m ) ∈ D be a fixed set of sensor points. Given an input function ū ∈ X , we encode it by the point values E(ū) = (ū(x 1 ), . . . , ū(x m )) ∈ R m . DeepONet is formulated in terms of two neural networks: The first is the branch-net β, which maps the point values E(ū) to coefficients β(E(ū)) = (β 1 (E(ū)), . . . , β p (E(ū)), resulting in a mapping β : R m → R p , E(ū) → (β 1 (E(ū)), . . . , β p (E(ū)). (2.1) The second neural network is the so-called trunk-net τ (y) = (τ 1 (y), . . . , τ p (y)), which is used to define a mapping τ : U → R p , y → (τ 1 (y), . . . , τ p (y)). (2.2) While the branch net provides the coefficients, the trunk net provides the "basis" functions in an expansion of the output function of the form N DON (ū)(y) = p k=1 β k (ū)τ k (y), ū ∈ X , y ∈ U, with β k (ū) = β k (E(ū)). The resulting mapping N DON : X → Y, ū → N DON (ū) is a DeepONet. Although DeepONet were shown to be universal in the class of measurable operators (Lanthaler et al., 2022) , the following fundamental lower bound on the approximation error was also established, Proposition 2.1 (Lanthaler et al. (2022, Thm. 3.4) ). Let X be a separable Banach space, Y a separable Hilbert space, and let µ be a probability measure on X . Let G : X → Y be a Borel measurable operator with E ū∼µ [∥G(ū)∥ 2 Y ] < ∞. Then the following lower approximation bound holds for any DeepONet N DON with trunk-/branch-net dimension p: (2.4) where the optimal error E opt is written in terms of the eigenvalues λ 1 ≥ λ 2 ≥ . . . of the covariance operator Γ G # µ := E u∼G # µ [(u ⊗ u)] of the push-forward measure G # µ. E (N DON ) = E ū∼µ ∥N DON (ū) -G(ū)∥ 2 Y 1/2 ≥ E opt := j>p λ j , We refer to SM A for relevant background on the underlying principal component analysis (PCA) and covariance operators, as well as an example illustrating the connection between sharpness of gradients and the decay of the PCA eigenvalues λ j (SM A.1). The same lower bound (2.4) in fact holds for any operator approximation of the form N (ū) = p k=1 β k (ū)τ k , where β k : X → R are arbitrary functionals. In particular, this bound continues to hold for e.g. the PCA-Net architecture of Hesthaven & Ubbiali (2018) ; Bhattacharya et al. (2021) . We will refer to any operator learning architecture of this form as a method with "linear reconstruction", since the output function N (ū) is restricted to the linear p-dimensional space spanned by the τ 1 , . . . , τ p ∈ Y. In particular, DeepONet are based on linear reconstruction. To overcome the lower bound (2.4) which sets fundamental limitations on DeepONets, the basis τ therefore needs to additionally depend on the input u. shift-DeepONet. The lower bound (2.4) shows that there are fundamental barriers to the expressive power of operator learning methods based on linear reconstruction. This is of particular relevance for problems in which the optimal lower bound E opt in (2.4) exhibits a slow decay in terms of the number of basis functions p, due to the slow decay of the eigenvalues λ j of the covariance operator. It is well-known that even linear advection-or transport-dominated problems can suffer from such a slow decay of the eigenvalues (Ohlberger & Rave, 2013; Dahmen et al., 2014; Taddei et al., 2015; Peherstorfer, 2020) , which could hinder the application of linear-reconstruction based operator learning methods to this very important class of problems. In view of these observations, it is thus desirable to develop a nonlinear variant of DeepONet which can overcome such a lower bound in the context of transport-dominated problems. We propose such an extension below. A shift-DeepONet N sDON : X → Y is an operator of the form N sDON (ū)(y) = p k=1 β k (ū)τ k A k (ū)y + γ k (ū) , (2.5) where the input function ū is encoded by evaluation at the sensor points E(ū) ∈ R m . We retain the DeepONet branchand trunk-nets β, τ defined in (2.1), (2.2), respectively, and we have introduced a scale-net A = (A k ) p k=1 , consisting of matrix-valued functions A k : R m → R d ′ ×d ′ , E(ū) → A k (ū) := A k (E(ū)), and a shift-net γ = (γ k ) p k=1 , with γ k : R m → R d ′ , E(ū) → γ k (ū) := γ k (E(ū)), All components of a shift-DeepONet are represented by deep neural networks, potentially with different activation functions. Remark 2.2. The form of shift-DeepONet (2.5) is very natural from a theoretical perspective. Practical experimentation indicates that an extended architecture based on a trunk-net τ * : R d ′ ×p → R p , depending jointly on all values A 1 (ū)y + γ 1 (ū), . . . , A p (ū)y + γ p (ū) and defining a mapping N sDON * (ū)(y) := p k=1 β k (ū)τ * k A(ū)y + γ(ū) , with concatenated input A(ū)y + γ(ū) := A 1 (ū)y + γ 1 (ū), . . . , A p (ū)y + γ p (ū) achieves better accuracy. Our numerical results will be reported for (2.6). We emphasize that all theoretical results in this work apply to both architectures, (2.5) and (2.6). Since shift-DeepONets reduce to DeepONets for the particular choice A ≡ 1 and γ ≡ 0, the universality of DeepONets (Theorem 3.1 of Lanthaler et al. (2022) ) is clearly inherited by shift-DeepONets. However, as shift-DeepONets do not use a linear reconstruction (the trunk nets in (2.5) depend on the input through the scale and shift nets), the lower bound (2.4) does not directly apply, providing possible space for shift-DeepONet to efficiently approximate transport-dominated problems, especially in the presence of discontinuities. Fourier neural operators (FNO). An FNO N FNO (Li et al., 2021a ) is a composition N FNO : X → Y : N FNO = Q • L L • • • • • L 1 • R, (2.7) consisting of a "lifting operator" ū(x) → R(ū(x), x), where R is represented by a (shallow) neural network R : R du × R d → R dv with d u the number of components of the input function, d the dimension of the domain and d v the "lifting dimension" (a hyperparameter), followed by L hidden layers L ℓ : v ℓ (x) → v ℓ+1 (x) of the form v ℓ+1 (x) = σ W ℓ • v ℓ (x) + b ℓ (x) + K ℓ v ℓ (x) , with W ℓ ∈ R dv×dv a weight matrix (residual connection), x → b ℓ (x) ∈ R dv a bias function and with a convolution operator K ℓ v ℓ (x) = ´Td κ ℓ (x -y)v ℓ (y) dy, expressed in terms of a (learnable) integral kernel x → κ ℓ (x) ∈ R dv×dv . The output function is finally obtained by a linear projection layer v L+1 (x) → N FNO (ū)(x) = Q • v L+1 (x). The convolution operators K ℓ add the indispensable non-local dependence of the output on the input function. Given values on an equidistant Cartesian grid, the evaluation of K ℓ v ℓ can be efficiently carried out in Fourier space based on the discrete Fourier transform (DFT), leading to a representation K ℓ v ℓ = F -1 N P ℓ (k) • F N v ℓ (k) , where F N v ℓ (k) denotes the Fourier coefficients of the DFT of v ℓ (x), computed based on the given N grid values in each direction, P ℓ (k) ∈ C dv×dv is a complex Fourier multiplication matrix indexed by k ∈ Z d , and F -1 N denotes the inverse DFT. In practice, only a finite number of Fourier modes can be computed, and hence we introduce a hyperparameter k max ∈ N, such that the Fourier coefficients of b ℓ (x) as well as the Fourier multipliers, b ℓ (k) ≡ 0 and P ℓ (k) ≡ 0, vanish whenever |k| ∞ > k max . In particular, with fixed k max the DFT and its inverse can be efficiently computed in O(((2k max + 1)N ) d ) operations (i.e. linear in the total number of grid points). The output space of FNO (2.7) is manifestly nonlinear as it is not spanned by a fixed number of basis functions. Hence, FNO constitute a nonlinear reconstruction method.

3. THEORETICAL RESULTS.

Context. Our aim in this section is to rigorously prove that the nonlinear reconstruction methods (shift-DeepONet, FNO) efficiently approximate operators stemming from discontinuous solutions of PDEs whereas linear reconstruction methods (DeepONet, PCA-Net) fail to do so. To this end, we follow standard practice in numerical analysis of PDEs (Hesthaven, 2018) and choose two prototypical PDEs that are widely used to analyze numerical methods for transport-dominated PDEs. These are the linear transport or advection equation and the nonlinear inviscid Burgers' equation, which is the prototypical example for hyperbolic conservation laws. The exact operators and the corresponding approximation results with both linear and nonlinear reconstruction methods are described below. The computational complexity of the models is expressed in terms of hyperparameters such as the model size, which are described in detail in SM B. Linear Advection Equation. We consider the one-dimensional linear advection equation ∂ t u + a∂ x u = 0, u( • , t = 0) = ū (3.1) on a 2π-periodic domain D = T, with constant speed a ∈ R. The underlying operator is G adv : L 1 (T) ∩ L ∞ (T) → L 1 (T) ∩ L ∞ (T), ū → G adv (ū) := u( • , T ), obtained by solving the PDE (3.1) with initial data ū up to any final time t = T . We note that X = L 1 (T) ∩ L ∞ (T) ⊂ L 2 (T). As input measure µ ∈ Prob(X ), we consider random input functions ū ∼ µ given by the square (box) wave of height h, width w and centered at ξ, ū(x) = h1 [-w/2,+w/2] (x -ξ), (3.2) where h ∈ [h, h], w ∈ [w, w] ξ ∈ [0, 2π ] are independent and uniformly identically distributed. The constants 0 < h ≤ h, 0 < w ≤ w are fixed. DeepONet fails at approximating G adv efficiently. Our first rigorous result is the following lower bound on the error incurred by DeepONet (2.3) in approximating G adv , Theorem 3.1. Let p, m ∈ N. There exists a constant C > 0, independent of m, p and T , such that for any DeepONet N DON (2.3), with sup ū∼µ ∥N DON (ū)∥ L ∞ ≤ M < ∞, we have the lower bound E = E ū∼µ ∥G adv (ū) -N DON (ū)∥ L 1 ≥ C min(m, p) . Consequently, to achieve E (N DON ) ≤ ϵ with DeepONet, we need p, m ≳ ϵ -1 trunk and branch net basis functions and sensor points, respectively, entailing that size(N DON ) ≳ pm ≳ ϵ -2 (cp. SM B). The detailed proof is presented in SM D.2. It relies on two facts. First, following Lanthaler et al. (2022) , one observes that translation invariance of the problem implies that the Fourier basis is optimal for spanning the output space. As the underlying functions are discontinuous, the corresponding eigenvalues of the covariance operator for the push-forward measure decay, at most, quadratically in p. Consequently, the lower bound (2.4) leads to a linear decay of error in terms of the number of trunk net basis functions. Second, roughly speaking, the linear decay of error in terms of sensor points is a consequence of the fact that one needs sufficient number of sensor points to resolve the underlying discontinuous inputs. Shift-DeepONet approximates G adv efficiently. Next and in contrast to the previous result on DeepONet, we have following efficient approximation result for shift-DeepONet (2.5), Theorem 3.2. There exists a constant C > 0, independent of T , such that for any ϵ > 0 there exists a shift-DeepONet N sDON ϵ (2.5) such that E = E ū∼µ ∥G adv (ū) -N sDON ϵ (ū)∥ L 1 ≤ ϵ, with uniformly bounded p ≤ C, and with the number of sensor points m ≤ Cϵ -1 . Furthermore, we have width(N sDON ϵ ) ≤ C, depth(N sDON ϵ ) ≤ C log(ϵ -1 ) 2 , size(N sDON ϵ ) ≤ Cϵ -1 . The detailed proof, presented in SM D.3, is based on the fact that for each input, the exact solution can be completely determined in terms of three variables, i.e., the height h, width w and shift ξ of the box wave (3.2). Given an input ū, we explicitly construct neural networks for inferring each of these variables with high accuracy. These neural networks are then combined together to yield a shift-DeepONet that approximates G adv , with the desired complexity. The nonlinear dependence of the trunk net in shift-DeepONet (2.5) on the input is the key to encode the shift in the box-wave (3.2) and this demonstrates the necessity of nonlinear reconstruction in this context. FNO approximates G adv efficiently. Finally, we state an efficient approximation result for G adv with FNO (2.7) below, where the constant C > 0 is again independent of the final time T : Theorem 3.3. There exists C > 0, such that for any ϵ > 0, there exists an FNO N FNO ϵ (2.7) with E ū∼µ ∥G adv (ū) -N FNO ϵ (ū)∥ L 1 ≤ ϵ, with grid size N ≤ Cϵ -1 , and with Fourier cut-off k max , lifting dimension d v , depth and size: k max = 1, d v ≤ C, depth(N FNO ϵ ) ≤ C log(ϵ -1 ) 2 , size(N FNO ϵ ) ≤ C log(ϵ -1 ) 2 . A priori, one recognizes that G adv can be represented by Fourier multipliers (see SM D.4). Consequently, a single linear FNO layer would in principle suffice in approximating G adv . However, the size of this FNO would be exponentially larger than the bound in Theorem 3.3. To obtain a more efficient approximation, one needs to leverage the nonlinear reconstruction within FNO layers. This is provided in the proof, presented in SM D.4, where the underlying height, wave and shift of the box-wave inputs (3.2) are approximated with high accuracy by FNO layers. These are then combined together with a novel representation formula for the solution to yield the desired FNO. Comparison. Observing the complexity bounds in Theorems 3.1, 3.2, 3.3, we note that the DeepONet size scales at least quadratically, size ≳ ϵ -2 , in terms of the error in approximating G adv , whereas for shift-DeepONet and FNO, this scaling is only linear and logarithmic, respectively. Thus, we rigorously prove that for this problem, the nonlinear reconstruction methods (FNO and shift-DeepONet) can be more efficient than DeepONet and other methods based on linear reconstruction. Moreover, FNO is shown to have a smaller approximation error than even shift-DeepONet for similar model size. We provide two remarks on extensions of these results to the approximation of the time-evolution and to higher dimensions in SM C. Inviscid Burgers' equation. Next, we consider the inviscid Burgers' equation in one-space dimension, which is considered the prototypical example of nonlinear hyperbolic conservation laws (Dafermos, 2005) : ∂ t u + ∂ x 1 2 u 2 = 0, u( • , t = 0) = ū, (3.4) on the 2π-periodic domain D = T. It is well-known that discontinuities in the form of shock waves can appear in finite-time even for smooth ū. Consequently, solutions of (3.4) are interpreted in the sense of distributions and entropy conditions are imposed to ensure uniqueness (Dafermos, 2005) . Thus, the underlying solution operator is G Burg : L 1 (T) ∩ L ∞ (T) → L 1 (T) ∩ L ∞ (T), ū → G Burg (ū) := u( • , T ), with u being the entropy solution of (3.4) at final time T . Given ξ ∼ Unif([0, 2π]), we define the random field ū(x) := -sin(x -ξ), and we define the input measure µ ∈ Prob(L 1 (T) ∩ L ∞ (T)) as the law of ū. We emphasize that the difficulty in approximating the underlying operator G Burg arises even though the input functions are smooth, in fact analytic. This is in contrast to the linear advection equation. DeepONet fails at approximating G Burg efficiently. First, we recall the following result, which follows directly from Lanthaler et al. (2022) (Theorem 4.19 ) and the lower bound (2.4), Theorem 3.4. Assume that G Burg = u( • , T ), for T > π and u is the entropy solution of (3.4) with initial data ū ∼ µ. There exists a constant C > 0, such that the L 2 -error for any DeepONet N DON with p trunk-/branch-net output functions is lower-bounded by E (N DON ) = E ū∼µ ∥G Burg (ū) -N DON (ū)∥ L 1 ≥ C p . (3.6) Consequently, achieving an error E (N DON ϵ ) ≲ ϵ requires at least size(N DON ϵ ) ≥ p ≳ ϵ -1 . shift-DeepONet approximate G Burg efficiently. In contrast to DeepONet, we have the following result for efficient approximation of G Burg with shift-DeepONet, Theorem 3.5. Assume that T > π. There is a constant C > 0, such that for any ϵ > 0, there exists a shift-DeepONet N sDON ϵ such that E (N sDON ϵ ) = E ū∼µ ∥G Burg (ū) -N sDON ϵ (ū)∥ L 1 ≤ ϵ, (3.7) with a uniformly bounded number p ≤ C of trunk/branch net functions, the number of sensor points can be chosen m = 3, and we have width(N sDON ϵ ) ≤ C, depth(N sDON ϵ ) ≤ C log(ϵ -1 ) 2 , size(N sDON ϵ ) ≤ C log(ϵ -1 ) 2 . The proof, presented in SM D.5, relies on an explicit representation formula for G Burg , obtained using the method of characteristics (even after shock formation). Then, we leverage the analyticity of the underlying solutions away from the shock and use the nonlinear shift map in (2.5) to encode shock locations. Careful inspection of the proof implies that the constant C is independent of T > π. FNO approximates G Burg efficiently Finally we prove (in SM D.6) the following theorem, Theorem 3.6. Assume that T > π, then there exists a constant C (again independent of T ), such that for any ϵ > 0 and grid size N ≥ 3, there exists an FNO N FNO ϵ (2.7), such that E (N FNO ϵ ) = E ū∼µ ∥G Burg (ū) -N FNO ϵ (ū)∥ L 1 ≤ ϵ, and with Fourier cut-off k max , lifting dimension d v and depth satisfying, k max = 1, d v ≤ C, depth(N FNO ϵ ) ≤ C log(ϵ -1 ) 2 , size(N FNO ϵ ) ≤ C log(ϵ -1 ) 2 . Comparison. A perusal of the bounds in Theorems 3.4, 3.5 and 3.6 reveals that after shock formation, the accuracy ϵ of the DeepONet approximation of G Burg scales at best as ϵ ∼ n -1 , in terms of the total number of degrees of freedom n = size(N DON ) of the DeepONet. In contrast, shift-DeepONet and FNO based on a nonlinear reconstruction can achieve an exponential convergence rate ϵ ≲ exp(-cn 1/2 ) in the total number of degrees of freedom n = size(N sDON ), size(N FNO ), even after the formation of shocks. This again highlights the expressive power of nonlinear reconstruction methods in approximating operators of PDEs with discontinuities. 

4. EXPERIMENTS

In this section, we illustrate how different operator learning frameworks can approximate solution operators of PDEs with discontinuities. To this end, we will compare DeepONet (2.3) (a prototypical operator learning method with linear reconstruction) with (the extended) shift-DeepONet (2.6) and FNO (2.7) (as nonlinear reconstruction methods). Moreover, two additional baselines (described in detail in SM E.1) are also used, namely the well-known ResNet architecture of He et al. (2016) and a fully convolutional neural network (FCNN) of Long et al. (2015) . Below, we present results for the best performing hyperparameter configuration, obtained after a grid search, for each model while postponing the description of details for the training procedures, hyperparameter configurations and model parameters to SM E.1. Linear Advection. We start with the linear advection equation (3.1) in the domain [0, 1] with wave speed a = 0.5 and periodic boundary conditions. The initial data is given by (3.2) corresponding to square waves, with initial heights uniformly distributed between h = 0.2 and h = 0.8, widths between w = 0.05 and w = 0.3 and shifts between 0 and 0.5. We seek to approximate the solution operator G adv at final time T = 0.25.  k(x, x ′ ) = exp -|x -x ′ | 2 2ℓ 2 , with correlation length ℓ = 0.06. The solution operator G Burg corresponds to evaluating the entropy solution at time T = 0.1. We generate the output data with a high-resolution finite volume scheme, implemented within the ALSVINN code Lye (2020), at a spatial mesh resolution of 1024 points. Examples of input and output functions, shown in SM Figure 11 , illustrate how the smooth yet oscillatory initial datum evolves into many discontinuities in the form of shock waves, separated by Lipschitz continuous rarefactions. Given this complex structure of the entropy solution, the underlying solution operator is hard to learn. The relative median test error for all the models is presented in Table 1 and shows that DeepOnet (and the baselines Resnet and FCNN) have an unacceptably high error between 20 and 30%. In fact, DeepONet performs worse than the two baselines. However, consistent with the theory of the previous section, this error is reduced more than three-fold with the nonlinear Shift-DeepONet. The error is reduced even further by FNO and in this case, FNO outperforms DeepOnet by a factor of almost 20 and learns the very complicated solution operator with an error of only 1.5% Compressible Euler Equations. The motion of an inviscid gas is described by the Euler equations of aerodynamics. For definiteness, the Euler equations in two space dimensions are, U t + F(U) x + G(U) y = 0, U =    ρ ρu ρv E    , F(U) =    ρu ρu 2 + p ρuv (E + p)u    , G(U) =    ρv ρuv ρv 2 + p (E + p)v    , (4.1) with ρ, u, v and p denoting the fluid density, velocities along x-and y-axis and pressure. E represents the total energy per unit volume E = 1 2 ρ(u 2 + v 2 ) + p γ -1 where γ = c p /c v is the gas constant which equals 1.4 for a diatomic gas considered here. Shock Tube. We start by restricting the Euler equations (4.1) to the one-dimensional domain D = [-5, 5] by setting v = 0 in (4.1). The initial data corresponds to a shock tube of the form, ρ 0 (x) = ρ L x ≤ x 0 ρ R x > x 0 u 0 (x) = u L x ≤ x 0 u R x > x 0 p 0 (x) = p L x ≤ x 0 p R x > x 0 (4.2) parameterized by the left and right states (ρ L , u L , p L ), (ρ R , u R , p R ), and the location of the initial discontinuity x 0 . As proposed in Lye et al. (2020) , these parameters are, in turn, drawn from the measure; ρ L = 0.75 + 0.45G(z 1 ), ρ R = 0.4 + 0.3G(z 2 ), u L = 0.5 + 0.5G(z 3 ), u R = 0, p L = 2.5 + 1.6G(z 4 ), p R = 0.375 + 0.325G(z 5 ), x 0 = 0.5G(z 6 ), with z = [z 1 , z 2 , . . . z 6 ] ∼ U [0, 1] 6 and G(z) := 2z -1. We seek to approximate the operator G : [ρ 0 , ρ 0 u 0 , E 0 ] → E(1.5). The training (and test) output are generated with ALSVINN code Lye (2020), using a finite volume scheme, with a spatial mesh resolution of 2048 points and examples of input-output pairs, presented in SM Figure 12 show that the initial jump discontinuities in density, velocity and pressure evolve into a complex pattern of (continuous) rarefactions, contact discontinuities and shock waves. The (relative) median test errors, presented in Table 1 , reveal that shift-DeepONet and FNO significantly outperform DeepONet (and the other two baselines). FNO is also better than shift-DeepONet and approximates this complicated solution operator with a median error of ≈ 1.5%. Four-Quadrant Riemann Problem. For the final numerical experiment, we consider the two-dimensional Euler equations (4.1) with initial data, corresponding to a well-known four-quadrant Riemann problem (Mishra & Tadmor, 2011) with U 0 (x, y) = U sw , if x, y < 0, U 0 (x, y) = U se , if x < 0, y > 0, U 0 (x, y) = U nw , if x > 0, y < 0 and U 0 (x, y) = U ne , if x, y > 0, with states given by ρ 0,ne = ρ 0,sw = p 0,ne = p 0,sw = 1.1, ρ 0,nw = ρ 0,se = 0.5065, p 0,nw = p 0,se = 0.35 [u 0,ne , u 0,nw , v 0,ne , v 0,se ] = 0.35[G(z 1 ), G(z 2 ), G(z 3 ), G(z 4 )] and [u 0,se , u 0,sw , v 0,nw , v 0,sw ] = 0.8939 + 0.35[G(z 5 ), G(z 6 ), G(z 7 ), G(z 4 8], with z = [z 1 , z 2 , . . . z 8 ] ∼ U [0, 1] 8 and G(z) = 2z -1. We seek to approximate the opera- tor G : [ρ 0 , ρ 0 u 0 , ρ 0 v 0 , E 0 ] → E(1.5). The training (and test) output are generated with the ALSVINN code, on a spatial mesh resolution of 256 2 points and examples of input-output pairs, presented in SM Figure 13 , show that the initial planar discontinuities in the state variable evolve into a very complex structure of the total energy at final time, with a mixture of curved and planar discontinuities, separated by smooth regions. The (relative) median test errors are presented in Table 1 . We observe from this table that the errors with all models are significantly lower in this test case, possibly on account of the lower initial variance and coarser mesh resolution at which the reference solution is sampled. However, the same trend, vis a vis model performance, is observed i.e., DeepONet is significantly (more than seven-fold) worse than both shift-DeepONet and FNO. On the other hand, these two models approximate the underlying solution operator with a very low error of approximately 0.1%. 2022) (free-surface waves based on a "operator learning manifold hypothesis"). However, with the notable exception of Lanthaler et al. (2022) where the approximation of scalar conservation laws with DeepONets is analyzed, theoretical results for the operator approximation of PDEs are not available. Hence, this paper can be considered to be the first where a rigorous analysis of approximating operators arising in PDEs with discontinuous solutions has been presented, particularly for FNOs. On the other hand, there is considerably more work on the neural network approximation of parametric nonlinear hyperbolic PDEs such as the theoretical results of De Ryck & Mishra (2022a) and empirical results of Lye et al. (2020; 2021) . Also related are results with physics informed neural networks or PINNs for nonlinear hyperbolic conservation laws such as De Ryck et al. (2022) ; Jagtap et al. (2022); Mao et al. (2020a) . However, in this setting, the input measure is assumed to be supported on a finite-dimensional subset of the underlying infinite-dimensional input function space, making them too restrictive for operator learning as described in this paper.

Conclusions.

A priori, it could be difficult to approximate operators that arise in PDEs with discontinuities. Given this context, we have proved a rigorous lower bound to show that any operator learning architecture, based on linear reconstruction, may fail at approximating the underlying operator efficiently. In particular, this result holds for the popular DeepONet architecture. On the other hand, we rigorously prove that the incorporation of nonlinear reconstruction mechanisms can break this lower bound and pave the way for efficient learning of operators arising from PDEs with discontinuities. We prove this result for an existing widely used architecture i.e., FNO, and a novel variant of DeepONet that we term as shift-DeepONet. For instance, we show that while the approximation error for DeepONets can decay, at best, linearly in terms of model size, the corresponding approximation errors for shift-DeepONet and FNO decays exponentially in terms of model size, even in the presence or spontaneous formation of discontinuities. These theoretical results are backed by experimental results where we show that FNO and shift-DeepONet consistently beat DeepONet and other ML baselines by a wide margin, for a variety of PDEs with discontinuities. Moreover, we also find theoretically (compare Theorems 3.2 and 3.3) that FNO is more efficient than even shift-DeepONet. This fact is also empirically confirmed in our experiments. The non-local as well as nonlinear structure of FNO is instrumental in ensuring its excellent performance in this context, see Theorem 3.3 and SM E.2 for further demonstration of the role of nonlinear reconstruction for FNOs.

Supplementary Material for:

Nonlinear Reconstruction for operator learning of PDEs with discontinuities. A PRINCIPAL COMPONENT ANALYSIS Principal component analysis (PCA) provides a complete answer to the following problem (see e.g. Bhattacharya et al. (2021) ; Lanthaler et al. (2022) and references therein for relevant results in the infinite-dimensional context): Given a probability measure ν ∈ Prob(Y) on a Hilbert space Y and given p ∈ N, we would like to characterize the optimal linear subspace V p ⊂ Y of dimension p, which minimizes the average projection error E w∼ν ∥w -Π Vp w∥ 2 Y = min dim(Vp)=p E w∼ν ∥w -Π Vp w∥ 2 Y , (A.1) where Π Vp denotes the orthogonal projection onto V p , and the minimum is taken over all pdimensional linear subspaces V p ⊂ X . Remark A.1. A characterization of the minimum in A.1 is of relevance to the present work, since the outputs of DeepONet, and other operator learning frameworks based on linear reconstruction N (u) = p k=1 β k (u)τ k , are restricted to the linear subspace V p := span{τ 1 , . . . , τ p }. From this, it follows that (Lanthaler et al. (2022) ): E u∼µ ∥G(u) -N (u)∥ 2 Y ≥ E u∼µ ∥G(u) -Π Vp G(u)∥ 2 Y = E w∼G # µ ∥w -Π Vp w∥ 2 Y , is lower bounded by the minimizer in (A.1) with ν = G # µ the push-forward measure of µ under G. To characterize minimizers of (A.1), one introduces the covariance operator Γ ν : Y → Y, by Γ ν := E w∼ν [w ⊗ w], where ⊗ denotes the tensor product. By definition, Γ ν satisfies the following relation ⟨v ′ , Γ ν v⟩ Y = E w∼ν [⟨v ′ , w⟩ Y ⟨w, v⟩ Y ] , ∀ v, v ′ ∈ Y. It is well-known that Γ ν possesses a complete set of orthonormal eigenfunctions ϕ 1 , ϕ 2 , . . . , with corresponding eigenvalues λ 1 ≥ λ 2 ≥ • • • ≥ 0. We then have the following result (see e.g. (Lanthaler et al., 2022, Thm. 3.8 )): Theorem A.2. A subspace V p ⊂ Y is a minimizer of (A.1) if, and only if, V p = span{ϕ 1 , . . . , ϕ p } can be written as the span of the first p eigenfunctions of an orthonormal eigenbasis ϕ 1 , ϕ 2 , . . . of the covariance operator Γ ν , with decreasing eigenvalues λ 1 ≥ λ 2 ≥ . . . . Furthermore, the minimum in (A.1) is given by E 2 opt = min dim(Vp)=p E w∼ν ∥w -Π Vp w∥ 2 Y = j>p λ j , in terms of the decay of the eigenvalues of Γ ν .

A.1 ILLUSTRATIVE EXAMPLE

To illustrate the close connection between the decay of PCA eigenvalues and the sharpness of gradients in that distribution, we consider the solution operator G : L 2 (T) → L 2 (T) of the advection equation ∂ t u + a∂ x u = 0 with a = 1, mapping the initial data ū → G(ū) = u(t = 1) to the solution at time t = 1. We consider the input probability measure µ (0) ∈ P(L 2 ) which is defined as the law of random indicator functions ū = 1 [-w/2,w/2] (x -ξ) on the periodic torus T = [0, 2π], where w ∈ [π/4, 3π/4] and ξ ∈ [0, 2π] are drawn uniformly at random and independently. For δ > 0, we define a "smoothened" probability measure µ (δ) on input functions, whose law is obtained by mollifying random draws ū from µ (0) against a Gaussian mollifier g δ (x) with Fourier coefficients We denote by ν (δ) = G # µ (δ) the push-forward measure under the solution operator G. Observing that ν (δ) is a translation-invariant probability measure, and following (Lanthaler et al., 2022 , Proof of Lemma 4.14), we note that the PCA eigenbasis is given by the standard Fourier basis, and the eigenvalue associated with the eigenfunction e ikx is given by g δ (k) = exp(-δ 2 k 2 ), i.e. µ (δ) = law ū * g δ (y) | u ∼ µ (0) . λ (δ) k = 2π ∆w ˆ3π/4 π/4 | ψ (δ) w (k)| 2 dw, ψ (δ) w (x) = 1 [-w/2,w/2] * g δ (x), (A.2) where ψ (δ) w (k) = (2π) -1 ´2π 0 ψ (δ) w (x)e -ikx dx denotes the k-th Fourier coefficient of ψ (δ) w , and ∆w = π/2 is a normalizing factor. We approximate λ (δ) k numerically by a (trapezoidal) quadrature in w with N w = 4001 quadrature points. For each value of w, we approximate the Fourier coefficients ψ w (x) computed on a fine grid of N = 10 4 equidistant points. The approximate PCA eigenvalues are finally obtained by sorting the so computed eigenvalues in decreasing order and retaining only the first 1000, which yields a decreasing sequence λ (δ) 1 ≥ λ (δ) 2 ≥ • • • ≥ λ (δ) 1000 . Figure 1 compares and contrasts the decay of the PCA eigenvalues for the probability measures ν (δ) with gradients at different length scales δ ∈ {0, 0.05, 0.2, 0.5}. We note that the theoretically predicted asymptotic λ k ∼ Ck -2 decay is recovered for δ = 0.

B MEASURES OF COMPLEXITY FOR (SHIFT-)DEEPONET AND FNO

As pointed out in the main text, there are several hyperparameters which determine the complexity of DeepONet/shift-DeepONet and FNO, respectively. Table 2 summarizes quantities of major importance for (shift-)DeepONet and their (rough) analogues for FNO. These quantities are directly relevant to the expressive power and trainability of these operator learning architectures, and are described in further detail below. FNO: Quantities of interest for FNO include the number of grid points in each direction N (for a total of O(N d ) grid points), the Fourier cut-off k max (we retain a total of O(k d max ) Fourier coefficients in the convolution operator and bias), and the lifting dimension d v . We recall that the lifting dimension determines the number of components of the input/output functions of the hidden layers, and hence the "intrinsic" dimensionality of the corresponding function space in the hidden layers is proportional to d v . The essential informational content for each of these d v components is encoded in their Fourier modes with wave numbers |k| ≤ k max (a total of O(k d max ) Fourier modes per component), and hence the total intrinsic function space dimension of the hidden layers is arguably of order ∼ k d max d v . The width of an FNO layer is defined in analogy with conventional neural networks as the maximal width of the weight matrices and Fourier multiplier matrices, which is of order (shift-)DeepONet FNO spatial resolution m ∼ N d "intrinsic" function space dim. p ∼ k d max • d v # trainable parameters size(N ) size(N ) depth depth(N ) depth(N ) width width(N ) ∼ k d max • d v ∼ k d max • d v . The depth is defined as the number of hidden layers L. Finally, the size is by definition the total number of tunable parameters in the architecture. By definition, the Fourier modes of the bias function b ℓ (x) are restricted to wavenumbers |k| ≤ k max (giving a total number of O(k d max d v ) parameters), and the Fourier multiplier matrix is restricted to wave numbers |k| ≤ k max (giving O(k d max d 2 v ) parameters). Apriori, it is easily seen that if the lifting dimension d v is larger than the number of components of the input/output functions, then (Kovachki et al., 2021b)  size(N FNO ) ≲ d 2 v + d 2 v k d max + d v N d depth(N FNO ), (k)) = O(d v k d max ) degrees of freedom. This is of relevance in the regime k max ≪ N , reducing the total FNO size from O(d 2 v N d L) to size(N FNO ) ≲ d 2 v k d max depth(N FNO ). (B.2) Practically, this amounts to adding the bias in the hidden layers in Fourier k-space, rather than physical x-space.

C EXTENSIONS OF THEORETICAL RESULTS

In this section, we remark on two straight-forward extensions of our theoretical results in Section 3.

C.1 TIME-EVOLUTION

We first consider the approximation of the time-evolution t → u( • , t) for solutions of the linear advection equation ∂ t u + a∂ x u = 0, u( • , t = 0) = ū, (C.1) with ū drawn from the probability measure µ specified in (3.2) in the main text. One way to apply neural operators to this time-evolution setting is by recursive application, where one fixes ∆t > 0, and learns an approximation N (ū) ≈ S ∆t (ū) for the given time-step, with ū → S ∆t (ū) given by the data-to-solution mapping ū → u( • , ∆t) of the PDE (C.1). Given ū, an approximation of the time-evolution u( • , t j ) at the discrete time-steps t j = j∆t for j = 1, . . . , N T with N T ∆t = T , is then obtained by iterative evaluations u( • , t j ) ≈ N j (ū) := (N • • • • • N ) j times (ū). In this context, Theorems 3.1, 3.2 and 3.3, can be extended to show that • For DeepONets, we have a lower bound E = sup j=1,,N T E ū∼µ ∥S tj (ū) -N DON j (ū)∥ L 1 ≥ C min(m, p) , with C > 0 independent of T, ∆t. Consequently to achieve an error E < ϵ requires p, m ≳ ϵ -1 trunk and branch net basis functions and sensor points, entailing a lower size bound size(N DON ) ≳ mp ≳ ϵ -2 . • For shift-DeepONets, there exists a constant C > 0, independent of ∆t and N T , such that for any ϵ > 0 there exists a shift-DeepONet N sDON such that E = sup j=1,...,N T E ū∼µ ∥S tj (ū) -N sDON j (ū)∥ L 1 ≤ ϵ, with uniformly bounded p ≤ C, with number of sensor points m ≤ CN T /ϵ and size(N sDON ) ≤ CN T /ϵ. • For FNO, there exists C > 0 independent of ∆t and N T , such that for any ϵ > 0, there exists an FNO N FNO with E = sup j=1,...,N T E ū∼µ ∥S tj (ū) -N sDON j (ū)∥ L 1 ≤ ϵ, with grid size N ≤ CN T /ϵ, Fourier cut-off k max = 1, lifting dimension d v ≤ C and size(N FNO ) ≤ C log(N T /ϵ) 2 . The additional factors of N T in these estimates stem from the fact that the errors over iterative time-steps accumulate, requiring an accuracy of order ϵ/N T per time-step in order to achieve an cumulative error of at most ϵ. The above results imply that for any fixed choice of ∆t and T = N T ∆t, shift-DeepONets and FNO can approximate the time-evolution of (C.1) more efficiently than DeepONets. As an avenue for future work, it would be interesting to consider the approximation of the solution operator S : ū → u, where the output function u = u(x, t) depends on both position x, as well as on time t. But this is outside of the scope of the present work.

C.2 EXTENSION TO HIGHER DIMENSIONS

While the analysis becomes considerably more cumbersome in higher dimensions, the main insights of this work also apply to problems on higher-dimensional domains, as we indicate for the linear advection example in the following: We consider the PDE ∂ t u + d j=1 a j ∂ xj u = 0, u( • , t = 0) = ū, (C.2) where ū = ū(x 1 , . . . , x d ) is a function defined on the d-dimensional torus T d , which we assume to be given by a random box wave of the form ū(x 1 , . . . , x d ) = h d j=1 1 [-wj /2,wj /2] (x j -ξ j ). (C.3) Here h ∈ [h, h], w j ∈ [w, w], and ξ j ∈ [0, 2π] are independent, uniform random variables. As in the 1-dimensional case, we consider the input probability measure µ defined as the law of this random box-wave in d-dimensions. Fixing T > 0, and by a slight abuse of notation, we will write the solution operator of (C.2) as G adv (ū) = u( • , T ). DeepONet: Following an analysis analogous to the one-dimensional case, it can be shown that the d-dimensional Fourier basis provides an optimal PCA basis for the push-forward measure G adv,# µ. Furthermore, an argument based on the decay of the Fourier coefficients of the box wave (C.3) implies that the PCA eigenvalues λ k satisfy the lower bound λ k ≳ k -2 . In particular, this provides a similar lower bound on the number of required basis functions p for the DeepONet approximation also in the d-dimensional case. Furthermore, an extension of the argument in the proof of Proposition D.10 also provides a similar lower bound in terms of m, i.e. if ess sup ū∼µ ∥G adv (ū)∥ L ∞ ≤ M , then E = E ū∼µ ∥G adv (ū) -N DON (ū)∥ L 1 ≥ C min(m, p) , for a constant C = C(M ) > 0 that is independent of p and T . Again this shows that a large number of basis functions is necessary to approximation G adv by a DeepONet. We note that we can only establish a lower bound ≳ 1/m rather than ≳ 1/m 1/d , as one might have expected to appear from the "curse of dimensionality" associated with higher-dimensional problems. shift-DeepONet: In contrast to the case of DeepONet, for shift-DeepONet it can be shown that there exists a constant C = C(d) > 0, such that for any ϵ > 0, there exists a shift-DeepONet N sDON such that E = E ū∼µ ∥G adv (ū) -N sDON (ū)∥ L 1 ≤ ϵ, and with a bounded number of basis functions p ≤ C, a number of sensor points m ≤ Cϵ -1 , and such that size(N sDON ) ≤ Cϵ -1 .

Sketch of proof:

The idea underlying this construction is to first reduce the multi-d problem to d onedimensional problems, to apply the known result in the one-dimensional case (along each coordinate) and to finally reconstruct the output wave-form from the solutions of the one-dimensional problems. To reduce to the one-dimensional case, we fix a coarse uniform grid x coarse ℓ , ℓ = 1, . . . , N coarse , on T = [0, 2π) with step size < w (smaller than the smallest possible box width in (C.3)), as well as a fine uniform grid x fine ℓ , ℓ = 1, . . . , N fine , on T = [0, 2π) with step size ∼ ϵ. In terms of these one-dimensional grids, and for any given coordinate direction j ∈ {1, . . . , d}, we fix sensor points x (j) ℓ1,...,ℓ d = (x coarse ℓ1 , . . . , x coarse ℓj-1 x fine ℓj , x coarse ℓj+1 , . . . , x coarse ℓ d ) ∈ T d , where ℓ j = 1, . . . , N fine , and where the other ℓ k (k ̸ = j) run over the coarse index set ℓ k = 1, . . . , N coarse . Since N coarse is independent of ϵ, while N fine ∼ ϵ -1 , the total number m of sensor points x (j) ℓ1,...,ℓ d j ∈ {1, . . . , d}, ℓ j ∈ {1, . . . , N fine }, ℓ k ∈ {1, . . . , N coarse } for k ̸ = j , can be bounded by m ≤ Cϵ -1 , where C = C(d, w) is independent of ϵ. Importantly for the reduction to the one-dimensional case: • the height h of the d-dimensional box wave can be obtained by taking the maximum over the sensor values ū(x  ū = h d k=1 1 [-w k /2,w k /2] (x k -ξ k ) → 1 [-wj /2,wj /2] (x j -ξ j ), where the output function is encoded by evaluation at the fine grid points x fine ℓ , ℓ = 1, . . . , N fine . Given this reduction to the one-dimensional case, we can then apply our one-dimensional results to find suitable approximations of h, w j , ξ j , (j = 1, . . . , d) as in the one-dimensional case. Based on this, we construct an approximation of the box wave solution u( • , T ) = h d j=1 1 [-wj /2,wj /2] (x jξ j -a j T ) by defining the trunk net as a suitable approximation of the d-dimensional unit box τ (x) ≈ d j=1 1 [-1,1] (x j ), defining the scale-net to scale each coordinate direction by A(ū) ≈ diag(1/w 1 , . . . , 1/w d ), setting the shift-net to be a shift by γ(ū) ≈ (ξ j + a j T ), and finally defining the branch net to be equal to the box height β(ū) = h, so that u( • , T ) = h d j=1 1 [-wj /2,wj /2] (x j -ξ j -a j T ) ≈ β(ū)τ (A(ū) • x + γ(ū)) , provides the desired approximation of the box wave solution. We will not provide the precise details and required estimates here.

FNO:

Similarly, for FNO it can be shown that there exists a constant C = C(d) > 0, such that for any ϵ > 0, there exists an FNO N FNO such that E = E ū∼µ ∥G adv (ū) -N FNO (ū)∥ L 1 ≤ ϵ, and with a bounded truncation parameter k max = 1, lifting dimension d v ≤ C, a number of grid points N ≤ Cϵ -d , and such that size(N FNO ) ≤ C log(ϵ -foot_0 ) 2 .

Sketch of proof:

The idea is very similar to the case of shift-DeepONet, and follows by a reduction to the one-dimensional case. Note that in this case, a fine grid needs to be chosen in each coordinate direction (requiring many grid points, of order ∼ ϵ -d ), but crucially the number of weights and biases of the FNO architecture is independent of this discretization parameter. Hence this does not affect the overall size. In the case of FNOs, we furthermore note that the box wave ū(x) = h d j=1 1 [-wj /2,wj /2] (x j -ξ j ), can be uniquely reconstructed (by a nonlinear reconstruction procedure) from knowledge of it's Fourier coefficients F ū(k 1 , . . . , k d ), for k 1 , . . . , k j-1 , k j+1 , . . . , k d = 0, k j ∈ {-1, 0, 1}, and j ∈ {1, . . . , d}, following the same approach detailed in the one-dimensional case in SM D.4, below. Therefore, given the Fourier coefficients F ū(k) = F ū(k 1 , . . . , k d ), for k = (k 1 , . . . , k d ) with |k| ≤ 1, we can first compute the corresponding Fourier coefficients of the solution u( • , T ) = ū( • -aT ), by a simple phase shift F[u( • , T )](k) = F ū(k)e -i(k•a) T , and then reconstruct ū( • -aT ) from knowledge of these Fourier coefficients. Thus k max = 1 also suffices in this case (corresponding to ∼ 3d Fourier coefficients which are retained). The lifting dimension d v of the detailed construction outlined above scales linearly in the dimension of the domain d, but is independent of ϵ. 1

D MATHEMATICAL DETAILS

In this section, we provide detailed proofs of the Theorems in Section 3. We start with some preliminary results below,

D.1 RELU DNN BUILDING BLOCKS

In the present section we collect several basic constructions for ReLU neural networks, which will be used as building blocks in the following analysis. For the first result, we note that for any δ > 0, the Published as a conference paper at ICLR 2023 following approximate step-function ζ δ (x) :=    0, x < 0, x δ , 0 ≤ x ≤ δ, 1, x > δ, can be represented by a neural network: ζ δ (x) = σ x δ -σ x -δ δ , where σ(x) = max(x, 0) denotes the ReLU activation function. Introducing an additional shift ξ, multiplying the output by h, and choosing δ > 0 sufficiently small, we obtain the following result: Proposition D.1 (Step function). Fix an interval [a, b] ⊂ R, ξ ∈ [a, b], h ∈ R. Let h 1 [x>ξ] be a step function of height h. For any ϵ > 0 and p ∈ [1, ∞), there exist a ReLU neural network Φ ϵ : R → R, such that depth(Φ ϵ ) = 1, width(Φ ϵ ) = 2, and ∥Φ ϵ -h 1 [x>ξ] ∥ L p ([a,b]) ≤ ϵ. The following proposition is an immediate consequence of the previous one, by considering the linear combination Φ δ (x -a) -Φ δ (x -b) with a suitable choice of δ > 0. A useful mathematical technique to glue together local approximations of a given function rests on the use of a "partition of unity". In the following proposition we recall that partitions of unity can be constructed with ReLU neural networks (this construction has previously been used by Yarotsky (2017); cp.  width(Λ) = 4J, depth(Λ) = 1, each Λ j is piecewise linear, satisfies Λ j (x) =    0, (x ≤ x j-1 -ϵ), 1, (x j-1 + ϵ ≤ x ≤ x j -ϵ), 0, (x ≥ x j + ϵ), and interpolates linearly between the values 0 and 1 on the intervals [x j-1 -ϵ, x j-1 + ϵ] and [x j -ϵ, x j + ϵ]. In particular, this implies that • supp(Λ j ) ⊂ [x j-1 -ϵ, x j + ϵ], for all j = 1, . . . , J, • Λ j (x) ≥ 0 for all x ∈ R, • The {Λ j } j=1,...,J form a partition of unity, i.e. J j=1 Λ j (x) = 1, ∀ x ∈ [a, b]. We also recall the well-known fact that the multiplication operator (x, y) → xy can be efficiently approximated by ReLU neural networks (cp. Yarotsky ( 2017)): Proposition D.4 (Multiplication, (Yarotsky, 2017, Prop. 3)). There exists a constant C > 0, such that for any ϵ ∈ (0, 1 2 ], M ≥ 2, there exists a neural network × ϵ,M : [-M, M ] × [-M, M ] → R, such that width( × ϵ,M ) ≤ C, depth( × ϵ,M ) ≤ C log(M ϵ -1 ), size( × ϵ,M ) ≤ C log(M ϵ -1 ), and sup x,y∈[-M,M ] | × ϵ,M (x, y) -xy| ≤ ϵ. We next state a general approximation result for the approximation of analytic functions by ReLU neural networks. To this end, we first recall Definition D.5 (Analytic function and extension). A function F : (α, β) → R is analytic, if for any x 0 ∈ (α, β) there exists a radius r > 0, and a sequence (a k ) k∈N0 such that ∞ k=0 |a k |r k < ∞, F (x) = ∞ k=0 a k (x -x 0 ) k , ∀ |x -x 0 | < r. If f : [a, b] → R is a function, then we will say that f has an analytic extension, if there exists F : (α, β) → R, with [a, b] ⊂ (α, β), with F (x) = f (x) for all x ∈ [a, b] and such that F is analytic. We then have the following approximation bound, which extends the main result of Wang et al. (2018) . In contrast to Wang et al. (2018) , the following theorem applies to analytic functions without a globally convergent series expansion. Theorem D.6. Assume that f : [a, b] → R has an analytic extension. Then there exist constants C, γ > 0, depending only on f , such that for any L ∈ N, there exists a ReLU neural network Φ L : R → R, with sup x∈[a,b] |f (x) -Φ L (x)| ≤ C exp(-γL 1/2 ), and such that depth(Φ L ) ≤ CL, width(Φ L ) ≤ C. Proof. Since f has an analytic extension, for any x ∈ [a, b], there exists a radius r x > 0, and an analytic function F x : [x -r x , x + r x ] → R, which extends f locally. By the main result of (Wang et al., 2018, Thm. 6) , there are constants C x , γ x > 0 depending only on x ∈ [a, b], such that for any L ∈ N, there exists a ReLU neural network Φ x,L : R → R, such that sup |ξ-x|≤rx |F (ξ) -Φ x,L (ξ)| ≤ C x exp(-γ x L 1/2 ), and depth(Φ x ) ≤ C x L, width(Φ x ) ≤ C x . For J ∈ N, set ∆x = (b -a)/J and consider the equidistant partition x j := a + j∆x of [a, b] . Since the compact interval [a, b] can be covered by finitely many of the intervals (x -r x , x + r x ), then by choosing ∆x sufficiently small, we can find x (j) ∈ [a, b], such that [x j-1 , x j ] ⊂ (x (j) -r x (j) , x (j) + r x (j) ) for each j = 1, . . . , J. By construction, this implies that for C := max j=1,...,J C x (j) , and γ := min j=1,...,J γ x (j) , we have that for any L ∈ N, there exist neural networks Φ j,L (= Φ x (j) ,L ) : R → R, such that sup x∈[xj-1,xj ] |f (x) -Φ j,L (x)| ≤ C exp(-γL 1/2 ), and such that depth(Φ j,L ) ≤ CL, width(Φ j,L ) ≤ C. Let now Λ : R → R J be the partition of unity network from Proposition D.3, and define Φ L (x) := J j=1 × M,ϵ (Λ j (x), Φ j,L (x)) , where × M,ϵ denotes the multiplication network from Proposition D.4, with M := 1 + C exp(-γ) + sup x∈[a,b] |f (x)|, and ϵ := J -1 C exp(-γL 1/2 ). Then we have depth(Φ L ) ≤ depth( × M,ϵ ) + depth(Λ) + max j=1,...,J depth(Φ j,L ) ≤ C ′ log(M ϵ -1 ) + 1 + CL ≤ C(1 + L), where the constant C > 0 on the last line depends on sup x∈[a,b] |f (x)|, γ and on C, but is independent of L. Similarly, we find that width(Φ L ) ≤ width(Λ) + max j=1,...,J width(Φ j,L ) ≤ 4J + C, is bounded independently of L. After potentially enlarging the constant C > 0, we can thus ensure that depth(Φ L ) ≤ CL, width(Φ L ) ≤ C, with a constant C > 0 that depends only on f , but is independent of L. Finally, we note that |Φ L (x) -f (x)| ≤ J j=1 × M,ϵ (Λ j (x), Φ j,L (x)) -Λ j (x)f (x) ≤ J j=1 × M,ϵ (Λ j (x), Φ j,L (x)) -Λ j (x)Φ j,L (x) + J j=1 Λ j (x) |Φ j,L (x) -f (x)| . By construction of × M,ϵ , and since |Φ j,L (x)| ≤ M , Λ j ≤ M , the first sum can be bounded by Jϵ = C exp(-γL 1/2 ). Furthermore, each term in the second sum is bounded by Λ j (x) C exp(-γL 1/2 ), and hence J j=1 Λ j (x)|Φ j,L (x) -f (x)| ≤   J j=1 Λ j (x)   C exp(-γL 1/2 ) = C exp(-γL 1/2 ), for all x ∈ [a, b]. We conclude that sup x∈[a,b] |Φ L (x) -f (x)| ≤ 2 C exp(-γL 1/2 ), with constants C, γ > 0 independent of L. Setting γ := γ and after potentially enlarging the constant C > 0 further, we thus conclude: there exist C, γ > 0, such that for any L ∈ N, there exists a neural network Φ L : R → R with depth(Φ L ) ≤ CL, width(Φ L ) ≤ C, such that sup x∈[a,b] |Φ L (x) -f (x)| ≤ C exp(-γL 1/2 ). This conclude the proof of Theorem D.6 By combining a suitable ReLU neural network approximation of division a → 1/a based on Theorem D.6 (division is an analytic function away from 0), and the approximation of multiplication by Yarotsky (2017) (cp. Proposition D.4, above), we can also state the following result: Proposition D.7 (Division). Let 0 < a ≤ b be given. Then there exists C = C(a, b) > 0, such that for any ϵ ∈ (0, 1 2 ], there exists a ReLU network ÷ a,b,ϵ : R × R → R, with depth( ÷ a,b,ϵ ) ≤ C log ϵ -1 2 , width( ÷ a,b,ϵ ) ≤ C, size( ÷ a,b,ϵ ) ≤ C log ϵ -1 2 , satisfying sup x,y∈[a,b] ÷ a,b,ϵ (x; y) - x y ≤ ϵ. We end this section with the following result. Lemma D.8. There exists a constant C > 0, such that for any ϵ > 0, there exists a neural network Ξ ϵ : R → R, such that sup ξ∈[0,2π-ϵ] |ξ -Ξ ϵ (cos(ξ), sin(ξ))| ≤ ϵ, with Ξ ϵ (cos(ξ), sin(ξ)) ∈ [0, 2π] for all ξ ∈ [0, 2π], and such that depth(Ξ ϵ ) ≤ C log(ϵ -1 ) 2 , width(Ξ ϵ ) ≤ C. Sketch of proof. We can divide up the unit circle {(cos(ξ), sin(ξ)) | ξ ∈ [0, 2π]} into 5 subsets, where              ξ ∈ [0, π/4], ⇐⇒ x ≥ 1/ √ 2, y ≥ 0, ξ ∈ (π/4, 3π/4], ⇐⇒ y > 1/ √ 2, ξ ∈ [3π/4, 5π/4], ⇐⇒ x ≤ -1/ √ 2, ξ ∈ (5π/4, 7π/4), ⇐⇒ y < -1/ √ 2, ξ ∈ [7π/4, 2π), ⇐⇒ x ≤ -1/ √ 2, y < 0, On each of these subsets, one of the mappings The proof of this theorem follows from the following two propositions, Proposition D.9 (Lower bound in p). Consider the solution operator  x ∈ - 1 √ 2 , 1 √ 2 → R, x = cos(ξ) → ξ, or y ∈ - 1 √ 2 , 1 √ 2 → R, y = sin(ξ) → ξ, G adv : L 1 ∩ L ∞ (T) → L 1 ∩ L ∞ (T) (ū)∥ L ∞ ≤ M < ∞, then E ū∼µ [∥N (ū) -G adv (u)∥ L 1 ] ≥ C p . Sketch of proof. The argument is an almost exact repetition of the lower bound derived in Lanthaler et al. (2022) , and therefore we will only outline the main steps of the argument, here: Since the measure µ is translation-invariant, it can be shown that the optimal PCA eigenbasis with respect to the L 2 (T)-norm (cp. SM A) is the Fourier basis. Consider the (complex) Fourier basis {e ikx } k∈Z , and denote the corresponding eigenvalues { λ k } k∈Z . The k-th eigenvalue λ k of the covariance-operator Γ µ = E u∼µ [(u ⊗ u)] satisfies Γ µ e -ikx = λ k e -ikx . A short calculation, as in (Lanthaler et al., 2022, Proof of Lemma 4.14) , then shows that λ k = ˆh h ˆw w h 2 | ψ w (k)| 2 dw ∆w dh ∆h ≥ h 2 ˆw w | ψ w (k)| 2 dw ∆w , where ψ w (k) denotes the k-th Fourier coefficient of ψ w (x) := 1 [-w/2,w/2] (x). Since ψ w (x) has a jump discontinuity of size 1 for any w > 0, it follows from basic Fourier analysis, that the asymptotic decay of | ψ w (k)| ∼ C/|k|, and hence, there exists a constant C = C(h, w, w) > 0, such that λ k ≥ C|k| -2 . Re-ordering these eigenvalues λ k in descending order (and renaming), λ 1 ≥ λ 2 ≥ . . . , it follows that for some constant C = C(h, w, w) > 0, we have j>p λ j ≥ C j>p j -2 ≥ Cp -1 . By Theorem 2.1, this implies that (independently of the choice of the functionals β k (ū)!), in the Hilbert space L 2 (T), we have E ū∼µ ∥N (ū) -G(ū)∥ 2 L 2 ≥ j>p λ j ≥ C p , for a constant C > 0 that depends only on µ, but is independent of p. To obtain a corresponding estimate with respect to the L 1 -norm, we simply observe that the above lower bound on the L 2 -norm together with the a priori bound sup ū∼µ ∥G adv (ū)∥ L ∞ ≤ h on the underlying operator and the assumed L ∞ -bound sup ū∼µ ∥N (ū)∥ L ∞ ≤ M , imply C p ≤ E ū∼µ ∥N (ū) -G(ū)∥ 2 L 2 ≤ E ū∼µ ∥N (ū) -G(ū)∥ L ∞ ∥N (ū) -G(ū)∥ L 1 ≤ (M + h)E ū∼µ [∥N (ū) -G(ū)∥ L 1 ] . This immediately implies the claimed lower bound. Proposition D.10 (Lower bound in m). Consider the solution operator  G adv : L 1 (T) ∩ L ∞ (T) → L 1 (T) ∩ L ∞ (T) E ū∼µ [∥N (ū) -G adv (ū)∥ L 1 ] ≥ C m . Proof. We recall that the initial data ū is a randomly shifted box function, of the form ū(x) = h1 [-w/2,+w/2] (x -ξ), where ξ ∈ [0, 2π], h ∈ [h, h] and w ∈ [w, w] are independent, uniformly distributed random variables. Let x 1 , . . . , x m ∈ (0, 2π] be an arbitrary choice of m sensor points. In the following, we denote ū(X) := (ū(x 1 ), . . . , ū(x m )) ∈ R m . Let now (x, ū(X)) → Φ(x; ū(X)) be any mapping, such that x → Φ(x; ū(X)) ∈ L 1 (T) for all possible random choices of ū (i.e. for all ū ∼ µ). Then we claim that E ū∼µ [∥G adv (ū) -Φ( • ; ū(X))∥ L 1 ] ≥ C m , (D.1) for a constant C = C(h, w) > 0, holds for all m ∈ N. Clearly, the lower bound (D.1) immediately implies the statement of Proposition D.10, upon making the particular choice Φ(x; ū(X)) = N (ū)(x) ≡ p k=1 β k (ū(x 1 ), . . . , ū(x m ))τ k (x). To prove (D.1), we first recall that ū(x) = ū(x; h, w, ξ) depends on three parameters h, w and ξ, and the expectation over ū ∼ µ in (D.1) amounts to averaging over h ∈ [h, h], w ∈ [w, w] and ξ ∈ [0, 2π). In the following, we fix w and h, and only consider the average over ξ. Suppressing the dependence on the fixed parameters, we will prove that 1 2π ˆ2π 0 ∥ū(x; ξ) -Φ(x; ū(X; ξ))∥ L 1 dξ ≥ C m , (D.2) with a constant that only depends on h, w. This clearly implies (D.1). To prove (D.2), we first introduce two mappings ξ → I(ξ) and ξ → J(ξ), by I(ξ) = i ⇔ ξ - w 2 ∈ [x i , x i+1 ), J(ξ) = j ⇔ ξ + w 2 ∈ [x j , x j+1 ), where we make the natural identifications on the periodic torus (e.g. x m+1 is identified with x 1 and ξ ± w/2 is evaluated modulo 2π). We observe that both mappings ξ → I(ξ), J(ξ) cycle exactly once through the entire index set {1, . . . , m} as ξ varies from 0 to 2π. Next, we introduce A ij := {ξ ∈ [0, 2π) | I(ξ) = i, J(ξ) = j}, ∀ i, j ∈ {1, . . . , m}. Clearly, each ξ ∈ [0, 2π) belongs to only one of these sets A ij . Since ξ → I(ξ) and ξ → J(ξ) have m jumps on [0, 2π), it follows that the mapping ξ → (I(ξ), J(ξ)) can have at most 2m jumps. In particular, this implies that there are at most 2m non-empty sets A ij ̸ = ∅ (these are all sets of the form A I(ξ),J(ξ) , ξ ∈ [0, 2π)), i.e. #{A ij ̸ = ∅ | i, j ∈ {1, . . . , m}} ≤ 2m. (D.3) Since ū(x; ξ) = h 1 [-w/2,w/2] (x -ξ) , one readily sees that when ξ varies in the interior of A ij , then all sensor point values ū(X; ξ) remain constant, i.e. the mapping interior(A ij ) → R m , ξ → ū(X; ξ) = const. We also note that A ij = [x i + w/2, x i+1 + w/2) ∩ [x j -w/2, x j+1 -w/2 ) is in fact an interval. Fix i, j such that A ij ̸ = ∅ for the moment. We can write A ij = [a -∆, a + ∆) for some a, ∆ ∈ T, and there exists a constant Ū ∈ R m such that Ū ≡ ū(X; ξ) for ξ ∈ [a -∆, a + ∆). It follows from the triangle inequality that ˆAij ∥ū(x; ξ) -Φ(x; ū(X; ξ)) ∥ L 1 dξ = ˆa+∆ a-∆ ∥ū(x; ξ) -Φ(x; Ū )∥ L 1 dξ = ˆ∆ 0 ∥ū(x; a -ξ ′ ) -Φ(x; Ū )∥ L 1 dξ ′ + ˆ∆ 0 ∥ū(x; a + ξ ′ ) -Φ(x; Ū )∥ L 1 dξ ′ ≥ ˆ∆ 0 ∥ū(x; a + ξ ′ ) -ū(x; a -ξ ′ )∥ L 1 dξ ′ . Since ū(x; ξ) = h 1 [-w/2,w/2] (x -ξ), we have, by a simple change of variables ∥ū(x; a + ξ ′ ) -ū(x; a -ξ ′ )∥ L 1 = h ˆT |1 [-w/2,w/2] (x) -1 [-w/2,w/2] (x + 2ξ ′ )| dx. The last expression is of order ξ ′ , provided that ξ ′ is small enough to avoid overlap with a periodic shift (recall that we are on working on the torus, and 1 [-w/2,w/2] (x) is identified with its periodic extension). To avoid such issues related to periodicity, we first note that ξ ′ ≤ ∆ ≤ π, and then we choose a (large) constant C 0 = C 0 (w), such that for any ξ ′ ≤ π/C 0 and w ≤ w, we have ˆT |1 [-w/2,w/2] (x) -1 [-w/2,w/2] (x + 2ξ ′ )| dx = ˆ-w/2 -w/2-2ξ ′ 1 dx + ˆw/2 w/2-2ξ ′ 1 dx = 4ξ ′ . From the above, we can now estimate ˆAij ∥ū(x; ξ) -Φ(x; ū(X; ξ))∥ L 1 dξ ≥ ˆ∆ 0 ∥ū(x; a + ξ ′ ) -ū(x; a -ξ ′ )∥ L 1 dξ ′ ≥ ˆ∆/C0 0 ∥ū(x; a + ξ ′ ) -ū(x; a -ξ ′ )∥ L 1 dξ ′ ≥ h ˆ∆/C0 0 4ξ ′ dξ ′ = 2h ∆ 2 C 2 0 ≥ C|A ij | 2 , where C = C(h, w) is a constant only depending on the fixed parameters h, w. Summing over all A ij ̸ = ∅, we obtain the lower bound ˆ2π 0 ∥ū(x; ξ) -Φ(x; ū(X; ξ))∥ L 1 dξ = Aij ̸ =∅ ˆAij ∥ū(x; ξ) -Φ(x; ū(X; ξ))∥ L 1 dξ ≥ C Aij ̸ =∅ |A ij | 2 . We observe that [0, 2π) = A ij is a disjoint union, and hence Aij ̸ =∅ |A ij | = 2π. Furthermore, as observed above, there are at most 2m non-zero summands |A ij | ̸ = 0. To finish the proof, we claim that the functional  | = • • • = |α 2m | = π/m. Given this fact, it then immediately follows from the above estimate that 1 2π ˆ2π 0 ∥ū(x; ξ) -Φ(x; ū(X; ξ))∥ L 1 dξ ≥ C Aij ̸ =∅ |A ij | 2 ≥ 2Cπ 2 m . where C = C(h, w) > 0 is independent of the values of w ∈ [w, w] and h ∈ [h, h] . This suffices to conclude the claim of Proposition D.10. It remains to prove the claim: We argue by contradiction. Let α 1 , . . . , α 2m be a minimizer of k |α k | 2 under the constraint k |α k | = 2π. Clearly, we can wlog assume that 0 ≤ α 1 ≤ • • • ≤ α 2m are non-negative numbers. If the claim does not hold, then there exists a minimizer, such that α 1 < α 2m . Given δ > 0 to be determined below, we define β k by β 1 = α 1 + δ, β 2m = α 2m -δ, and β k = α k , for all other indices. Then, by a simple computation, we observe that k α 2 k - k β 2 k = 2δ(α 2m -α 1 -δ). Choosing δ > 0 sufficiently small, we can ensure that the last quantity is > 0, while keeping β k ≥ 0 for all k. In particular, it follows that k |β k | = k |α k | = 2π, but k α 2 k > k β 2 k , in contradiction to the assumption that α 1 , . . . , α 2m minimize the last expression. Hence, any minimizer must satisfy |α 1 | = • • • = |α 2m | = π/m.

D.3 PROOF OF THEOREM 3.2

Proof. We choose equidistant grid points x 1 , . . . , x m for the construction of a shift-DeepONet approximation to G adv . We may wlog assume that the grid distance ∆x = x 2 -x 1 < w, as the statement is asymptotic in m → ∞. We note the following points: Step 1: We show that h can be efficiently determined by max-pooling. First, we observe that for any two numbers a, b, the mapping a b → max(0, a -b) max(0, b) max(0, -b) → max(0, a -b) + max(0, b) -max(0, -b) ≡ max(a, b), is exactly represented by a ReLU neural network max(a, b) of width 3, with a single hidden layer. Given k inputs a 1 , . . . , a k , we can parallelize O(k/2) copies of max, to obtain a ReLU network of width ≤ 3k and with a single hidden layer, which maps       a 1 a 2 . . . a k-1 a k       →    max(a 1 , a 2 ) . . . max(a k-1 , a k )    . Concatenation of O(log 2 (k)) such ReLU layers with decreasing input sizes k, ⌈k/2⌉, ⌈k/4⌉, . . . , 1, provides a ReLU representation of max-pooling    a 1 . . . a k    →    max(a 1 , a 2 ) . . . max(a k-1 , a k )    →    max(a 1 , a 2 , a 3 , a 4 ) . . . max(a k-3 , a k-2 , a k-1 , a k )    → • • • → max(a 1 , . . . , a k ). This concatenated ReLU network maxpool : R k → R has width ≤ 3k, depth O(log(k)), and size O(k log(k)). Our goal is to apply the above network maxpool to the shift-DeepONet input u(x 1 ), . . . , u(x m ) to determine the height h. To this end, we first choose ℓ 1 , . . . , ℓ k ∈ {1, . . . , m}, such that x ℓj+1 -x ℓj ≤ w, with k ∈ N minimal. Note that k is uniformly bounded, with a bound that only depends on w (not on m). Applying the maxpool construction above the u(x ℓ1 ), . . . , u(x ℓ k ), we obtain a mapping    u(x 1 ) . . . u(x m )    →    u(x ℓ1 ) . . . u(x ℓ k )    → maxpool(u(x ℓ1 ), . . . , u(x ℓ k )) = h. This mapping can be represented by O(log(k)) ReLU layers, with width ≤ 3k and total (fully connected) size O(k 2 log(k)). In particular, since k only depends on w, we conclude that there exists C = C(w) > 0 and a neural network h : R m → R with depth( h) ≤ C, width( h) ≤ C, size( h) ≤ C, (D.4) such that h(ū(X)) = h, (D.5) for any initial data of the form ū (x) = h1 [-w/2,w/2] (x -ξ), where h ∈ [h, h], w ∈ [w, w], and ξ ∈ [0, 2π]. Step 2: To determine the width w, we can consider a linear layer (of size m), followed by an approximation of division, ÷(a; b) ≈ a/b (cp. Proposition D.7): ū(X) → ∆x m j=1 ū(x j ) → ÷   ∆x m j=1 ū(x j ); h(ū(X))   Denote this by w(ū(X)). Then |w -w| = 1 h ˆ2π 0 ū(x) dx - ∆x h j ū(x j ) + ∆x h j ū(x j ) -÷ h,h,ϵ   ∆x m j=1 ū(x j ); h   ≤ 2π m + ϵ. And we have depth( w) ≤ C log(ϵ -1 ) 2 , width( w) ≤ C, size( w) ≤ C m + log(ϵ -1 ) 2 , by the complexity estimate of Proposition D.7. Step 3: To determine the shift ξ ∈ [0, 2π], we note that ∆x m j=1 ū(x j )e -ixj = ˆ2π 0 ū(x)e -ix dx + O 1 m = 2 sin(w/2)e -iξ + O 1 m . Using the result of Lemma D.8, combined with the approximation of division of Proposition D.7, and the observation that w ∈ (w, π) implies that sin(w/2) ≥ sin(w/2) > 0 is uniformly bounded from below for all w ∈ [w, w], it follows that for all ϵ ∈ (0, 1 2 ], there exists a neural network ξ : R m → [0, 2π], of the form ξ(ū(X)) = Ξ ϵ   ÷   ∆x m j=1 u(x j )e -ixj ; × M,ϵ h, 2 sin( w/2)     such that ξ -ξ(ū(X)) ≤ ϵ, for all ξ ∈ [0, 2π -ϵ), and depth( ξ) ≤ C log(ϵ -1 ) 2 , width( ξ) ≤ C, size( ξ) ≤ C m + log(ϵ -1 ) 2 . Step 4: Combining the above three ingredients (Steps 1-3), and given the fixed advection velocity a ∈ R and fixed time t, we define a shift-DeepONet with p = 6, scale-net A k ≡ 1, and shift-net γ with output γ k (ū) ≡ ξ + at, as follows: N sDON (ū) = p k=1 β k (ū)τ k (x -γ k (ū)) := 1 j=-1 h 1 ϵ [0,∞) x -ξ -at + w/2 + 2πj - 1 j=-1 h 1 ϵ [0,∞) x -ξ -at -w/2 + 2πj , where h = h(ū(X)), w = w(ū(X)), and ξ = ξ(ū(X)), and where 1 ϵ [0,∞) is a sufficiently accurate L 1 -approximation of the indicator function 1 [0,∞) (x) (cp. Proposition D. 2). To estimate the approximation error, we denote 1 ϵ [-w/2, w/2] (x) := 1 ϵ [0,∞) (x + w/2) -1 ϵ [0,∞) (x -w/2 ) and identify it with it's periodic extension to T, so that we can more simply write N sDON (ū)(x) = h 1 ϵ -[ w/2, w/2] (x -ξ -at) . We also recall that the solution u(x, t) of the linear advection equation ∂ t u + a∂ x u = 0, with initial data u(x, 0) = ū(x) is given by u(x, t) = ū(x -at), where at is a fixed constant, independent of the input ū. Thus, we have G adv (ū)(x) = ū(x -at) = h 1 [-w/2,w/2] (x -ξ -at). We can now write |G adv (ū)(x) -N sDON (ū)(x)| = h 1 [-w/2,w/2] (x -ξ -at) -h 1 ϵ [-w/2, w/2] (x -ξ -at) . We next recall that by the construction of Step 1, we have h(ū) ≡ h for all inputs ū. Furthermore, upon integration over x, we can clearly get rid of the constant shift at by a change of variables. Hence, we can estimate ∥G adv (ū) -N sDON (ū)∥ L 1 ≤ h ˆT 1 [-w/2,w/2] (x -ξ) -1 ϵ [-w/2, w/2] (x -ξ) dx. (D.6) Using the straight-forward bound 1 [-w/2,w/2] (x -ξ) -1 ϵ [-w/2, w/2] (x -ξ) ≤ 1 [-w/2,w/2] (x -ξ) -1 [-w/2,w/2] (x -ξ) + 1 [-w/2,w/2] (x -ξ) -1 [-w/2, w/2] (x -ξ) + 1 [-w/2, w/2] (x -ξ) -1 ϵ [-w/2, w/2] (x -ξ) , one readily checks that, by Step 3, the integral over the first term is bounded by ∥(I)∥ L 1 ≤ C ˆ2π-ϵ 0 |ξ -ξ| dξ + ˆ2π 2π-ϵ 2 dξ ≤ (C + 2)ϵ. where C = C(w, w) > 0. By Step 2, the integral over the second term can be bounded by ∥(II)∥ L 1 ≤ C|w -w| ≤ C (1/m + ϵ) . Finally, by Proposition D.1, by choosing ϵ sufficiently small (recall also that the size of 1 ϵ is independent of ϵ), we can ensure that ∥(III)∥ L 1 = ∥ 1 ϵ [-w/2, w/2] -1 [-w/2, w/2] ∥ L 1 ≤ ϵ, holds uniformly for any w. Hence, the right-hand side of (D.6) obeys an upper bound of the form E ū∼µ ∥G adv (ū) -N sDON (ū)∥ L 1 = h h dh w w dw T dξ ∥G adv (ū) -N sDON (ū)∥ L 1 ≤ h 2π w w dw ˆ2π 0 dξ {∥(I)∥ L 1 + ∥(II)∥ L 1 + ∥(III)∥ L 1 } ≤ C ϵ + 1 m , for a constant C = C(w, w, h) > 0. We also recall that by our construction, depth(N sDON ) ≤ C log(ϵ -1 ) 2 , width(N sDON ) ≤ C, size(N sDON ) ≤ C m + log(ϵ -1 ) 2 . Replacing ϵ by ϵ/2C and choosing m ∼ ϵ -1 , we obtain E ū∼µ ∥G adv (ū) -N sDON (ū)∥ L 1 ≤ ϵ, with depth(N sDON ) ≤ C log(ϵ -1 ) 2 , width(N sDON ) ≤ C, size(N sDON ) ≤ Cϵ -1 , where C depends only on µ, and is independent of ϵ. This implies the claim of Theorem 3.2.

D.4 PROOF OF THEOREM 3.3

For the proof of Theorem 3.3, we will need a few intermediate results: Lemma D.11. Let ū = h 1 [-w/2,w/2] (x -ξ) and fix a constant at ∈ R. There exists a constant C > 0, such that given N grid points, there exists an FNO with k max = 1, d v ≤ C, depth ≤ C, size ≤ C, such that sup h,w,ξ N FNO (ū)(x) -sin(w/2) cos(x -ξ -at) ≤ C N . Proof. We first note that there is a ReLU neural network Φ consisting of two hidden layers, such that Φ(ū(x)) = min(1, h -1 ū(x)) = min 1, h -1 h1 [-w/2,w/2] (x) = 1 [-w/2,w/2] (x), for all h ∈ [h, h]. Clearly, Φ can be represented by FNO layers where the convolution operator K ℓ ≡ 0. Next, we note that the k = 1 Fourier coefficient of u := 1 [-w/2,w/2] (x -ξ) is given by F N u(k = ±1) = 1 N N j=1 1 [-w/2,w/2] (x j -ξ)e ∓ixj = 1 2π ˆ2π 0 1 [-w/2,w/2] (x -ξ)e ∓ix dx + O(N -1 ) = sin(w/2)e ∓iξ π + O(N -1 ), where the O(N -1 ) error is bounded uniformly in ξ ∈ [0, 2π] and w ∈ [w, w]. It follows that the FNO N FNO defined by ū → Φ(ū) → σ F -1 N P F N Φ(ū) -σ F -1 N (-P )F N Φ(ū) , where P implements a projection onto modes |k| = 1 and multiplication by e ±iat π/2 (the complex exponential introduces a phase-shift by at), satisfies sup x∈T N FNO (ū)(x) -sin(w/2) cos(x -ξ -at) ≤ C N , where C is independent of N , w, h and ξ. Lemma D.12. Fix 0 < h < h and 0 < w < w. There exists a constant C = C(h, h, w, w) > 0 with the following property: For any input function ū (x) = h 1 [-w/2,w/2] (x -ξ) with h ∈ [h, h] and w ∈ [w, w], and given N grid points, there exists an FNO with constant output function, such that sup h,w,ξ N FNO (ū)(x) -w ≤ C N , and with uniformly bounded size, k max = 0, d v ≤ C, depth ≤ C, size ≤ C. Proof. We can define an FNO mapping ū → 1 [-w/2,w/2] (x) → 2π N N j=1 1 [-w/2,w/2] (x j ) = w + O(N -1 ), where we observe that the first mapping is just ū → max(h -1 ū(x), 1), which is easily represented by an ordinary ReLU NN of bounded size. The second mapping above is just projection onto the 0-th Fourier mode under the discrete Fourier transform. In particular, both of these mappings can be represented exactly by an FNO with k max = 0 and uniformly bounded d v , depth and size. To conclude the argument, we observe that the error O(N -1 ) depends only on the grid size and is independent of w ∈ [w, w]. Lemma D.13. Fix 0 < w < w. There exists a constant C = C(w, w) > 0, such that for any ϵ > 0, there exists an FNO such that for any constant input function ū(x) ≡ w ∈ [w, w], we have N FNO (ū)(x) - 1 2 sin(w) ≤ ϵ, ∀x ∈ [0, 2π], and k max = 0, d v ≤ C, depth ≤ C log(ϵ -1 ) 2 , size ≤ C log(ϵ -1 ) 2 . Proof. It follows e.g. from (Elbrächter et al., 2021, Thm. III.9 ) (or also Theorem D.6 above) that there exists a constant C = C(w, w) > 0, such that for any ϵ > 0, there exists a ReLU neural network S ϵ with size(S ϵ ) ≤ C log(ϵ -1 ) 2 , depth(S ϵ ) ≤ C log(ϵ -1 ) 2 and width(S ϵ ) ≤ C, such that sup w∈[w,w] S ϵ (w) - 1 2 sin(w) ≤ ϵ. To finish the proof, we simply note that this ReLU neural network S ϵ can be easily represented by an FNO S ϵ with k max = 0, d v ≤ C, depth(S ϵ ) ≤ C log(ϵ -1 ) 2 and size(S ϵ ) ≤ C log(ϵ -1 ) 2 ; it suffices to copy the weight matrices W ℓ of S ϵ , set the entries of the Fourier multiplier matrices P ℓ (k) ≡ 0, and choose constant bias functions b ℓ (x) = const. (with values given by the corresponding biases in the hidden layers of S ϵ ). Lemma D.14. Let ū = h 1 [-w/2,w/2] (x -ξ). Assume that 2π/N ≤ w. For any ϵ > 0, there exists an FNO with constant output function, such that sup h,w,ξ N FNO (ū)(x) -h ≤ ϵ, and k max = 0, d v ≤ C, depth ≤ C log(ϵ -1 ) 2 . Proof. The proof follows along similar lines as the proofs of the previous lemmas. In this case, we can define an FNO mapping ū → h 1 [-w/2,w/2] (x) 1 [-w/2,w/2] (x) → h N j=1 1 [-w/2,w/2] (x j ) N j=1 1 [-w/2,w/2] (x j ) → ÷ ϵ   h N j=1 1 [-w/2,w/2] (x j ), N j=1 1 [-w/2,w/2] (x j )   . The estimate on k max , d v , depth follow from the construction of ÷ in Proposition D.7. Proof of Theorem 3.3. We first note that (the 2π-periodization of) 1 [-w/2,w/2] (x -ξ -at) is = 1 if, and only if cos(x -ξ -at) ≥ cos(w/2) ⇐⇒ sin(w/2) cos(x -ξ -at) ≥ 1 2 sin(w). (D.7) The strategy of proof is as follows: Given the input function ū(x) = h 1 [-w/2,w/2] (x -ξ) with unknown w ∈ [w, w], ξ ∈ [0, 2π] and h ∈ [h, h], and for given a, t ∈ R (these are fixed for this problem), we first construct an FNO which approximates the sequence of mappings ū → h w sin(w/2) cos(x -ξ) →   h 1 2 sin(w) sin(w/2) cos(x -ξ -at)   → h sin(w/2) cos(x -ξ -at) -1 2 sin(w) Then, according to (D.7), we can approximately reconstruct 1 [-w/2,w/2] (x-ξ -at) by approximating the identity 1 [-w/2,w/2] (x -ξ -at) = 1 [0,∞) sin(w/2) cos(x -ξ -at) - 1 2 sin(w) , where 1 [0,∞) is the indicator function of [0, ∞). Finally, we obtain G adv (ū) = h 1 [-w/2,w/2] (x -ξat) by approximately multiplying this output by h. We fill in the details of this construction below. Step 1: The first step is to construct approximations of the mappings above. We note that we can choose a (common) constant C 0 = C 0 (h, h, w, w) > 0, depending only on the parameters h, h, w and w, such that for any grid size N ∈ N all of the following hold: 1. There exists an FNO H N with constant output (cp. Lemma D.14), such that for ū (x) = h 1 [-w/2,w/2] (x -ξ), sup w,h |H N (ū) -h| ≤ 1 N . (D.8) and with k max ≤ 1, d v ≤ C 0 , depth ≤ C 0 log(N ) 2 , size ≤ C 0 log(N ) 2 . 2. Combining Lemma D.12 and D.13, we conclude that there exists an FNO S N with constant output, such that for ū(x) = h 1 [-w/2,w/2] (x -ξ), we have sup w∈[w-1,w+1] S N (ū) - 1 2 sin(w) ≤ C 0 N . (D.9) and with k max = 0, d v ≤ C 0 , depth ≤ C 0 log(N ) 2 , size ≤ C 0 log(N ) 2 . 3. There exists an FNO C N (cp. Lemma D.11), such that for ū = h 1 [-w/2,w/2] , sup x,ξ,w |C N (ū)(x) -sin(w/2) cos(x -ξ -at)| ≤ C 0 N , (D.10) where the supremum is over x, ξ ∈ [0, 2π] and w ∈ [w, w], and such that k max = 1, d v ≤ C 0 , depth ≤ C 0 , size ≤ C 0 . 4. There exists a ReLU neural network 1 N [0,∞) (cp. Proposition D.1), such that ∥ 1 N [0,∞) ∥ L ∞ ≤ 1, 1 N [0,∞) (z) = 0, (x < 0), 1, (x ≥ 1 N ). (D.11) with width( 1 N [0,∞) ) ≤ C 0 , depth( 1 N [0,∞) ) ≤ C 0 . 5. there exists a ReLU neural network × N (cp. Proposition D.4), such that sup a,b | × N (a, b) -ab| ≤ 1 N , (D.12) where the supremum is over all |a|, |b| ≤ h + 1, and width( × N ) ≤ C 0 , depth( × N ) ≤ C 0 log(N ). Based on the above FNO constructions, we define N FNO (ū) := × N H N (ū), 1 N [0,∞) C N (ū) -S N (ū) . (D.13) Taking into account the size estimates from points 1-5 above, as well as the general FNO size estimate (B.2), it follows that N FNO can be represented by an FNO with k max = 1, d v ≤ C, depth ≤ C log(N ) 2 , size ≤ C log(N ) 2 . (D.14) To finish the proof of Theorem 3.3, it suffices to show that N FNO satisfies an estimate sup ū∼µ ∥N FNO (ū) -G adv (ū)∥ L 2 ≤ C N , with C > 0 independent of N . Step 2: We claim that if x ∈ [0, 2π] is such that sin(w/2) cos(x -ξ) - 1 2 sin(w) ≥ 2C 0 + 1 N , with C 0 the constant of Step 1, then 1 N [0,∞) (C N (ū)(x) -S N (ū)) = 1 [-w/2,w/2] (x -ξ). To see this, we first assume that sin(w/2) cos(x -ξ) - 1 2 sin(w) ≥ 2C 0 + 1 N > 0. Then C N (ū)(x) -S N (ū) ≥ sin(w/2) cos(x -ξ) - 1 2 sin(w) -|C N (ū)(x) -sin(w/2) cos(x -ξ)| -S N (ū) - 1 2 sin(w) ≥ 2C 0 + 1 N - C 0 N - C 0 N = 1 N > 0. Hence, it follows from (D.11) that 1 N [0,∞) (C N (ū)(x) -S N (ū)) = 1 = 1 [0,∞) sin(w/2) cos(x -ξ) - 1 2 sin(w) = 1 [-w/2,w/2] (x). The other case, sin(w/2) cos(x -ξ) - 1 2 sin(w) ≤ - 2C 0 + 1 N , is shown similarly. Step 3: We note that there exists C = C(w, w) > 0, such that for any δ > 0, the Lebesgue measure meas x ∈ [0, 2π] sin(w/2) cos(x -ξ) - 1 2 sin(w) < δ ≤ Cδ. Step 4: Given the previous steps, we now write G adv (ū) -N FNO (ū) = h 1 [-w/2,w/2] (x) -× N H N , 1 N [0,∞) C N -S N (ū) = h 1 [-w/2,w/2] (x) -h 1 N [0,∞) (C N -S N ) + (h -H N ) 1 N [0,∞) (C N -S N ) + H N 1 N [0,∞) C N -S N (ū) -× N H N , 1 N [0,∞) C N -S N (ū) =: (I) + (II) + (III). The second (II) and third (III) terms are uniformly bounded by N -1 , by the construction of × N and H N . By Steps 2 and 3 (with δ = (2C 0 + 1)/N ), we can estimate the L 1 -norm of the first term as ∥(I)∥ L 1 ≤ 2h meas{ sin(w/2) cos(x -ξ) -2 -1 sin(w) < δ} ≤ C/N, where the constant C is independent of N , and only depends on the parameters h, h, w and w. Hence, N FNO satisfies sup ū∼µ ∥N FNO (ū) -G adv (ū)∥ L 1 ≤ C N , for a constant C > 0 independent of N , and where we recall (cp. (D.14) above): k max = 1, d v ≤ C, depth(N FNO ) ≤ C log(N ) 2 , size(N FNO ) ≤ C log(N ) 2 . The claimed error and complexity bounds of Theorem 3.3 are now immediate upon choosing N ∼ ϵ -1 . D.5 PROOF OF THEOREM 3.5 To motivate the proof, we first consider the Burgers' equation with the particular initial data ū(x) = -sin(x), with periodic boundary conditions on the interval x ∈ [0, 2π]. The solution for this initial datum can be constructed via the well-known method of characteristics; we observe that the solution u(x, t) with initial data ū(x) is smooth for time t ∈ [0, 1), develops a shock discontinuity at x = 0 (and x = 2π) for t ≥ 1, but remains otherwise smooth on the interval x ∈ (0, 2π) for all times. In fact, fixing a time t ≥ 0, the solution u(x, t) can be written down explicitly in terms of the bijective mapping (cp. Figure 4 ) Ψ t : [x t , 2π -x t ] → [0, 2π], Ψ t (x 0 ) = x 0 -t sin(x 0 ), where x t = 0, for t ≤ 1, x t > 0 is the unique solution of x t = t sin(x t ), for t > 1. (D.15) We note that for given x 0 , the curve t → Ψ t (x 0 ) traces out the characteristic curve for the Burgers' equation, starting at x 0 (and until it collides with the shock). Following the method of characteristics, the solution u(x, t) is then given in terms of Ψ t , by u(x, t) = -sin Ψ -1 t (x) , for x ∈ [0, 2π]. (D.16) We are ultimately interested in solutions for more general periodic initial data of the form ū(x) = -sin(x -ξ); these can easily be obtained from the particular solution (D.16) via a shift. We summarize this observation in the following lemma: Lemma D.15. Let ξ ∈ [0, 2π) be given, fix a time t ≥ 0. Consider the initial data ū(x) = -sin(x-ξ). Then the entropy solution u(x, t) of the Burgers' equations with initial data ū is given by u(x, t) = -sin(Ψ -1 t (x -ξ + 2π)), (x < ξ), -sin(Ψ -1 t (x -ξ)), (x ≥ ξ), (D.17) for x ∈ [0, 2π], t ≥ 0. Lemma D.16. Let t > 1, and define U : [0, 2π] → R by U (x) := -sin(Ψ -1 t (x)). There exists ∆ t > 0 (depending on t), such that x → U (x) can be extended to an analytic function Ū : (-∆ t , 2π + ∆ t ) → R, x → Ū (x); i.e., such that Ū (x) = U (x) for all x ∈ [0, 2π]. Given ϵ > 0, we can thus choose L ≥ γ -2 log(ϵ -1 ) 2 , to obtain a neural network Φ ϵ := Φ L , such that sup x∈[0,2π] |U (x) -Φ ϵ (x)| ≤ ϵ, depth(Φ ϵ ) ≤ C log(ϵ -1 ) 2 , width(Φ ϵ ) ≤ C, for a constant C > 0, independent of ϵ. Lemma D.18. Let t > 1, and let U : [-2π, 2π] → R be given by U (x) := -sin(Ψ -1 t (x + 2π)), (x < 0), -sin(Ψ -1 t (x)), (x ≥ 0). There exists a constant C = C(t) > 0, depending only on t, such that for any ϵ ∈ (0, 1 2 ], there exists a neural network Φ ϵ : R → R, such that depth(Φ ϵ ) ≤ C log(ϵ -1 ) 2 , width(Φ ϵ ) ≤ C, and such that ∥Φ ϵ -U ∥ L 1 ([-2π,2π]) ≤ ϵ. Proof. By Corollary D.17, there exists a constant C > 0, such that for any ϵ > 0, there exist neural networks Φ -, Φ + : R → R, such that -2π,2π] ) ≤ ϵ. By Proposition D.2 (approximation of indicator functions), there exist neural networks χ ± ϵ : R → R with uniformly bounded width and depth, such that depth(Φ ± ) ≤ C log(ϵ -1 ) 2 , width(Φ ± ) ≤ C, ∥Φ -(x) -U (x)∥ L ∞ ([-2π,0]) ≤ ϵ, ∥Φ + (x) -U (x)∥ L ∞ ([0,2π]) ≤ ϵ. This implies that U -1 [-2π,0) Φ -+ 1 [0,2π] Φ + L ∞ ([ χ - ϵ -1 [-2π,0] L 1 ≤ ϵ, χ + ϵ -1 [0,2π] L 1 ≤ ϵ. and ∥χ ± ϵ (x)∥ L ∞ ≤ 1. Combining this with Proposition D.4 (approximation of multiplication), it follows that there exists a neural network Φ ϵ : R → R, Φ ϵ (x) = × ϵ (χ + ϵ , Φ + ) + × ϵ (χ - ϵ , Φ -), such that Φ ϵ -1 [-2π,0) Φ -+ 1 [0,2π] Φ + L 1 ([-2π,2π]) ≤ × M,ϵ (χ + , Φ + ) -1 [0,2π] Φ + L 1 ([-2π,2π]) + × M,ϵ (χ -, Φ -) -1 [-2π,0) Φ - L 1 ([-2π,2π]) By construction of × M,ϵ , we have × M,ϵ (χ + , Φ + ) -1 [0,2π] Φ + L 1 ≤ × M,ϵ (χ + , Φ + ) -χ + Φ + L 1 + χ + -1 [0,2π] Φ + L 1 ≤ 4πϵ + ∥χ + -1 [0,2π] ∥ L 1 ∥Φ + ∥ L ∞ ≤ (4π + 2) ϵ. And similarly for the other term. Thus, it follows that Φ ϵ -1 [-2π,0) Φ -+ 1 [0,2π] Φ + L 1 ([-2π,2π]) ≤ 2(4π + 2)ϵ, and finally, ∥U -Φ ϵ ∥ L 1 ([-2π,2π]) ≤ 4π U -1 [-2π,0) Φ -+ 1 [0,2π] Φ + L ∞ ([-2π,2π]) + 1 [-2π,0) Φ -+ 1 [0,2π] Φ + -Φ ϵ L 1 ([-2π,2π]) ≤ 4πϵ + 2(4π + 2)ϵ = (12π + 4)ϵ. for a neural network Φ ϵ : R → R of size: depth(Φ ϵ ) ≤ C log(ϵ -1 ) 2 , width(Φ ϵ ) ≤ C. Replacing ϵ with ϵ = ϵ/(12π + 4) yields the claimed estimate for Φ ϵ . Based on the above results, we can now prove the claimed error and complexity estimate for the shift-DeepONet approximation of the Burgers' equation, Theorem 3.5. Proof of Theorem 3.5. Fix t > 1. Let U : [-2π, 2π] → R be the function from Lemma D.18. By Lemma D.15, the exact solution of the Burgers' equation with initial data ū(x) = -sin(x -ξ) at time t, is given by u (x, t) = U (x -ξ), ∀ x ∈ [0, 2π]. From Lemma D.18 (note that x -ξ ∈ [-2π, 2π]), it follows that there exists a constant C > 0, such that for any ϵ > 0, there exists a neural network Φ ϵ : R → R, such that ∥u( • , t) -Φ ϵ ( • -ξ)∥ L 1 ([0,2π]) ≤ ϵ, and depth(Φ ϵ ) ≤ C log(ϵ -1 ) 2 , width(Φ ϵ ) ≤ C. We finally observe that for equidistant sensor points x 0 , x 1 , x 2 ∈ [0, 2π), x j = 2πj/3, there exists a matrix A ∈ R 2×3 , which for any function of the form g(x) = α 0 + α 1 sin(x) + α 2 cos(x) maps [g(x 0 ), g(x 1 ), g(x 2 )] T → A • [g(x 0 ), g(x 1 ), g(x 2 )] T = [-α 1 , α 2 ] T . Clearly, the considered initial data ū(x) = -sin(x -ξ) is of this form, for any ξ ∈ [0, 2π), or more precisely, we have ū(x) = -sin(x -ξ) = -cos(ξ) sin(x) -sin(ξ) cos(x), so that A • [ū(x 0 ), ū(x 1 ), ū(x 2 )] T = [cos(ξ), sin(ξ)]. As a next step, we recall that there exists C > 0, such that for any ϵ > 0, there exists a neural network Ξ ϵ : R 2 → [0, 2π] (cp. Lemma D.8), such that sup ξ∈[0,2π-ϵ) |ξ -Ξ ϵ (cos(ξ), sin(ξ))| < ϵ, such that Ξ ϵ (cos(ξ), sin(ξ)) ∈ [0, 2π] for all ξ, and depth(Ξ ϵ ) ≤ C log(ϵ -1 ) 2 , width(Ξ ϵ ) ≤ C. Based on this network Ξ ϵ , we can now define a shift-DeepONet approximation of N sDON (ū) ≈ G Burg (ū) of size: depth(N sDON ) ≤ C log(ϵ -1 ) 2 , width(N sDON ) ≤ C, by the composition N sDON (ū)(x) := Φ ϵ (x -Ξ ϵ (A • ū(X))), (D.18) where ū(X) := [ū(x 0 ), ū(x 1 ), ū(x 2 )] T , and we note that (denoting Ξ ϵ := Ξ ϵ (A • ū(X))), we have for ξ ∈ [0, 2π -ϵ]: ∥G Burg (ū) -N sDON (ū)∥ L 1 = ∥U ( • -ξ) -Φ ϵ ( • -Ξ ϵ )∥ L 1 ≤ ∥U ( • -ξ) -U ( • -Ξ ϵ )∥ L 1 + ∥U ( • -Ξ ϵ ) -Φ ϵ (x -Ξ ϵ )∥ L 1 ≤ C|ξ -Ξ ϵ | + ϵ ≤ (C + 1)ϵ, where C only depends on U , and is independent of ϵ > 0. On the other hand, for ξ > 2π -ϵ, we have ∥G Burg (ū) -N sDON (ū)∥ L 1 = ∥U ( • -ξ) -Φ ϵ ( • -Ξ ϵ )∥ L 1 ≤ ∥U ( • -ξ)∥ L 1 + ∥Φ ϵ ( • -Ξ ϵ )∥ L 1 ≤ 2π (∥U ( • -ξ)∥ L ∞ + ∥Φ ϵ ( • -Ξ ϵ )∥ L ∞ ) ≤ 6π, is uniformly bounded. It follows that E ū∼µ ∥G Burg (ū) -N sDON (ū)∥ L 1 = ˆ2π-ϵ 0 + ˆ2π 2π-ϵ ∥G Burg (ū) -N sDON (ū)∥ L 1 dξ ≤ 2π(C + 1)ϵ + 6πϵ, with a constant C > 0, independent of ϵ > 0. Replacing ϵ by ϵ = ϵ/C ′ for a sufficiently large constant C ′ > 0 (depending only on the constants in the last estimate above), one readily sees that there exists a shift-DeepONet N sDON , such that E (N sDON ) = E ū∼µ ∥G Burg (ū) -N sDON (ū)∥ L 1 ≤ ϵ, and such that width(N sDON ) ≤ C, depth(N sDON ) ≤ C log(ϵ -1 ) 2 , and size(N sDON ) ≤ C log(ϵ -1 ) 2 , for a constant C > 0, independent of ϵ > 0. This concludes our proof.

D.6 PROOF OF THEOREM 3.6

Proof. Step 1: Assume that the grid size is N ≥ 3. Then there exists an FNO N 1 , such that N 1 (ū) = [cos(ξ), sin(ξ)], and depth(N 1 ) ≤ C, d v ≤ C, k max = 1. To see this, we note that for any ξ ∈ [0, 2π], the input function ū(x) = -sin(x -ξ) = -cos(ξ) sin(x) -sin(ξ) cos(x) can be written in terms of a sine/cosine expansion with coefficients cos(ξ) and sin(ξ). For N ≥ 3 grid points, these coefficients can be retrieved exactly by a discrete Fourier transform. Therefore, combining a suitable lifting to d v = 2 with a Fourier multiplier matrix P , we can exactly represent the mapping ū → F -1 N (P • F N (R • ū)) = cos(ξ) sin(x) sin(ξ) cos(x) , by a linear Fourier layer. Adding a suitable bias function b(x) = [sin(x), cos(x)] T , and composing with an additional nonlinear layer, it is then straightforward to check that there exists a (ReLU-)FNO, such that ū → cos(ξ) sin(x) + sin(x) sin(ξ) cos(x) + cos(x) = (cos(ξ) + 1) sin(x) (sin(ξ) + 1) cos(x) → | cos(ξ) + 1|| sin(x)| | sin(ξ) + 1|| cos(x)| → | cos(ξ) + 1| N j=1 | sin(x j )| | sin(ξ) + 1| N j=1 | cos(x j )| → | cos(ξ) + 1| -1 | sin(ξ) + 1| -1 = cos(ξ) sin(ξ) . Step 2: Given this construction of N 1 , the remainder of the proof follows essentially the same argument as in the proof D.5 of Theorem 3.5: We again observe that the solution u(x, t) with initial data ū(x) = -sin(x -ξ) is well approximated by the composition N FNO (ū)(x) := Φ ϵ (x -Ξ ϵ (N 1 (ū))), such that (by verbatim repetition of the calculations after (D.18) for shift-DeepONets) E (N FNO ) = E ū∼µ ∥G Burg (ū) -N FNO (ū)∥ L 1 ≤ ϵ, and where Φ ϵ : R → R is a ReLU neural network of width depth(Φ ϵ ) ≤ C log(ϵ -1 ) 2 , width(Φ ϵ ) ≤ C, Ξ ϵ : R 2 → [0, 2π] is an ReLU network with depth(Ξ ϵ ) ≤ C log(ϵ -1 ) 2 , width(Ξ ϵ ) ≤ C. Being the composition of an FNO N 1 satisfying Given an input y ∈ R m , a feedforward neural network (also termed as a multi-layer perceptron), transforms it to an output, through a layer of units (neurons) which compose of either affine-linear maps between units (in successive layers) or scalar nonlinear activation functions within units Goodfellow et al. (2016) , resulting in the representation, k max = 1, d v ≤ C, depth(N 1 ) ≤ C u θ (y) = C L • σ • C L-1 . . . • σ • C 2 • σ • C 1 (y). (E.1) Here, • refers to the composition of functions and σ is a scalar (nonlinear) activation function. For any 1 ≤ ℓ ≤ L, we define C ℓ z ℓ = W ℓ z ℓ + b ℓ , for W ℓ ∈ R d ℓ+1 ×d ℓ , z ℓ ∈ R d ℓ , b ℓ ∈ R d ℓ+1 ., (E.2) and denote, θ = {W ℓ , b ℓ } L ℓ=1 , (E.3) to be the concatenated set of (tunable) weights for the network. Thus in the terminology of machine learning, a feed forward neural network (E.1) consists of an input layer, an output layer, and L hidden layers with d ℓ neurons, 1 < ℓ < L. In all numerical experiments, we consider a uniform number of neurons across all the layer of the network  d ℓ = d ℓ-1 = d, 1 < ℓ < L. , r(z ℓ , z ℓ-k ) = σ(W ℓ z ℓ + b ℓ ) + z ℓ-k . (E.4) In all numerical experiments we set k = 2. The residual network takes as input a sample function ū ∈ X , encoded at the Cartesian grid points (x 1 , . . . , x m ), E(ū) = (ū(x 1 ), . . . , ū(x m )) ∈ R m , and outputs the output sample G(ū) ∈ Y encoded at the same set of points, E(G(ū)) = (G(ū)(x 1 ), . . . , G(ū)(x m )) ∈ R m . For the compressible Euler equations the encoded input is defined as E(ū) = ρ 0 (x 1 ), . . . , ρ 0 (x m ), ρ 0 (x 1 )u 0 (x 1 ), . . . , ρ 0 (x m )u 0 (x m ), E 0 (x 1 ), . . . , E 0 (x m ) ∈ R 3m E(ū) = ρ 0 (x 1 ), . . . , ρ 0 (x m 2 ), ρ 0 (x 1 )u 0 (x 1 ), . . . , ρ 0 (x m 2 )u 0 (x m 2 ), ρ 0 (x 1 )v 0 (x 1 ), . . . , ρ 0 (x m 2 )v 0 (x m 2 )E 0 (x 1 ), . . . , E 0 (x m 2 ) ∈ R 4m 2 (E.5) for the 1d and 2d problem, respectively.

E.1.3 FULLY CONVOLUTIONAL NEURAL NETWORK

Fully convolutional neural networks are a special class of convolutional networks which are independent of the input resolution. The networks consist of an encoder and decoder, both defined by a composition of linear and nonlinear transformations: E θe (y) = C e L • σ • C e L-1 . . . • σ • C e 2 • σ • C e 1 (y), D θ d (z) = C d L • σ • C d L-1 . . . • σ • C d 2 • σ • C d 1 (z), u θ (y) = D θ d • E θe (y). (E.6) The affine transformation C ℓ commonly corresponds to a convolution operation in the encoder, and transposed convolution (also know as deconvolution), in the decoder. The latter can also be performed with a simple linear (or bilinear) upsampling and a convolution operation, similar to the encoder. The (de)convolution is performed with a kernel W ℓ ∈ R k ℓ (for 1d-problems, and W ℓ ∈ R k ℓ ×k ℓ for 2d-problems), stride s and padding p. It takes as input a tensor z ℓ ∈ R w ℓ ×c ℓ (for 1d-problems, and z ℓ ∈ R w ℓ ×h ℓ ×c ℓ for 2d-problems), with c ℓ being the number of input channels, and computes Step-wise decay 100 steps, γ = 0.999 Step-wise Decay 50 steps, γ = 0.99 Step-wise decay 100 steps, γ = 0.999 Exponential decay γ = 0.999 None

Burgers' Equation

Step-wise decay 100 steps, γ = 0.999 Step-wise Decay 50 steps, γ = 0.99 Step-wise decay 100 steps, γ = 0.999 Exponential decay γ = 0.999 None

Shocktube Problem

Step-wise decay 100 steps, γ = 0.999 Step-wise Decay 50 steps, γ = 0.99 Step-wise decay 100 steps, γ = 0.999 Exponential decay γ = 0.999 None

2D Riemann Problem

Step-wise decay 100 steps, γ = 0.999 Step The results presented in Table 9 further demonstrate that FNO is the best performing model on all the benchmarks. In order to further understand the superior performance of FNOs, we consider the linear advection equation. As stated in the main text, given the linear nature of the underlying operator, a single FNO Fourier layer suffices to represent this operator. However, the layer width will grow linearly with decreasing error, requiring k max ∼ ϵ -1 Fourier modes. Hence, it is imperative to use the nonlinear reconstruction, as seen in the proof of Theorem 3.3, to obtain good performance. To empirically illustrate this, we compare FNO with different k max (number of Fourier modes) with corresponding error obtained by simply projecting the outputs of the operator into the linear space spanned by the corresponding number of Fourier modes. This projection onto Fourier space amounts to the action of a linear version of FNO. The errors, presented in Table 11 (keeping the total number of degrees of freedom roughly constant) and Table 12 (keeping the lifting dimension d v = 64 fixed), clearly show that as predicted by the theory, very few Fourier modes with a k max = 1 are enough to obtain an error of approximately 1%. On the other hand, the corresponding error with just the linear projection is two orders of magnitude higher. In fact, one needs to project onto approximately 500 Fourier modes to obtain an error of approximately 1% with this linear method. This stark contrast is further illustrated in Figure 9 , which shows that FNOs relying on a nonlinear reconstruction mechanism can outperform a competing method based on linear Fourier reconstruction (Fourier linear projection) by one to two orders of magnitude in terms of approximation accuracy. This experiment clearly brings out the role of the nonlinearities in FNO in enhancing its expressive power on advection-dominated problems. For the linear advection problem, we first recall that the fundamental limitation of DeepONets is due to the lower bound of Theorem 3.1, which shows that DeepONets can achieve at-best a linear decay error ∼ 1/p in terms of the number of DeepONet basis functions p. To compare this theoretical prediction with empirical observation, we carry out a scan over different choices of p = 1, 2, 4, . . . , 64 in Table 13 . The table clearly shows that the DeepONet training/testing errors both decrease with increasing p. Furthermore, inspection of the corresponding Figure 8 reveals that this decrease follows an approximately linear decay, which is only slightly worse than the (optimal) lower bound ∼ 1/p, consistent with approximation theory. In contrast, our approximation theoretic result shows that shift-DeepONets can overcome the lower bound, and achieve high accuracy even with a bounded number of basis functions p ≤ C (cp. Theorem 3.2). This is reflected by the almost immediate saturation of the error as a function of p, in Figure 8 (cf. also Table 13 ). Consistent with the theoretical insight, our numerical experiment demonstrates that (a) the error for shift-DeepONets (based on nonlinear reconstruction) is considerably smaller than the corresponding error for DeepONets (based on linear reconstruction), and (b) increasing the number of basis functions beyond very moderate values of p ∼ 4, 8 does not improve the accuracy of shift-DeepONets, consistent with the fact that a uniformly bounded number p ≤ C is sufficient. 



Similar to the one-dimensional case, additional shifted copies of these basis functions are needed in practice to account for the 2π-periodicity of the output function. this step is where the ϵ-gap at the right boundary 2π -ϵ is needed, as the points at angles ξ = 0 and ξ = 2π are identical on the circle.



Although the learning of operators arising from PDEs has attracted great interest in recent literature, there are very few attempts to extend the proposed architectures to PDEs with discontinuous solutions. Empirical results for some examples of discontinuities or sharp gradients were presented in Mao et al. (2020b) (compressible Navier-Stokes equations with DeepONets) Kissas et al. (2022) (Shallow-Water equations with an attention based framework) and in Seidman et al. (

Figure 1: Sharper gradients (at length scale δ) can cause a slow decay of the corresponding PCA eigenvalues λ (δ) k .

) via the fast Fourier transform of ψ (δ)

Figure 2: Approximate correspondence between measures of the complexity (d=dimension of the domain D ⊂ R d of input/output functions).

shift-)DeepONet: Quantities of interest include the number of sensor points m, the number of trunk-/branch-net functions p and the width, depth and size of the operator network. We first recall the definition of the width and depth for DeepONet, width(N DON ) := width(β) + width(τ ), depth(N DON ) := max {depth(β), depth(τ )} , where the width and depth of the conventional neural networks on the right-hand side are defined in terms of the maximum hidden layer width (number of neurons) and the number of hidden layers, respectively. To ensure a fair comparison between DeepONet, shift-DeepONet and FNO, we define the size of a DeepONet assuming a fully connected (non-sparse) architecture, as size(N DON ) := (m + p)width(N DON ) + width(N DON ) 2 depth(N DON ), where the second term measures the complexity of the hidden layers, and the first term takes into account the input and output layers. Furthermore, all architectures we consider have a width which scales at least as width(N DON ) ≳ min(p, m), implying the following natural lower size bound, size(N DON ) ≳ (m + p) min(p, m) + min(p, m) 2 depth(N DON ). (B.1) We also introduce the analogous notions for shift-DeepONet: width(N sDON ) := width(β) + width(τ ) + width(A) + width(γ), depth(N sDON ) := max {depth(β), depth(τ ), depth(A), depth(γ)} , size(N sDON ) := (m + p)width(N DON ) + width(N DON ) 2 depth(N DON ).

where the first term in parentheses corresponds to size(W ℓ ) = d 2 v , the second term accounts for size(P ℓ ) = O(d 2 v k d max ) and the third term counts the degrees of freedom of the bias, size(b ℓ (x j )) = O(d v N d ). The additional factor depth(N FNO ) takes into account that there are L = depth(N FNO ) such layers. If the bias b ℓ is constrained to have Fourier coefficients b ℓ (k) ≡ 0 for |k| > k max (as we assumed in the main text), then the representation of b ℓ only requires size( b ℓ

..,ℓ d ), with ℓ 1 , . . . , ℓ d belonging to a coarse subset of indices (cp.Step 1 in the proof in SM D.3; this requires a neural network of size O(ϵ -1 )),• for any box wave of the form (C.3), and for any direction j = 1, . . . , d, a summation over the coarse indices (i.e. sum over ℓ k for k ̸ = j) of the encoded values ū(x ℓ1,...,ℓ d ) combined with a ReLU truncation, allows us to construct a DNN mapping

Indicator function). Fix an interval [a, b] ⊂ R. Let 1 [a,b] (x) be the indicator function of [a, b]. For any ϵ > 0 and p ∈ [1, ∞), there exist a ReLU neural network Φ ϵ : R → R, such that depth(Φ ϵ ) = 1, width(Φ ϵ ) = 4, and ∥Φ ϵ -1 [a,b] ∥ L p ([a,b]) ≤ ϵ.

Figure 3: Illustration of partition of unity network for J = 5, [a, b] = [0, 1].

Figure 3): Proposition D.3 (Partition of unity). Fix an interval [a, b] ⊂ R. For J ∈ N, let ∆x := (b -a)/J, and let x j := a + j∆x, j = 0, . . . , J be an equidistant grid on [a, b]. Then for any ϵ ∈ (0, ∆x/2], there exists a ReLU neural network Λ : R → R J , x → (Λ 1 (x), . . . , Λ J (x)), such that

is well-defined and possesses an analytic, invertible extension to the open interval (-1, 1) (with analytic inverse). By Theorem D.6, it follows that for any ϵ > 0, we can find neural networks Φ 1 , . . . , Φ 5 , such that |Φ j (cos(ξ), sin(ξ)) -ξ| ≤ ϵ on an open set containing the corresponding domain, and depth(Φ j ) ≤ C log(ϵ -1 ) 2 , width(Φ j ) ≤ C. By a straight-forward partition of unity argument based on Proposition D.3, we can combine these mappings to a global map, 2 which is represented by a neural network Ξ ϵ : R 2 → [0, 2π], such that sup ξ∈[0,2π-ϵ] |Ξ ϵ (cos(ξ), sin(ξ)) -ξ| ≤ ϵ, and such that depth(Ξ ϵ ) ≤ C log(ϵ -1 ) 2 , width(Ξ ϵ ) ≤ C. D.2 PROOF OF THEOREM 3.1

of the linear advection equation, with input measure µ given as the random law of box functions of height h ∈ [h, h], width w ∈ [w, w] and shift ξ ∈ [0, 2π]. Let M > 0. There exists a constant C = C(M, µ) > 0, depending only on µ and M , with the following property: If N (ū) = p k=1 β k (ū)τ k is any operator approximation with linear reconstruction dimension p, such that sup ū∼µ ∥N

k=1 |α k | 2 is minimized among all α 1 , . . . , α 2m satisfying the constraint 2m k=1 |α k | = 2π if, and only if, |α 1

Figure 4: Illustration of Ψ t (x 0 ): (a) characteristics traced out by t → Ψ t (x 0 ) (until collision with shock), (b) Ψ t (x 0 ) before shock formation, (c) Ψ t (x 0 ) after shock formation, including the interval [x t , 2π -x t ] (red limits) and the larger domain (blue limits) allowing for bijective analytic continuation, (∆ t , 2π -∆ t ).

Figure 8: Testing error obtained by training DON and SDON with different number of basis function p for the linear advection problem.

Figure 9: Testing error obtained by training FNO with different number of modes for the linear advection problem and corresponding error obtained with linear Fourier projection.

Figure 10: Illustration of two input (blue) and output (orange) samples for the advection equation.

Figure 11: Illustration of two input (blue) and output (orange) samples for the Burgers' equation.

Relative median-L 1 error computed over 128 testing samples for different benchmarks with different models.

of the linear advection equation, with input measure µ given as the law of random box functions of height h ∈ [h, h], width w ∈ [w, w] and shift ξ ∈ [0, 2π]. There exists an absolute constant C > 0 with the following property: If N DON is a DeepONet approximation with m sensor points, then

with the two ordinary neural networks Φ ϵ and Ξ ϵ , it follows that N FNO can itself be represented by an FNO withk max = 1, d v ≤ C and depth(N FNO ) ≤ C log(ϵ -1 ) 2 . By the general complexity estimate (B.2), size(N FNO ) ≲ d 2 v k d max depth(N FNO), we also obtain the claimed an upper complexity bound size(N FNO ) ≤ C log(ϵ -1 ) 2 .

The number of layers L, neurons d and the activation function σ are chosen though cross-validation.

Learning rate scheduler for different benchmarks and different models: γ denotes the learning rate decay factor

Minimum (Top sub-row)  and maximum (Bottom sub-row) number of trainable parameters among the grid-search hyperparameters configurations.

ResNet best performing hyperparameters configuration for different benchmark problems.

Fully convolutional neural network best performing hyperparameters configuration for different benchmark problems.

Training and testing error obtained by training DON and SDON for the linear advection problem with different number of basis functions.

annex

G is a convolution defined by (1, 1, 0, c, 1) z ℓ+1 ∈ R w ℓ+1 ×c ℓ+1 (for 1d-problems, and z ℓ+1 ∈ R w ℓ+1 ×h ℓ+1 ×c ℓ+1 for 2d-problems). Therefore, a (de)convolutional affine transformation can be uniquely identified with the tuple (k ℓ , s, p, c ℓ , c ℓ+1 ).The main difference between the encoder's and decoder's transformation is that, for the encoder h ℓ+1 < h ℓ , w ℓ+1 < w ℓ , c ℓ+1 > c ℓ and for the decoder h ℓ+1 > h ℓ , w ℓ+1 > w ℓ , c ℓ+1 < c ℓ .For the linear advection equation and the Burgers' equation we employ the same variable encoding of the input and output samples as ResNet. On the other hand, for the compressible Euler equations, each input variable is embedded in an individual channel. More precisely, E(ū) ∈ R m×3 for the shock-tube problem, and E(ū) ∈ R m×m×4 for the 2d Riemann problem. The architectures used in the benchmarks examples are shown in figures 5, 6, 7.In the experiments, the number of channel c (see figures 5, 6, 7 for an explanation of its meaning) and the activation function σ are selected with cross-validation.

E.1.4 DEEPONET AND SHIFT-DEEPONET

The architectures of branch and trunk are chosen according to the benchmark addressed. In particular, for the first two numerical experiments, we employ standard feed-forward neural networks for both branch and trunk-net, with a skip connection between the first and the last hidden layer in the branch.On the other hand, for the compressible Euler equation we use a convolutional network obtained as a composition of L blocks, each defined as:with BN denoting a batch normalization. The convolution operation is instead defined byThe output is then flattened and forwarded through a multi layer perceptron with 2 layer with 256 neurons and activation function σ.For the shift and scale-nets of shift-DeepONet, we use the same architecture as the branch.Differently from the rest of the models, the training samples for DeepONet and shift-DeepONet are encoded at m and n uniformly distributed random points, respectively. Specifically, the encoding points represent a randomly picked subset of the grid points used for the other models. The number of encoding points m and n, together with the number of layers L, units d and activation function of trunk and branch-nets, are chosen through cross-validation.

E.1.5 FOURIER NEURAL OPERATOR

We use the implementation of the FNO model provided by the authors of Li et al. (2021a) . Specifically, the lifting R is defined by a linear transformation from R du×m to R dv×m , where d u is the number of inputs, and the projection Q to the target space performed by a neural network with a single hidden layer with 128 neurons and GeLU activation function. The same activation function is used for all the Fourier layers, as well. Moreover, the weight matrix W ℓ used in the residual connection derives from a convolutional layer defined by, for all 1 < ℓ < L -1. We use the same samples encoding employed for the fully convolutional models. The lifting dimension d v , the number of Fourier layers L and k max , defined in 2, are the only objectives of cross-validation.

E.1.6 TRAINING DETAILS

For all the benchmarks, a training set with 1024 samples, and a validation and test set each with 128 samples, are used. The training is performed with the ADAM optimizer, with learning rate 5 • 10 -4 for 10000 epochs and minimizing the L 1 -loss function. We use the learning rate schedulers defined in table 2 . We train the models in mini-batches of size 10. A weight decay of 10 -6 is used for ResNet (all numerical experiments), DON and sDON (linear advection equation, Burgers' equation, and shock-tube problem). On the other hand, no weight decay is employed for remaining experiments and models. At every epoch the relative L 1 -error is computed on the validation set, and the set of trainable parameters resulting in the lowest error during the entire process saved for testing. Therefore, no early stopping is used. The models hyperparameters are selected by running grid searches over a range of hyperparameter values and selecting the configuration realizing the lowest relative L 1 -error on the validation set. For instance, the model size (minimum and maximum number of trainable parameters) that are covered in this grid search are reported in Table 3 .The results of the grid search i.e., the best performing hyperparameter configurations for each model and each benchmark, are reported in tables 4, 5, 6, 7 and 8.

E.2 FURTHER EXPERIMENTAL RESULTS.

In this section, we present some further experimental results which supplement the results presented in 

