HYPERDEEPONET: LEARNING OPERATOR WITH COM-PLEX TARGET FUNCTION SPACE USING THE LIMITED RESOURCES VIA HYPERNETWORK

Abstract

Fast and accurate predictions for complex physical dynamics are a significant challenge across various applications. Real-time prediction on resource-constrained hardware is even more crucial in real-world problems. The deep operator network (DeepONet) has recently been proposed as a framework for learning nonlinear mappings between function spaces. However, the DeepONet requires many parameters and has a high computational cost when learning operators, particularly those with complex (discontinuous or non-smooth) target functions. This study proposes HyperDeepONet, which uses the expressive power of the hypernetwork to enable the learning of a complex operator with a smaller set of parameters. The DeepONet and its variant models can be thought of as a method of injecting the input function information into the target function. From this perspective, these models can be viewed as a particular case of HyperDeepONet. We analyze the complexity of DeepONet and conclude that HyperDeepONet needs relatively lower complexity to obtain the desired accuracy for operator learning. HyperDeepONet successfully learned various operators with fewer computational resources compared to other benchmarks.

1. INTRODUCTION

Operator learning for mapping between infinite-dimensional function spaces is a challenging problem. It has been used in many applications, such as climate prediction (Kurth et al., 2022) and fluid dynamics (Guo et al., 2016) . The computational efficiency of learning the mapping is still important in real-world problems. The target function of the operator can be discontinuous or sharp for complicated dynamic systems. In this case, balancing model complexity and cost for computational time is a core problem for the real-time prediction on resource-constrained hardware (Choudhary et al., 2020; Murshed et al., 2021) . Many machine learning methods and deep learning-based architectures have been successfully developed to learn a nonlinear mapping from an infinite-dimensional Banach space to another. They focus on learning the solution operator of some partial differential equations (PDEs), e.g., the initial or boundary condition of PDE to the corresponding solution. Anandkumar et al. (2019) proposed an iterative neural operator scheme to learn the solution operator of PDEs. Simultaneously, Lu et al. (2019; 2021) proposed a deep operator network (DeepONet) architecture based on the universal operator approximation theorem of Chen & Chen (1995) . The DeepONet consists of two networks: branch net taking an input function at fixed finite locations, and trunk net taking a query location of the output function domain. Each network provides the p outputs. The two p-outputs are combined as a linear combination (inner-product) to approximate the underlying operator, where the branch net produces the coefficients (p-coefficients) and the trunk net produces the basis functions (p-basis) of the target function. While variant models of DeepONet have been developed to improve the vanilla DeepONet, they still have difficulty approximating the operator for a complicated target function with limited computational resources. Lanthaler et al. (2022) and Kovachki et al. (2021b) pointed out the limitation of linear approximation in DeepONet. Some operators have a slow spectral decay rate of the Kolmogorov n-width, which defines the error of the best possible linear approximation using an n-dimensional space. A large n is required to learn such operators accurately, which implies that the DeepONet requires a large number of basis p and network parameters for them. Hadorn (2022) investigated the behavior of DeepONet, to find what makes it challenging to detect the sharp features in the target function when the number of basis p is small. They proposed a Shift-DeepONet by adding two neural networks to shift and scale the input function. Venturi & Casey (2023) also analyzed the limitation of DeepONet via singular value decomposition (SVD) and proposed a flexible DeepONet (flexDeepONet), adding a pre-net and an additional output in the branch net. Recently, to overcome the limitation of the linear approximation, Seidman et al. (2022) proposed a nonlinear manifold decoder (NOMAD) framework by using a neural network that takes the output of the branch net as the input along with the query location. Even though these methods reduce the number of basis functions, the total number of parameters in the model cannot be decreased. The trunk net still requires many parameters to learn the complex operators, especially with the complicated (discontinuous or non-smooth) target functions. In this study, we propose a new architecture, HyperDeepONet, which enables operator learning, and involves a complex target function space even with limited resources. The HyperDeepONet uses a hypernetwork, as proposed by Ha et al. (2017) , which produces parameters for the target network. Wang et al. (2022) pointed out that the final inner product in DeepONet may be inefficient if the information of the input function fails to propagate through a branch net. The hypernetwork in HyperDeepONet transmits the information of the input function to each target network's parameters. Furthermore, the expressivity of the hypernetwork reduces the neural network complexity by sharing the parameters (Galanti & Wolf, 2020) . Our main contributions are as follows. • We propose a novel HyperDeepONet using a hypernetwork to overcome the limitations of DeepONet and learn the operators with a complicated target function space. The DeepONet and its variant models are analyzed primarily in terms of expressing the target function as a neural network (Figure 4 ). These models can be simplified versions of our general HyperDeepONet model (Figure 5 ). • We analyze the complexity of DeepONet (Theorem 2) and prove that the complexity of the HyperDeepONet is lower than that of the DeepONet. We have identified that the Deep-ONet should employ a large number of basis to obtain the desired accuracy, so it requires numerous parameters. For variants of DeepONet combined with nonlinear reconstructors, we also present a lower bound for the number of parameters in the target network. • The experiments show that the HyperDeepONet facilitates learning an operator with a small number of parameters in the target network even when the target function space is complicated with discontinuity and sharpness, which the DeepONet and its variants suffer from. The HyperDeepONet learns the operator more accurately even when the total number of parameters in the overall model is the same.

2. RELATED WORK

Many machine learning methods and deep learning-based architectures have been successfully developed to solve PDEs with several advantages. One research direction is to use the neural network directly to represent the solution of PDE (E & Yu, 2018; Sirignano & Spiliopoulos, 2018) . The physics-informed neural network (PINN), introduced by Raissi et al. (2019) , minimized the residual of PDEs by using automatic differentiation instead of numerical approximations. There is another approach to solve PDEs, called operator learning. Operator learning aims to learn a nonlinear mapping from an infinite-dimensional Banach space to another. Many studies utilize the convolutional neural network to parameterize the solution operator of PDEs in various applications (Guo et al., 2016; Bhatnagar et al., 2019; Khoo et al., 2021; Zhu et al., 2019; Hwang et al., 2021) . The neural operator (Kovachki et al., 2021b) is proposed to approximate the nonlinear operator inspired by Green's function. Li et al. (2021) extend the neural operator structure to the Fourier Neural Operator (FNO) to approximate the integral operator effectively using the fast Fourier transform (FFT). Kovachki et al. (2021a) proved the universality of FNO and identified the size of the network. The DeepONet (Lu et al., 2019; 2021) has also been proposed as another framework for operator learning. The DeepONet has significantly been applied to various problems, such as bubble growth dynamics (Lin et al., 2021) , hypersonic flows (Mao et al., 2021) , and fluid flow (Cai et al., 2021) . (2020) devised a chunk embedding method that partitions the parameters of the target network; this is because the output dimension of the hypernetwork can be large.

3.1. PROBLEM SETTING

The goal of operator learning is to learn a mapping from infinite-dimensional function space to the others using a finite pair of functions. Let G : U → S be a nonlinear operator, where U and S are compact subsets of infinite-dimensional function spaces U ⊂ C(X ; R du ) and S ⊂ C(Y; R ds ) with compact domains X ⊂ R dx and Y ⊂ R dy . For simplicity, we focus on the case d u = d s = 1, and all the results could be extended to a more general case for arbitrary d u and d s . Suppose we have observations {u i , G(u i )} N i=1 where u i ∈ U and G(u i ) ∈ S. We aim to find an approximation G θ : U → S with parameter θ using the N observations so that G θ ≈ G. For example, in a dam break scenario, it is an important to predict the fluid flow over time according to a random initial height of the fluid. To this end, we want to find an operator G θ , which takes an initial fluid height as an input function and produces the fluid height over time at any location as the output function (Figure 1 ). As explained in Lanthaler et al. (2022) , the approximator G θ can be decomposed into the three components (Figure 2 ) as G θ := R • A • E. (1) First, the encoder E takes an input function u from U to generate the finite-dimensional encoded data in R m . Then, the approximator A is an operator approximator from the encoded data in finite dimension space R m to the other finite-dimensional space R p . Finally, the reconstructor R reconstructs the output function s(y) = G(u)(y) with y ∈ Y using the approximated data in R p .

3.2. DEEPONET AND ITS LIMITATION

DeepONet can be analyzed using the above three decompositions. Assume that all the input functions u are evaluated at fixed locations {x j } m j=1 ⊂ X ; they are called "sensor points." DeepONet uses an encoder as the pointwise projection E(u) = (u(x 1 ), u(x 2 ), ..., u(x m )) of the continuous function u, the so-called "sensor values" of the input function u. An intuitive idea is to employ a neural network that simply concatenates these m sensor values and a query point y as an input to approximate the target function G(u)(y). DeepONet, in contrast, handles m sensor values and a query point y separately into two subnetworks based on the universal approximation theorem for the operator (Lu et al., 2021) . See Appendix B for more details. Lu et al. (2021) use the fully connected neural network for the approximator A : R m → R p . They referred to the composition of these two maps as branch net β : U → R p , β(u) := A • E(u) (2) for any u ∈ U. The role of branch net can be interpreted as learning the coefficient of the target function G(u)(y). They use one additional neural network, called trunk net τ as shown below. τ : Y → R p+1 , τ (y) := {τ k (y)} p k=0 (3) for any y ∈ Y. The role of trunk net can be interpreted as learning an affine space V that can efficiently approximate output function space C(Y; R ds ). The functions τ 1 (y), ...,τ p (y) become the p-basis of vector space associated with V and τ 0 (y) becomes a point of V . By using the trunk net τ , the τ -induced reconstructor R is defined as R τ : R p → C(Y; R ds ), R τ (β)(y) := τ 0 (y) + p k=1 β k τ k (y) where β = (β 1 , β 2 , ..., β p ) ∈ R p . In DeepONet, τ 0 (y) is restricted to be a constant τ 0 ∈ R that is contained in a reconstructor R. The architecture of DeepONet is described in Figure 4 (b). Here, the τ -induced reconstructor R τ is the linear approximation of the output function space. Because the linear approximation R cannot consider the elements in its orthogonal complement, a priori limitation on the best error of DeepONet is explained in Lanthaler et al. (2022) as U Y |G(u)(y) -R τ • A • E(u)(y)| 2 dydµ(u) 1 2 ≥ k>p λ k , where λ 1 ≥ λ 2 ≥ ... are the eigenvalues of the covariance operator Γ G #µ of the push-forward measure G #µ . Several studies point out that the slow decay rate of the lower bound leads to inaccurate approximation operator learning using DeepONet (Kovachki et al., 2021b; Hadorn, 2022; Lanthaler et al., 2022) . For example, the solution operator of the advection PDEs (Seidman et al., 2022; Venturi & Casey, 2023) and of the Burgers' equation (Hadorn, 2022) are difficult to approximate when we are using the DeepONet with the small number of basis p. Figure 3 : The perspective target network parametrization for operator learning.

3.3. VARIANT MODELS OF DEEPONET

Several variants of DeepONet have been developed to overcome its limitation. All these models can be viewed from the perspective of parametrizing the target function as a neural network. When we think of the target network that receives y as an input and generates an output G θ (u)(y), the DeepONet and its variant model can be distinguished by how information from the input function u is injected into this target network G θ (u), as described in Figure 3 . From this perspective, the trunk net in the DeepONet can be considered as the target network except for the final output, as shown in Figure 4 (a). The output of the branch net gives the weight between the last hidden layer and the final output. Hadorn (2022) proposed Shift-DeepONet. The main idea is that a scale net and a shift net are used to shift and scale the input query position y. Therefore, it can be considered that the information of Venturi & Casey (2023) proposed FlexDeepONet, explained in Figure 4 (c). They used the additional network, pre-net, to give the bias between the input layer and the first hidden layer. Additionally, the output of the branch net also admits the additional output τ 0 to provide more information on input function u at the last inner product layer. NOMAD is recently developed by Seidman et al. (2022) to overcome the limitation of DeepONet. They devise a nonlinear output manifold using a neural network that takes the output of branch net {β i } p i=1 and the query location y. As explained in Figure 4 (d), the target network receives information about the function u as an additional input, similar to other conventional neural embedding methods (Park et al., 2019; Chen & Zhang, 2019; Mescheder et al., 2019) . These methods provide information on the input function u to only a part of the target network. It is a natural idea to use a hypernetwork to share the information of input function u to all parameters of the target network. We propose a general model HyperDeepONet (Figure 5 ), which contains the vanilla DeepONet, FlexDeepONet, and Shift-DeepONet, as a special case of the HyperDeepONet. The HyperDeepONet structure is described in Figure 5 . The encoder E and the approximator A are used, similar to the vanilla DeepONet. The proposed structure replaces the branch net with the hypernetwork. The hypernetwork generate all parameters of the target network. More precisely, we define the hypernetwork h as h θ : U → R p , h θ (u) := A • E(u) (6) for any u ∈ U. Then, h(u) = Θ ∈ R p is a network parameter of the target network, which is used in reconstructor for the HyperDeepONet. We define the reconstructor R as 

4.2. COMPARISON ON COMPLEXITY OF DEEPONET AND HYPERDEEPONET

In this section, we would like to clarify the complexity of the DeepONet required for the approximation A and reconstruction R based on the theory in Galanti & Wolf (2020) . Furthermore, we will show that the HyperDeepONet entails a relatively lower complexity than the DeepONet using the results on the upper bound for the complexity of hypernetwork (Galanti & Wolf, 2020) .

4.2.1. NOTATIONS AND DEFINITIONS

Suppose that the pointwise projection values (sensor values) of the input function u is given as E(u) = (u(x 1 ), u(x 2 ), ..., u(x m )). For simplicity, we consider the case Y = [-1, 1] dy and E(u) ∈ [-1, 1] m . For the composition R • A : R m → C(Y; R) , we focus on approximating the mapping O : R m+dy → R, which is defined as follows: O(E(u), y) := (R • A(E(u)))(y), for y ∈ [-1, 1] dy , E(u) ∈ [-1, 1] m . ( ) The supremum norm ∥h∥ ∞ is defined as max y∈Y ∥h(y)∥. Now, we introduce the Sobolev space W r,n , which is a subset of C r ([-1, 1] n ; R). For r, n ∈ N, W r,n := h : [-1, 1] n → R ∥h∥ s r := ∥h∥ ∞ + 1≤|k|≤r ∥D k h∥ ∞ ≤ 1 , where D k h denotes the partial derivative of h with respect to multi-index k ∈ {N ∪ {0}} dy . We assume that the mapping O lies in the Sobolev space W r,m+dy . For the nonlinear activation σ, the class of neural network F represents the fully connected neural network with depth k and corresponding width (h 1 = n, h 2 , • • • , h k+1 ), where W i ∈ R hi × R hi+1 and b i ∈ R hi+1 denote the weights and bias of the i-th layer respectively. F := f : [-1, 1] n → R|f (y; [W, b]) = W k • σ(W k-1 • • • σ(W 1 • y + b 1 ) + b k-1 ) + b k Some activation functions facilitate an approximation for the Sobolev space and curtail the complexity. We will refer to these functions as universal activation functions. The formal definition can be found below, where the distance between the class of neural network F and the Sobolev space W r,n is defined by d(F; W r,n ) := sup ψ∈Wr,n inf f ∈F ∥f -ψ∥ ∞ . Most well-known activation functions are universal activations that are infinitely differentiable and non-polynomial in any interval (Mhaskar, 1996) . Furthermore, Hanin & Sellke (2017) state that the ReLU activation is also universal. Definition 1. (Galanti & Wolf, 2020) (Universal activation). The activation function σ is called universal if there exists a class of neural network F with activation function σ such that the number of parameters of F is O(ϵ -n/r ) with d(F; W r,n ) ≤ ϵ for all r, n ∈ N. We now introduce the theorem, which offers a guideline on the neural network architecture for operator learning. It suggests that if the entire architecture can be replaced with a fully connected neural network, large complexity should be required for approximating the target function. It also verifies that the lower bound for a universal activation function is a sharp bound on the number of parameters. First, we give an assumption to obtain the theorem. Assumption 1. Suppose that F and W r,n represent the class of neural network and the target function space to approximate, respectively. Let F ′ be a neural network class representing a structure in which one neuron is added rather than F. Then, the followings holds for all ψ ∈ W r,n not contained in F. inf f ∈F ∥f -ψ∥ ∞ > inf f ∈F ′ ∥f -ψ∥ ∞ . For r = 0, Galanti & Wolf (2020) remark that the assumption is valid for 2-layered neural networks with respect to the L 2 norm when an activation function σ is either a hyperbolic tangent or sigmoid function. With Assumption 1, the following theorem holds, which is a fundamental approach to identifying the complexity of DeepONet and its variants. Note that a real-valued function g ∈ L 1 (R) is called a bounded variation if its total variation sup ϕ∈C 1 c (R),∥ϕ∥∞≤1 R g(x)ϕ ′ (x)dx is finite. Theorem 1. (Galanti & Wolf, 2020) . Suppose that F is a class of neural networks with a piecewise C 1 (R) activation function σ : R → R of which derivative σ ′ is bounded variation. If any nonconstant ψ ∈ W r,n does not belong to F, then d(F; W r,n ) ≤ ϵ implies the number of parameters in F should be Ω(ϵ -n/r ). such that σ and σ ′ are bounded. Suppose that the class of branch net B has a bounded Sobolev norm (i.e., ∥β∥ s r ≤ l 1 , ∀β ∈ B). Suppose any non-constant ψ ∈ W r,n does not belong to any class of neural network. In that case, the number of parameters in the class of trunk net

4.2.2. LOWER

T is Ω(ϵ -dy/r ) when d(F DeepONet (B, T ); W r,dy+m ) ≤ ϵ. The core of this proof is showing that the inner product between the branch net and the trunk net could be replaced with a neural network that has a low complexity (Lemma 1). Therefore, the entire structure of DeepONet could be replaced with a neural network that receives [E(u), y] ∈ R dy+m as input. It gives the lower bound for the number of parameters in DeepONet based on Theorem 1. The proof can be found in Appendix C.1. The analogous results holds for variant models of DeepONet. Models such as Shift-DeepONet and flexDeepONet could achieve the desired accuracy with a small number of basis. Still, there was a trade-off in which the first hidden layer of the target network required numerous units. There was no restriction on the dimension of the last hidden layer in the target network for NOMAD, which uses a fully nonlinear reconstruction. However, the first hidden layer of the target network also had to be wide enough, increasing the number of parameters. Details can be found in Appendix C.2. For the proposed HyperDeepONet, the sensor values E(u) determine the weight and bias of all other layers as well as the weight of the last layer of the target network. Due to the nonlinear activation functions between linear matrix multiplication, it is difficult to replace HyperDeepONet with a single neural network that receives [E(u), y] ∈ R dy+m as input. Galanti & Wolf (2020) state that there exists a hypernetwork structure (HyperDeepONet) such that the number of parameters in the target network is O(ϵ -dy/r ). It implies that the HyperDeepONet reduces the complexity compared to all the variants of DeepONet.

5. EXPERIMENTS

In this section, we verify the effectiveness of the proposed model HyperDeepONet to learn the operators with a complicated target function space. To be more specific, we focus on operator learning problems in which the space of output function space is complicated. Each input function u i generates multiple triplet data points (u i , y, G(u)(y)) for different values of y. Except for the shallow water problem, which uses 100 training function pairs and 20 test pairs, we use 1,000 training input-output function pairs and 200 test pairs for all experiments. For the toy example, we first consider the identity operator G : u i → u i . The Chebyshev polynomial is used as the input (=output) for the identity operator problem. The Chebyshev polynomials of the first kind T l of degree 20 can be written as u i ∈ { 19 l=0 c l T l (x)|c l ∈ [-1/4, 1/4]} with random sampling c l from uniform distribution U [-1/4, 1/4]. The differentiation operator G : u i → d dx u i is considered for the second problem. Previous works handled the anti-derivative operator, which makes the output function smoother by averaging (Lu et al., 2019; 2022) . Here, we choose the differentiation operator instead of the anti-derivative operator to focus on operator learning when the operator's output function space is complicated. We first sample the output function G(u) from the above Chebyshev polynomial of degree 20. The input function is generated using the numerical method that integrates the output function. Finally, the solution operators of PDEs are considered. We deal with two problems with the complex target function in previous works (Lu et al., 2022; Hadorn, 2022) . The solution operator of the advection equation is considered a mapping from the rectangle shape initial input function to the solution w(t, x) at t = 0.5, i.e., G : w(0, x) → w(0.5, x). We also consider the solution operator of Burgers' equation which maps the random initial condition to the solution w(t, x) at t = 1, i.e., G : w(0, x) → w(1, x). The solution of the Burgers' equation has a discontinuity in a short time, although the initial input function is smooth. For a challenging benchmark, we consider the solution operator of the shallow water equation that aims to predict the fluid height h(t, x 1 , x 2 ) from the initial condition h(0, x 1 , x 2 ), i.e., G : h(0, x 1 , x 2 ) → h(t, x 1 , x 2 ) (Figure 1 ). In this case, the input of the target network is three dimension (t, x 1 , x 2 ), which makes the solution operator complex. Detail explanation is provided in Appendix E. Expressivity of target network. We have compared the expressivity of the small target network using different models. We focus on the identity and differentiation operators in this experiment. All models employ the small target network d y -20-20-10-1 with the hyperbolic tangent activation function. The branch net and the additional networks (scale net, shift net, pre-net, and hypernetwork) also use the same network size as the target network for all five models. Table 1 shows that the DeepONet and its variant models have high errors in learning complex operators when the small target network is used. In contrast, the HyperDeepONet has lower errors than the other models. This is consistent with the theorem in the previous section that HyperDeepONet can achieve improved approximations than the DeepONet when the complexity of the target network is the same. Figure 6 shows a prediction on the differentiation operator, which has a highly complex target function. The same trends are observed when the activation function or the number of sensor points changes (Table 5 ) and the number of layers in the branch net and the hypernetwork vary (Figure 11 ). Same number of learnable parameters. The previous experiments compare the models using the same target network structure. In this section, the comparison between the DeepONet and the HyperDeepONet is considered when using the same number of learnable parameters. We focus on the solution operators of the PDEs. 2 shows that the HyperDeepONet achieves a similar or better performance than the DeepONet when the two models use the same number of learnable parameters. The HyperDeepONet has a slightly higher error for advection equation problem, but this error is close to perfect operator prediction. It shows that the complexity of target network and the number of learnable parameters can be reduced to obtain the desired accuracy using the HyperDeepONet. The fourth row of Table 2 shows that HyperDeepONet is much more effective than DeepONet in approximating the solution operator of the shallow water equation when the number of parameters is limited. Figure 7 and Figure 12 show that the HyperDeepONet learns the complex target functions in fewer epochs for the desired accuracy than the DeepONet although the HyperDeepONet requires more time to train for one epoch (Table 8 ). Scalability. When the size of the target network for the HyperDeepONet is large, the output of the hypernetwork would be high-dimensional (Ha et al., 2017; Pawlowski et al., 2017 ) so that its complexity increases. In this case, the chunked HyperDeepONet (c-HyperDeepONet) can be used with a trade-off between accuracy and memory based on the chunk embedding method developed by von Oswald et al. (2020) . It generates the subset of target network parameters multiple times iteratively reusing the smaller chunked hypernetwork. The c-HyperDeepONet shows a better accuracy than the DeepONet and the HyperDeepONet using an almost similar number of parameters, as shown in Table 2 . However, it takes almost 2x training time and 2∼30x memory usage than the HyperDeepOnet. More details on the chunked hypernetwork are in Appendix D.

6. CONCLUSION AND DISCUSSION

In this work, the HyperDeepONet is developed to overcome the expressivity limitation of Deep-ONet. The method of incorporating an additional network and a nonlinear reconstructor could not thoroughly solve this limitation. The hypernetwork, which involves multiple weights simultaneously, had a desired complexity-reducing structure based on theory and experiments. We only focused on when the hypernetwork and the target network is fully connected neural networks. In the future, the structure of the two networks can be replaced with a CNN or ResNet, as the structure of the branch net and trunk net of DeepONet can be changed to another network (Lu et al., 2022) . Additionally, it seems interesting to research a simplified modulation network proposed by Mehta et al. (2021) , which still has the same expressivity as HyperDeepONet. Some techniques from implicit neural representation can improve the expressivity of the target network (Sitzmann et al., 2020) . Using a sine function as an activation function with preprocessing will promote the expressivity of the target network. We also leave the research on the class of activation functions satisfying the assumption except for hyperbolic tangent or sigmoid functions as a future work. 

A NOTATIONS

We list the main notations in Table 3 which is not concretely described in this paper. There exists a constant

U

C such that n ≤ C • ϵ, ∀ϵ > 0 n = Ω(ϵ) There exists a constant C such that n ≥ C • ϵ, ∀ϵ > 0 n = o(ϵ) n/ϵ converges to 0 as ϵ approaches to 0.  R τ • A β (u(x 1 ), • • • , u(x m ))(y) := ⟨β(u(x 1 ), • • • , u(x m ); θ β ), τ (y; θ τ )⟩, where τ and β are referred to as the trunk net and the branch net respectively. Note that R τ and A β denote the reconstructor and the operator in Section 3.1. For m fixed observation points It was revealed that the stacked DeepONet, the simplified version of the unstacked DeepONet, is a universal approximator in the set of continuous functions. Therefore, the general structure also becomes a universal approximator which enables close approximation by using a sufficient number of parameters. Motivated by the property, we focus on how large complexity should be required for DeepONet and its variants to achieve the desired error. (x 1 , • • • , x m ) ∈ X m ,

C ON COMPLEXITY OF DEEPONET AND ITS VARIANTS

C.1 PROOF OF THEOREM 2 ON DEEPONET COMPLEXITY The following lemma implies that the class of neural networks is sufficiently efficient to approximate the inner product. Lemma 1. For the number of basis p ∈ N, consider the inner product function π p : [-1, 1] 2p → R defined by π p (a 1 , • • • , a p , b 1 , • • • , b p ) := p i=1 a i b i = ⟨(a 1 , • • • , a p ), (b 1 , • • • , b p )⟩. For an arbitrary positive t, there exists a class of neural network F with universal activation σ : R → R such that the number of parameters of F is O(p 1+1/t ϵ -1/t ) with inf f ∈F ∥f -π p ∥ ∞ ≤ ϵ. Proof. Suppose that t is a positive integer and the Sobolev space W 2t,2 is well defined. First, we would like to approximate the product function π 1 : [-1, 1] 2 → R which is defined as π 1 (a, b) = ab. Note that partial derivatives D k π 1 = 0 for all multi-index k ∈ {N {0}} 2 such that |k| ≥ 2. For a multi-index k with |k| = 1, D k π 1 contains only one term which is either a or b. In this case, we can simply observe that |k|=1 ∥D k π 1 ∥ ∞ ≤ 2 • 1 = 2 by the construction of the domain [-1, 1] 2 for π 1 . And finally, ∥π 1 ∥ ∞ ≤ ∥ab∥ ∞ ≤ ∥a∥ ∞ ∥b∥ ∞ ≤ 1 • 1 = 1, so that a function π 1 /3 should be contained in W r,2 for any r ∈ N. In particular, π 1 /3 lies in W 2t,2 so that there exists a neural network approximation f nn in some class of neural network F * with an universal activation function σ such that the number of parameters of F * is O((ϵ/3p) -2/2t ) = O(p 1/t ϵ -1/t ), and ∥π 1 /3 -f nn ∥ ∞ ≤ ϵ/3p, by Definition 1. Then the neural network 3f nn approximates the function π 1 by an error ϵ/p which can be constructed by adjusting the last weight values directly involved in the output layer of neural network f nn . Finally, we construct a neural network approximation for the inner product function π p . Decompose the 2p-dimensional inner product function π p into p product functions {Proj i (π p )} p i=1 which are defined as Proj i (π p ) : R 2p → R, Proj i (π p )(a 1 , • • • , a p , b 1 , • • • , b p ) := π 1 (a i , b i ) = a i b i , for ∀i ∈ {1, • • • , p}. Then each function Proj i (π p ) could be approximated within an error ϵ/p by neural network N N i which has O(p 1/t ϵ -1/t ) parameters by the above discussion. Finally, by adding the last weight [1, 1, • • • , 1] ∈ R 1×p which has input as the outputs of p neural networks {N N i } p i=1 , we can construct the neural network approximation N N of π p = p i=1 Proj i (π p ) such that the number of parameters is O(1+p+p•p 1/t ϵ -1/t ) = O(p 1+1/t ϵ -1/t ). Class of neural network F, which represents the structure of N N , satisfies the desired property. Obviously, the statement holds for an arbitrary real t which is not an integer. Now we assume that O (defined in Eq. ( 8)) lies in the Sobolev space W r,dy+m . Then, we can obtain the following lemma which presents the lower bound on the number of basis p in DeepONet structure. Note that we apply L ∞ -norm for the outputs of branch net and trunk net which are multidimensional vectors. Lemma 2. Let σ : R → R be a universal activation function in C r (R) such that σ ′ is a bounded variation. Suppose that the class of branch net B has a bounded Sobolev norm (i.e., ∥β∥ s r ≤ l 1 , ∀β ∈ B). Assume that supremum norm ∥•∥ ∞ of the class of trunk net T is bounded by l 2 and the number of parameters in T is o(ϵ -(dy+m)/r ). If any non-constant ψ ∈ W r,dy+m does not belong to any class of neural network, then the number of basis p in T is Ω(ϵ -dy/r ) when d(F DeepONet (B, T ); W r,dy+m ) ≤ ϵ. Proof. To prove the above lemma by contradiction, we assume the opposite of the conclusion. Suppose that there is no constant C that satisfies the inequality p ≥ C(ϵ -dy/r ) for ∀ϵ > 0. In other words, there exists a sequence of DeepONet which has {p n } ∞ n=1 as the number of basis with a sequence of error {ϵ n } ∞ n=1 in R such that ϵ n → 0, and satisfies p n ≤ 1 n ϵ -dy/r n i.e., p n = o(ϵ -dy/r n ) with respect to n , and d(F DeepONet (B n , T n ); W r,dy+m ) ≤ ϵ n , where B n and T n denote the corresponding class sequence of branch net and trunk net respectively. Then, we can choose the sequence of branch net {β n : R m → R pn } ∞ n=1 and trunk net τ n : R dy → R pn ∞ n=1 satisfying ∥O(E(u), y) -π pn (β n (E(u)), τ n (y))∥ ∞ ≤ 2ϵ n , ∀[E(u), y] ∈ [-1, 1] dy+m , O ∈ W r,dy+m for the above sequence of DeepONet by the definition of d(F DeepONet (B n , T n ); W r,dy+m ). Now, we would like to construct neural network approximations f n for the branch net β n . By the assumption on the boundedness of B, the i-th component [β n ] i of β n has a Sobolev norm bounded by l 1 . In other words, ∥[β n ] i /l 1 ∥ s r ≤ 1 and therefore, [β n ] i /l 1 is contained in W 1,m . Since σ is a universal activation function, we can choose a neural network approximation [f n ] i of [β n ] i such that the number of parameters is O((ϵ n /l 1 ) -m/r ) = O(ϵ -m/r n ), and ∥[f n ] i -[β n ] i ∥ ∞ ≤ ϵ n /l 1 . Then, f n = (l 1 [f n ] 1 , l 1 [f n ] 2 , • • • , l 1 [f n ] pn ) becomes neural network approximation of β n which has O(p n ϵ -m/r n ) parameters within an error ϵ n . Recall the target function corresponding to m observation E(u) ∈ [-1, 1] m by O(E(u), •) : R dy → R which is defined in Eq. ( 8). Then, for ∀E(u), we can observe the following inequalities: ∥O(E(u), y) -π pn (f n (E(u)), τ n (y))∥ ∞ ≤ ∥O(E(u), y) -π pn (β n (E(u)) , τ n (y)) ∥ ∞ + ∥π pn (β n (E (u)) -f n (E(u)) , τ n (y))∥ ∞ ≤ 2ϵ n + ϵ n ∥τ n (y)∥ ∞ ≤ ϵ n (2 + l 2 ), by the assumption on the boundedness of T . Now we would like to consider the sequence of neural network which is an approximation of inner product between p n -dimensional vector in [-1, 1] pn . Note the following inequality ) and, ∥f n (E(u))∥ ∞ ≤ ∥f n (E(u)) -β n (E(u))∥ ∞ + ∥β n (E(u))∥ ∞ ≤ ϵ n + ∥β n (E(u))∥ s r ≤ ϵ n + l 1 ≤ 2l 1 , with ∥τ n ∥ ∞ ≤ l 2 for large n. It implies that f n (E(u))/ inf h∈Hn ∥h -π pn ∥ ∞ ≤ ϵ n where π pn is the inner product corresponding to p n -dimensional vector. Choose a neural network h n : [-1, 1] 2pn → R such that ∥h n -π pn ∥ ∞ ≤ 2ϵ n . Then, by the triangular inequality, ∥O(E(u), y) -2l 1 l 2 h n (f n (E(u))/2l 1 , τ n (y)/l 2 )∥ ∞ ≤ ∥O(E(u), y) -π pn (f n (E(u)), τ n (y))∥ ∞ + 2l 1 l 2 ∥π pn (f n (E(u))/2l 1 , τ n (y)/l 2 ) -h n (f n (E(u))/2l 1 , τ n (y)/l 2 )∥ ∞ ≤ ϵ n (2 + l 2 ) + 2l 1 l 2 (2ϵ n ) = ϵ n (2 + l 2 + 4l 1 l 2 ). (10) Finally, we compute the number of parameters which is required to implement the function 2l 1 l 2 h n (f n (E(u))/2l 1 , τ n (y)/l 2 ). The only part that needs further consideration is scalar multiplication. Since we need one weight to multiply a constant with one real value, three scalar multiplications h n (f n (E(u))/2l 1 , τ n (y)/l 2 ) → 2l 1 l 2 h n (f n (E(u))/2l 1 , τ n (y)/l 2 ), f n (E(u)) → f n (E(u))/2l 1 , and τ n (x) → τ n (y)/l 2 , require 1, p n , p n -parameters respectively. Combining all the previous results with the size of trunk net, the total number of parameters is obtained in the form of O(1 + 2p n + p 1+1/2dyr n ϵ -1/2dyr n + p n ϵ -m/r n ) + o(ϵ -(dy+m)/r n ) = o(ϵ -(dy+m)/r n ), since the initial assumption ( 9) on the number of basis gives the following inequality. p 1+1/2dyr n ϵ -1/2dyr n + p n ϵ -m/r n ≤ p n (p 1/2dyr n ϵ -1/2dyr n + ϵ -m/r n ) ≤ 1 n ϵ -dy/r n (ϵ -1/2r 2 -1/2dyr n + ϵ -m/r n ) ≤ 1 n ϵ -dy/r n 2ϵ -m/r n = 2 n ϵ -(dy+m)/r n . On the one hand, the sequence of function {2l 1 l 2 h n (f n (E(u))/2l 1 , τ n (y)/l 2 )} ∞ n=1 is an sequence of approximation for O(E(u), y) within a corresponding sequence of error {ϵ n (2 + l 2 + 4l 1 l 2 )} ∞ n=1 . Denote the sequence of the class of neural networks corresponding to the sequence of the function {2l 1 l 2 h n (f n (E(u))/2l 1 , τ n (y)/l 2 )} ∞ n=1 by {F n } ∞ n=1 . By the assumption, Theorem 1 implies the number of parameters in {F n } ∞ n=1 is Ω((ϵ n (2 + l 2 + 4l 1 l 2 )) -(dy+m)/r ) = Ω(ϵ -(dy+m)/

r n

). Therefore, the initial assumption (9) would result in a contradiction so the desired property is valid. Note that the assumption on the boundedness of the trunk net could be valid if we use the bounded universal activation function σ. Using the above results, we can prove our main theorem, Theorem 2. Proof of Theorem 2. Denote the number of parameters in T by N T . Suppose that there is no constant C satisfies the inequality N T ≥ Cϵ -dy/r , ∀ϵ > 0. That is, there exists a sequence of DeepONet with the corresponding sequence of trunk net class {T n } ). On the one hand, the following inequality holds where B n denotes the corresponding class sequence of branch net. d(F DeepONet (B n , T n ); W r,dy+m ) ≤ ϵ n . Since σ is bounded, T n consists of bounded functions with respect to the supremum norm. Therefore, if we apply the Lemma 2 with respect to n, the number of basis p n should be Ω(ϵ -dy/r n ). Since p n is also the number of output dimensions for the class of trunk net T n , the number of parameters in T n should be larger than p n = Ω(ϵ -dy/r n ). This leads to a contradiction. Finally, we present a lower bound on the total number of parameters of DeepONet, considering the size of the branch net. Keep in mind that the proof of this theorem can be applied to other variants of DeepONet. norm induced by vector 1-norms. We can impose constraints on the upper bound of the weights, which consequently enforce affine transformation W i to be bounded with respect to the L 1 norm. Therefore, we can guarantee the Lipschitz continuity of the entire neural network in this way. We would like to remark on the validity of the weight assumptions in the theorem since the bounded assumptions of the weight may be a possible reason for increasing the number of parameters. However, the definition of Sobolev space forces all elements to have the supremum norm ∥ • ∥ ∞ less than 1. It may be somewhat inefficient to insist on large weights for approximating functions with a limited range. Theorem 4. Let σ : R → R be a universal activation function in C r (R) such that σ and σ ′ are bounded. Suppose that the class of branch net B and pre-net P has a bounded Sobolev norm (i.e., ∥β∥ s r ≤ l 1 , ∀β ∈ B, and ∥ρ∥ s r ≤ l 3 , ∀ρ ∈ P) and any neural network in the class of trunk net T is Lipschitz continuous with constant l 2 . If any non-constant ψ ∈ W r,n does not belong to any class of neural network, then the number of parameters in T is Ω(ϵ -dy/r ) when d(F shift-DeepONet (P, B, T ); W r,dy+m ) ≤ ϵ. Proof. Denote the number of parameters in T by N T . Suppose that there exists a sequence of pre- net {ρ n } ∞ n=1 , branch net {β n } ∞ n=1 and trunk net {τ n } ∞ n=1 with the corresponding sequence of error {ϵ n } ∞ n=1 such that ϵ n → 0 and, N T = o(ϵ -dy/r n ), sup ψ∈W r,dy +m ∥f shift-DeepONet (ρ n , β n , τ n ) -ψ∥ ∞ ≤ ϵ n . The proof can be divided into three parts. Firstly, we come up with a neural network approximation ρ N N n of ρ n of which size is O(w • ϵ -m/r n ) within an error ϵ n . Next, construct a neural network approximation of Φ using the Lemma 3. Finally, the inner product π pn (β n , τ n ) is replaced with a neural network as in (10) of Lemma 2. Since all techniques such as triangular inequality are consistent with the previous discussion, we will briefly explain why additional Lipschitz continuity is required for the trunk network, and omit the details. Approximating the Pre-Net of Shift DeepOnet, which is not in DeepOnet, inevitably results in an error in the input of the trunk net. We are reluctant to allow this error to change the output of the trunk net significantly. In this situation, the Lipschitz continuity provides the desired result. For d y = 1, the additional rotation is only multiplying by 1 or -1. Since the weight and bias of the first layer alone can cover the scalar multiplication, flexDeepONet has the same properties as Shift-DeepONet in the above theorem. Theorem 5. Consider the case d y = 1. Let σ : R → R be a universal activation function in C r (R) such that σ and σ ′ are bounded. Suppose that the class of branch net B and pre-net P has a bounded Sobolev norm(i.e., ∥β∥ s r ≤ l 1 , ∀β ∈ B, and ∥ρ∥ s r ≤ l 3 , ∀ρ ∈ P), and any neural network in the class of trunk net T is Lipschitz continuous with constant l 2 . If any non-constant ψ ∈ W r,n does not belong to any class of neural network, then the number of parameters in T is Ω(ϵ -dy/r ) when d(F flexDeepONet (B, T ); W r,dy+m ) ≤ ϵ. Proof. The main difference between flexDeepONet and Shift-DeepONet, which is not mentioned earlier, is that the branch net affects the bias of the output layer. However, adding the values of the two neurons can be implemented in a neural network by adding only one weight of value 1 for each neuron, so all of the previous discussion are valid. In fact, NOMAD can be substituted with the embedding method handled by Galanti & Wolf (2020) . Suppose that the branch net of NOMAD is continuously differentiable. Let's also assume that the Lipschitz constant of branch net and trunk net is bounded. We would like to briefly cite the relevant theorem here. Theorem 6. (Galanti & Wolf (2020) ) Suppose that σ is a universal activation in C 1 (R) such that σ ′ is a bounded variation on R. Additionally, suppose that there is no class of neural network that can represent any function in W 1,dy+m other than a constant function. If the weight on the first layer of target network in NOMAD is bounded with respect to L 1 -norm, then d(N ; W 1,dy+m ) ≤ ϵ implies the number of parameters in N is Ω(ϵ -min(dy+m,2•my) ) where N denotes the class of function contained as a target network of NOMAD.

D CHUNKED EMBEDDING METHOD

The HyperDeepONet may suffer from the large complexity of the hypernetwork when the size of the target network increases. Although even a small target network can learn various operators with proper performance, a larger target network will be required for more accurate training. To take into account this case, we employ a chunk embedding method which is developed by von Oswald et al. ( 2020). The original hypernetwork was designed to generate all of the target network's weights so that the complexity of hypernetwork could be larger than the complexity of the target network. Such a problem can be overcome by using a hypernetwork with smaller outputs. with a latent vector z j . All groups share the hypernetwork so that the complexity decreases by a factor of the number of groups. Since the latent vectors {z j } Nc j=1 learn the characteristics of each group during the training period, the chunked embedding method preserves the expressivity of the hypernetwork. The chunked architecture is a universal approximator for the set of continuous functions with the existence of proper partitions (Proposition 1 in von Oswald et al. ( 2020)). We remark that the method can also generate the additional weights and discard the unnecessary ones when the number of the target network's parameters is not multiple of N C , which is the number of group. In most experiments, we follow the hyperparameter setting in Lu et al. (2019; 2021; 2022) . We use ADAM in Kingma & Ba (2015) as an optimizer with a learning rate of 1e-3 and zero weight decay.

E EXPERIMENTAL DETAILS

In all experiments, an InverseTimeDecay scheduler was used, and the step size was fixed to 1. In the experiments of identity and differentiation operators, grid search was performed using the sets 0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005 for decay rates. The selected values of the decay rate for each model can be found in Table 4 . 

E.1 IDENTITY

As in the text, we developed an experiment to learn the identity operator for the 20th-order Chebyshev polynomials (Figure 10 ). Note that the absolute value of the coefficients of all orders is less than or equal to 1/4. We discretize the domain [-1, 1] with a spatial resolution of 50. For the experiments described in the text, we construct all of the neural networks with tanh. We use 1,000 training pairs and 200 pairs to validate our experiments. The batch size during the training is determined to be 5,000, which is one-tenth the size of the entire dataset.

E.2 DIFFERENTIATION

In this experiment, we set functions whose derivatives are 20th-order Chebyshev polynomials as input functions. As mentioned above, all coefficients of the Chebyshev polynomial are between -1/4 and 1/4. We use the 100 uniform grid on the domain [-1, 1] . The number of training and test samples is 1,000 and 200, respectively. We use a batch size of 10,000, which is one-tenth the size(100, 000 = 100 • 1, 000) of the entire dataset.

E.3 ADVECTION EQUATION

We consider the linear advection equation on the torus T := R/Z as follows: ∂w ∂t + c • ∂w ∂x = 0. w(x, 0) = w 0 (x), x ∈ T, ( ) where c is a constant which denotes the propagation speed of w. By constructing of the domain as T, we implicitly assume the periodic boundary condition. In this paper, we consider the case when c = 1. Our goal is to learn the operator which maps w 0 (x)(= w(0, x)) to w(0.5, x). We use the same data as in Lu et al. (2022) . We discretize the domain [0, 1] with a spatial resolution of 40. The number of training samples and test samples is 1,000 and 200, respectively. We use the full batch for training so that the batch size is 40 • 1, 000 = 40, 000.

E.4 BURGERS' EQUATION

We consider the 1D Burgers' equation which describes the movement of the viscous fluid ∂w ∂t = -u • ∂w ∂x + ν ∂ 2 w ∂x 2 , (x, t) ∈ (0, 1) × (0, 1], w(x, 0) = w 0 (x), x ∈ (0, 1), where w 0 is the initial state and ν is the viscosity. Our goal is to learn the nonlinear solution operator of the Burgers' equation, which is a mapping from the initial state w 0 (x)(= w(x, 0)) to the solution w(x, 1) at t = 1. We use the same data of Burgers' equation provided in Li et al. (2021) . The initial state w 0 (x) is generated from the Gaussian random field N (0, 5 4 (-∆ + 25I) -2 ) with the periodic boundary conditions. The split step and the fine forward Euler methods were employed to generate a solution at t = 1. We set viscosity ν and a spatial resolution to 0.1 and 2 7 = 128, respectively. The size of the training sample and test sample we used are 1,000 and 200, respectively. We take the ReLU activation function and the InverseTimeDecay scheduler to experiment with the same setting as in Lu et al. (2022) . For a fair comparison, all experiments on DeepONet retained the hyperparameter values used in Lu et al. (2022) . We use the full batch so that the batch size is 1,280,000.

E.5 SHALLOW WATER EQUATION

The shallow water equations are hyperbolic PDEs which describe the free-surface fluid flow problems. They are derived from the compressible Navier-Stokes equations. The physical conservation laws of the mass and the momentum holds in the shock of the solution. The specific form of the equation can be written as          ∂h ∂t + ∂ ∂x (hu) + ∂ ∂y (hv) = 0, ∂(hu) ∂t + ∂ ∂x (u 2 h + 1 2 gh 2 ) + ∂ ∂y (huv) = 0, ∂(hv) ∂t + ∂ ∂y (v 2 h + 1 2 gh 2 ) + ∂ ∂x (huv) = 0, h(0, x 1 , x 2 ) = h 0 (x 1 , x 2 ), for t ∈ [0, 1] and x 1 , x 2 ∈ [-2.5, 2.5] where h(t, x 1 , x 2 ) denotes the height of water with horizontal and vertical velocity (u, v). g denotes the gravitational acceleration. In this paper, we aim to learn the operator h 0 (x 1 , x 2 ) → {h(t, x 1 , x 2 )} t∈[1/4,1] without the information of (u, v). For the sampling of initial conditions and the corresponding solutions, we directly followed the setting of Takamoto et al. (2022) . The 2D radial dam break scenario is considered so that the initialization of the water height is generated as a circular bump in the center of the domain. The initial condition is generated by h(t = 0, x 1 , x 2 ) = 2.0, for r < x 2 1 + x 2 2 1.0, for r ≥ x 2 1 + x 2 2 (15) with the radius r randomly sampled from U [0.3, 0.7]. The spatial domain is determined to be a 2dimensional rectangle [-2.5, 2.5] × [-2.5, 2.5]. We use 256 = 16 2 grids for the spatial domain. We train the models with three snapshots at t = 0.25, 0.5, 0.75, and predict the solution h(t, x 1 , x 2 ) for four snapshots t = 0.25, 0.5, 0.75, 1 on the same grid. We use 100 training samples and the batch size is determined to be 25,600.

F ADDITIONAL EXPERIMENTS F.1 COMPARISON UNDER VARIOUS CONDITIONS

The experimental results under various conditions are included in The training and test error for the DeepONet is not reduced despite the depth of the branch net becoming larger. This is a limitation of DeepONet's linear approximation. DeepONet approximates the operator with the dot product of the trunk net's output that approximates the basis of the target function and the branch net's output that approximates the target function's coefficient. Even if a more accurate coefficient is predicted by increasing the depth of the branch net, the error does not decrease because there is a limit to approximating the operator with a linear approximation using the already fixed trunk net. The HyperDeepONet approximates the operator with a low test error in all cases with a different number of layers. Figure 11 shows that the training error of the HyperDeepONet remains small as the depth of the hypernetwork increases, while the test error increases. The increasing gap between the training and test errors is because of overfitting. HyperDeepONet overfits the training data because the learnable parameters of the model are more than necessary to approximate the target operator.

F.3 COMPARISON OF HYPERDEEPONET WITH FOURIER NEURAL OPERATOR

The Fourier Neural Operator (FNO) (Li et al., 2021 ) is a well-known method for operator learning. Lu et al. (2022) consider 16 different tasks to explain the relative performance of the DeepONet and the FNO. They show that each method has its advantages and limitations. In particular, DeepONet has a great advantage over FNO when the input function domain is complicated, or the position of the sensor points is not uniform. Moreover, the DeepONet and the HyperDeepONet enable the inference of the solution of time-dependent PDE even in a finer time grid than a time grid used for training, e.g.the continuous-in-time solution operator of the shallow water equation in our experiment. Since the FNO is image-to-image based operator learning model, it cannot obtain a continuous solution operator over time t and position x 1 , x 2 . In this paper, while retaining these advantages of DeepONets, we focused on overcoming the difficulties of DeepONets learning complex target functions because of linear approximation. Therefore, we mainly compared the vanilla DeepONet and its variants models to learn the complex target function without the result of the FNO. For three different PDEs with complicated target functions, we compare all the baseline methods in Table 7 to evaluate the performances. We analyze the model's computation efficiency based on the number of parameters and fix the model's complexity for each equation. All five models demonstrated their prediction abilities for the advection equation. DeepONet shows the greatest performance in this case, and other variants can no longer improve the performance. For the Burgers' equation, NOMAD and HyperDeepONet are the two outstanding algorithms from the perspective of relative test error. NOMAD seems slightly dominant to our architectures, but the two models compete within the margin of error. Furthermore, HyperDeepONet improves its accuracy using the chunk embedding method, which enlarge the target network's size while maintaining the complexity. Finally, HyperDeepONet and NOMAD outperform the other models for 2-dimensional shallow water equations. The HyperDeepONet still succeeds in accurate prediction even with a few parameters. It can be observed from Table 7 that NOMAD is slightly more sensitive to an extreme case using a low-complexity model. Because of the limitation in computing 3-dimensional rotation, FlexDeepONet cannot be applied to this problem. Figure 13 shows the results on prediction of shallow water equations' solution operator using the DeepONet and the HyperDeepONet. The overall performance of the DeepONet is inferior to that of the HyperDeepONet, which is consistent with the result in Figure 12 . In particular, the DeepONet has difficulty matching the overall circular shape of the solution when the number of parameters is small. This demonstrates the advantages of the HyperDeepONet when the computational resource is limited. 

F.5 COMPARISON OF TRAINING TIME AND INFERENCE TIME

Table 8 shows the training time and the inference time for the DeepONet and the HyperDeepONet for two different operator problems. When the same small target network is employed for the Deep-ONet and the HyperDeepONet, the training time and inference time for the HyperDeepONet are larger than for the DeepONet. However, in this case, the time is meaningless because DeepONet does not learn the operator with the desired accuracy at all (Table 1 and Figure 6 ). Even when both models use the same number of training parameters, HyperDeepONet takes slightly longer to train for one epoch than the DeepONet. However, the training complex operator using the HyperDeepONet takes fewer epochs to get the desired accuracy than DeepONet, as seen in Figure 7 . This phenomenon can also be observed for the shallow water problem in Figure 12 . It shows that the HyperDeepONet converges to the desired accuracy faster than any other variants of DeepONet. The HyperDeepONet also requires a larger inference time because it can infer the target network after the hypernetwork is used to generate the target network's parameters. However, when the input function's sensor values are already fixed, the inference time to predict the output of the target function for various query points is faster than that of the DeepONet. This is because the size of the target network for HyperDeepONet is smaller than that of the DeepONet, although the total number of parameters is the same. 



Figure 1: Example of operator learning: the input function and the output function for the solution operator of shallow water equation Lanthaler et al. (2022) provided the universal approximation property of DeepONet. Wang et al. (2021) proposed physics-informed DeepONet by adding a residual of PDE as a loss function, and Ryck & Mishra (2022) demonstrated the generic bounds on the approximation error for it. Prasthofer et al. (2022) considered the case where the discretization grid of the input function in DeepONet changes by employing the coordinate encoder. Lu et al. (2022) compared the FNO with DeepONet in different benchmarks to demonstrate the relative performance. FNO can only infer the output function of an operator as the input function in the same grid as it needs to discretize the output function to use Fast Fourier Transform(FFT). In contrast, the DeepONet can predict from any location. Ha et al. (2017) first proposed hypernetwork, a network that creates a weight of the primary network. Because the hypernetwork can achieve weight sharing and model compression, it requires a relatively small number of parameters even as the dataset grows. Galanti & Wolf (2020) proved that a hypernetwork provides higher expressivity with low-complexity target networks. Sitzmann et al. (2020) and Klocek et al. (2019) employed this approach to restoring images with insufficient pixel observations or resolutions. de Avila Belbute-Peres et al. (2021) investigated the relationship between the coefficients of PDEs and the corresponding solutions. They combined the hypernetwork with the PINN's residual loss. For time-dependent PDE, Pan et al. (2022) designated the time t as the input of the hypernetwork so that the target network indicates the solution at t. von Oswald et al.(2020) devised a chunk embedding method that partitions the parameters of the target network; this is because the output dimension of the hypernetwork can be large.

Figure 2: Diagram for the three components for operator learning.

Figure 4: DeepONet and its variant models for operator learning.

Figure 5: The proposed HyperDeep-ONet structure

: R p → C(Y; R ds ), R(Θ)(y) := NN(y; Θ) (7) where Θ = [W, b] ∈ R p , and NN denotes the target network. Two fully connected neural networks are employed for the hypernetwork and target network. Therefore, the main idea is to use the hypernetwork, which takes an input function u and produces the weights of the target network. It can be thought of as a weight generator for the target network. The hypernetwork determines the all parameters of the target network containing the weights between the final hidden layer and the output layer. It implies that the structure of HyperDeepONet contains the entire structure of DeepONet. As shown in Figure 4 (b) and (c), Shift-DeepONet and FlexDeepONet can also be viewed as special cases of the HyperDeepONet, where the output of the hypernetwork determines the weights or biases of some layers of the target network. The outputs of the hypernetwork determine the biases for the first hidden layer in the target network for NOMAD in Figure 4 (d).

BOUND FOR THE COMPLEXITY OF THE DEEPONET Now, we provide the minimum number of parameters in DeepONet. The following theorem presents a criterion on the DeepONet's complexity to get the desired error. It states that the number of required parameters increases when the target functions are irregular, corresponding to a small r. F DeepONet (B, T ) denotes the class of function in DeepONet, induced by the class of branch net B and the class of trunk net T . Theorem 2. (Complexity of DeepONet) Let σ : R → R be a universal activation function in C r (R)

.003 0.777±0.018 0.678±0.062 0.578±0.020 0.036±0.005 Differentiation 0.559±0.001 0.624±0.015 0.562±0.016 0.558±0.003 0.127±0.043Table1: The mean relative L 2 test error with standard deviation for the identity operator and the differentiation operator. The DeepONet, its variants, and the HyperDeepONet use the target network d y -20-20-10-1 with tanh activation function. Five training trials are performed independently.

Figure 6: One test data example of differentiation operator problem.Finally, the solution operators of PDEs are considered. We deal with two problems with the complex target function in previous works(Lu et al., 2022;Hadorn, 2022). The solution operator of the advection equation is considered a mapping from the rectangle shape initial input function to the solution w(t, x) at t = 0.5, i.e., G : w(0, x) → w(0.5, x). We also consider the solution operator of Burgers' equation which maps the random initial condition to the solution w(t, x) at t = 1, i.e., G : w(0, x) → w(1, x). The solution of the Burgers' equation has a discontinuity in a short time, although the initial input function is smooth. For a challenging benchmark, we consider the solution operator of the shallow water equation that aims to predict the fluid height h(t, x 1 , x 2 ) from the initial condition h(0, x 1 , x 2 ), i.e., G : h(0, x 1 , x 2 ) → h(t, x 1 , x 2 ) (Figure1). In this case, the input of the target network is three dimension (t, x 1 , x 2 ), which makes the solution operator complex. Detail explanation is provided in Appendix E.

) m-30-30-30-30-Nθ dy-30-30-30-30-1 101K 0.0148 ± 0.0002 Shallow w/ small param DeepONet m-20-20-10 dy-20-20-10-1 6.5K 0.0391 ± 0.0066 Hyper(ours) m -10-10-10-Nθ dy-10-10-10-1 5.7K 0.0209 ± 0.0013

Figure 7: One test data example of prediction on the advection equation (First row) and Burgers' equation (Second row) using the DeepONet and the HyperDeepONet. For the three solution operator learning problems, we use the same hyperparameters proposed in Lu et al. (2022) and Seidman et al. (2022) for DeepONet. First, we use the smaller target network with the larger hypernetwork for the HyperDeepONet to compare the DeepONet. Note that the vanilla DeepONet is used without the output normalization or the boundary condition enforcing techniques explained in Lu et al. (2022) to focus on the primary limitation of the DeepONet. More Details are in Appendix E. Table2shows that the HyperDeepONet achieves a similar or better performance than the DeepONet when the two models use the same number of learnable parameters. The HyperDeepONet has a slightly higher error for advection equation problem, but this error is close to perfect operator prediction. It shows that the complexity of target network and the number of learnable parameters can be reduced to obtain the desired accuracy using the HyperDeepONet. The fourth row of Table2shows that HyperDeepONet is much more effective than DeepONet in approximating the solution operator of the shallow water equation when the number of parameters is limited. Figure7and Figure12show that the HyperDeepONet learns the complex target functions in fewer epochs for the desired accuracy than the DeepONet although the HyperDeepONet requires more time to train for one epoch (Table8).

Dimension of the codomain of input function d s Dimension of the codomain of output function {x 1 , • • • , x m } Sensor points m Number of sensor points R d Euclidean space of dimension d C r (R) Set of functions that has continuous r-th derivative. C(Y; R ds ) Set of continuous function from Y to R ds C r ([-1, 1] n ; R) Set of functions from [-1, 1] n to R whose r-th partial derivatives are continuous. n = O(ϵ)

Figure 8: The structure of DeepONet

the unstacked DeepONet consists of an inner product of branch Net and trunk Net, which are fully connected neural networks. For a function u, the branch net receives pointwise projection values (u(x 1 ), • • • , u(x m )) as inputs to detect which function needs to be transformed. The trunk net queries a location y ∈ Y of interest where Y denotes the domain of output functions.

n=1 and sequence of error {ϵ n } ∞ n=1 such that ϵ n → 0, and it satisfies N Tn < 1 n ϵ -dy/r n i.e.,N Tn = o(ϵ -dy/r n ) with respect to n . Note that the above implies N Tn = o(ϵ -(dy+m)/r n

Figure 9: Chunk Embedding Method

Figure 10: One test data example of differentiation operator problem.

Figure 11: Varying the number of layers of branch net and hypernetwork in DeepONet and Hyper-DeepONet for identity operator problem (left) and differentiation operator problem (right).

Figure 12: The test L 2 relative errors of four methods during training for the solution operator of shallow water equations. Training time (s) (per 1 epoch) Inference time (ms) Same target (Differentiation) DeepONet 1.018 0.883 HyperDeepONet 1.097 1.389 Same #param (Advection) DeepONet 0.466 0.921 HyperDeepONet 0.500 1.912

Figure 13: The examples of predictions on the solution operator of shallow water equations using the DeepONet and the HyperDeepONet. The first column represents the exact solution generated in Takamoto et al. (2022), and the other four columns denote the predicted solutions using the corresponding methods. The four rows shows the predictions h(t, x 1 , x 2 ) at four snapshots t = [0.25, 0.5, 0.75, 1].

The mean relative L 2 test error with standard deviation for solution operator learning problems. N

Yinhao Zhu, Nicholas Zabaras, Phaedon-Stelios Koutsourelakis, and Paris Perdikaris. Physicsconstrained deep learning for high-dimensional surrogate modeling and uncertainty quantification without labeled data. J. Comput. Phys., 394:56-81, 2019. ISSN 0021-9991. doi: 10.1016/j.jcp. 2019.05.024. URL https://doi.org/10.1016/j.jcp.2019.05.024.

Notation

Setting of the decay rate for each operator problem

Table 5 by modifying the network structure, activation function, and number of sensor points. Although the DeepONet shows good performance in certain settings, the proposed HyperDeepONet shows good performance without dependency on the various conditions. The relative L 2 test errors for experiments on training the identity operator under various conditions

shows the simple comparison of the HyperDeepONet with the FNO for the identity operator and differentiation operator problems. Although the FNO structure has four Fourier layers, we use only one Fourier layer with 2,4,8, and 16 modes for fair comparison using similar number of parameters. The FNO shows a better performance than the HyperDeepONet for the identity operator problem. Because the FNO has a linear transform structure with a Fourier layer, the identity

The relative L 2 test errors and the number of parameters for the identity and differentiation operator problems using HyperDeepONet and FNO with different number of modes. #Param denote the number of learnable parameters.operator is easily approximated even with the 2 modes. In contrast, the differentiation operator is hard to approximate using the FNO with 2, 4, and 8 modes. Although the FNO with mode 16 can approximate the differentiation operator with better performance than the HyperDeepONet, it requires approximately 4.7 times as many parameters as the HyperDeepONet.

The relative L 2 test errors and the number of parameters for the solution operators of PDEs experiments. #Param denote the number of learnable parameters. Note that the all five models use the similar number of parameters for each problem.F.4 PERFORMANCE OF OTHER BASELINES WITH THE SAME NUMBER OF LEARNABLE PARAMETERS

acknowledgement

ACKNOWLEDGMENTS J. Y. Lee was supported by a KIAS Individual Grant (AP086901) via the Center for AI and Natural Sciences at Korea Institute for Advanced Study and by the Center for Advanced Computation at Korea Institute for Advanced Study. H. J. Hwang and S. W. Cho were supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2022-00165268) and by Institute for Information & Communications Technology Promotion (IITP) grant funded by the Korea government(MSIP) (No.2019-0-01906, Artificial Intelligence Graduate School Program (POSTECH)).

annex

Published as a conference paper at ICLR 2023 Theorem 3. (Total Complexity of DeepONet) Let σ : R → R be a universal activation function in C r (R) such that σ and σ ′ are bounded. Suppose that the class of branch net B has a bounded Sobolev norm (i.e., ∥β∥ s r ≤ l 1 , ∀β ∈ B). If any non-constant ψ ∈ W r,n does not belong to any class of neural network, then the number of parameters in DeepONet is Ω(ϵ -(dy+m)/R ) for any R > r when d(F DeepONet (B, T ); W r,dy+m ) ≤ ϵ.Proof. For a positive ϵ < l 1 , suppose that there exists a class of branch net B and trunk net T such that d(F DeepONet (B, T ); W r,dy+m ) ≤ ϵ. By the boundedness of σ, there exists a constant l 2 which is the upper bound on the supremum norm ∥ • ∥ ∞ of the trunk net class T . Let us denote the number of parameters in F DeepONet by N FDeepONet . Using the Lemma 1 to replace DeepONet's inner products with neural networks as in the inequality (10), we can construct a class of neural network F such that the number of parametersSuppose that N FDeepONet = o(ϵ -(dy+m)/r ). Then, by Theorem 1, p 1+1/t ϵ -1/t should be Ω(ϵ -(dy+m)/r ). Since t can be arbitrary large, the number of basis p should be Ω(ϵ -(dy+m)/R ) for any R > r.

C.2 COMPLEXITY ANALYSIS ON VARIANTS OF DEEPONET.

We would like to verify that variant models of DeepONet require numerous units in the first hidden layer of the target network. Now we denote the class of Pre-net in Shift-DeepONet and flexDeepONet by P. The class of Shift-DeepONet and flexDeepONet will be written as F shift-DeepONet (P, B, T ) and F flexDeepONet (P, B, T ) respectively. The structure of Shift-DeepONet can be summarized as follows. Denote the width of the first hidden layer of the target network by w. We define the pre-net as ρ = [ρ 1 , ρ 2 ] : R m → R w×(dy+1) where ρ 1 : R m → R w×dy and ρ 2 : R m → R w , the branch net as β : R m → R p , and the trunk net as τ : R w → R p . The Shift-DeepONet f Shift-DeepONet (ρ, β, τ ) is defined aswhere Φ is defined in Eq. ( 11).We claim that it does not improve performance for the branch net to additionally output the weights on the first layer of a target network. The following lemma shows that the procedure can be replaced by a small neural network structure.Lemma 3. Consider a function Φ : R dy(w+1) → R w which is defined below.For any arbitrary positive t, there exists a class of neural network F with universal activation σ : R → R such that the number of parameters of F is O(wdProof. Using the Lemma 1, we can construct a sequence of neural network {f i } w i=1 which is an ϵ-approximation of the inner product with O(d 1+1/t y ϵ -1/t ) parameters. If we combine all of the w approximations, we get the desired neural network. Now we present the lower bound on the number of parameters for Shift-DeepONet. We derive the following theorem with an additional assumption that the class of trunk net is Lipschitz continuous. The function τ : R dy → R p is called Lipschitz continuous if there exists a constant C such that ∥τ (y 1 ) -τ (y 2 )∥ 1 ≤ C∥y 1 -y 2 ∥ 1 .For the neural network f , the upper bound of the Lipschitz constant for f could be obtained as L k-1 Π k i=1 ∥W i ∥ 1 , where L is the Lipschitz constant of σ and the norm ∥ • ∥ 1 denotes the matrix

