HT-NET: HIERARCHICAL TRANSFORMER BASED OPERA-TOR LEARNING MODEL FOR MULTISCALE PDES

Abstract

Complex nonlinear interplays of multiple scales give rise to many interesting physical phenomena and pose significant difficulties for the computer simulation of multiscale PDE models in areas such as reservoir simulation, high-frequency scattering, and turbulence modeling. In this paper, we introduce a hierarchical transformer (HT) scheme to efficiently learn the solution operator for multiscale PDEs. We construct a hierarchical architecture with a scale-adaptive interaction range, such that the features can be computed in a nested manner and with a controllable linear cost. Self-attentions over a hierarchy of levels can be used to encode and decode the multiscale solution space across all scales. In addition, we adopt an empirical H 1 loss function to counteract the spectral bias of the neural network approximation for multiscale functions. In the numerical experiments, we demonstrate the superior performance of the HT scheme compared with state-of-the-art (SOTA) methods for representative multiscale problems.

1. INTRODUCTION

Partial differential equation (PDE) models with multiple temporal/spatial scales are ubiquitous in physics, engineering, and other disciplines. They are of tremendous importance in making predictions for challenging practical problems such as reservoir modeling, high-frequency scattering, and atmosphere circulation, to name a few. The complex nonlinear interplays of characteristic scales cause major difficulties in the computer simulation of multiscale PDEs. While the resolution of all characteristic scales is prohibitively expensive, sophisticated multiscale methods have been developed to efficiently and accurately solve the multiscale PDEs by incorporating microscopic information. However, most of them are designed for problems with fixed input parameters. Recently, several novel methods such as Fourier neural operator (FNO) (Li et al., 2021) , Galerkin transformer (GT) (Cao, 2021) and deep operator network (DeepONet) (Lu et al., 2021) are developed to directly learn the operator (mapping) between infinite dimensional spaces for PDE problems, by taking advantages of the enhanced expressibility of deep neural networks and advanced architectures such as feature embedding, channel mixing, and self-attentions. Such methods can deal with an ensemble of input parameters and have great potential for the efficient forward and inverse solvers of PDE problems. However, for multiscale problems, most existing operator learning schemes essentially capture the smooth part of the solution space, and how to resolve the intrinsic multiscale features remains to be a major challenge. In this paper, we design a hierarchical transformer based operator learning method, so that the accurate, efficient, and robust computer simulation of multiscale PDE problems with an ensemble of input parameters becomes feasible. Our main contribution can be summarized in the following, • we develop a novel transformer architecture that allows the decomposition of the input-output mapping to a hierarchy of levels, the features can be updated in a nested manner based on the hierarchical local aggregation of self-attentions with linear computational cost; • we adopt the empirical H 1 loss which avoids the spectral bias and enhances the ability to capture the oscillatory features of the multiscale solution space; • the resulting scheme has significantly better accuracy and generalization properties for multiscale input parameters, compared with state-of-the-art (SOTA) models.

2. BACKGROUND AND RELATED WORK

We briefly introduce multiscale PDEs in Section § 2.1, then summarize relevant multiscale numerical methods in § 2.2, and neural solvers, in particular neural operators in § 2.3.

2.1. MULTISCALE PDES

Multiscale PDEs, in a narrower sense, refer to PDEs with rapidly varying coefficients, which may arise from a wide range of applications in heterogeneous and random media. In a broader sense, they may form a hierarchy of models at different scales by a systematic derivation, starting from fundamental laws of physics such as quantum mechanics (E & Engquist, 2003) . Some outstanding multiscale PDE models may include: Multiscale elliptic equations is an important class of prototypical examples, such as the following secondorder elliptic equation of divergence form, -∇ • (a(x)∇u(x)) = f (x) x ∈ D u(x) = 0 x ∈ ∂D (2.1) where 0 < a min ≤ a(x) ≤ a max , ∀x ∈ D, and the forcing term f ∈ H -1 (D; R). The coefficient to solution map is S : L ∞ (D; R + ) → H 1 0 (D; R), such that u = S(a). In Li et al. (2021) , smooth coefficient a(x) is considered and S can be well resolved by the FNO parameterization. The setup a(x) ∈ L ∞ allows rough coefficients with fast oscillation (e.g. a(x) = a(x/ε) with ε ≪ 1), high contrast ratio with a max /a min ≫ 1, and even a continuum of non-separable scales. The rough coefficient case is much harder from both scientific computing (Branets et al., 2009) and operator learning perspectives. Navier-Stokes equation models the flow of incompressible fluids, which becomes turbulent due to the simultaneous interaction of a wide range of temporal and spatial scales of motion. Helmholtz equation models time-harmonic acoustic waves. Its numerical solution exhibits severe difficulties in the high wave number regime due to the interaction of high-frequency waves and numerical mesh.

2.2. MULTISCALE SOLVERS FOR MULTISCALE PDES

For multiscale PDEs, the computational cost of classical numerical methods, such as finite element methods, finite difference methods, etc., usually scales proportionally to 1/ε ≫ 1. Multiscale solvers have been developed such that their computational costs are independent of ε by incorporating microscopic information. Asymptotic and numerical homogenization Asymptotic homogenization (Bensoussan et al., 1978) is an elegant analytical approach for multiscale PDEs with scale separation, e.g., a(x) = a(x/ε) with ε ≪ 1 in equation 2.1. For general multiscale PDEs with possibly a continuum of scales, numerical homogenization (Engquist & Souganidis, 2008) offers an effective numerical approach that aims to identify low dimensional approximation spaces such that the approximation bases are adapted to the corresponding multiscale operator, and can be efficiently constructed (e.g. with localized bases). See Appendix A for details. Multilevel and multiresolution methods Multilevel/multigrid methods (Hackbusch, 1985; Xu & Zikatanov, 2017) , and multiresolution methods such as wavelets (Brewster & Beylkin, 1995; Beylkin & Coult, 1998) have been successfully applied to PDEs. However, the convergence of those methods for multiscale problems can be severely affected by the regularity of coefficients (Branets et al., 2009) . Recently, the introduction of gamblets (Owhadi, 2017) can be seen as a multilevel extension of numerical homogenization, and it opens an avenue to automatically discover scalable multilevel algorithms and operator-adapted wavelets for linear PDEs with rough coefficients. See also Appendix A.2. Low-rank decomposition based methods It is well-known that the elliptic Green's function has lowrank approximation (Bebendorf, 2005) , which lays the theoretical foundation of the (near-)linear complexity methods such as fast multipole method (Greengard & Rokhlin, 1987; Ying et al., 2004) , hierarchical matrices (H and H 2 matrices) (Hackbusch et al., 2002; Bebendorf, 2008) , and hierarchical interpolative factorization (Ho & Ying, 2016) . See Appendix B for the connection with our method presented in this paper. Tensor numerical methods Analytical approaches such as periodic unfolding (Cioranescu et al., 2008) suggest that a low-dimensional multiscale PDE can be transformed into a high-dimensional PDE. Therefore, tensor numerical methods, e.g., the sparse tensor finite element method (Harbrecht & Schwab, 2011) and the quantic tensor train (QTT) method (Kazeev et al., 2022) , provide efficient numerical procedures to find the low-rank tensor representation of multiscale solutions.

2.3. NEURAL OPERATOR FOR PDES

Neural PDE solvers and Spectral Bais Various neural solvers have been proposed in E et al. ( 2017); Sirignano & Spiliopoulos (2018) ; Raissi et al. (2019) to solve PDEs with fixed parameters. The difficulty to solve multiscale PDEs has been exhibited by the so-called spectral bias or frequency principle (Rahaman et al., 2019; Ronen et al., 2019) , which shows that DNN-based algorithms are often inefficient to learn highfrequency components of multiscale functions. A series of algorithms have been developed to overcome the high-frequency curse of DNNs and to solve multiscale PDEs (Cai et al., 2020; Wang et al., 2021) . Operator learning for PDEs DNN algorithms demonstrate more potential to learn the input-output mapping of parametric PDEs. Finite-dimensional operator learning methods such as Zhu & Zabaras (2018) ; Fan et al. (2019a; b) ; Khoo et al. (2020) can be applied to problems with fixed discretization. In particular, Fan et al. (2019a; b) combines H or H 2 matrices linear operations with nonlinear activation functions, though the nonlinear geometrical interaction/aggregation is absent, which limits the expressivity of the resulting neural operator. Infinite-dimensional operator learning methods (Li et al., 2021; Gupta et al., 2021) aim to learn the mapping between infinite-dimensional Banach spaces, and the convolutions in the neural operator construction are parametrized by Fourier or wavelet transform. Nevertheless, those methods do not always work even for multiscale PDEs with fixed parameters. On the other hand, while the universal approximation can be rigorously proved for FNO operators (Kovachki et al., 2021) , "extra smoothness" is required to achieve a meaningful decay rate, which either is absent or gives rise to large constants for multiscale PDEs. This motivates us to construct a new architecture for multiscale operator learning.

Efficient Attention

The vanilla multi-head self-attention in Vaswani et al. (2017) scales quadratically with the number of tokens, which is prohibitive for high-resolution problems. Many efficient attention methods are proposed using the kernel trick (low-rank projection) (Choromanski et al., 2020; Wang et al., 2020; Peng et al., 2021; Nguyen et al., 2021) to reduce the computation cost arising in the dense matrix operation. In the context of operator learning, Galerkin transformer (Cao, 2021) proposed to remove the softmax normalization in self-attention and introduced a linearized self-attention variant with Petrov-Galerkin projection normalization. Furthermore, hierarchical transformers using local window aggregation are proposed in Liu et al. (2021b) ; Zhang et al. (2022) for NLP and vision applications.

3. METHODS

We introduce the hierarchical attention model in this section, motivated by the hierarchical matrix, in particular the H 2 matrix variant. We follow the setup in Li et al. (2021) ; Lu et al. (2021) to approximate the operator S : a → u := S(a), with the input a ∈ A drawn from a distribution µ and the output u ∈ U, where A and U are infinite dimensional Banach spaces respectively. We aim to learn the operator S from a finite collection of finitely observed input-output pairs through a parametric map N : A × Θ → U and a loss functional L : U × U → R, such that the optimal parameter θ * = arg min θ∈Θ E a∼µ [L (N (a, θ), S(a))]. The input a can then be mapped to features f through patch embedding (Dosovitskiy et al., 2020) .

Hierarchical Discretization

We introduce the hierarchical discretization of the spatial domain D, which can be used directly for time-independent PDEs. Time-dependent PDEs can be treated by taking time-sliced data as feature channels. Let I (r) be the finest level index set, such that each index i = (i 1 , . . . , i r ) ∈ I (r) denotes the finest level spatial objects such as image pixels, finite difference points, etc. For any i = (i 1 , . . . , i r ) ∈ I (r) , and 1 ≤ m ≤ r, i (m) = (i 1 , . . . , i m ) represents i's m-th level parent node which is the aggregate of finer level objects. I (m) := {i (m) : i ∈ I (r) } is the m-th level index set. We can similarly define the natural parent-child relationship between coarser and finer nodes, which induces the index tree I together with the index sets I (m) (1 ≤ m ≤ r). Each index i ∈ I (m) corresponds to a token, which can be characterized e.g. by its position x (m) i and a feature vector f (m) i with number of channels C (m) . In the following, we will restrict our presentation to the quadtree setting as illustrated in Fig 3 .1.

Reduce Operation

The reduce operation defines the map from finer-level features to coarser-level features. For i ∈ I (m) , we denote i (m,m+1) as the set of (m + 1)-th level child nodes of i. In the quadtree setting, i (m,m+1) = {(i, 0), (i, 1), (i, 2), (i, 3)}foot_0 . The reduce map can be abstractly defined as f (m),t i = R (m) ({f (m+1),t j } j∈i (m,m+1 ) ), which maps the (m+1)-th level features with indices in i (m,m+1) to the m-th level feature with index i (m) . For simplicity, we use a simple linear layer for the operator R (m) in our current implementation, namely, f (m),t i = R (m) 0 f (m+1),t (i,0) + R (m) 1 f (m+1),t (i,1) + R (m) 2 f (m+1),t (i,2) + R (m) 3 f (m+1),t (i,3) , where R (m) 0 , R (m) 1 , R (m) 2 , R (m) 3 ∈ R C (m-1) ×C (m) are parametrized by linear layers. From the perspective of H matrices, those parametrized matrices R (m) j embed low-rank approximation to the attention kernel. Please refer to Appendix B for more details.

Multilevel Local Aggregation The finest level features f

(r),t i ∈ R C (r) at the evolution step t, with index i ∈ I (r) , can be updated by the following token aggregation formula through vanilla attention, atten : f (r),t+1 i = N (r) j=1 G(f (r),t i , f (r),t j )v(f (r),t j ), for i ∈ I (r) , (3.1) with N (r) := |I (r) |. For simplicity, we ignore the softmax normalizing factor and only consider single-head attention, and we assume that the interaction kernel G is of the form G(f i , f j ) = exp(W Q f i • W K f j ) = exp(q i • k j ), where q i := W Q f i , k i := W K f i , v i := v(f i ) = W V f i , and W Q , W K , W V ∈ R C (r) ×C (r) are learnable matrices. Instead of computing equation 3.1 explicitly with O(N 2 ) cost, we propose a self-attention based local aggregation scheme, inspired by the H 2 matrices (Hackbusch, 2015) , see also Appendix B. The local aggregation at the r-th level writes,  loc : f (r),t+1 i = j∈N (r) (i) exp(q (r),t j • k (r),t j )v (r),t j , for i ∈ I (r) , (3.2) where N (r) (i) is the set of the r-th level neighbors of i ∈ I (r) , q (r),t i := Ŵ Q f (r),t i , k (r),t i := Ŵ K f (r),t i , v (r),t i := Ŵ V f (r),t i with learnable matrices Ŵ Q , Ŵ K , Ŵ V ∈ R C (r) ×C (r) . Then for m = r -1, ..., 1, we calculate the local attention in each level with nested q (m),t i , k (m),t j , v (m),t j , atten (m) loc : f (m),t+1 i = j∈N (m) (i) exp(q (m),t i • k (m),t j )v (m),t j , for i ∈ I (m) (3.3) where q (m),t i = R (m) ({q (m+1),t j )} j∈i (m,m+1) ), k (m),t i = R (m) ({k (m+1),t j )} j∈i (m,m+1) ), v (m),t i = R (m) ({v (m+1),t j )} j∈i (m,m+1) ), N (m) (i) is the set of the m-th level neighbors of i ∈ I (m) . Decompose Operation To mix the multilevel features f (m),t+1 i , m = 1, ..., r, we propose a decompose operation that reverses the reduce operation from level 1 to level r -1. The decompose operator D (m) : f (m),t+1 i → {f (m+1),t+ 1 2 j } j∈i (m,m+1 ) , maps the m-th level feature with index i to (m + 1)th level features associated to its child set i (m,m+1) . f m+1) . In the current implementation, we use a simple linear layer such that f (m+1),t+ 1 2 i is further aggregated to f (m+1),t+1 i such that f (m+1),t+1 i + = f (m+1),t+ 1 2 i for i ∈ I ( (m+1),t+ 1 2 (i,s) = D (m),T s f (m),t+1 i , for s = 0, 1, 2, 3, with parameter matrices D (m) s ∈ R C (m) ×C (m+1) . Remark 1 In our current implementation, the operators R (m) and D (m) only consist of linear layers. In general, these operators may contain nonlinear activation functions. Remark 2 The nested learnable operators R (m) and D (m) induce the channel mixing and a structured parameterization of W Q , W V , W K matrix for the coarse level tokens. See Appendix B for details. We summarize the hierarchical attention algorithm in the following.

Algorithm 1 One V-cycle of Hierarchical Attention

Input: , m = 1, ..., r for any i ∈ I (m) . I (r) , f (r),t i for i ∈ I (r) . STEP 0: Get the q (r),t i , k (r),t i , v (r),t i for i ∈ I (r) . STEP 1: For m = r -1, • • • , 1, Do the reduce operations q (m),t i = R (m) ({q (m+1),t j } j∈i (m, STEP 3: For m = 1, • • • , r -1, Do the decompose operations {f (m+1),t+ 1 2 j } j∈i (m,m+1) = D (m) (f (m),t+1 i ), for any i ∈ I (m) ; then f (m+1),t+1 i + = f (m+1),t+ 1 2 i , for any i ∈ I (m+1) Output: f (r),t+1 i for any i ∈ I (r) . Proposition 3.1 (Complexity of Algorithm 1) The reduce operation, multilevel aggregation, and decomposition operation together form a V-cycle for the update of features, as illustrated in Figure 3 .2. The cost of one V-cycle is O(N ) if I is a quadtree as implemented in the paper. See Appendix C for the proof. Remark 3 (Technical contribution) In Liu et al. (2021b) ; Zhang et al. (2022) , spatial resolutions are sequentially reduced, and attentions are performed at each level separately. However, attentions are only used for a fixed level, and there is no multilevel attention-based aggregation. The coarse latent variables are used for classification and generation tasks, and fine-scale information may get lost in the reduce(coarsening) process. While in the HT-Net architecture, we are motivated by multiscale numerical methods such as numerical homogenization and hierarchical matrix method. Attention-based local aggregations are constructed at each level, and features from all levels are summed up to form the updated fine-scale features, which enables the recovery of the fine details with linear cost. Decoder The decoder maps the features f (at the last update step) to the solution u. The decoder is often chosen with prior knowledge of the PDE. A simple feedforward neural network (FFN) is used in Lu et al. (2021) to learn a basis set as the decoder. A good decoder can also be learned from the data using SVD (Bhattacharya et al., 2021) . In this paper, we use the spectral convolution layers in Li et al. (2021) . Loss functions Loss functions are crucial for efficient training and robust generalization of neural network models. For multiscale problems considered in this paper, we prefer to use H 1 loss instead of the usual L 2 loss function, as it puts more "weights" on high-frequency components. Yu et al. (2022) adopted Sobolev norm based loss function for function approximation, where H 1 loss is defined for the neural network function itself. On the contrary, we define the H 1 loss on the target solution space, which measures the distance between the prediction û := N (a) and the ground truth u. We refer the readers to Adams & Fournier (2003) for the definition of Sobolev spaces and H 1 norm, and to the next section for the implementation.

4. EXPERIMENTS

We demonstrate the effectiveness of the hierarchical transformer model (HT-Net) on multiscale operator learning. All our examples are in 2D. We first study the ability of the model on a multiscale elliptic equation with two-phase coefficients which is a widely used benchmark problem for operator learning (Li et al., 2021) . As the "multiscaleness" of the coefficients such as the oscillation, contrast, and roughness of the two-phase interface can be fine-tuned, we showcase the accuracy and the robustness of the HT-NET model for both smooth and rough coefficients. We also show that HT-NET is able to provide better generalization errors for out-of-distribution input parameters. We also test the Navier-Stokes equation with a large Reynolds number as an example with non-linearity and time dependence.

4.1. SETUP

In our experiments, the spatial domain is D := [0, 1] 2 and is discretized uniformly with h = 1/n. Let the uniform grid be G 2 := {(x i , x j ) = (ih, jh) | i, j = 0, ..., n -1}. {(a j , u j )} N j=1 are functions pairs such that u j = S(a j ), and a j is drown from some probability measure µ. The actual data pairs for training and testing are pointwise evaluations of a j and u j on the grid G 2 , denoted by a j and u j , respectively. The comparison with all the baselines is consistent with (most time better than) references. The hierarchical index tree I can be generated by the corresponding quadtree representation of the nodes with depth r, such that the finest level objects are pixels or patches aggregated by pixels. See Appendix D for more details. Empirical H 1 loss function Empirical L 2 loss function is defined as L L ({(a j , u j )} N j=1 ; θ) := 1 N N i=1 ∥u j -N (a j ; θ)∥ l 2 /∥u j ∥ l 2 , where ∥ • ∥ l 2 is the canonical l 2 vector norm. For any ξ ∈ Z 2 n := ξ ∈ Z 2 | -n/2 + 1 ⩽ ξ j ⩽ n/2, j = 1, 2 , the normalized discrete Fourier transform (DFT) co- efficients of f writes F(f )(ξ) := 1 √ n x∈G 2 f (x)e -2iπx•ξ . The empirical H 1 loss function is given by, L H ({(a j , u j )} N j=1 ; θ) := 1 N N i=1 ∥u j -N (a j ; θ)∥ h /∥u j ∥ h , where ∥u∥ h := ξ∈Z 2 n |ξ| 2 (F(u)(ξ)) 2 . L H can be viewed as a weighted L L loss with weights |ξ| 2 , which forces the operator to capture highfrequency components in the solution. In this work, we use the equivalent frequency space representation of the discrete H 1 norm, and in general, discrete H 1 norm in real space can also be adopted. In previous works, the coefficients do not oscillate much and the solutions look smooth. We change the parameters which control the smoothness and contrast of the coefficients, such that the solutions contain more roughness than the Darcy benchmark in Li et al. (2021) , see Figure 4 .1. We refer the readers to Appendix E for data generation details. We also include experiments for multiscale trigonometric coefficients with higher contrast. Detailed description on multiscale trigonometric coefficients is in Appendix F.1. 2 . The models are trained with the rough two-phase dataset in Table 4 .1 and tested on out-of-distribution data. FNO2D H 1 is FNO model but trained using our H 1 loss. HT-Net outperforms other models by order of magnitude. Table 2 : Relative L 2 error (×10 -2 ) for out of distribution data, with a max = 12, a min = 3 and c = 18.

4.3. NAVIER-STOKES EQUATION

We consider the 2D Navier-Stokes equation in vorticity form on the unit torus, which is also benchmarked in Li et al. ( 2021) ∂ t w(x, t) + u(x, t) • ∇w(x, t) = ν∆w(x, t) + f (x), x ∈ (0, 1) 2 , t ∈ (0, T ] ∇ • u(x, t) = 0, x ∈ (0, 1) 2 , t ∈ [0, T ] w(x, 0) = w 0 (x), x ∈ (0, 1) 2 with velocity u, vorticity w = ∇ × u, initial vorticity w 0 , viscosity ν > 0 (∼ Re -1 with Re being the Reynolds number), and forcing term f . We learn the operator S : w(•, 0 ≤ t ≤ 10) → w(•, 10 ≤ t ≤ T ), mapping the vorticity up to time 10 to the vorticity up to some later time T > 10. We experiment with viscosities ν = 1e -3, 1e -4, 1e -5, and decrease the final time T accordingly as the Reynolds numbers increase and the dynamics become chaotic. Time dependent neural operator Following the setup in Li et al. (2021) , we fix the resolution as 64 × 64 for both training and testing. The ten time-slices of solutions w(•, t) at t = 0, ..., 9 are taken as the input data to the neural operator N which maps the solutions at the previous 10 timesteps to the next time step (2D functions to 2D functions). This procedure, often referred to as the rolled-out prediction, can be repeated recurrently until the final time T . We list the results of HT-Net, FNO-3D (convolution in space-time), FNO-2D, U-Net in Ronneberger et al. (2015) in Table 3 , and HT-Net is significantly better than other methods.  T = 50 T = 30 T = 30 T = 20 #Parameters ν = 1e -3 ν = 1e -4 ν = 1e -4 ν = 1e -5 N = 1000 N = 1000 N = 10000 N = 1000 FNO-3D

5. CONCLUSION

We have built HT-Net, a hierarchical transformer based operator learning model for multiscale PDEs, which allows nested computation of features and self-attentions, and provides a hierarchical representation for the multiscale solution space. The reduce, multilevel local aggregation, and decompose operations form a finecoarse-fine V-cycle for the feature update. We also introduced the empirical H 1 loss to reduce the spectral bias in the multiscale operator learning. HT-Net provides much better accuracy and robustness compared with state-of-the-art (SOTA) neural operators, which is demonstrated by several multiscale benchmarks. Limitation and outlook (1) The current implementation of HT-Net relies on a regular grid. The extension to data clouds and graph neural networks will offer more opportunities to take advantage of the hierarchical representation. (2) The current implementation of the attention-based operator lacks the flexibility to generalize to a different resolution as FNO (Li et al., 2021) . However, for multiscale problems, the discretization invariance of FNO and MWT models may be hampered by the aliasing error in the frequency domain, which is validated by the experiment results. Recent convergence analysis of operator learning methods such as Kovachki et al. (2021) ; Lanthaler et al. (2022) leverage the smoothness of the solutions, while multiscale PDEs usually have lower regularity and do not fall into such categories. For HT-Net, it might be possible to achieve (approximate) discretization invariance by using proper sampling and interpolation modules, we expect that a full treatment will be operator adaptive.

ETHICS STATEMENT

This work proposes a hierarchical transformer operator learning model for multiscale PDEs. As stated in the introduction, solving multiscale PDEs is associated with some of the most challenging practical problems, such as reservoir modeling, fracture and fatigue prediction, high frequency scattering, weather forecasting, etc. Potentially, HT-Net solvers could help to reduce the prohibitively expensive computational cost of those simulations. The negative consequences are not obvious. Though, in theory, any technique can be misused, it is not likely to happen at the current stage.

REPRODUCIBILITY STATEMENT

The datasets for Section 4 are either downloaded (for smooth two-phase coefficients and Navier-Stokes) from https://github.com/zongyi-li/fourier_neural_operator, or generated (for rough Darcy coefficients) using the code from the same website. We implemented P 1 finite element method in MATLAB to solve equation 2.1 with multiscale trigonometric coefficients in Appendix F.1, and in FreeFEM to solve the Helmholtz equation in Appendix F.2. We have included introductions of the relevant mathematical and data generation concepts in the appendix.

A.1 ASYMPTOTIC HOMOGENIZATION

We introduce some basic formulation of asymptotic homogenization, by assuming that a(x) = a( x ε ) with 1-periodic funciton a(•) in equation 2.1, namely, -div a x ε ∇u ε = f, x ∈ D, ⊂ R d , a(•) periodic in y = x/ε, u ε = 0, x ∈ ∂D. It can be derived from the asymptotic expansion of the two-scale function u ε = u(x, x/ε) in ε, and justified rigorously that u ε ⇀ u 0 in H 1 , with u 0 (x) the solution of the homogenized problem -div x (a 0 ∇ x u 0 ) = f, x ∈ D, u 0 = 0, x ∈ ∂D. which contains only coarse scale information. The homogenized coefficient a 0 can be computed by the formula a 0 ij = Y (e i + ∇χ i ) T a(y)(e j + ∇χ j )dy, where χ i solves the cell problems given by -div y (a (∇ y χ i + e i )) = 0, x ∈ Y , 1 ≤ i ≤ d, χ i ∈ H 1 per (Y ). with Y being the d dimensional torus. Asymptotic homogenization provides an approximate solution ûε = u 0 -ε χ i x ε ∂u0 ∂xi such that ∥u ε -ûε ∥ H 1 ≤ cε 1 2 . We note that to find the homogenization approximation ûε only requires the coarse-scale solution u 0 and precomputation of d cell problems for χ i which do not depend on f and D.

A.2 NUMERICAL HOMOGENIZATION AND MULTILEVEL/MULTIGRID METHODS

Given the smallest scale of the multiscale problem ε and a coarse computational scale determined by the available computational power and the desired precision, with ε ≪ H ≪ 1. The goal of numerical homogenization is to construct a finite-dimensional approximation space V H and to seek an approximate solution u H ∈ V H , such that, the accuracy estimate ∥u-u H ∥ ≤ CH α holds for optimal choices of the norm ∥•∥ and the exponent α, and optimal computational cost holds with V H constructed via pre-computed subproblems which are localized, independent and do not depend on the RHS and boundary condition of the problem. In recent two decades, great progress has been made in this area (Hou et al., 1999; Målqvist & Peterseim, 2014; Owhadi & Zhang, 2007) , approximation spaces with optimal accuracy (in the sense of Kolmogorov N -width (Berlyand & Owhadi, 2010) ) and cost can be constructed for elliptic equations with fixed rough coefficients. For multiscale PDEs, operator learning methods can be seen as a step forward from numerical homogenization, since they can be applied to an ensemble of coefficients, and the decoder can be interpreted as the basis of the underlying problem as well. For multiscale PDEs, multilevel/multigrid methods can be seen as the multilevel generalization of the numerical homogenization methods. Numerical homogenization can provide coarse spaces with optimal approximation and localization properties. Recently, operator-adapted wavelets (gamblets) (Owhadi, 2017; Xie et al., 2019) have been developed, and enjoy three properties that are ideal for the construction of efficient direct methods: scale orthogonality, well-conditioned multi-resolution decomposition, and localization. Gamblets has been generalized to solve Navier-Stokes equation (Budninskiy et al., 2019) and Helmholtz equation (Hauck & Peterseim, 2022) efficiently.

B HIERARCHICAL MATRIX PERSPECTIVE

The hierarchically nested attention in Algorithm 1 resembles the celebrated hierarchical matrix method (Hackbusch, 2015) , in particular, the H 2 matrix from the perspective of matrix operations. In the following, During the reducing process, the computation is performed from bottom to top to obtain coarser level tokens, for example, the (0, 0) tokens are obtained by learnable reduce operations R (2) and R (1) on tokens (0, 0, 0) and (0, 0, 1). When generating the high-resolution tokens/features, the computation is from up to bottom by learnable decomposition operations D (1) and D (2) . The red frames show examples of attention windows at each level. we take the binary tree-like hierarchical discretization shown in Fig B .1 as an example to illustrate the reduce operation, decompose operation, and multilevel token aggregation in Algorithm 1 in matrix formulas. STEP 0: Given features f (r),t , compute the queries q (r),t i , keys k (r),t j , and values v (r),t i for j ∈ I (r) . Starting from the finest level features f (r),t i , i ∈ I (r) , the queries q (r),t can be obtained by r) . . .     . . . q (r),t i . . .     =      W Q,(r) W Q,( W Q,(r)      |I (r) |     . . . f (r),t i . . .            |I (r) |, the keys k (r),t and values v (r),t can follow the similar procedure. STEP 1: For m = r -1 : 1, Do the reduce operations q m) are matrices parametrized by linear layers in our paper (in practice, queries, keys, and values use different R (m),t i = R (m) ({q R (m) :=         R (m) 0 R (m) 1 R (m) 0 R (m) 1 . . . . . . R (m) 0 R (m) 1         |I (m+1) |              |I (m) |, R (m) 0 , R (m) 1 ∈ R C(m-1)×C( (m) 0 , R (m) 1 to enhance the expressivity). In general, these operators R (m) are not limited to linear operators. The composition of nonlinear activation functions would help increase the expressivity. The nested learnable operators R (m) also induce the channel mixing and is equivalent to a structured parameterization of W Q , W V , W K matrix for the coarse level tokens, in the sense that, inductively,     . . . q (m),t i . . .     = R (m) • • • R (r-1)     . . . q (r),t i . . .     = R (m) • • • R (r-1)      W Q,(r) W Q,(r) . . . W Q,(r)          . . . f (r),t i . . .     . STEP 2: With the m-th level queries and keys, we can calculate the local attention matrix G (m),t loc at m-th level with G (m),t loc,i,j := exp(q (m),t i • k (m),t j ) for i ∈ N (m) (j), or i ∼ j. STEP 3: The decompose operations, opposite to the reduce operations, correspond to the transpose of the following matrix in the linear case, D (m) :=         D (m) 0 D (m) 1 D (m) 0 D (m) 1 . . . . . . D (m) 0 D (m) 1         |I (m+1) |              |I (m) |, The m-th level aggregation in Figure 3 .2 contributes to the final output f (r),t+1 in the form D (r-1),T • • • D (m),T G (m) loc R (m) • • • R (r-1)     . . . v (r),t i . . .     . Eventually, aggregations at all r levels in one V cycle can be summed up as     . . . f (r),t+1 i . . .     = r-1 m=1 (D (r-1),T • • • D (m),T G (m) loc R (m) • • • R (r-1) ) + G (r) loc     . . . v (r),t i . . .     . (B.1) The hierarchical attention matrix G We use the Adam optimizer with learning rate 1e-3, weight decay 1e-4 and the 1-cycle schedule as in Cao (2021) . We use batch size 8 for experiments in Sections 4.2 and 4.3, and batch size 4 for experiments in Appendices F.1 and F.2. h := r-1 m=1 (D (r-1),T • • • D (m),T G (m) loc R (m) • • • R (r-1) ) + G Evaluation setup For the Darcy rough case in Table 1 , we use a train-val-test split of the dataset with sizes 1280, 112, and 112, respectively. Each run is carried out with different seeds. For the Darcy smooth and multiscale trigonometric cases, we use a train-val-test split of the dataset with sizes 1000, 100 and 100, respectively. For the Navier-Stokes experiment in Section 4.3, we have tried N = 1000 and N = 10000 training samples as shown in Table 3 , and the number of testing samples is 100 and 1000 respectively. The baseline's implementations are based on their official implementation if provided publicly. The results are comparable with the ones reported in the references. All experiments are run on a NVIDIA A100 GPU.

E DATA GENERATION FOR THE TWO-PHASE COEFFICIENT IN SECTION 4.2

The two-phase coefficients and solutions are generated according to https://github.com/ zongyi-li/fourier_neural_operator/tree/master/data_generation, and used as operator learning benchmarks in Li et al. (2021) ; Gupta et al. (2021) ; Cao (2021) . The coefficients a(x) are generated according to a ∼ µ := ψ # N 0, (-∆ + cI) -2 with zero Neumann boundary conditions on the Laplacian. The mapping ψ : R → R takes the value a max on the positive part of the real line and a min on the negative part. The push-forward is defined in a pointwise manner. The forcing term is fixed as f (x) ≡ 1. Solutions u are obtained by using a second-order finite difference scheme on a suitable grid. The parameters a max and a min can control the contrast of the coefficient. The parameter c can control the roughness (oscillation) of the coefficient, and the coefficient with a larger c has rougher two-phase interfaces, as shown in (1 + 1 2 cos(a k π(x 1 + x 2 )))(1 + 1 2 sin(a k π(x 2 -3x 1 ))), with a k = uniform(2 k-1 , 1.5 × 2 k-1 ), and fixed f (x) ≡ 1. The reference solutions are obtained using P 1 FEM on a 1023 × 1023 grid. Datasets of lower resolution are sampled from the higher resolution dataset by linear interpolation. The experiment results for the multiscale trigonometric case with different resolutions are shown in Table 4 . HT-Net obtains the best relative L 2 error compared to other neural operators at various resolutions at 600 epochs. Multiwavelet neural operator MWT has the second best performance as it also possesses a multiresolution structure, but it does not adapt to the PDE. It is not surprising that FNO, an excellent smoother which filters higher frequency modes, fails to capture the high-frequency oscillations of the solution. In contrast, our method has a better performance in this respect. See 



(i, j) is understood as the concatenation of entries of i and the scalar 0 ≤ j ≤ 3.



Figure3.1: Hierarchical discretization and index tree. In this example, the 2D unit square is hierarchically discretized into three levels which are indexed by I (1) , I (2) and I (3) , respectively. For example, we denote by (1) (1,2) = {(1, 0), (1, 1), (1, 2), (1, 3)} the set of the second level child nodes of the node (1).Figure3.2: One V-cycle of the feature update.

Figure3.1: Hierarchical discretization and index tree. In this example, the 2D unit square is hierarchically discretized into three levels which are indexed by I (1) , I (2) and I (3) , respectively. For example, we denote by (1) (1,2) = {(1, 0), (1, 1), (1, 2), (1, 3)} the set of the second level child nodes of the node (1).Figure3.2: One V-cycle of the feature update.

m+1) ) and also for k (m),t i and v (m),t i , for any i ∈ I (m) . STEP 2: For m = r, • • • , 1, Do the local aggregation by equation 3.2 and equation 3.3 to get f (m),t+1 i

Figure 4.1: Top: (a) smooth coefficient in Li et al. (2021), with a max = 12, a min = 3 and c = 9, (b), reference solution, (c) HT-Net solution, (d) absolute error of HT-Net, (e) absolute error of FNO2D; Bottom: (a) rough coefficients with a max = 12, a min = 2 and c = 20, (b) reference solution, (c) HT-Net solution, (d) absolute error of HT-Net, (e) absolute error of FNO2D, The maximal absolute error in Bottom:(e) is around 900µ = 9e-4.

Figure 4.2: In (a) HT-Net trained with H 1 loss, and (b) HT-Net trained with L 2 loss, we show the evolution of errors with x-axis for frequency, y-axis for training epochs, and colorbar for the magnitude of L 2 error on each frequency in log 10 scale, the error for each frequency is normalized frequency-wise by the error at epoch 0. The loss curves with training, testing, and generalization errors are shown in (c) for HT-Net trained with H 1 loss, and in (d) for HT-Net trained with L 2 loss.



Figure B.1: Hierarchical discretization of 1D domain. The coarsest level partition is plotted as the top four segments in pink. The segment (0, 0) is further partitioned into two child segments (0, 0, 0) and (0, 0, 1). During the reducing process, the computation is performed from bottom to top to obtain coarser level tokens, for example, the (0, 0) tokens are obtained by learnable reduce operations R (2) and R (1) on tokens (0, 0, 0) and (0, 0, 1). When generating the high-resolution tokens/features, the computation is from up to bottom by learnable decomposition operations D (1) and D(2) . The red frames show examples of attention windows at each level.

r) loc in equation B.1 resembles the three-level H 2 matrix decomposition illustrated in the following Figure B.2, we with output size of n 2 × n 2 × 2C. The procedure is repeated from level 2 level to level 3 with the output f (3) of size n × n × C.

TRIGONOMETRIC COEFFICIENT In this experiment, we consider equation 2.1 with multiscale trigonometric coefficient adapted from Owhadi (2017), such that D = [-1, 1] 2 , a(x) = 6 k=1

Figure F.1 for illustrations of the coefficient and comparison of the solutions at the slice x = 0.

epochs=300

Figure F.1: (a) multiscale trigonometric coefficient, (b) slice of the solutions at x = 0.

Performance on multiscale elliptic equation. For the Darcy rough case, we run each experiment 3 times to calculate the mean and the standard deviation (after ±) for relative L 2 errors (×10 -2 ) and relative H 1 errors (×10 -2 ). All experiments use a fixed train-val-test split setup, see Appendix D for details. SWIN is constructed by adding the encoder-decoder architeture of HT-Net to the original SWIN code, and is served as a baseline for multiscale vision transformers. FNO2D-48 and FNO2D-96 are adapted from FNO2D (by default, with 12 modes) with 48 modes and 96 modes respectively in order to learn high-frequency outputs. HT-Net outperforms other neural operators by a considerable margin for all three cases.

Benchmarks for the Navier Stokes equation. The resolution is 64 × 64 for both training and testing, and all models are trained for 500 epochs. N is the size of the training samples.

Relative error (×10 -2 ) of the multiscale trigonometric example.

acknowledgement

We put the code for and also a link for datasets at the anonymous Github page https://github.com/ Shengren-Kato/VFMM-ICLR2023.git. Supplementary descriptions of the code are also provided in the page.

annex

also refer to Hackbusch (2015) for a detailed description. The sparsity of the matrix lies in the fact that the attention matrix is only computed for pairs of tokens within the neighbor set. The H 2 matrix-vector multiplication in B.1 also implies O(N ) complexity of the Algorithm 1. Note that, the local attention matrix at level I (1) (pink), level I (2) (blue) and level I (3) (green ) are G(1) loc , G(2) loc and G(3) loc , respectively. However, when considering their contributions to the finest level, they are equivalent to the attention matrix(blue) and G 

C PROOF OF PROPOSITION 3.1

Proof For each level m, the cost to compute equation 3. -1) flops and so does the decompose operation at the same level. Therefore, for each level, the operation cost is c(|I (m) |C (m) ) + 2|I (m) |C (m) C (m-1) . When I is a quadtree, I (r) = N, I (r-1) = N/4, • • • , I (1) = 4, therefore the total computational cost ∼ O(N ). □

D IMPLEMENTATION DETAILS

In the HT-Net implementation, we follow the window attention scheme in Liu et al. (2021b) for the definition of the neighborhood N (•) (•) in equation 3.2 and equation 3.3. In this paper, we choose r = 3 as the depth of the HT-Net, GeLU as the activation function, and a CNN-based patch embedding module to transfer the input data into features.For a dataset with resolution n f × n f , such as in the multiscale elliptic equation benchmark 4.2, the input feature f (3) is represented as a tensor of size n × n × C via patch embedding. The self-attention is first computed within a local window on level 3. Then the reduce layer concatenates the features of each group of 2 × 2 neighboring tokens and applies a linear transformation on the 4C-dimensional concatenated features on n 2 × n 2 level 2 tokens, to obtain level 2 features f (2) as a tensor of the size n 2 × n 2 × 2C. The procedure is repeated from level 2 to level 1 with f (1) of size n 4 × n 4 × 4C. We adopt the window attention scheme from Liu et al. (2021b) . One can refer to the implementation details there. For the decompose process, starting at level 1, a linear layer is applied to transform the 4Cdimensional features f (1) into 8C-dimensional features. Each level 1 token with 8C-dimensional features is decomposed into four level 2 tokens with 2C-dimensional features and added to the level 2 feature f (2) The Helmholtz equation is notorious for solving. One reason is the so-called resonance phenomenon as the frequency ω is close to an eigenfrequency of the Helmholtz operator for some particular c. We found that it is necessary to use a large training dataset of size 8000. We also compare the evaluation time of the trained models in 5. Compared to HT-Net, FNO has both a larger error and a longer evaluation time. UNet, as a CNN based method, has much faster evaluation time (30 times faster than HT-Net) but the worst error (60 times as much as HT-Net).

F.3 COMPARISON WITH CLASSICAL METHODS (FDM) AND MULTISCALE METHODS (GRPS)

We compare the accuracy and efficiency of HT-Net with two conventional solvers -finite difference method (FDM) as a typical classical method, and generalized rough polyharmonic splines (GRPS) (Liu et al., 2021a) as a typical multiscale method, for the Darcy rough benchmark. The results are listed in Table 6 .Model Evaluation time (s) relative L 2 error (×10 -2 ) FDM 0.34 0.84 GRPS 18.9 0.02 HT-Net 0.003 0.58Table 6 : Relative L 2 error and evaluation time on the Darcy rough benchmark. We implement FDM and NH in MATLAB and measure the solution time on a CPU Intel(R) Core(TM) i7-10510U CPU @ 2.30 GHz. We measure the evaluation time of HT-Net on an NVIDIA A100 GPU. Compared with classical methods, HT-Net has comparable accuracy but needs much less time for evaluation. The reference solution is defined on grids 512 × 512, sampled from the test dataset. FDM solves the problem on a 256 × 256 grid. GRPS uses coarse bases of dimension 256 2 to solve the same problem. HT-Net learns the solutioin operator on the same resolution.

F.4 STUDY FOR HYPERPARAMETERS

Model H 1 relative error (×10 -2 ) L 2 relative error (×10 -2 ) HT-Net [3, 8, 80] 1.843 0.648 8, 128] 1.761 0.688 4, 80] 1.898 0.710 4, 64] 1.909 0.695 2, 80] 2.030 0.707 8, 80] 1.903 0.701Table 7 : Hyperparameter study.We conduct studies for hyperparameters such as number of hierarchical levels (depth), window size, and feature dimension. We list the results for the Darcy rough benchmark in Table 7 , where the notation 8, 80] represents HT-Net with 3 levels, window size 8, and feature dimension 80. The experiments are run for 100 epochs. The results show that larger values of hierarchical level, window size, and feature dimension might be helpful to reduce errors, though with a higher computational cost. To balance the model size, computational cost, and performance, we choose the hyperparameters as [3, 8, 80] . The CUDA mem(GB) is the sum of self cuda memory usuage from the PyTorch autograd profiler for 1 backpropagation. The mem(GB) is recorded from nvidia -smi of the memory allocated for the active Python process during profiling.

F.5 MEMORY USAGE

We report the memory usage of different models for the Darcy smooth (with resolution 211×211) and Darcy rough (with resolution 256×256) benchmarks in Table F .5. The table shows that the memory usage of HT-Net is stable. For the resolution of 256×256, both MWT and GT consume larger or comparable CUDA memory compared with HT-Net, albeit HT-Net has much better accuracy.

