BI-STRIDE MULTI-SCALE GRAPH NEURAL NETWORK FOR MESH-BASED PHYSICAL SIMULATION

Abstract

Learning physical systems on unstructured meshes by flat Graph neural networks (GNNs) faces the challenge of modeling the long-range interactions due to the scaling complexity w.r.t. the number of nodes, limiting the generalization under mesh refinement. On regular grids, the convolutional neural networks (CNNs) with a U-net structure can resolve this challenge by efficient stride, pooling, and upsampling operations. Nonetheless, these tools are much less developed for graph neural networks (GNNs), especially when GNNs are employed for learning large-scale mesh-based physics. The challenges arise from the highly irregular meshes and the lack of effective ways to construct the multi-level structure without losing connectivity. Inspired by the bipartite graph determination algorithm, we introduce Bi-Stride Multi-Scale Graph Neural Network (BSMS-GNN) by proposing bi-stride as a simple pooling strategy for building the multi-level GNN. Bi-stride pools nodes by striding every other Breadth-First-Search (BFS) frontier; it 1) works robustly on any challenging mesh in the wild, 2) avoids using a mesh generator at coarser levels, 3) avoids the spatial proximity for building coarser levels, and 4) uses non-parametrized aggregating/returning instead of MLPs during pooling and unpooling. Experiments show that our framework significantly outperforms the state-of-the-art method's computational efficiency in representative physics-based simulation cases.

1. INTRODUCTION

Simulating physical systems through numerically solving partial differential equations (PDEs) plays a key role in various science and engineering applications, ranging from particle-based (Jiang et al., 2016) and mesh-based (Li et al., 2020a) solid mechanics to grid-based fluid (Bridson, 2015) and aero (Cao et al., 2022) dynamics. Despite the extensive successes in improving their stability, accuracy, and efficiency, numerical solvers are often computationally expensive for time-sensitive applications, especially iterative design optimization where fast online inferring is desired. Recently, machine learning approaches have demonstrated impressive potential in improving the efficiency of inferring physical states with competitive accuracy. Representative methods include end-to-end frameworks (Obiols-Sales et al., 2020) and those with physics-informed neural networks (PINNs) (Raissi et al., 2019; Karniadakis et al., 2021; Sun et al., 2020; Gao et al., 2021) . Many existing works apply convolutional neural networks (CNNs) (Fukushima & Miyake, 1982) to learn physical systems on two-or three-dimensional structured grids (Kim et al., 2019; Fotiadis et al., 2020; Gao et al., 2021; Guo et al., 2016; Tompson et al., 2017) . It is generally recognized that CNNs exhibit strong performance on handling local information with convolution and global information with pooling/upsampling. However, the strict dependency on regular domain shapes makes it non-trivial to be applied on unstructured meshes. Although it is possible to deform the domains to rectangular shapes to apply CNNs (Gao et al., 2021) or other models, such as NeuralOpera-torNets (Li et al., 2022) , the challenge remains for domains with complex topologies, which are common in practice. On the other hand, graph neural networks (GNNs) have been considered as a natural choice for physics-based simulation on unstructured meshes (Battaglia et al., 2018; Belbute-Peres et al., 2020; Gao et al., 2022; Harsch & Riedelbauch, 2021; Pfaff et al., 2020; Sanchez-Gonzalez et al., 2018; 2020) . However, all the above methods use the flat GNN that faces two challenges when the graph size increases: (1) Oversmoothing: the graph convolution can be seen as a low-pass filter that suppresses the signal with higher frequency than a certain value (Chen et al., 2020; Li et al., 2020b) . Multiple passes of graph convolution then become an iterative projection onto the eigenspace of the graph where all higher frequency signals are smoothed out, which also makes training harder. (2) Complexity: Under mesh refinement, not only that more nodes are there to be processed, but the message passing (MP) iterations also grow linearly to propagate information to the same physical distance (Fortunato et al., 2022) . As a result, a quadratic complexity becomes inevitable for both the running time and the memory to store the computational graph. To mitigate these limitations, researchers recently start investigating multi-scale GNNs (MS-GNNs) for physics-based simulation (Fortunato et al., 2022; Li et al., 2020b; Lino et al., 2021; Liu et al., 2021; Lino et al., 2022a; b) . The multi-scale approach is appealing as it tackles the oversmoothing issue by building sub-level graphs on coarser resolutions, which lead to longer range interaction and naturally fewer MP times. However, pooling and adjacency building should be conducted carefully to avoid introducing partitions into the coarser levels (Gao & Ji, 2019) , which stops information exchange across the separated clusters. Existing solutions include utilizing the spatial proximity for building the connections at the coarser levels (Lino et al., 2021; Liu et al., 2021; Lino et al., 2022a; b) , or generating coarser meshes for the original geometry (Fortunato et al., 2022; Liu et al., 2021) , and randomly pooling nodes then applying Nyström approximation for the original adjacency matrix (Li et al., 2020b) . However, all of them suffer from limitations: the spatial proximity can result in wrong connections across the geometry boundaries; the mesh generation is laboring and often unavailable for unseen meshes; and the random pooling may introduce partitions in the coarser levels. We observe that all the aforementioned limitations originate from pooling and building connections at coarser levels. To the best of our knowledge, no existing work can systematically generate multiscale GNNs with arbitrary levels for an arbitrary geometry in the wild while completely avoiding cutting or wrong connections across the boundaries. To this end, in this work, we introduce a simple yet robust and effective pooling strategy, bi-stride. Bi-stride is inspired by the bi-partition determination in DAG (directed acyclic graph). It pools all nodes on every other BFS (breadth-first-search) frontiers, such that a 2 nd -powered adjacency enhancement conserves all the connectivity. We also accompany bi-stride with a non-parameterized aggregating/returning method to handle the transition between adjacent levels to decrease the model complexity. Our framework, namely Bi-Stride Multi-Scale Graph Neural Network (BSMS-GNN), is tested on three benchmarks (CYLINDERFLOW, AIRFOIL, and DEFORMINGPLATE) from GraphMeshNets and INFLATINGFONT, a new dataset of inflating elastic surfaces with many self-contacts. In all cases, BSMS-GNN shows a dominant advantage in memory footprint and required training and inference time compared to alternatives.

2. BACKGROUND AND RELATED WORKS

GNNs for Physics-Based Simulation GNNs are first applied to physical simulation to learn the behaviors of particle systems, deformable solids, and Lagrangian fluids (Battaglia et al., 2016; Chang et al., 2016; Mrowca et al., 2018; Sanchez-Gonzalez et al., 2020) . Notably, the generalized Message Passing (Sanchez-Gonzalez et al., 2018 ) is broadly accepted for information propagation. Based on that, GraphMeshNets (Pfaff et al., 2020) sets a milestone for learning mesh-based simulation. Following GraphMeshNets, which predicts a single forward timestep, there have been several variants, including 1) solving forward and inverse problems by combining GNNs with PINNs (Gao et al., 2022) , 2) predicting long-term system states combined with GraphAutoEncoder (GAE) and Transformer (Han et al., 2022) , 3) predicting steady states with multi-layer readouts (Harsch & Riedelbauch, 2021) , and 4) up-sampling from coarser meshes with differentiable simulation (Belbute-Peres et al., 2020) . Yet still, with flat GNNs, the quadratic computation complexity on finer meshes poses great challenges. We claim that adopting a multi-level structure is an effective solution. Multi-Scale GNNs It is common to apply GNNs with multi-level structures in various graphrelated tasks, such as graph classification (Wu et al., 2020; Mesquita et al., 2020; Zhang et al., 2019) . GraphUNet (GUN) (Gao & Ji, 2019) first introduces the UNet structures into GNN with a trainable scoring module for pooling; it also has a 2 nd -powered adjacency enhancement to reduce the chance of losing connectivity. A few works have investigated multi-scale GNNs (MS-GNNs) for physics-based simulation. Specifically, Fortunato et al. (2022) and Liu et al. (2021) define twoand multi-level GNNs, respectively, for physics-based simulation, but both of them rely on pre- et al. (2015) . Still, these methods by construction suit better cases without meshes, such as particle fluid simulations.

Motivations of Our Method

We present an overview of representative GNN architectures with U-net structure in Fig. 1 . Two major disadvantages we observed are: 1) easy loss of connectivity by pooling, even with a 2 nd -powered adjacency enhancement; and 2) lack of direct connections between pooled and unpooled nodes, leading to additional edges built by the spatial proximity for transition between levels. For a more clear illustration, we start with a few definitions. We first define the adjacency enhancement by the K th -order matrix power as A ← A K , where A is the adjacency matrix of the graph. Geometrically, A(i, j) = 1 means the edge (i, j) exists, and A K (i, j) = 1 means that node j is connected to node i via at most K hops. Given a pooling strategy P and the selected pooled nodes S P , we define a Kth-order outlier set as O K , where the nodes in O K are not connected to any pooled nodes even after K th -order adjacency enhancement: A K (i, j) = 0, ∀i ∈ S P , ∀j ∈ O K . We further define that a pooling strategy P is K th -order connection conservative (K-CC) if O K is empty. We argue that larger K in K th -order adjacency enhancement is harmful to distinguish the node features. As K increases, A K (i, j) approaches a matrix with all its entries equal to 1, representing a fully connected graph, where a single step of convolution will average all node features and make them indistinguishable. The most favorable and possible, i.e. the smallest, K we should seek is 2. Gao & Ji (2019) uses the 2 nd order enhancement to help conserve the connectivity. Nonetheless, there is no theoretical guarantee that a learnable pooling module is consistently 2-CC for any graph (a counter example is shown in Fig. 1(a) ). There are two alternative solutions to the matrix power enhancement that ensure conservation of the connectivity at coarser layers: 1) Lino et al. (2022a; 2021; 2022b) . However, both methods need spatial proximity to build additional connections for the transition between levels, which may produce wrong connections across the boundary. These limitations motivate us to create a consistent 2-CC pooling strategy, as described in Sec. 3.2. An additional overhead is the learnable transition modules which have the same network architecture as the message passing. This overhead in model size and computational complexity grows linearly w.r.t. the number of levels of U-net. As a result, they often end up with a relatively shallow level at 2 or 3. We claim that none-parameterized transition is performance-wise crucial for deeper multi-layer GNNs, and propose the first non-parameterized transition method in Sec. 3.3. Overall, our method adopts a similar message passing layer as in GraphMeshNets (Pfaff et al., 2020) . Compared to Liu et al. (2021); Fortunato et al. (2022) , our advantage is that no mesh generators is needed for the coarser-level graphs. Compared to Lino et al. (2021; 2022a; b) , our advantage is that no spatial proximity is necessary. Together, we eliminate the need for building connections via spatial proximity nor using learnable MLP for aggregation and returning. Note that the work of Li et al. (2020b) shares similar advantages to some extent, but it focuses on generalization with PDE parameters, while ours focuses on a systematic pooling strategy for arbitrary complex geometries.

3. METHODOLOGY

3.1 DEFINITIONS Figure. 2 presents the overall structure of BSMS-GNN. We consider the evolution of a physics-based system discretized on a mesh, which is converted to an undirected graph G 1 = (V 1 , E 1 ). Here, with subscript 1, V 1 and E 1 label the nodal fields and the connectivities at the finest level (the input mesh), respectively. Specifically for edges, we define E 1 = {E 1 1 , • • • , E S 1 } , where E 1 1 is the edge set directly copied from the input mesh, and {E k 1 | S k=2 } are optionally the additional problem-dependent edge sets involved. For example, both DEFORMINGPLATE (Fig. 5(c )) and INFLATINGFONT (Fig. 5(d) ) benchmarks have a second edge set E 2 1 for the nearby colliding vertices. We use {p, q}, stacked vectors of {p i , q i } of all nodes i ∈ V 1 , to denote the input and output nodal fields, respectively. Given an input field p j at a previous time t j , one pass of our BSMS-GNN returns the output field q j+1 at time t j+1 = t j + ∆t, where ∆t is the fixed time step size. The output q can contain more physical fields than the input p and must be able to derive the input for the next pass. The rollout refers to iteratively conducting BSMS-GNN from the initial state p 0 → q 1 → p 1 → • • • → q n and producing the temporal sequence output {q 1 , q 2 , • • • , q n } within the time range of (t 0 , t 0 + n∆t], where n is the total number of evaluations.

Message Passing

In general, we follow the encode-process-decode fashion in GraphMeshNets, where encoding and decoding only appear at the beginning and the end of the finest level G 1 , mapping the nodal input p and output q to/from the latent feature v, respectively (see Table A .1 for the domain-specific information). As for the process part, unlike GraphMeshNets where multiple message passings (MPs) are needed, we observe that a single MP at each level is sufficient for all experiments. Therefore, it becomes unnecessary to keep updating the latent edge information across multiple MPs. To include the directional information of an edge (x i , x j ), we simply prepend its positional offset ∆x ij = x i -x j to the stacked sender/receiver latent as input to calculate the information flow. For a problem involving S edge sets, an MP pass at level l is formulated as: e s l,ij ← f s l ∆x l,ij , v l,i , v l,j , s = 1, • • • , S, v ′ l,i ← v l,i + f V l v l,i , j e 1 l,ij , • • • , j e S l,ij , where f is a MLP function, e is the latent information flow through an edge, and v is the latent node feature. Please refer to Sec. A.2 for the detailed architecture of the model.

Cross-Level Transition

We handle information transition between two adjacent levels with downsampling and upsampling modules. Here we define downsampling as the sequence of pooling (selecting pooled nodes) and then aggregating the information from the neighbors to the coarser level, and we define upsampling as the sequence of unpooling and then returning the information of the pooled nodes to their neighbors at the finer level. Please refer to Sec. 3.3 for details. As aforementioned, two challenges of building a multi-level GNN for learning physical simulation, especially on wild geometries, are 1) not introducing partitions that break the connectivity, and 2) not introducing wrong edges by spatial proximity. We tackle these challenges by improving the pooling phase. Specifically, is there a pooling strategy that is consistently 2 nd -order connection conservative (2-CC) for any input graph so that an efficient 2 nd -order enhancement is sufficient to conserve the connectivity? We draw the initial inspiration from the bi-partition determination algorithm (Asratian et al., 1998) in a directed acyclic graph (DAG). As show in the inset figure (a), after topological sorting, pooling on every other depth (yellow and green) generates a bi-partition. To resemble the bipartition determination on a mesh, which is not bi-partite due to cycles, we can conduct a breadthfirst search (BFS) to compute the geodesic distances from an initial seed to all other nodes, and then stride and pool all nodes at every other BFS frontiers (bi-stride). A bi-stride example is shown in the inset figure (b) , where the number in each vertex represents the distance to the seed (node 1 in red circle) by BFS. This pooling is 2-CC by construction and conserves direct connections between pooled nodes and unpooled nodes. As a result, we avoid building edges by spatial proximity or handling the cumbersome corner cases such as cross-boundary connections.

3.2. BI-STRIDE POOLING AND ADJACENCY ENHANCEMENT

Seeding Heuristics We claim that there should exist some freedom as long as the seeding is balanced to a certain degree. The time complexity for searching seeds is tolerable because of the onepass preprocess. For training datasets, we choose two deterministic seeding heuristics: 1) closest to the center of a cluster (CloseCenter) for INFLATINGFONT, and 2) the minimum average distance (MinAve) for all other cases, and we preprocess the multi-level building in one pass. One can consider the cheaper heuristic CloseCenter during the online inferring phase if an unseen geometry is encountered. The details of the algorithms can be found in Sec.A.6. Auxiliary Edges For multi-physics problems, such as DEFORMINGPLATE (Fig. 5(c )) and INFLAT-INGFONT (Fig. 5(d) ), the auxiliary edges (such as contact edges A C ) should be built dynamically by spatial proximity to exchange the interfacial information between different systems. The enhancement of these edges should be handled properly for multi-layered GNN, which, to the best of our knowledge, has not been addressed yet. At level l, given two adjacent matrices A l and A C l for the mesh edges from the input graph G 1 l and the contact edges, respectively, we apply the enhancement followed by per-cluster bi-stride pooling for A l with selected node indices I: A ′ l+1 ← A l A l , A l+1 ← A ′ l+1 [I, I], A ′C l+1 ← A l A C l A l , A C l+1 ← A ′C l+1 [I, I]. (2) This enhancement can be geometrically interpreted as such: an auxiliary edge (i, j) should exist if j is reachable from i in 2 hops and one of which is an auxiliary edge at the finer level. We prove in Sec. A.5 that our pooling conserves all the contact edges under this enhancement.

3.3. TRANSITION BETWEEN LEVELS

We propose a unified non-parameterized method to reduce the overhead of the learnable transition modules between every pair of adjacent levels. Downsampling We treat the latent information as a conserved variable and project it to the pooled nodes. We define A as the unweighted adjacency matrix where its row and column indices represent the sender and the receiver, respectively. We further represent the nodal mass or importance as a nodal weight field w, which is initialized on the finest level to ones for near-uniform meshes or the volume/mass field for highly irregular meshes. With the receiver vertex j and its sender vertices i, the formal procedure is formulated as (Fig. 3 ): ŵij ← wiAij/ j Aij Cij ← ŵij/ i ŵij vj ← i viCij vi ← vjC T ij • Normalize by row as in a standard graph convolution Âij ← A ij / j A ij , and then convolve the weight once ŵij ← w i Âij (Fig. 3 (a)); • Calculate edge weights C ij ← ŵij / i ŵij , where C can be viewed as a contribution table with C ij as the share of weights in the receiver j contributed by the sender i (Fig. 3(b )); • Convolve the latent information by the contribution table v j ← i v i C ij , which is equivalent to equally splitting and sending the weighted information to neighbors, and then obtaining the weighted average (Fig. 3(c) ). Upsampling After unpooling, all nodes except the pooled ones have zero information. A returning process, resembling the transposed convolution in CNNs, can help distinguish the receivers. With the contribution table C recording the edge weights, a natural choice is v i ← v j C T ij (Fig. 3(d) ).

4.1. EXPERIMENT SETUP

Datasets We adopt three representative public datasets from GraphMeshNets (Pfaff et al., 2020) : 1) CYLINDERFLOW: incompressible fluid around a cylinder where the mass conservation has to be enforced globally, 2) AIRFOIL: compressible flow around an airfoil where an auxiliary prediction, the pressure, is included, and 3) DEFORMINGPLATE: deforming an elastic plate with an actuator where simple contact is included. In addition, to illustrate the ease of extending our method to multiset problems, we further create a new dataset, INFLATINGFONT, featuring the inflation of enclosed elastic surfaces with massive self-contacts (Fang et al., 2021) . Baselines On all datasets, we compare computational complexity, training/inference time, and memory footprint of BSMS-GNN to baselines: 1) GRAPHMESHNETS (Pfaff et al., 2020) : the single-level GNN architecture of GraphMeshNets, 2) MS-GNN-GRID (Lino et al., 2021; 2022a; b) : a representative work for those building the hierarchy with spatial proximity (i.e. using the distance between nodes), and 3) GRAPHUNET (Gao & Ji, 2019 ): a representative work for those using learnable modules for pooling. The detailed reimplementation of these works can be found in Sec. A.2. We note again that methods such as Implementation We implement our BSMS-GNN framework with PyTorch (Paszke et al., 2019) and PyG (PyTorch Geometric) (Fey & Lenssen, 2019) . We train the entire model by supervising the single-step L 2 loss between the ground truth and the nodal field output of the decoding module. For more detailed information, such as the statistics of the mesh, the number of layers, the multi-edge sets, and the hyperparameters of the MLP network, please refer to Sec. A.1 and A.2. Our datasets and code are publicly available at https:// anonymous.4open.science/ r/ BSMS-GNN-ICLR-2023/ . MISCs We also conduct the ablation study for the specific choice of our transition method in Sec. A.3, and include the scaling test on INFLATINGFONT in Sec. A.4. Another ablation study can be performed on whether or not to use a learnable pooling module. But we already covered this aspect by comparing to GRAPHUNET in the full experiments (details in Sec. 4.2).

4.2. RESULTS AND DISCUSSIONS

We evaluate BSMS-GNN on all described benchmarks and compare it with the baselines (Sec. 4.1). In general, our method builds multi-level graphs without the loss of connectivity; it is free from spatial proximity and therefore avoids wrong edges across the boundary for complex geometries (the generated multi-level graphs of each example are plotted in Fig. 5 ), leading to high-quality rollouts on all tasks. Compared to all baselines, our method shows dominant advantages in significantly less memory footprint and training time to reach the desired accuracy, as plotted in Compared to GRAPHMESHNETS, BSMS-GNN only takes 51% and 31% ∼ 39% of the unit training time for Eulerian systems (CYLINDERFLOW and AIRFOIL) and Lagrangian systems with contacts (DEFORMINGPLATE and INFLATINGFONT), respectively. The main source of the speedup is the reduction of total MP times. In GRAPHMESHNETS, 15 MP passes are conducted on the finest level of the mesh. While in our method, 2 × levels + 3 MPs are conducted, and only 4 of them happen on the finest level. As for MS-GNN-GRID, they share the similar advantage of performing more MPs on smaller subsets at coarser levels, but 4 × levels + 6 in total MPs are required, while 8 of which happen at the finest level; they also have the overhead of learnable aggregation/returning modules. When applied to Eulerian systems, their unit training/infer time lies between our method and GRAPHMESHNETS. For Lagrangian systems with contacts, the contact edge sets bring in additional overhead and degrade the unit training time to the same level as GRAPHMESHNETS. Regarding inference time, the performances for DEFORMINGPLATEwith the smallest mesh size (∼ 1K) are very similar. Our method and MS-GNN-GRIDhave similar performance in CYLINDER-FLOW(mesh size ∼ 1.5K) as well, and both outperformed GRAPHMESHNETS. As mesh size grows (5K ∼ 15K), BSMS-GNN boost the inference time gradually up to 2.5× compared to MS-GNN-GRID, and 2.9× compared to GRAPHMESHNETS. Training time to reach desired rollout accuracy Since rolling out is the ultimate purpose for predicting physical systems, we define the training cost (in time) as the earliest wall time to obtain the converged global rollout RMSE. The global rollout error is reduced by feeding the model with noisy inputs but correct outputs at each epoch, so that it can learn to better correct noises generated during inference (Pfaff et al., 2020) . The essential point is epoch number, i.e. the number of random noise patterns seen. In our observation, all methods reach the desired global rollout RMSE with a similar amount of epochs, which leads to our superiority due to much faster unit training time. Accuracy We plot the detailed RMSEs with different rollout steps (1, 50, or until the end) for different methods. Our method has the smallest global rollout RMSE for all cases except DEFORM-INGPLATE, where the error is slightly higher than the alternatives. For INFLATINGFONT, with the most complicated contact connectivities, our method cuts about 55% of the training time while also reducing 40% of the global rollout error.

Memory Footprint

The memory footprint affects both the training and the inference stage. In the training stage, the higher RAM consumption sets the lower cap of batch number and results in more data transfer from CPU to GPU and a larger overhead to finish an epoch. In our observation, we can achieve at most ∼ 3x acceleration by simply increasing the batch number. In the inference stage, the RAM consumption is closely related to the deployment in production. We measure the memory footprint of all methods under varying batch sizes (Table . 3). Compared to GRAPHMESHNETS, BSMS-GNN consistently reduces memory consumption by approximately half in all cases. As for Figure 4 : Failure cases for MS-GNN-GRID. Left: the configuration of the simplest failure case for multi-level GNNs by spatial proximity: the steady-state 1-D heat transfer. Right, leading two columns: two tests showing that even if trained to convergence, the erroneous edge across the boundary can still result in wrong inference. Right, last two columns: the erroneous edge coincidentally does not affect the results due to the symmetry of the solution and that no heat will diffuse between two nodes with the same temperature. MS-GNN-GRID, we observe a similar phenomenon as the unit training time: their advantage only stands for the Eulerian systems. For the Lagrangian systems with additional contact edge sets, they consume similar or even higher memory than GRAPHMESHNETS. Overall, our method consumes 17% ∼ 57% less memory than MS-GNN-GRID. Our method also has the smallest inference RAM, except for DEFORMINGPLATE where ours is slightly higher (20MBs) than GRAPHMESHNETS. The Failure Case for Spatial Proximity To illustrate the adversarial impact of wrong edges built by spatial proximity, we design a simple 1-D steady-state heat transfer on sticks (Figure. 4 left) . The training set contains two mirrored instances, where one end of the stick is fixed at a specific temperature, and the other has the fixed heat flux. The result of such a configuration is the linear temperature distribution. In the test set, we simply align two sticks in a head-to-tail fashion but leave some space between them so no heat diffuses across the boundary. We choose MS-GNN-GRID as an example of those utilizing spatial proximity. The training for BSMS-GNN and MS-GNN-GRID can converge quickly under a few hundred iterations. However, in the test phase, the erroneous connection by proximity transfers the information between two isolated sticks and can yield wrong results (Figure . 4 right, leading two columns). We also note that although preprocessing (separately inferring for two sticks) can help resolve the issue in this simple example, it is not doable for a single but wild geometry. The simplest counter example is the fluid dynamics in a U-shaped channel where the two ends of the channel are close spatially but far away geodesically.

5. CONCLUSION, LIMITATIONS, AND FUTURE WORK

Bi-Stride Multi-Scale GNN features a simple and robust pooling strategy that systematically generates an arbitrary-depth, multi-level graph neural network given geometry in the wild as the sole input. It does not rely on mesh generators or projecting to regular grids. Bi-Stride guarantees direct connections between the pooled and unpooled nodes, while free from any redundant connections by spatial proximity. This further helps replace the MLPs for the transition between adjacent levels with a unified non-parametrized transition scheme. BSMS-GNN eliminates the necessity of multiple MPs and the latent edge embedding. Combined, it significantly reducing computational costs. With moderate tailoring, BSMS-GNN can be easily extended to multi-edge-set problems involving different dynamical behaviors. In summary, we believe that the non-parameterized Bi-Stride strategy will conceptually complete the methodology path created by GraphMeshNets, just like what striding and up-sampling by interpolation are for CNN. Following our non-parametrized strategy, there are interesting ideas to explore. For example, although any general multi-level GNN can reduce the time complexity to linear, it still need to load the whole graph initially. Combining the multi-level GNN and batch training is crucial for huge-scale graphs. Second, as stated in Li et al. (2020b) , the transition from fine to coarse levels is equivalent to the transition from sparse, high-rank kernels to dense, low-rank kernels. Although the dense or fully connected graphs only appear near the bottom layers with minimum nodes in practice, there is no theoretical guarantees. Whether strategies like edge pruning is needed to avoid dense graphs at coarser levels becomes an interesting question. In addition, since all the nodal features will be smoothed without the skip-layers, how to migrate our strategy to GAE+Transformer (Han et al., 2022) is also a meaningful direction. A APPENDIX Below we list the model configurations: 1) the offset inputs to prepend before the material edge processor e M ij , and e W ij , and 2) nodes p i , as well as the nodal outputs q i from the decoder for each experiment cases, where X and x stand for the material-space and world-space positions, v is the velocity, ρ is the density, P is the absolute pressure, and the dot ȧ = a t+1 -a t stands for temporal change for a variable a. All the variables involved are normalized to zero-mean and unit variance via pre-processing. As for time integration, Cylinder, Airfoil, and Plate inherited the first-order integration from GRAPHMESHNETS. For INFLATINGFONT, the first-order quasi-static integration (Fang et al., 2021) is used in the solver. Hence, we also adopt the first-order integration for INFLATINGFONT. While exploring the non-parametric transition solutions, we started with no transition because our method is adopted directly from GUN (Gao & Ji, 2019) . The no-transition strategy produces low enough 1-step RMSE and visually correct rollouts for INFLATINGFONT. However, in the global

A.6 ALGORITHMS FOR THE SEEDING HEURISTICS

Here we elaborate our two seeding heuristics for the bi-stride pooling at every levels: picking the seed that 1) is closest to the center of a cluster (CloseCenter), and 2) with the minimum average geodesic distance to its neighbors (MinAve). The complexity for MinAve is O(N 2 ) as we need to conduct BFS for every nodes to find the one with the minimum average distance to neighbors. In our experiments, the quadratic cost of MinAve is tolerable for all cases but INFLATINGFONT. For both heuristics, we conduct the search in a per-cluster fashion to avoid the information from other clusters that could pollute the search result. For example, when determining the center of an isolated part of the input geometry, the positions of nodes from other clusters could pollute this process. The determination of clusters given a graph is elaborated below. 



Figure1: Issues of existing multi-level GNNs. (a) A learnable pooling(Gao & Ji, 2019) may lead to loss of connectivity even after 1 st -order enhancement. (b) A pooling by rasterization(Lino et al., 2021; 2022a;b)  and (c) by spatial proximity(Liu et al., 2021;Fortunato et al., 2022) can lead to wrong connections across the boundaries at the coarser level. generated coarse meshes. Lino et al. (2021; 2022a;b) use the original mesh at the first level and project it to regular grids (MS-GNN-GRID) at the coarser levels. Li et al. (2020b) adopt multi-level matrix factorization to generate the kernels at arbitrary levels without requiring mesh generators or K-nearest neighbor (K-NN) interpolations. Concerning building the connections and hierarchies on point clouds with radius samplers, there are representative works such as GNS Sanchez-Gonzalez et al. (2020), PointNet Qi et al. (2017a), PointNet++ Qi et al. (2017b), and GeodesicConv Masci et al. (2015). Still, these methods by construction suit better cases without meshes, such as particle fluid simulations.

build the coarser graph by projecting the finer nodes to the nearby background grids (Fig. 1(b)); 2) Liu et al. (2021); Fortunato et al. (2022) create coarser meshes for the same domain (Fig. 1(c))

Figure2: BSMS-GNN pipeline uses encode-process-decode trained with one-step supervision. G 1 , G 2 , • • • , G d represents the graph at different levels (finest to coarsest). The encoder/decoder only connects the input/output fields with the latent fields at G 1 . The latent nodal fields are updated by one MP (message passing) at each level. The bi-stride pooling selects the pooled nodes for the adjacent coarser level, and the transition is conducted in a non-parameterized way.

Bi-stride of a mesh

Figure 3: Schematic plot of the transition steps between adjacent levels.

Figure5: Example plots of the multi-level graphs produced by our bi-stride pooling. Our dataset contains both Eulerian and Lagrangian systems. Many meshes are highly irregular and contain massive self-contact, which poses strong challenges for building the coarser level connection by spatial proximity. The bi-stride strategy only relies on topological information and has proven to be robust and reliable on arbitrary kinds of geometry.

Figure 6: (a) All three transition methods can reach the target training RMSE given 200 iterations. (b) However, our weighted graph aggregration+returning has the strongest resistance to the noise during the rollout. (c) The visual comparisons show that no transition produces mosaic-like patterns, while the graph convolution transition smeared out the information and ceased propagating downstream. (d) The global rollout error distribution of no transition (Left) shows the edge of the mosaic patterns look similar to the simulation mesh; The error of our transition (Right) travels with the generated vortices downstream and leaves the domain after step 200, which explains the RMSE drop in (b).

MinAve: seeding by minimum average geodesic distance to neighbors Input:Unweighted, Bi-directional graph, G = (N, E) Output: List of seeds in each clusters L s 1 L c ← DetermineCluster(G) 2 L s ← ∅/ * BFS(s) returns the list of distances to all other neighbors from s * / / * if unreachable, the distance is set to infinity * /3 D ← {BFS(s) for s in N } 4 for idx in L c do 5 D c ← D[idx, idx] 6 Dc ← average(D c , dim = 1) 7 s ← idx[argmin( Dc )] 8 L s .append(s) 9 return L sFor INFLATINGFONT, the largest mesh has around 47K nodes, and the time for pre-processing with MinAve becomes intolerable. We switch to CloseCenter with the linear complexity. Algorithm 2: CloseCenter: seeding by minimum distance to the center of cluster Input: Unweighted, Bi-directional graph, G = (N, E); Positions of the nodes, X Output: List of seeds in each clusters L s 1 L c ← DetermineCluster(G) 2 L s ← ∅ 3 for idx in L c do 4 X ← average(X[idx], dim = 0)

DetermineCluster Input: Unweighted, Bi-directional graph, G = (N, E) Output: List of clusters L c / * R stands for remaining nodes that are not inside any cluster *

Detailed measurements of our method, MS-GNN-GRID, GRAPHMESHNETS, and GRA-PHUNET. All measurements are conducted using a single Nvidia RTX 3090. BSMS-GNN consistently generates stable and competive global rollouts with the smallest training cost. BSMS-GNN is also lightweight and has the fastest inference time. It is also free from the large RMSE due to poor pooling on unseen geometries where the learnable pooling module of GRAPHUNET suffers.

Table. A.1. Memory footprint under multi-batches, BSMS-GNN consistently cuts RAM consumption by approximately half in all cases in the training stage, and also has the smallest (except for DEFORMINGPLATE) inference RAM. Unit Training/Inference Time We evaluate the time complexity with unit training time per step.

The Plate includes hyperelastic plates squeezed by moving obstacles. In addition to these three cases, our Font(INFLATINGFONT) case involves the quasi-static inflation of enclosed elastic surfaces (3D surface mesh) possibly with self-contact. We create the INFLATINGFONTcases using the open-source simulator(Fang et al., 2021), with the same material properties and inflation speed. The input geometries for INFLATING-FONTare 1, 400 2×2-character matrices in Chinese. All the datasets are split into 1000 training, 200 validation, and 200 testing instances. In the following table, the second entries with superscript * in the average edge number column are for the contact edges:

Case Type

Offset inputs e M ijOffset inputs e W ij Inputs p i Outputs q i Cylinder Eulerian

1. BASIC MODULES AND ARCHITECTURES

The MLPs for the nodal encoder, the processor, and the nodal decoder are ReLU-activated twohidden-layer MLPs with the hidden-layer and output size at 128, except for the nodal decoder whose output size matches the prediction q. All MLPs have a residual connection. A LayerNorm normalizes all MLP outputs except for the nodal decoder.A.2.2 BASELINE: GRAPHMESHNETS Our GRAPHMESHNETSimplementation uses the same MLPs as above but with an additional module: the edge encoder. Also, the edge latent is updated and carried over throughout the end of multiple MPs. We use 15 times MP for all cases to keep it consistent with GRAPHMESHNETS.

A.2.3 BASELINE: MS-GNN-GRID

Our re-implementation of MS-GNN-GRIDuses the same MLPs as above but with four additional modules: the edge encoder at the finest level, the aggregation modules for nodes and edges at every level for the transitions, and the returning modules for nodes at every level. This method also requires assigning the regular grid nodes for each level. We assign these grid nodes by defining an initial grid resolution and an inflation rate between levels. As for the MP times at each level, we follow Lino et al. (2022a) to use four at the top and bottom levels and two for the others. Our re-implementation of GRAPHUNET uses the same number of levels as those of BSMS-GNN. Likewise, we make the following modifications to the original GRAPHUNET: (1) We change the information passing from GCN to our message passing module for consistency and translational invariance.

Case

(2) GraphUNet was intended for tiny graphs (100 nodes) and used dense matrix multiplications. This design is not scalable as it can break the memory limit and slow down the training to take more than 30 days per epoch in our graph size (1500 to 15000 nodes). We thus optimize the operations such as matrix multiplication and aggregation with sparse implementations.

A.2.5 NOISE AND BATCH NUMBER

For all three methods, we enhance the datasets by shuffling noise into them so the model can resist the noise produced by single-step predictions. Each method's batch number has been tuned to achieve a good convergence rate under smaller subsets. rollouts of CYLINDERFLOWand AIRFOILcases, we observed stripe patterns (Figure. 6 (c) , column None) where the stripes are aligned with the edges at the coarser levels (Figure. 6 (d) ). We suspect that this error results from the fact that the unpooled nodes all have zero information before MP, making them indistinguishable for the processor modules and exaggerating the difference between pooled and unpooled nodes over rollouts.The no-transition strategy resembles no interpolation during the super-resolution phase of CNN+UNet. Naturally, we then tried a single step of graph convolution (without activation) to resemble the interpolation in regular grids. However, this turns out to over-smooth the features (Figure. 6 (e) , column Graph Conv), and the information propagation was smeared out except for the area near the generator (in this case, near the cylinder).We believe the over-smoothing issue arises from the ignorance of the irregularity of the mesh. Unlike CNN, where the fine nodes regularly lie at the center of coarser grids, irregular meshes have varying topology and element sizes. The element sizes are almost always smaller near the interface for higher precision in simulations; hence an unweighted graph convolution can smear the finer information near the cylinder and their adjacent neighbors during returning. The natural choice to account for the irregularity is to include reasonable nodal weights (such as the size). In the end, we arrive at the solution proposed in Sec. 3.3 by utilizing the nodal weights during aggregation and recording the shares of contribution for later returning. Our transition method works consistently for all experiment cases and produces the lowest RMSE for global rollouts (Figure. 6 (b) ).Comparing to alternative transition methods Additionally, we compare our transition methods to two alternatives extracted from previous works: (1) calculating the edge weights (kernel) for the information flow using the inverse of its length (node position offset), which we refer to as Pos-Kernel (Liu et al., 2021) ; and(2) the level-wise learnable transition modules implemented by additional MP, which we refer to as Learnable (Fortunato et al., 2022) . In addition to the high RMSE of None and Graph-Conv shown in Figure . 6, we can also observe that: (1) the training/infer time and RAM consumption for all non-parametric transitions (including None) are similar, which supports the statement that our transition method is light-weighted.

Measurements

(2) Learnable transition can reach slightly higher accuracy but at the price of ∼ 70% more training/infer time and RAM. As mentioned in Sec. 4.2, higher training RAM can limit the batch number and increase the frequency of data communication between CPU and GPU, slowing down the training process even further when the scale goes up.(3) Pos-Kernel results in a slightly higher RMSE compared to our method, making it a competitive alternative choice in production.

A.4 SCALING ANALYSIS

We train and evaluate three different methods on INFLATINGFONT with varying resolutions (5K,15K,30K, and 45K) for the scaling analysis.Adjustments for datasets and models We generate the downscale and the upscale version of IN-FLATINGFONT with different average node numbers for the initial geometry, and then use the same settings to simulate the sequence. As reported in Fortunato et al. (2022) , the low-resolution model suffers from converging to very small RMSE; hence we loosen the termination criteria by enlarging the target RMSE relative to the average edge length to prevent convergence failures. Similarly, the noise injection is also adjusted to be relative to the average edge length. Moreover, with a smaller number of nodes, the number of levels required to achieve the same bottom resolution also reduces.We make the corresponding adjustments to the levels of our model d 1 and that of the MS-GNN-GRIDd 2 . The adjustments are plotted below. 

Results

The results in Figure . 7 show that both BSMS-GNNand MS-GNN-GRIDscale up well, preserving a near-linear scale-up rate, in contrast to GRAPHMESHNETS. Still, our method is lighter weighted and more efficient than MS-GNN-GRIDbecause of the non-parametric transitions and fewer level-wise MP. With Bi-stride pooling, our pooling conserves all the contact edges under the enhancement in Eq. 2.We assume the graph is undirected and unweighted, such that the adjacent matrix is a boolean matrix.Formally speaking, given any contact edge (i, j) at level l (i.e. A C l [i, j] = 1) and a Bi-stride pooling P which pools nodes I, there exists a contact edge (i ′ , j ′ ) that remains in the coarser level (i.e.). There are only four scenarios concerning the pooling nodes I and the contact edge nodes i, j, under which the assertion always holds:1. Both i, j are pooled, i.e. i, j ∈ I.∈ I. Since we use Bi-stride pooling, j can either be the seed at level 0 (Bi-stride can select either even or odd levels) that directly connects to all nodes at level 1, or must have at least one direct connection from the previous level. I.e, at least one neighbor of j in the adjacent level is pooled, we let it be j ′ : A l [j, j ′ ] = 1, j ′ ∈ I. Thenl+1 [i ′ , j ′ ] = 1. 3. Only j is pooled, i / ∈ I, j ∈ I. Similarly we have at least one i ′ such that: A l [i ′ , i] = 1, i ′ ∈ I. Then A l A C l [i ′ , j] ≥ A l [i ′ , i] * A C l [i, j] = 1, and (A l A C l )A l [i ′ , j] ≥ (A l A C l )[i ′ , j] * A l [j, j] = 1. Let j ′ = j, then A ′ C l+1 [i ′ , j ′ ] = 1. 4. None of i, j is pooled, i, j / ∈ I. Then, we select one direct pooled neighbor for i, j, respectively, such that A l [i ′ , i] = A l [j, j ′ ] = 1, i ′ , j ′ ∈ I.

