METAP: HOW TO TRANSFER YOUR KNOWLEDGE ON LEARNING HIDDEN PHYSICS

Abstract

Gradient-based meta-learning methods have primarily focused on classical machine learning tasks such as image classification and function regression, where they were found to perform well by recovering the underlying common representation among a set of given tasks. Recently, PDE-solving deep learning methods, such as neural operators, are starting to make an important impact on learning and predicting the response of a complex physical system directly from observational data. Since the data acquisition in this context is commonly challenging and costly, the call of utilization and transfer of existing knowledge to new and unseen physical systems is even more acute. Herein, we propose a novel meta-learnt approach for transfer-learning knowledge between neural operators, which can be seen as transferring the knowledge of solution operators between governing (unknown) PDEs with varying parameter fields. With the key theoretical observation that the underlying parameter field can be captured in the first layer of the neural operator model, in contrast to typical final-layer transfer in existing meta-learning methods, our approach is a provably universal solution operator for multiple PDE solving tasks. As applications, we demonstrate the efficacy of our proposed approach on PDE-based datasets and a real-world material modeling problem, demonstrating that our method can handle complex and nonlinear physical response learning tasks while greatly improving the sampling efficiency in new and unseen tasks.

1. INTRODUCTION

Few-shot learning is an important problem in machine learning, where new tasks are learned with a very limited number of labelled datapoints (Wang et al., 2020) . In recent years, significant progress has been made on few-shot learning using meta-learning approaches (Koch et al., 2015; Vinyals et al., 2016; Snell et al., 2017; Finn et al., 2017; Santoro et al., 2016; Antoniou et al., 2018; Ravi & Larochelle, 2016; Nichol & Schulman, 2018; Raghu et al., 2019; Tripuraneni et al., 2021; Collins et al., 2022) . Broadly speaking, given a family of tasks, some of which are used for training and others for testing, meta-learning approaches aim to learn a shared multi-task representation that can generalize across the different training tasks, and result in fast adaptation to new and unseen testing tasks. Although most of meta-learning learning developments focus on conventional machine learning problems such as image classification, function regression, and reinforcement learning, studies on few-shot learning approaches for complex physical system modeling problems have been limited. The call of developing a few-shot learning approach for complex physical system modeling problems is just as acute, while the typical understanding of how multi-task learning should be applied on this scenario is still nascent. As a motivating example, we consider the scenario of new material discovery in the lab environment, where the material model is built based on experimental measurements of its responses subject to different loadings. Since the physical properties (such as the mechanical and structural parameters) in different material specimens vary, the model learnt from experimental measurements on one specimen would have a large generalization error on future specimens. That means, the data-driven model has to be trained repeatedly with a large number of material specimens, which makes the learning process inefficient. Further, experimental measurement acquisition of these specimens is often challenging and expensive. In some problems, a large amount of measurements are not even feasible. For example, in the design and testing of biosynthetic tissues, performing repeated loading would potentially induces the cross-linking and permanent set phenomenon, which notoriously alter the tissue durability (Zhang & Sacks, 2017) . As a result, it is critical to learn the physical response model of a new specimen with samples sizes as small as possible. Furthermore, since many characterization methods to obtain underlying material mechanistic and structural properties would require the use of destructive methods (Misfeld & Sievers, 2007; Rieppo et al., 2008) , in practice many physical properties are not measured and can only be treated as hidden and unknown variables. We likely only have limited access to the measurements on the complex system responses caused by the change of these physical properties. Supervised operator learning methods are typically used to address this class of problems. They take a number of observations on the loading field as input, and try to predict the corresponding physical system response field as output, corresponding to one underlying PDE (as one task). Herein, we consider the meta-learning of multiple complex physical systems (as tasks), such that all these tasks are governed by a common PDE with different (hidden) physical property or parameter fields. Formally, assume that we have a distribution p(T ) over tasks, each task T η ∼ p(T ) corresponds to a hidden physical property field b η (x) ∈ B(R d b ) that contains the task-specific mechanistic and structural information in our material modeling example. On task T η , we have a number of observations on the loading field g η i (x) ∈ A(R dg ) and the corresponding physical system response field u η i (x) ∈ U(R du ) according to a hidden parameter field b η (x). Here, i is the sample index, B, A and U are Banach spaces of function taking values in R d b , R dg and R du , respectively. For task T η , our modeling goal is to learn the solution operator G η : A → U, such that the learnt model can predict the corresponding physical response field u(x) for any loading field g(x). Without transfer learning, one needs to learn a surrogate solution operator for each task only based on the data pairs on this task, and repeat the training for every task. The learning procedure would require a relatively large amount of observation pairs and training time for each task. Therefore, this physical-based modeling scenario raises a key question: Given knowledge on a number of parametric PDE solving tasks with different unknown parameters, how can one efficiently learn the best surrogate solution operator for a new and unknown parameter, with only a small set of training data pairsfoot_0 ? To address this question, we introduce MetaP, a novel meta-learnt approach for transfer-learning knowledge between neural operators, which can be seen as transferring the knowledge of solution operators between governing (unknown) PDEs with varying hidden parameter fields. Our main contributions are: • MetaP is the first neural-operator-based approach for multi-task learning, which not only preserves the generalizability to different resolutions and input functions from its integral neural operator architecture, but also improves sampling efficiency on new tasks -for comparable accuracy, MetaP saves the required number of measurements by ∼90%. • With rigorous operator approximation analysis, we made the key observation that the hidden parameter field can be captured by adapting the first layer of the neural operator model, in contrast to typical final-layer transfer in existing meta-learning methods. By construction, MetaP serves as a provably universal solution operator for multiple PDE solving tasks. • From synthetic, benchmark, to a real-world biological tissue datasets, the proposed method consistently outperforms existing baseline gradient-based meta-learning methods.

2. BACKGROUND AND RELATED WORK

In this section we introduce the relevant materials on hidden physics learning, neural operators, and gradient-based meta-learning methods, which will later complement the definition of our method.

2.1. HIDDEN PHYSICS LEARNING AND NEURAL OPERATORS

For many decades, physics-based PDEs have been commonly employed for predicting and monitoring complex system responses, then traditional numerical methods were employed to solve these PDEs and provide predictions for desired system responses. However, three fundamental challenges present. First, the choice of governing PDE laws is often determined a priori and free parameters are often tuned to obtain agreement with experimental data. This fact makes the rigorous calibration and validation process challenging. Second, traditional numerical methods are solved for specific boundary and initial conditions, as well as loading or source terms. Therefore, they are not generalizable for other operating conditions and hence not effective for real-time prediction. Third, complex PDE systems such as turbulence flows and heterogeneous materials modeling problems usually require a very fine discretization, and therefore very time-consuming for traditional solvers. To provide an efficient surrogate model for physical responses, machine learning methods may hold the key. Recently, there has been significant progress in the development of deep neural networks (NNs), focusing on learning the hidden physics of a complex system (Ghaboussi et al., 1998; 1991; Carleo et al., 2019; Karniadakis et al., 2021; Zhang et al., 2018; Cai et al., 2022; Pfau et al., 2020; He et al., 2021; Besnard et al., 2006) . Among these methods, the neural operators show particular promises in resolving the above challenges. Neural operators aim to learn maps between inputs of a dynamical system and its state, so that the network can serve as a surrogate for a solution operator (Li et al., 2020a; b; c; You et al., 2022a; Ong et al., 2022; Gupta et al., 2021; Lu et al., 2019; 2021b; Goswami et al., 2022a) . Comparing with the classical NNs, the most notable advantages of neural operators are resolution independence and generalizability to different input instances. Moreover, comparing with the classical PDE modeling approaches, neural operators require only data with no knowledge of the underlying PDE. All these advantages make neural operators promising tools to PDE learning tasks. Examples include modeling the unknown physics law of real-world problems (Yin et al., 2022a; Goswami et al., 2022a; Yin et al., 2022b) , and providing efficient solution operator for PDEs (Li et al., 2020a; b; c; Lu et al., 2021c; a) . On the other hand, data in scientific applications are often scarce and incomplete. Utilization of other relevant data sources could alleviate such a problem, yet no existing work have addressed the transferability of neural operators. Through the meta-learning techniques, our work fulfills the demand of such a transfer setting, with the same type of PDE system but different (hidden) physical properties.

2.2. GRADIENT-BASED META-LEARNING METHODS

One highly successful meta-learning algorithm has been Model Agnostic Meta-Learning (MAML) (Finn et al., 2017) , which led to the development of a series of related gradient-based meta-learning (GBML) methods (Raghu et al., 2019; Nichol & Schulman, 2018; Antoniou et al., 2018; Hospedales et al., 2020) . Almost-No-Inner-Loop algorithm (ANIL) (Raghu et al., 2019) modifies MAML by freezing the final layer representation during local adaptation. Recently, theoretical analysis (Collins et al., 2022) found that the driving force causing MAML and ANIL to recover the general representation is the adaptation of the final layer of their models, which harnesses the underlying task diversity to improve the representation in all directions of interest. Although MAML and the general meta-learning approaches have achieved impressive performance in some machine-learning applications such as the image classification and reinforcement learning scenarios, a few work has studied the hidden physics learning under meta (Mai et al., 2021; Zhang et al., 2022; Yin et al., 2021; Wang et al., 2021) or even transfer setting (Kailkhura et al., 2019; Goswami et al., 2022b) . Among these meta-learning works, (Mai et al., 2021; Zhang et al., 2022) are designed for specific physical applications, while (Yin et al., 2021; Wang et al., 2021) focus on on dynamics forecasting by learning the temporal evolution information directly (Yin et al., 2021) or learning time-invariant features (Wang et al., 2021) . Hence, none of these works have provided a generic approach nor theoretical understanding on how to transfer the multi-task knowledge between a series of complex physical systems, such that all these tasks are governed by a common parametric PDE with different physical parameters. 3 META-LEARNT NEURAL OPERATOR

3.1. INTEGRAL NEURAL OPERATORS

Here, we first state the base model of this work without the meta aspect. The integral neural operators, first proposed in (Li et al., 2020a) and further developed in (Li et al., 2020b; c; You et al. , 2022a;c) comprises of three building blocks. First, the input function, g(x) ∈ A, is lifted to a higher dimensional representation via h(x, 0) = P[g](x) := P (x)[x, g(x)] T + p(x). Here, P (x) ∈ R (s+dg)×d h and p(x) ∈ R d h define an affine pointwise mapping, which are often taken as constant parameters, i.e., P (x) ≡ P and p(x) ≡ p. Then, the feature vector function h(x, 0) goes through an iterative layer block where the layer update is defined via the action of the sum of a local linear operator, a nonlocal integral kernel operator, and a bias function: h(•, (l +1)∆t) = J l+1 [h(•, l∆t)], for l = 0, • • • , L-1. Here, h(•, j∆t), j = 0, • • • , L := T /∆t, is a sequence of functions representing the values of the network at each hidden layer, taking values in R d h . J 1 , • • • , J L are the nonlinear operator layers, defined by the particular choice of networks. In this work, we employ the implicit Fourier neural operator (IFNO) as the base model, because of its theoretical universal approximation property in PDE solving tasks (You et al., 2022c) and robustness in complex physical response modeling tasks (You et al., 2022b) foot_2 . In this case, the iterative layers are taken as J 1 = • • • = J L = J , where h(x, (l + 1)∆t) = J [h(x, l∆t)] := h(x, l∆t) + ∆tσ W h(x, l∆t) + F -1 [F[κ(•; v)] • F[h(•, l∆t)]](x) + c(x) . Here, F and F -1 denote the Fourier transform and its inverse, respectively. In practice, F and F -1 are computed using the FFT and its inverse to each component of h separately. Also, c ∈ R d h defines a constant bias, W ∈ R d h ×d h is the weight matrix, and F[κ(•; v)] := R is a circulant matrix that depends on the convolution kernel κ. σ is an activation function, which is oftenly taken to be the popular rectified linear unit (ReLU) function. Finally, the output u(•) ∈ U is obtained through a projection layer. In particular, we project the last hidden layer representation h(•, T ) onto U as: u(x) = Q[h(•, T )](x) := Q 2 (x)σ(Q 1 h(x, T ) + q 1 (x)) + q 2 (x). Here, Q 1 (x) ∈ R d Q ×d h , Q 2 (x) ∈ R du×d Q , q 1 (x) ∈ R d Q and q 2 (x) ∈ R du are the appropriately sized matrices and vectors that are part of the parameter set that we aim to learn. Similarly as for the lifting layer, Q 1 (x), Q 2 (x), q 1 (x) and q 2 (x) are also often taken as constant parameters, which will be denoted as Q 1 , Q 2 , q 1 and q 2 , respectively. In the following, we denote the set of trainable parameters in the lifting layer as θ P , the set from the iterative layer block as θ I , and the set in the projection layer as θ Q . The neural operator can be employed to learn an approximation for the solution operator, G. Given D := {(g i , u i )} N i=1 , a labelled (context) set of observations, where the input {g i } ⊂ A is a set of independent and identically distributed (i.i.d.) random fields from a known probability distribution µ on A, and u i (x) ∈ U, possibly noisy, is the observed corresponding solution, let Ω ⊂ R s be the domain of interests, we assume that all observations can be modeled with a parametric PDE form K b(x) [u i ](x) = g i (x), x ∈ Ω. (2) Here, K b is the operator representing the possibly unknown governing law, e.g., balance laws. Then, the system response can be learnt by constructing a surrogate solution operator of equation 2: G[g; θ](x) := Q θ Q • (J θ I ) L • P θ P [g](x) ≈ u(x) , where the parameter set θ = [θ P , θ I , θ Q ] is obtained by solving the optimization problem: min θ∈Θ L D (θ) = min θ∈Θ E f ∼µ [C( G[g; θ], G[g])] ≈ min θ∈Θ N i=1 [C( G[g i ; θ], u i )]. Here C denotes a properly defined cost functional which is often taken as the the mean square error.

3.2. BASE META MODEL WITH MAML AND ANIL

To transfer the multi-task knowledge between a series of complex systems governed by different hidden physical parameters, we proposed to leverage the integral neural operator with a meta-learning setting. Herein, assume that for each training task T η ∼ p(T ) we have a set of observations of loading field/respond field data pairs D η := {(g η i (x), u η i (x))} N η i=1 , and each task can be modeled with a parametric PDE form K b η (x) [u η i ](x) = g η i (x), x ∈ Ω, where b η (x) is the hidden task-specific physical parameter field for the common governing law. Given a new and unseen test task, T test , and a context set of labelled samples D test := {(g test i (x), u test i (x))} N test i=1 on it, our goal is to obtain the approximated solution operator model on the test task as G[g; θ test ]. A straightforward approach would be to simply apply MAML and ANIL to a neural operator architecture, which will be treated as the baselines of our studies. Here we formally state our implementation of ANIL and MAML for the problem described above. MAML. The MAML algorithm proposed in (Finn et al., 2017) aims to find an initialization, θ, across all tasks, so that new tasks can be learnt with very few examples. First, we draw a batch {T η } H η=1 of H tasks from p(T ). For each task T η , we split the available set of loading field/response field data pairs D η to a support set of samples, S η , which will be used for inner loop updates, and a target set of samples, Z η , for outer loop updates. Then, for the inner loop we let θ η,0 := θ and θ η,i be the task-wise parameter after i gradient updates. During each inner loop update, we compute θ η,i = θ η,i-1 -α∇ θ η,i-1 L S η (θ η,i-1 ), for η = 1, • • • , H, where L S η (θ η,i-1 ) is the loss on the support set of the η-th task, and α is the step size. After m inner loop updates, we update the initial parameter θ with a fixed step size β: θ ← θ -β∇ θ L meta ( θ) , where the meta-loss L meta ( θ) := H η=1 L Z η (θ η,m ). Then, on the test task, T test , we perform inner loop adaptation based on few labelled samples D test until convergence, and obtain the approximated solution operator model on the test task as G[g; θ test ]. ANIL. In (Raghu et al., 2019) , ANIL was proposed as a modified version of MAML with inner loop updates only for the final layer. The inner loop update formulation equation 5 is modified as θ η,i Q = θ η,i-1 Q -α∇ θ η,i-1 Q L S η (θ η,i-1 Q ), for η = 1, • • • , H, where θ η,i Q is the task-wise parameter on the projection layer after i gradient updates. Then, we perform the same outer loop updates following equation 6.

3.3. METAP: A NOVEL META-LEARNT NEURAL OPERATOR ARCHITECTURE

We now propose MetaP, which applies task-wise adaptation only to the first layer, i.e., the lifting layer, with the full algorithm outlined in Algorithm 1. Similar as in other meta-learning approaches (Yoon et al., 2018; Vanschoren, 2018; Yang & Kwok, 2022; Kalais & Chatzis, 2022) , the algorithm consists of two phases: (1) a meta-train phase which learns shared iterative layers parameters θ I and projection layer parameters θ P from existing tasks; (2) a meta-test phase which transfers the learned knowledge and rapidly learning surrogate solution operators for unseen tasks with unknown physical parameter field, where only a few test samples are required. To see the inspiration of the proposed architecture, without loss of generality, we assume that the underlying task parameter field b η (x), modeling the physical property field, is normalized and satisfying b η (x) -b(x) L 2 (Ω) ≤ 1 for all η ∈ {1, • • • , H}, where b := E T η ∼p(T ) [b η ]. Denoting F u [b] := K b [u] as a function from physical parameter fields B to loading fields A, we can take the Fréchet derivative of F with respect to bb and obtain: K b η [u] = F u [b] + DF u [b](b η -b) + o( b η -b L 2 (Ω) ).

Meta-Train Phase:

Require: a batch {T η } H η=1 of known tasks and available data pairs D η := {(g η i (x), u η i (x))} N η i=1 on each task Output: common parameters θ * I and θ * Q across all tasks 1. randomly initialize θ I , θ Q , and {θ η P } H η=1 2. solve the optimization problem: {θ * I , θ * Q , {θ η, * P } H η=1 } = argmin {θ I ,θ Q ,{θ η P } H η=1 } H η=1 L D η ([θ η P , θ I , θ Q ]) Meta-Test Phase: Require: a test task T test and few labelled data pairs D test := {(g test i (x), u test i (x))} N test i=1 Output: the task-wise parameter θ test, * P and the corresponding surrogate PDE solution operator G[g; [θ test, * P , θ * I , θ * Q ]](x) for the test task 3. solve for the lift layer parameter from the optimization problem: θ test, * P = argmin θ test P L D test ([θ test P , θ * I , θ * Q ]) Algorithm 1: MetaP for Few-Shot Learning of New PDE Solver with Hidden Physical Parameters Substituting the above formulation into equation 4, we obtain F u η i [b] + DF u η i [b](b η -b) ≈ g η i . Denoting F 1 [b η ] := [1, b η -b] and F 2 [u η i ] := [F u η i [b], DF u η i [b]], we can actually reformulated equation 4 into a more generic form: F 1 [b η ](x) • F 2 [u η i ](x) = g η i (x), x ∈ Ω. We point out that this parametric PDE form is indeed very general and finds applications in many science and engineering applications -besides our motivating example on material modeling, examples also include the monitoring of tissue degeneration problems (Zhang & Sacks, 2017) , the detection of subsurface flows (Dejam et al., 2017) , the nondestructive inspection in aviation (Fallah et al., 2019) , and the prediction of concrete structures deterioration (Wei et al., 2021) , etc. In the following, we show that MetaP are universal solution finding operators for the multi-task PDE solving problem in equation 8, in the sense that they can approximate a fixed point method to a desired accuracy. For simplicity, we consider a 1D domain Ω, and scalar functions F 1 [b η ], F 2 [u η i ] . These functions are assumed to be sufficiently smooth and measured at uniformly distributed nodes χ := {x 1 , x 2 , . . . , x M }, with F 1 [b η ](x j ) ̸ = 0 for all η and j. Then, equation 8 can be formulated as an implicit system of equations: H(U η, * i ; Gη i ) :=    F 2 [u η i ](x 1 ) -g η i (x 1 )/F 1 [b η ](x 1 ) . . . F 2 [u η i ](x M ) -g η i (x M )/F 1 [b η ](x M )    = 0, where U η, * i := [u η i (x 1 ), u η i (x 2 ), . . . , u η i (x M )] is the solution we seek, Gη i := [g η i (x 1 )/F 1 [b η ](x 1 ), g η i (x 2 )/F 1 [b η ](x 2 ), . . . , g η i (x M )/F 1 [b η ](x M ) ] is the reparameterized loading vector, and G η i := [g η i (x 1 ), g η i (x 2 ), . . . , g η i (x M ) ] is the original loading vector. Here, we notice that all task-specific information are encoded in Gη i and can be captured in the lifting layer parameter. Therefore, when seeing equation 9 as an implicit problem of U η, * i and Gη i , it is actually independent of the task parameter field b η , i.e., this problem is task-independent. In the later contents we refer to equation 9 without the task index, as H(U * ; G) for notation simplicity. To solve for U * from the nonlinear system H(U * ; G) = 0, a popular approach would be to use fixed-point iteration methods such as the Newton-Raphson method. In particular, with an initial guess of the solution (denoted as U 0 ), the process is repeated to produce successively better approximations to the roots of equation 9, from the solution of iteration l (denoted as U l ) to the solution of iteration l + 1 (denoted as U l+1 ) as: U l+1 = U l -(∇H(U l ; G)) -1 H(U l ; G) := U l + R(U l , G), until a sufficiently precise value is reached. In the following, we show that as long as Assumptions 1 and 2 hold, i.e., there exists a converging fixed point method, then MetaP can be seen as an resemblance of the fixed point method in equation 10 and hence acts as an universal approximator of the solution operator for equation 8. Assumptions 1 and 2 ensure the hidden PDEs to be numerically solvable with a converging iterative solver, which is a required condition of most numerical PDE solving problems. Then, taking U 0 := [x 1 , • • • , x M ] as the initial guess, we aim to show that for any desired accuracy ε > 0, one can find a sufficiently large L > 0 and sets of parameters θ η = {θ η P , θ I , θ Q }, such that the resultant MetaP model acts as a fixed point method and its prediction satisfies Q θ Q • (J θ I ) L • P θ η P ([U 0 , G η ] T ) -U η, * l 2 (R M ) ≤ ε for all tasks and samples. Assumption 1. There exists a fixed point equation, U = U + R(U, G) for the implicit problem equation 9, such that R : R 2M → R M is a continuous function satisfying R(U, G) = 0 and ||R( Û, G) -R( Ũ, G)|| l 2 (R M ) ≤ m|| Û -Ũ|| l 2 (R M ) for any two vectors Û, Ũ ∈ R M . Here, m > 0 is a constant independent of G. Assumption 2. With the initial guess U 0 := [x 1 , • • • , x M ], the fixed-point iteration U l+1 = U l + R(U l , G) (l = 0, 1, . . . ) converges, i.e. , for any given ε > 0, there exists an integer L such that ||U l -U * || l 2 (R M ) ≤ ε, ∀l > L, for all possible input instances G ∈ R M and their corresponding solutions U * . Then, we have our universal approximation theorem as below, with proof provided in Appendix A: Theorem 1 (Universal approximation). Given Assumptions 1-2, let the activation function σ for all iterative kernel integration layers be the ReLU function, and the activation function in the projection layer be the identity function. Then for any ε > 0, there exist sufficiently large layer number L > 0 and feature dimension number d h > 0, such that one can find a parameter set for the multi-task problem, θ η = [θ η P , θ I , θ Q ], such that the corresponding MetaP model satisfies Q θ Q • (J θ I ) L • P θ η P ([U 0 , G η ] T ) -U η, * ≤ ε, ∀G η ∈ R M , for all tasks.

4. EXPERIMENTS

In this section, we demonstrate the empirical effectiveness of the proposed MetaP approach. Specifically, we conduct experiments on a synthetic dataset from a nonlinear PDE solving problem, a benchmark dataset of heterogeneous materials subject to large deformation, and a real-world dataset from biological tissue mechanical testing, and compare the proposed method against competitive GBML baselines. All of the experiments are tested using PyTorch with Adam optimizer, with detailed settings provided in the Appendix B. In all tests we considered the averaged relative error, ||u i,pred -u i || L 2 (Ω) /||u i || L 2 (Ω) , as the error metric (lower means better). We have repeated each experiment for 5 times, and reported the averaged errors and their standard errors.

4.1. SYNTHETIC DATA SETS AND ABLATION STUDY

We first consider the PDE solution finding problem of the Holzapfel-Gasser-Odgen (HGO) model (Holzapfel et al., 2000) , which describes the deformation of hyperelastic, anisotropic, and fiber-reinforced materials. Different tasks correspond to different material parameter sets, {k 1 , k 2 , E, ν, α}, where k 1 and k 2 are fiber modulus and the exponential coefficient, respectively, E is the Young's modulus, ν is the Poisson ratio, and α is the fiber angle direction from the reference direction. In this example the physical response of interests is the displacement field u : [0, 1] 2 → R 2 subject to different traction loadings applied on the top edge of this material. Therefore, we take the input function g(x) as the padded traction loading field, and the output function as the corresponding displacement field. To investigate the performance of MetaP in few-shot learning, we generate 60 tasks for training, validation, and 1 in-distribution (ID) test task by sampling the physical parameters 5 of Appendix B, the first OOD task (denoted as "OOD Task1") corresponds to a stiffer material sample and smaller deformation for each given loading, while the second OOD task (denoted as "OOD Task2") generates a softer material sample and larger deformation. For each training task, we generate 500 data pairs D η := {(g η i , u η i )} 500 i=1 , by sampling the vertical traction loading from a Gaussian random field. Then, the corresponding ground-truth displacement field is obtained using the finite element method implemented in FEniCS (Alnaes et al., 2015) . For the test tasks, we train with N test = {2, 4, 8, 12, 20, 100, 300} numbers of labelled data pairs (the context set), and evaluate the resultant model on an additional dataset with 200 data pairs (the target set). An 8-layer IFNO is employed as the base model. k 1 , k 2 ∼ U[0.1, 1], E ∼ U[0.5, 1.5], ν ∼ U[0.1, 0. Ablation Study. We first conduct an ablation study with three settings. 1) Follow the meta-train and meta-test phases as in Algorithm 1, with task-wise adaptation only to the lifting layer in both phases (denotes as "MetaP"). 2) After MetaP, perform an additional fine-tuning step to all parameters in the meta-test phase (denotes as "MetaP+"). With this test, we aim to investigate if our algorithm has successfully identified all the common features in the iterative and projection layers. 3) Apply task-wise adaptation only to the projection layer in both meta-train and meta-test phases (denoted as "MetaLast"). With this test, we study if the successful "adapting last layers" strategy in image classification problems would also apply for our PDE solving problem. Besides these three settings, we also report the few-shot learning results with four baseline methods: 1) Learn a neural operator model only based on the context data set on the test task (denoted as "Single"), 2) Pretrain a single neural operator model based on all training task data sets, then fine-tune it based on the context test task data set (denoted as "Single+"), 3) MAML, and 4) ANIL. As shown in the left plot of Figure 2 , MetaP and MetaP+ are both able to quickly adapt with few data pairs -to achieve a test error below 5%, "Single" and "Single+" require 100 data pairs, while MetaP and MetaP+ requires only 4 data pairs. On the other hand, MetaLast, MAML and ANIL have similar performance. They all require 100 data pairs to achieve a < 5% test error. This observation verifies our theoretical analysis: on the multi-task parametric PDE solution operator learning problem, one should adapt the first layer, not the last ones. Moreover, when comparing MetaP and MetaP+ we can see that the additional finetune step barely improves the performance, especially in the few-sample regime. This fact verifies the efficacy of MetaP, and indicates that our method has successfully capture the underlying task diversity by adapting the first layer, so no further fine-tuning is required. In-Distribution and Out-Of-Distribution Tests. On the right plot of Figure 2 , we demonstrate the relative test error of MetaP in both ID and OOD tasks. We can see that these three test errors are both in a similar scale as the error on training tasks. The error from OOD task1 is slightly smaller than the ID test task error, while the error from OOD task2 is much larger, probably due to the fact that the solutions in OOD task1 generally have smaller magnitude and hence its solution operator lies more in a linear regime, which makes the solution operator learning task easier. These results validates the good generalization performance of MetaP.

4.2. BENCHMARK MECHNICAL MNIST DATASETS

To further test the capability of MetaP on benchmark datasets, we test MetaP and four baseline methods on Mechanical MNIST (Lejeune, 2020) . Mechanical MNIST is a dataset of heterogeneous material undergoing large deformation. It contains 70,000 heterogeneous material specimens, and each specimen is governed by the Neo-Hookean material with a varying modulus converted from We present the results in the left plot of Figure 3 . The neural operator model learned by MetaP again outperforms the state-of-the-art GBML models. Our MetaP model achieves 1% when using only 2 labelled data pair on the test task, while the "Single" model has around 100% error even using 8 labelled data pairs, due to overfitting. This fact highlights the importance of learning across multitasks in engineering applications -when the total number of measurements on each specimen is limited, it is necessary to transfer the knowledge across specimens. Moreover, we notice that in this example ANIL is the least effective GBML method, which is even less efficient than the pretrained model ("Single+"), probably due to the inefficacy of the adapting last layers strategy.

4.3. APPLICATION ON REAL-WORLD DATA SETS

We now take a step further to demonstrate the performance of our method on a real-world physical response dataset which is NOT generated by sovling PDEs. We consider the problem of learning the mechanical response of multiple biological tissue specimens from DIC displacement tracking measurements. As demonstrated in Figure 1 , we measure the biaxial loading of tricuspid valve anterior leaflet (TVAL) specimens from a porcine heart, such that each specimen (as a task) corresponds to a different region of the leaflet. Due to the material heterogeneity of biological tissues, these specimens are with different mechanical and structural properties. In this task, we aim to model the tissue response by learning a neural operator mapping the boundary displacement loading to the interior displacement field on each tissue specimen. On each specimen, we have 500 available data pairs. Due to the challenges of obtaining the experimental tissue, only 14 specimens are available in total. This example also stands for a common challenging setting in real-world applications: we not only have the few-shot learning challenge, but also suffer from the difficulty from limited available training tasks. With a 4-layer IFNO as the base model, we train each model based on N test ∈ [2, 300] samples, then evaluate the performance on another 200 samples. The results are provided on the right plot of Figure 3 . MetaP performs the best with low data samples among all the methods, and still beat our MAML and ANIL variants when N test = 300. Interestingly, MAML and ANIL did not even beat the "Single+" method, possibly due to the low efficacy of the adapting last layers strategy and the small number of training tasks.

5. CONCLUSION

In this paper we propose MetaP, the first neural-operator-based meta-learning approach that are designed to achieve good transferability in learning complex physical system responses with significant improvement in sample efficiency. The first layer adaption used by our method is theoretically motivated and shown to be the universal solution operator for multiple parametric PDE solving tasks. We demonstrate the effectiveness of our proposed MetaP algorithm on various synthetic and realworld datasets, showing promises over baseline methods. For future work, we will investigate the applicability of the proposed approach to other neural operators.

A PROOF OF THEOREM 1

In this section we provide the detailed proof for Theorem 1, based on Assumptions 1 and 2. Intuitively, that means the underlying implicit problem is solved with a converging fixed point method. This condition is a basic requirement by numerical PDEs, and it generally holds true in many applications governed by nonlinear and complex PDEs, such as in our three experiments. Here, we prove that the MetaP is universal, i.e., give a fixed point method satisfying Assumptions 1-2, one can find parameter sets θ η whose output approximates U η, * to a desired accuracy, ε > 0, for all η = 1, • • • , H tasks. For the task-wise parameters, with a slight abuse of notation, we denote P η ∈ R d h M ×(dg+s)M as the collection of the pointwise weight matrices at each discretization point in χ for the η-th task, and p η ∈ R d h M for the bias in the lifting layer. Then, for the parameters shared among all tasks, in the iterative layer we denote C = [c(x 1 ), • • • , c(x M )] ∈ R d h M as the collection of pointwise bias vectors c(x i ), W ∈ R d h ×d h for the local linear transformation, and R = F[κ(•; v)] ∈ C d h ×d h ×M ∈ C d h ×d h ×M for the Fourier coefficients of the kernel κ. For simplicity, here we have assumed that the Fourier coefficient is not truncated, and all available frequencies are used. Then, for the projection layer we seek Q 1 ∈ R d Q M ×d h M , Q 2 ∈ R duM ×d Q M , q 1 ∈ R d Q M and q 2 ∈ R duM . For the simplicity of notation, in this section we organize the feature vector H ∈ R d h M in a way such that the components corresponding to each discretization point are adjacent, i.e., H = [H(x 1 ), • • • , H(x M )] and H(x i ) ∈ R d h . We point out that under this circumstance, we have the (discretized) iterative layer can be written as J [H(l∆t)] =H(l∆t) + ∆tσ W H(l∆t) + Re(F -1 ∆x (R • F ∆x (H(l∆t)))) + C =H(l∆t) + ∆tσ (V H(l∆t) + C) , with V := Re            M -1 n=0 Rn+1 + W M -1 n=0 Rn+1 exp( 2iπ∆xn M ) . . . M -1 n=0 Rn+1 exp( 2iπ(M -1)∆xn M ) M -1 n=0 Rn+1 exp( 2iπ∆xn M ) M -1 n=0 Rn+1 + W . . . M -1 n=0 Rn+1 exp( 2iπ(M -2)∆xn M ) . . . . . . . . . . . . M -1 n=0 Rn+1 exp( 2iπ(M -1)∆xn M ) M -1 n=0 Rn+1 exp( 2iπ(M -2)∆xn M ) . . . M -1 n=0 Rn+1 + W            . Here, R ∈ C M ×d h ×d h with R i ∈ C d h ×d h being the component associated with each discretization point x i ∈ χ, V ∈ R d h M ×d h M , C ∈ R d h M , W := W ⊕ W ⊕ • • • ⊕ W is a d h M × d h M block diagonal matrix formed by W ∈ R d h ×d h , F ∆x and F -1 ∆x denote the discrete Fourier transform and its inverse, respectively. By further taking R 2 = • • • = R M = W = 0, a d h × d h matrix with all its elements being zero, it suffices to show the universal approximation property for an iterative layer as follows: J (H(l∆t)) := H(l∆t) + ∆tσ Ṽ H(l∆t) + C where Ṽ := 1 [M,M ] ⊗ V with V ∈ R d h ×d h and 1 [m,n] being an m by n all-ones matrix. To be more precise, we will prove the following theorem: Theorem 1 (Universal approximation). Let U η, * = [u η (x 1 ), u η (x 2 ), . . . , u η (x M )] be the groundtruth solution of η-th task that satisfies Assumptions 1-2, the activation function σ for all iterative kernel integration layers be the ReLU function, and the activation function in the projection layer be the identity function. Then for any ε > 0, there exist sufficiently large layer number L > 0 and feature dimension number d h > 0, such that one can find a parameter set for the multi-task problem, θ η = [θ η P , θ I , θ Q ] with the corresponding MetaP model satisfies Q θ Q • (J θ I ) L • P θ η P ([U 0 , G η ] T ) -U η, * ≤ ε, ∀G η ∈ R M . For the proof of this main theorem, we need the following approximation property of a shallow neural network, with its detailed proof provided in You et al. (2022c) : Lemma 1. Given a continuous function T : R 2M → R M , and a non-polynomial and continuous activation function σ, for any constant ε > 0 there exists a shallow neural network model T := Sσ (BX + A) such that ||T (X) -T (X)|| l 2 (R M ) ≤ ε, ∀X ∈ R 2M , for sufficiently large feature dimension d > 0. Here, S ∈ R M × dM , B ∈ R dM ×2M , and A ∈ R dM are matrices/vectors which are independent of X. We now proceed to the proof of Theorem 1: Proof. Since all U η, * satisfies Assumptions 1-2, for any ε > 0, we first pick a sufficiently large integer L such that the L-th layer iteration result of this fixed point formulation satisfies ||U L - U η, * || l 2 (R M ) ≤ ε 2 for all tasks. By taking ε := mε 2(1+m) L in Lemma 1, there exists a sufficiently large feature dimension d and one can find S ∈ R M × dM , B ∈ R dM ×2M , and A ∈ R dM , such that R(U η , Gη ) := Sσ(B[U η , Gη ] T + A) satisfies ||R(U η , Gη )-R(U η , Gη )|| l 2 (R M ) = ||R(U η , Gη )-Sσ(B[U η , Gη ] T +A)|| l 2 (R M ) ≤ ε = mε 2(1 + m) L , where m is the contraction parameter of R, as defined in Assumption 1. By this construction, we know that S has independent rows. Denoting d := d + 1 > 0, there exists the right inverse of S, which we denote as S + ∈ R ( d-1)M ×M , such that SS + = I M , S + S := Ĩ( d-1)M , where I M is the M by M identity matrix, Ĩ( d-1)M is a ( d -1)M by ( d -1)M block matrix with each of its element being either 1 or 0. Hence, for any vector Z ∈ R( d -1)M , we have σ( Ĩ( d-1)M Z) = Ĩ( d-1)M σ(Z) . Moreover, we note that S has a very special structure: from the ((i -1)( d -1) + 1)-th to the (i( d -1))-th column of S, all nonzero elements are on its i-th row. Correspondingly, we can also choose S + to have a special structure: from the ((i -1)( d -1) + 1)-th to the (i( d-1))-th row of S + , all nonzero elements are on its i-th column. Hence, when multiplying S + with U, there will be no entanglement between different components of U. That means, S + can be seen as a pointwise weight function. We now construct the MetaP as follows. In this construction, we choose the feature dimension as d h := dM . With the input [U 0 , G η ] ∈ R 2M , for the lift layer we set P η := 1 [M,1] ⊗ S + 0 0 D η = S + 0 S + 0 • • • S + 0 0 D η 0 D η • • • 0 D η T repeated for M times ∈ R d h M ×2M , and p η := 0 ∈ R d h M . Here, D η := diag[1/F 1 [b η ](x 1 ), • • • , 1/F 1 [b η ](x M )]. As such, the initial layer of feature is then given by H(0) = P η ([U 0 , G η ] T ) = 1 [M,1] ⊗ [S + U 0 , D η G η ] T = 1 [M,1] ⊗ [S + U 0 , Gη ] T ∈ R dM . Here, we point out that P η and p η can be seen as pointwise weight and bias functions, respectively. Next we construct the shared iterative layer J , by setting V := Ĩ( d-1)M B/M 0 S/∆t 0 0 I M /∆t , Ṽ := 1 [M,M ] ⊗V, and C := 1 [M,1] ⊗ Ĩ( d-1)M A/∆t 0 . Note that Ṽ is independent of η, and falls into the formulation of V , by letting R 1 = V and R 2 = R 2 = • • • = R M = W = 0. For the l + 1-th layer of feature vector, we then arrive at H((l + 1)∆t) = H(l∆t) + ∆tσ Ṽ H(l∆t) + C =H(l∆t) + IM ⊗ S + S 0 0 IM σ 1 [M,1] ⊗ B/M 0 1 [1,M ] ⊗ S 0 0 IM H(l∆t) + 1 [M,1] ⊗ A 0 , where H(l∆t) = [ ĥl∆t 1 , ĥl∆t 2 , . . . , ĥl∆t 2M -1 , ĥl∆t 2M ] T denotes the (spatially discretized) hidden layer feature at the l-th iterative layer of the IFNO. Subsequently, we note that the second part of the feature vector, ĥl∆t 2j ∈ R M , satisfies ĥ(l+1)∆t 2j = ĥl∆t 2j = • • • = ĥ0 2j = Gη , ∀l = 0, • • • , L -1, ∀j = 1, • • • , M Hence, the first part of the feature vector, ĥl∆t 2j-1 ∈ R ( d-1)M , satisfies the following iterative rule: ĥ(l+1)∆t 2j-1 = ĥl∆t 2j-1 + S + Sσ(B[S ĥl∆t 2j-1 , Gη ] T + A), ∀l = 0, • • • , L -1, ∀j = 1, • • • , M, and ĥ(l+1)∆t 1 = ĥ(l+1)∆t 3 = • • • = ĥ(l+1)∆t 2M -1 . Finally, for the projection layer Q, we set the activation function in the projection layer as the identity function,  and q  Q 1 := I d h M (the identity matrix of size d h M ), Q 2 := [S, 0] ∈ R M ×d h M , q 1 := 0 ∈ R d h M , 2 := 0 ∈ R M . Denoting the output U η := Q θ Q •(J θ I ) L •P θ η P ([U 0 , G η ] T ) , we now show that U η can approximate U η, * with a desired accuracy ε: ||U η -U η, * || ≤ ||U η -U L || l 2 (R M ) + ||U L -U η, * || l 2 (R M ) ≤ ||S ĥL∆t 1 -U L || l 2 (R M ) + ε 2 (by Assumption 2) ≤ ||S ĥ(L-1)∆t 1 -U L-1 || l 2 (R M ) + || R(S ĥ(L-1)∆t 1 , G) -R(U L-1 , G)|| l 2 (R M ) + ε 2 ≤ ||S ĥ(L-1)∆t 1 -U L-1 || l 2 (R M ) + || R(S ĥ(L-1)∆t 1 , Gb) -R(S ĥ(L-1)∆t 1 , Gb)|| l 2 (R M ) + ||R(S ĥ(L-1)∆t 1 , Gb) -R(U L-1 , Gb)|| l 2 (R M ) + ε 2 ≤ (1 + m)||S ĥ(L-1)∆t 1 -U L-1 || l 2 (R M ) + mε 2(1 + m) L + ε 2 (by Lemma 1 and Assumption 1) ≤ mε 2(1 + m) L (1 + (1 + m) + (1 + m) 2 + • • • + (1 + m) L-1 ) + ε 2 ≤ ε 2 + ε 2 = ε.

B DATA GENERATION AND TRAINING DETAILS

In the following we briefly describe the empirical process of generating datasets, and the settings employed in running of each algorithm. For a fair comparison, for each algorithm, we tune the hyperparameters, including the learning rate from {0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001}, the decay rate from {0.5, 0.7, 0.9}, the weight decay parameter from {0.01, 0.001, 0.0001, 0.00001, 0.000001}, and the inner loop learning rate for MAML and ANIL from {0.01, 0.001, 0.0001, 0.00001, 0.000001}, to minimize the error on a separate validation dataset. In all experiments we decrease the learning rate with a ratio of learning rate decay rate every 100 epochs. The code and the processed datasets will be publicly released at github for readers to reproduce the experimental results. B.1 EXAMPLE 1: SYNTHETIC DATA SETS B.1.1 DATA GENERATION In the synthetic data example, we consider the modeling problem of a hyperelastic, anisotropic, fiber-reinforced material, and seek to find its displacement field u : [0, 1] 2 → R 2 under different boundary loadings. In this problem, the specimen is assumed to be subject to a uniaxial tension T y (x) on the top edge (see Figure 4 (a)). To generate training and test samples, the Holzapfel-Gasser-Odgen (HGO) model (Holzapfel et al., 2000) was employed to describe the constitutive behavior of the material in this example, with its strain energy density function given as: Here, ⟨•⟩ denotes the Macaulay bracket, and the fiber strain of the two fiber groups is defined as: η = E 4(1 + ν) (I 1 -2) - E 2(1 + ν) ln(J) + k 1 2k 2 exp (k 2 ⟨S(α)⟩ 2 ) + exp (k 2 ⟨S(-α)⟩ 2 ) -2 + E 6(1 -2ν) J 2 -1 2 -ln J . S(α) = I 4 (α) -1 + |I 4 (α) -1| 2 . where k 1 and k 2 are fiber modulus and the exponential coefficient, respectively, c 10 is the moduli for the non-fibrous ground matrix, E is the Young's modulus, and ν is the Poisson ratio. Moreover, I 1 = tr(C) is the is the first invariant of the right Cauchy-Green tensor C = F T F, F is the deformation gradient, and J is related with F such that J = det F. For the fiber group with angle direction α from the reference direction, I 4 (α) = n T (α)Cn(α) is the fourth invariant of the right Cauchy-Green tensor C, where n(α) = [cos(α), sin(α)] T . To generate samples for different specimens,different specimens (tasks) correspond to different material parameter sets, {k 1 , k 2 , E, ν, α}. For the training tasks, the validation task, and the in-distribution (ID) test task, their physical parameters are sampled from: Yin et al. (2022b) . In particular, T y (x) is taken as the restriction of a 2D random field, ϕ(x) = F -1 (γ 1/2 F(Γ))(x), on the top edge. Here, Γ(x) is a Gaussian white noise random field on R 2 , γ = (w 2 1 + w 2 2 ) -5foot_4 represents a correlation function, and w 1 , w 2 are the wave numbers on x and y directions, respectively. Then, for each sampled traction loading, we solved the displacement field on the entire domain by minimizing potential energy using the finite element method implemented in FEniCS (Alnaes et al., 2015) . In particular, the displacement filed was approximated by continuous piecewise linear finite elements with triangular mesh, and the grid size was taken as 0.025. Then, the finite element solution was interpolated onto χ, a structured 41 × 41 grid which will be employed as the discretization in our neural operators. k 1 , k 2 ∼ U[0.1, 1], E ∼ U[0. To visualize the domain characteristics for tasks, the distribution of each parameter for training, validation and test tasks are demonstrated in Figure 5 , and the corresponding solution fields are plotted in Figure 4 (c), showing the diversity across different tasks due to the change of underlying hidden material parameter set, {k 1 , k 2 , E, ν, α}. From Figures 5 and 4 (c), one can see that OOD Task1 corresponds a stiffer material (with large Young's modulus E) and hence smaller deformation subject to the same loading T y (x). On the other hand, OOD Task2 corresponds a softer material (with small Young's modulus E) and larger deformation. Therefore, the material response of OOD Task1 specimen is more likely to lie in a linear region, which is easier to learn and explains the relatively small test error on this task. On the other hand, the material response of OOD Task2 is more nonlinear and hence complex due to larger deformation, as shown in Figure 4 (c), and results in the relatively larger test error in Figure 2 .

B.1.2 ALGORITHM SETTINGS

Base model: As the base model for all algorithms, we construct an architecture for IFNO (You et al., 2022c) as follows. First, the input loading field instance g(x) ∈ A is lifted to a higher dimensional representation via lift layer P[g](x), which is parameterized as a 1-layer feed forward linear layer with width (3,32). Then for the iterative layer in equation 1, we implement F -1 [F[κ(•; v)] • F[h(•, l∆t)]](x) with 2D fast Fourier transform (FFT) with input channel and output channel widths both set as 32 and the truncated Fourier modes set as 8. The local linear transformation parameter, W , is parameterized as a 1-layer feed forward network with width (32,32). In the projection layer, a 2-layer feed forward network with width (32,128,2) is employed. To accelerate the training procedure, we apply the shallow-to-deep training technique to initialize the optimization problem. In particular, we start from the NN model with depth L = 1, train until the loss function reaches a plateau, then use the resultant parameters to initialize the parameters for the next depth, with L = 2, L = 4, and L = 8. In the synthetic experiments, we set the layer depth as L = 8.

MetaP:

We split the total 60 training tasks to two groups: 59 tasks for the purpose of training and 1 task for the purpose of validation. During the meta-train phase, we train for the task-wise parameters θ η P and the common parameters θ I and θ Q on all 59 tasks, with the context set of 500 samples on each task. After meta-train phase, we load θ I and θ Q and the averaged θ η P among all 59 tasks as initialization, then tune the hyperparameters based on the validation task. In particular, the 500 samples on the validation task is split into two parts: 300 samples are reserved for the purpose of training (as the context set) and the rest 200 samples are used for evaluation (as the target set). Then we train for the lift layer on the validation task, and tune the learning rate, the decay rate, and the weight decay parameter for different context set sizes (N test ), to minimize the loss on the target set. Based on the chosen hyperparameters, we perform the test on the test task by training for the lift layer on different numbers of samples on its context set, then evaluate and report the performance based on its target set. We repeat the procedure on the test task with selected hyperparemeters with different 5 random seeds, and calculate means and standard errors for the resultant test errors on target set. MAML&ANIL: For MAML and ANIL, we use the same architecture as the base model, and also split the training tasks for the purpose of training and validation as in MetaP. During the metatrain phase, for each task we randomly split the available 500 samples to two sets: 250 samples in the support set used for inner loop updates, and the rest in the target set for outer loop updates. During the inner loop update, we train for the task-wise parameter with one epoch, following the standard settings of MAML and ANIL (Finn et al., 2017; Raghu et al., 2019) . Then, the model hyperparameters, including the learning rate, weight decay, decay rate, and inner loop learning rate, are tuned. In the meta-test phase, we load the initial parameter and train for all parameters (in MAML) or the last-layer parameters (in ANIL) until the optimization algorithm converges. Similar as in MetaP, we first tune the hyperparameters on the validation task, then evaluate the performance on the test task.

B.2 EXAMPLE 2: MECHNICAL MNIST B.2.1 DATA SETTINGS

Mechanical MNIST is a benchmark dataset of heterogeneous material undergoing large deformation, modeld by the Neo-Hookean material with a varying modulus converted from the MNIST bitmap images (Lejeune, 2020) . In this example, we randomly select 102 specimens corresponding to the hand-written number "1". On each specimen, we have 32 loading/response data pairs on a structured 27 by 27 grid, under the uniaxial extension, shear, equibiaxial extension, and confined compression load scenarios, respectively. All 102 specimens are splitted into three groups: 100 specimens for the purpose of training in the meta-train stage, 1 specimen for validation, and 1 specimen for test. On the validation and test tasks, we reserve a target set consisting of 20 data pairs for the purpose of evaluation, then use the rest as the context set.

B.2.2 ALGORITHM SETTINGS

Base model: As the base model for all algorithms, we construct two IFNO architectures, for the prediction of u x and u y , the displacement fields in the xand y-directions, respectively. On each architecture, the input loading field instance g(x) ∈ A is mapped to a higher dimensional representation via a lifting layer P[g](x) parameterized as a 1-layer feed forward linear layer with width (4,64). Then for the iterative layer in equation 1, we set the number of truncated Fourier mode as 13, and parameterize the local linear transformation parameter, W , as a 1-layer feed forward network with width (64,64). In the projection layer, a 2-layer feed forward network with width (64,128,1) is employed. In this example we also apply the shallow-to-deep technique to accelerate the training, and set the layer depth as L = 8. MetaP: During the meta-train phase, we train for the task-wise parameters θ η P and the common parameters θ I and θ Q on all 100 training tasks, with the context set of 32 samples on each task. After the meta-train phase, we load θ I and θ Q and the averaged θ η P among all 100 tasks as initialization, then train for θ P on the validation task. In particular, the 32 samples on the validation task is split into two parts: 12 samples are reserved for the purpose of training (as the context set) and the rest 20 samples are used for the purpose of evaluation (as the target set). Then we train for the lift layer on the validation task, and tune the learning rate, the decay rate, and the weight decay parameter for different context set sizes (N test ), to minimize the loss on the target set. Based on the chosen hyperparameters, we perform the meta-test phase on the test task by training for the lift layer on different numbers of samples on its context set, then evaluate and report the performance based on its target set. MAML&ANIL: For MAML and ANIL, we use the same architecture as the base model, and also split the training tasks for the purpose of training and validation as in MetaP. During the meta-train phase, for each task we randomly split the available 32 samples to two sets: 16 samples in the support set used for inner loop updates, and the rest in the target set for outer loop updates. During the inner loop update, we also follow the standard settings of MAML and ANIL (Finn et al., 2017; Raghu et al., 2019) , and tune the hyperparameters following the same procedure as elaborated above for Example 1. We now briefly provide the data generation procedure for the tricuspid valve anterior leaflet (TVAL) response modeling example. In this problem, the constitutive equations and material microstructure are both unknown, and the dataset has unavoidable measurement noise. To generate the data, we firstly followed the established biaxial testing procedure, including acquisition of a healthy porcine heart and retrieval of the TVAL Ross et al. (2019) ; Laurence et al. (2019) . Then, we sectioned the leaflet tissue and applied a speckling pattern to the tissue surface using an airbrush and black paint Zhang & Arola (2004) ; Lionello & Cristofolini (2014) ; Palanca et al. (2016) . The painted specimen was then mounted to a biaxial testing device (BioTester, CellScale, Waterloo, ON, Canada). To generate samples for each specimen, we performed 7 protocols of displacement-controlled testing to target various biaxial stresses: P 11 : P 22 = {1 : 1, 1 : 0.66, 1 : 0.33, 0.66 : 1, 0.33 : 1, 0.05 : 1, 1 : 0.1}. Here, P 11 and P 22 denote the first Piola-Kirchhoff stresses in the xand y-directions, respectively. Each stress ratio was performed for three loading/unloading cycles. Throughout the test, images of the specimen were captured by a CCD camera, and the load cell readings and actuator displacements were recorded at 5 Hz. After testing, the acquired images were analyzed using the digital image correlation (DIC) module of the BioTester's software. The pixel coordinate locations of the DIC-tracked grid were then exported and extrapolated to a 21 by 21 uniform grid. In this example, we have the DIC measurements on 14 specimens, with 500 data pairs of loadings and material responses from the 7 protocols on each specimen. These specimens are divided into three groups: 12 for the purpose of meta-train, 1 for validation, and 1 for test. To demonstrate the



In some meta-learning literature (e.g.,(Xu et al., 2020)), these small sets of labelled data pairs on a new task (or any task) is also called the context, and the learnt model will be evaluated on an additional set of unlabelled data pairs, i.e., the target. We also point out that the proposed multi-task strategy is generic and hence also applicable to any other integral neural operators(Li et al., 2020a;b;c; You et al., 2022a). We have excluded small deformation samples with the maximum displacement magnitude ≤ 0.1. Here we sample both ID and OOD tasks from the same range of ν, due to the fact that [0.01, 0.49] is the range of Poisson ratio for common materials(Bischofs & Schwarz, 2005).



Figure 1: The architecture of MetaP based on an integral neural operator model.

49], and α ∼ U[π/10, π/2]. Here U stands for uniform distribution. To further evaluate the generalizability when the physical parameters in test tasks are outside the

Figure 2: Results on a synthetic data set. Left: The ablation study comparison on test errors in the in-distribution test. Right: The relative error of MetaP in in-distribution and out-of distribution tests.

Figure 3: Comparison of MetaP and four baseline methods on the benchmark dataset (MechnicalM-NIST, left plot) and the real-world dataset (heart valve tissue, right plot). the MNIST bitmap images. On each specimen, 32 loading/response data pairs are provided 3 . Here in, we randomly select 101 specimens for training and validation, and a new and unseen specimen as the test task. On the meta-test phase, we reserve 20 data pairs on the test task as the target set for evaluation, then train each model under the few-shot learning setting with N test = {2, 4, 8, 12} labelled data pairs as the context set. All approaches are developed based on an 8-layer IFNO model.

Figure 4: Problem setup of example 1: the synthetic data sets. (a) A unit square specimen subject to uniaxial tension with Neumann-type boundary condition. (b) & (c) Visualization of an instances of the loading field T y (x), and the corresponding ground-truth solutions u η (x) from the in-distribution and out-of-distribution tasks, showing the solution diversity across different tasks, due to the change of underlying hidden material parameter set.

Figure 5: Distribution of physical parameters of different tasks, and the resultant magnitude of material response, ||u η (x)|| L 2 (Ω) , on an exemplar loading instance shown in Figure 4(b).

Figure 6: Visualization of the processed dataset in example 3: learning the biological tissue responses. Subject to the same loading instance, different columns show the corresponding groundtruth solutions u η (x) from different tasks, showing the solution diversity across different tasks due to the change of underlying hidden material parameter field.

EXAMPLE 3: EXPERIMENTAL MEASUREMENTS ON BIOLOGICAL TISSUES B.3.1 DATA GENERATION

55, 1.5], ν ∼ U[0.01, 0.49], and α ∼ U[π/10, π/2]. For the two out-of-distribution (OOD) test tasks, we sample their parameters following k 1 , k 2 ∼ U[1, 1.9], E ∼ U[1.5, 2]∪U[0.5, 0.55], ν ∼ U[0.01, 0.49] 4 , and α ∼ U[π/2, 3π/4]∪[0, π/10]. To generate the highfidelity (ground-truth) dataset, we sampled 500 different vertical traction conditions T y (x) on the top edge from a random field, following the algorithm in Lang & Potthoff (2011);

annex

diversity of these specimens due to the material heterogeneity in biological tissues, in Figure 6 we plot the processed displacement field of two exemplar training specimens and the validation and test specimens.

B.3.2 ALGORITHM SETTINGS

Base model: As the base model, we first construct the lifting layer as a 1-layer feed forward linear layer with width (4,16). Then for the iterative layer in we keep 8 truncated Fourier modes and parameterize the local linear transformation parameter, W , a 1-layer feed forward network with width (16,16). In the projection layer, a 2-layer feed forward network with width (16,64,1) is employed. We construct two 4-layer IFNO architectures, for the prediction of u x and u y , the displacement fields in the xand y-directions, respectively.MetaP: During the meta-train phase, we train for the task-wise parameters θ η P and the common parameters θ I and θ Q on all 12 tasks, with the context set of 500 samples on each task. After metatrain phase, we load θ I and θ Q and the averaged θ η P among all 12 tasks as initialization, then tune the hyperparameters based on the validation task. In particular, the 500 samples on the validation task is splited into two parts: 300 samples are reserved for the purpose of training (as the context set) and the rest 200 samples are used for evaluation (as the target set). Based on the chosen hyperparameters, we perform the test on the test task by training for the lift layer on different numbers of samples on its context set, then evaluate and report the performance based on its target set.

MAML&ANIL:

For MAML and ANIL, we use the same architecture as base model, and also split the training tasks for the purpose of training and validation as in MetaP. During the meta-train phase, for each task we randomly split the available 500 samples to two sets: 250 samples in the support set used for inner loop updates, and the rest in the target set for outer loop updates. During the inner loop update, we train for the task-wise parameter with one epoch, following the standard settings of MAML and ANIL (Finn et al., 2017; Raghu et al., 2019) .

