TRANSFORMER MEETS BOUNDARY VALUE INVERSE PROBLEMS

Abstract

A Transformer-based deep direct sampling method is proposed for electrical impedance tomography, a well-known severely ill-posed nonlinear boundary value inverse problem. A real-time reconstruction is achieved by evaluating the learned inverse operator between carefully designed data and the reconstructed images. An effort is made to give a specific example to a fundamental question: whether and how one can benefit from the theoretical structure of a mathematical problem to develop task-oriented and structure-conforming deep neural networks? Specifically, inspired by direct sampling methods for inverse problems, the 1D boundary data in different frequencies are preprocessed by a partial differential equation-based feature map to yield 2D harmonic extensions as different input channels. Then, by introducing learnable non-local kernels, the direct sampling is recast to a modified attention mechanism. The new method achieves superior accuracy over its predecessors and contemporary operator learners and shows robustness to noises in benchmarks. This research shall strengthen the insights that, despite being invented for natural language processing tasks, the attention mechanism offers great flexibility to be modified in conformity with the a priori mathematical knowledge, which ultimately leads to the design of more physics-compatible neural architectures. The fundamental assumption here is that this map is "well-defined" enough to be regarded as a high-dimensional interpolation (learning) problem on a compact data submanifold (Seo et al., 2019; Ghattas & Willcox, 2021) , and the learned approximate mapping can be evaluated at newly incoming σ's. The incomplete information of Λ σ due to a small L for one single σ is compensated by a large N ≫ 1 sampling of different σ's. Classical iterative methods. There are in general two types of methodology to solve inverse problems. The first one is a large family of iterative or optimization-based methods (Dobson &

1. INTRODUCTION

Boundary value inverse problems aim to recover the internal structure or distribution of multiple media inside an object (2D reconstruction) based on only the data available on the boundary (1D signal input), which arise from many imaging techniques, e.g., electrical impedance tomography (EIT) (Holder, 2004) , diffuse optical tomography (DOT) (Culver et al., 2003) , magnetic induction tomography (MIT) (Griffiths et al., 1999) . Not needing any internal data renders these techniques generally non-invasive, safe, cheap, and thus quite suitable for monitoring applications. In this work, we shall take EIT as an example to illustrate how a more structure-conforming neural network architecture leads to better results in certain physics-based tasks. Given a 2D bounded domain Ω and an inclusion D, the forward model is the following partial differential equation (PDE) ∇ • (σ∇u) = 0 in Ω, where σ = σ 1 in D, and σ = σ 0 in Ω\D, where σ is a piecewise constant function defined on Ω with known function values σ 0 and σ 1 , but the shape of the inclusion D buried in Ω is unknown. The goal is to recover the shape of D using only the boundary data on ∂Ω (Figure 1 ). Specifically, by exerting a current g on the boundary, one solves (1) with the Neumann boundary condition σ∇u • n| ∂Ω = g, where n is the outwards unit normal direction of ∂Ω, to get a unique u on the whole domain Ω. In practice, only the Dirichlet boundary value representing the voltages f = u| ∂Ω on the boundary can be measured. This procedure is called Neumann-to-Dirichlet (NtD) mapping: For various notation and the Sobolev space formalism, we refer readers to Appendix A; for a brief review of the theoretical background of EIT we refer readers to Appendix B. The NtD map above in (2) can be expressed as f = A σ g, (3) where g and f are (infinite-dimensional) vector representations of functions g and f relative to a chosen basis, and A σ is the matrix representation of Λ σ (see Appendix B for an example). Λ σ : H -1/2 (∂Ω) → H 1/2 (∂Ω), with g = σ∇u • n| ∂Ω → f = u| ∂Ω . The original mathematical setup of EIT is to use the NtD map Λ σ in (2) to recover σ, referred to as the case of full measurement (Calderón, 2006) . In this case, the forward and inverse operators associated with EIT can be formulated as F : σ → Λ σ , and F -1 : Λ σ → σ. Fix a set of basis {g l } ∞ l=1 of the corresponding Hilbert space containing all admissible currents. Then, mathematically speaking, "knowing the operator Λ σ " means that one can measure all the current-to-voltage pairs {g l , f l := Λ σ g l } ∞ l=1 and construct the infinite-dimensional matrix A σ . However, as infinitely many boundary data pairs are not attainable in practice, the problem of more practical interest is to use only a few data pairs {(g l , f l )} L l=1 for reconstruction. In this case, the forward and inverse problems can be formulated as F L : σ → {(g 1 , Λ σ g 1 ), ..., (g L , Λ σ g L )} and F -1 L : {(g 1 , Λ σ g 1 ), ..., (g L , Λ σ g L )} → σ. (5) For limited data pairs, the inverse operator F -1 L is extremely ill-posed or even not well-defined (Isakov & Powell, 1990; Barceló et al., 1994; Kang & Seo, 2001; Lionheart, 2004) ; namely, the same boundary measurements may correspond to different σ. In view of the matrix representation A σ , for g l = e l , l = 1, ..., L, with e l being unit vectors of a chosen basis, (f 1 , ..., f L ) only gives the first L columns of A σ . It is possible that two matrices A σ and A σ have similar first L columns but ∥σ -σ∥ is large. How to deal with this "ill-posedness" is a central theme in boundary value inverse problem theories. The operator learning approach has the potential to tame the ill-posedness by restricting F -1 L at a set of sampled data D := {σ (k) } N k=1 , with different shapes and locations following certain distribution. Then the problem becomes to approximate Santosa, 1994; Martin & Idier, 1997; Chan & Tai, 2003; Vauhkonen et al., 1999; Guo et al., 2019; Rondi & Santosa, 2001; Chen et al., 2020; Bao et al., 2020; Gu et al., 2021) . One usually looks for an approximated σ by solving a minimization problem with a regularization R(σ) to alleviate the ill-posedness, say inf σ L l=1 ∥Λ σ g l -f l ∥ 2 ∂Ω + R(σ) . The design of regularization R(σ) plays a critical role in a successful reconstruction (Tarvainen et al., 2008; Tehrani et al., 2012; Wang et al., 2012) . Due to the ill-posedness, the computation for almost all iterative methods usually takes numerous iterations to converge, and the reconstruction is highly sensitive to noise. Besides, the forward operator F(•) needs to be evaluated at each iteration, which is itself expensive as it requires solving forward PDE models. Classical direct methods. The second methodology is to develop a well-defined mapping G θ parametrized by θ, empirically constructed to approximate the inverse map itself, say G θ ≈ F -1 . These methods are referred to as non-iterative or direct methods in the literature. Distinguished from iterative approaches, direct methods are typically highly problem-specific, as they are designed based on specific mathematical structures of their respective inverse operators. For instance, methods in EIT and DOT include factorization methods (Kirsch & Grinberg, 2007; Azzouz et al., 2007; Brühl, 2001; Hanke & Brühl, 2003) , MUSIC-type algorithms (Cheney, 2001; Ammari & Kang, 2004; 2007; Lee et al., 2011) , and the D-bar methods (Knudsen et al., 2007; 2009) based on a Fredholm integral equation (Nachman, 1996) , among which are the direct sampling methods (DSM) being our focus in this work (Chow et al., 2014; 2015; Kang et al., 2018; Ito et al., 2013; Ji et al., 2019; Harris & Kleefeld, 2019; Ahn et al., 2020; Chow et al., 2021; Harris et al., 2022) . These methods generally have a closed-form G θ for approximation, and the parameters θ represent model-specific mathematical objects. For each fixed θ, this procedure is usually much more stable than iterative approaches with respect to the input data. Furthermore, the evaluation for each boundary data pair is distinctly fast, as no optimization is needed. However, a simple closed-form G θ admitting efficient execution may not be available in practice since some mathematical assumptions and derivation may not hold. For instance, MUSIC-type and D-bar methods generally require an accurate approximation to Λ σ , while DSM poses restrictions on the boundary data, domain geometry, etc., see Appendix D for details. Boundary value inverse problems. For most cases of boundary value inverse problems in 2D, the major difference, e.g., with an inverse problem in computer vision (Marroquin et al., 1987) , is that data are only available on 1D manifolds, which are used to reconstruct 2D targets. When comparing (4) with a linear inverse problem in signal processing y = Ax + ϵ, to recover a signal x from measurement y with noise ϵ, the difference is more fundamental in that F(•) itself is highly nonlinear and involves boundary value PDEs. Moreover, the boundary data themselves generally involve certain input-output structures (NtD maps), which adds more complexity. In Adler & Guardo (1994); Fernández-Fuentes et al. (2018) ; Feng et al. (2018) , boundary measurements are collected and directly input into feedforward fully connected networks. As the data reside on different manifolds, special treatments are made to the input data, such as employing pre-reconstruction stages to generate rough 2D input to CNNs (Ben Yedder et al., 2018; Ren et al., 2020; Pakravan et al., 2021) . Deep neural network and inverse problems. Solving an inverse problem is essentially to give a satisfactory approximation to F -1 but based on finitely many measurements. et al., 2021a; Kovachki et al., 2021; Guibas et al., 2022; Zhao et al., 2022; Wen et al., 2022; Li et al., 2022b) . FNO takes advantage of the low-rank nature of certain problems and learns a local kernel in the frequency domain yet global in the spatial-temporal domain, mimicking the solution's kernel integral form. Concurrent studies include DeepONets (Lu et al., 2021; Wang et al., 2021b; Jin et al., 2022b) , Transformers (Cao, 2021; Kissas et al., 2022; Li et al., 2022a; Liu et al., 2022; Fonseca et al., 2023) , Integral Autoencoder (Ong et al., 2022) , Multiwavelet Neural Operators (Gupta et al., 2021; 2022) , and others (Lütjens et al., 2022; Hu et al., 2022; Boussif et al., 2022; de Hoop et al., 2022a; b; Ryck & Mishra, 2022; Seidman et al., 2022; Zhang et al., 2023; Lee, 2023; Zhu et al., 2023a) . Related studies on Transformers. The attention mechanism-based models have become state of the art in many areas since Vaswani et al. (2017) . One of the most important and attractive aspects of the attention mechanism is its unparalleled capability to efficiently model non-local long-range interactions (Katharopoulos et al., 2020; Choromanski et al., 2021; Nguyen et al., 2021) . The relation of the attention with kernel learning is first studied in Tsai et al. (2019) and later connected with random feature (Peng et al., 2021) . Han et al. (2022) . Among inverse problems, Transformers have been applied in medical imaging applications, including segmentation (Zhou et al., 2021; Hatamizadeh et al., 2022; Petit et al., 2021) , X-Ray (Tanzi et al., 2022) , magnetic resonance imaging (MRI) (He et al., 2022) , ultrasound (Perera et al., 2021) , optical coherence tomography (OCT) (Song et al., 2021) . To our best knowledge, no work in the literature establishes an architectural connection between the attention mechanism in Transformer and the mathematical structure of PDE-based inverse problems.

2.1. CONTRIBUTIONS

• A structure-conforming network architecture. Inspired by the EIT theory and classic DSM, we decompose the approximation of the inverse operator into a harmonic extension and an integral operator with learnable non-local kernels that has an attention-like structure. Additionally, the attention architecture is reinterpreted through a Fredholm integral operator to rationalize the application of the Transformer to the boundary value inverse problem. • Theoretical and experimental justification for the advantage of Transformer. We have proved that, in Transformers, modified attention can represent target functions exhibiting higher frequency natures from lower frequency input features. A comparative study in the experiments demonstrates a favorable match between the Transformer and the benchmark problem.

3. INTERPLAY BETWEEN MATHEMATICS AND NEURAL ARCHITECTURES

In this section, we try to articulate that the triple tensor product in the attention mechanism matches exceptionally well with representing a solution in the inverse operator theory of EIT. In pursuing this end goal, this study tries to answer the following motivating questions: (Q1) What is an appropriate finite-dimensional data format as inputs to the neural network? (Q2) Is there a suitable neural network matching the mathematical structure? 3.1 FROM EIT TO OPERATOR LEARNING In the case of full measurement, the operator F -1 can be well approximated through a large number of (σ, A σ ) data pairs. This mechanism essentially results in a tensor2tensor mapping/operator from A σ to the imagery data representing σ. In particular, the BCR-Net (Fan & Ying, 2020 ) is a DNN approximation falling into this category. However, when there are very limited boundary data pairs accessible, the task of learning the full matrix A σ becomes obscure, which complicates the development of a tensor2tensor pipeline for operator learning. Operator learning problems for EIT. We first introduce several attainable approximations of infinite-dimensional spaces by finite-dimensional counterparts for the proposed method. (1) Spatial discretization. Let Ω h be a mesh of Ω with the mesh spacing h and let {z j } M j=1 := M be the set of grid points to represent the 2D discretization of continuous signals. Then a function u defined almost everywhere in Ω can be approximated by a vector u h ∈ R M . (2) Sampling of D. We generate N samples of D with different shapes and locations following certain distributions. For example, elliptical inclusions with random semi-axes and centers are generated as a benchmark (see Appendix C.1 for details). With the known σ 0 and σ 1 , set the corresponding data set D = {σ (1) , σ (2) , ..., σ (N ) }. N is usually large enough to represent field applications of interest. (3) Sampling of NtD maps. For the k-th sample of D, we generate L pairs of boundary data {(g (k) l , f l )} L l=1 by solving PDE (1), which can be thought of as sampling of columns of the infinite matrix A σ representing the NtD map. By the proposed method, L can be chosen to be very small (≤ 3) to yield satisfactory results. Our task is to find a parameterized mapping G θ to approximate F -1 L,D (6) by minimizing J (θ) := 1 N N k=1 ∥G θ ({(g (k) l , f (k) l )} L l=1 ) -σ (k) ∥ 2 , ( ) for a suitable norm ∥ • ∥. Hyper-parameters N, h, L will affect the finite-dimensional approximation to the infiniteness in the following way: h determines the resolution to approximate D; N affects the representativity of the training data set; L decides how much of a finite portion of the infinite spectral information of Λ σ can be accessed.

3.2. FROM HARMONIC EXTENSION TO TENSOR-TO-TENSOR

To establish the connection between the problem of interest with the attention used in the Transformers, we first construct higher-dimensional tensors from the 1D boundary data. The key is a harmonic extension of the boundary data that can be viewed as a PDE-based feature map. We begin with a theorem to motivate it. Let I D be the characteristic function of D named index function, i.e., I D (x) = 1 if x ∈ D and I D (x) = 0 if x / ∈ D. Thus, σ can be directly identified by the shape of D through the formula σ = σ 1 I D + σ 0 (1 -I D ). In this setup, reconstructing σ is equivalent to reconstructing I D . Without loss of generality, we let σ 1 > σ 0 . Λ σ0 is understood as the NtD map with σ = σ 0 on the whole domain, i.e., it is taken as the known background conductivity (no inclusion), and thus Λ σ0 g can be readily computed. Then f -Λ σ0 g = (Λ σ -Λ σ0 )g measures the difference between the NtD mappings and encodes the information of σ. The operator Λ σ -Λ σ0 is positive definite, and it has eigenvalues Cheng et al., 1989) . Theorem 1. Suppose that the 1D boundary data g l is the eigenfunction of Λ σ -Λ σ0 corresponding to the l-th eigenvalue λ l , and let the 2D data functions ϕ l be obtained by solving {λ l } ∞ l=1 with λ 1 > λ 2 > • • • > 0 ( -∆ϕ l = 0 in Ω, n • ∇ϕ l = (f l -Λ σ0 g l ) on ∂Ω, ∂Ω ϕ l ds = 0, for l = 1, 2, . . . . Define the space S L = Span{∂ x1 ϕ l ∂ x2 ϕ l : l = 1, . . . , L}, and the dictionary S L = {a 1 + a 2 arctan(a 3 v) : v ∈ S L , a 1 , a 2 , a 3 ∈ R}. Then, for any ϵ > 0, we con construct an index function I D L ∈ S L s.t. sup x∈Ω |I D (x) -I D L (x)| ≤ ϵ provided L is large enough. The full proof of Theorem 1 can be found in Appendix E. This theorem gives a constructive approach for approximating F -1 and justifies the practice of approximating I D when L is large enough. The function ϕ l is called the harmonic extension of f l -Λ σ0 g l . On the other hand, it relies on "knowing" the entire NtD map Λ σ to construct I D L explicitly. Namely, the coefficients of ∂ x1 ϕ l ∂ x2 ϕ l depend on a big chunk of spectral information (eigenvalues and eigenfunctions) of Λ σ , which may not be available in practice. Thus, the mathematics itself in this theorem does not provide an architectural hint on building a structure-conforming DNN. To further dig out the hidden structure, we focus on the case of a single measurement, i.e., L = 1. With this setting, it is possible to derive an explicit and simple formula to approximate I D which is achieved by the classical direct sampling methods (DSM) (Chow et al., 2014; 2015; Kang et al., 2018; Ito et al., 2013; Ji et al., 2019; Harris & Kleefeld, 2019; Ahn et al., 2020) . For EIT, I D (x) ≈ I D 1 (x) := R(x) (d(x) • ∇ϕ(x)) x ∈ Ω, d(x) ∈ R 2 , is derived in Chow et al. (2014) , where (see a much more detailed formulation in Appendix D) • ϕ is the harmonic extension of f -Λ σ0 g with certain noise f -Λ σ0 g + ξ; • d(x) is called a probing direction and can be chosen empirically as d(x) = ∇ϕ(x)/∥∇ϕ(x)∥; • R(x) = (∥f -Λ σ0 g∥ ∂Ω |η x | Y ) -1 , where η x is a function of d(x) and measured in | • | Y semi-norm on boundary ∂Ω. Both ϕ and η x can be computed effectively by traditional fast PDE solvers, such as finite difference or finite element methods based on Ω h in Section 3.1. However, the reconstruction accuracy is much limited by a single measurement, the nonparametric ansatz, and empirical choices of d(x) and | • | Y . These restrictions leave room for DL methodology. See Appendix D for a detailed discussion. Constructing harmonic extension (2D features) from boundary data (1D signal input with limited depth) can contribute to the desired high-quality reconstruction. First, harmonic functions are highly smooth away from the boundary, of which the solution automatically smooths out the noise on the boundary due to PDE theory (Gilbarg & Trudinger, 2001, Chapter 8) , and thus make the reconstruction highly robust with respect to the noise (e.g., see Figure 3 in Appendix C.1). Second, in terms of using certain backbone networks to generate features for downstream tasks, harmonic extensions can be understood as a problem-specific way to design higher dimensional feature maps (Álvarez et al., 2012) , which renders samples more separable in a higher dimensional data manifold than the one with merely boundary data. See Figure 1 to illustrate this procedure. The information of σ is deeply hidden in ϕ. As shown in Figure 1 (see also Appendix C), one cannot observe any pattern of σ directly from ϕ. It is different from and more challenging than the inverse problems studied in (Bhattacharya et al., 2021; Khoo et al., 2021) that aim to reconstruct 2D targets from the much more informative 2D internal data of u. In summary, both Theorem 1 and the formula of DSM (10) offer inspiration to give a potential answer to (Q1): the harmonic extension ϕ (2D features) of f -Λ σ0 g (1D measurements) naturally encodes the information of the true characteristic function I D (2D targets). As there is a pointwise correspondence between the harmonic extensions and the targets at 2D grids, a tensor representation of ∇ϕ l at these grid points can then be used as the input to a tensor2tensor-type DNN to learn I D . Naturally, the grids are set as the positional embedding explicitly. In comparison, the positional information is buried more deeply in 1D measurements. As shown in Figure 1 , G θ can be nicely decoupled into a composition of a learnable neural network operator T θ and a non-learnable PDE-based feature map H, i.e., G θ = T θ • H. The architecture of T θ shall be our interest henceforth.

3.3. FROM CHANNELS IN ATTENTION TO BASIS IN INTEGRAL TRANSFORM

In this subsection, a modified attention mechanism is proposed as the basic block in the tensor2tensortype mapping introduced in the next two subsections. Its reformulation conforms with one of the most used tools in applied mathematics: the integral transform. In many applications such as inverse problems, the interaction (kernel) does not have any explicit form, which meshes well with DL methodology philosophically. In fact, this is precisely the situation of the EIT problem considered. Let the input of an encoder attention block be x h ∈ R M ×c with c channels, then the query Q, key K, value V are generated by three learnable projection matrices θ : = {W Q , W K , W V } ⊂ R c×c : ⋄ = x h W ⋄ , ⋄ ∈ {Q, K, V }. Here c ≫ L is the number of expanded channels for the latent representations. A modified dot-product attention is proposed as follows: U = Attn(x h ) := α nl Q (Q)nl K (K) ⊤ V = α Q K ⊤ V ∈ R M ×c , where nl Q (•) and nl K (•) are two learnable normalizations. Different from Nguyen & Salazar (2019) ; Xiong et al. (2020) , this pre-inner-product normalization is applied right before the matrix multiplication of query and key. This practice takes inspiration from the normalization in the index function kernel integral ( 10) and ( 20), see also Boyd (2001) where the normalization for orthogonal bases essentially uses the (pseudo)inverse of the Gram matrices. In practice, layer normalization (Ba et al., 2016) or batch normalization (Ioffe & Szegedy, 2015) is used as a cheap alternative. Constant α = h 2 is a mesh-based weight such that the summation becomes an approximation to an integral. To elaborate these rationales, the j-th column of the i-th row U i of U is (U i ) j = α A i• • V j , in which the i-th row A i• = ( Q K ⊤ ) i• and the j-th column V j := V• j . Thus, applying this to every column 1 ≤ j ≤ c, attention (11) becomes a basis expansion representation for the i-th row U i U i = α A i•   V 1 . . . V M   = M m=1 α A im V m =: M m=1 A(Q i , K m ) V m . Here, αA i• contains the coefficients for the linear combination of {V m } M m=1 . This set {V m } M m=1 forms the V 's row space, and it further forms each row of the output U by multiplying with A. A(•, •) in ( 12) stands for the attention kernel, which aggregates the pixel-wise feature maps to measure how the projected latent representations interact. Moreover, the latent representation U i in an encoder layer is spanned by the row space of V and is being nonlinearly updated cross-layer-wise. For x h , U, Q, K, V , a set of feature maps are assumed to exist: for example u(•) maps R 2 → R 1×c , i.e., U i = u(z i ) = [u 1 (z i ), • • • , u c (z i )], e.g., see Choromanski et al. (2021), then an instance- dependent kernel κ θ (•, •) : R 2 × R 2 → R can be defined by A(Q i , K j ) := α⟨ Q i , K j ⟩ = α⟨q(z i ), k(z j )⟩ =: α κ θ (z i , z j ). (13) Now the discrete kernel A(•, •) with tensorial input is rewritten to this kernel κ θ (•, •), thus the dot-product attention is expressed as a nonlinear integral transform for the l-th channel: u l (z) = α x∈M q(z) • k(x) v l (x) δ x ≈ Ω κ θ (z, x)v l (x) dµ(x), 1 ≤ l ≤ c. (14) Through certain minimization such as (8), the backpropagation updates θ, which further leads a new set of latent representations. This procedure can be viewed as an iterative method to update the basis residing in each channel by solving the Fredholm integral equation of the first kind in (14). To connect attention with inverse problems, the multiplicative structure in a kernel integral form for attention ( 14) is particularly useful. ( 14) is a type of Pincherle-Goursat (degenerate) kernels (Kress, 1999, Chapter 11) and approximates the full kernel using only a finite number of bases. The number of learned basis functions in expansion (12) depends on the number of channels n. Here we show the following theorem; heuristically, it says that: given enough but finite channels of latent representations, the attention kernel integral can "bootstrap" in the frequency domain, that is, generating an output representation with higher frequencies than the input. Similar approximation results are impossible for layer-wise propagation in CNN if one opts for the usual framelet/wavelet interpretation (Ye et al., 2018) . For example, if there are no edge-like local features in the input (see for empirical evidence in Figure 9 and Figure 10 ), a single layer of CNN filters without nonlinearity cannot learn weights to extract edges. The full proof with a more rigorous setting is in Appendix F. Theorem 2 (Frequency bootstrapping). Suppose there exists a channel l in V such that (V i ) l = sin(az i ) for some a ∈ Z + , the current finite-channel sum kernel A(•, •) approximates a non-separable kernel to an error of O(ϵ) under certain norm ∥ • ∥ X . Then, there exists a set of weights such that certain channel k ′ in the output of (12) approximates sin(a ′ z), Z + ∋ a ′ > a with an error of O(ϵ) under the same norm. The considered inverse problem is essentially to recover higher-frequency eigenpairs of Λ σ based on lower-frequency data, see, e.g., Figure 1 . Λ σ together with all its spectral information can be determined by the recovered inclusion shape. Thus, the existence in Theorem 2 partially justifies the advantages of adopting the attention mechanism for the considered problem.

3.4. FROM INDEX FUNCTION INTEGRAL TO TRANSFORMER

In (10), the probing direction d(x), the inner product d(x) • ∇ϕ(x), and the norm | • | Y are used as ingredients to form certain non-local instance-based learnable kernel integration. This non-localness is a fundamental trait for many inverse problems, in that I D (x) depends on the entire data function. Then, the discretization of the modified index function is shown to match the multiplicative structure of the modified attention mechanism in (11). In the forthcoming derivations, K(x, y), Q(x, y), and a self-adjoint positive definite linear operator V : L 2 (∂Ω) → L 2 (∂Ω), are shown to yield the emblematic Q-K-V structure of attention. To this end, we make the following modifications and assumptions to the original index function in (10). • The reformulation of the index function is motivated by the heuristics that the agglomerated global information of ϕ could be used as "keys" to locate a point x. ÎD 1 (x) := R(x) Ω d(x) • K(x, y)∇ϕ(y) dy. If an ansatz K(x, y) = δ x (y) is adopted, then (15) reverts to the original one in (10). • The probing direction d(x) as "query" is reasonably assumed to have a global dependence on ϕ d(x) := Ω Q(x, y)∇ϕ(y) dy. (16) If Q(x, y) = δ x (y)/∥∇ϕ(x)∥, then d(x) = ∇ϕ(x)/∥∇ϕ(x)∥ which is the choice of the probing direction in (Ikehata, 2000; Ikehata & Siltanen, 2000; Ikehata, 2007) . • In the quantity R(x) in ( 10), the key is | • | Y which is assumed to have the following form: Based on the assumptions from ( 15) to ( 17), we derive a matrix representation approximating the new index function on a grid, which accords well with an attention-like architecture. Denote by ϕ n : the vector that interpolates ∂ xn ϕ at the grid points {z j }, n = 1, 2. |η x | 2 Y := (Vη x , η x ) L 2 (∂Ω) . Here, we sketch the outline of the derivation and present the detailed derivation in Appendix D. We shall discretize the variable x by grid points z i in ( 15) and obtain an approximation to the integral: Ω K(z i , y)∂ xn ϕ(y) dy ≈ j ω j K(z i , z j )∂ xn ϕ(z j ) =: k T i ϕ n , where {ω j } are some integration quadrature weights. We then consider ( 16) and focus on one component d n (x) of d(x). With a suitable approximated integral, it can be rewritten as d n (z i ) ≈ j ω j Q(z i , z j )∂ xn ϕ(z j ) =: q T i ϕ n . Note that the self-adjoint positive definite operator V in ( 17) can be parameterized by a symmetric positive definite (SPD) matrix denoted by V . There exist vectors v n,i such that |η zi | 2 Y ≈ n ϕ T n v n,i v T n,i ϕ n . Then, the modified indicator function can be written as ÎD 1 (z i ) ≈ ∥f -Λ σ0 g∥ -1 ∂Ω n ϕ T n v n,i v T n,i ϕ n -1/2 n ϕ T n q i k T i ϕ n . Now, using the notation from Section 3.3, we denote the learnable kernel matrices and an input vector: for ⋄ ∈ {Q, K, V }, and u ∈ {q, k, v} W ⋄ = u 1,1 • • • u 1,M u 2,1 • • • u 2,M ∈ R 2M ×M , x h = [ ϕ 1 ϕ 2 ] ∈ R 1×2M . Then, we can rewrite (20) as [ ÎD 1 (z i )] M i=1 ≈ C f,g (x h W Q * x h W K )/(x h W V * x h W V ) 1/2 (22) where C f,g = ∥f -Λ σ0 g∥ -1 ∂Ω is a normalization weight, and both * and / are element-wise. Here, we may define Q = x h W Q , K = x h W K , and V = x h W V as the query, keys, and values. We can see that the right matrix multiplications (11) in the attention mechanism are low-rank approximations of the ones above. Hence, based on ( 22), essentially we need to find a function I resulting in a vector approximation to the true characteristic function {I D (z j )} I(Q, K, V ) ≈ [I D (z i )] M i=1 . (23) Moreover, when there are L data pairs, the data functions ϕ l are generated by computing their harmonic extensions as in (9). Then, each ϕ l is then treated as a channel of the input data x h . In summary, the expressions in ( 22) and ( 23) reveal that a Transformer may be able to generalize the classical non-parametrized DSM formula further in (10) to non-local learnable kernels. Thus, it may have an intrinsic architectural advantage that handles multiple data pairs. In the subsequent EIT benchmarks, we provide a potential answer to the question (Q2); namely, the attention architecture is better suited for the tasks of reconstruction, as it conforms better with the underlying mathematical structure. The ability to learn global interactions by attention, supported by a non-local kernel interpretation, matches the long-range dependence nature of inverse problems.

4. EXPERIMENTS

In this section we present some experimental results to show the quality of the reconstruction. The benchmark contains sampling of inclusions of random ellipses (targets), and the input data has a single channel (L = 1) of the 2D harmonic extension feature from the 1D boundary measurements. The training uses 1cycle and a mini-batch ADAM for 50 epochs. The evaluated model is taken from the epoch with the best validation metric on a reserved subset. There are several baseline models to compare: the CNN-based U-Nets (Ronneberger et al., 2015; Guo & Jiang, 2020) ; the state-of-the-art operator learner Fourier Neural Operator (FNO) (Li et al., 2021a) and its variant with a token-mixing layer (Guibas et al., 2022) ; MultiWavelet Neural Operator (MWO) (Gupta et al., 2021) . The Transformer model of interest is a drop-in replacement of the baseline U-Net, and it is named by U-Integral Transformer (UIT). UIT uses the kernel integral inspired attention ( 11), and we also compare UIT with the linear attention-based Hybrid U-Transformer in Gao et al. (2021) , as well as a Hadamard product-based cross-attention U-Transformer in Wang et al. (2022) . An ablation study is also conducted by replacing the convolution layers in the U-Net with attention (11) on the coarsest level. For more details of the hyperparameters' setup in the data generation, training, evaluation, network architectures please refer to Section 3.1, Appendix C.1, and Appendix C.2. The comparison result can be found in Table 1 . Because FNO (AFNO, MWO) keeps only the lower modes in spectra, it performs relatively poor in this EIT benchmark where one needs to recover traits that consist of higher modes (sharp boundary edges of inclusion) from lower modes (smooth harmonic extension). Attention-based models are capable to recover "high-frequency target from low-frequency data", and generally outperform the CNN-based U-Nets despite having only 1/3 of the parameters. Another highlight is that the proposed models are highly robust to noise thanks to the unique PDE-based feature map through harmonic extension. The proposed models can recover the buried domain under a moderately large noise (5%) and an extreme amount of noise (20%) which can be disastrous for many classical methods. Table 1 : Evaluation metrics of the EIT benchmark tests. τ : the normalized relative strength of noises added in the boundary data before the harmonic extension; see Appendix C for details. L 2 -error and cross entropy: the closer to 0 the better; Dice coefficient: the closer to 1 the better. 

5. CONCLUSION

For a boundary value inverse problem, we propose a novel operator learner based on the mathematical structure of the inverse operator and Transformer. The proposed architecture consists of two components: the first one is a harmonic extension of boundary data (a PDE-based feature map), and the second one is a modified attention mechanism derived from the classical DSM by introducing learnable non-local integral kernels. The evaluation accuracy on the benchmark problems surpasses the current widely-used CNN-based U-Net and the best operator learner FNO. This research strengthens the insights that the attention is an adaptable neural architecture that can incorporate a priori mathematical knowledge to design more physics-compatible DNN architectures. However, we acknowledge some limitations: in this study, σ to be recovered relies on a piecewise constant assumption. For many EIT applications in medical imaging and industrial monitoring, σ may involve non-sharp transitions or even contain highly anisotropic/multiscale behaviors; see Appendix G for more discussion on limitations and possible approaches.

A TABLE OF NOTATIONS

Table 2 : Notations used in an approximate chronological order and their meaning this work.

Notation

Meaning Ω an underlying spacial domain in R 2 D a subdomain in Ω (not necessarily topologically-connected) ∂D, ∂Ω D's and Ω's boundary, 1-dimensional manifolds ∇u the gradient vector of a function, ∇u(x) For EIT, an immediate question is whether F -1 and F -1 L in ( 4) and ( 5) are well-defined, namely whether σ can be uniquely determined. In fact, for the case of full measurements (L = ∞), the uniqueness for F -1 has been well established, (Brühl, 2001; Hanke & Brühl, 2003; Astala & Päivärinta, 2006; Nachman, 1996; Kohn & Vogelius, 1984; Sylvester & Uhlmann, 1987) . It is worthwhile to point out that, in this case, σ is not necessarily a piecewise constant function. In (1), we present a simplified case for purposes of illustrating as well as benchmarking. In general, with the full spectral information of the NtD map, σ can be uniquely determined as a general positive function. = (∂ x 1 u(x), ∂ x 2 u(x)) ∥ • ∥ ω the L 2 -norm on a region ω ∥ • ∥ = ∥ • ∥ Ω the L 2 -norm If infinitely many eigenpairs are known, then the operator itself can be precisely characterized using infinitely many feature channels by Reproducing Kernel Hilbert Space (RKHS) theory, e.g., Mercer (1909) ; Aronszajn (1950); Minh et al. (2006) ; Morris (2015); Kadri et al. (2016) ; Lu et al. (2022) . In the context of EIT, this is known as the "full measurement". A more challenging and practical problem is to recover σ from only finitely many boundary data pairs. A common practice for the theoretical study of reconstruction using finite measurements is the assumption of σ being a piecewise constant function. The task is usually set to recover the shape and location of the inclusion D. Otherwise, the problem is too ill-posed. With finite measurements, the uniqueness of the inclusion remains a long-standing theoretical open problem, and it can be only established for several special classes of the inclusion shape, such as the convex cylinders in Isakov & Powell (1990) or convex polyhedrons in Barceló et al. (1994) . We refer readers to some counter-examples in Kang & Seo (2001) where a two-or three-dimensional ball may not be identified uniquely by one single measurement if the values of σ 0 and σ 1 are unknown. Furthermore, here we provide one example to illustrate the difficulty in the reconstruction procedure (Pidcock et al., 1995) . Let Ω be a unit circle, let D be a circle with the radius ρ < 1, and define σ(x) = 1 if ∥x∥ ≥ ρ, σ 1 if ∥x∥ < ρ, with σ 1 < 1 being an arbitrary constant. In this case, the eigenpairs of Λ σ can be explicitly calculated λ l = 1 l 1 -ρ 2l µ 1 + ρ 2l µ , ν m = 1 √ 2π cos(lθ) or 1 √ 2π sin(lθ), l = 1, 2, ..., with µ = (1 -√ σ 1 )/(1 + √ σ 1 ), which are exactly the Fourier modes in a circle. In this case, if the set of basis {g l } ∞ l=1 is just chosen as {cos(θ), sin(θ), cos(2θ), sin(2θ), ...}, the matrix representation A σ in (3) can be written as an infinite diagonal matrix A σ     (1 -ρ 2 µ)/(1 + ρ 2 µ) 0 0 0 0 (1 -ρ 4 µ)/(1 + ρ 4 µ) 0 0 0 0 (1 -ρ 4 µ)/(1 + ρ 4 µ) 0 0 0 0 . . .     . ( ) Thanks to this special geometry, eigenvalues in (25) can clearly determine ρ and σ 1 as follows: ρ = (1 -λ 2 )(1 + λ 1 ) (1 + λ 2 )(1 -λ 1 ) and µ = (1 -λ 1 ) 2 (1 + λ 2 ) (1 + λ 1 ) 2 (1 -λ 2 ) . ( ) However, in practice, A σ does not have such a simple structure. Approximating A σ itself requires a large number of data pairs that are not available in the considered case. Besides, an accurate approximation of the eigenvalues of A is also very expensive. Furthermore, for complex inclusion shapes, two eigenvalues are not sufficient to exactly recover the shape and conductivity values. Figure 2 : Randomly selected samples of elliptic inclusion to represent the coefficient σ (left 1-4). A Cartesian mesh Ω h with a grid point {z j } (right). In computation, discretization of I D consists of values taken as 1 at the mesh points of Ω h inside D and 0 at others. The noise ξ = ξ(x) below ( 10) is assumed to be ξ (x) = (f (x) -Λ σ0 g(x))τ G(x) ) where τ specifies the relative strength of noise, and G(x) is a normal Gaussian distribution independent with respect to x. As ξ(x) is merely pointwise imposed, the boundary data can be highly rough, even if the ground truth f (•) -Λ σ0 g(•) is chosen to be smooth. Nevertheless, the harmonic extension makes the noise from boundary data have a minimal impact on the overall reconstruction, thanks to the smoothing property of the inverse of the Laplacian operator (-∆) -1 ; for example, please refer to Figure 3 . In data generation, the harmonic extension feature map is approximated by finite element methods incorporating stencil modification near the inclusion interfaces (Guo & Lin, 2019; Guo et al., 2019) . Similar data augmentation practices using internal data can be found in Nachman et al. (2007) ; Bal (2013) ; Jin et al. (2022a) . Thanks to the position-wise binary nature of I D , another choice of the loss function during training can be the binary cross entropy L(•, •), applied for a function in P, to measure the distance between the ground truth and the network's prediction  L(p h , u h ) := - z∈M p h (z) ln(u h (z)) + (1 -p h (z)) ln (1 -u h (z)) .

C.2 NETWORK ARCHITECTURE

The difference in architectural hyperparameters, together with training and evaluation costs comparison for all the models compared in the task of EIT reconstruction can be found in Table 3 .

U-Integral Transformer

• Overall architecture. The U-Integral-Transformer architecture is a drop-in replacement of the standard CNN-based U-Net baseline model (7.7m) in Table 1 . The CNN-based U-Net is used in DL-based approaches for boundary value inverse problem in Guo & Jiang (2020) ; Guo et al. (2021) ; Le et al. (2022) . One of the novelties is that the input is a tensor that concatenates different measurement matrices as different channels, and a similar practice can be found in Brandstetter et al. (2023) . Same with the baseline U-Net, the UIT has three downsampling layers as the encoder (feature extractor). The downsampling layers map m×m latent representations to m/2×m/2, and Figure 5 : The harmonic extension feature map ϕ l (left 1-3 as different channels' inputs to the neural network) corresponding to a randomly chosen sample's inclusion map (right). No visible relevance shown with the ground truth. The layered heatmap appearance is adopted in plotly contour only for aesthetics purposes, no edge-like nor layer-like features can be observed from the actual harmonic extension feature map input. expand the number of channels from C to 2C. To leverage the "basis ⇔ channel" interpretation and the basis update nature of the attention mechanism in Section 3.3, the proposed block is first added on the coarsest grid, which has the most number of channels. UIT upsampling layers as the decoder (feature selector), which map m/2 × m/2 latent representations to m × m, and shrink the number of channels from 2C to C. In these upsampling layers, attention blocks are applied on each cross-layer propagation to compute the interaction between the latent representations on both coarse and fine grids (see below). Please refer to Figure 8 for a high-level encoder-decoder schematic. (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) • Double convolution block. The double convolution block is modified from that commonly seen in Computer Vision (CV) models, such as ResNet (He et al., 2016) . We modify this block such that upon being used in an attention block, the batch normalization (Ioffe & Szegedy, 2015) can be replaced by the layer normalization (Ba et al., 2016) , which can be understood as a learnable approximation to the Gram matrices' inverse by a diagonal matrix. • Positional embedding. At each resolution, the 2D Euclidean coordinates of an m × m regular Cartesian grid are the input of a channel expansion through a fixed learnable linear layer and are then added to each latent representation. This choice of positional embedding enables a bilinear interpolation between the coarse and fine grids or vice versa (see below). • Mesh-normalized attention. The scaled dot-product attention in the network is chosen to be the integral kernel attention in (11) with a mesh-based normalization. Please refer to Figure 7 for a diagram in a single attention head. • Interpolation. Instead of max pooling used in the standard CNN-based U-Net, we opt for a bilinear interpolation on the Cartesian grid to map a latent representation from the fine grid to the coarse grid or vice versa. Note that, in the upsampling layers, the interpolation's outputs are directly inputted into an attention block that computes the interaction of latent representations between coarse and fine grids (see below). • Coarse-fine attention in up blocks. A modified attention in Section 3.3 with a pre-inner-product normalization replaces the convolution layer on the coarsest level. The skip connection from the encoder latent representations to the ones in the decoder are generated using an architecture similar to the cross attention used in Petit et al. (2021) . Q and K are generated from the latent representation functions on the same coarser grids. As such, the attention kernel to measure the interaction between different channels is built from the coarse grid. V is associated with a finer grid. Compared with the one in Petit et al. (2021) , the modified attention in our method is inspired by the kernel integral for a PDE problem. Thus, it has (1) no softmax normalization or (2) no Hadamard product-type skip connection. Other operator learners compared • Fourier Neural Operator (FNO) and variants. Neural Operator (FNO2d) learns convolutional filters in the frequency domain for some pre-selected modes, efficiently capturing globally-supported spatial interactions for these modes. The weight filter in the frequency domain multiplies with the lowest modes in the latent representations (four corners in the FFT). The Adaptive Fourier Neural Operator (AFNO2d) adds a token-mixing layer, as seen in Figure 2 in Guibas et al. (2022) , appending every spectral convolution layer in the baseline FNO2d model. 

FROM DSM TO TRANSFORMER

This section gives a more detailed presentation of how the attention-like operator is derived from the DSM ansatz (15) with learnable kernels. We begin with recalling the original indicator function from (10): I D (x) := C f,g d(x) • ∇ϕ(x) |η x | H s (∂Ω) , where C f,g = ∥f -Λ σ0 g∥ -1 L 2 (∂Ω) is a constant, and the equations of the functions ϕ η x are : -∆ϕ = 0 in Ω, n • ∇ϕ = (f -Λ σ0 g) + ξ on ∂Ω, ∂Ω ϕ ds = 0, |η x | 2 Y := (Vη x , η x ) L 2 (∂Ω) . Applying certain quadrature rule to (33) with the quadrature points z i , i.e., the grid points of Ω h , and weights {ω j }, we obtain an approximation to the integral: Ω K(z i , y)∂ xn ϕ(y) dy ≈ j ω j K(z i , z j )∂ xn ϕ(z j ) =: k T i ϕ n , i.e., q T i is the vector of [ω j K(z i , z j )] j . For (34), we consider one component d n (x) of d(x). With the same rule to compute the integral, ( 16) can be written as d n (z i ) ≈ j ω j Q(z i , z j )∂ xn ϕ(z j ) =: q T i ϕ n , i.e., q T i is the vector of [ω j Q(z i , z j )] j . Next, we proceed to express |η zi | Y by discretizing the variational form in (32) using a linear finite element method (FEM). Applying integration by parts, the weak form of ( 32 ) at x = z i is Ω ∇η zi • ∇ψ dy = Ω -d(z i ) • ∇δ zi (y)ψ(y) dy = -d(z i ) • ∇ψ(z i ), for any test function ψ ∈ H 1 0 (Ω). Here, we let {ψ (j) } M j=1 be the collection of the finite element basis functions, and for the fixed z i we further let ψ n,i be the vector approximating [∂ xn ψ (j) (z i )] M j=1 . Denote η i as the vector approximating {η zi (z j )} M j=1 . Introduce the matrices • B: the finite element/finite difference discretization of -∆ on Ω h (a discrete Laplacian) coupled with the Neumann boundary condition and the zero integral normalization condition in (9). • R: the matrix that projects a vector defined at interior grids to the one defined on ∂Ω. Then, the finite element discretization of (38) yields the linear system: Bη i = n d n (z i )ψ n,i ≈ n ψ n,i q T i ϕ n , where we have used (37). Then, the trace of η zi on ∂Ω admits the following approximation ηi := η zi | ∂Ω ≈ n RB -1 ψ n,i q T i ϕ n . Now, we can discretize (35). Note that the trace of a linear finite element space on ∂Ω is still a continuous piecewise linear space, defined as S h (∂Ω). Then, the self-adjoint positive definite operator V can be parameterized by a symmetric positive definite (SPD) matrix denoted by V operating on the space S h (∂Ω). We can approximate |η zi | 2 Y as |η zi | 2 Y ≈ ηT i V ηi ≈ n ψ T n,i B -1 R T V RB -1 ψ n,i ϕ T n q i q T i ϕ n (41) where ψ T n,i B -1 R T V RB -1 ψ n,i ≥ 0 as V is SPD. Define v n,i = (ψ T n,i B -1 R T V RB -1 ψ n,i ) 1/2 q i . This can be considered another learnable vector since the coefficient of q i comes from the learnable matrix V . Then, (41) reduces to |η zi | 2 Y ≈ n ϕ T n v n,i v T n,i ϕ n Putting ( 36), ( 37) and ( 43) into (15), we have ÎD 1 (z i ) ≈ ∥f -Λ σ0 g∥ -1 L 2 (∂Ω) n ϕ T n v n,i v T n,i ϕ n -1/2 n ϕ T n q i k T i ϕ n . Now, using the notation in (21), we get the desired representation ( 22).

E PROOF OF THEOREM 1

Lemma 1. Suppose the boundary data g l is the eigenfunction of Λ σ -Λ σ0 corresponding to the l-th eigenvalue λ l , and let ϕ l be the data functions generated by harmonic extensions -∆ϕ l = 0 in Ω, n • ∇ϕ l = (f l -Λ σ0 g l ) = (Λ σ -Λ σ0 )g l on ∂Ω, ∂Ω ϕ l ds = 0, where l = 1, 2, • • • . Let d be an arbitrary unit vector in R 2 , define a function Θ L (x) = L l=1 (d • ∇ϕ l (x)) 2 λ 3 l . ( ) Then, there holds lim L→∞ Θ L (x) = ∞, if x / ∈ D, a finite constant, if x ∈ D. Proof. See Theorem 4.1 in Guo & Jiang (2020) , and also see Brühl (2001) ; Hanke & Brühl (2003) . Theorem 1 (A finite-dimensional approximation of the index function). Suppose the boundary data g l is the eigenfunction of Λ σ -Λ σ0 corresponding to the l-th eigenvalue λ l , and let ϕ l be the data functions generated by harmonic extensions given in (9). Define the space: S L = Span{∂ x1 ϕ l ∂ x2 ϕ l : l = 1, ..., L}, and the dictionary: S L = {a 1 + a 2 arctan(a 3 v) : v ∈ S L , a 1 , a 2 , a 3 ∈ R}. Then, for any ϵ > 0, we con construct an index function I D L ∈ S L s.t. sup x∈Ω |I D (x) -I D L (x)| ≤ ϵ ( ) provided L is large enough. Proof. Consider the function Θ L (x) from Lemma 1. As Θ L (x) > 0, it is increasing with respect to L. Then, there is a constant ρ such that ρ > Θ L (x), ∀x ∈ D. Given any ϵ > 0, there is an integer L such that Θ L (x) > 4ρϵ -2 /π 2 , ∀x / ∈ D. Define I D L (x) = 1 - 2 π arctan πϵ 2ρ Θ L (x) Note the fundamental inequality z > arctan(z) ≥ π 2z -1 , ∀z > 0. Then, if x ∈ D, there holds |I D (x) -I D L (x)| = 2 π arctan πϵ 2ρ Θ L (x) < ϵ ρ Θ L (x) < ϵ if x / ∈ D, there holds |I D (x) -I D L (x)| = 1 - 2 π arctan πϵ 2ρ Θ L (x) ≤ 4ρ π 2 ϵΘ L (x) < ϵ. Therefore, the function in ( 51) fulfills (50).

F PROOF OF THEOREM 2

In presenting Theorem 2 in Section 3.3, we use the term "multiplicative" to describe the fact that two latent representations are multiplied in the attention mechanism. In contrast, no such operation exists in, e.g., a pointwise FFN or a convolution layer. Heuristically speaking, the main result in Theorem 2 states that the output latent representations can be of a higher "frequency" than the input if the neural network architecture has "multiplicative" layers in it. The input latent representations are discretizations of certain functions, and they are combined using matrix dot product as the one used in attention. Suppose that this discretization can represent functions of such a frequency with a certain approximation error, the resulting matrix/tensor can be an approximation of a function with a higher frequency than the existing latent representation under the same discretization. Please see Figure 9 and Figure 10 for empirical evidence of this phenomenon for the latent representations: with completely smooth input (harmonic extensions), e.g., see Figure 5 , attention-based learner can generate latent representations with multiple peaks and valleys. Theorem 2 (Frequency-bootstrapping for multiplicative neural architectures). Consider Ω = (0, π) which has a uniform discretization of {z i } M i=1 of size h, and v(x) = sin(ax) for some a ∈ Z + . Let N := a -1 ≥ 1 be the number of channels in the attention layer of interest, assume that (i) the current latent representation p h ∈ R M ×N consists of the discretization of the first N Legendre polynomials {p j (•)} N j=1 such that (p h ) ij = p j (z i ), (ii) p h is normalized and the normalization weights α ≡ 1 in (14), (iii) the discretization satisfies that | M i=1 hf (z i ) -Ω f (x) dx| ≤ C h. Then, there exists a set of attention weights {W Q , W K , W V } such that for u(x) = sin(a ′ x) with Z + ∋ a ′ > a ∥ũ -u∥ L 2 (Ω) ≤ C max{h, ∥ε∥ L ∞ (Ω) }, where ũ and κ(•, •) are defined as the output of and the kernel of the attention formulation in (14), respectively; ε(x) := ∥κ(x, •)κ(x, •)∥ L 2 (Ω) is the error function for the kernel approximation. Proof. Without loss of generality, it is assumed that a ′ = a + 1. The essential technical tools used suggest the validity for any a ′ > a > 0 (a ′ , a ∈ Z + ). Consider a simple non-separable smooth kernel function κ(x, z) := sin((a + 1)(xz)), (53) it is straightforward to verify that for v(x) := sin(ax), we have c 1 = 2a/(2a + 1) Ω κ(x, z)v(x) dx = Ω sin (a + 1)(xz) sin(ax) dx = c 1 sin (a + 1)z =: c 1 u(z). (54) As is shown, it suffices to show that the matrix multiplication (a separable kernel) in the attention mechanism approximates this non-separable kernel with an error related to the number of channels. To this end, taking the Taylor expansion, centered at a z 0 ∈ Ω of κ(x, •) with respect to the second variable at each x ∈ Ω, we have κ N (x, z) := N l=1 (z -z 0 ) l-1 (l -1)! ∂ l-1 κ ∂z l-1 (x, z 0 ). It is straightforward to check that ∥κ(x, •) -κ N (x, •)∥ L 2 (Ω) ≤ c 1 (π -z 0 ) 2N +1 + z 2N +1 0 N ! √ 2N + 1 ∂ N κ ∂z N (x, •) L ∞ (Ω) . ( ) By the assumptions on κ(•, •), and a straightforward computation we have ∥κ(x, •) -κ N (x, •)∥ L 2 (Ω) ≤ c 2 (π) N N ! √ N ∂ N κ ∂z N (x, •) L ∞ (Ω) . Next, let q l (z) := (zz 0 ) l-1 /(l -1)! and k l (x) := ∂ l-1 κ/∂z l-1 (x, z 0 ) for 1 ≤ l ≤ N , i.e., they form a Pincherle-Goursat (degenerate) kernels (Kress, 1999, Chapter 11) κ N (x, z) = N l=1 q l (x)k l (z). Published as a conference paper at ICLR 2023 Notice this is the same order with the estimate of ∥κ(x, •)κ N (x, •)∥ L 2 (Ω) . For the term ∥κ N (x, •)∥ L 2 (Ω) , a simple triangle inequality trick can be used: ∥κ N (x, •)∥ L 2 (Ω) ≤ ∥κ N (x, •)∥ L 2 (Ω) + ∥κ N (x, •) -κN (x, •)∥ L 2 (Ω) ≤ ∥κ(x, •)∥ L 2 (Ω) + ∥κ N (x, •) -κN (x, •)∥ L 2 (Ω) , which can be further estimated by reusing the argument in (67). Lastly, using the following argument and the estimate for ∥u -ũ∥ L ∞ (Ω) yield the desired result:  ∥u -ũ∥ 2 L 2 (Ω) ≤ ∥u -ũ∥ L 1 (Ω) ∥u -ũ∥ L ∞ (Ω) ≤ 2 max |u| ∥u -ũ∥ 2 L ∞ (Ω) . ( ) (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l)



https://github.com/scaomath/eit-transformer



Figure 1: Schematics of the pipeline: the approximation of the inverse operator is decomposed into (i) a PDE-based feature map H: the 1D boundary data is extended to 2D features by a harmonic extension; (ii) a tensor2tensor neural network T θ that outputs the reconstruction.

)InChow et al. (2014), it is shown that if V induces a kernel with sharply peaked Gaussian-like distribution, the index function in (10) can achieve maximum values for points inside D.

error Position-wise cross entropy Dice coefficient # params τ = 0 τ = 0.05 τ = 0.2 τ = 0 τ = 0.05 τ = 0.2 τ = 0 τ = 0.05 τ = 0.2 U-Net baseline 0.200 0

on whole domain Ω δ x the delta function such that Ω f (y)δ x (y) dy = f (x), ∀f . n the unit outer normal vector on the boundary ∂Ω. ∇u • n normal derivative of u, measures the rate of change along the direction of n Λ σ NtD map from Neumann data g := ∇u • n (how fast the solution changes toward the outward normal direction) to Dirichlet data f := u| ∂Ω (the solution's value along the tangential direction) H s (Ω), s ≥ 0 the Sobolev space of functions H s (Ω), s < 0 the bounded linear functional defined on H s (Ω) H s 0 (Ω) all u ∈ H s (Ω) such that u's integral on Ω vanishes | • | Y the seminorm defined for functions in Y B BACKGROUND OF EIT

EXPERIMENT SET-UP C.1 DATA GENERATION AND TRAINING In the numerical examples, the data generation mainly follows standard practice in theoretical prototyping for solving EIT problems, see e.g.,Chow et al. (2014); Michalikova et al. (2014); Hamilton & Hauptmann (2018); Guo & Jiang (2020);Fan & Ying (2020). For examples, please refer to Figure2. The computational domain is set to be Ω := (-1, 1) 2 , and the two media with the different conductivities are with σ 1 = 10 (inclusion) and σ 0 = 1 (background). The inclusions are four random ellipses. The lengths of the semi-major axis and semi-minor axis of these ellipses are sampled from U(0.1, 0.2) and U (0.2, 0.4), respectively. The rotation angles are sampled from U(0, 2π). There are 10800 samples in the training set, from which 20% are reserved as validation. There are 2000 in the testing set for evaluation.

Figure 3: The harmonic extensions ϕ l with zero noise on boundary (left); the harmonic extensions φl with 20% Gaussian noise on boundary (middle). Their pointwise difference (right). ∥ϕ l -φl ∥/∥ϕ l ∥ = 0.0203.

Figure 4: The left: the training-testing pixel-wise binary cross entropy convergence for the CNN-based U-Net with 31 million parameters, a clear overfitting pattern is shown. The right: the training-testing convergence for the attention-based U-Transformer with 11.4 million parameters.

Figure 6: The neural network evaluation result for the inclusion in Figure 5 using various models. A model's input is the left most function in Figure 5 if this model uses a single channel, or all left three in Figure 5 if a model uses 3-channel input. (a) Ground truth inclusion in Figure 5; (b) U-Net baseline (7.7m) prediction; (c) U-Net big (31m) prediction with 3 channels; (d) Fourier Neural Operator (10.4m) prediction with 1 channel; (e) Fourier Neural Operator big (33m) prediction with 1 channel; (f) Adaptive Fourier Neural Operator (10.7m) prediction with 1 channel; (g) Multiwavelet Neural Operator (9.8m) prediction with 1 channel; (h) Hybrid UT with a linear attention (10.13m) prediction with 1 channel; (i) UIT (11.4m) prediction with 1 channel; (j) UIT (11.4m) prediction with 3 channels.

Figure 7: Detailed flow of the modified 2D attention-based encoder layer using (11). C: the number of channels in the input, N : the number of expanded channels (for the basis expansion interpretation in Theorem 2).

Figure 9: The latent representations from CNN-based UNet when evaluating for the sample in Figure 5, in their respective layers, 4 latent representations are extracted from 4 randomly selected channels: (a)-(d) are from the feature extracting layer (layer 1, 128 × 128 grid), (e)-(h) are from the middle layer acting on the coarsest level (32 × 32 grid), (i)-(l) are from the next level (64 × 64 grid).

Figure 10: The latent representations from UIT when evaluating for the sample in Figure 5, in their corresponding layers with respect to those in Figure 9, 4 latent representations are extracted from 4 randomly selected channels: (a)-(d) are from the feature extracting layer (layer 1, 128 × 128 grid), (e)-(h) are from the middle layer acting on the coarsest level (32 × 32 grid), (i)-(l) are from the next level (64 × 64 grid).

The emerging deep learning (DL) based on Deep Neural Networks (DNN) to directly emulate operators significantly resembles those classical direct methods mentioned above. However, operator learners by DNNs are commonly considered black boxes. A natural question is how the a priori mathematical knowledge can be exploited to design more physics-compatible DNN architectures. In pursuing the answer to this question, we aim to provide a supportive example that bridges deep learning techniques and classical direct methods, which improves the reconstruction of EIT. direct sampling method is proposed inGuo & Jiang (2020);Guo et al. (2021) that learns local convolutional kernels mimicking the gradient operator of DSM. Another example is radial basis function neural networks seen inHrabuska et al. (2018);Michalikova et al. (2014);Wang  et al. (2021a). Nevertheless, convolutions in CNNs use kernels whose receptive fields involve only a small neighborhood of a pixel. Thus, layer-wise, CNN does not align well with the non-local nature of inverse problems. More recently, the learning of PDE-related forward problems using global kernel has gained attraction, most notably the Fourier Neural Operator (FNO)(Nelsen & Stuart, 2021; Li

Multi-Wavelet Neural Operator (MWO). The MultiWavelet Neural Operator (MWO) is proposed inGupta et al. (2021), which introduces a multilevel structure into the FNO architecture. MWO still follows FNO's practice on each level by pre-selecting the lowest modes.

acknowledgement

ACKNOWLEDGMENTS L. Chen is supported in part by National Science Foundation grants DMS-1913080 and DMS-2012465, and DMS-2132710. S. Cao is supported in part by National Science Foundation grants DMS-1913080 and DMS-2136075. The hardware to perform the experiments are sponsored by NSF grants DMS-2136075, and UMKC School of Science and Engineering computing facilities. No additional revenues are related to this work. The authors would like to thank Ms. Jinrong Wei (University of California Irvine) for the proofreading and various suggestions on the manuscript. The authors would like to thank Dr. Jun Zou (The Chinese University of Hong Kong) and Dr. Bangti Jin (University College London & The Chinese University of Hong Kong) for their comments on inverse problems. The authors also greatly appreciate the valuable suggestions and comments by the anonymous reviewers.

REPRODUCIBILITY STATEMENT

This paper is reproducible. Experimental details about all empirical results described in this paper are provided in Appendix C. Additionally, we provide the PyTorch (Paszke et al., 2019) code for reproducing our results at https://github.com/scaomath/eit-transformer. The dataset used in this paper is available at https://www.kaggle.com/datasets/scaomath/ eletrical-impedance-tomography-dataset. Formal proofs under a rigorous setting of all our theoretical results are provided in Appendices E-F.

annex

Published as a conference paper at ICLR 2023 Table 3 : The detailed comparison of the networks used in this study. For U-Net-based neural networks, the channel/width is the number of the base channels on the finest grid after the initial channel expansion. A torch.cfloat type parameter entry counts as two parameters. GFLOPs: Giga FLOPs for 1 backpropagation (BP) performed for a batch of 8 samples recorded the PyTorch autograd profiler for 1 BP averaging from 100 BPs. Eval: number of instances per second. A c

A c

Figure 8 : A simplified schematic of the U-Integral-Transformer that follows the standard U-Net. The input is a tensor concatenating the discretizations of ϕ and ∇ϕ. The output is the approximation to the index function I D . : 3 × 3 convolution + ReLU; : layer normalization or batch normalization; : bilinear interpolations from the fine grid to the coarse grid; : cross attention A c that uses the latent representations on a coarse grid to compute interactions to produce the latent representations on a finer grid; : input and output discretized functions in certain Hilbert spaces. The TikZ source code to produce this figure is modified from the examples in Iqbal (2018) . Henceforth, these "empirical chosen quantities" are made to be learnable from data by introducing two undetermined kernels K(x, y), Q(x, y), and a self-adjoint positive definite linear operator V, the modified indicator function is written asBy this choice of the latent representation space being the first N Legendre polynomials, q l ∈ Y := span{p j }, thus there exists a set of weights {w Q l ∈ R l } N l=1 corresponding to each channel, such thatFor the key matrix, by standard polynomial approximation since Y ≃ P N (Ω), there exists a set of weightsi.e., K = p h W K ∈ R M ×N with the l-th column of K being the discretization of kl (•). Moreover, it can approximate k l (•) with the following estimateSimilarly, without loss of generality, we choose v(•) := v 1 (•), which is concatenated to V such that it occupies the first channel of V defined earlier, we haveand Then, we have for any z ∈ ΩThus, we have(66) Now, by triangle inequality, q l = ql , the definitions above (53, and the estimate in (61)(67)

G LIMITATIONS, EXTENSIONS, AND FUTURE WORK

In this study, the σ to be recovered relies on a piecewise constant assumption. This assumption is commonly seen in the theoretical study of the original DSM. For many EIT applications in medical imaging and industrial monitoring, σ may involve non-sharp transitions or even contain highly anisotropic/multiscale behaviors making it merely an L ∞ function. If the boundary data pairs are still quite limited, i.e., only a few electric modes are placed on the boundary ∂Ω, the proposed model alone is not expected to perform as well as in benchmark problems. Nevertheless, it can still contribute to achieving reconstruction with satisfactory accuracy, if certain a priori knowledge of the problem is accessible. End2end-wise, our proposed method has limitations like other operator learners: the data manifold on which the operator is learned is assumed to exhibit low-dimensional/low-rank attributes. The behavior of the operator of interest on a compact subset is assumed to be reasonably well approximated by a finite number of bases. Therefore, for non-piecewise constant conductivities, the modification can be to employ a suitable data set, in which the sampling of {σ (k) } represents the true σ's distribution a posteriori to a certain degree. However, to reconstruct non-piecewise constant conductivities, more boundary data pairs or even the entire NtD map is demanded from a theoretical perspective (Astala & Päivärinta, 2006; Nachman, 1996; Kohn & Vogelius, 1984; Sylvester & Uhlmann, 1987) . For fewer data pairs and more complicated conductivity set-up, there have been efforts in this direction hierarchically using matrix completion (Bui-Thanh et al., 2022) to recover Λ σ . When Λ σ is indeed available, σ can be described by a Fredholm integral equation, see Nachman (1996, Theorem 4 .1), which itself is strongly related to the modified attention mechanism ( 14) of the proposed Transformer. The architectural resemblance may lead to future explorations in this direction. Optimization with regularization can be applied for the instance of interest (fine-tune) from the perspective of improving the reconstruction for a single instance. This approach dates back to the classical iterative methods involving adaptivity (Jin & Xu, 2019) . Recent novel DL-inspired adaptions (Li et al., 2021b; Benitez et al., 2023) re-introduce this type of method. In fine-tuning, the initial guess is the reconstruction by the operator learner trained in the end2end pipeline (pre-train).

