STABLE, EFFICIENT, AND FLEXIBLE MONOTONE OPERATOR IMPLICIT GRAPH NEURAL NETWORKS

Abstract

Implicit graph neural networks (IGNNs) that solve a fixed-point equilibrium equation for representation learning can learn the long-range dependencies (LRD) in the underlying graphs and show remarkable performance for various graph learning tasks. However, the expressivity of IGNNs is limited by the constraints for their well-posedness guarantee. Moreover, when IGNNs become effective for learning LRD, their eigenvalues converge to the value that slows down the convergence, and their performance is unstable across different tasks. In this paper, we provide a new well-posedness condition of IGNNs leveraging monotone operator theory. The new well-posedness characterization informs us to design effective parameterizations to improve the accuracy, efficiency, and stability of IGNNs. Leveraging accelerated operator splitting schemes and graph diffusion convolution, we design efficient and flexible implementations of monotone operator IGNNs that are significantly faster and more accurate than existing IGNNs.

1. INTRODUCTION

Implicit graph neural networks (IGNNs) that solve a fixed-point equilibrium equation for graph representation learning can learn long-range dependencies (LRD) in the underlying graphs, showing remarkable performance for various tasks [69; 39; 58; 63; 22] . Let G = (V, E) represent a graph, where V is the set of nodes, and E ⊆ V × V is the set of edges. The connectivity of G can be represented by the adjacency matrix A ∈ R n×n with A ij = 1 if there is an edge connecting nodes i, j ∈ V ; otherwise A ij = 0. Let X ∈ R d×n be the initial node features whose i-th column x i ∈ R d is the initial feature of the i-th node. IGNN [39] learns the node representation by finding the fixed point, denoted as Z * , of the Picard iteration below where σ is the nonlinearity (e.g. ReLU), g B is a function parameterized by B (e.g. g B (X) = BXG), matrices W and B ∈ R d×d are learnable weights, and G is a graph-related matrix. In IGNN, G is chosen as Â := D-1/2 (I + A) D-1/2 with I being the identity matrix and D is the degree matrix with Dii = 1+ n j=1 A ij . IGNN constrains W using a tractable projected gradient descent method to ensure the well-posedness of Picard iteration at the cost of limiting the expressivity of IGNNs. The prediction of IGNN is given by f Θ (Z * ), a function parameterized by Θ. IGNNs have several merits: 1) The depth of IGNN is adaptive to particular data and tasks rather than fixed. 2) Training IGNNs requires constant memory independent of their depth -leveraging implicit differentiation [66; 2; 51; 13] . 3) IGNNs have better potential to capture LRD of the underlying graph compared to existing GNNs, including GCN [75] , GAT [73] , SSE [23] , and SGC [79] . The latter GNNs lack the capability to learn LRD as they suffer from over-smoothing [56; 84; 62; 20] . Several methods have been proposed to alleviate over-smoothing and hence improve learning LRD by adding residual connections [37; 21; 55] , by geometric aggregation [65] , by adding a fully-adjacent layer [3] , by improving breadth-wise backpropagation [59] , and by adding oscillatory layers [27; 67] . Issue 1: Well-posedness of IGNN Limits Its Expressivity. One bottleneck of IGNN is that the magnitude of W 's eigenvalues has to be less than one for its well-posedness guarantee; see Sec. 2 for details. This limits the selection of W and thereby limits the expressivity of IGNNs. Issue 2: When can IGNNs learn LRD? To understand when IGNN can learn LRD, we run IGNN using the settings in [39] to classify directed chains. Directed chains is a synthetic dataset designed to test the effectiveness of GNNs in learning LRD for node classification [71; 39] . Fig. 1 plots epoch vs. accuracy of IGNN for the chain classification. Here, each epoch means iterating Equation (1) until convergence and then updating W and B at the end. IGNN can classify the binary chain task perfectly at length 100 but performs near random guesses when the length is 250, as illustrated in Fig. 1 . For the three-class chains, IGNN's performance is very poor at chain length 100 but performs quite well at length 80. We investigate the results above by studying the dynamics of eigenvalues of the matrix |W |foot_0 . For illustrative purpose, we consider λ 1 (|W |) and λ 2 (|W |), the largest and the second largest eigenvalue of |W | in magnitude. Fig. 2 (left) contrasts the evolution of the magnitude of λ 1 (|W |) and λ 2 (|W |) of IGNN when classifying nodes on chains with different lengths. We see that the magnitude of both eigenvalues goes to 1 when IGNN becomes accurate. However, Fig. 2 (right) shows that IGNN takes many more iterations in each epoch when the magnitude of eigenvalues gets close to 1. Indeed, when λ 1 (|W |) → 1, the Lipschitz constant of the linear map W ZG + g B (X) is close to 1, slowing down the convergence of the Picard iterations. The results in Fig. 2 echo our intuition; the representation of a given node aggregates one more hop of information after each Picard iteration; when the magnitude of eigenvalues gets close to 1, Equation (1) converges slowly so that IGNN can capture LRD before fixed point convergence. We report the classification results of different lengths in Appendix I; these results show prevalently that IGNNs suffer from two bottlenecks: 1) An inherent tradeoff between computational efficiency and capability for learning LRD. 2) The performance of IGNNs, based on Picard iteration, is unstable in the sense that their performance varies substantially across tasks. In particular, starting from random Gaussian initialization of W -the default initialization of W -IGNN cannot learn LRD if none of the eigenvalues of W get close to 1 in magnitude. We develop accurate, stable, and efficient monotone operator IGNNs (MIGNNs) 2 . In particular, we derive a new well-posedness condition for MIGNN leveraging monotone operator theory; see Sec 2. The new well-posedness condition informs us to design 1) a monotone parameterization of W , whose eigenvalues can take a much wider range than that of IGNNs, to boost the expressivity of MIGNNs, addressing Issue 1. And 2) a Cayley transform-based orthogonal parameterization of W to improve the stability and efficiency of MIGNN for learning LRD, addressing Issue 2; see Sec. 3. Picard iteration is inefficient or impossible to find the fixed point of MIGNN with monotone or orthogonal parameterization. As such, we implement MIGNNs leveraging Anderson-accelerated operator splitting schemes; see Sec. 4 . We verify the efficacy of MIGNN on various benchmark tasks; see Sec. 5.

1.2. ADDITIONAL RELATED WORK

We briefly review some representative related works in three directions: deep equilibrium models (DEQs), GNNs, and orthogonal parameterizations for recurrent neural networks (RNNs). DEQ. IGNN is related to DEQs [7; 26; 8] , but the equilibrium equation of IGNN differs from DEQs in that IGNN encodes graph structure. DEQs are a class of infinite depth weight-tied feedforward neural networks with forward propagation using root-finding and backpropagation using implicit differentiation. As a result, training DEQs only requires constant memory independent of the network's depth. Monotone operator theory has been used to guarantee the convergence of DEQs [77] and to improve the robustness of implicit neural networks [44] . The convergence of DEQs has also been considered by constraining the network's weights [49] . Linearized DEQs are studied in [46] . Jacobian regularization has been used to stabilize the training of DEQs [9] . Anderson-accelerated DEQs with learned acceleration-related hyperparameters are also proposed [10] . Graph neural networks. Classical GNNs are defined by stacking explicitly defined graph filtering layers. Examples include graph convolutional networks (GCNs) [17; 24; 48] , recurrent GNNs [38; 30; 57; 21] GraphSAGE [40] , neural graph fingerprints [25] , graph isomorphism network (GIN) [80] , message passing neural networks [36] , graph attention networks (GATs) [73] , GCNs with convolution kernels learned based on paths (PAN [60] and pathGCN [28] ), and higher-order message passing networks [15; 14] . There are some recent advances in IGNNs: EIGNN removes the nonlinearity in each intermediate iteration and derives a closed form of the infinite iterations [58] , convergent graph solver (CGS) is an IGNN model with convergence guarantees by constructing the input-dependent linear contracting iterative maps [63] , GIND leverages implicit nonlinear diffusion to access infinite hops of neighbors [22] . In addition to Picard iteration, implicit GNNs have also been defined by parametrizing the diffusion equation on graphs, see e. 

1.3. NOTATION

We denote scalars by lower-or upper-case letters and vectors/matrices by lower-/upper-case boldface letters. For a vector a, we use ∥a∥/∥a∥ ∞ to denotes its ℓ 2 -/ℓ ∞ -norm. We use I to denote the identity matrix whose dimension can be inferred from the context. For a matrix A, we denote its transpose as A ⊤ , its inverse as A -1 , its Frobenius norm/2-norm/∞-norm as ∥A∥ F /∥A∥/|A∥ ∞ , and we denote its i-th largest eigenvalue in magnitude as λ i (W ). Given two matrices A and B, we denote their Kronecker/entry-wise product as A ⊗ B/A ⊙ B, and denote A ≻ B (A ⪰ B) if A -B is positive definite (semi-positive definite). We use vec(A) to denote the vectorization of the matrix A in column-major order. The meaning of other notations can be inferred from the context.

2. WELL-POSEDNESS OF MIGNN: A MONOTONE OPERATOR PERSPECTIVE

In this section, we characterize the well-posedness of MIGNN leveraging monotone operator theory, see Appendix B for a brief review of monotone operator theory. Using the Kronecker product 3 and vectorization of a matrix, we can rewrite Equation (1) into the following equivalent vectorized form vec(Z (k+1) ) = σ G ⊤ ⊗ W vec(Z (k) ) + vec(g B (X)) . Gu et al. propose the well-posedness condition of IGNN as λ 1 (|G ⊤ ⊗ W |) < 1, guaranteeing that the unique fixed point of Equation ( 2) can be found by Picard iteration. Selecting G = Â, all eigenvalues of G are in [-1, 1] with λ 1 (G) = 1. Therefore, well-posedness of IGNN is equivalent to λ 1 (|W |) < 1 as λ 1 (|G ⊤ ⊗ W |) = λ 1 (G)λ 1 (|W |) = λ 1 (|W |). Then, IGNN parameterizes W by relaxing the well-posedness condition λ 1 (|W |) < 1 to ∥W ∥ ∞ < 1, which constrains the magnitudes of eigenvalues of W to be less than 1. We seek to apply the monotone operator theory to improve the expressivity and efficiency of existing IGNNs. According to the monotone operator theory [68; 77] , finding the fixed point of Equation ( 2) is equivalent to solving the monotone inclusion problem: find 0 ∈ (F + G)(vec(Z)) with F and G being two set-valued functions that are given below F(vec(Z)) = (I -G ⊤ ⊗ W )vec(Z) -vec(g B (X)) and G = ∂f, where ∂f denotes the subgradient of a convex closed proper function f that satisfies σ = prox 1 f with prox α f (x) ≡ arg min z 1 2 ∥x -z∥ 2 + αf (z) . When σ is ReLU, then σ = prox α f for ∀α > 0 with f being the indicator of the positive octant, i.e. f (x) = I{x ≥ 0}. The above monotone inclusion problem admits a unique solution if the operator F is strongly monotone, i.e. I - G ⊤ ⊗ W ⪰ mI or, 1 2 G ⊤ ⊗ W + G ⊗ W ⊤ ⪯ (1 -m)I. Therefore, we obtain the following well-posedness condition for MIGNN: 3 See Appendix D for a review of some properties about the Kronecker product. Proposition 1 (Well-posedness condition for MIGNN). Let the non-linearity σ be ReLU and K = 1 2 (G ⊤ ⊗ W + G ⊗ W ⊤ ). Then the MIGNN model Equation ( 2) is well-posed as long as K ⪯ (1 -m)I for some m > 0. As K is symmetric, K ⪯ (1 -m)I is equivalent to requiring that each eigenvalue of K is no more than 1 -m. We provide the proof of Proposition 1 in the appendix; similarly, the proof of all the subsequent theoretical results are provided in the appendix. The well-posedness condition in Proposition 1 allows for more flexible parametrizations than [39] by enabling the real part of eigenvalues of W to be in the range (-∞, 1) and the imaginary part to be arbitrary. Along with providing a more flexible well-posedness condition for MIGNN, monotone operator theory guides us in designing efficient algorithms for implementing MIGNN; see Sec. 4.

3. FLEXIBLE PARAMETERIZATION OF MIGNN

This section presents the monotone and orthogonal parameterizations of W for MIGNN in Equation (2) . The monotone parameterization can enhance IGNN's expressivity, and the orthogonal parameterization can stabilize and accelerate the training of MIGNNs.

3.1. MONOTONE PARAMETERIZATION

Proposition 1 informs us to design a more expressive parameterization of W for MIGNN than that used for IGNN leveraging monotone operator theory. Proposition 2 (Monotone parameterization). Let G = (V, E) be a graph and let G be L/2 with L := D -1/2 (D -A)D -1/2 being the normalized Laplacian, where A is the adjacency matrix and D is the degree matrix with D ii = n j=1 A ij . Then the MIGNN model Z (k+1) = σ W Z (k) G + g B (X ) is well-posed when the weight matrix W is parameterized as follows W = (1 -m)I -CC ⊤ + F -F ⊤ , where C, F ∈ R d×d are arbitrary matrices, and m > 0. 

3.2. ORTHOGONAL PARAMETERIZATION

As discussed in Sec. 1, IGNN learns LRD when λ 1 (|W |) approaches 1 in magnitude. This is often not the case when starting from Gaussian random initialization -making IGNN unstable for learning LRD. Inspired by the unitary RNN [5] , we propose to use the orthogonal parameterization [41; 54; 53] with a learnable scaling factor to stabilize MIGNN in learning LRD. In particular, we parameterize W by the following scaled Cayley map W = ϕ(γ)(I -S)(I + S) -1 , where ϕ(•) is the sigmoid function and γ ∈ R is a learnable parameter ensuring ϕ(γ) ∈ (0, 1). S = C -C ⊤ is a skew-symmetric matrix with C ∈ R d×d being an arbitrary parameterized matrix. It is evident that MIGNN with the parameterization in Equation ( 5) is well-posed with G being Â defined in Sec. 1. Also, all eigenvalues of (I -S)(I + S) -1 have magnitude 1, see a derivation in Appendix E.3. To effectively learn LRD, MIGNN only requires the scalar ϕ(γ) to converge to 1.

4. ACCELERATED OPERATOR SPLITTING FOR IMPLEMENTING IGNNS

It is worth noting that monotone and orthogonal parameterizations are beyond the efficient convergence regime of the Picard iteration. Thus, we leverage the operator splitting schemes to find the fixed point of the equilibrium equation with monotone or orthogonal parameterization. Operator splitting schemes often converge faster than Picard iteration and can guarantee convergence of IGNNs even when Picard iteration fails [68] . In particular, for small graphs and tasks where learning LRD is not crucial, we use Anderson-accelerated forward-backward splitting (FB) to implement MIGNN with monotone parameterization. For tasks that require learning LRD, we employ Anderson-accelerated Peaceman-Rachford splitting (PR) 4 , with the Neumann series approximation accompanied by diffusion convolution, to implement MIGNN with orthogonal parameterization. We structure this section as follows: In Sec. 4.1, we present FB (Sec. 4.1.1)/PR (Sec. 4.1.2) for finding the fixed point of MIGNNs using monotone/orthogonal parameterization. In Sec. 4.2, we present backward propagation algorithms for updating the parameters of MIGNN. We can find the fixed point of MIGNN in Equation ( 2) via FB splitting with iterative scheme Z (k+1) := F FB α (Z (k) ) := prox α f Z (k) -α • Z (k) -W Z (k) G -g B (X) , α > 0 is a constant. ( ) We provide a detailed implementation of FB splitting in Appendix F.1. Note that the Lipschitz constant of the FB iteration is Section 5] . Therefore, FB splitting converges to the fixed point if α < 2m/∥I -G ⊤ ⊗ W ∥ 2 . By choosing a proper α, FB splitting can converge in the regime that Picard iteration does not. However, when the monotone parameterization is used ∥W ∥ can be arbitrarily large. Thus α needs to be small to guarantee the convergence of FB splitting, in which case the Lipschitz constant is close to 1, and the convergence of FB splitting will be significantly slowed. FB splitting is appealing for learning with small graphs and tasks where learning LRD is not crucial. In this case, we use monotone parameterization to improve the expressivity of the model, and we denote the MIGNN with monotone parameterization using FB splitting as MIGNN-Mon. For large graphs and tasks that require learning LRD, FB splitting suffers from slow convergence. Next, we will present PR splitting, which is better for learning large-scale graphs and LRD. Furthermore, we argue that PR splitting is not suitable for implementing MIGNN with monotone parameterization. L FB := 1 -2αm + α 2 ∥I -G ⊤ ⊗ W ∥ 2 [68,

4.1.2. PR SPLITTING

PR splitting used in [77] is guaranteed to converge for a much broader choice of α and requires fewer iterations than FB splitting. However, each iteration of PR splitting requires inverting large matrices, which is computationally much more expensive and less scalable than FB splitting. PR splitting finds the solution Z * of the MIGNN by letting Z * = prox α f (U * ) where U * ∈ R d×n is obtained from the fixed-point iteration vec(U (k+1) ) = F PR α (vec(U (k) )) := C F C G (vec(U (k) )) with C F and C G being the Cayley operators (see Appendix B for details) of F and G, respectively. Let u (k) be the shorthand notation of vec(U (k) ). Then we can formulate the PR splitting as follows u (k+1) := F PR α (u (k) ) = 2V 2 prox α f (u (k) ) -u (k) + α vec(g B (X)) -2 prox α f (u (k) ) + u (k) , where the matrix V := (I + α(I -G ⊤ ⊗ W )) -1 and u (0) is the zero vector. With the parametrizations discussed in Sec. 3, the linear operator F in Equation ( 3) is strongly monotone and L-Lipschitz where L = ∥I -G ⊤ ⊗ W ∥. Therefore, its Cayley operator C F and hence F PR α is contractive with the optimal choice of α being 1/L, see [68, Section 6] . In particular, it is suggested to choose α = 1/(1 + ϕ(γ)) when using orthogonal parametrization W = ϕ(γ)(I -S)(1 + S) -1 . The pseudocode for the detailed implementation of PR splitting in Equation ( 7) can be found in Appendix F.1. Remark 2. Douglas-Rachford (DR) splitting is another option for solving MIGNN, which is often faster than PR. However, in our case PR is contractive, making it faster than DR for the same α. PR splitting also benefits MIGNNs in learning LRD when an orthogonal parameterization is used. To see this, we have the following Neumann series expansion of V (u (k) ) V (u (k) ) = (I + α(I -G ⊤ ⊗ W )) -1 (u (k) ) = 1 1 + α I - G ⊤ ⊗ W 1 + 1/α -1 (u (k) ) = 1 1 + α ∞ i=0 vec W i U (k) G i (1 + 1/α) i where the last equality follows from (A ⊗ B) k = A k ⊗ B k , and (A ⊗ B)vec(C) = vec(BCA ⊤ ) for ∀A, B and C that satisfy dimensional consistency. Equation (8) indicates that each node can access information from its ∞-hop neighbors in a single PR iteration for MIGNN with orthogonal parameterization. This cannot be said of monotone parameterization with large ∥W ∥, as the Neumann series expansion in the last equality of Equation ( 8) no longer applies. Evaluating 1 1+α I -G ⊤ ⊗W 1+1/α -1 (u (k) ) can be carried out by using Bartels-Stewart algorithm [11] , which converts computing V into diagonalizing the matrix G ⊤ and W , respectively. From Equation ( 8), we have V (vec(U (k) )) = 1 1 + α vec Q W H ⊙ Q -1 W U (k) Q G ⊤ Q ⊤ G ⊤ where Q G ⊤ Λ G ⊤ Q ⊤ G ⊤ and Q W Λ W Q -1 W are the eigen-decomposition of G ⊤ and of W , respectively, and H ∈ R d×n whose (i, j)-th entry is H ij = 1/ 1 -1 1+1/α (Λ W ) ii (Λ G ⊤ ) jj . We provide a proof of Equation (9) in Appendix E.4. According to Equation ( 9), one only needs to calculate the eigen-decomposition of G once prior to training and the eigen-decomposition of W once per epoch. The above matrix inversion procedure echos the idea of EIGNN [58] . MIGNN has multiple layers, with each fixed point iteration representing one layer. In contrast, EIGNN is reducible to a one-layer model; see Appendix A.2 for details on EIGNN. Although PR splitting can capture LRD in a single iteration, computing V in Equation ( 7) requires computationally prohibitive matrix inversion. We provide two remedies to address this issue for MIGNN using orthogonal parameterization: 1) We use Neumann series expansion to approximate the matrix inversion when orthogonal parameterization is used. 2) We replace the graph-related matrix G with a generalized graph diffusion convolution matrix, e.g. heat kernel or the personalized PageRank [34; 33] . Notice that the above two remedies do not work for MIGNN using monotone parameterization since we can no longer use the Neumann series approximation. Therefore, MIGNN with monotone parameterization using PR splitting is not scalable to learning large graphs. Neumann series approximation. In the orthogonal parameterization of W we have ∥ G ⊤ ⊗W 1+1/α ∥ < 1, ensuring efficient approximation of V in Equation ( 7) using only a few terms of its Neumann series expansion. The K-th order Neumann series expansion of V (vec(U (k) )) is given by NK (vec(U (k) )) := 1 1 + α K i=0 vec W i U (k) G i (1 + 1/α) i . According to Equation ( 7), the K-th order Neumann series approximated PR iteration function, denoted as F PR,K α , can be written as follows u (k+1) := F PR,K α (u (k) ) = 2NK 2 prox α f (u (k) ) -u (k) + α vec(g B (X)) -2 prox α f (u (k) ) + u (k) . ( ) Each node can access information from its K-hop neighbors using the K-th order Neumann series approximated PR iteration, which is more efficient than the existing IGNN. Also, such a treatment can significantly accelerate forward propagation. We can intuitively understand this as follows: Each iteration of MIGNN, with K-th order Neumann series approximated PR iteration, aggregates information from K-hop neighbors, enabling the use of much fewer iterations than that of IGNN, which aggregates one hop per iteration. MIGNN can use a much smaller λ 1 (|W |) than IGNN to reach the same number of hops, meaning MIGNN converges much faster than IGNN. MIGNN with diffusion convolution. We can also improve MIGNNs for learning LRD using graph diffusion convolution [34; 1], i.e. instead of using Â or L defined in the previous context, we can set G to be the combination of higher powers of Â or L, so that each node aggregates features from multi-hop neighbors at each iteration. In particular, we let G = D-1/2 (A + • • • + A P ) D-1/2 for any positive integer P , where D is the degree matrix with Dii = n j=1 P k=1 (A k ) ij ; other choices of G can be found in [34] . We can show that the eigenvalues of D-1/2 (A + • • • + A P ) D-1/2 are all within [-1, 1]; see E.4 for a proof. As such, the orthogonal parameterization of W still ensures the well-posedness of MIGNN. We write the MIGNN with P -th order diffusion matrix G as follows Z = σ(W Z D-1/2 (A + A 2 + • • • + A P ) D-1/2 + g B (X)). We can further apply the operator splitting schemes to Equation (12) . In particular, we denote the model as MIGNN-NKDP when W is orthogonal, and Equation ( 12) is implemented using P -th order diffusion and K-th order Neumann series approximated PR iteration. where M denotes the maximal number of iterations, and d is the feature dimension which is much smaller than the number of nodes.

4.1.3. ANDERSON ACCELERATION

We have already seen that the main steps in both FB and PR splitting schemes involve solving iterative equations, e.g. Equations ( 6) and ( 7), and we can utilize Anderson acceleration [4] to accelerate the convergence of these iterative equations. We provide the detailed formulation and pseudocode for Anderson-accelerated operator splitting-based MIGNNs in Appendix F.3.

4.2. BACKWARD PROPAGATION FOR UPDATING MIGNNS

We derive backpropagation for MIGNN based on implicit differentiation [35; 7; 26] . Recall that the vectorized MIGNN vec(Z) = σ G ⊤ ⊗ W vec(Z) + vec(g B (X)) , has equilibrium point vec(Z * ). For any loss function ℓ and any parameter θ (W or B), we have ∂ℓ ∂θ = ∂ℓ ∂vec(Z * ) I -J G ⊤ ⊗ W -1 ∂σ G ⊤ ⊗ W vec(Z * ) + vec(g B (X)) ∂θ ( ) where J is the Jacobian of σ evaluated at G ⊤ ⊗ W vec(Z * ) + vec(g B (X)). The values of the first and last term in Equation ( 13) can be found through automatic differentiation by running one more iteration in the forward pass. Note that the product of the first two terms remains the same for any θ. Hence one only needs to compute it once in each backward pass. However, it can still be expensive to find (∂ℓ)/(∂vec(Z * ))(I -J G ⊤ ⊗ W ) -1 . Following [77, Theorem 2], the operator splitting methods can be used in the backward pass so that computing (I -J (G ⊤ ⊗ W )) -1 can be converted into computing V = (I -(G ⊤ ⊗W )) -1 , which is already calculated in the forward pass; see Appendix F.2. Similar to the forward propagation, the backpropagation can also benefit from Anderson acceleration using an iterative formulation, and we provide more details in Appendix F.2.

5. EXPERIMENTAL RESULTS

In this section, we compare the performance of MIGNN-Mon (MIGNN with monotone parameterization implemented via FB splitting) and MIGNN-NKDP (MIGNN with orthogonal parameterization implemented via PR splitting accompanied by K-th order Neumann series approximation and P -th order graph diffusion convolution) with IGNN and several other popular GNNs on various graph classification tasks at both node and graph levels. We aim to show that 1) MIGNN-Mon is significantly more expressive than IGNN for both node and graph classifications, and 2) MIGNN-NKDP can learn LRD effectively, efficiently, and stably. The hyperparameters used in each model are provided in Appendix K. We conduct all experiments using NVIDIA RTX 3090 graphics cards. To show that MIGNNs can capture LRD in the underlying graphs, we test them on the synthetic chain task using the experimental setup from [58] . The chain task dataset comprises of c classes and n c single-linked directed chains, each containing l nodes. For each chain, only the feature on the first node encodes the label information. The data is partitioned into training, validation, and test sets of 5%, 10%, and 85%, respectively. We consider binary (c = 2) and three-class classification (c = 3) problems over several different chain lengths. For IGNN, we use the experimental settings used in [71] . For MIGNN, we consider MIGNN-NKDP for this task. Fig. 3 shows the averaged test accuracy over 5 random seeds of different models for classifying directed chains of length ranging from 50 to 300 in an increment of 50 for the binary case and from 40 to 200 in an increment of 20 for the three-class case. For binary classification, MIGNN-N3D3 and MIGNN-N3D5 both score perfectly for all random initializations of the considered chain lengths. For the three-class task, both MIGNN models achieve high accuracy consistently with the higher order diffusion models, and the higher order diffusion model outperforms the lower order diffusion model on longer chains. In contrast, the accuracy of IGNN is much lower and less stable than that of MIGNNs, and in general, IGNN's performance becomes worse as the chain length increases. We provide an ablation study of the impact of the order of Neumann series approximation and graph diffusion convolution on the chain classification accuracy and computational time in Appendix G and H, respectively.

5.1. DIRECTED CHAIN CLASSIFICATION

We can also set G to be the diffusion matrix in Equation ( 12) to enhance IGNN's capability in learning LRD. E.g. we can equip IGNN with a diffusion matrix of order 5, and we denote the resulting model as IGNN-D5. Fig. 3 further contrasts the performance of diffusion enhanced models, and we observe that MIGNN is more consistent and more accurate as the chain length increases. Based on the operator splitting theory, we expect that MIGNNs are more computationally efficient than IGNNs when both models can accurately classify the chain nodes. Fig. 4 compares the accuracy and computational efficiency of MIGNN-N2D5 over IGNN for three-class chain classification. We see that MIGNN-N2D5 stably approaches perfect accuracy compared to IGNN, which abruptly changes around epoch 400. When both models accurately classify the chains, MIGNN-N2D5 requires fewer iterations and less computational time than IGNN.

5.2. GRAPH NODE CLASSIFICATION

In this subsection, we contrast MIGNN- training procedure outlined in [22] and report the mean accuracy of 10-fold cross validation in Table 1 . The MIGNN-Mon outperforms the implicit model benchmarks IGNN and EIGNN on all three tasks. We provide an ablation study of the impact of the order of Neumann series and graph diffusion convolution for graph node classification in Appendix G and H, respectively. In this subsection, we verify that MIGNN-Mon can be more expressive than IGNN for graph classification since the eigenvalues of monotone parameterization are more flexible than IGNN. We consider five bioinformatics-related graph classification benchmarks: MUTAG, PTC, COX2, PROTEINS, and NCI1 [81] , and some details of these datasets are provided in Appendix J. The training is performed using 10-fold cross-validation using the experimental setup of [71] . The averaged test accuracy and standard deviation across the 10 folds are shown in Table 2 . For both IGNN and MIGNN-Mon, we use the hyperparameters outlined in [71] . We present the results for both IGNN and MIGNN-Mon in Table 2 . Clearly, MIGNN-Mon outperforms IGNN on all tasks. To verify our theory, we report on the evolution of λ 1 (|W |) for three of the ten folds of MUTAG in Fig. 5 . For all of the folds λ 1 (|W |) exceeds one. Table 2 also reports the accuracy of MIGNN-N3D1 against several baseline models where it performs better than IGNN and GIND on all tasks and achieves the best accuracy on COX2 and PROTEINS tasks among all studied models. These results show that learning LRD effectively is vital for classifying these graphs. We provide an ablation study of the impact of the order of Neumann series and graph diffusion convolution on classification accuracy and computational time in Appendix G and H, respectively.

5.4. LARGER SCALE GRAPH NODE CLASSIFICATION

We further show the advantages of MIGNN-NKDP over IGNN and other GNNs for a larger scale graph node classification task -Amazon co-purchasing dataset, which contains 334863 nodes, 925872 edges, and the diameter of the graph is 44 [82] . We provide more details of the Amazon co-purchasing dataset in Appendix J. As in [23] , we train on portions of the graph ranging from 5% to 9%, and test on sets representing 10% of the total graph. We then report both Macro-F1 and Micro-F1 consistent with [71] . Fig. 6 Table 2 : Graph classification mean accuracy (%) ± standard deviation for 10-fold cross-validation. We take the results of the baseline models from [22] which are consistent with our reproduced results. IGNN using 5% of the graph for training. λ 1 (|W |) of MIGNN-N1D1 is much smaller than that of IGNN, implying faster convergence of MIGNN-N1D1 than IGNN as confirmed by the fact that MIGNN-N1D1 saves significantly on the number of iterations and computational time over IGNN. We further consider a physical problem of fluid flow in porous media, following [63] . The model is a 3D graph whose nodes and edges correspond to pore chambers and throats. We sample training graphs of different sizes between 100 and 500, which are generated to fit into 0.1 m 3 cubes. We aim to predict the equilibrium pressures Z * inside pore networks G. We train MIGNN to minimize the mean-squared error (MSE) between the prediction and Z * . We utilize the experimental setup of [63] and include their reported results for IGNN. Both IGNN and MIGNN use the same encoder and decoder architecture. Graphs of 50 -200 nodes are sampled in training and 1000 test graphs are generated for pore counts from 200 to 500. Fig. 8 shows the MSE for the test graphs as the number of nodes (pores) varies from 200 to 500. MIGNN with both monotone and orthogonal parameterizations outperform IGNN by a significant margin. For this task of learning physical diffusion in networks, CGS [63] performs better than MIGNN and IGNN in accuracy. As a future direction, we plan to integrate the idea of the learnable graph-related matrix G that is used in CGS with our proposed MIGNN to further improve the performance of MIGNN for learning physical diffusion in networks.

6. CONCLUDING REMARKS

We propose MIGNN based on a monotone operator viewpoint of IGNN. In particular, MIGNN can be parameterized more flexibly than the benchmark IGNN.  Z (k+1) = γg(F )Z (k) G + X where Z (•) denotes the hidden feature, G is the normalized augmented adjacency matrix Â (See Section 1), X is the input feature, g(F ) is the weight matrix which is parameterized to guarantee convergence, and γ is a constant scalar in (0, 1). Note that, there is no non-linearity in the fixed-point Equation ( 14) and this allows EIGNN to find the equilibrium by the following closed formula: lim k→∞ vec Z (k+1) = (I -γ(G ⊤ ⊗ g(F ))) -1 vec(X). For computation efficiency consideration, the matrix inverse operation is reduced to eigenvalue decomposition of G ⊤ and g(F ) where the eigenvalue decomposition G ⊤ is pre-calculated before training. CGS Convergent graph solver (CGS) is an implicit graph neural network proposed by Park et al. in [63] where the fixed point equation in use can be described as follows Z (k+1) = γZ (k) G θ + g B (X) where Z (•) is the hidden feature, γ is the contraction factor, G θ ∈ R n×n is the graph-related matrix that is learnable and g B (X) is the input-dependent bias term. Similar to the EIGNN case, the linearity in Equation 16allows the fixed point to be found by a closed formula. GIND The optimization-induced graph implicit nonlinear diffusion (GIND) is an implicit graph neural network proposed by Chen et al. [22] . GIND involves a fixed point iteration equation of the following form: Z (k+1) = -W ⊤ σ(W (Z (k) + g B (X))G)G ⊤ , where Z (•) is the hidden feature, W is the weight matrix, g B (X) is some input-dependent bias term, and G is a normalization of the adjacency matrix A. The precise definition of G is given as G := D-1/2 A/ √ 2 where D is the degree matrix of the augmented adjacency matrix A + I given as Dii := 1 + j A ij . The weight matrix W is parameterized so that ∥W ∥∥G∥ < 1. Similar to IGNN, the Picard iteration is employed to find the fixed point. The authors claimed that the new fixed-point equation (Equation 17) represents a nonlinear diffusion process with anisotropic properties while IGNN only represents a linear isotropic diffusion. However, we observe that GIND is closely related to the following simple variant of IGNN where the main change is to Z (k+1) = σ W (-W ⊤ )Z (k) G ⊤ G + W g B (X)G where the notations are the same as in Equation 17. In fact, once ∥W ∥∥G∥ < 1, and assuming σ is a non-expansive activation function (for example, tanh, ReLU, ELU), then Equation 18 is contractive and hence its fixed point exists. Let Z * be the fixed-point of Equation ( 18), then we claim that Z = -W ⊤ Z * G ⊤ is the fixed point of Equation ( 17) with the same W , G, and g B (X) used in both Equation 18and Equation 17. This can be seen from the following direct calculation: Z = -W ⊤ Z * G ⊤ = -W ⊤ σ W (-W ⊤ )Z * G ⊤ G + W g B (X)G G ⊤ = -W ⊤ σ W ZG + W g B (X)G G ⊤ = -W ⊤ σ W ( Z + g B (X))G G ⊤ .

B A BRIEF REVIEW OF MONOTONE OPERATOR THEORY B.1 OPERATORS

In this section, we briefly review the definition and basic theory of monotone operators, more details can be found in [68] . We sat T is a (set-valued) operator if T maps a point in R d to a subset of R d . and we denote this as T : R d ⇒ R d . We define the graph of an operator as Gra T = {(x, u)|u ∈ T (x)}. Mathematically, an operator and its graph are equivalent. In other words, we can view T : R d ⇒ R d as a point-to-set mapping and as a subset of R d × R d . Many notions for functions can be extended to operators. For example, the domain and range of an operator T are defined as dom T = {x | T (x) ̸ = ∅}, range T = {y | y = T (x), x ∈ R d }. If T and S are two operators, we define their composition as T • S(x) = T S(x) = T (S(x)), and their sum as (T + S)(x) = T (x) + S(x). Alternately, we can define the operator composition and sum using their graphs, T S = (x, z) | ∃ y (x, y) ∈ S, (y, z) ∈ T , T + S = (x, y + z) | (x, y) ∈ T , (x, z) ∈ S . The identity (I) and zero (0) operators are defined as follows I = {(x, x) | x ∈ R d }, 0 = {(x, 0) | x ∈ R d }. We say an operator T is L-Lipschitz (L > 0) if ∥T (x) -T (y)∥ ≤ L∥x -y∥, ∀x, y ∈ dom T , i.e., ∥u -v∥ ≤ L∥x -y∥, ∀(x, u), (y, v) ∈ T . The inverse operator of T is defined as T -1 = {(y, x) | (x, y) ∈ T }. When 0 ∈ T (x), we say that x is a zero of T . We write the zero set of an operator T as Zer T = {x | 0 ∈ T (x)} = T -1 (0). B.2 MONOTONE OPERATORS An operator T on R d is said to be monotone if ⟨u -v, x -y⟩ ≥ 0, ∀(x, u), (y, v) ∈ T , where ⟨•, •⟩ denotes the inner product between two vectors. Equivalently, we can express monotonicity as ⟨T (x) -T (y), x -y⟩ ≥ 0, ∀x, y ∈ R d . Furthermore, we say the operator T is maximal monotone if there is no other monotone operator S s.t. Gra T ⊂ Gra S properly. In other words, if the monotone operator T is not maximal, then there exists (x, u) / ∈ T s.t. T ∪ {(x, u)} is still monotone. A continuous monotone function F : R d → R d is maximal monotone. An operator T : R d ⇒ R d is B-strongly monotone or B-coercive if B > 0 and ⟨u -v, x -y⟩ ≥ B∥x -y∥ 2 , ∀(x, u), (y, v) ∈ T . We say T is strongly monotone if it is B-strongly monotone for some unspecified constant B ∈ (0, ∞). In particular, a linear operator F(x) = Gx + h for G ∈ R d×d and h ∈ R d is maximal monotone if and only if G + G ⊤ ⪰ 0 (0 stands for the matrix whose entries are all zero) and B-strongly monotone if 1 2 (G + G ⊤ ) ⪰ BI. Similarly, a subdifferentiable operator ∂f is maximal monotone if and only if f is a convex closed proper (CCP) function. An operator T is β-cocoercive or β-inverse strongly monotone if β > 0 and ⟨u -v, x -y⟩ ≥ β∥u -v∥ 2 , ∀(x, u), (y, v) ∈ T . We say T is cocoercive if it is β-cocoercive for some unspecified constant β ∈ (0, ∞). In particular, if the linear operator F(x) = Gx + h is B-strongly monotone and L-Lipschitz, then F is B L 2cocoercive.

C A BRIEF REVIEW OF OPERATOR SPLITTING SCHEMES

In this section, we provide a brief review of a few celebrated operator splitting schemes for solving fixed-point equilibrium equations.

C.1 RESOLVENT AND CAYLEY OPERATORS

The resolvent and Cayley operators of an operator T is defined as, respectively, as follows R T = (I + αT ) -1 , and C T = 2R T -I, where α > 0 is a constant. The resolvent and Cayley operators are both non-expansive, i.e. they both have Lipschitz constant L ≤ 1 for any maximal monotone operator T , and the resolvent operator R T is contractive (i.e. L < 1) for strongly monotone T , the Cayley operator C T is contractive for strongly monotone and Lipschitz T . There are two well-known properties associated with the resolvent operators: • First, when F(x) = Gx + h is a linear operator, then R F (x) = I + αG -1 (x -αh). • Second, when F = ∂f for some CCP function f , then the resolvent is given by the following proximal operator R F (x) = prox α f (x) := arg min z 1 2 ∥x -z∥ 2 + αf (z) .

C.2 OPERATOR SPLITTING SCHEMES

Operator splitting schemes refer to methods to find a zero in a sum of operators (assumed here to be maximal monotone), i.e. find x s.t. 0 ∈ (F + G)(x). We present a few popular operator splitting schemes for solving the above monotone inclusion problem. • Forward-backward splitting (FB): Consider the monotone inclusion problem find x∈R d 0 ∈ (F + G)(x), where F and G are maximal monotone and F is single-valued. Then for any α > 0, we have 0 ∈ (F + G)(x) ⇔ 0 ∈ (I + αG)(x) -(I -αF)(x) ⇔ (I + αG)(x) ∋ (I -αF)(x) ⇔ x = R G (I -αF)(x). Therefore, x is a solution if and only if it is a fixed point of R G (I -αF). Moreover, assume F is β-cocoercive, then the Picard iteration using forward-backward splitting can be written as x (k+1) = R G (x (k) -αFx (k) ), which converges if α ∈ (0, 2β) and Zer(F + G) ̸ = ∅. • Peaceman-Rachford splitting (PR): Consider the following monotone inclusion problem find x∈R d 0 ∈ (F + G)(x), where F and G are maximal monotone. For any α > 0, we have 0 ∈ (F + G)(x) ⇔ 0 ∈ (I + αF)(x) -(I -αG)(x) ⇔ 0 ∈ (I + αF)(x) -C G (I + αG)(x) ⇔ 0 ∈ (I + αF)(x) -C G (z), z ∈ (I + αG)(x) ⇔ C G (z) ∈ (I + αF)R G (z), x = R G (z) ⇔ R F C G (z) = R G (z), x = R G (z) ⇔ C F C G (z) = z, x = R G (z). Therefore, x is a solution if and only if there is a solution of the fixed-point equilibrium equation z = C F C G (z) and x = R G (z), which is called Peaceman-Rachford splitting. • Douglas-Rachford splitting (DR): Sometimes the operator C F C G is merely nonexpansive, the Picard iteration with PR given below z (k+1) = C F C G (z (k) ) is not guaranteed to converge. To guarantee convergence, we note that for any ∀α > 0, we have 0 ∈ (F + G)(x) ⇔ 1 2 I + 1 2 C F C G (z) = z, x = J G (z). And the above splitting is called Douglas-Rachford splitting. The Picard iteration with DR can be written as follows x (k+1/2) = R G (z (k) ) x (k+1) = R F (2x (k+1/2) -z (k) ) z (k+1) = z (k) + x (k+1) -x (k+1/2) which converges for any α > 0 if Zer(F + G) ̸ = ∅.

D PROPERTIES OF KRONECKER PRODUCT

In this section, we collect some Kronecker product results that are used in this paper. Definition 1. Let A ∈ R p×q , B ∈ R r×s be two matrices. Their Kronecker product A×B ∈ R pr×qs is defined as follows: A ⊗ B =    A 11 B . . . A 1q B . . . . . . A p1 B . . . A pq B    The following identities about Kronecker product hold: Meanwhile, note that one has • (A ⊗ B) ⊤ = A ⊤ ⊗ B ⊤ ∀A ∈ R p×q , B ∈ R r×s • ∥A ⊗ B∥ = ∥A∥∥B∥ ∀A ∈ R p×q , B ∈ R r×s • ∥A ⊗ B∥ ∞ = ∥A∥ ∞ ∥B∥ ∞ ∀A ∈ R p×q , B ∈ R r×s • (A ⊗ B)vec(C) = vec(BCA ⊤ ) ∀A ∈ R s,r , B ∈ R p×q , C ∈ R q×r • (A ⊗ B) ⊗ C = A ⊗ (B ⊗ C) ∀A ∈ R m,n , B ∈ R p×q , C ∈ R r×s • A ⊗ (B + C) = A ⊗ B + A ⊗ C ∀A ∈ R p×q , B, C ∈ R r×s • (A + B) ⊗ C = A ⊗ C + B ⊗ C ∀A, B ∈ R p×q , C ∈ R r×s • (A ⊗ B)(C ⊗ D) = AC ⊗ BD ∀A ∈ R p×q , B ∈ R r×s , C ∈ R q×k , D ∈ R s×l Proposition 3 ([43, Theorem 4.2.12]). Let A ∈ R n×n and B ∈ R m×m . If we denote the eigenvalue sets of A and B as Λ(A) = {λ 1 (A), . . . , λ n (A)} and Λ(B) = {λ 1 (B), . . . , λ m (B)}, then the eigenvalue set of A ⊗ B is Λ(A ⊗ B) = {λ i (A) • λ j (B), i = 1, . . . , n, j = 1, . . . , m}.

E TECHNICAL PROOFS

∥W ∥ ∞ = ∥ |W | ∥ ∞ by definition. Hence one has Lip ∞ (f ) = ∥G ⊤ ∥ ∞ ∥W ∥ ∞ ≥ λ 1 (G ⊤ )λ 1 (|W |). Note that, when G is the normalized adjacency matrix of undirected graph Â, we have λ 1 (G ⊤ ) = λ 1 (G) = 1 and hence we have Lip ∞ (f ) ≥ λ 1 (|W |).

E.2 PROOFS FOR SECTION 2

Proof of Proposition 1. First recall the operator splitting problem 3 in Section 1: find 0 ∈ (F + G)(vec(Z)), where F(vec(Z)) = (I -G ⊤ ⊗ W )vec(Z) -vec(g B (X) ) and G = ∂f, here f is the indicator of the positive octant, i.e. f (x) = I{x ≥ 0} for which we have prox α f equals σ, the ReLU activation function, for all α > 0. Note that, from the condition K = 1 2 G ⊤ ⊗ W + G ⊗ W ⊤ ⪯ (1 -m)I, one has G ⊤ ⊗ W ⪯ (1 -m)I and hence I -G ⊤ ⊗ W ⪰ mI which says F is m-strongly monotone for some m > 0. As the function F is a linear and hence continuous function defined on the entire R d×n , it is then automatically maximal monotone once it is monotone. Since f is a CCP function, its subdifferential operator G = ∂f is maximal monotone. In particular, as the linear map F is single-valued, we can apply the FB splitting scheme in Appendix C.2 as the following: for any α > 0, we have 0 ∈ (F + G)(vec(Z)) ⇔ vec(Z) = RG(I -αF)(vec(Z)). ⇔ vec(Z) = prox α f vec(Z) -α • vec(Z) -G ⊤ ⊗ W vec(Z) -vec(g B (X)) , ⇔ vec(Z) = σ vec(Z) -α • vec(Z) -G ⊤ ⊗ W vec(Z) -vec(g B (X)) . When α = 1 in the last above, we recover the MIGNN model 2: vec(Z) = σ(G ⊤ ⊗ W vec(Z) + vec(g B (X)) This shows the equivalence between finding a fixed point of MIGNN model 2 and finding a zero of the operator splitting problem 3. Therefore, when K ⪯ (1 -m)I, the linear map F is strongly monotone and Lipschitz, the monotone splitting problem and hence the MIGNN model is wellsposed, see Appendix C.2.

E.3 PROOFS FOR SECTION 3

The following properties of the Cayley map are used in this paper. Proposition 4. Let S be a skew-symmetric matrix. Then its image under the Cayley map Cay(S) := (I -S)(I + S) -1 is an orthogonal matrix, and hence the magnitude of all its eigenvalues is 1. Proof. To verify that the Cayley map is well-defined, it suffices to show that -1 is not an eigenvalue of S. This can be derived from the general fact that each eigenvalue of any skew-symmetric matrix is purely imaginary. To see this, let λ be an eigenvalue of S with corresponding eigenvector v where both λ and v possibly contain complex numbers. Let v H and S H denote the conjugate transpose of the vector v and the matrix S respectively. We then have v H Sv = v H (λv) = λ|v| 2 C , where || C denotes the Euclidean norm for a complex vector. At the same time, one has v H Sv = (S H v) H v = (-Sv) H v = -λ|v| 2 C , where λ denotes the complex conjugate of λ. Hence λ = -λ, that is λ is purely imaginary. This concludes the proof that (I -S)(I + S) -1 is well-defined. Note that (I -S)(I + S) -1 (I -S)(I + S) -1 ⊤ = (I -S)(I + S) -1 (I + S)(I -S) -1 = I. Therefore, (I -S)(I + S) -1 is (real) orthogonal. In the last part, we present a short proof that the magnitude of all eigenvalues of a (real) orthogonal matrix O equals 1. Let λ O be an eigenvalue of O and w is its eigenvector. Then we have |λ O ||w| 2 C = (Ow) H (Ow) = w H O H Ow = (Ow) H (Ow) = w H O ⊤ Ow = |w| 2 C . Hence, |λ O | = 1. Proof of Proposition 2. Since the normalized Laplacian L is symmetric, we have K = 1 2 1 2 L ⊤ ⊗ W + 1 2 L ⊗ W ⊤ = 1 2 L ⊗ 1 2 W + W ⊤ . The property of Kronecker product (Theorem 3) tells us that the eigenvalues of K are the products of the eigenvalues of L and 1 2 (W + W ⊤ ) . Therefore, the MIGNN model satisfies the wellposedness condition in Proposition 1 once λ i 1 2 L λ j 1 2 (W + W ⊤ ) ≤ 1 -m for all eigenvalues from 1 2 L and 1 2 (W + W ⊤ ) . Notice that 1 2 L is positive semi-definite and all its eigenvalues are within [0, 1]. Therefore, W guarantees the well-posedness of MIGNN as long as all eigenvalues satisfy λ i 1 2 (W + W ⊤ ) ≤ 1 -m. When W = (1 -m)I -CC ⊤ + F -F ⊤ , we have 1 2 (W + W ⊤ ) = (1 -m)I -CC ⊤ . As CC ⊤ is positive semi-definite, all eigenvalues of 1 2 (W + W ⊤ ) are no more than (1 -m).

E.4 PROOFS FOR SECTION 4

The following result about Kronecker product is adapted from [58] which we include here for completeness. Proof of Formula 9 used in Section 4. Since G ⊤ is symmetric, it admits an eigen-decomposition G ⊤ = Q G ⊤ Λ G ⊤ Q ⊤ G ⊤ where Q G ⊤ is orthogonal and hence satisfies Q -1 G ⊤ = Q G ⊤ . As W is diagonalizable, it admits a eigen-decomposition W = Q W Λ W Q -1 W . Then we can write G ⊤ ⊗ W = [Q G ⊤ Λ G ⊤ Q ⊤ G ⊤ ] ⊗ [Q W Λ W Q -1 W ] = [Q G ⊤ ⊗ Q W ][Λ G ⊤ ⊗ Λ W ][Q ⊤ G ⊤ ⊗ Q -1 W ] Let n = dim(G) and d = dim(W ), we have I nd = I n ⊗ I d = [Q G ⊤ I n Q ⊤ G ⊤ ] ⊗ [Q W I m Q -1 W ] = [Q G ⊤ ⊗ Q W ][I n ⊗ I m ][Q ⊤ G ⊤ ⊗ Q -1 W ] Therefore, for some matrix B ∈ R d×n , V (vec(U )) = 1 1 + α I nd - α 1 + α (G ⊤ ⊗ W ) -1 (vec(U )) = 1 1 + α I nd - α 1 + α (G ⊤ ⊗ W ) -1 (vec(U )) 1 1 + α [Q G ⊤ ⊗ Q W ] I nd - α 1 + α Λ G ⊤ ⊗ Λ W Q ⊤ G ⊤ ⊗ Q -1 W -1 (vec(U )) 1 1 + α [Q G ⊤ ⊗ Q W ] I nd - α 1 + α Λ G ⊤ ⊗ Λ W -1 Q ⊤ G ⊤ ⊗ Q -1 W (vec(U )) Note that I nd -α 1+α Λ G ⊤ ⊗ Λ W is a diagonal matrix whose inverse is given by the diagonal matrix Diag(vec(H)) where the entires of H is given as H ij := 1/ 1 -α 1+α (Λ W ) ii (Λ G ⊤ ) jj . Here the notation Diag(v) denotes the diagonal matrix that has v as its diagonal for any vector v. From this we have, V (vec(U )) = 1 1 + α [Q G ⊤ ⊗ Q W ] Diag(vec(H)) Q ⊤ G ⊤ ⊗ Q -1 W (vec(U )) = 1 1 + α ([Q G ⊤ ⊗ Q W ] Diag(vec(H)) vec(Q -1 W U Q G ⊤ ) = 1 1 + α [Q G ⊤ ⊗ Q W ] vec H ⊙ [Q -1 W U Q G ⊤ ] = 1 1 + α vec Q W [H ⊙ [Q -1 W U Q G ⊤ ]]Q ⊤ G ⊤ where ⊙ denotes entry-wise multiplication. For the reader's convenience, we present the following fact that implies D-1/2 (A + A 2 + • • • + A P ) D-1/2 has its eigenvalues within [-1, 1] which is used in MIGNN with diffusion convolution (Equation 12). Proposition 5. Let S ∈ R n×n be non-singular symmetric matrix and let D be the degree matrix defined as the diagonal matrix where D ii = n j=1 |S ij |. Since S is non-singular, D -1/2 is welldefined. Then the normalization S := D -1/2 SD -1/2 of S has its eigenvalues with [-1, 1]. Let y (k) be the intermediate variable, the procedure of applying PR splitting on Equation ( 19) can be summarized as first finding the fixed-point y * of the following iteration function y (k+1) := B PR α (y (k) ) = 2V ⊤ 2(I + αD) -1 (y (k) + αv) -y (k) -2(I+αD) -1 (y (k) +αv)+y (k) and then the final solution of the operator splitting problem is ũ = (I + αD) -1 (y * + αv).

F.3 ANDERSON ACCELERATION

We first introduce the general Anderson acceleration scheme. Let f : R n → R n be a function s.t. the Lipschitz constant L(f ) < 1. Therefore, the function f admits a unique fixed point and can be obtained through Picard iteration. Let h(x) = f (x) -x be the residual function. Let x (0) be the initial guess, β ∈ (0, 1) be a relaxation parameter, and m > 1 be an integer parameter. Then the Anderson acceleration update x (k) as x (k+1) = (1 -β) m i=0 γ (k) i x (k-m+i) + β m i=0 γ (k) i h x (k-m+i) where the coefficients γ (k) = γ (k) 0 , . . . , γ ⊤ are determined by a least-square problem as the following: min γ=(γ0,...,γm) ⊤ m i h(x (k-m+i) )γ i s.t. m i=0 γ i = 1. Note that, when β = 1, the trivial weight γ (k) = (0, . . . , 0, 1) ⊤ recovers Picard iteration. Therefore, when the Picard iteration converges, the Anderson acceleration also converges and typically faster. In Algorithm 5, we present the FB MIGNN forward propagation with Anderson acceleration on the FB iteration function F FB α which is introduced in Section 4 and recalled here: Z (k+1) := F FB α (Z (k) ) := prox α f Z (k) -α • Z (k) -W Z (k) G -g B (X) . Algorithm 5 MIGNN-FB-Forward: FB MIGNN forward propagation Input: initial point Z (0) := 0, FB damping parameter α, AA relaxation parameter β, max storage size m ≥ 1. Compute F (0) = F PB α (Z (0) ), H (0) = F (0) -Z (0) . for k = 1, . . . , K do Set m k = min(m, k) Compute F (k) = F PB α Z (k) , H (k) = F (k) -Z (k) Update H := (H (k-m k ) , . . . , H (k) ) Determine γ (k) = γ (k) 0 , . . . , γ (k) m k ⊤ that solves min γ=(γ 0 ,...,γm k ) ⊤ ∥Hγ∥ s.t. m k i=0 γi = 1. Set Z (k+1) := β m k i=0 γ (k) i F PB α (Z ((k-m k )+i) ) + (1 -β) m k i=0 γ (k) i Z ((k-m k )+i) . end for return Z (k+1) In Algorithm 6, we present the PR MIGNN forward propagation with Anderson acceleration on the PR iteration function F PR α which is introduced in Section 4 and recalled here: u (k+1) := F PR α (u (k) ) = 2V 2 prox α f (u (k) ) -u (k) + α vec(g B (X)) -2 prox α f (u (k) ) + u (k) , Algorithm 6 MIGNN-PR-forward: PR MIGNN forward propagation Input: initial point u (0) = vec(U (0) ) := 0, PR damping parameter α, AA relaxation parameter β, max storage size m ≥ 1. Compute f (0) := F PR α (u (0) ), h (0) := f (0) -u (0) . for k = 1, . . . , K do Set m k := min(m, k) Compute f (k) := F PR α u (k) , h (k) := f (k) -u (k) Update H := (h (k-m k ) , . . . , h (k) ) Determine γ (k) = γ (k) 0 , . . . , γ (k) m k ⊤ that solves min γ=(γ0,...,γm k ) ⊤ ∥Hγ∥ s.t. m k i=0 γ i = 1. Set u (k+1) := β m k i=0 γ (k) i F PR α (u ((k-m k )+i) ) + (1 -β) m k i=0 γ (k) i u ((k-m k )+i) . end for Set U (k+1) := vec -1 (u (k+1) ) return prox α f (U (k+1) ) The FB iteration function for the backpropagation B FB α is introduced in Appendix F.2 and recalled here: u (k+1) := B FB α (u (k) ) = (I + αD) -1 ((1 -α)u (k) + αW ⊤ v). We now present the Anderson-accelerated FB MIGNN backward propagation as Algorithm 7.  f (0) := B FB α (u (0) ), h (0) := f (0) -u (0) . for k = 1, . . . , K do Set m k := min(m, k) Compute f (k) := B FB α u (k) , h (k) := f (k) -u (k) Update H := (h (k-m k ) , . . . , h (k) ) Determine γ (k) = γ (k) 0 , . . . , γ (k) m k ⊤ that solves min γ=(γ0,...,γm k ) ⊤ ∥Hγ∥ s.t. m k i=0 γ i = 1. Set u (k+1) := β m k i=0 γ (k) i B FB α (u ((k-m k )+i) ) + (1 -β) m k i=0 γ (k) i u ((k-m k )+i) . end for Set U (k+1) := vec -1 (u (k+1) ) return v + vec(W ⊤ U (k+1) G ⊤ ) The PR iteration function for the backpropagation B PR α is introduced in Appendix F.2 and recalled here: let y (k) be the intermediate variable, y (k+1) := B PR α (y (k) ) = 2V ⊤ 2(I + αD) -1 (y (k) + αv) -y (k) -2(I + αD) -1 (y (k) + αv) + y (k) and then the final solution of the operator splitting problem is ũ = (I + αD) -1 (y * + αv). We now present the Anderson-accelerated PR MIGNN backward propagation as Algorithm 8.  f (0) := B PR α (y (0) ), h (0) := f (0) -y (0) . for k = 1, . . . , K do Set m k := min(m, k) Compute f (k) := B PR α y (k) , h (k) := f (k) -y (k) Update H := (h (k-m k ) , . . . , h (k) ) Determine γ (k) = γ (k) 0 , . . . , γ (k) m k ⊤ that solves min γ=(γ0,...,γm k ) ⊤ ∥Hγ∥ s.t. m k i=0 γ i = 1. Set y (k+1) := β m k i=0 γ (k) i B PR α (y ((k-m k )+i) ) + (1 -β) m k i=0 γ (k) i y ((k-m k )+i) .

end for

Compute u (k+1) where u (k+1) i := y (k+1) i +αvi 1+α(1+Dii) if D ii < ∞ 0 if D ii = ∞ Set U (k+1) := vec -1 (u (k+1) ) return v + vec(W ⊤ U (k+1) G ⊤ )

G EFFECTS OF THE ORDER OF NEUMANN SERIES EXPANSION

In this section, we perform ablation studies on the effects of the order of the Neumann series for approximating matrix (I + α(I -G ⊤ ⊗ W )) -1 in MIGNN-NKDP with fixed P = 1. We study the performance of MIGNN-NKDP for synthetic directed chain classification, benchmark graph node and graph classification.

G.1 DIRECTED CHAIN CLASSIFICATION

Examining the Neumann series expansion for the synthetic chain classification task demonstrates the trade-off between accuracy and time complexity. We train MIGNN-NKD1 for three-class classification, where the order K ranges from 1 to 5 in increments of 1. Fig. 9 plots the resulting test accuracy, number of iterations, and time elapsed for each training epoch. We make three observations as the order of the Neumann series increases. First the accuracy increases with respect to the order with diminishing returns. Second the number of iterations increases relative to the order up 3. Finally the time elapsed also increases with respect to the order up to 4 and 5 which are similar. These observations underscore the trade-off between accuracy and time complexity as the order increases. 

G.2 NODE CLASSIFICATION

The graph node classification tasks also highlight the trade-off between accuracy and time complexity. We train MIGNN-NKD1 using 10-fold cross validation on Cora, Citeseer and Pubmed. We consider K in the range from 1 to 5, incrementing by 1. The mean test accuracy and time elapsed along with their standard deviations are reported in Table 3 . For node classification we see a very clear trend across all datasets. Both the accuracy and time elapsed increase with the order of the Neumann expansion. However, the accuracy scales with diminishing returns; notice N4 and N5 have the same accuracy for both Citeseer and Pubmed.

G.3 GRAPH CLASSIFICATION

In this subsection, we apply MIGNN-NKD1 to classify the MUTAG dataset, where K ranges from 1 to 5 incrementing by 1. Fig. 10 plots the test accuracy, the number of iterations, and the time elapsed for training one fold of the 10-fold cross validation. Unlike the directed chains and node classification tasks, the graph classification does not show significant improvements from higher order Neumann expansion on this fold. However, from Table 2 , we observe that over 10-fold cross validation diffusion improves the results. Although the accuracy and iteration count remain similar among all orders, the time elapsed still scales with the order.

H EFFECTS OF THE ORDER OF GRAPH DIFFUSION CONVOLUTION

In this section, we use MIGNN-NKDP with fixed K = 1 and varying order of graph diffusion P to study the effects of the order of graph diffusion convolution. We report the performance of MIGNN benchmarking on synthetic directed chain classification, benchmark graph node and graph classification tasks.

H.1 DIRECTED CHAIN CLASSIFICATION

The three-class chain classification task benefits tremendously for high orders of diffusion. We train MIGNN-N1DP on chain lengths of 140, where P ranges from 1 to 5 incrementing by 1. Fig. 11 plots the test accuracy, number of iterations, and time elapsed for each training epoch. For diffusion convolution we make two observations. First, the accuracy scales with the order of diffusion with a remarkable gap between D3 and D4. Second, the iteration count and time elapsed remain relatively constant among all orders, with D1 standing out as the least among all others. Our theory informs us of the following: 1) Accuracy scaling occurs when the introduced P -hop edges contain relevant information for the task 2) Time elapsed scales relative to the number of edges in the higher order graph diffusion matrix. Our observations support our theory and strongly suggest using diffusion as an inexpensive improvement to simple learning tasks.

H.2 NODE CLASSIFICATION

In this subsection, we study the effects of the order of diffusion convolution on the node classification tasks outlined in the citation datasets (Cora, Citeseer, Pubmed). We consider MIGNN-N1DP with P ranging from 1 to 3 with an increment of 1. We observe that higher order diffusion convolution has little impact on the time complexity when each connected subgraph is small relative to the underlying graph.

I MORE DISCUSSION ON WHEN IGNNS BECOME EXPRESSIVE FOR LEARNING LRD

In this section, we further confirm the interconnection between the accuracy of IGNN for classifying directed chains and the eigenvalues of |W |. The accuracy and number of iterations of IGNN and the dynamics of the two leading eigenvalues are plotted in Figs. 13 and 14 , respectively, for the binary and three-class cases. These results confirm the phenomena we have discussed in Sec. 1. (m) (n) (o) Figure 14: The first column shows the training, test, and validation accuracies of IGNN for several chain lengths of three classes. In the second column, we plot the corresponding top two eigenvalues. In the third column, we plot the number of Picard iterations for each chain length. As the maximum eigenvalue of the system approaches 1, IGNN becomes more accurate for chain classification at the cost of a significantly increased number of training iterations.

J DETAILS ABOUT DATASETS

Synthetic chains dataset. To evaluate the LRD learning ability of models, we construct synthetic chains dataset as in Gu et al. [39] . Both binary classification and multiclass classification are considered. Let c be the number of classes, that is, there are c types of chains. The label information is only encoded as a one-hot vector in the first c-dimensions of the node feature of the starting nodes of each chain. With c classes, n c chains for each class, and l nodes in each chain, the chain dataset has c × n c × l nodes in total. Bioinformatics datasets. MUTAG is a dataset of 188 mutagenic aromatic and heteroaromatic nitro compounds. PTC is a dataset of 344 chemical compounds that report carcinogenicity for male and female rats. COX2 is a dataset of 467 cyclooxygenase-2 (COX-2) inhibitors. PROTEINS is a dataset of 1113 secondary structure elements (SSEs). NCI1 is a public dataset from the National Cancer Institute (NCI) and is a subset of balanced datasets of chemical compounds screened for the ability to suppress or inhibit the growth of a panel of human tumor cell lines. Amazon product co-purchasing network. This dataset contains 334863 nodes (representing goods), 925872 edges, and 58 label types. An edge is formed between two nodes if the represented goods have been purchased together [52] . Pore networks. The pore network is a simulated dataset that models fluid flow in porous media. Each porous network is randomly generated inside a cubic domain of width 0.1m by Delaunay or Voronoi tessellation. The prediction of equilibrium pressure in a pore network under physical diffusion is introduced as a GNN task in [63] . The GNN model prediction accuracy is compared with the ground truth obtained through solving the diffusion equation directly, see [63, Appendix C] for more details. Citation dataset. Cora and Citeseer are large citation datasets that describe the presence of specific words in publications. Pubmed is a large citation dataset that contains information about papers classified for studying one of the three diabetes. The following table adapted from [33] describes the statistics of the three datasets. 

K DETAILS ABOUT HYPERPARAMETERS

The default parameter settings for MIGNN are the following. For the fixed-point schemes α = 0.9, β = 0.9, the default maximum iteration is 300, the tolerance is 1e-6, and convergence is measured in the ℓ ∞ -norm of the difference between two consecutive fixed point iterations. The learnable parameter is initialized to γ = 1.0. Synthetic chains dataset. For both binary and three-class classification we use the parameters outlined by IGNN [71] . In both classification tasks we make the same modifications. We set the clipping and dropout to 0. Citation dataset. In the citation datasets we follow the training procedure used by GIND [22] . For the Cora dataset we set the weighted-decay to 1e-4 and the fixed-point tolerance to 1e-3. For all three models we set the fixed-point α = 0.5, and the number of hidden layers to 64. Bioinformatics datasets. In the bioinformatics datasets we follow the training procedure used by IGNN [71] . On the C12 dataset for MIGNN-Mon, we use α = 0.5. On all other datasets we extend the number of training epochs to 500. Amazon product co-purchasing network. In the Amazon product co-purchasing dataset we follow the training procedure used by IGNN [71] . Pore networks. In the physical diffusion pore networks we follow the training procedure used by CGS [63] and the default parameter values for MIGNN.



The matrix |W | is obtained by taking the entry-wise absolute value of the matrix W . Starting from here, we use MIGNN to stress that the model is based on monotone operator theory. For the sake of presentation, we denote Anderson-accelerated FB and PR splitting as FB and PR.



Figure 1: Epoch vs. training, validation, and test accuracy of IGNN for classifying directed chains. First row: binary chains of length 100 (left) and 250 (right). Second row: three-class chains of length 80 (left) and 100 (right).

Figure 2: Epoch vs. the magnitude of λ1(|W |) and λ2(|W |) and the iterations required for each epoch. First row: binary chains, second row: Three-class chains.

In monotone parameterization, we first set the graph-related matrix G to be L/2, whose eigenvalues are in [0, 1]. In contrast, the range of the eigenvalues of Â used in IGNN, see Sec. 1, is [-1, 1]. Next, we parameterize W as in Equation (4), whose eigenvalues have real part in (-∞, 1m]. Thus, 1 2 (G ⊤ ⊗ W + G ⊗ W ⊤ ) ⪯ (1 -m)I, guaranteeing the well-posedness of MIGNN. Moreover, W = (1-m)I -CC ⊤ +F -F ⊤ describes all possible W that satisfy W ⪯ (1-m)I.

PROPAGATION FOR FINDING THE FIXED POINT 4.1.1 FB SPLITTING

Now we discussion the time complexity of MIGNN-NKDP . The P -th order diffusion matrix only needs to be pre-computed once in preprocessing with time complexity O(nP |E P |) where n is the number of nodes, and |E P | denotes the number of non-zero entries in A P . In each epoch, the parameter K in the K-th order Neumann series affects the training time complexity linearly as O(KM d|E P |)

Figure 3: The accuracy of IGNN and MIGNN of different configurations for classifying directed chains of different lengths. Left: binary classification (c = 2). Right: three-class classification (c = 3).

Figure 4: The accuracy and efficiency of MIGNN-N2D5 over IGNN for three class chains, of length 140, classification.

Figure 5: λ1(|W |) of MIGNN-Mon vs. Epoch on MUTAG.

Figure 6: Epoch vs. λ1(|W |), the time required for each epoch, and iterations required for each epoch of IGNN and MIGNN-N1D1 for the Amazon dataset with 5% training portion.

Fig.7contrasts MIGNN-N1D1 with baseline models when trained on portions of the graph ranging from 5% to 9%. We see that MIGNN-N1D1 outperforms almost all baseline models over all different portions of the graph for the training. Though MIGNN-N1D1 does not outperform IGNN significantly, MIGNN-N1D1 enjoys significant computational advantages over IGNN.5.5 PHYSICAL DIFFUSION IN NETWORKS

LIPSCHITZ CONSTANT VS. LARGEST MAGNITUDE OF EIGENVALUE Let f (Z) = W ZG+B be a linear map. With slight abuse of notation, we still denote the vectorized version of f as f which reads f (vec(Z)) = (G ⊤ ⊗ W )vec(Z) + vec(B) (See Appendix D for properties of the Kronecker product). The Lipschitz constant Lip ∞ (f ) of the linear map f with respect to the ℓ ∞ vector norm is exactly the ∞-norm ∥G ⊗ W ∥ ∞ = ∥G ⊤ ∥ ∞ ∥W ∥ ∞ . Recall the following general result about matrix norm and the largest magnitude of eigenvalue. Theorem 1 ([47, Theorem 4 in Section 4.6]). The largest magnitude of eigenvalue λ 1 (A) of a matrix A satisfies λ 1 (A) = inf ∥•∥M ∥A∥ M in which the infimum is taken over all subordinate matrix norms ∥ • ∥ M including 2-norm and ∞norm.

MIGNN-FB-Backward: FB MIGNN backward propagation Input: initial point u (0) := vec(U ) := 0, v := ∂ℓ ∂vec(Z * ) , PR damping parameter α, AA relaxation parameter β, max storage size m ≥ 1. Compute

MIGNN-PR-Backward: PR MIGNN backward propagation Input: initial point y (0) := 0, v := ∂ℓ ∂vec(Z * ) , PR damping parameter α, AA relaxation parameter β, max storage size m ≥ 1. Compute

Figure 9: Comparison of Neumann expansion for accuracy, number of iterations, and elapsed time using three-class chain classifications with chain length 140.

Figure 10: Comparison of Neumann expansion for accuracy, number of iterations and elapsed time using the first fold of the MUTAG graph data set.

Figure 11: Comparison of graph diffusion convolution for accuracy, number of iterations and elapsed time using three-class chains of length 140.

Figure 12: Comparison of diffusion convolution for accuracy, number of iterations and elapsed time using the first fold of the MUTAG graph data set.

In the first column, the training, test, and validation accuracies of IGNN are depicted for several varying chain lengths. In the second column, the corresponding top two eigenvalues are plotted. The third column depicts the number of Picard iterations for each chain length. When IGNN becomes accurate for chain classification, the corresponding λ1(|W |) becomes close to 1 and requires substantially more iterations for the Picard iteration to converge.

g. [18; 72; 19]. Orthogonal parameterization for deep learning. The fixed point iteration Equation (1) is related to the hidden state updates of RNNs [66; 29; 2; 50]. Learning LRD is challenging for RNNs due

Node

contrasts the computational cost of MIGNN-N1D1 with

A BRIEF REVIEW OF IGNN AND RELATED MODELSA.1 IGNN: FORWARD AND BACKWARD PROPAGATION IGNN employs a projected gradient descent method in the training phase to ensure their proposed well-posedness condition is satisfied. In forward propagation, IGNN finds the equilibrium through direct Picard iteration. During backward propagation, IGNN uses the implicit function theorem at the equilibrium to compute the gradient. The computationally expensive terms related to

Graph node classification mean accuracy (%) ± standard deviation for 10-fold cross-validation.

Table 4 reports the test accuracy and the time elapsed for each epoch for different MIGNN models. We observe that diffusion does provide any benefit for graph node classification.

Graph node classification mean accuracy (%) ± standard deviation for 10-fold cross-validation.

Dataset statistics. The shortest path length is denoted by Avg. SP.

annex

Proof. Note that, the normalization S satisfies S⊤ = D -1/2 S ⊤ D -1/2 = D -1/2 SD -1/2 = S, that is, S is symmetric. To complete the proof, it then suffices to show that both I + S and I -S are positive semi-definite. Indeed, from the construction, both symmetric matrices D -S and D + S are diagonal dominant, and their diagonal entries are positive, hence they are positive semi-definite by Gershgorin's Circle Theorem. Meanwhile, for any vector v ∈ R n , we haveThis shows that I + S is positive semi-definite. Similarly, one can derive that I -S is positive semi-definite from D -S is positive semi-definite.

F MIGNN VIA ANDERSON-ACCELERATED OPERATOR SPLITTING SCHEMES

In this section, we present the pseudocodes of Anderson-accelerated MIGNN operator splitting schemes discussed in Section 4.

F.1 PSEUDOCODE FOR MIGNN WITH OPERATOR SPLITTING SCHEMES

FB Splitting. The detail of the FB splitting scheme iteration function Equation ( 6) of solving MIGNN is presented in Algorithm 1. Algorithm 1 FB-forward-MIGNN Z := 0; err := 1 while err > ϵ doend while return Z PR splitting. The details of the PR splitting scheme encoded in the iteration function Equation ( 7)

F.2 MORE DETAILS ON BACKWARD PROPAGATION

In the backward propagation, the following result from [77] allows us to convert the computing of the inverse Jacobian term (I -J (G ⊤ ⊗ W )) -⊤ to the (transpose of) matrix inverse term V = (I -G ⊤ ⊗ W )) -1 which is already calculated in the forward pass.Proposition 6 (Adapted from [77, Theorem 3] ). Let vec(Z * ) be the fixed point of the MIGNN model (2) and J is the Jacobian σ of the non-linearity at the G ⊤ ⊗ W vec(Z * ) + vec(g B (X)).For any v ∈ R n the solution u * of the equationwhere ũ is a solution of the operator splitting problem 0 ∈ ( F + G)(ũ), with operators defined aswhere D is the diagonal matrix defined by J = (I + D) -1 (whereNote that, since the non-linearity σ is applied entry-wise, the Jacobian J is a diagonal matrix, and its diagonal entries consist of the vectorization of the Jacobian ∂σ(W ZG ⊤ ) ∂Z | Z * . Therefore, the Jacobian J and hence D can be efficiently computed. We provide the pseudo-codes of FB and PR splitting schemes for the backward propagation described in the above proposition as Algorithm 3 and Algorithm 4 respectively and their Anderson-accelerated version can be found in Algorithm 7 and Algorithm 8.

FB backward propagation

We now present the pseudo-code of FB splitting method (Algorithm 3) for the backward propagation with the procedure described in Proposition 6.Let u (k) be the intermediate variable, the procedure of applying FB splitting on monotone splitting problem 19 can be summarized as finding the fixed-point u * of the following iteration functionPR backward propagation We now present the pseudo-code of PR splitting method (Algorithm 4) for the backward propagation with the procedure described in Proposition 6. y (1/2) := 2u (1/2) -y u (+) := V ⊤ y (1/2) y (+) := 2u (+) -y (1/2) err := ∥y (+) -y∥ ∥y (+) ∥ y, u := y (+) , u (+) end while Compute u where ui :=

