STABLE, EFFICIENT, AND FLEXIBLE MONOTONE OPERATOR IMPLICIT GRAPH NEURAL NETWORKS

Abstract

Implicit graph neural networks (IGNNs) that solve a fixed-point equilibrium equation for representation learning can learn the long-range dependencies (LRD) in the underlying graphs and show remarkable performance for various graph learning tasks. However, the expressivity of IGNNs is limited by the constraints for their well-posedness guarantee. Moreover, when IGNNs become effective for learning LRD, their eigenvalues converge to the value that slows down the convergence, and their performance is unstable across different tasks. In this paper, we provide a new well-posedness condition of IGNNs leveraging monotone operator theory. The new well-posedness characterization informs us to design effective parameterizations to improve the accuracy, efficiency, and stability of IGNNs. Leveraging accelerated operator splitting schemes and graph diffusion convolution, we design efficient and flexible implementations of monotone operator IGNNs that are significantly faster and more accurate than existing IGNNs.

1. INTRODUCTION

Implicit graph neural networks (IGNNs) that solve a fixed-point equilibrium equation for graph representation learning can learn long-range dependencies (LRD) in the underlying graphs, showing remarkable performance for various tasks [69; 39; 58; 63; 22] . Let G = (V, E) represent a graph, where V is the set of nodes, and E ⊆ V × V is the set of edges. The connectivity of G can be represented by the adjacency matrix A ∈ R n×n with A ij = 1 if there is an edge connecting nodes i, j ∈ V ; otherwise A ij = 0. Let X ∈ R d×n be the initial node features whose i-th column x i ∈ R d is the initial feature of the i-th node. IGNN [39] learns the node representation by finding the fixed point, denoted as Z * , of the Picard iteration below where σ is the nonlinearity (e.g. ReLU), g B is a function parameterized by B (e.g. g B (X) = BXG), matrices W and B ∈ R d×d are learnable weights, and G is a graph-related matrix. In IGNN, G is chosen as Â := D-1/2 (I + A) D-1/2 with I being the identity matrix and D is the degree matrix with Dii = 1+ n j=1 A ij . IGNN constrains W using a tractable projected gradient descent method to ensure the well-posedness of Picard iteration at the cost of limiting the expressivity of IGNNs. The prediction of IGNN is given by f Θ (Z * ), a function parameterized by Θ. IGNNs have several merits: 1) The depth of IGNN is adaptive to particular data and tasks rather than fixed. 2) Training IGNNs requires constant memory independent of their depth -leveraging implicit differentiation [66; 2; 51; 13]. 3) IGNNs have better potential to capture LRD of the underlying graph compared to existing GNNs, including GCN [75] , GAT [73] , SSE [23], and SGC [79] . The latter GNNs lack the capability to learn LRD as they suffer from over-smoothing [56; 84; 62; 20] . Several methods have been proposed to alleviate over-smoothing and hence improve learning LRD by adding residual connections [37; 21; 55] , by geometric aggregation [65] , by adding a fully-adjacent layer [3], by improving breadth-wise backpropagation [59] , and by adding oscillatory layers [27; 67] . Z (k+1) = σ W Z (k) G + g B (X) , for k = 0, 1, 2, • • • , Issue 1: Well-posedness of IGNN Limits Its Expressivity. One bottleneck of IGNN is that the magnitude of W 's eigenvalues has to be less than one for its well-posedness guarantee; see Sec. 2 for details. This limits the selection of W and thereby limits the expressivity of IGNNs. We see that the magnitude of both eigenvalues goes to 1 when IGNN becomes accurate. However, Fig. 2 (right) shows that IGNN takes many more iterations in each epoch when the magnitude of eigenvalues gets close to 1. Indeed, when λ 1 (|W |) → 1, the Lipschitz constant of the linear map W ZG + g B (X) is close to 1, slowing down the convergence of the Picard iterations. The results in Fig. 2 echo our intuition; the representation of a given node aggregates one more hop of information after each Picard iteration; when the magnitude of eigenvalues gets close to 1, Equation ( 1) converges slowly so that IGNN can capture LRD before fixed point convergence. We report the classification results of different lengths in Appendix I; these results show prevalently that IGNNs suffer from two bottlenecks: 1) An inherent tradeoff between computational efficiency and capability for learning LRD. 2) The performance of IGNNs, based on Picard iteration, is unstable in the sense that their performance varies substantially across tasks. In particular, starting from random Gaussian initialization of W -the default initialization of W -IGNN cannot learn LRD if none of the eigenvalues of W get close to 1 in magnitude. 



The matrix |W | is obtained by taking the entry-wise absolute value of the matrix W . Starting from here, we use MIGNN to stress that the model is based on monotone operator theory.



Figure 1: Epoch vs. training, validation, and test accuracy of IGNN for classifying directed chains. First row: binary chains of length 100 (left) and 250 (right). Second row: three-class chains of length 80 (left) and 100 (right).

When can IGNNs learn LRD? To understand when IGNN can learn LRD, we run IGNN using the settings in [39] to classify directed chains. Directed chains is a synthetic dataset designed to test the effectiveness of GNNs in learning LRD for node classification [71; 39]. Fig. 1 plots epoch vs. accuracy of IGNN for the chain classification. Here, each epoch means iterating Equation (1) until convergence and then updating W and B at the end. IGNN can classify the binary chain task perfectly at length 100 but performs near random guesses when the length is 250, as illustrated in Fig. 1. For the three-class chains, IGNN's performance is very poor at chain length 100 but performs quite well at length 80. We investigate the results above by studying the dynamics of eigenvalues of the matrix |W | 1 . For illustrative purpose, we consider λ 1 (|W |) and λ 2 (|W |), the largest and the second largest eigenvalue of |W | in magnitude. Fig. 2 (left) contrasts the evolution of the magnitude of λ 1 (|W |) and λ 2 (|W |) of IGNN when classifying nodes on chains with different lengths.

Figure 2: Epoch vs. the magnitude of λ1(|W |) and λ2(|W |) and the iterations required for each epoch. First row: binary chains, second row: Three-class chains.

GNNs, and orthogonal parameterizations for recurrent neural networks (RNNs). DEQ. IGNN is related toDEQs [7; 26; 8], but the equilibrium equation of IGNN differs from DEQs in that IGNN encodes graph structure. DEQs are a class of infinite depth weight-tied feedforward neural networks with forward propagation using root-finding and backpropagation using implicit differentiation. As a result, training DEQs only requires constant memory independent of the network's depth. Monotone operator theory has been used to guarantee the convergence of DEQs[77]   and to improve the robustness of implicit neural networks[44]. The convergence of DEQs has also

