ITERATED GRAPH NEURAL NETWORK SYSTEM

Abstract

We present Iterated Graph Neural Network System (IGNNS), a new framework of Graph Neural Networks (GNNs), which can deal with undirected graph and directed graph in a unified way. The core component of IGNNS is the Iterated Function System (IFS), which is an important research field in fractal geometry. The key idea of IGNNS is to use a pair of affine transformations to characterize the process of message passing between graph nodes and assign an adjoint probability vector to them to form an IFS layer with probability. After embedding in the latent space, the node features are sent to IFS layer for iterating, and then obtain the high-level representation of graph nodes. We also analyze the geometric properties of IGNNS from the perspective of dynamical system. We prove that if the IFS induced by IGNNS is contractive, then the fractal representation of graph nodes converges to the fractal set of IFS in Hausdorff distance and the ergodic representation of that converges to a constant matrix in Frobenius norm. We have carried out a series of semi-supervised node classification experiments on citation network datasets such as citeser, Cora and PubMed. The experimental results show that the performance of our method is obviously better than the related methods.

1. INTRODUCTION

GNN (Scarselli et al., 2009) has been proved to be effective in processing graph structured data, and has been widely used in natural language processing, computer vision, data mining, social network and biochemistry. In recent years, GNN has developed a variety of architectures, such as GCN (Kipf & Welling, 2017) , GraphSAGE (Hamilton et al., 2017) , GAT (Veličković et al., 2018) , DGI (Veličković et al., 2019) , GIN (Xu et al., 2019) , GCNII (Ming Chen et al., 2020) and GEN (Li et al., 2020) . These architectures have a common feature, that is, the representation of each node is updated using messages from its neighbors but without distinguishing the direction (or angle) of message passing between two nodes. Recent studies have shown that considering directed message passing between nodes can improve the performance of GNN and achieve success in related fields. For example, DimeNet (Klicpera et al., 2020) considers the spatial direction from one atom to another and can learn both molecular properties and atomic forces. R-GCN (Schlichtkrull et al., 2018) and Bi-GCN (Marcheggiani & Titov, 2017; Fu et al., 2019) are models for directed graph, applied in the field of natural language processing. We note that the above direction based model does not consider the bidirectional mixed passing of messages. But in real life, message passing is interactive in different directions. For example, node A obtains a message from node B. After processing the message, node A not only passes it to the next node C, but also feeds back to node B. Suppose there are only two directions for message passing, forward and backward, represented by 0 or 1, respectively. The symbol space of the first generation message passing path is {0, 1} = {0, 1} 1 , and that of the second generation message passing path is {00, 01, 10, 11} = {0, 1} 2 . Generally, the symbol space of the n-th generation message passing path is {0, 1} n and the size of the symbol space is 2 n . This means that the scope of message passing spreads with exponent 2. However, in Bi-GCN (similar to Bi-LSTM) and R-GCN architectures, the symbol space is {{0} n , {1} n }, and its size is 2, which indicates that a lot of information will be lost in the process of message passing (see Appendix A). How to characterize the above message passing patterns? We use two mappings to represent message passing process in two directions. Then the interactive passing of messages in different directions is equivalent to the composite operation of corresponding mappings. In addition, the direction of only occurs in the same direction, but also occurs interactively in different directions, which is more in line with the actual situation. For example, in layer 1, node 2 passes the processed message f 1 (m 2 ) to node 1, and then, in layer2, node 1 processes the received message f 1 (m 2 ) and returns the processed message f 0 (f 1 (m 2 )) to node 2. message passing is often random, so we endow the two mappings with an adjoint probability vector to reflect the randomness. Because the symbol space of the iterative path of the Iterated Function System (IFS) with two mappings is also {0, 1} n and the mapping is selected with a certain probability, the iterative process of IFS is similar to the message passing process. In other words, the above message passing pattern can be described perfectly by IFS with probabilities. We naturally present the Iterative Graph Neural Network System (IGNNS), whose core layer is constructed by IFS. Figure 1 describes the differences in message passing patterns among GCN, Bi-GCN and IGNNS. At the same time, we regard undirected graph as a directed graph with equal probability of bidirectional message passing (see Figure 1 (a)), so the IGNNS architecture can handle directed graph and undirected graph in a unified way.

2. PRELIMINARIES

A graph G = (V, E) is defined by its note set V = {v 1 , v 2 , ..., v N } and edge set E = {(v i , v j )|v i , v j ∈ V }. Let A ∈ R N denote the adjacency matrix of G, providing with relational information between nodes. A[i, j] denote i, jth element of A, A[i, :] means the ith row, and A[:, j] means the jth column. In this paper, we assume that all nodes of G are self adjacent, that is A[i, i] = 1, i = 1, 2, ..., N . let D = diag(d 1 , d 2 , ..., d N ) be the degree matrix of A, where d i = N j=1 A[i, j]. Neighborhood Normalization. There are two ways to normalize A. One approach is the following mean-pooling employed by Hamilton et al. (2017) and Veličković et al. (2019) for inductive learning: A mp = D -1 A. Another approach is the following symmetric normalization employed by Kipf & Welling (2017) : A sym = D -1 2 AD -1 2 . Iterated Function System. A mapping f : R N → R N is said to be a contractive mapping on R N if there exists a constant 0 < c < 1 such that f (x 1 ) -f (x 2 ) 2 < c x 1 -x 2 2 for all x 1 , x 2 ∈ R N . An iterated function system (Hutchinson, 1981) is defined by IFS = {R N ; f 1 , f 2 , ..., f m ; p}, where each f i : R N → R N is a contractive mapping and p = (p 1 , p 2 , ..., p m ) is an adjoint probability vector meaning that f i is selected by probability p i for each iteration. Hutchinson (1981) showed that there exists a unique nonempty compact set F such that , where we use the mean-pooling method to normalize A, p 0 = 0.6, p 1 = 0.4, F = m i=1 f i (F). A =    1 1 0 0 1 1 0 1 0 1 1 1 0 1 1 1    , A 0 =     1 2 1 2 0 0 0 1 2 0 1 2 0 0 1 2 1 2 0 0 0 1     and A 1 =     1 0 0 0 1 2 1 2 0 0 0 1 2 1 2 0 0 1 3 1 3 1 3     . We call F the fractal set or invariant set of IFS. More conclusions on IFS can be found in the Appendix D. It is well known that there exists a unique probability measure µ with support F satisfying the equation µ = m i=1 p i µ • f -1 i . The probability measure µ in (1) is called the self-similar measure of IFS with probability vector p.

3. IGNNS ARCHITECTURE

In this section, we will introduce the architecture of the IGNNS according to the input layer, IFS layer, representation layer and output layer, which is described in Figure 2 .

3.1. INPUT LAYER

Given a graph structure data X ∈ R N ×F of G = (V, E), called as the feature matrix of node set V . A row of X represents the F -dimensional feature vector of a node in V . Let W int ∈ R F ×H be a learnable parameter matrix, where H is the dimension of the latent space. Then XW int ∈ R N ×H . The output of input layer is defined by X int = σ(XW int ) ∈ R N ×H , where σ(•) is the activation function. Generally, ReLU(x) = max(0, x) is used as the nonlinear activation function. Here, each column of X int is regarded as a point in R N , so X int is the set of H points in R N and arranged in a certain order. The vector composed of the ith component of these points (the ith row of X int ) is a feature representation of the ith node of graph G.

3.2. IFS LAYER

Let A be the adjacency matrix of G. Let triu(A) denote the upper triangular matrix of A and tril(A) denote the lower triangular matrix of A. The symmetric normalization of triu(A) and tril(A) are A 0 = D -1 2 0 triu(A)D -1 2 0 and A 1 = D -1 2 1 tril(A)D -1 2 1 , where D 0 and D 1 are degree matrices of triu(A) and tril(A) respectively. Sometimes, we use the mean-pooling of triu(A) and tril(A), i.e. A 0 = D -1 0 triu(A), A 1 = D -1 1 tril(A). Let f 0 , f 1 be the two affine transformations on R N , induced by A 0 , A 1 respectively, defined as follows: f 0 : x → A 0 x + b 0 , x ∈ R N , b 0 ∈ R, f 1 : x → A 1 x + b 1 , x ∈ R N , b 1 ∈ R , where b 0 and b 1 are learnable biases, namely add constants b 0 and b 1 to each component of A 0 x and A 1 x respectively. Constructing iterated function system IFS = {R N ; f 0 , f 1 ; p}, where p = (p 0 , p 1 ) is a learnable adjoint probability vector, satisfying p 0 > 0, p 1 > 0 and p 0 +p 1 = 1. Using the symbol space Ω m = {0, 1} m , then for each i = (i 1 , i 2 , ..., i m ) ∈ Ω m the length of i is m, denoted as |i| = m, and defining p i = p i1 p i2 • • • p im and f i = f i1 • f i2 • • • • • f im . Let n be the number of iterations of IFS. For IGNNS, n is a preset parameter. The iterative process of IFS is described as follows: The first iteration (|i| = 1). The result of the first iteration is denoted by H (1) = {f 0 (X int ), f 1 (X int )} = {H i } |i|=1 , where H i = f i (X int ), ∀i ∈ Ω 1 . Since IFS selects the iteration branch f i with probability p i , the mathematical expectation of H (1) is computed by E 1 = p 0 f 0 (X int ) + p 1 f 1 (X int ) = p 0 H 0 + p 1 H 1 = |i|=1 p i H i .

If choose to use bias in iterations, then H

0 = A 0 X int + b 0 , H 1 = A 1 X int + b 1 , where b 0 and b 1 are learnable H-dimensional vectors. The second iteration (|i| = 2). Using the results of the first iteration as the input of the second iteration, then the result of the second iteration is denoted by H (2) = {f 0 (f 0 (X int )), f 0 (f 1 (X int )), f 1 (f 0 (X int )), f 1 (f 1 (X int ))} = {f 00 (X int ), f 01 (X int ), f 10 (X int ), f 11 (X int )} = {H i } |i|=2 , where H i = f i (X int ), ∀i ∈ Ω 2 . Note that IFS selects the iteration path f i with probability p i , then the mathematical expectation of H (2) is computed by E 2 = |i|=2 p i f i (X int ) = |i|=2 p i H i . We expand the expression of E 2 and perceive its powerful feature representation ability. First, H 00 = f 0 (H 0 ) = A 0 (A 0 X int + b 0 ) + b 0 , H 01 = f 0 (H 1 ) = A 0 (A 1 X int + b 1 ) + b 0 , H 10 = f 1 (H 0 ) = A 1 (A 0 X int + b 0 ) + b 1 , H 11 = f 1 (H 1 ) = A 1 (A 1 X int + b 1 ) + b 1 . Then E 2 = p 00 H 00 + p 01 H 01 + p 10 H 10 + p 11 H 11 = (p 00 A 00 + p 01 A 01 + p 10 A 10 + p 11 A 11 )X int + (p 00 A 0 b 0 + p 01 A 0 b 1 + p 10 A 1 b 0 + p 11 A 1 b 1 ) + (p 00 b 0 + p 01 b 0 + p 10 b 1 + p 11 b 1 ), where A i = A i1 A i2 , ∀i = (i 1 , i 2 ) ∈ Ω 2 . The n-th iteration (|i| = n). Inductively, we have H (n) = {H i } |i|=n , H i = f i (X int ), E n = |i|=n p i H i . (3) Note that H i ∈ R N ×H and each column of H i is a point in R N , so we regard it as a subset of R N with H elements. Thus H (n) is a subset of R N with H ×2 n elements (including duplicate elements). Because of Theorem 4.1, we call H (n) the fractal representation with depth n of notes.

3.3. REPRESENTATION LAYER

After n iterations of IFS layer, the dynamic trajectory of IFS is obtained: O = {E 1 , E 2 , ..., E n }. In general, the global representation R of notes is obtained by time average or concatenation operations on O. R = 1 n n i=1 E i ∈ R N ×H or R = n i=1 E i ∈ R N ×nH , where is the concatenation operator. Because of Theorem 4.2, we call R the ergodic representation of notes. In practice, we adopt weighted time average or weighted concatenation. According to the Theorem C.1, we use heuristic weights (Here, we understand it as the average expansion factor of the distance between two points after affine transformation). Let r = ln(N ) + γ, where γ ≈ 0.577215664 is the Euler constant. Suppose r = (r 1 , r 2 , ..., r n ) is a learnable n-dimensional vector with initial value r i = 1 r i-1 . Then the ergodic representation of notes is R = n i=1 r i E i ∈ R N ×H or R = n i=1 r i E i ∈ R N ×nH .

3.4. OUTPUT LAYER

Let W out ∈ R H×P be a learnable parameter matrix, where P is the dimension of the output layer (such as the number of class labels). If R is generated by O concatenation, then let W out ∈ R nH×P . There are two ways to construct output layer, one is to use a Single-Layer Perception (SLP) as output, that is O = RW out + b out ; the other is to use f 0 , f 1 for Mixed Propagation (MP), that is, let R 0 = f 0 (RW out ) and R 1 = f 1 (RW out ) , where the biases of f 0 , f 1 are removed, then the output O = p 0 R 0 + p 1 R 1 + b out . Where the bias b out ∈ R P is an optional learnable parameter vector.

3.5. INITIALIZATION OF LEARNABLE VARIABLES

The learnable parameters of IGNNS include input layer matrix W int ∈ R F ×H , adjoint probability vector p = (p 0 , p 1 ) ∈ R 2 of IFS, biases b 0 , b 1 ∈ R H of IFS layer, weight coefficient r = (r 1 , r 2 , ..., r n ) of representation layer, matrix W out ∈ R H×P of output layer and bias b out ∈ R P of output layer. Among them, W int and W out are the required learnable parameters, using the initialization described in Glorot & Bengio (2010) ; b 0 , b 1 and b out are optional learnable parameters with an initial value 0; p is a optional learnable parameter, for undirected graph, setting p 0 ∈ [0.5 -0.1, 0.5 + 0.1], and for directed graph, setting (For the reasons, see Appendix G) p 0 = det D 1 det D 0 + det D 1 , p 1 = det D 0 det D 0 + det D 1 ; r is a optional learnable parameter, and its initial value as defined in 3.3. Let n be the number of iterations of IFS. We regard n as the depth of IGNNS. Thus, IGNNS is denoted as O = IGNNS(X, A; W int , n, p, b 0 , b 1 , r, W out , b out ) or simply O = IGNNS(X, IFS), where IFS is induced by relational matrix A. The output of IGNNS can be used as the input of downstream tasks, and can also be connected to other network architectures.

3.6. THEORETICAL TIME COMPLEXITY OF IGNNS

Let n, N, H, P be defined as above. Let T (•) denote the number of calculations of an object. For input layer, T (input layer) = N F H + N H = O(N F H). For IFS layer, during the iterative calculation, we will store the previous calculation results, thus T (H (1) ) = 2N 2 H, T (H (i) ) = 2×T (H (i-1) ), i = 2, 3, ..., n. It follows that T (H (i) ) = 2 i N 2 H, i = 1, 2, ..., n. Similarly, T ({p i ||i| = i}) = 2i, i = 1, 2, ..., n. Complete the above calculation, it is easy to see that T (E i ) = 2 i N H. Thus T (IFS layer) = n i=1 T (H (i) ) + T ({p i ||i| = i}) + T (E i ) = O(2 n N 2 H). It is easy to verify that T (representation layer) = O(nN H). We assume that the output layer is constructed by mixed propagation, then T (output layer) = O(N 2 P ) if W out ∈ R H×P , and T (output layer) = O(N 2 P + nN HP ) if W out ∈ R nH×P . Then T (IGNNS) = O(2 n N 2 H + N 2 P ) or O(2 n N 2 H + N 2 P + nN HP ). In practice, for large graphs, 2 n N 2 H N 2 P nN HP , thus 2 n N 2 H is the main factor affecting the time complexity of IGNNS. Furthermore, for large graphs of the same size, n is the main important factor affecting time complexity. For citation network datasets such as citeser, Cora and PubMed, we suggest that n ≤ 8 (see Appendix B).

4. GEOMETRIC PROPERTIES OF IGNNS

The discussion here assumes that affine f 0 , f 1 are contractive. Otherwise, let f 0 : x → 1 A0 F +1 A 0 x + b 0 and f 1 : x → 1 A1 F +1 A 1 x + b 1 . In practice, IGNNS does not use contractive affine. If contractive affine is used in IGNNS, it can be seen from the following theorems that the characterization ability of IGNNS decreases with the increase of IFS iterations, which is similar to the performance of Graph Convolution Network (GCN). Such a phenomenon is called over-smoothing (Li et al., 2018b; Xu et al., 2019; Chen et al., 2020) , which suggests that as the number of layers increases, the representations of the nodes in GCN are inclined to converge to a certain value and thus become indistinguishable. In order to overcome the over-smoothing problem of deep GNN, people need to use some new methods such as Skip connection (Xu et al., 2018) , Drop edge (Rong et al., 2020) , Residual connection (Klicpera et al., 2019a) , Identity mapping (Chen et al., 2020) , Generalized message aggregation functions (Li et al., 2020) and so on. Generally speaking, deep network may lead to the decrease of generalization performance. To analyze which type of deep GNN would achieve better generalization performance, Xu et al. (2020) proposes a guiding theoretical framework. Theorem 4.1 (Fractal generation) Let H (n) = {H i } |i|=n , which is a subset of R N with H × 2 n elements (including duplicate elements), then d H (H (n) , F) → 0, n → ∞, where d H is the Hausdorff distance defined on H(R N ), the set of all nonempty compact subsets of R N , and F is the fractal set of IFS in IGNNS. In other words, as the number of iterations increases, H (n) will be independent of node feature X, only related to the graph structure described by A. Let T be the Hutchinson operator on H(R N ), defined as T (B) = f 0 (B) f 1 (B), ∀B ∈ H(R N ). Then the updated rule of H (n) satisfies H (n) = T (H (n-1) ) = • • • = T n (H (0) ), where H (0) = {X int } is a subset of R N with H elements. In fractal geometry, H (n) is used to draw the fractal image on the plane. First, taking initial value H (0) = {x 0 }, where x 0 is a point in plane. For enough n, printing all the points of H (n) on the screen to obtain the approximate fractal image. Theorem 4.2 (Ergodic property) Let E n = |i|=n p i H i be the mathematical expectation of H (n) = {H i } |i|=n , then E n converges to a constant matrix E ∈ R N ×H in Frobenius norm, i.e. lim n→∞ E n = E, where E[i, :] = (e i , e i , ..., e i ) ∈ R H and e i ∈ R is a constant, i = 1, 2, ..., N . Furthermore, the time average of the dynamic trajectory O of IFS satisfies Theorem 4.2 shows that as long as the number of iterations is large enough, the embeddings of nodes will be close to linear correlation, and the representation ability of IGNNS will decline. However, in the framework of IGNNS, because the spectral radius ρ(A 0 ) = ρ(A 1 ) = 1, IFS is not contractive in general, and IGNNS still has the ability of depth feature representation. lim n→∞ 1 n n i=1 E i = lim n→∞ E n = E and series ∞ i=1 r i E i ∈ R N ×H converges in Frobenius norm.

5.1. EXPERIMENTAL TASK: SEMI-SUPERVISED NODE CLASSIFICATION

Let Z = softmax(O), where Z ∈ R N and softmax(•) is the softmax activation function, defined as softmax (x i ) = 1 Z exp(x i ) with Z = i exp(x i ) , is applied row-wise. For semi-supervised multiclass classification, we employ the following cross-entropy to evaluate error over all labeled examples: L = -l∈Y L P i=1 Y [l, i] ln Z[l, i], where Y L is the set of node indices that have labels with P classes, Y [l, :] is a one-hot vector of size P representing the class of node l and Z[l, :] is the row l of the matrix Z.

5.2. EXPERIMENTAL SETUP

Datasets. In our experiment, we use three standard citation network benchmark datasets for evaluation, including Cora, Citeseer, Pubmed and apply the standard fixed training/validation/testing split (Yang et al., 2016; Kipf & Welling, 2017; Veličković et al., 2018) on above datasets, with 20 nodes per class for training, 500 nodes for validation and 1,000 nodes for testing. In these citation networks, papers are represented as nodes, and citations of one paper by another are denoted as edges. Node features are the bag-of-words vector of papers, and node label is the only one academic topic of a paper. See Table 1 for more details. Parameter Setting. Random seed for Tensorflow and Numpy is set to 1234. ReLU (Nair & Hinton, 2010) is used as the activation function in input layer and output layer. Dropout (Srivastava et al., 2014) is applied to input layer, IFS layer and output layer. In representation layer, we adopt weighted time average to get the global representation of notes. In output layer, we adopt the method of mixed propagation to get the output of IGNNS. We use the AdamOptimizer (P. Kingma & Ba, 2015) during training. More details of hyper-parameters are shown in Table 2 . During training stage, we select the best model to maximize the accuracy of the validation set and use early stopping with a patience of 100 epochs.

5.3. EXPERIMENTAL RESULT

We compare with those models that strictly follow the standard of experiment setup of semisupervised node classification, i.e. the standard fixed training/validation/testing split (Yang et al., 2016; Kipf & Welling, 2017) is applied on dataset. For baselines, we include recent deep GNN models such as JKNet (Xu et al., 2018) , APPNP (Klicpera et al., 2019a) , Attention-based models such as GAT (Veličković et al., 2018) , AGNN (Thekumparampil et al., 2018) and H-GAT (Gulcehre et al., 2019) , and other models such as TAGCN (Du et al., 2017) and N-GCN (Abu-El-Haija et al., 2018) . We also include three state-of-the-art shallow GNN models: Planetoid (Yang et al., 2016) , GCN (Kipf & Welling, 2017) and DGCN (Zhuang & Ma, 2018) . The detailed results are shown in Table 3 . We can see from Table 3 that the improved performance of model IGNNS in dataset Cora and Citeseer is much higher than that in dataset Pubmed. To understand why this happens, we analyze (Kipf & Welling, 2017) 81.5 70.3 79.0 GAT (Veličković et al., 2018) 83.0 72.5 79.0 TAGCN (Du et al., 2017) 83.3 71.4 81.1 JKNet (Xu et al., 2018) 81.1 69.8 78.1 AGNN (Thekumparampil et al., 2018) 83.1 71.7 79.9 N-GCN (Abu-El-Haija et al., 2018) 83.0 72.2 79.5 DGCN (Zhuang & Ma, 2018) 83.5 72.6 80.0 APPNP (Klicpera et al., 2019a) 83.3 71.8 81.1 H-GAT (Gulcehre et al., 2019) 83.5 72.9 -IGNNS (ours) 86.3(44s, 0.17s) 75.1(65s, 0.16s) 80.5(221s, 1.47s) the characteristics of these citation networks. We consider two statistical properties of networks, one is the Network Density d(G), which is defined as d(G) = 2L N (N -1) , where N is the number of nodes and L is the number of edges, and the other is the Average Clustering Coefficient C, which is defined as C = 1 N i∈V C i , where V is the set of nodes, C i = 2ei ki(ki-1) , k i is the number of the neighbors of node v i and e i is the number of undirected edges between k i neighbors. The small Network Density means the strong global sparsity of the network, and the small Average Clustering Coefficient means the strong sparsity of the neighbors of nodes. The calculation results of the statistical characteristics of the network are shown in Table 4 . We can see from Table 4 that Pubmed is more sparse than Cora and Citeseer. The performance of IGNNS benefits from the bidirectional mixed propagation of information between nodes, but this sparsity weakens the gain of IGNNS.

5.4. PERFORMANCE OF COMPLETELY LINEAR IGNNS

In Nonlinear IGNNS, we use the nonlinear activation function ReLU(x), learn adjoint probability vector p = (p 0 , p 1 ) by p 0 ← ReLU(p0)+0.1 ReLU(p0)+ReLU(p1)+0.2 , p 1 ← ReLU(p1)+0.1 ReLU(p0)+ReLU(p1)+0.2 and learn the representation layer coefficient r = (r 1 , r 2 , ..., r n ) by r i ← ReLU(r i ) with initial value r i = 1 r i-1 where r = ln(N ) + 0.577215664. In this experiment, to get a completely linear IGNNS, we let all the activation functions be the identity function, i.e. σ(x) = x, and let the adjoint probability vector p = (p 0 , p 1 ) and the representation layer coefficient r = (r 1 , r 2 , ..., r n ) be hyperparameters 83.9 72.4 79.9 without learning. For Citeseer, we let p 0 = 0.6, and for Pubmed, we use bias for IFS layer. Except for the above changes, experimental task and the other experimental setups remain unchanged as showed in section 5.1 and 5.2 respectively. We can see from Table 5 that the performance of completely linear IGNNS is better than that of baseline model GCN. Compared with other models, completely linear IGNNS is still competitive. This is due to the fact that the IFS can extract more features than spectral filters. For more discussion, Let A be the normalization adjacency matrix of graph G, and let A 0 and A 1 be defined as in section 3.2. We further assume that the dimension of the hidden space is equal to 1. This means that the input of GNN (GCN or IGNNS) is a point x 0 = X int = XW int ∈ R N ×1 . Let n be the depth of GNN, for IGNNS, equal to the number of iterations of IFS. For convenience, we ignore the activation function and parameter matrix. Let f (x) = Ax + b, f 0 (x) = A 0 x + b 0 and f 1 (x) = A 1 x + b 1 . For GCN, the message passing results are {f (x 0 )}, {f 2 (x 0 )}, ..., {f n (x 0 )}. We see that each iteration only gets one value, i.e. |{f n (x 0 )}| = 1. For IGNNS, the message passing results are as follows: {f 0 (x 0 ), f 1 (x 0 )}, {f 0 (f 0 (x 0 )), f 0 (f 1 (x 0 )), f 1 (f 0 (x 0 )), f 1 (f 1 (x 0 ))}, ..., {f i (x 0 )} |i|=n . If f 0 and f 1 satisfy separation condition, i.e. f 0 (x) = f 1 (y) implies x = y, then |{f i (x 0 )} |i|=n | = 2 n . This means that IGNNS can extract more information than GCN. Even if f 0 and f 1 are contractive mappings, by Theorem 4.1, we have {f i (x 0 )} |i|=n → F, where F is the fractal set of IFS induced by f 0 and f 1 . Generally speaking, F is a uncountable compact set. This means that when n is large, the features may still be distinguishable.

6. CONCLUSION

In this paper, we propose a new framework of graph neural networks, IGNNS, which give a connection between Graph Neural Networks and Iterated Function System. We use IFS to simulate the bidirectional message passing process of graph neural network, and obtain the fractal representation and ergodic representation of graph nodes, which are very helpful for downstream tasks. The experiments show that we have achieved good results in semi-supervised node classification task. Interesting directions for future work include pruning the iterative path space {0, 1} n to reduce the computational complexity, coding graph structured data with IFS, and establishing more interesting connections between IFS and graph neural networks. A THE FRACTAL REPRESENTATION OF GRAPH G WITH ONLY ONE SELF ADJACENT NODE v. For the sake of discussion, we assume that the dimension of the hidden space is equal to 1. We assume that messages are sent from node v, propagate in two directions (clockwise and anticlockwise), and are finally received by node v. The received messages in the clockwise direction become one-third of the original, and the received messages in the anticlockwise direction become one-third of the original plus a constant of 2 3 . It is expressed by mathematical formula as follows f 0 (x) = x 3 , f 1 (x) = x 3 + 2 3 , x ∈ R. For Bi-GCN, messages are delivered independently in both directions. In other words, there are two independent channels, and the message passing (transmitting or receiving) can only be carried out by their own channels. Let x 0 ∈ R be the initial message. In the clockwise direction, after n passes, the received messages are x 0 3 , x 0 3 2 , ..., x 0 3 n → 0. In the anticlockwise direction, the received messages are x 0 3 + 2 3 , x 0 3 2 + 2 3 2 + 2 3 , ..., x 0 3 n + n i=1 2 3 i → 1. For IGNNS, the two channels have a connection point at node v. First, node v sends the message x 0 in both directions, and the connection point of node v will receive two messages {f 0 (x 0 ), f 1 (x 0 )}. In the second launch, any message (f 0 (x 0 ) or f 1 (x 0 )) can be sent in both directions, so the received messages are In summary, After n passes, the received messages are {f 0 (f 0 (x 0 )), f 0 (f 1 (x 0 )), f 1 (f 0 (x 0 )), f 1 (f 1 (x 0 ))}. H (n) = {f i (x 0 )} |i|=n → C, where C is the famous Cantor Set. This means that we have not only received one message, but 2 n , since f 0 and f 1 satisfy the separation condition. We can see from Figure 3 that Bi-GCN gets boundary messages and IGNNS gets all messages. Let p = (p 0 , p 1 ) be the adjoint probability vector, then the mathematical expectation E n of H (n) is |i|=n p i f i (x 0 ). We interpret E n as the average of all messages received. Now the question is, fractal representation gets enough messages, but is there redundancy in these messages? How to select the valid message from the fractal representation becomes the focus of our research in the next stage. B ANALYSIS OF TIME COMPLEXITY ON CORA, CITESEER AND PUBMED. From section 3.6, we have known that the time complexity in Experiment 5. 1 is O(2 n N 2 H +N 2 P ), where n is the number of iterations of IFS (the depth of IGNNS), N is the number of nodes, H is the dimension of the latent space and P is the dimension of the output layer. In this section, we compare the real running time (100 epochs) of IGNNS on Cora, Citeseer and Pubmed. Let H = 8, then 2 n N 2 H is the main factor affecting the time complexity of IGNNS. Let the depth of IGNNS 

C FROBENIUS NORM OF MATRIX

Theorem C.1 Let A ∈ R N ×N be the adjacency matrix of a graph with no weights, i.e. A i,j = 1 if there exists an edge i → j in the graph and A i,j = 0 otherwise, and A 1 = D -1 1 tril(A) (or A 1 = D -1 2 1 tril(A)D -1 2 1 ) as defined in IGNNS, then A 1 F ≥ ln(N ) + γ, where γ ≈ 0.577215664 is the Euler constant.

Proof. Case 1:

A 1 = D -1 1 tril(A). Let tril(A) = (a ij ) N ×N , D 1 = diag(d 1 , d 2 , . .., d N ) be the degree matrix of tril(A) and A 1 = (b ij ) N ×N . Note that tril(A) is an lower triangular matrix, then d i = N j=1 a ij = i j=1 a ij . Since a ij ∈ {0, 1}, we have i ≥ d i . Computing the Frobenius Norm of A 1 as follows:  A 1 2 F = N i=1 N j=1 b 2 ij = N i=1 i j=1 b 2 ij = N i=1 i j=1 a ij d i 2 . ( a ij d i 2 = 1 d i 2 × i j=1 a 2 ij = 1 d i 2 × d i = 1 d i ≥ 1 i . It follows from ( 4) and ( 5) that A 1 2 F ≥ N i=1 1 i . So A 1 F ≥ N i=1 1 i ≈ ln(N ) + γ, where γ ≈ 0.577215664 is the Euler constant. Case 2: A 1 = D -1 2 1 tril(A)D -1 2 1 . Computing the Frobenius Norm of A 1 as follows: A 1 2 F = N i=1 N j=1 b 2 ij = N i=1 i j=1 b 2 ij = N i=1 i j=1 a 2 ij d i d j . ( ) ∀i ∈ {1, 2, ..., N }, it follows from j ≥ d j that i j=1 a 2 ij d i d j ≥ 1 d i i j=1 1 j • a 2 ij . Note that d i elements in {a ij } i j=1 are 1 and the rest are 0. It follows from Rearrangement inequality that i j=1 1 j • a 2 ij ≥ 1 i -d i + 1 • a 2 i(i-di+1) + 1 i -d i + 2 • a 2 i(i-di+2) + • • • + 1 i • a 2 ii , where a i(i-di+1) = a i(i-di+2) = • • • = a ii = 1. Thus i j=1 1 j • a 2 ij ≥ 1 i • 1 2 + 1 i • 1 2 + • • • + 1 i • 1 2 = d i × 1 i . It follows from ( 6), ( 7) and (9) that A 1 2 F ≥ N i=1 1 i . Which completes the proof.

D INTRODUCTION TO ITERATED FUNCTION SYSTEM

In order to prove Theorem 4.1 and Theorem 4.2, we will briefly introduce the relevant conclusions on IFS in this section, and we will not give the proof here. More details of IFS Theory can be found in Hutchinson (1981) ; Elton (1987) ; Barnsley (1988) ; Falconer (1990) ; Massopust (2017) . We call (H(X); d H ) a Fractal space. Let {f i } n i=1 be a set of mappings on (X; d). Hutchinson operator T : (H(X); d H ) → (H(X); d H ) defined as T (B) = n i=1 f i (B), ∀B ∈ H(X). ( ) Theorem D.2 If {f i } n i=1 is a set of contractive mappings on (X; d), then Hutchinson operator T is a contractive mapping on (H(X); d H ).

D.2 MARKOV OPERATOR OF IFS

Let (X; d) be a complete metric space. Let M(X) be the set of all probability measures on X. Let C(X) be the set of all continuous functions mapping X to R. We say that f ∈ Lip1, if |f (x) -f (y)| ≤ d(x, y), ∀x, y ∈ X. It is easy to see that if f ∈ Lip1 then f ∈ C(X). Hutchinson metric d M on M defined as d M (µ, ν) = sup X f dµ - X f dν |f ∈ Lip1 , ∀µ, ν ∈ M(X). ( ) Theorem D.3 (M(X); d M ) is a complete metric space. Let IFS = {X; f 1 , f 2 , ..., f n ; p}, the Markov operator M : M(X) → M(X) of IFS defined as M µ = n i=1 p i µ • f -1 i , µ ∈ M(X). Theorem D.4 Markov operator M of IFS is a contractive mapping on space (M(X); d M ). Let measure sequence {µ i } ⊂ M and µ ∈ M, we call {µ i } weakly convergent to µ if the following equation holds: --→ F. The above convergence is independent of the choice of initial value. Thus, let H (0) = {X int } = {x 1 , x 2 , ..., x H }, x i ∈ R N , we have lim i→∞ X f dµ i = X f dµ, ∀f ∈ C(X), H (n) = T (H (n-1) ) = • • • = T n (H (0) ) d H --→ F. The above result indicates that when n is large enough, H (n) is close to the fractal set F of IFS in the sense of Hausdorff distance, and has nothing to do with the choice of initial value H (0) . F THE PROOF OF THEOREM 4.2 . For this purpose, let x j be the j column of X int as defined in (2), then x j is a point in R N . Define a Dirac measure as follows: δ x (B) = 1 x ∈ B, 0 other. ( ) It is easy to see that δ x ∈ M(R N ). The Markov operator M : M(R N ) → M(R N ) of IFS, defined as M µ = 1 i=0 p i µ • f -1 i , µ ∈ M(R N ). Now take µ 0 = δ xj , and the results of iterative calculation are as follows:  µ 1 = M µ 0 = 1 i=0 p i µ 0 • f -1 i = 1 i=0 p i δ xj • f -1 i = 1 i=0 p i δ fi(xj ) = |i|=1 p i δ f i (xj ) . µ 2 = M 2 µ 0 = M µ 1 = 1 i=0 p i µ 1 • f -1 i = 1 i=0 p i ( |i|=1 p i δ f i (xj ) ) • f -1 i = p In ( 18), ∀i ∈ {1, 2, ..., N }, take the continuous function F i to satisfy  F i (t) = t i , ∀t = (



Figure 1: Message passing patterns. Where the symbol H is the representations of all the notes. (a) An undirected graph is transformed into a directed graph in a natural way. (b) Regardless of direction, simply gather information from neighbors. (c) Message is passed in the same direction (forward or backward), and get two hidden representations independently. (d) Message passing notonly occurs in the same direction, but also occurs interactively in different directions, which is more in line with the actual situation. For example, in layer 1, node 2 passes the processed message f 1 (m 2 ) to node 1, and then, in layer2, node 1 processes the received message f 1 (m 2 ) and returns the processed message f 0 (f 1 (m 2 )) to node 2.

Figure 2: An overview of IGNNS. The upper part of the Figure describes how to generate two affine transformations on R 4 , where we use the mean-pooling method to normalize A, p 0 = 0.6, p 1 = 0.4,

Figure 3: Comparison of feature extraction ability between Bi-GCN and IGNNS. Bi-GCN gets boundary messages and IGNNS gets all messages.

Figure 4: Real training time on Cora, Citeseer and Pubmed.

FRACTAL SPACE Let (X; d) be a complete metric space. Let H(X) denote a set consisting of all nonempty compact subsets of X. Hausdorff distanceon d H on H(X) defined byd H (A, B) 1 (H(X); d H ) is a complete metric space.

µ as i → ∞, then µ i w -→ µ as i → ∞.E THE PROOF OF THEOREM 4.1Proof. By Theorem D.1, (H(R N ); d H ) is a complete metric space. Let T be the Hutchinson operator on H(R N ), defined as T (B) = f 0 (B) f 1 (B), ∀B ∈ H(R N ). By Theorem D.2, T is a contractive mapping on (H(R N ); d H ). It follows from the Banach fixed point theorem that there exits a unique compact subset F of H(R N ) such thatF = T (F) = f 0 (F) f 1 (F),which implies that F is the fractal set of IFS. Further more, ∀B ∈ H(R N ), we have T n (B) d H

It suffices to prove that ∀j ∈ {1, 2, ..., H} , E n [:, j], the j column of E n , satisfies

0 p 0 δ f0(f0(xj )) + p 0 p 1 δ f0(f1(xj )) + p 1 p 0 δ f1(f0(xj)) + p 1 p 1 δ f1(f1(xj )) = |i|=2 p i δ f i (xj )Inductively, we haveµ n = M n µ 0 = |i|=n p i δ f i (xj ) .(17)By Theorem D.4, it follows from the Banach fixed point theorem that there exits a unique probability measure µ * such thatµ n d M --→ µ * .The above µ * is actually the self-similar measure of IFS. By Theorem D.5, we have µ n w -→ µ * , i.e.lim n→∞ F dµ n = F dµ * , ∀F ∈ C(R N ).It follows from (17) and (3) thatF dµ * = lim n→∞ F dµ n = lim n→∞ F d( |i|=n p i δ f i (xj ) ) = lim n→∞ |i|=n p i F (f i (x j )) = lim n→∞ |i|=n p i F (H i [:, j]), ∀F ∈ C(R N ).

Summary statistics of the benchmark datasets used in the experiment.

hyper-parameters in experiment.

Summary of classification accuracy (%) results on Cora, Citeseer and Pubmed. The results are taken from the corresponding papers. The first value in brackets indicates the total training time in seconds and the second value in brackets indicates the average training time in seconds per epoch.

Statistical characteristics of the networks. Bold for minimum.

Performance of completely linear IGNNS on Cora, Citeseer and Pubmed.

t 1 , t 2 , ..., t N ) ∈ R N .

G HOW TO SET THE INITIAL VALUE OF ADJOINT PROBABILITY VECTOR?

The geometric meaning of matrix determinant det A is the expansion factor of graph volume under linear transformation A. Let F be the fractal set of IFS as defined in IGNNS. Then

Note that

It can be seen that if the value of det A i is large, it reflects that f i (F) has a large share in F. therefore, when selecting the iterative function, f i should have a greater probability of being selected. So setNote that A 0 , A 1 are triangular matrixes and the diagonals of triu(A), tril(A) are equal to 1. Itwhere D 0 and D 1 are degree matrices of triu(A) and tril(A) respectively.

