TRANSFER LEARNING OF GRAPH NEURAL NETWORKS WITH EGO-GRAPH INFORMATION MAXIMIZATION

Abstract

Graph neural networks (GNNs) have been shown with superior performance in various applications, but training dedicated GNNs can be costly for large-scale graphs. Some recent work started to study the pre-training of GNNs. However, none of them provide theoretical insights into the design of their frameworks, or clear requirements and guarantees towards the transferability of GNNs. In this work, we establish a theoretically grounded and practically useful framework for the transfer learning of GNNs. Firstly, we propose a novel view towards the essential graph information and advocate the capturing of it as the goal of transferable GNN training, which motivates the design of EGI (ego-graph information maximization) to analytically achieve this goal. Secondly, we specify the requirement of structurerespecting node features as the GNN input, and conduct a rigorous analysis of GNN transferability based on the difference between the local graph Laplacians of the source and target graphs. Finally, we conduct controlled synthetic experiments to directly justify our theoretical conclusions. Extensive experiments on realworld networks towards role identification show consistent results in the rigorously analyzed setting of direct-transfering (freezing parameters), while those towards large-scale relation prediction show promising results in the more generalized and practical setting of transfering with fine-tuning.

1. INTRODUCTION

Graph neural networks (GNNs) have been intensively studied recently (Kipf & Welling, 2017; Keriven & Peyré, 2019; Chen et al., 2019; Oono & Suzuki, 2020; Huang et al., 2018) , due to their established performance towards various real-world tasks (Hamilton et al., 2017; Ying et al., 2018b; Velickovic et al., 2018) , as well as close connections to spectral graph theory (Defferrard et al., 2016; Bruna et al., 2014; Hammond et al., 2011) . While most GNN architectures are not very complicated, the training of GNNs can still be costly regarding both memory and computation resources on real-world large-scale graphs (Chen et al., 2018; Ying et al., 2018a) . Moreover, it is intriguing to transfer learned structural information across different graphs and even domains in settings like few-shot learning (Vinyals et al., 2016; Finn et al., 2017; Ravi & Larochelle, 2017) . Therefore, several very recent studies have been conducted on the transferability of GNNs, which focus on the setting of pre-training plus fine-tuning (Hu et al., 2019a (Hu et al., ,b, 2020;; Wu et al., 2020) . However, it is unclear in what situations the models will excel or fail especially when the pre-training and fine-tuning tasks are different. To provide rigorous analysis and guarantee on the transferability of GNNs, we focus on the setting of direct-transfering between the source and target graphs, under an analogous setting of "domain adaptation" (Ben-David et al., 2007) . In this work, we establish a theoretically grounded framework for the transfer learning of GNNs, and leverage it to design a practically transferable GNN model. Figure 1 gives an overview of our framework. It is based on a novel view of a graph as samples from the joint distribution of its k-hop ego-graph structures and node features, which allows us to define graph information and similarity, so as to analyze GNN transferability ( §2). This view motivates us to design EGI, a novel GNN model based on ego-graph information maximization, which is effective in capturing the graph information as we define ( §2.1). Then we further specify the requirement on transferable node features and analyze the transferability of EGI that is dependent on the local graph Laplacians of source and target graphs ( §2.2). All of our theoretical conclusions have been directly validated through controlled synthetic experiments (Table 1 ), where we use structural-equivalent role identification in a direct-transfering setting to analyze the impacts of different model designs, node features and source-target structure similarities on GNN transferability. In §3, we conduct real-world experiments on multiple publicly available network datasets. On the Airport and Gene graphs ( §3.1), we closely follow the settings of our synthetic experiments and observe consistent but more detailed results supporting the design of EGI and the utility of our theoretical analysis. On the YAGO graphs ( §3.2), we further evaluate EGI on the more generalized and practical setting of transfer learning with task-specific fine-tuning. We find our theoretical insights still indicative in such scenarios, where EGI consistently outperforms state-of-the-art GNN models and transfer learning frameworks with significant margins.

2. TRANSFERABLE GRAPH NEURAL NETWORKS

Based on the connection between GNN and spectral graph theory (Kipf & Welling, 2017) , we describe the output of a GNN as a combination of its input node features, fixed graph Laplacian and learnable graph filters. The goal of training a GNN is then to improve its utility by learning the graph filters that are compatible with the other two components towards specific tasks. In the graph transfer learning setting where downstream tasks are often unknown during pre-training, we argue that the general utility of a GNN should be optimized and quantified w.r.t. its ability of capturing the essential graph information in terms of the joint distribution of its link structures and node features, which motivates us to design a novel ego-graph information maximization model (EGI) ( §2.1). The general transferability of a GNN is then quantified by the gap between its abilities to model the source and target graphs. Under reasonable requirements such as using structure-respecting node features as the GNN input, we analyze this gap for EGI based on the structural difference between two graphs w.r.t. their local graph Laplacians ( §2.2).

2.1. TRANSFERABLE GNN VIA EGO-GRAPH INFORMATION MAXIMIZATION

In this work, we focus on the direct-transfering setting where a GNN is pre-trained on a source graph G a in an unsupervised fashion and applied on a target graph G b without fine-tuning.foot_0 Consider a graph G = {V, E}, where the set of nodes V are associated with certain features and the set of links E form certain structures. Intuitively, the transfer learning will be successful only if both the features and structures of G a and G b are similar in some ways, so that the graph filters of a GNN learned on G a are compatible with the features and structures of G b . Motivated by the concept of k-layer expansion sub-graph in (Bai & Hancock, 2016) , we introduce a novel view of a graph as samples from the joint distribution of its k-hop ego-graph structures and node features. This view allows us to give concrete definitions towards structural information of graphs in the transfer learning setting, which facilitates the measuring of similarity (difference) among graphs. Definition 2.1 (K-hop ego-graph). We call a graph g i = {V (g i ), E(g i )} a k-hop ego-graph centered at v i if it has a k-layer centroid expansion (Bai & Hancock, 2016) such that the greatest shortest path rooted from v i has length k, i.e., k = max vj ∈V |S(v i , v j )|, where S(v i , v j ) is the shortest path between v i and v j . For an ordered k-hop ego-graph, we denote v p,q as the q-th node in the p-th layer of the ego-graph (i.e., |S i (v i , v p,q )| = p), where p = 0, . . . , k, and e vv as the edge between v p,q and v p+1,q . Definition 2.2 (Structural information). Let G be a topological space of sub-graphs (Verma & Zhang, 2019) . We view a graph G as samples of k-hop ego-graphs G = {g i } n i=1 drawn i.i.d. from G with probability µ, i.e., g i i.i.d. ∼ µ ∀i = 1, • • • , n. The structural information of G is then defined to be the combination of the distribution µ and the set of spectrum of {g i } n i=1 . The structural information of a graph G can be characterized by {g i } vi∈V and its empirical distribution, where each g i is a k-hop ego-graph of G centered at node v i with V (g i ) = {u ∈ V (G) : S(u, v i ) ≤ k}, and edges E(g i ) = {e uv ∈ E(G) : u, v ∈ V (g i )}. As shown in Figure 1 , three graphs G 0 , G 1 and G 2 are characterized by a set of 1-hop ego-graphs and their empirical distributions, which allows us to quantify the structural similarity among graphs as shown in §2.2 (i.e., G 0 is more similar to G 1 than G 2 under such characterization). In practice, the nodes in a graph G are characterized not only by their k-hop ego-graph structures but also their associated node features. Therefore, G should be regarded as samples {(g i , x i )} n ∈ G × X , drawn with the joint distribution p on the product space of G and a node feature space X . To capture such joint distributions of structural information and node features, we design ego-graph information maximization (EGI), which recursively reconstructs the k-hop ego-graph of each node based on their features in an unsupervised fashion. Ego-Graph Information Maximization. Assume we are given a set of ego-graphs {(g i , x i )} i with empirical joint distribution P. Similarly with the "local" version of DIM (Hjelm et al., 2019) , we define U Ψ(gi,xi) as the empirical distribution of the embedding produced by the GNN encoder Ψ for the the center node v i of ego-graph g i . Unlike DGI (Velickovic et al., 2019) that models the local-global mutual information (MI), EGI optimizes Ψ to maximize the MI of I(g i , Ψ(g i , x i )), which is directly between the structural input and output of GNN, with a focus on the structural information g i . Specifically, we use the Jensen-Shannon MI estimator in (Hjelm et al., 2019) , L EGI = -I (JSD) (G, Ψ) = E P× Ũ [sp (T D,Ψ (g i , Ψ(g i , x i )))]-E P [-sp (-T D,Ψ (g i , Ψ(g i , x i )))] , T D,Ψ = D • (g i , Ψ(g i , x i )) , where D is a discriminator D : g i × Ψ(g i , x i ) → R + . In Eq. 1, during the training of D, the input space of D is at least as large as the number of graph permutations |V (g i )|!. Instead of enumerating all possible graphs g i , we fix g i and sample GNN's output Ψ(g i , x i ) from the marginal distribution Ũ by uniformly sampling (g i , x i ) ∼ P, P = P. The correspondence between sampling (g i , x i ) ∼ P and g i ∼ G is discussed in Remark 2 when node features are strcuture-respecting (Def. 2.3). Formally, we characterize the decision process of D with a fixed graph ordering, i.e., BFS-ordering π over edges E(g i ). D is a GNN scoring function over an edge sequence E π : {e 1 , e 2 , ..., e n }, which makes predictions on BFS-ordered edges. Let z i = Ψ(g i , x i ) , then we have, D(g i , z i ) = k p=0 |Vp(gi)| q=1 log D(e ṽv |h q p,q , x i p,q , z i ), where h is the hidden representation output by D, e ṽv ∈ E(g i ) is an edge between node ṽ in layer p and v in layer p + 1, following the notation defined below Def 2.1. More specifically, we have D(e ṽv |h q p,q , x i p,q , z i ) = σ U T • τ W T [h q p,q ||x i p,q ||z i ] , (3) where σ and τ are Sigmoid and ReLU activation functions, respectively. Thus, the discriminator is asked to distinguish positive (e ṽv , Ψ(g i , x i )) and negative pair (e ṽv , Ψ(g i , x i )) that consists of an observed edge and positive/negative center node embeddings Ψ(•). Due to the fact that the output of a k-layer GNN only depends on a k-hop ego-graphs, EGI can be trained in parallel by sampling batches of g i 's. Besides, the training objective of EGI is transferable as long as (g i , x i ) across source graph G a and G b satisfies the conditions given in §2.2. More details about the model are in Appendix §B and source code in the Supplementary Materials. Connection with existing work. To provide more insights into the EGI objective, we also present it as a dual problem of ego-graph reconstruction. Recall our definition of ego-graph mutual information I(g i , Ψ(g i , x i )). It can be related to an ego-graph reconstruction loss R(g i |Ψ(g i , x i )) as max I(g i , Ψ(g i , x i )) = H(g i ) -H(g i |Ψ(g i , x i )) ≤ H(g i ) -R(g i |Ψ(g i , x i )). (4) When EGI is maximizing the mutual information, it simultaneously minimizes the upper error bound of reconstructing an ego-graph g i . In this view, the key difference between EGI and GVAE (Kipf & Welling, 2016) is they assume each edge in a graph to be observed independently during the reconstruction, while we assume the edges in an ego-graph to be observed jointly. Moreover, existing mutual information based GNNs such as DGI (Velickovic et al., 2019) and GMI (Peng et al., 2020) explicitly measure the mutual information between node features x and GNN output Ψ. In this way, they tend to capture node features instead of graph structures, which we deem more essential in graph transfer learning as discussed in §2.2. Supportive observations. In the first three columns of Table 1 , in both cases of transfering GNNs between similar graphs (F-F) and dissimilar graphs (B-F), EGI significantly outperforms all competitors when using node degree one-hot encoding as transferable node features. In particular, the performance gains over the untrained GIN and GCN show the effectiveness of training and transfering, and our gains are always larger than the two state-of-the-art unsupervised GNNs. Such results clearly indicate advantageous structure preserving capability and transferability of EGI.

2.2. TRANSFERABILITY ANALYSI BASED ON LOCAL GRAPH LAPLACIANS

We now study the transferability of a GNN (in particular, EGI) between the source graph G a and target graph G b based on the graph similarity between G a and G b . We firstly establish the requirement towards node features, under which we then focus on analyzing the transferability of EGI w.r.t. the structural information of G a and G b . Recall our view of the GNN output as a combination of its input node features, fixed graph Laplacian and learnable graph filters. The utility of a GNN is determined by the compatibility among the three. In order to fulfill such compatibility, we require the node features to be structure-respecting: Definition 2.3 (Structure-respecting node features). Let g i be an ordered ego-graph centered on node v i with a set of node features {x i p,q } k,|Vp(gi)| p=0,q=1 , where V p (g i ) is the set of nodes in p-th hop of g i . Then we say the node features on g i are structure-respecting if x i p,q = [f (g i )] p,q ∈ R d for any node v q ∈ V p (g i ), where f : G → R d×|V (gi)| is a function. In the strict case, f should be injective. In its essence, Def 2.3 requires the node features to be a function of the graph structures, which is sensitive to changes in the graph structures, and in an ideal case, injective to the graph structures. In this way, when the learned graph filters of a transfered GNN is compatible to the structure of G, they are also compatible to the node features of G. As we will explain in Remark 2 of Theorem 2.1, this requirement is also essential for the analysis of our GNN transferability which eventually only depends on the structural difference between two graphs. In practice, commonly used node features like node degrees, PageRank scores (Page et al., 1999 ), spectral embeddings (Chung & Graham, 1997) , and many pre-computed unsupervised network embeddings (Perozzi et al., 2014; Tang et al., 2015; Grover & Leskovec, 2016 ) are all structurerespecting in nature. However, other commonly used node features like random vectors (Yang et al., 2019) or uniform vectors (Xu et al., 2019) are not and thus non-transferable. When organic node attributes are available, they are transferable as long as the concept of homophily (McPherson et al., 2001) applies, which also implies Def 2.3, but we do not have a rigorous analysis on it yet. Supportive observations. In the fifth and sixth columns in Table 1 , where we use uniform embedding as non-transferable node features to contrast with the first three columns, there is almost no or even negative transferability for all compared methods when non-transferable features are used, as the performance of trained GNNs are similar to or worse than their untrained baselines. With our view of graphs and requirement on node features both established, now we derive the following theorem by characterizing the performance difference of EGI on two graphs based on Eq. 1. Theorem 2.1 (GNN transferability). Let G a = {(g i , x i )} n i=1 and G b = {(g i , x i )} m i =1 be two graphs. Then denote L gi as the (normalised) graph Laplacian of g i ∀i = 1, • • • , n, and let the node features of g i be structure-respecting and normalized (similarly for g i ). Consider GNN Ψ θ with k layers and a 1-hop polynomial filter φ θ . With reasonable assumptions on the local spectrum of G a and G b , the empirical performance difference of Ψ θ with φ θ evaluated on L EGI satisfies |L EGI (G a ) -L EGI (G b )| ≤ O M + 1 nm n i=1 m i =1 λ(L gi ) -λ(L g i ) 2 , ( ) where M is a constant dependant on k, φ θ , {L gi }, {L g i }, {x i }, {x i }, and finally λ(L gi ) denotes the ordered eigenvalues of the graph Laplacian of g i ∈ G a (similarly for g i ). Proof. The full proof is detailed in Appendix §A. Remark 1. Our view of a graph G as samples of k-hop ego-graphs is important, as it allows us to make node-wise characterization of GNN similarly as in (Verma & Zhang, 2019) . It also allows us to set the depth of ego-graphs in the analysis to be the same as the number of GNN layers (k), since the GNN embedding of each node mostly depends on its k-hop ego-graph instead of the whole graph. Remark 2. For Eq. 1, Def 2.3 ensures the sampling of GNN embedding at a node always corresponds to sampling an ego-graph from G, which reduces to uniformly sampling from G = {g i } n i=1 under the setting of Theorem 2.1. Therefore, the requirement of Def 2.3 in the context of Theorem 2.1 guarantees the analysis to be only depending on the structural information of the graph. In practice, the computation of eigenvalues on the small ego-graphs can be rather efficient (Arora et al., 2005) , and we do not need to enumerate all pairs of ego-graphs. Suppose we need to sample M pairs of k-hop ego-graphs to compare two large graphs, and the average size of ego-graphs are L, then the overall complexity of computing Eq. 5 is O(M L 2 ), where M is often less than 1K and L less than 50.

The analysis in

Supportive observations. In Table 1 , in the d columns, we compute the average structural difference between two Forest-fire graphs ( d(F, F )) and between Barabasi and Forest-fire graphs ( d(B, F )), based on the RHS of Eq. 5. The results validate our usage of the two graph models to generate structurally different graphs, while also verify our novel view of graphs and the way we propose based on it to characterize structural information of graphs. We further highlight in the ∆ columns the performance difference between the GNNs transfered from Forest-fire graphs and Barabasi graphs to Forest-fire graphs. Since Forest-fire graphs are more similar to Forest-fire graphs than Barabasi graphs (as verified in the d columns), we expect ∆ to be positive and large, indicating more positive transfer between the more similar graphs. Indeed, the behaviors of EGI align well with the expectation, which indicates its well-understood transferability and the utility of our theoretical analysis. Table 1 : Synthetic experiments of identifying structural equivalent nodes. We randomly generate 40 graphs with the Forest-fire model (F) (Leskovec et al., 2005) and 40 graphs with the Barabasi model (B) (Albert & Barabási, 2002) , The GNN models we use include the untrained encoders of GCN (Kipf & Welling, 2017) and GIN (Xu et al., 2019) with random parameters (baselines with only the neighborhood aggregation function), GVAE with GCN encoder (Kipf & Welling, 2016), DGI with GIN encoder (Velickovic et al., 2019) , and EGI with GIN encoder. We train GVAE, DGI and EGI on one graph from either set (F and B), and test them on the rest of Forest-fire graphs (F). More details about the results and dataset can be found in Appendix §C.1.

Method

transferable features non-transferable feature structural difference Protocols. By default, we use node degree one-hot encoding as the transferable feature across all different graphs. As stated before, other transferable features like spectral and other pre-computed node embeddings are also applicable. We focus on the setting where the downstream tasks on target graphs are unspecified but assumed to be structure-relevant, and thus pre-train the GNNs on source graphs in an unsupervised fashion. 3 In terms of evaluation, we design two realistic experimental settings: (1) Direct-transfering on the more structure-relevant task of role identification without given node features to directly evaluate the utility and transferability of EGI. F-F B-F ∆ F-F B-F ∆ d(F,F) d(B, (2) Few-shot learning on relation prediction with task-specific node features to evaluate the generalization ability of EGI.

3.1. DIRECT-TRANSFERING ON ROLE IDENTIFICATION

First, we use the role identification without node features in a direct-transfering setting as a reliable proxy to evaluate transfer learning performance regarding different pre-training objectives. Role in a network is defined as nodes with similar structural behaviors, such as clique members, hub and bridge (Henderson et al., 2012) . Across graphs in the same domain, we assume the definition of role to be consistent, and the task of role identification is highly structure-relevant, which can directly reflect the transferability of different methods and allows us to conduct the analysis according to Theorem 2.1. Upon convergence of pre-training each model on the source graphs, we directly apply them on the target graphs and further train a multi-layer perceptron (MLP) upon their outputs. The GNN parameters are freezing during the MLP training. We refer to this strategy as direct-transfering since there is no fine-tuning of the models after transfering to the target graphs. We use two real-world network datasets with role-based node labels: (1) Airport (Ribeiro et al., 2017) contains three networks from different regions-Brazil, USA and Europe. Each node is an airport and each link is the flight between airports. The airports are assigned with external labels based on their level of popularity. (2) Gene (Yang et al., 2019) contains the gene interactions regarding 50 different cancers. Each gene has a binary label indicating whether it is a transcription factor. The experimental setup on the Airport dataset closely resembles that of our synthetic experiments in Table 1 , but with real data and more detailed comparisons. We train all models (except for the untrained ones) on the Europe network, and test them on all three networks. The results are presented in Table 2 . We notice that the node degree features themselves (with MLP) show reasonable performance in all three networks, which is not surprising since the popularity-based airport role labels are highly relevant to node degrees. The untrained GIN encoder yields a significant margin over both node degrees and the untrained vanilla GCN encoder, indicating the importance of proper aggregation mechanisms. While training of the GCN (through GVAE) and GIN (through DGI) can further improve the performance on the source graph, EGI shows the best performance there with the structure-respecting node degree features (59.15), corroborating the claimed effectiveness of EGI in capturing the essential graph information as we stress in §2. When transfering the models to USA and Brazil networks, EGI further achieves the best performance compared with all baselines when node degree features are used (64.55 and 73.15), which reflects the most significant positive transfer. Interestingly, direct application of GVAE and DGI without the consideration of essential graph information as we stress leads to rather limited and even negative transferrability (through comparison against the untrained GCN and GIN encoders). The recently proposed transfer learning frameworks for GNN like Mask-GIN and Structural Pre-train are able to mitigate negative transfer to some extent, but their performances are still inferior to EGI. We believe this is because their models do not aim to capture the underlying ego-graph distributions as we deem important, so they are prune to learn the graph-specific information that is less transferable across different graphs. Similarly as in Table 1 , we also compute the structural difference among three networks w.r.t. to RHS of Eq. 5. The structural difference is 12.03 between the Europe and USA networks, and 12.14 between the Europe and Brazil datasets, which are pretty close. Consequently, the transferability of EGI regarding its performance gain over the untrained GIN baseline is 4.8% on the USA network and 4.4% on the Brazil network, which are also pretty close. Such observations once again align well with our conclusion in Theorem 2.1 that the transferability of EGI is closely related to the structural different between source and target graphs. On the Gene dataset, with more graphs available, we focus on EGI to further analyze the utility of Eq. 5 in Theorem 2.1, regarding the connection between the structural difference of two graphs and the performance gap of EGI on them. As shown in Figure 2 , we train EGI on one graph and test it on six different graphs. The x-axis shows the structural difference measured w.r.t. the RHS of Eq. 5, and y-axis shows the performance loss compared with an untrained GIN. The positive correlation between two quantities is obvious. Specifically, when the structural difference is small, positive transfer is observed as the performance of transfered EGI is better than untrained GIN, and when the structural difference becomes large, negative transfer is observed. Note that, at its current stage, Eq. 5 in Theorem 5 mainly gives a relative indication on the transferability of EGI, because the absolute values of structural difference may vary a lot across different datasets.

3.2. FEW-SHOT LEARNING ON RELATION PREDICTION

Here we evaluate EGI in the more generalized and practical setting of few-shot learning on the less structure-relevant task of relation prediction, with task-specific node features and fine-tuning. The source graph contains a cleaned full dump of 579K entities from YAGO (Suchanek et al., 2007) , and we investigate 20-shot relation prediction on a target graph with 24 relation types, which is a sub-graph of 115K entities sampled from the same dump. In post-fine-tuning, the models are pre-trained with an unsupervised loss on the source graph and fine-tuned with the task-specific loss on the target graph. In joint-fine-tuning, the same pre-trained models are jointly optimized w.r.t. the unsupervised pre-training loss and task-specific fine-tuning loss on the target graph. In Table 3 , we observe most of the existing models fail to transfer across pre-training and fine-tuning tasks, especially in the joint-fine-tuning setting. In particular, both Mask-GIN and ContextPred-GIN rely a lot on task-specific fine-tuning, while EGI focuses on the capturing of similar ego-graph structures that are transferable across graphs. As a consequence, EGI significantly outperforms all compared methods in both settings. 

4. RELATED WORK

Representation learning on graphs has been studied for decades, with earlier spectral-based methods (Belkin & Niyogi, 2002; Roweis & Saul, 2000; Tenenbaum et al., 2000) theoretically grounded but hardly scaling up to graphs with over a thousand of nodes. With the emergence of neural networks, unsupervised network embedding methods based on the Skip-gram objective (Mikolov et al., 2013) have replenished the field (Tang et al., 2015; Grover & Leskovec, 2016; Perozzi et al., 2014; Ribeiro et al., 2017) . Equipped with efficient structural sampling (random walk, neighborhood, etc.) and negative sampling schemes, these methods are easily parallelizable and scalable to graphs with thousands to millions of nodes. However, these models are essentially transductive as they compute fully parameterized embeddings only for nodes seen during training, which are impossible to be transfered to unseen graphs. More recently, researchers introduce the family of graph neural networks (GNNs) that are capable of inductive learning and generalizing to unseen nodes given meaningful node features (Kipf & Welling, 2017; Defferrard et al., 2016; Hamilton et al., 2017 ). Yet, most existing GNNs require task-specific labels for training in a semi-supervised fashion to achieve satisfactory performance (Kipf & Welling, 2017; Hamilton et al., 2017; Velickovic et al., 2018; Chen et al., 2018) , and their usage is limited to single graphs where the downstream task is fixed. To this end, several unsupervised GNNs are presented, such as the auto-encoder-based ones like GVAE (Kipf & Welling, 2016) and GNFs (Liu et al., 2019) , as well as the deep-infomax-based ones like DGI (Velickovic et al., 2019) and InfoGraph (Sun et al., 2019) . Their potential in the transfer learning of GNN remains unclear when the node features and link structures vary across different graphs. Although the architectures of GNNs are not very complicated, training a dedicated model for each graph can still be cumbersome (Chen et al., 2018; Ying et al., 2018a) . Moreover, as pre-training neural networks are proven to be successful in other domains (Devlin et al., 2019; He et al., 2016) , the idea is intriguing to transfer well-trained GNNs from relevant source graphs to improve the modeling of target graphs or enable few-shot learning (Vinyals et al., 2016; Finn et al., 2017; Ravi & Larochelle, 2017) when labeled data are scarce. In the light of this, pioneering works have studied both generative (Hu et al., 2020) and discriminative (Hu et al., 2019a,b ) GNN pre-training schemes. Among these work, though Graph Contrastive Coding (Qiu et al., 2020) shares similar structural view as ours, it utilizes contrastive learning in the embedding space instead of structural space as EGI. Unsupervised domain adaptive GCNs (Wu et al., 2020) study the domain adaption problem while source and target tasks are homogenous. Previous pre-training and self-supervised GNNs lack a rigorous analysis towards their transferability and thus have unpredictable effectiveness.

5. CONCLUSION

To the best of our knowledge, this is the first research effort towards establishing a theoretically grounded framework to analyze GNN transferability, which we also demonstrate to be practically useful for guiding the design and conduct of transfer learning with GNNs. For future work, it is intriguing to further strengthen the bound with relaxed assumptions, rigorously extend it to the more complicated and less restricted settings regarding node features and downstream tasks, as well as analyze and improve the proposed framework over more transfer learning scenarios and datasets.

A THEORY DETAILS

From the L EGI objective, we have assumed g i i.i.d. ∼ µ, x i i.i.d. ∼ ν, and (g i , x i ) i.i.d. ∼ p. Then with graph G, we have access to the empirical distributions of the three. So the sampling reduces to bootstrapping in the procedure of evaluating the objective. Note that, in Eq. 2 of the main paper, we used a d dimensional hidden state h q p,q , specified in Eq. 13 to denote an edge encoding derived from the structure of the ego-graph and the associated source node feature from (p -1)-th layer. For simplicity, we consider the concatenated vector f (x i ) z i , where f (x i ) = h q p,q x i p,q and h q p,q , x i p,q are as defined in the EGI model and in 13. Additionally, since both of h q p,q and x i p,q are normalised, f is bounded. Finally, as we are considering GNN with k layers, its computation only depends on the k-hop egographs of G, which is an important consideration when unfolding the embedding of GNN at a centre node. A.1 PROOF FOR THEOREM 3.1 Lemma A.1. For any A ∈ R m×n , where m ≥ n, and A is a submatrix of B ∈ R m ×n , where m < m , we have A 2 ≤ B 2 . Proof. Note that, AA T is a principle matrix of BB T , i.e., AA T is obtained by removing the same set of rows and columns from BB T . Then, by Eigenvalue Interlacing Theorem (Hwang (2004) ) and the fact that A T A and AA T have the same set of non-zero singular values, the matrix operator norm satisfies A 2 = λ max (A T A) = λ max (AA T ) ≤ λ max (BB T ) = B 2 . We restate Theorem 3.1 from the main paper as below. Theorem A.2 (GNN transferability). Let G a = {(g i , x i )} n i=1 and G b = {(g i , x i )} m i =1 be two graphs. Then denote L gi as the (normalised) graph Laplacian of g i ∀i = 1, • • • , n, and let the node features of g i be structure-respecting and normalized (similarly for g i ). Consider GNN Ψ θ with k layers and a 1-hop polynomial filter φ θ , the empirical performance difference of Ψ θ with φ θ evaluated on L EGI satisfies |L EGI (G a ) -L EGI (G b )| ≤ O M + 1 nm n i=1 m i =1 λ max (L gi -L g i ) 1/2 , ( ) where M is a constant dependant on k, φ θ , {L gi }, {L g i }, {x i }, {x i }. In addition, if ∃U ∈ O(n ∨ m) 4 s.t., U L gi U T = Diag(λ(L gi )), U L g i U T = Diag(λ(L g i )) we have O M + 1 nm n i=1 m i =1 λ(L gi ) -λ(L g i ) 2 , where λ(L gi ) denotes the ordered eigenvalues of the graph Laplacian of g i ∈ G a (similarly for g i ). Proof. We denote σ s (t) = log(1 + e t ), the softplus activation function, which is 1-Lipschitz continuous. Now, |L EGI (G) -L EGI (G )| = 1 n 2 n i,j=1 (D(g i , z j )) - 1 n n i=1 (-(-D(g i , z i )) -( 1 m 2 m i ,j =1 (D(g i , z j )) - 1 m m i =1 (-(-D(g i , z i )))) ≤ 1 n 2 m 2 n i,j=1 m i ,j =1 |D(g i , z j ) -D(g i , z j )| + 1 nm n i=1 m i =1 |D(g i , z i ) -D(g i , z i )| = 1 n 2 m 2 n i,j=1 m i ,j =1 A + 1 nm n i=1 m i =1 B. First we consider B. Recall that, V p (g i ) is the set of nodes in layer p of g i , D(g i , z i ) = k p=1 |Vp(gi)| q=1 log(σ sig U T τ W T [f (x i ) z i ] ), where σ sig (t) = 1 1+e -t is the sigmoid function, τ is some γ τ -Lipschitz activation function and [• •] denotes the concatenation of two vectors. Then we have U T τ W T [f (x i ) z i ] = U T τ W T 1 f (x i ) + W T 2 z i . WLOG, assume d p = |V p (g i )| = |V p (g i )| ∀u = 1, • • • , k. In addition, since log(σ sig (t)) = -log(1 + e -t ) = -σ s (-t), which is 1-Lipschitz, it gives B ≤ k p=1 dp q=1 |U T τ W T 1 f (x i ) + W T 2 z i -U T τ W T 1 f (x i ) + W T 2 z i | ≤ γ τ s U k p=1 dp q=1 ( W T 1 f (x i ) -W T 1 f (x i ) 2 + W T 2 z i -W T 2 z i 2 ) ≤ γ τ s U s W k p=1 dp q=1 ( f (x i ) -f (x i ) 2 + z i -z i 2 ), where s U is the largest singular value of U , and similarly s W = s W1 ∨ s W2 . Since we assumed the node features are normalised, then f (x i ) -f (x i ) 2 ≤ c D . From Eq. 7, we only care about x i 's embedding obtained from a k-layer GNN with 1-hop polynomial (linear in L) filter. Inspired by the characterization of GNN from a node-wise view in Verma & Zhang (2019) , we similarly denote the embedding of node x i ∀i = 1, • • • , n in the final layer of the GNN as z k i = z i = Ψ θ (x i ) = σ( j∈N (xi) e •j z k-1 j ) ∈ R d , where e •j = [φ θ (L)] •j ∈ R. We may denote z i ∈ R d similarly for = 1, • • • , k -1, and z 0 i = x i ∈ S d-1 the node feature of node x i . With the assumption of GNN stated in the statement, it is clear that only the k-hop ego-graph g i centered at x i is needed to compute z k i for any i = 1, • • • , n instead of the whole of G. With such observation in mind, let us denote the matrix of node embeddings of g i at the th layer as (z i( ) p,q ) ∈ R |V (gi)|×d , for = 1, • • • , k; and let (z i(0) p,q ) ≡ (x i p,q ) ∈ (S d-1 ) |V (gi)| denote the matrix of node features in the k-hop ego-graph g i . In addition, we denote (z i( ) p,q ) p≤t to be the submatrix that is obtained by selecting rows that corresponds to v ∈ V p (g i ) for p = 0, • • • , t ≤ k. Similarly for g i . Moreover, let us denote φ θ (L gi ) ≡ [φ θ (L)] gi , i.e., the filtered full graph Laplacian of G subsetted by the k-hop ego-graph g i . Then, let φ θ (L gi ) p≤t denotes the submatrix that is obtained by selecting rows and columns that corresponds to v ∈ V p (g i ) for p = 0, • • • , t ≤ k. Similarly for g i . Therefore, by Lemma A.1, for any = 1, • • • , k, the following holds (z i ( ) p,q ) p≤t -(z i ( ) p,q ) p≤t 2 ≤ (z i ( ) p,q ) p≤t+1 -(z i ( ) p,q ) p≤t+1 2 . Assume (z i ( -1) p,q ) 2 ≤ c z < ∞ ∀ . Now, at the final layer, z i -z i 2 = (z i (k) p,q ) p=0 -(z i (k) p,q ) p=0 2 ≤ [σ(φ θ (L gi ) p≤1 (z i(k-1) p,q ) p≤1 ) -σ(φ θ (L g i ) p≤1 (z i (k-1) p,q ) p≤1 )] p=0 2 ≤γ σ φ θ (L gi ) p≤1 (z i(k-1) p,q ) p≤1 -φ θ (L g i ) p≤1 (z i (k-1) p,q ) p≤1 2 ≤γ σ φ θ (L gi ) p≤1 2 (z i(k-1) p,q ) p≤1 -(z i (k-1) p,q ) p≤1 2 +γ σ (z i (k-1) p,q ) p≤1 2 φ θ (L gi ) p≤1 -φ θ (L g i ) p≤1 2 ≤γ σ φ θ (L gi ) 2 (z i(k-1) p,q ) p≤1 -(z i (k-1) p,q ) p≤1 2 + γ σ c z φ θ (L gi ) -φ θ (L g i ) 2 . (8) In general, for = 1, • • • , k -1, the following holds with t = k -, (z i ( ) p,q ) p≤t -(z i ( ) p,q ) p≤t 2 ≤γ σ φ θ (L gi ) p≤t+1 (z i( -1) p,q ) p≤t+1 -φ θ (L g i ) p≤t+1 (z i ( -1) p,q ) p≤t+1 2 ≤γ σ φ θ (L gi ) 2 (z i( -1) p,q ) p≤t+1 -(z i ( -1) p,q ) p≤t+1 2 + γ σ c z φ θ (L gi ) -φ θ (L g i ) 2 . (9) Then we equivalently write Eq. 9 as E ≤ bE -1 + a, which gives E ≤ b E 1 + b + 1 b -1 a. Then, with (x i p,q ) = (z i(0) p,q ), we see the following is only dependant on the structure of g i and g i , (z i ( ) p,q ) -(z i ( ) p,q ) 2 ≤ γ σ φ θ (L gi ) 2 (x i p,q ) -(x i p,q ) 2 + γ σ φ θ (L gi ) 2 + 1 γ σ φ θ (L gi ) 2 -1 γ σ c z φ θ (L gi ) -φ θ (L g i ) 2 . Since the features are normalised, and so are the graph Laplacians, we have φ θ (L gi ) 2 ≤ c L and (x i p,q ) -(x i p,q )) 2 ≤ c x . Then with Eq. 8, we have z i -z i 2 ≤ γ k σ c k L c x + γ k σ c k L + 1 γ σ c L -1 γ σ γ θ c z L gi -L g i 2 ≤ c γ,Ψ (M + L gi -L g i 2 ) = c γ,Ψ (M + λ max (L gi -L g i ) 1/2 ). Now, by Eq. 7, we have B ≤ kd max γ τ s(c D + c γ,Ψ (M + λ max (L gi -L g i ) 1/2 )) , where d max = max p d p . Similarly, the above holds for A, since from Eq. 8, the node features and embedded features are bounded by separate terms. We therefore arrive at |L EGI (G) -L EGI (G )| ≤ 2kd max γ τ c γ,Ψ s(M + 1 nm n i=1 m i =1 λ max (L gi -L g i ) 1/2 )) ≤ 2kd max γ τ c γ,Ψ s(M + 1 nm n i=1 m i =1 L gi -L g i F )). Moreover, by Von Neumann's Trace Inequality Grigorieff (1991 ), if ∃U ∈ O(β) 5 , where β = k p=0 d p , s.t. U L gi U T = Diag(λ(L gi )), U L g i U T = Diag(λ(L g i )), we have L gi -L g i F = λ(L gi ) -λ(L g i ) 2 , then Eq. 11 ≤ c γ,Ψ (M + λ(L gi ) -λ(L g i ) 2 ). Therefore Eq. 11 becomes |L EGI (G) -L EGI (G )| ≤ 2kd max γ τ c γ,Ψ s(M + 1 nm n i=1 m i =1 λ(L gi ) -λ(L g i ) 2 ). Note that, our view of structural information is closely related to graph kernels (Bai & Hancock, 2016) and graph perturbation (Verma & Zhang, 2019) . Specifically, our Def 2.1 is motivated by the concept of k-layer expansion sub-graph in (Bai & Hancock, 2016) . However, (Bai & Hancock, 2016) used the Jensen-Shannon divergence between pairwise representations of sub-graphs to define a depth-based sub-graph kernel, while we depict G as samples of its ego-graphs. In this sense, our view is related to the setup in (Verma & Zhang, 2019) , which derived a uniform algorithmic stability bound of a 1-layer GNN under 1-hop structure perturbation of G. In the setting of domain adaptation, (Ben-David et al., 2007) draws a connection between the difference in the distributions of source and target domains and the model transferability, and learns a transferable model by minimizing such distribution differences. This coincides with our approach of connecting the structure difference of two graphs in terms of k-hop subgraph distributions and the transferability of GNNs in the above theory.

B MODEL DETAILS

Following the same notations used in the paper, EGI consists of a GNN encoder Ψ and a GNN discriminator D. In general, the GNN encoder Ψ and decoder D can be any existing GNN models. For each ego-graph and its node features {g i , x i }, the GNN encoder returns node embedding z i for the center node v i . As mentioned in Eq. 2 in the main paper, the GNN discriminator D makes edge-level predictions as follows, D(e ṽv |h q p,q , x i p,q , z i ) = σ U T • τ W T [h q p,q ||x i p,q ||z i ] , where e ṽv ∈ E(g i ) and h q p,q ∈ R d is the representation for edge e ṽv between node v p-1,q in hop p -1 and v p,q in hop p. Specifically, we denote the source node at p -1 hop as q ∈ Qp,q , Qp,q = {q : v p-1,q ∈ V p-1 (g i ), e (p-1,q)(p,q) ∈ E(g i )}. Hence, the edge prediction relies on the combination of center node embedding z i , destination node feature x i p,q and edge message h q p,q . Ego-graph (𝒈𝒈 𝒊𝒊 , 𝒙𝒙 𝒊𝒊 ) for node v p-1,q at each hop. The edge message h q p,q is calculated between source node's hidden representation m p-1,q and destination node features x p,q . h q p,q = ReLU W T p m p-1,q + x i p,q

Ego

, m p-1,q = 1 | Qp-1,q | q ∈ Qp-1q h q p-1,q (13) When p = 1, every edge origins from the center node v i and m 0,q is the center node feature x vi . In every batch, we sample a set of ego-graphs and their node features {g i , x i }. During the forward pass of encoder Ψ, it aggregates from neighbor nodes to the center node v i . Then, the discriminator calculates the edge embedding in Eq. 12 from center node v i to its neighbors and make edge-level predictions-fake or true. The training framework of EGI is depicted in Figure 3 and Algorithm 1. We implement our method and all of the baselines using the same encoders Ψ: 2-layer GIN (Xu et al., 2019) for synthetic and role identification experiments, 2-layer GraphSAGE (Hamilton et al., 2017) for the relation prediction experiments. We set hidden dimension as 32 for both synthetic and role identification experiments, For relation prediction fine-tuning task, we set hidden dimension as 256. We train EGI in a mini-batch fashion since all the information for encoder and discriminators are within the k-hop ego-graph g i and its features x i . Further, we conduct neighborhood sampling and set maximum neighbors as 10 to speed up the parrallel training. The space and time complexity of EGI is O(BN K ), where B is the batch size, N is the number of the neighbors and k is the number of hops of ego-graphs. Notice that both the encoder Ψ and discriminator D propagate message on the k-hop ego-graphs, so the extra computation cost of D compared with a common GNN module is a constant multiplier over the original one. The scalability of EGI on million scale YAGO network is reported in section C.3.

B.1 TRANSFER LEARNING SETTINGS

The goal of transfer learning is to train a model on a dataset or task, and use it on another. In our graph learning setting, we focus on training the model on one graph and using it on another. In particular, we focus our study on the setting of direct-transfering, where the model learned on the source graph is directly applied on the target graph without fine-tuning. We study this setting because Sample M ego-graphs {(g 1 , x 1 ), ..., (g M , x M )} from empirical distribution P without replacement, and obtained their positive and negative node embeddings z i , z i through Ψ z i = Ψ(g i , x i ), z i = Ψ(g i , x i ), /* Initialize positive and negative expectation in Eq. 1 in the main paper*/ 5 E pos = 0, E neg = 0 6 for p = 1 to k do 7 /* Compute JSD on edges at each hop*/ 8 for e (p-1,q)(p,q) ∈ E(g i ) do 9 generate edge embedding h q p,q in Eq. ( 13) ; 10  E pos = E pos + σ U T • τ W T [h q p,q ||x i p,q ||z i ] 11 E neg = E neg + σ U T • τ W T [h q p,q ||x i p, ← --∇ Ψ L EGI , θ D + ← --∇ D L EGI 18 end it allows us to directly measure the transferability of GNNs, which is not affected by the fine-tuning process on the target graph. In other words, the fine-tuning process introduces significant uncertainty to the analysis, because there is no guarantee on how much the fine-tuned GNN is different from the pre-trained one. Depending on specific tasks and labels distributions on the two graphs, the fine-tuned GNN might be quite similar to the pre-trained one, or it can be significantly different. It is then very hard to analyze how much the pre-trained GNN itself is able to help. Another reason is about efficiency. The fine-tuning of GNNs requires the same environment set-up and computation resource as training GNNs from scratch, although it may take less training time eventually if pre-training is effective. It is intriguing if this whole process can be eliminated when we guarantee the performance with direct-transfering. In our experiments, we also study the setting of transfer learning with fine-tuning, particularly on the real-world large-scale YAGO graphs. Since we aim to study the general transferability of GNNs not bounded to specific tasks, we always pre-train GNNs with the unsupervised pre-training objective on source graphs. Then we enable two types of fine-tuning. The first one is post-fine-tuning (L = L s ), where the pre-trained GNNs are fine-tuned with the supervised task specific objective L s on the target graphs. The second on is joint-fine-tuning (L = L s + L u ), where pre-training is the same, but fine-tuning is done w.r.t. both the pre-training objective L u and task specific objective L s on target graphs in a semi-supervised learning fashion. The unsupervised pre-training objective L u of EGI is Algorithm 1, while those of the compared algorithms are as defined in their papers. The supervised fine-tuning objective L s is the same as in the DistMult paper (Yang et al., 2014) for all algorithms.

C EXPERIMENT DETAILS C.1 SYNTHETIC EXPERIMENTS

Data. As mentioned in the main paper, we use two traditional graph generation models for synthetic data generation: (1) barabasi-albert graph (Barabási & Albert, 1999) and (2) forest-fire graph (Leskovec et al., 2005) . We generate 40 graphs each with 100 nodes with each model. We control the parameters of two models to generate two graphs with different ego-graph distributions. Specifically, we set the number of attached edges as 2 for barabasi-albert model and set p forward = 0.4, p backward = 0.3 for forest-fire model. In Figure 4a and 4b, we show example graphs from two families in our datasets. They have the same size but different appearance which leads to our study on the transferability gap in Table 1 in the main paper. The accuracy of this task defined as the percentage of nearest neighbors for target node in the embedding space that are structure-equivalent, i.e. #correct k-nn neighbors / #ground truth equivalent nodes. Results. The structural equivalence label is obtained by a 2-hop WL-test (Weisfeiler & Lehman, 1968 ) on the ego-graphs. If two nodes have the same 2-hop ego-graphs, they will be assigned the same label. In the example of Figure 4c , the nodes labeled with same number (e.g. 2, 4) have the isomorphic 2-hop ego-graphs. Note that this task is exactly solvable when node features and GNN architectures are powerful enough like GIN (Xu et al., 2019) . In order to show the performance difference among different methods, we set the length of one-hot node degree encoding to 3 (all nodes with degrees higher than 3 have the same encoding). Here, we present the performance comparison with different length of degree encodings (d) in Table 4 . When the capacity of initial node features is high (d=10), the transfer learning gap diminishes between different methods and different graphs because the structural equivalence problem can be exactly solved by neighborhood aggregations. However, when the information in initial node features is limited, the advantage of EGI in learning and transfering the graph structural information is obvious. In Table 5 , we also show the performance of different transferable and non-transferable features, i.e. node embedding (Perozzi et al., 2014) and random feature vectors. The observation is similar with Table 1 in the main paper: the transferable feature can reflect the performance gap between similar and dissimilar graphs while non-transferable features can not. In both Table 4 and 7 here as well as Table 1 in the main paper, we report the structural difference among graphs in the two sets ( d) calculated w.r.t. the term 1 nm n i=1 m i =1 λ(L gi ) -λ(L g i ) 2 on the RHS of Theorem 2.1 in the main paper. This indicates that the Forest fire graphs are structurally similar to the other Forest fire graphs, while less similar to the Barabasi graphs, as can be verified from Figure 4a and 4b. Our bound in Theorem 3.1 then tells us that the GNNs (in particular, EGI) should be more transferable in the F-F case than B-F. This is verified in Table 4 and 5 when using the transferable node features of degree encoding with limited dimension (d=3) as well as DeepWalk embedding, as EGI trained on Forest fire graphs performs significantly better on Forest fire graphs than on Barabasi graphs (with +0.094 and +0.057 differences, respectively). Data. We report the number of nodes, edges and classes for both airport and gene dataset. The numbers for the Gene dataset are the aggregations of the total 52 gene networks in the dataset. For the three airport networks, Figure 5 shows the power-law degree distribution on log-log scale. The class labels are between 0 to 3 reflecting the level of the airport activities (Ribeiro et al., 2017) . For the Gene dataset, we matched the gene names in the TCGA dataset (Yang et al., 2019) to the list of transcription factors on wikipediafoot_5 . 75% of the genes are marked as 1 (transcription factors) and some gene graphs have extremely imbalanced class distributions. So we conduct experiments on the relatively balanced gene graphs of brain cancers (Figure 2 in the main paper). Both datasets do not have organic node attributes. The role-based node labels are highly relevant to their local graph structures, but are not trivially computable such as from node degrees. Results. As we can observe from Figure 5 , the three airport graphs have quite different sizes and structures (e.g., regarding edge density and connectivity pattern). Thus, the absolute classification accuracy in both Table 2 in the main paper and Table 7 here varies across different graphs. However, as we mention in the main paper, the structural difference we compute based on Eq. 5 in Theorem 3.1 is close among the Europe-USA and Europe-Brazil graph pairs (12.03 and 12.14), which leads to close transferability of EGI from Europe to USA and Brazil. This indicates the effectiveness of our view over essential structural information. Note that, the results present in Table 7 are the accuracy of GNNs directly trained and evaluated on each network without transfering. Therefore, only the Europe column has the same results as in Table 2 in the main paper, while the USA and Brazil columns can be regarded as providing an upper-bound performance of GNN transfered from other graphs. As we can see, EGI gives the closest results from Table 2 in the main paper to Table 7 here, demonstrating the its plausible transferability. The scores are so close, showing a possibility to skip fine-tuning when the source and target graphs are similar enough. Also note that, although the variances are pretty large (which is also observed



In the experiments, we show our model to be generalizable to the more practical settings with task-specific pre-training and fine-tuning, while the study of rigorous bound in such scenarios is left as future work. We are not exploring graph-level tasks and but focusing on transfer knowledge between two graphs. Thus, we drop the graph-level pre-training tasks in the paper since it is not applicable to our setting. The downstream tasks are unspecified because we aim to study the general transferability of GNNs that is not bounded to specific tasks. Nevertheless, we assume the tasks to be relevant to graph structures. O(n ∨ m) is the orthogonal group of order n ∨ m. So we have Lg i and Lg i admitting simultaneous ordered spectral decomposition. O(β) is the orthogonal group of square matrix β. So we have Lg i and Lg i admits simultaneous ordered spectral decomposition. https://en.wikipedia.org/wiki/Transcription_factor



Figure 1: Overview of our GNN transfer learning framework: (1) we represent graph as a combination of its 1-hop ego-graph and node feature distributions; (2) we design a transferable GNN regarding the capturing of such essential graph information; (3) we establish a rigorous guarantee of GNN transferability based on the requirement on nodes features and difference between graph structures.

Theorem 2.1 naturally instantiates our insight about the correspondence between structural similarity and GNN transferability. It tells us how well a GNN trained on G a can work on G b by only checking the local graph Laplacians of G a and G b without actually training the model.

Figure2: Role identification on the Gene dataset. Due to severe label imbalance that vanishes the performance gaps, we only use the 7 brain cancer networks that have a more consistent balance of labels. We visualize the source graph G0 and two example target graphs that are relatively more similar (G5) and different (G6) with G0.

Figure 3: The overall EGI training framework.In Figure3, {g i , x i } and {g i , x i } are the positive and negative training samples w.r.t ego-graph topology g i . The discriminator D operates on a reversed ego-graph gi comparing encoder's forward propagation on g i . It starts from the center node v i and compute the hidden representation m p-1,q for node v p-1,q at each hop. The edge message h q p,q is calculated between source node's hidden representation m p-1,q and destination node features x p,q .

Figure 4: Visualizations of the graphs and labels we use in the synthetic experiments.

Figure 5: Visualizations of power-law degree distribution on three airport dataset.

We compare the proposed model with existing unsupervised GNNs and pre-training GNN frameworks. The unsupervised GNNs are the same as used in our synthetic experiments, i.e., GVAE with GCN encoder (Kipf & Welling, 2016) and DGI with GIN encoder(Velickovic et al., 2019). The pre-training GNN frameworks include Mask-GIN and ContextPred-GIN, two node-level pre-training models proposed in(Hu et al., 2019a)  2 . Besides, Structural Pre-train(Hu et al., 2019b) also conducts unsupervised node-level pre-training with structural features like node degrees and clustering coefficients.

Results of role identification with direct-transfering on the Airport dataset. The performance reported (%) are the average over 100 runs. The scores marked with * * passed t-test with p < 0.01 over the second best results. More details about the results and dataset can be found in Appendix §C.2.

Performance of few-shot relation prediction on YAGO. Structural Pre-train(Hu et al., 2019b)  can not scale to the YAGO graphs with 100K+ nodes. More details can be found in Appendix §C.3.

q ||z i ]

Synthetic experiments of identifying structural-equivalent nodes with different degree encoding dimensions.

Synthetic experiments of identifying structural-equivalent nodes with different transferable and nontransferable features.

annex

Algorithm 1: Pseudo code for training EGI 1 The GNN encoder Ψ and the GNN discriminator D, k-hop ego graph and features {g i , x i }; 2 /* EGI-training starts */ 3 while L EGI not converges do in other works like (Ribeiro et al., 2017) since the networks are small), our t-tests have shown the improvements of EGI to be significant. 55.56% ± 6.83% DGI (GIN) (Velickovic et al., 2019) 57.75% ± 4.47% 62.44% ± 4.46% 68.15% ± 6.24% Mask-GIN (Hu et al., 2019a) 56.37% ± 5.07% 63.78% ± 2.79% 61.85% ± 10.74% ContextPred-GIN (Hu et al., 2019a) 52.69% ± 6.12% 56.22% ± 4.05% 58.52% ± 10.18% Structural Pre-train (Hu et al., 2019b) 56.00% ± 4.58% 62.29% ± 3.51% 71.48% ± 9.38 % EGI (GIN)59.15% ± 4.44% 65.88% ± 3.65% 74.07% ± 5.49%

C.3 REAL-WORLD LARGE-SCALE RELATION PREDICTION EXPERIMENTS

Data. As shown in Table 8 , the source graph we use to pre-train GNNs is the full graph cleaned from the YAGO dump (Suchanek et al., 2007) , where we assume the relations among entities are unknown.The target graph we use is a subgraph uniformed sampled from the same YAGO dump (we sample the nodes and then include all edges among the sampled nodes). The similar ratio between number of nodes and edges can be observed in Table 8 . On the target graph, we also have the access to 24 different relations (Shi et al., 2018) such as isAdvisedBy, isMarriedTo and so on. Such relation labels are still relevant to the graph structures, but the relevance is lower compared with the structural role labels. We use the 256-dim degree encoding as node features for pre-training on the source graph, then we use the 128-dim positional embedding generated by LINE (Tang et al., 2015) for fine-tuning on the target graph, to explicitly make the features differ across source and target graphs.Results. In Section B.1, we introduced two different types of fine-tuning, i.e., post-fine-tuning and joint-fine-tuning. For both types of fine-tuning, we add one feature encoder E before feeding it into the GNNs for two purposes. First, the target graph fine-tuning feature usually has different dimensions with the pre-training features, such as the node degree encoding we use. Second, the semantics and distributions of fine-tuning features can be different from pre-training features. The feature encoder aims to bridge the gap between feature difference in practice. The supervised loss used in this experiment is the same as in DistMult (Yang et al., 2014) . In particular, the bilinear score function is calculated as s(h, r, t) = z T h M r z t , where M r is a diagonal matrix for each relation r, z h and z t the the embedding of GNN encoder Ψ for head and tail entities. The experiments were run on GTX1080 with 12G memories. We report the average training time per epoch of our algorithm in pre-training and fine-tuning stage in Table 8 as well. The pre-training and fine-tuning takes about 40 epochs and 10 epochs to converge, respectively. In Table 8 , we also present the per-epoch training time of EGI. EGI takes about 338 seconds per epoch for optimizing the ego-graph information maximization objective on YAGO-source. As we can see, fine-tuning also takes significant time compared to pre-training, which strengthens our arguments about avoiding or reducing fine-tuning through structural analysis. We implement all baselines within the same pipeline, and the runtimes are all at the same scale. 

