TOWARDS FEDERATED LEARNING OF DEEP GRAPH NEURAL NETWORKS

Abstract

Graph neural networks (GNNs) learn node representations by recursively aggregating neighborhood information on graph data. However, in the federated setting, data samples (nodes) located in different clients may be connected to each other, leading to huge information loss to the training method. Existing federated graph learning frameworks solve such a problem by generating missing neighbors or sending information across clients directly. None are suitable for training deep GNNs, which require a more expansive receptive field and higher communication costs. In this work, we introduce a novel framework named Fed 2 GNN for federated graph learning of deep GNNs via reconstructing neighborhood information of nodes. Specifically, we design a graph structure named rooted tree. The node embedding obtained by encoding on the rooted tree is the same as that obtained by encoding on the induced subgraph surrounding the node, which allows us to reconstruct the neighborhood information by building the rooted tree of the node. An encoder-decoder framework is then proposed, wherein we first encode missing neighbor information and then decode it to build the rooted tree. Extensive experiments on real-world network datasets show the effectiveness of our framework for training deep GNNs while also achieving better performance for training shadow GNN models 1 .

1. INTRODUCTION

Recently, Graph Neural Networks (GNNs) have attracted significant attention due to their powerful ability for representation learning of graph-structured data (Hamilton et al., 2017a; Kipf & Welling, 2017; Hamilton et al., 2017b) . Generally speaking, it adopts a recursive neighborhood aggregation (or message passing) scheme to learn node representations by considering the node features and graph topology information together (Xu et al., 2018) . After k iterations of aggregation, a node captures the information within the node's k-hop neighborhood. Similar to learning tasks of other domains, training a well-performed GNN model requires its training data to be not only sufficiently large but also heterogeneous for better generalization of the model. However, in reality, heterogeneous data are often separately stored in different clients and cannot be shared due to policies and privacy concerns. To that end, recent works have proposed federated training of GNNs (Zhang et al., 2021; Peng et al., 2021; Yao & Joe-Wong, 2022; Chen et al., 2022) . They typically consider a framework wherein each client iteratively updates node representations with a semi-supervised model on its local graph; the models are then aggregated at a central server. The main challenge is that data samples (nodes) located in different clients may be connected to each other. Hence, it is non-trivial to consider the connected nodes (i.e., neighbor nodes) located in other clients when applying node updates. Although existing works focus on recovering missing neighborhood information for nodes, they either only consider immediate neighbors (Zhang et al., 2021; Peng et al., 2021) or require communication costs to increase exponentially as the neighbors' distance increases (Yao & Joe-Wong, 2022; Chen et al., 2022) . None of them are suitable for training deeper GNN models, which require a more expansive receptive field and have been shown to be beneficial for representation learning for graph-structured data (Li et al., 2019; Liu et al., 2020; Zhou et al., 2020a) . For GNNs, the receptive field of a node representation is its entire neighborhood. Moreover, (Yao & Joe-Wong, 2022 ) also requires calculating the weighted matrix in advance, which is not available in practice. In this work, we aim to fundamentally address the above limitations of existing federated graph learning methods by proposing a novel framework named Fed 2 GNN. The key idea lies in designing a principled approach to reconstructing the neighborhood information of nodes that considers both structure-based (i.e., graph topology) and feature-based information (i.e., node features). For the structure-based information, we propose a novel graph structure named rooted tree, which has a more regular structure than the original structure of the node neighborhood. More importantly, the node embedding obtained by encoding on the rooted tree is the same as that obtained by encoding on the node's ego-graph (i.e., the induced subgraph surrounding the node). Such a property allows us to easily reconstruct the structure-based information by building the rooted tree of the node. For the feature-based information, since the structure of the node neighborhood changes, we aim to generate features of the nodes in the rooted tree. Inspired by the structure of the rooted tree, we design a protocol wherein clients recursively transmit information across each other. The data transmitted in the k-th round correspond to nodes in the k+1-th layer of the rooted tree. Furthermore, we utilize the encoder-decoder framework to reduce the communication costs such that it grows only linearly as the number of iterations increases. In more detail, each client first encodes the information and sends the output to other clients. Other clients build the rooted tree by decoding the received information. By merging all trees into the local graph (with the rooted node as an anchor), each client obtains a complete graph on which applying graph representation learning has limited information loss. In summary, we make the following contributions: • We introduce Fed 2 GNN, an framework for federated training of GNNs to solve node-level prediction tasks. We achieve such a goal by devising a principled approach to reconstructing missing neighborhood information that considers both structure-based and feature-based information. • To reconstruct the structure-based information, we propose a novel graph structure named rooted tree, which is easier to construct than the original irregular structures of the node neighborhood. More importantly, the node embedding obtained by encoding on the rooted tree is the same as that obtained by encoding on the node's ego-graph. • To reconstruct the feature-based information, we propose an encoder-decoder framework to reduce communication costs while having limited information loss. • We conduct extensive experiments to verify the utility of Fed 2 GNN. The results show that it is effective for training deep GNNs while achieving better performance for training shadow GNN models. We outline related works in Section 2 before introducing the problem statement of federated graph learning in Section 3. We then introduce Fed 2 GNN in Section 4, wherein we first introduce the structure of the rooted tree and then presents the neighborhood reconstruction process. We analyze its performance experimentally in Section 5 and concluding in Section 6.

2. RELATED WORKS

Graph Neural Networks (GNNs) learn a representation for each node in the graph using a set of stacked graph convolution layers. Each layer gets an initial vector for each node and outputs a new embedding vector by aggregating vectors of neighbor nodes followed by a non-linear transform. After k aggregations, the source of information encoded in the representation of a node essentially comes from its k-hop neighborhood. Following the above framework, which is usually called message passing, several GNN models have been proposed, such as GCN (Kipf & Welling, 2017) , GraphSAGE (Hamilton et al., 2017b) , GAT (Velickovic et al., 2018) , and so on. However, unlike the learning tasks in other domains, simply stacking graph convolution layers usually suffer from an over-smoothing issue, leading to even worse performance. Surprisingly, with the research on the above issues, several works (Liu et al., 2020; Li et al., 2019; Zhou et al., 2020b) propose effective deep GNNs and obtain better performance on graph learning tasks. Its excellent performance suggests its great potential for federated learning on distributed subgraph data. (Finn et al., 2017; Li et al., 2020) , which aims to learn heterogeneous models for different local tasks, also attracted attention from the community. However, our paper focuses on learning a global model from multiple clients for a common task. Hence, we mainly borrow the idea of FedAvg to train GNNs collaboratively. Federated Graph Learning aims to solve the graph learning problem on distributed subgraph data. Recent researchers have made progress on federated graph learning, and several frameworks have been proposed (Fu et al., 2022; Liu & Yu, 2022) . (Zhang et al., 2021) proposed a framework named FedSage+, which utilizes a neighbor generator to recover cross-client neighbors of subgraphs. However, they only consider immediate neighbors and can not fully recover the cross-client information. FedGraph (Chen et al., 2022) develops a cross-client graph convolution operation to enable embedding sharing among clients, which requires sending updated embedding in every iteration during model training. FedGCN (Yao & Joe-Wong, 2022) solves such a problem by proposing to transmit the aggregated features of the cross-subgraph neighbor nodes directly. However, it assumes that the weighted matrix is calculated in advance and requires communication costs to increase exponentially as the number of layers increases. Baek et al., (Baek et al., 2022) propose a personalized federated subgraph learning method, mainly focusing on heterogeneous subgraphs. In more detail, it develops an aggregation method based on similarity among clients. However, it does not consider the missing neighbor information. In summary, none of the existing works are suitable for federated learning for deep GNNs; we are the first to achieve such a goal.

3. PROBLEM STATEMENT.

Given an attributed non-directed graph G = (V, E, X ), where V is the vertex set, E is the edge set, and X include the node features. Each node has a feature x v ∈ X with a corresponding label y i for the downstream task, e.g., node classification. We have x v ∈ R dx . In the FL system, we assume there exist a central server S and M clients P p , p ∈ [M ], where [M ] represents {1, ..., M } for any M ∈ N + . The server only maintains a graph learning model with no graph data stored, and each client holding a subgraph G p = (V p , E p , X p ), where V p , E p , X p are subset of V, E, X respectively. We assume no overlapping nodes shared across clients, namely V p ∩ V q = ∅, ∀p, q ∈ [M ], p ̸ = q, and all subset constitute the full set, i.e., p V p = V. Given a node v ∈ V, we define its k-hop neighborhood as the nodes which have shortest paths of length k from v. Let c(v) denote the index of the client that contains node v and N v denote the 1-hop neighborhood of v. For any node v ∈ V, we assume P c(v) knows all v's 1-hop neighbors even they are distributed across different clients. The set of neighborhoods contained in the same client as v is denoted as N p v , and the set of other neighborhoods is denoted as N cp v := N v \N p v . We name P c(v) as the active client and other clients {P c(j) |j ∈ N cp v } as passive clients of v for ease of expression.

4. METHODS

In this section, we propose a novel framework for federated learning of deep GNNs (Fed 2 GNN). The framework relies on a principled approach to reconstructing neighborhood information, with the core idea of building rooted trees for all nodes. After constructing all trees, we can get complete local graphs, on which applying graph representation learning has no neighborhood information loss. The FedAvg (McMahan et al., 2017) algorithm is then adopted to train a GNN on complete local graphs. Next, we start with designing a rooted tree of a node and then show how to build the tree with an encoder-decoder framework.

4.1. DESIGN OF THE ROOTED TREE

Given a K-layer GNN model, it learns node representation by recursively aggregating neighborhood information around the node. After K times aggregation, it could encode information from the (c) Rooted tree with 3-hop neighborhood (b) Rooted tree with 2-hop neighborhood (a) Rooted tree with 1-hop neighborhood k-hop subgraph of the node, wherein the subgraph is often called the ego-graph. In order to reconstruct the missing neighborhood information for each node, the rooted tree is designed by following two principles: (1) it could fully preserve the neighborhood information, i.e., the node embedding obtained by encoding on the rooted tree is the same as that obtained by encoding on the node's egograph, and ( 2) it has an easy-to-build structure. To that end, we design the rooted tree by unfolding the ego-graph for each node. 𝑖 𝑚 ! 𝑚 ! 𝑚 ! 𝑚 ! 𝑚 ! 𝑚 ! 𝑚 ! 𝑚 " 𝑚 " 𝑚 " 𝑚 " 𝑚 " 𝑚 # 𝑚 # 𝑚 # 𝑚 # 𝑚 # 𝑚 # 𝑖 𝑖 𝑖 𝑖 𝑖 𝑗 𝑗 𝑗 𝑗 𝑗 𝑗 An example of constructing rooted trees with 1,2,3-hop neighborhoods is depicted in Figure 1 . Specifically, for a node i ∈ G, we first set i as the rooted node. We then can set i's 1-hop neighbors m ∈ N i as nodes in the second layer of the rooted tree (Figure 1 (a) ). The third layer, intuitively, can be constructed using m's neighbors for each m ∈ N i . However, note that i is also included in N m and connected to it as the father note of m. Hence, we exclude i and connect the remaining nodes in N m to m to further construct the rooted tree (Figure 1 (b) ). Following the above procedure, we construct the k + 1-th (k > 1) layer by connecting the neighbors of nodes in the k-th layer to itself except for its father node. Finally, for a K-layer GNN model, we construct a rooted tree with K + 1 layers. Proposition 1. Given a K-layer GNN model, for any node i and its corresponding ego-graph, we can construct a K + 1-layer rooted tree following the above procedure such that i's embedding obtained by encoding on the rooted tree is the same as that obtained by encoding on its ego-graph. It's worth noting that the rooted tree is actually an undirected graph. For any node on the tree, it treats its father node and children nodes as one-hop neighbors and recursively aggregates their information to update its embedding. Meanwhile, the nodes in the k-th layer of the rooted tree have the complete neighborhood information of K +1-k hops, and the rooted node at the first layer have the complete K-hop neighborhood information. Thus, for a K-layer GNN model, a rooted node has the same embedding as obtained by encoding on its K-hop ego-graph. Meanwhile, although the degree of nodes in the last layer changes, which affects the convolution results of nodes at K-th layer to models like GCN (Kipf & Welling, 2017) , we can manipulate the edge weight to guarantee the correctness of the convolution process. Details are presented in Appendix A. In practice, we aggregate the leaf nodes into a single node. It does not affect the encoding process when applying the sum function to aggregate features of neighbor nodes. Meanwhile, we only construct the rooted tree of missing neighborhood information. For instance, assume that nodes i, m 1 , m 3 , j in Figure 1 are located in the same client while node m 2 locates in a different client. We can construct the rooted tree of i by only considering the m 2 branch in the second layer. Although m 2 also acts as i's 2-hop neighbor node, affecting i's encoding result through the path m 2 -m 1 -i, which does not exist in i's rooted tree, we solve this problem by merging the rooted trees of i and m 1 into the original subgraph, wherein the path m 2 -m 1 exists in m 1 's rooted tree. Finally, by merging rooted trees of all nodes into the local graph (with the rooted node as an anchor), we obtain a complete graph.

4.2. NEIGHBORHOOD RECONSTRUCTION

To construct the rooted tree of K + 1 layers, we have clients recursively transmit information across each other for K iterations. Meanwhile, an encoder-decoder framework is further used to reduce communication costs. An visual illustration of the information transmitting process across clients is presented in Figure 2 . For ease of presentation, we start from 1-hop and 2-hop neighborhood reconstruction and then generalize to K-hop neighborhood reconstruction. One-hop neighborhood Reconstruction. Constructing a rooted tree with 1-hop neighborhood in-  m∈N p i ∪{i} x m + m∈N cp i x m )W (1) ). It is sufficient for the passive client P z send m ∈N cp i 1 z [c(m)]x m to the active client, where the indicator 1 z [c(m)] is 1 if z = c(m) and zero otherwise. The active client then sums all received information to get f 1 i := m ∈N cp i x m and treats it as a pseudo node to construct the rooted tree. We treat the sum function as an encoder and do not require a decoder here. The same idea is also presented in (Yao & Joe-Wong, 2022) . However, it assumes that the weight matrix is calculated in advance, which is unavailable when training GCN (Kipf & Welling, 2017 ) models in practice since the degrees of neighbor nodes in other clients are usually unknown. Two-hop neighborhood Reconstruction. Constructing the rooted tree of 2-hop neighborhood information is a continuation of the process of 1-hop neighborhood Reconstruction. Specifically, for all i ∈ G, we assume P c(i) knows f 1 i already. Our intuition lies in the expression of a 2-layer GNN. ŷi = σ( m∈Ni∪{i} σ( j∈Nm∪{m} x j W (1) )W (2) ) = σ(( m∈N p i ∪{i} σ(( j∈N p m ∪{m} x j + j∈N cp m x j )W (1) ) + m∈N cp i σ( j∈Nm∪{m} x j W (1) ))W (2) ) (2) Although three items are missing to get ŷi , including the missing one-hop information j∈N cp i x j as well as the missing two-hop information ∀m ∈ N p i , j∈N cp m x j and ∀m ∈ N cp i , j∈Nm∪{m} x j , only the third item remains unknown for P c(i) after constructing the 1-hop neighborhood for all nodes. To that end, we have passive clients encode the missing information first and send the result to the active client. Specifically, for a passive client P z , who owes a subset N cp-z i ⊂ N cp i , it calculate q 2 z := m∈N cp-z i 1 z [c(m) ] j∈Nm x j and send q 2 z to the active client P c(i) . P c(i) then sum all received information to get Given f 1 i and f 2 i , we design a decoder ϕ to extract features of nodes in the second layer and third layer of the rooted tree simultaneously. Specifically, the decoder is designed to achieve the following capability: z 2 i := m∈N cp i j∈Nm x j . By minis i's information, P c(i) gets f 2 i := z 2 i -|N cp i |x i , ϕ(CONCAT(f 1 i , f 2 i )) = {CONCAT(x m , ĤNm\{i} )|m ∈ N cp i }, where ĤNm\{i} = j∈Nm\{i} xj is the neighbour information of m except for i. According to the dimension of features, we can split the outputs into two sets G i = {x m |m ∈ N cp i } and H i = { ĤNm\{i} |m ∈ N cp i } corresponding to features of nodes in the second and third layer of the rooted tree. The rooted tree can be easily constructed by connecting nodes in G i to the rooted node and nodes in H i to the corresponding node in G i .

Generalizing to K-hop neighborhood reconstruction

Following the derivation for the 1-hop and 2-hop cases, one can similarly generalize the method to reconstruct K-hop neighborhood. For every node i ∈ V, denote R k i , k ∈ [K] as the set of neighbor nodes in the k + 1-th layer of the rooted tree for the missing neighborhood. To construct the rooted tree, we first get the sum of features of nodes in each layer f k i = m∈R k i x m , k ∈ [K], and then input them to the decoder to reconstruct the entire neighborhood information (Figure 2 ). Meanwhile, denote N k i , k ∈ [K] as the set of neighbor nodes in the k + 1-th layer of the rooted tree for the entire neighborhood, R k i ⊂ N k i . We further denote µ k i = m∈N k i x m , k ∈ [K]. Protocol 1 in the Appendix F depicts the process of getting f k i . The core idea is that the k-hop neighbor information of any node i is contained in the k-1-hop neighbor information of i's neighbor nodes. To that end, we only require passive participants send information to the active participant. Specifically, for every node i ∈ V, denote C i = {P p |∃m ∈ N cp i , s.t, p = c(m)} as the set of passive participants for node i. We first get f 1 i (line 3-8) and f 2 i (line 13-20) following the process described above. Meanwhile, We also get [19] [20] , where η i denotes neighbor information without loss for node i. To construct f 3 i , we have passive client P p ∈ C i sends q 3 p := m∈N cp i 1 p [c(m)]µ 2 m to the active client. After summing all received information, it is then required to minus (|N cp i | -1) • f 1 i , wherein f 1 i is missed 1hop neighbor information that already exists in the second layer of the rooted tree. Meanwhile, as each q 3 µ 1 i , η 1 i := µ 1 i -f 1 i (line 9-10) and µ 2 i , η 2 i := µ 2 i -f 2 i (line p contains i's information without loss, it is required minus |N cp i | • η 1 i and finally get f 3 i = Pp∈Ci q 3 p -(|N cp i | -1) • f 1 i -|N cp i | • η 1 i (line 22). Similarly, µ 3 i = Pp∈Ci∪{c(i)} q 3 p - (|N cp i | + |N p i | -1) • µ 1 i (line 23), η 3 i = µ 3 i -f 3 i (line 24). Repeating the above procedure, we can get f k i for any k > 3. The decoding process is challenging, as we need to decode feature information of nodes in the rooted tree and reconstruct the edges between them. To achieve such a goal, we propose an algorithm to reconstruct the neighborhood with the help of multiple decoders ϕ k , k ∈ [K -1], where ϕ k can extract features of k-hop neighborhood information. In more detail, ϕ k is designed to achieve the following capability: ϕ k CONCAT f k i , ..., f K i = {CONCAT(x l k , l k+1 ∈N l k \{l k-1 } xl k+1 , ..., l k+1 ∈N l k \{l k-1 } ... l K ∈N l K-1 \{l K-2 } xl K ) | l k ∈ N cp k-1 )}, (4) where l k , k ∈ [K] corresponding to k-hop neighbor nodes. Algorithm 2 in the Appendix F presents the pseudo-code of the construction process. We first concatenate f k i , k ∈ [K] (line 4) and input it to ϕ 1 (line 9). We split the outputs into two sets G 1 i , H 1 i , where G 1 i = {x l1 |l 1 ∈ N cp i } and H 1 i = {CONCAT( l2∈N l 1 \{i} xl2 , ..., l2∈N l 1 \{i} ... l K ∈N l K-1 \{l K-2 } xl K )|l 1 ∈ N cp i }. The second layer of the rooted tree for i is constructed by generating nodes with features corresponding to elements in G 1 i . We then input each vector in H 1 i to ϕ 2 to decode features of nodes in the next layer (line 8-12). Specifically, for every l 1 ∈ N cp i , we input the corresponding vector to ϕ 2 and get two sets G 2 i = {x l2 |l 2 ∈ N l1 \{i}} and H 2 i = {CONCAT( l3∈N l 2 \{l1} xl3 , ..., l3∈N l 3 \{l1} ... l K ∈N l K-1 \{l K-2 } xl K )|l 2 ∈ N l1 \{i}}. The third layer is constructed by generating nodes with features corresponding to elements in G 2 i and connecting them to l 1 . Repeat the above procedure, we can fully construct the first K layers of the rooted tree. The last layer is constructed by directly generating nodes with features corresponding to elements in H K i and connecting them to l K-1 .

4.3. DESIGN OF ENCODER-DECODER.

In the design of our encoder-decoder framework, we use the sum function as our encoder, which has the advantage of no need to learn. So the active client can learn an end-to-end decoder with the training dataset generated from her local graph. Although decoders ϕ i , i ∈ [K] extract features of nodes in different layers separately, they are designed by the same principle, i.e., extract node' features and the corresponding neighbor information simultaneously. The main difference is that the scope of the neighbor information is different, where ϕ 1 extracts K -1-hop neighbor information while ϕ K-1 extracts 1-hop neighbor information. To that end, it is only needed to train ϕ 1 and apply it to extract features of nodes in all layers. For ease of presentation, we show the training process of ϕ 1 when K = 2. Specifically, we consider the following principle to reconstruct the 1-hop and 2-hop neighborhood information simultaneously. We have min ϕ1 i∈V M (Y i , ϕ 1 (X i )) , s.t. X i = h∈Yi h, ∀i ∈ V (5) where X i = CONCAT(f 1 i , f 2 i ), Y i = CONCAT xm , ĤNm\{i} | m ∈ N cp i , and M(•, •) is the loss function that measures the loss to reconstruct a set Y i . The task of learning ϕ 1 is hard to characterize, and there are two fundamental challenges: First, as the distribution of node degrees is often long-tailed in real-world networks, the size of each set may vary differently. Second, a matching problem has to be solved to compare two equal-sized sets, which is costly if the set size is large. The first problem is easy to solve. As each node knows its neighbor nodes, the active client can sample at most d neighbor nodes and ask passive clients to send the information of sample nodes only. For the second problem, we transform the decoding task into learning the probability distribution of the neighborhood information from a source distribution. Specifically, for node i, the neighborhood information is represented as an empirical realization of i.i.d sampling of d i elements from P i , where P i ≜ 1 di m∈Ni CONCAT(x m , H Nm\{i} ) and the source distribution Q i ≜ m∈Ni P (m) i . Therefore, we adopt M (Y i , ϕ 1 (X i )) = W 2 2 (P i , ϕ 1 (X i )) , where W 2 is the 2-Wasserstein distance (Villani, 2009) . We make use of the architecture of a U-Net network (Ronneberger et al., 2015) to construct the decoder, which has the advantage of generating features of all neighbors in one forward pass. A theoretical analysis of the capability of the decoder is stated in Theorem 4.1 (See the proof in Appendix B, reproduced in Theorem 2.1 in (Lu & Lu, 2020) ). Theorem 4.1. For any ϵ > 0, if the support of the distribution P i lies in a bounded space of R d , and Q i is absolutely continuous with respect to the Lebesgue measure, there exists a feed-forward neural network u(•) : R d → R (and thus its gradient ∇u(•) : R d → R d ) with large enough width and depth (depending on ϵ) such that W 2 2 (P i , (∇u ) # Q i ) < ϵ. The decoder ϕ 1 requires knowing the number of missing neighbors in advance, which is unavailable for the active client when applying it to extract features of k-hop (k > 1) neighbors. To that end, we use a predictor to predict the number of neighbors. Specifically, predictor ϕ d takes xm and ĤNm\{i} as inputs to predict the size of |N m \{i}|. Denote L as the loss function. ϕ d is joint training with ϕ 1 by optimizing the following loss function: M (Y i , ϕ 1 (X i )) + m∈N cp i L(ϕ d (x m , ĤNm\{i} ), |N m \{i}|). In practice, we adopt the empirical Wasserstein distance that can provably approximate the population one. For node i, for every forward pass, the model will get |Y i | outputs denoted as ξ1 , . . . , ξ|Yi| . Inspired by the work of (Tang et al., 2022) , which propose an auto-encoder framework for graph data, we adopt the surrogated loss defined as follow: min π∈Π |Yi| j=1 ξ j i - ξπ(j) i 2 , s.t. π is a bijective mapping: [|Y i |] → [|Y i |]. Meanwhile, the original node features could be high-dimensional, and reconstructing them directly may introduce a lot of noise. Instead, we may first map node features into a latent space. In more detail, all clients could collectively learn a model and map features to a low dimension with linear layers in the model.

5. EXPERIMENTS

In this section, we conduct extensive experiments to evaluate Fed 2 GNN focusing on the following research questions: RQ1: How does Fed 2 GNN perform in comparison to state-of-the-art federated We conduct experiments on three citation network datasets, Cora (Sen et al., 2008) , MSAcademic (Shchur et al., 2018) , and DBLP (Fu et al., 2020) . We randomly split all datasets with 40% training set, 30% validation set, and 30% testing set. To synthesize the distributed subgraph system, we generate hierarchical graph clusters on each dataset with the Louvain algorithm (Blondel et al., 2008) following the work of FedSage+ (Zhang et al., 2021) . We consider three scenarios of M = 5, 10, 15. The statistics of datasets are presented in Appendix C. We compare Fed 2 GNN with four baselines to demonstrate its effectiveness. All experiments were repeated ten times with different random seeds. • Central learning: models are trained in a centralized manner with a global graph dataset. • FedAvg: models are trained by utilizing the FedAvg algorithm on distributed subgraph data. • FedAvg-Full: models are trained by utilizing the FedAvg algorithm, but the node representations have no neighborhood information loss. This baseline is in line with FedGCN (Yao & Joe-Wong, 2022)'s goal in terms of accuracy and represents the target of our framework. • FedSage+ (Zhang et al., 2021) : a federated graph learning framework with the idea of generating cross-subgraph neighbor nodes.

5.2. PERFORMANCE FOR 2-LAYER GNNS.(RQ1)

We conduct experiments on two GNN models: GCN (Kipf & Welling, 2017) and Graph-SAGE (Hamilton et al., 2017b) , and compare the performance of Fed 2 GNN with all baselines. The results are depicted in Table 1 . It shows that Fed 2 GNN has the best performance over FedAvg and is better than FedSage+ in most cases. Meanwhile, Fed 2 GNN has a comparable performance with FedAvg-full, corroborating the effectiveness of our method for reconstructing neighborhood information.

5.3. PERFORMANCE FOR DEEP GNNS (RQ2)

We conduct experiments to evaluate the performance of our framework for deep GNNs. We adopt the model DAGNN proposed in (Liu et al., 2020) , which has a better performance for node representation learning than GCN (Kipf & Welling, 2017 ) on many dataset. Table 2 presents the results when the number of layers is 5. It shows that Fed 2 GNN still has a good performance even for deep GNNs, demonstrating the effectiveness of our methods for reconstructing neighborhood information. Meanwhile, another important observation emerging from the results is that learning for DAGNN significantly bridges the gap between central learning and federated learning on the Cora dataset. Specifically, as can be observed in Table 1 , the gaps between central learning and Fed 2 GNN is 4.58% when M = 10 and 5.68% when M = 15 for GCN model,while the gaps are only 1.37% and 1.43% for deep GNNs. Such an observation assay the benefits brought by DAGNN in the federated setting. We further conduct experiments for DAGNN with different depths. The results are illustrated in Figure 3 . It shows that node representation learning on Cora and DBLP datasets benefits from training the DAGNN model. Meanwhile, applying DAGNN on the MSAcademic dataset degrades the performance in central learning. Both of the observations are aligned with the results in (Liu et al., 2020) . However, federated graph learning on the MSAcademic dataset could benefit from applying deeper DAGNN, probably because deep GNNs have a better generalization ability and compensate for the federated learning. Moreover, Fed 2 GNN still performs better than FedAvg and has comparable performance with FedAvg-full, showing that our method is effective for federated learning of deep GNNs.

5.4. IN-DEPTH ANALYSIS FOR FED 2 GNN(RQ3)

In Figure 4 , we present results on studying the impact of max node degree d for Fed 2 GNN. The size of d not only affects the efficiency of reconstructing neighborhood information but also affects of performance of downstream tasks. We experiment on models with a different number of layers with the Core dataset and try q varies from 3 to 20. It shows that the performance of Fed 2 GNN is robust to the max node degree, with large d only leading to slightly better performance them small d. Finally, to further understand the effectiveness of our proposed framework for federated graph learning, we perform convergence analysis in Appendix E. The results show that the method has a similar converge rate as FedAvg-full, indicating that we can indeed recover the missing neighborhood. Meanwhile, we conduct experiments on training the decoder locally or collectively; the results show that the decoder has a good generalization ability.

6. CONCLUSION

In this work, we address the limitations of existing works of federated learning on graph-structured data and propose a new framework that can better tackle the issue of missing cross-client neighborhood information during training. Our framework utilizes the idea of reconstructing neighborhood information that considers both structured-based and feature-based information. Such an advantage allows us to train deep GNNs in the federated setting, leading to better performance for node representation learning. Extensive experiments have been conducted to verify the effectiveness of our framework, which is consistent with our theoretical analysis. Though our framework manifests good performance, it confronts potential privacy concerns as other works in federated learning. Solving such a problem could be a promising direction in the future. A FED 2 GNN FOR GCN MODEL The rooted tree recovers the structure-based information of one node. However, in practice, the last layer of the rooted tree is not fully splitted, which change the degrees of nodes in the last two layers. Specifically, the nodes in the last layer has a constant degree 2 (including self-loop) and the nodes in the second to last layer has a constant degree 3 (including self-loop), which affects the encoding process of models such as GCN (Kipf & Welling, 2017) . Surprisingly, we can slightly change the encoding process and manipulate the edge weight to solve the problem. Recall that the representation of a GCN model is formulated as k) ). In the 1-hop neighborhood reconstruction phase, instead of sending H (k) i = σ( j∈Ni∪{i} 1 √ (|Ni|+1)•(|Nj |+1) H (k-1) j W ( m∈N cp i 1 z [c(m)]x m directly, the passive client P z sends m∈N cp i 1 z [c(m)] xm √ |Nm|+1 to the active client, and m∈N cp i 1 z [c(m)] j∈Nm xj √ (|Nj |+1) in the 2-hop neighborhood reconstruction phase. With a decoder, we get two set G i = { xm √ |Nm|+1 |m ∈ N cp i } and H i = { j∈Nm\{i} xj √ |Nj |+1 |m ∈ N cp i }. We represent the 1-hop neighbors as √ 3•xm √ |Nm|+1 and 2-hop neighbors as j∈Nm\{i} √ 2•xj √ (|Nj |+1) , ∀m ∈ N cp i . We further manipulate the edge weight between node i and the generated neighbor nodes including 1-hop neighbors m ∈ N cp i and the corresponding 2-hop neighbor j m connecting to m. For pair of nodes A, B, we denote the message passing from A to B as A → B. In undirectd graphs, message passing between two nodes bidirectionally, hence we present the weights of A → B and B → A separately. We have the following results: Direction Weight Direction Weight i → m √ 3 (|Ni|+1)•( √ |Nm|+1) m → i 1 √ 3 √ |Ni|+1 i → i 1 |Nm|+1 j m → m √ 3 √ 2(|Nm|+1) Note that the message passing from m → j m does not affect the encoding process for note i; we omit it here. Proof. For easy of presentation, for every node v ∈ V, we denote D v as the degree of v including self loop. For a two layer GCN model, we have: ŷi = σ 1 ( m∈Ni∪{i} 1 √ D i • D m σ 2 ( j∈Nm∪{m} 1 D m • D j x j W (1) )W (2) ) = σ 1 (( 1 D i σ 2 (( 1 D i x i + j∈Ni 1 √ D i • D m x j )W (1) ) + m∈Ni 1 √ D i • D m σ 2 (( 1 D m x m + j∈Nm 1 D m • D j x j )W (1) ))W (2) ) In our methods, we manipulate the edge weight and features of neighbors, such that encoding result is the same as Equation 9. ŷi = σ 1 (( 1 D i σ 2 (( 1 D i x i + j∈Ni 1 √ 3D i • √ 3x j √ D m )W (1) ) + m∈Ni 1 √ 3D i σ 2 (( 1 D m √ 3x m √ D m + √ 3x i D i • √ D m + √ 3 √ 2D m j∈Nm\{i} √ 2x j D j )W (1) ))W (2) ) Note that σ 2 is the ReLU function, rearranging the above formula, we can get Equation 9. B PROOF OF THEOREM 4.1 Theorem 4.1 is reproduced from Theorem 2.1 in Lu & Lu (2020) , which is stated as below: Theorem B.1. (Theorem 2.1 in Lu & Lu (2020) ). Let P and Q be the target and the source distributions respectively, both defined on R d . Assume that Q is absolutely continuous with respect to the Lebesgue measure and E X∼P |X| 3 < ∞, it holds that for any given approximation error ε, there exists a positive integer n = O( 1 ε d ), and a fully connected and feed-forward deep neural network u(•) of depth L = ⌈log 2 n⌉ and width N = 2 L = 2 ⌈log 2 n⌉ , with d inputs and a single output and with ReLU activation such that W 1 ((∇u) # Q, P) ≤ ε. To prove Theorem 4.1, we only need to verify the conditions presented in the above theorem. We need to show that P = P i has a bound E X∼P |X| 3 < ∞ and Q = Q i is absolutely continuous with respect to the Lebesgue measure. Furthermore, we also need to show the connection between W 1 (•, •) and W 2 (•, •). Proof. P = P i has a bounded 3-order moment because the support of P is in a bound space of R d Q = Q i is absolutely continuous with respect to the Lebesgue measure because P i is absolutely continuous with respect to the Lebesgue measure. Last, we show the connection between W 1 (•, •) and W 2 (•, •). Note that the support P = P i is bounded, i.e., ∀δ ∈ supp(P), there exist C ∈ ∞, s.t. ||δ|| < C. According to (Lu & Lu, 2020) , Q = (∇u) # Q also has bounded support. Wlog, we have ∀δ ∈ supp( Q), ||δ|| < C. Then, we show that W 2 2 (P, Q) = inf γ∈Γ(P, Q) Z×Z ′ ∥Z -Z ′ ∥ 2 2 dγ (Z, Z ′ ) ≤ 2C inf γ∈Γ(P, Q) Z ′ Z ′ ∥Z -Z ′ ∥ 2 dγ (Z, Z ′ ) ≤ 2C d x inf γ∈Γ(P, Q) Z×Z ′ ∥Z -Z ′ ∥ 1 dγ (Z, Z ′ ) = 2C d x W 1 (P, Q) < 2C d x ϵ. As C and √ d x are constant, we have W 2 2 (P, Q) = O(ε). The first inequality is because ∥Z -Z ′ ∥ 2 ≤ ∥Z∥ 2 + ∥Z ′ ∥ 2 = 2C. The second inequality is because d x inf γ∈Γ(P,Q) Z×Z ′ ∥Z -Z ′ ∥ 1 dγ (Z, Z ′ ) = lim i→∞ Z×Z ′ d x ∥Z -Z ′ ∥ 1 dγ i (Z, Z ′ ) (There exists a sequence of measures {γ i } ∞ i=1 achieving the infimum) ≥ lim i→∞ Z×Z ′ ∥Z -Z ′ ∥ 2 dγ i (Z, Z ′ ) ≥ inf γ∈Γ(P,Q) Z×Z ′ ∥Z -Z ′ ∥ 2 dγ (Z, Z ′ ) .

C DATASET STATISTICS

In this section, we present details of the datasets used in our experiments. We conduct experiments on three citation network datasets, Cora (Sen et al., 2008) , MSAcademic (Shchur et al., 2018) , and DBLP (Fu et al., 2020) . We summarize the statistics of the datasets in Table 3 . Meanwhile, we also summarize the statistics of subgraphs obtained by the Louvain algorithm (Blondel et al., 2008) in Table 4 . In this subsection, we repeat five times experimental runs and show the average (a) train accuracy curves and (b) loss curves in Figure 5 to demonstrate the convergence of our model. In addition to the Louvain split method applied in the previous work (Zhang et al., 2021) , we also apply the random split method, which randomly selects nodes to form subgraphs and maintains only the edges between the selected nodes in the subgraph. The results show that Fed 2 GNN model achieves almost the same convergence as the FedAvg-Full baseline. Since nodes are randomly selected to form subgraphs, there will be more cross-client edges in the subgraphs under the random split method. Therefore, our experimental results will be more evident under the random split method.

E.2 DECODER TRAINING METHODS STUDY

In this subsection, we study the training methods of the decoder by comparing the test accuracy obtained under different decoder training methods. We tried locally trained decoders and federated trained decoders and set baselines FedAvg for comparison. We conduct experiments on Cora and DBLP datasets using the hyper-parameters setting in Table6. As shown in Figure 6 , the federated trained decoder only slightly better that the locally trained decoder, showing that the decoder has a good generalization ability to decode information from unseen data.

E.3 COMMUNICATION COST

Communication cost is widely recognized as a major bottleneck for federated learning. Particularly in federated graph learning, additional communication across clients may be required to recover missing neighborhood information. Recall that each client recursively aggregates information from its neighbor nodes during the information transmitting process in our methods; the communication is O(ndK), where n is the number of nodes, d is the average degree of all nodes, and K is the number of layers of a GNN model. We conduct experiments to evaluate the communication cost on real-world datasets with the baselines of FedGCN (Yao & Joe-Wong, 2022) and FedGraph (Chen et al., 2022) , which have the communication cost of O(nd k ) and O(ndT ), where T >> k is the number of iterations. We ignore the influence of feature dimension since all methods can reduce the feature dimension in advance. The results are presented in Figure 7 . We vary the number of GNN layers and the number of clients. It shows that our methods have the lowest cost in all settings. Meanwhile, partitioning more clients requires more communication costs since more nodes have missing neighborhood information.

E.4 EFFICIENCY OF CONSTRUCTING ROOTED TREES

Constructing the rooted tree for a node is simple. Algorithm 2 presents the pseudo-code of the process for one tree. It mainly consists of a sequence of predicting processes of the decoder ϕ. For each vector h ∈ H k , the decoder takes it as input and outputs G and H (line 9 in Algorithm 2). The K + 1-th layer of the tree is constructed by generating nodes with features corresponding to elements in G and connecting them to g corresponding to h. In practice, the vectors in the set H k can be fed to the decoder simultaneously. Meanwhile, we can even input all H 1 s (of different root nodes) into the decoder to construct the rooted trees simultaneously. To that end, building rooted trees of K layers for all nodes requires the decoder to predict only K -1 times. We conduct experiments to evaluate the efficiency of the tree-building process and report the average tree-building time and training time for all clients. All experiments are implemented on a server equipped with an A100 GPU. The results are presented in Figure 8 . It shows that the tree-building time increases as the number of layers grows. Indeed, as the number of layers increases, the number of generated nodes increases exponentially, hence requiring more time. However, it is still efficient to build rooted trees for all nodes, which only requires a few seconds. F ADDITIONAL PROTOCOL AND ALGORITHM µ 0 i , f 0 i ← x i 5 for P p ∈ C i do 6 P p sends q 1 p ← m∈N cp i 1 p [c(m)]µ 0 m to P c(i)  7 end 8 f 1 i ← Pp∈Ci q 1 p 9 µ 1 i ← Pp∈Ci∪{P c(i) } q 1 p 10 η 1 i ← µ 1 i -f 1 f k i ← Pp∈Ci q k p -|N cp i | • f k-2 i 19 µ k i ← Pp∈Ci∪{P c(i) } q k p -(|N cp i | + |N p i |) • µ k-2 i 20 η k i ← µ k i -f k i 21 else 22 f k i ← Pp∈Ci q k p -(|N cp i | -1) • f k-2 i -|N cp i | • η k-2 i 23 µ k i ← Pp∈Ci∪{c(i)} q k p -(|N cp i | + |N p i | -1) • µ k-2 i 24 η k i ← µ k i -f k



Code available at https://www.dropbox.com/s/unizcyixsmip0je/Fed%5E2GNN.zip? dl=0



Figure 1: Examples of constructing rooted trees with different neighborhood for node i.

Figure 2: Visual illustration of the information transmitting process across clients and the approach to get the complete graph by constructing a rooted tree. The information transmitted between clients is encoded and represented by the bold arrow. formation only requires the features of 1-hop neighbors of the rooted node. According to the expression of a 1-layer GNN model presented below, ŷi = σ( m∈Ni∪{i}

which is the sum of features of nodes in the third layer of the rooted tree, where | • | represents the size of a set.

Figure 3: Results of Fed 2 GNN for training deep GNNs with different number of layers.

Figure 5: Average accuracy curves and loss curves on DBLP, M = 10

Figure 7: Communication cost of Fed 2 GNN in comparison with FedGCN (Yao & Joe-Wong, 2022) and FedGraph (Chen et al., 2022) for training models with different number of layers and different number of clients. Cora MSAcademic DBLP

Federated Learning (FL) is a privacy computing technology that enables collaborative machine learning without exposing private data. It was first proposed by (McMahan et al., 2017)'s FedAvg, which allows clients to collaboratively train a common model while each of them only possesses a local dataset. During training, each client periodically uploads the local update to the server. The server then aggregates the updates to a global model and distributes the model to clients for further training. Recently, personalized federated learning

Summary of node classification accuracy results in percent (GCN, GraphSAGE).

Summary of node classification accuracy results in percent (5-layer deep GNN).

Statistics of the datasets, where |V| represents the number of nodes, |E| represents the number of edges, |X | represents the number of features and Classes is the number of label classes.

Statistics of subgraphs, where |V| represents the average number of nodes, |E| represents the average number of edges and ∆|E| is the total number of the cross-client edges.

annex

Protocol 1: Neighbor information transmissionAlgorithm 2: Rooted tree construction 1 Input: Node i and its neighbor informationtrained in advance; An operator ZIP which iterate over two set in parallel and producing a set of tuples with an item from each one. 2 Output: The rooted tree for i. 

