A NEW PARADIGM FOR FEDERATED STRUCTURE NON-IID SUBGRAPH LEARNING Anonymous authors Paper under double-blind review

Abstract

Federated graph learning (FGL), a distributed training framework for graph neural networks (GNNs) has attracted much attention for breaking the centralized machine learning assumptions. Despite its effectiveness, the differences in data collection perspectives and quality lead to the challenges of heterogeneity, especially the domain-specific graph is partitioned into subgraphs in different institutions. However, existing FGL methods implement graph data augmentation or personalization with community split which follows the cluster homogeneity assumptions. Hence we investigate the above issues and suggest that subgraph heterogeneity is essentially the structure variations. From the observations on FGL, we first define the structure non-independent identical distribution (Non-IID) problem, which presents unique challenges among client-wise subgraphs. Meanwhile, we propose a new paradigm for general federated data settings called Adaptive Federated Graph Learning (AdaFGL). The motivation behind it is to implement adaptive propagation mechanisms based on federated global knowledge and non-params label propagation. We conduct extensive experiments with community split and structure Non-IID settings, our approach achieves state-of-the-art performance on five benchmark datasets.

1. INTRODUCTION

The graph as a relational data structure is widely used to model real-world entity relations such as citation networks Yang et al. (2016a) , recommended systems Wu et al. (2022) , drug discovery Gaudelet et al. (2021) , particle physics Shlomi et al. (2021) , etc. However, due to the collection agents and privacy concerns, generally, the global domain-specific graph consists of many subgraphs collected by multiple institutions. In order to analyze the local subgraph, each client maintains a powerful graph mining model such as graph neural networks (GNNs), which have achieved stateof-the-art performance in many graph learning tasks Zhang et al. (2022b) ; Hu et al. (2021) ; Zhang & Chen (2018) . Despite its effectiveness, the limited data provide sub-optimal performance in most cases. Motivated by the success of federated learning (FL), a natural idea is to combine the GNNs with FL to utilize the distributed subgraphs. Recently, federated graph learning (FGL) He et al. (2021) ; Wang et al. (2022b) is proposed to achieve collaborative training without directly sharing data, yet an essential concern is the heterogeneity of the distributed subgraphs. Notably, graph heterogeneity is different from the heterogeneity of labels or features in the fields of computer vision or natural language processing, we suggest that it depends on the graph structure. However, The existing FGL methods simulate the federated subgraph distributions through community split, which follows the cluster homogeneity assumption as shown in Fig. 1(a) . Specifically, community split leads to the subgraph structure being consistent and the same as the original graph, e.g., connected nodes are more likely to have the same labels. Obviously, it is overly desirable and hard to satisfy in reality, hence we consider a more reasonable setting shown in Fig. 1(c ). We first refer to the above problem as structure non-independent identical distribution (Non-IID). The motivation behind it is due to graph structure directly related to node labels and feature distributions. Meanwhile, the challenges of structure heterogeneity are ubiquitous in the real world Zheng et al. (2022b) . For instance, in citation networks, we consider research teams focused on computers and intersectional fields (e.g., AI in Science) Shlomi et al. (2021) ; Gaudelet et al. (2021) as clients. In online transaction networks, fraudsters are more likely to build connections with customers instead of other fraudsters Pandit et al. (2007) . We consider different regions as clients to detect financial fraudsters by analyzing online transaction subgraphs. Specifically, graph structure can be divided into two types: homogeneity means that connected nodes are more likely to have the same label and similar feature distributions and heterogeneity is the opposite. In order to explain it intuitively, we visualize the 3 clients partitioning result on Cora in Table . 1 and Table. 2 , where Homo represents the homogeneity degree of the local subgraph, and it is computed by a popular metric Pei et al. (2020) . Obviously, compared to community split, which follows the cluster homogeneity assumption and uniform distribution principle, structure Non-IID brings challenges to the existing FGL methods. Based on this, we investigate the above issues through empirical analysis shown in Fig. 2 . According to the results, we observe that in case the original graph satisfies the homogeneity assumption then the label distributions satisfy Non-IID. It is the opposite when the original graph satisfies the heterogeneity. This is due to the fact that the nodes partitioned into the same clients are communities and follow the uniform distribution principle. In addition, the local accuracy indicates that the subgraph structure performs a more important role in FGL compared to the label distributions, which also supports our motivation. In model performance, we observe that the GGCN improves the structure Non-IID problem, and FedSage+ trains NeighGen to implement local subgraph augmentation by sharing node embeddings. However, the above methods fail to achieve competitive results as SGC on the homogeneous subgraphs while considering heterogeneity. In order to efficiently analyze distributed subgraphs with both homogeneity and heterogeneity. We propose a simple pipeline called Adaptive Federated Graph Learning (AdaFGL) for more general federated data settings, which consists of three main parts. Specifically, it starts by analyzing the subgraph structure through non-params label propagation and selects the appropriate base model: (i) the federated global knowledge extractor (e.g., MLP, powerful GNNs, or any reasonable embedding models), which does not rely on any learning over the subgraph. Then, the base predictor is trained based on the global data, which can be done offline or in parallel with local training, benefiting from the flexibility of our approach. Finally, the local client implements two adaptive propagation mechanisms: (ii) homogeneity propagation module or (iii) heterogeneity propagation module based on the local subgraph. Notably, with non-params label propagation, the above process is adaptive. To summarize, the contributions of this paper are as follows: (1) 1 2 3 4 5 6 7 8 9 10  0 baselines, our method achieves performance gains of 4.67% and 2.65% in structure Non-IID and community split data settings, respectively.

2. PRELIMINARIES

In this section, we first introduce the semi-supervised node classification task. Then, we review the prior diverse GNNs and very recent FGL methods. Consider a graph G = (V, E) with |V | = n nodes and |E| = m edges, the adjacency matrix (including self loops) is denoted as Â ∈ R n×n , the feature matrix is denoted as X = {x 1 , x 2 , . . . , x n } in which x v ∈ R f represents the feature vector of node v, and f represents the dimension of the node attributes. Besides, Y = {y 1 , y 2 , . . . , y n } is the label matrix, where y v ∈ R |Y | is a one-hot vector and |Y | represents the number of the node classes. The semi-supervised node classification task is based on the topology of labeled set V L and unlabeled set V U , and the nodes in V U are predicted with the model supervised by V L . GNNs. As the most popular GNN method, The forward information propagation process of the l-th layer GCN Kipf & Welling (2017) is formulated as X (l) = σ( ÃX (l-1) W (l) ), Ã = Dr-1 Â D-r , where D represents the degree matrix with Â, r ∈ [0, 1] denotes the convolution kernel coefficient, W represents the trainable weight matrix, and σ(•) represents the non-linear activation function. In GCN, we set r = 1/2, and then D-1/2 Â D-1/2 is called symmetric normalized adjacency matrix. Despite their effectiveness, they have limitations in real-world graphs, which have complex heterogeneous relationship patterns. Some recent researches  m (l) v = Aggregate (l) ({h * u |u ∈ N * (v)}), h (l) v = Update (l) (h * v , m * v ), where h * u denotes the information of multi-hop neighbors N * (v), m * v represents the higher-order messages of node v from the previous layers, Aggregate(•) and Update(•) denote the message aggregation function and update function, respectively. However, these methods suffer from high computational complexity and fail to achieve competitive performance on the homogeneous graph. FGL has received growing attention for breaking centralized graph machine learning assumptions. FedGraphNN He et al. (2021) and FS-G Wang et al. (2022b) propose general FGL packages, which contain a wide range of graph learning tasks. GCFL Xie et al. (2021) and FED-PUB Baek et al. (2022) investigate the personalized technologies in graph-level and node-level, respectively. Furthermore, some recent researches improve performance with local subgraph augmentation, including FedGNN Wu et al. (2021 ), FedGL Chen et al. (2021 ), and FedSage Zhang et al. (2021) . Inspired by FS-G Wang et al. (2022b) , we can consider the collaborative training process in FGL as modules. Specifically, we model the information such as gradients and node embeddings uploaded by the clients as messages. Then we consider the server processes and broadcast results as the various message-handling mechanisms. Here we illustrate the GNNs combined with collaborative training. Its generic form with N clients is defined as FGL -Clients (Local Update) → min 1 N N i=1 E (A i ,X i ,Y i )∼D i [Lce(f θ i (Ai, Xi), Yi)], L(f θ i (Ai, Xi), Yi) = - i∈V L j [Yij log(Softmax( Ỹij)) + (1 -Yij) log(1 -Softmax( Ỹij))], where f θ i and L ce are the i-th local GNN with parameters θ and cross-entropy loss function, respectively. It can be replaced by any other appropriate loss function depending on the task. 2017) is an efficient FL algorithm, which can be defined as (A i , X i , Y i ) ∼ D i represents the local subgraph (A i , X i , Y i ) sampled from the distribution D i . FedAvg McMahan et al. ( FGL -Server (Aggregate) → ∀i, W t+1 i ← W t i -ηg i , W t+1 ← N i=1 n i n W t+1 i , where t represents the round number of the FL process, W represents the model weights, η represents the learning rate, g represents the gradient calculated from the Eq. 3, n i and n represent the i-th local client data size and the global data size, respectively.

3. ADAFGL PIPELINE

The basic idea of AdaFGL is to perform adaptive propagation mechanisms based on federated global knowledge and non-params label propagation. The pipeline with three main parts as shown in Fig. 3 , which combine the global knowledge embeddings and local structure properties. The above decoupling process utilizes the computational capacity of the local system while minimizing communication costs and the risk of privacy leakage. AdaFGL can benefit from the evolution of FL and GNN through the base predictor and adaptive propagation. Notably, the base predictor obtained by federated training and personalized propagation are viewed as two decoupled modules that are executed sequentially. Meanwhile, both of them accomplish the training without sharing local private data.

3.1. FEDERATED GLOBAL KNOWLEDGE EXTRACTOR

In FGL, limited data yields sub-optimal performance in most cases. Therefore, AdaFGL starts to perform non-params label propagation to adaptive process. Note this process does not rely on any learning over the subgraph. Specifically, the labeled nodes are initialized as y 0 v = y v , ∀v ∈ V L , and the unlabeled nodes are denoted as y 0 u = ( 1 |Y | , . . . , 1 |Y | ), ∀u ∈ V U . Then, the non-params label propagation of the k-step is expressed as y k u = graph -aggregator({y k-1 v |v ∈ N u }) = αy 0 u + (1 -α) v∈Nu 1 dv du y k-1 v . We follow the approximate calculation of the personalized PageRank matrix Klicpera et al. (2019) , where N v represents the one-hop neighbors of v, and we default set α = 0.5. Then, we design the homogeneity confidence score (HCS) computed by the number of correct predictions, and the default ratio of the boolean mask is 0.5. Finally, we set thresholds λ for the adaptive binary selection of the homogeneity propagation module and heterogeneity propagation module in each client. In experiments, we default set λ = 0.6 To demonstrate that AdaFGL is a simple yet effective framework, we choose simple models (e.g. MLP or SGC) and FedAvg to achieve federated training. Due to the flexibility of AdaFGL, they can be replaced by any other powerful GNNs and federated methods. From the perspective of FL in Non-IID data, we default choose MLP as the base predictor, which is independent of the graph structure. Then we quote the convergence theorem Li et al. (2020) in T rounds and E epochs, the federated global knowledge extractor error bound ϵ f ed is expressed as ϵ f ed ≤ 2L µ 2 (γ + T -1) N i=1 n i n φ 2 i + 6Lϕ + 8(E -1) 2 ω 2 + γ 4 ||W 1 -W ⋆ || 2 . ( ) It assumes that the mapping function satisfies L-smooth and µ-strongly convex, where φ and ϕ represent the local random gradient and the degree of model heterogeneity, respectively, γ = max{8L/µ, E}, ω denotes the divergence of local model, and W * represents the global optima. We observe that the base predictor error bound is mainly determined by the differences in the node feature distributions, and the model performance will be further hurt if the graph structure is considered. Therefore, we are motivated to propose adaptive propagation mechanisms. Specifically, we implement the binary selection of the homogeneity propagation module or heterogeneity propagation module in each client by comparing the HCS value and the threshold λ. We will describe the technical details of personalized propagation strategies.

3.2. ADAPTIVE HOMOGENEITY PROPAGATION

After that, we use the base predictor to embed local subgraph nodes into the global knowledge space X global and improve the accuracy with the local homogeneous structure. The motivation behind it is that the feature propagation satisfying homogeneity has a significant positive impact on prediction performance, which has also been confirmed in many recent research works Zhang et al. (2022a) ; Wang & Leskovec (2020) . Hence we expect to utilize local smoothing features to correct the predictions. Then, we first define the homogeneous feature propagation X (k) smooth = graph -operator(A) (k) X (0) , ∀k = 1, . . . K, H homo = message -updater(X (k) smooth ) = f θ (X (k) smooth ), where graphoperator(•) represents the graph operator in feature propagation, we default to use symmetric normalized adjacency as shown in Eq. 1. X (k) smooth represents the local smoothing features after K-steps propagation, messageupdater(•) denotes the model training process, and we use f θ to represent the linear regression or MLP with parameters of θ. In order to correct the global embedding and local information, we use the local message update mechanism and online distillation to achieve an effective combination of the local smooth structure prior and the global embeddings, which can be written as H local = W local X global , L kd = ||H homo -H local || F . Based on this, we can make local smoothing information and global embeddings to achieve mutual supervision and end-to-end training by gradient updating. This exploits the local structure information to reduce the error bound. Notably, the above adaptive process is accomplished in the local client and has no additional communication costs and privacy concerns.

3.3. ADAPTIVE HETEROGENEITY PROPAGATION

In contrast, in order to break the heterogeneous structure limitations, we optimize the messagepassing framework by embeddings X global to detect subgraph heterogeneous patterns. Specifically, we propose an adaptive propagation mechanism by discovering the global dependency of the current node and modeling the positive or negative impact of the messages. Intuitively, we first expect to optimize the propagation probability matrix and align the local structure by global embeddings A (0) prop = X global X T global , X align =graph -operator( Â(0) prop ) (k) X (0) . Obviously, the original propagation probability matrix introduces high error, we improve it by scaling the aggregated messages and making it trainable. Formally, let p ij ∈ A prop correspond to the i-th row and j-th col of A prop , we define the scaling operator d ij = dis(P ii , P ij ) for j ̸ = i, where dis(•) is a distance function or a function positively relative with the difference, which can be implemented using identity distance. Thus the corrected propagation matrix is expressed as Â(l) prop = A (l) prop /dd T -diag(A (l) prop ). (10) The purpose of it is to measure the global dependency of the current node through the probability difference. Then, we further model the positive and negative impacts of the messages to implement effective aggregation, which is formally represented as follows H (l) = WH (l-1) , A (l) prop = Â(l-1) prop + β H (l) H (l) T , H (l) pos = PoSign( Â(l) prop )H (l) , H (l) neg = NeSign( Â(l) prop )H (l) , H (l+1) = H (l) + H (l) pos + H (l) neg , where H (0) = X align , PoSign(•) and NeSign(•) represent the trainable adaptive propagation probabilities, it can be replaced by any reasonable nonlinear activation function. Here we analyze the error bound for the above adaptive heterogeneous propagation mechanism. The proof of the following theorem and reasonable assumptions are given in Appendix. A.1 Theorem 3.1 Suppose that the latent ground-truth mapping Φ : x → y from node features to node labels is differentiable and satisfies L-Lipschitz constraint, the following approximation error is j̸ =i P ⋆ ij Φ(H (l) ) -   H (l) i + j̸ =i (Pos (l) ij + Neg (l) ij )H (l) j   ≤   L ∥ϵ i ∥ 2 + j̸ =i P ⋆ ij O H l j -H (l) i 2   + H ⋆ -ϕ (κ + P) H (l) 2 , where ⋆ represents the global optimal, ϵ denotes immediate neighbors error, O(•) denotes a higher order infinitesimal, ϕ and κ represent propagation matrix and model differences, respectively. The core of the above propagation mechanisms is to generate embeddings based on other nodes in the embedding space. In other words, it means that any node representation can be mapped to a linear combination of existing node representations, which has been applied in many studies Zheng et al. (2022a); Yang et al. (2022) . However, most of the methods use ranking mechanisms for representation and fail to consider modeling propagation processes, which has limitations.

4. EXPERIMENTS

In this section, we conduct experimental analysis on five benchmark datasets with community split and structure Non-IID settings to validate the effectiveness of AdaFGL. We aim to answer the following five questions. 

4.2. OVERALL PERFORMANCE

We first present the complete results on Cora and Chameleon in Table . 4 and Table . 3, which are two representative homogeneous and heterogeneous datasets. Due to the space limitation, the details about the experiment environment and results in other datasets can be found in AppendixA.6. Notably, since we randomly inject homogeneous or heterogeneous information into the structure Non-IID data partitioning process, the model performance does not directly relate to the number of clients. Meanwhile, in the community split setting, the process of model aggregation by multiple clients to achieve federated learning can be considered as ensemble learning. Therefore, the prediction performance gets better with the increasing number of clients in some cases. To answer Q1, tively. Meanwhile, AdaFGL exceeds the best methods among all considered baselines on the heterogeneous datasets by a margin of 6.37% to 8.52%. In the community split setting, we improve the prediction accuracy by utilizing the local smoothing prior and adaptive propagation mechanisms. To answer Q2, We demonstrate the performance of existing methods in the face of structure Non-IID challenges in Table . 4. Although FedGGCN performs well in general, it cannot obtain competitive performance. Despite FedSage+ achieving effective local graph augmentation by sharing global data, structure Non-IID is a natural challenge, and this weakness is amplified when heterogeneity is high. In contrast, our method achieves performance gains of 1.45%, 3.77%, and 1.87% compared to the highest prediction accuracy. Impressively, AdaFGL improves performance by 9.82%, 13.06%, and 13.29% in the structure Non-IID setting for the heterogeneous dataset Chameleon. From the observation of the comparison results with the baselines, our method has significant advantages, especially in terms of robustness and impressive performance.

4.3. ABLATION EXPERIMENTS

To answer Q3, we present the ablation experiment results in Table . 3 and Table . 4, where HomoKD represents the online distillation in the homogeneous propagation module and HeteTA represents the trainable probability propagation matrix in the heterogeneous propagation module. We observe that the online distillation enhances Homogeneous propagation by combining local smoothing features and local embeddings, it can effectively improve model performance without adding additional computation costs. In essence, it achieves mutually supervised end-to-end learning of global and local information. Furthermore, the trainable probability propagation matrix optimizes the heterogeneity propagation module. It learns the global optimal propagation mechanism and detects positive and negative messages to generate embeddings. HeteTA can discover the global dependence of the current node and achieve effective message aggregation, which is proved by Theorem. 3.1.

4.4. VISUALIZATION AND EXPLAINABILITY ANALYSIS

To answer Q4, we present the local prediction accuracy trends with the competitive baseline methods in Fig. 4 . According to it, we can notice that our method achieves the best performance in most cases under both community split and structure Non-IID data settings, while the overall trend is optimized. Due to space limitations, the relevant experimental results about the hyperparameter sensitivity analysis experiments on AdaFGL and conclusions can be found in Appendix. A.5. In order to illustrate the effectiveness of the federated global knowledge extractor and the adaptive propagation mechanisms, we also analyze the explainability by presenting the heat maps shown in Fig. 5 . We perform structure Non-IID partitioning for 10 clients on PubMed, then select the client with the highest number of nodes with homogeneity and heterogeneity. Based on this, we randomly sampled 20 nodes to obtain the similarity score by computing the embedding transpose. From the observation of the results, we notice that the federated global knowledge extractor only obtains fuzzy results and cannot be optimized for the local subgraphs. Fortunately, we achieve an effective combination of global knowledge and local subgraph structure prior to obtaining explicit node embeddings, which is also demonstrated through the final output in Fig. 5 .

4.5. METHODS COMPARISON

To answer Q5, we review three recent FGL methods and analyze our approach to them in terms of three aspects: method type, exchange messages, and the ability to solve structure Non-IID problems as shown in Table .5. Obviously, although FedSage+ can achieve competitive results, it introduces significant communication costs and privacy concerns. Specifically, FedSage+ trains two models and thus has communication costs, while implementing cross-client information sharing to improve predictive performance, which no doubt increases privacy concerns. GCFL+ has limitations in model selection leads to its failure to handle the structure Non-IID problem in subgraph learning. In our experiments, FedGL is essentially a local graph structure learning process. In contrast, our approach can utilize the computational capabilities of the local system while minimizing communication costs and privacy concerns. More experimental details can be found in Appendix. A.4. Then, we compare the effectiveness of existing GNNs and our approach to handling heterogeneous graph, which focuses on two points: Neighbor Discovery and Message Combination, which is shown in Fig. 6 . We observe that MLP ignores graph structure prior which leads to the failure to handle heterogeneous graphs. Although FedGL and FedSage+ can improve this problem by utilizing global information for local graph augmentation, the limitations of propagation lead to the fact that they are still not the best solutions. Notably, they cannot handle the structure Non-IID problem in FGL. Although NLGNN and GGCN attempt to solve the heterogeneous structure problem, they cannot be directly applied in FGL. Therefore, we are motivated by these methods and propose adaptive propagation mechanisms to improve the performance, which has been validated to be effective.

5. CONCLUSION

In this paper, we discover and define the structure Non-IID problem in FGL, which is a new challenge for existing methods. Based on this, we propose a new paradigm AdaFGL for more general federated data settings. Specifically, we investigate the structure Non-IID problem in FGL for supplementing the existing community split data partitioning approach, which is a more practical federated data setting. To implement effective FGL on heterogeneous distributed subgraphs, we propose AdaFGL which consists of the federated global knowledge extractor and adaptive propagation modules. It combines FL and GNNs tightly and benefits from their evolution. Extensive experiments based on the community split and structure Non-IID data settings demonstrate the effectiveness of AdaGFL. We believe that the ability to fully utilize the graph structure information is the key to achieving efficient FGL, thus the research on graph structure in FGL is a promising direction. Wenqing Zheng, Edward W Huang, Nikhil Rao, Sumeet Katariya, Zhangyang Wang, and Karthik Subbian. Cold brew: Distilling graph node representations with incomplete or missing neighborhoods. In International Conference on Learning Representations, ICLR, 2022a. Xin Zheng, Yixin Liu, Shirui Pan, Miao Zhang, Di Jin, and Philip S Yu. Graph neural networks for graphs with heterophily: A survey. arXiv preprint arXiv:2202.07082, 2022b.

A APPENDIX OUTLINE

The appendix is organized as follows: A.1 Theory error bounds for adaptive heterogeneous propagation modules. A.2 More details about the compared baselines. A.3 Datasets description and structure Non-IID data setting. A.4 Communication costs analysis. A.5 Hyperparameter sensitivity analysis. A.6 Experimental environment and additional base results.

A.1 THEORY ERROR BOUNDS FOR ADAPTIVE HETEROGENEOUS PROPAGATION

To demonstrate the effectiveness of the adaptive heterogeneous propagation module, we prove its error bound. We first make the reasonable following assumption and definitions. Assumption A.1 Φ is L-smooth, ∀x 1 , x 2 ∈ dom(Φ) Φ(x 1 ) ≤ Φ(x 2 ) + (x 1 -x 2 ) T ∇Φ(x 2 ) + L 2 ||x 1 -x 2 || 2 2 . Then we quote the embedding method theorem Linial et al. (1995) . Definition A.1 Given two metric spaces (V, d) and (Z, d ′ ) and mapping function Φ : V → Z, the distortion ϵ distor is definied as ∀u, v ∈ V, 1/ϵ distor d(u, v) ≤ d ′ (Φ(u), Φ(v)) ≤ d(u, v). Theorem A.1 (Bourgain theorem) Given any finite metric space (V, d) with V = n, there exists an embedding of (V, d) into R k under any l p metric, where k = O(log 2 n), and the distortion of the embedding is O(log n). It defines the distortion O(log n) in the embedding space (V, d) for mapping methods satisfying the above conditions. Based on this, we consider a graph G with fixed structure represented by Ã = D-1/2 Â D-1/2 , embeddings represented with H in the forward propagation, and nodes mapping function Φ(H), which satisfies the Theorem. A.1, it can be expressed as ϕ(H) = d(H, S 1,1 ) k , d(H, S 1,2 ) k , . . . , d(H, S log n,c log n ) k , where d(H, S i,j ) = min u∈Si,j d(H, u). S i,j ⊂ V, i = 1, 2, . . . , log n, j = 1, 2, . . . , c log n represents c log 2 n random sets, where c is a constant. It is chosen by including each point in V independently with probability 1/2 i . Then motivated by Xie et al. (2021) and the above conclusions, we have the following model weights difference proposition. Proposition A.1 Assume the propagation probability matrix, hidden embeddings, and label difference with global optima f ⋆ θ and local model f θ are bounded with SGC Wu et al. (2019) for the forward propagation, the model weights difference with the influence of feature difference is represented as ∥P ⋆ -P∥ 2 2 = ∥E P ∥ 2 2 ≤ ϵ P ∥H ⋆ -H∥ 2 2 = ∥E H ∥ 2 2 ≤ ϵ H Ŷ⋆ -Ŷ 2 2 = E Ŷ 2 2 ≤ ϵ Ŷ . Based on this, given that ∥H • H ⋆ ∥ 2 2 = ∥H • (H + E H )∥ 2 2 ≥ ∥HE H ∥ 2 2 . Let ∥XE H ∥ 2 2 = δ H , then we have H ⋆-1 -H -1 2 2 = ∥E H -1 ∥ ≤ ϵ H /δ H . If we choose ϕ = ∥f ⋆ θ -f θ ∥ 2 = (PH ⋆ ) -1 Ŷ⋆ -(PH) -1 Ŷ 2 2 = H ⋆-1 P -1 ( Ŷ + E Ŷ ) -H -1 P -1 Ŷ 2 2 = (H ⋆-1 -H -1 )P -1 Ŷ + H ⋆-1 P -1 E Ŷ 2 2 = E H -1 P -1 Ŷ + (PH + PE H ) -1 E Ŷ 2 2 ≤ ϵ H δ H P -1 Ŷ 2 2 + ϵ 2 H ϵ Ŷ δ X (PH) -1 2 2 + ϵ H ϵ Ŷ (PH) -1 4 2 . Similarly, there exists ∥P • P ⋆ ∥ 2 2 = ∥P • (P + E P )∥ 2 2 ≥ ∥PE P ∥ 2 2 , ∥PE P ∥ 2 2 = δ P , and P ⋆-1 -P -1 2 2 = ∥E P -1 ∥ ≤ ϵ P /δ P . we can obtain the model weight differences with the influence of structure difference. ϕ = ∥f ⋆ θ -f θ ∥ 2 = (P ⋆ H) -1 Ŷ⋆ -(PH) -1 Ŷ 2 2 = H -1 P ⋆-1 Ŷ⋆ -P -1 Ŷ = H -1 2 2 (P -1 + E P -1 )( Ŷ + E Ŷ ) -P -1 Ŷ 2 2 = H -1 2 2 P -1 E Ŷ + E P -1 Ŷ + E P -1 E Ŷ ≤ H -1 2 2 ϵ Ŷ P -1 2 2 + ϵ P δ P Ŷ 2 2 + ϵ P ϵ Ŷ δ P . Proof A.1 Here, based on the Eq. 11, we consider the adaptive heterogeneous propagation process H (l+1) = H (l) + H (l) pos + H (l) neg = H (l) + PoSign( Â(l) prop )H (l) + NeSign( Â(l) prop )H (l) = H (l) + PoSign Â(l-1) prop + βWH (l-1) (WH (l-1) ) T H (l) + NeSign Â(l-1) prop + βWH (l-1) (WH (l-1) ) T H (l) . Take node i as an example, given that Φ(•) is differentiable, where contains the gradient update of the model difference ϕ. Meanwhile, in order to quantify the difference between our trainable propagation probability matrix and the global optimum, we define κ i = P ⋆ [i :] -Â0 prop [i :] + l βWH l (WH (l) ) T [i :] , where P ⋆ represents the optimal propagation probability matrix. Then, we use Pos, Neg to denote the positive and negative message propagation weights P = Â(l) prop , there exist H (l+1) i = j̸ =i P ⋆ ij Φ(H (l) ) = H (l) i + j̸ =i (Pos (l) ij + Neg (l) ij )H (l) j =   κ i + P ii + j̸ =i (Pos (l) ij + Neg (l) ij )   ϕH (l) , where Pos (l) ij + Neg (l) ij = Â(l) prop [i :]. Then, we perform a first-order Taylor expansion with Peano's form of remainder at H (l) i and consider the model differences j̸ =i P ⋆ ij Φ(H (l) ) = j̸ =i P ⋆ ij Φ(H (l) ) + ∂Φ(H (l) j ) ∂(H (l) ) T (H (l) j -H (l) i ) + O(||H (l) j -H (l) i || 2 ) = j̸ =i P ⋆ ij Φ(H (l) ) + j̸ =i P ⋆ ij ∂Φ(H (l) j ) ∂(H (l) ) T (H (l) j -H (l) i ) + j̸ =i P ⋆ ij O(||H (l) j -H (l) i || 2 ). Now, we let j̸ =i P ⋆ ij (H (l) j -H (l) i ) = -ϵ i , there exist j̸ =i P ⋆ ij Φ(H (l) ) =   κ i + P ii + j̸ =i (Pos (l) ij + Neg (l) ij )   ϕH (l) , = j̸ =i P ⋆ ij Φ(H (l) ) - ∂Φ(H (l) j ) ∂(H (l) ) T ϵ i + j̸ =i P ⋆ ij O(||H (l) j -H (l) i || 2 ) j̸ =i P ⋆ ij Φ(H (l) ) -   κ i + P ii + j̸ =i (Pos (l) ij + Neg (l) ij )   ϕH (l) = ∂Φ(H (l) j ) ∂(H (l) ) T ϵ i - j̸ =i P ⋆ ij O(||H (l) j -H (l) i || 2 ) . According to Cauchy-Schwarz inequality and L-Lipschitz property, we have ∂Φ(H (l) i ) ∂(H (l) ) T ϵ i ≤ ∂Φ(H (l) i ) ∂(H (l) ) T ∥ϵ i ∥ 2 ≤ L ∥ϵ i ∥ 2 . Therefore, the approximation of H (l) i + j̸ =i (Pos (l) ij + Neg (l) ij )H (l) j is bounded by j̸ =i P ⋆ ij Φ(H (l) ) -   H (l) i + j̸ =i (Pos (l) ij + Neg (l) ij )H (l) j   = ∂Φ(H (l) i ) ∂(H (l) ) T - j̸ =i P ⋆ ij O H l j -H (l) i 2 + H ⋆ -P ⋆ H (l) 2 ≤ ∂Φ(H (l) i ) ∂(H (l) ) T ϵ i + j̸ =i P ⋆ ij O H l j -H (l) i 2 + H ⋆ -P ⋆ H (l) 2 ≤   L ∥ϵ i ∥ 2 + j̸ =i P ⋆ ij O H l j -H (l) i 2   + H ⋆ -ϕ (κ + P) H (l) 2 , where H ⋆ represents the global optimal embeddings. Based on this, we obtain the theory error bound for heterogeneous propagation. From the observation of error bounds, we reveal that in theory, the adaptive heterogeneous propagation process can minimize the immediate neighbors error ϵ i , the model difference ϕ, and the propagation probability matrix difference κ to scale the error to improve the predictive performance.

A.2 COMPARED BASELINES

The main characteristic of all baselines are listed below: FedMLP: The combination of FedAvg and MLP, we employ a two-layer MLP with the hidden dimension of 64. It generates node embeddings based on the original features while ignoring graph structure information in the forward propagation process. For fairness, we follow the experimental setup of the baseline methods paper as much as possible, and in other cases, we show the best prediction accuracy. In addition, the number of rounds for the above baseline methods is 50, and the local epoch is 20.

A.3 DATASETS DESCRIPTION AND STRUCTURE NON-IID DATA SETTING

The statistics of datasets are summarized in (2021) , where nodes are web pages on specific topics and edges are hyperlinks between them. Based on this, we illustrate the structure Non-IID data partitioning process in detail. The core of it is the Dirichlet process He et al. (2020) , Its basic analysis is as follows. The pdf of the Dirichlet distribution is defined as p(P = {p i }|α i ) = i Γ(α i ) Γ( i α i ) Γ i p αi-1 i , where α i ∈ {α 1 , . . . , α k } > 0 is the dimensionless distribution parameter, the scale (or concentration) ϑ = i α i , the base measure (α ⋆ 1 , . . . , α ⋆ k ), α ⋆ i = α i /ϑ, and Γ(n) = (n -1)!. Dirichlet is a distribution over Multinomials, thus there is i p i = 1, p i ≥ 0, where p i represents the probability. It determines the mean distribution, and the scale affects the variance, then we obtain E(p i ) = α i ϑ = α ⋆ i , V ar(p i ) = α i (ϑ -α) ϑ 2 (ϑ + 1) = α ⋆ i (1 -α ⋆ i ) (ϑ + 1) , Cov(p i , p j ) = -α i α j ϑ 2 (ϑ + 1) , which means that a Dirichlet with small scale ϑ favors extreme distributions, but this prior belief is very weak and is easily overwritten by data. As ϑ → ∞, the covariance → 0, the samples → base measure. Based on this, we start sampling the edges to determine the attribution of a pair of nodes. If a conflicting set of nodes exists it is sampled again and finally generates induced subgraphs. Then, we randomly inject homogeneous or heterogeneous information based on the label prior, which can solve unreal structure loss and enhance structure identity. We propose to set three probabilities p iso , p homo , and p hete for each client individually to represent the probability of avoiding isolated nodes, increasing homogeneous edges, and increasing heterogeneous edges in the subgraph, respectively. Specifically, p iso represents the probability of isolated nodes generating edges with other nodes, which can effectively be used to prevent the generation of isolated nodes. p homo applies to the subgraph of clients that are selected to enhance homogeneity, and it represents the probability of connection between two nodes with the same label based on the label prior information. Correspondingly, p hete represents the probability used to perform the structure information injection for the client subgraph that performs the heterogeneity enhancement.

A.4 COMMUNICATION COSTS ANALYSIS

The advantage of our approach is to exploit the local structure prior while making full use of the global information, which considers the characteristics of GNNs. It has the benefit of reducing the communication costs and privacy concerns during the federated training process. Meanwhile, thanks to the utilization of the local structure information, we can obtain models with better representational power to improve the performance. To demonstrate the effectiveness of our method, we provide the experimental results of AdaFGL with the two most competitive methods, FedSage and FedGGCN, as shown in Tables. A.4 and Table . A.4. According to the experimental results, we observe that AdaFGL maintains the low communication costs and achieves a satisfying result, which mainly benefits from the utilization of local structure information by the adaptive propagation modules. Compared to FedSage, which is the current most competitive FGL approach, suffers from the performance improvement and communication costs dilemma, which also brings more privacy concerns.



Figure 1: We utilize black circles for base class and gray circles for other class. (a): Limitations of existing FGL methods. (b): The general collaborative training pipeline. (c): The new challenges of graph structure Non-IID for FGL.

Figure 2: FedSGC: FedAvg+SGC, FedGGCN: FedAvg+GGCN represent the representative methods. The upper and lower parts represent CiteSeer and Squirrel, respectively, left and right parts represent community split and structure Non-IID, respectively. The left x-axis represents the client ID, and the number represents the number of samples in each client corresponding to the classes. The dark color represents the number of samples with high numbers.

Figure3: Overview of our model-free pipeline with a toy example. The federated global knowledge extractor represents a wide variety of node embedding models. The middle part of the black circles represents labeled nodes and the gray circles represent unlabeled nodes. Based on this we obtain graph structure properties and implement the adaptive propagation mechanisms.

Figure 5: The heat maps of nodes similarity on PubMed for 10 clients. Homogeneity and heterogeneity represent local structure assumptions, respectively. Dark color indicates high similarity.

Community split in Cora.

Structure Non-IID in Cora.

The results of test accuracy based on Cora and Chameleon by implementing the community split: mean accuracy ± standard deviation. The best results are shown in bold.

The results of test accuracy are based on Cora and Chameleon by implementing the structure Non-IID: mean accuracy ± standard deviation. The best results are shown in bold. FedGL aims to optimize the local model performance using global information, it is essentially graph structure learning without overlapping nodes. FedSage+ performs local graph augmentation to improve prediction performance. GCFL+ implements the clustering process to perform the personalized update mechanisms. More details about baseline methods can be referred to Appendix. A.2.

The results based on two data partitioning methods on 10 clients, where the upper part represents structure Non-IID and the lower part represents community split.

A summary of very recent FGL methods and our approach.

A summary of powerful GNNs in heterogeneous graph and our approach.

Statistics of five benchmark datasets. We combine FedAvg and SGCWu et al. (2019), we default to use the 3-layer feature propagation process, which follows the homogeneous assumption and thus fails to deal with heterogeneous graph. FedNLGNN: Implementation of NLGNN (NLMLP or NLGCN) Liu et al. (2021) based on FedAvg, we select the more effective version to present the model performance. It depends on the embedding model and suffers from representational limitations.FedGGCN: The combination of FedAvg andGGCN Yan et al. (2021), we follow the experimental setup of the original paper as much as possible, which can handle heterogeneous graphs effectively, but cannot achieve competitive results on homogeneous graphs.FedGLChen et al. (2021): As a FGL training framework, it strongly relies on the overlapping nodes assumption, which in our data setting is essentially local graph structure learning.

Table. 7, which contains both homogeneity and heterogeneity. In our experiments, we use five benchmark datasets containing homogeneity and heterogeneity, for which details are given below.Cora, Citeseer, and PubmedYang et al. (2016b)  are three popular citation network datasets. In these three networks, papers from different topics are considered as nodes, and the edges are citations among the papers. The node attributes are binary word vectors, and class labels are the topics papers belong to.Chameleon and Squirrel are two web page datasets collected from Wikipedia Rozemberczki et al.

The results of accuracy based on Cora and CiteSeer by implementing the community split of 10 clients: mean accuracy ± standard deviation. The best results are shown in bold. FedGGCN 82.34±0.84 82.80±0.97 5.52×10 5 75.32±0.95 75.78±1.12 1.42×10 6 Ours 87.97±0.98 86.89±1.02 2.88×10 5 79.63±1.48 77.05±1.72 7.26×10 5

The results of accuracy based on Chameleon and Squirrel by implementing the community split of 10 clients: mean accuracy ± standard deviation. The best results are shown in bold. 49±2.06 50.11±1.95 2.44×10 7 31.44±2.17 31.27±1.93 2.19×10 7 FedGGCN 43.81±1.37 41.52±1.25 8.95×10 5 35.26±1.10 34.10±1.25 8.04×10 5 Ours 58.06±1.23 54.43±1.52 8.95×10 5 39.16±1.88 35.43±2.53 8.04×10 5

A.5 HYPERPARAMETER SENSITIVITY ANALYSIS

Here we conduct the hyperparameter sensitivity of AdaFGL, and the experimental results are shown in Fig. 6 . In our experiments, we analyze the ratio of online distillation loss in the homogeneous propagation module and the smoothing coefficient of the trainable propagation matrix in the heterogeneous propagation module. According to the experimental results, we observe that AdaFGL performs robustness except for extreme cases. Furthermore, we obtain the conclusion from the results generated by the extreme knowledge distillation loss ratios, where the low confidence base predictor results instead affect the homogeneous propagation module. Motivated by this, in order to avoid global embeddings with low confidence from influencing the propagation module, we measure the confidence of the global model according to the characteristics of the base predictor.

A.6 EXPERIMENTAL ENVIRONMENT AND ADDITIONAL BASE RESULTS

The experiments are conducted on a machine with Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz, and a single NVIDIA GeForce RTX 3090 with 24GB memory. The operating system of the machine is Ubuntu 18.04.6. As for software versions, we use Python 3.8, Pytorch 1.11.0, and CUDA 11.4.To alleviate the influence of randomness, we repeat each method 10 times and report the statistical characteristics. The hyper-parameters of baselines are set according to the original paper if available. We use Optuna Akiba et al. (2019) to implement hyperparameters search. Following the above principles, we present the results of two data partitioning as follows. 

