A NEW PARADIGM FOR FEDERATED STRUCTURE NON-IID SUBGRAPH LEARNING Anonymous authors Paper under double-blind review

Abstract

Federated graph learning (FGL), a distributed training framework for graph neural networks (GNNs) has attracted much attention for breaking the centralized machine learning assumptions. Despite its effectiveness, the differences in data collection perspectives and quality lead to the challenges of heterogeneity, especially the domain-specific graph is partitioned into subgraphs in different institutions. However, existing FGL methods implement graph data augmentation or personalization with community split which follows the cluster homogeneity assumptions. Hence we investigate the above issues and suggest that subgraph heterogeneity is essentially the structure variations. From the observations on FGL, we first define the structure non-independent identical distribution (Non-IID) problem, which presents unique challenges among client-wise subgraphs. Meanwhile, we propose a new paradigm for general federated data settings called Adaptive Federated Graph Learning (AdaFGL). The motivation behind it is to implement adaptive propagation mechanisms based on federated global knowledge and non-params label propagation. We conduct extensive experiments with community split and structure Non-IID settings, our approach achieves state-of-the-art performance on five benchmark datasets.

1. INTRODUCTION

The graph as a relational data structure is widely used to model real-world entity relations such as citation networks Yang et al. (2016a) Notably, graph heterogeneity is different from the heterogeneity of labels or features in the fields of computer vision or natural language processing, we suggest that it depends on the graph structure. However, The existing FGL methods simulate the federated subgraph distributions through community split, which follows the cluster homogeneity assumption as shown in Fig. 1(a) . Specifically, community split leads to the subgraph structure being consistent and the same as the original graph, e.g., connected nodes are more likely to have the same labels. Obviously, it is overly desirable and hard to satisfy in reality, hence we consider a more reasonable setting shown in Fig. 1(c ). We first refer to the above problem as structure non-independent identical distribution (Non-IID). The motivation behind it is due to graph structure directly related to node labels and feature distributions. Meanwhile, the challenges of structure heterogeneity are ubiquitous in the real world Zheng et al. (2022b) . For instance, in citation networks, we consider research teams focused on computers and intersectional fields (e.g., AI in Science) Shlomi et al. ( 2021 2007). We consider different regions as clients to detect financial fraudsters by analyzing online transaction subgraphs. Specifically, graph structure can be divided into two types: homogeneity means that connected nodes are more likely to have the same label and similar feature distributions and heterogeneity is the opposite. In order to explain it intuitively, we visualize the 3 clients partitioning result on Cora in Table . 1 and Table. 2 , where Homo represents the homogeneity degree of the local subgraph, and it is computed by a popular metric Pei et al. ( 2020). Obviously, compared to community split, which follows the cluster homogeneity assumption and uniform distribution principle, structure Non-IID brings challenges to the existing FGL methods. Based on this, we investigate the above issues through empirical analysis shown in Fig. 2 . According to the results, we observe that in case the original graph satisfies the homogeneity assumption then the label distributions satisfy Non-IID. It is the opposite when the original graph satisfies the heterogeneity. This is due to the fact that the nodes partitioned into the same clients are communities and follow the uniform distribution principle. In addition, the local accuracy indicates that the subgraph structure performs a more important role in FGL compared to the label distributions, which also supports our motivation. In model performance, we observe that the GGCN improves the structure Non-IID problem, and FedSage+ trains NeighGen to implement local subgraph augmentation by sharing node embeddings. However, the above methods fail to achieve competitive results as SGC on the homogeneous subgraphs while considering heterogeneity. In order to efficiently analyze distributed subgraphs with both homogeneity and heterogeneity. We propose a simple pipeline called Adaptive Federated Graph Learning (AdaFGL) for more general federated data settings, which consists of three main parts. Specifically, it starts by analyzing the subgraph structure through non-params label propagation and selects the appropriate base model: (i) the federated global knowledge extractor (e.g., MLP, powerful GNNs, or any reasonable embedding models), which does not rely on any learning over the subgraph. Then, the base predictor is trained based on the global data, which can be done offline or in parallel with local training, benefiting from the flexibility of our approach. Finally, the local client implements two adaptive propagation mechanisms: (ii) homogeneity propagation module or (iii) heterogeneity propagation module based on the local subgraph. Notably, with non-params label propagation, the above process is adaptive. To summarize, the contributions of this paper are as follows: (1) To the best of our knowledge, we are the first to analyze the structure Non-IID problem in FGL, which is a more general federated data setting and brings new challenges. (2) We propose AdaFGL, a new paradigm for structure Non-IID subgraph learning, which shows its flexibility in FGL with impressive performance. (3) Extensive experiments demonstrate the effectiveness of AdaFGL. Specifically, our approach achieves state-ofthe-art performance in the above two data settings. Compared to the best prediction accuracy in the



, recommended systems Wu et al. (2022), drug discovery Gaudelet et al. (2021), particle physics Shlomi et al. (2021), etc. However, due to the collection agents and privacy concerns, generally, the global domain-specific graph consists of many subgraphs collected by multiple institutions. In order to analyze the local subgraph, each client maintains a powerful graph mining model such as graph neural networks (GNNs), which have achieved stateof-the-art performance in many graph learning tasks Zhang et al. (2022b); Hu et al. (2021); Zhang & Chen (2018). Despite its effectiveness, the limited data provide sub-optimal performance in most cases. Motivated by the success of federated learning (FL), a natural idea is to combine the GNNs with FL to utilize the distributed subgraphs. Recently, federated graph learning (FGL) He et al. (2021); Wang et al. (2022b) is proposed to achieve collaborative training without directly sharing data, yet an essential concern is the heterogeneity of the distributed subgraphs.

Figure 1: We utilize black circles for base class and gray circles for other class. (a): Limitations of existing FGL methods. (b): The general collaborative training pipeline. (c): The new challenges of graph structure Non-IID for FGL.

Community split in Cora.

