PERSONALIZED SUBGRAPH FEDERATED LEARNING

Abstract

In real-world scenarios, subgraphs of a larger global graph may be distributed across multiple devices or institutions, and only locally accessible due to privacy restrictions, although there may be links between them. Recently proposed subgraph Federated Learning (FL) methods deal with those missing links across private local subgraphs while distributively training Graph Neural Networks (GNNs) on them. However, they have overlooked the inevitable heterogeneity among subgraphs, caused by subgraphs comprising different communities of a global graph, therefore, consequently collapsing the incompatible knowledge from local GNN models trained on heterogeneous graph distributions. To overcome such a limitation, we introduce a new subgraph FL problem, personalized subgraph FL, which focuses on the joint improvement of the interrelated local GNN models rather than learning a single global GNN model, and propose a novel framework, FEDerated Personalized sUBgraph learning (FED-PUB), to tackle it. A crucial challenge in personalized subgraph FL is that the server does not know which subgraph each client has. FED-PUB thus utilizes functional embeddings of the local GNNs using random graphs as inputs to compute similarities between them, and use them to perform weighted averaging for server-side aggregation. Further, it learns a personalized sparse mask at each client to select and update only the subgraph-relevant subset of the aggregated parameters. We validate FED-PUB for its subgraph FL performance on six datasets, considering both non-overlapping and overlapping subgraphs, on which ours largely outperforms relevant baselines.

1. INTRODUCTION

Most of the previous Graph Neural Networks (GNNs) (Hamilton, 2020) focus on a single graph, whose nodes and edges collected from multiple sources are stored in a central server. For instance, in a social network platform, every user, with his/her social networks, contributes to creating a giant network consisting of all users and their connections. However, in some practical scenarios, each user/institution collects its own private graph, which is only locally accessible due to privacy restrictions. For instance, as described in Zhang et al. (2021) , each hospital may have its own patient interaction network to track their physical contacts or co-diagnosis of a disease, however, such a graph may not be shared with others. How can we then collaboratively train, without sharing actual data, a neural network with its subgraphs distributed across multiple participants (i.e., clients)? The most straightforward way is to perform Federated Learning (FL) with GNNs. Specifically, each client will individually train a local GNN on the private local data, while a central server aggregates locally updated GNN weights from multiple clients into one, and then transmits it back to the clients. However, an important challenge for such the subgraph FL scenario is how to deal with potentially missing edges between subgraphs that are not captured by individual data owners, but may carry important information (See Figure 1 (A) ). Recent subgraph FL methods (Wu et al., 2021a; Zhang et al., 2021) additionally tackle this problem by expanding the local subgraph from other subgraphs, as illustrated in Figure 1 (B) . In particular, they expand the local subgraph either by exactly augmenting the relevant nodes from the other subgraphs at the other clients (Wu et al., 2021a) , or by estimating the nodes using the node information in the other subgraphs (Zhang et al., 2021) . However, such sharing of node information may compromise data privacy and can incur high communication costs. Also, there exists a more important challenge that has been overlooked by existing subgraph FL. We observe that they suffer from large performance degeneration (See Figure 1 right), due to the heterogeneity among subgraphs, which is natural since subgraphs comprise different parts of a global (Wu et al., 2021a; Zhang et al., 2021) expand the local subgraphs to tackle the missing edge problem, but collapse incompatible knowledge from heterogeneous subgraphs. (C) Our personalized subgraph FL focuses on the joint improvement of local models working on interrelated subgraphs, such as ones within the same community, by selectively sharing the knowledge across them. (Right:) Knowledge collapse results, where local models belonging to two small communities (Communities 1 and 2) suffer from large performance degeneration by existing subgraph FL, such as FedGNN (Wu et al., 2021a; 2022) and FedSage+ (Zhang et al., 2021) . A personalized FL method, FedPer (Arivazhagan et al., 2019) , also underperforms ours since it only focuses on individual model's improvement without sharing local personalization layers between similar subgraphs. graph. Specifically, two individual subgraphs -for example, User 1 and 3 subgraphs in Communities A and B respectively in Figure 1 (A) -are sometimes completely disjoint having opposite properties. Meanwhile, two densely connected subgraphs form a community (e.g., User 1 and 2 subgraphs within the Community A of Figure 1 (A) ), in which they share similar characteristics. However, it is challenging to consider such heterogeneity arising from community structures of a graph. Motivated by this challenge, we introduce a novel problem of personalized subgraph FL, whose goal is to jointly improve the interrelated local models trained on the interconnected local subgraphs, for instance, subgraphs belonging to the same community, by sharing weights among them (See Figure 1 (C)). However, implementing such selective weight sharing is challenging, since we do not know which subgraph each client has, due to its local accessibility. To resolve this issue, we use functional embeddings of GNNs on random graphs to obtain similarity scores between two local GNNs, and then use them to perform weighted averaging of the model weights at the server. However, the similarity scores only tell how relevant each local model from the other clients is, but not which of the parameters are relevant. Thus we further learn and apply personalized sparse masks on the local GNN at each client to obtain only the subnetwork, relevant to the local subgraph. We refer to this subgraph FL framework as FEDerated Personalized sUBgraph learning (FED-PUB). We extensively validate our FED-PUB on six different datasets with varying numbers of clients, under both overlapping and disjoint subgraph FL scenarios. The experimental results show that ours significantly outperforms relevant baselines. Further analyses show that our method can discover community structures among subgraphs, and the masking scheme localizes the knowledge with respect to the subgraph of each client. Our main contributions are as follows: • We introduce a novel problem of personalized subgraph FL, which aims at collaborative improvements of the related local models (e.g. subgraphs belonging to the same community), which has been relatively overlooked by previous approaches on graph and subgraph FL. • We propose a novel framework for personalized subgraph FL, which performs weighted averaging of the local model parameters based on their functional similarities obtained without accessing the data, and learns sparse masks to select only the relevant subnetworks for the given subgraphs. • We validate our framework on six real-world datasets under both overlapping and non-overlapping node scenarios, demonstrating its effectiveness over existing subgraph FL baselines.

2. RELATED WORK

Graph Neural Networks Graph Neural Networks (GNNs) (Hamilton, 2020; Zhou et al., 2020; Wu et al., 2021b; Jo et al., 2021; Baek et al., 2021) , which aim to learn the representations of nodes, edges, and entire graphs, are an extensively studied topic. Most existing GNNs under a message passing scheme (Gilmer et al., 2017) iteratively represent a node by aggregating features from its neighboring nodes as well as itself. For example, Graph Convolutional Network (GCN) (Kipf & Welling, 2017) approximates the spectral graph convolutions (Hammond et al., 2011) , yielding a mean aggregation over neighboring nodes. Similarly, for each node, GraphSAGE (Hamilton et al., 2017) aggregates the features from its neighbors to update the node representation. While they lead to successes on node classification and link prediction tasks for single graphs, they are not directly applicable to real-world systems with locally distributed graphs, where graphs from different sources are not shared across participants, which gives rise to federated learning approaches to train GNNs. Federated Learning Federated Learning (FL) (Li et al., 2021) is an essential approach for our distributed subgraph learning problem. To mention a few, FedAvg (McMahan et al., 2017) locally trains a model for each client and then transmits the trained model to a server, while the server aggregates the model weights from local clients and then sends the aggregated model back to them. However, since the locally collected data from different clients may largely vary, heterogeneity is a crucial issue. To tackle this, FedProx (Li et al., 2020) proposes the regularization term that minimizes the weight differences between local and global models, which prevents the model from diverging to the local training data. However, when the local data is extremely heterogeneous, it is more appropriate to collaboratively train a personalized model for each client rather than learning a single global model. FedPer (Arivazhagan et al., 2019 ) is such a method, which shares only the base layers while having local personalized layers for each client, to keep the local knowledge. Furthermore, to share the learned knowledge between heterogeneous clients, recent studies propose to distill the outputs from clients (Lin et al., 2020; Sattler et al., 2021; Zhu et al., 2021) , or directly minimize the differences of their model outputs (Makhija et al., 2022) . However, unlike the commonly studied image and text data, graph-structured data is defined by connections between instances, and consequently introduces additional challenges: missing edges and shared nodes between private subgraphs.

Graph Federated Learning

Few recent studies propose to use the FL framework to collaboratively train GNNs without sharing graph data (He et al., 2021) , which can be broadly classified into subgraph-and graph-level methods. Graph-level FL methods assume that different clients have completely disjoint graphs (e.g., molecular graphs), and recent work (Xie et al., 2021; He et al., 2022) focuses on the heterogeneity among non-IID graphs (i.e., difference in graph labels across various clients). In contrast to graph-level FL methods that have similar challenges to general FL scenarios, the subgraph-level FL problem we target has a unique graph-structural challenge, that there exists missing yet probable links between subgraphs, since a subgraph is a part of a larger global graph. To deal with such a missing link problem among subgraphs, existing methods (Wu et al., 2021a; Zhang et al., 2021) augment the nodes by requesting the node information in the other subgraphs, and then connecting the existing nodes with the augmented ones. However, this scheme may compromise data privacy constraints, and also increases communication overhead across clients. Unlike existing subgraph FL that focuses on the problem of missing links, our subgraph FL method tackles the problem with a completely different perspective, focusing on exploring subgraph communities (Girvan & Newman, 2002; Radicchi et al., 2004) , which are groups of densely connected subgraphs.

3. PROBLEM STATEMENT

We provide general descriptions of Graph Neural Networks (GNNs) and Federated Learning (FL), and then define our novel problem of personalized subgraph FL lying at the intersection of them. Graph Neural Networks A graph G = (V, E) consists of a set of n nodes V and a set of m edges E along with its node feature matrix X ∈ R n×d , where each row represents a d-dimensional node feature. (u, v) ∈ E represents an edge from a node u to a node v. Then, GNNs (Hamilton, 2020) generally represent each node based on features from its neighbors as well as itself, as follows: H l+1 v = UPDATE l H l v , AGGREGATE l H l u : ∀u ∈ N (v) , where H l v is the feature matrix for node v at l-th layer, N (v) denotes a set of adjacent nodes of node v: N (v) = {u ∈ V | (u, v) ∈ E}, AGGREGATE aggregates the features of v's neighbors, and UPDATE updates the node v's representation given its previous representation and the aggregated representations from the neighbors. H 1 is initialized as input node features X.

Federated Learning

The goal of FL is to collaboratively train a model with local data. Let assume we have K clients with locally collected data inaccessible from others: D k = {X i , y i } N k i=1 , where X i is a data instance, y i is its corresponding class label, and N k is the number of data instances at k-th client. Then, a popular FL algorithm, FedAvg (McMahan et al., 2017) (1) k with N = k N k , and distributes the updated global parameters θ(1) to the local clients selected at the next round. This FL algorithm iterates between Step 2 and 3 until reaching the final round R. Challenges in Subgraph FL While the above FL works well on image and text data, due to the unique structure of graphs, there exist nontrivial challenges for applying this FL scheme to graphstructured data. In particular, unlike with an image domain where each instance X i is independent to the other images, each node v in a graph is always influenced by its relationships to adjacent nodes N (v). Moreover, a local graph G i could be a subgraph of a larger global graph G: G i ⊆ G. In such a case, there could be missing edges between subgraphs in two different clients: (u, v) with u ∈ V i and v ∈ V j for clients i and j, respectively. To tackle this problem, existing methods (Wu et al., 2021a; Zhang et al., 2021) estimate the nodes of a local subgraph G k based on the node information from subgraphs at other clients G i ∀i ̸ = k, and then extend the existing nodes with the estimated ones. However, this augmentation scheme incurs high communication costs as it requires sharing node information across clients, which may also violate data privacy constraints (Abadi et al., 2016 ). Yet, there exists a more challenging issue. Assume that we have a global graph consisting of all subgraphs. Then, there are communities of such subgraphs (Radicchi et al., 2004; Girvan & Newman, 2002; Porter et al., 2009) , where subgraphs within the same community are more densely connected to each other than subgraphs outside the community. Formally, a global graph G can be decomposed into T different communities: C i ⊆ G ∀i = 1, ..., T , where i-th community C i = (V i , E i ) consists of densely connected nodes. Then, in a subgraph FL problem, a local subgraph G j belongs to at least one community: C i = J j=1 G j . Note that, based on the theory of network homophily (McPherson et al., 2001) , such connected subgraphs within the same community have similar properties, while subgraphs in two opposite communities are not. Such distributional heterogeneity of communities may lead a naive FL algorithm to collapse incompatible knowledge across different communities. Personalized Subgraph FL To prevent the above knowledge collapse issue, we aim to personalize the subgraph FL algorithm by performing personalized weight averaging of local model parameters; thereby capturing the community structure among interrelated subgraphs. To be formal, the objective of existing subgraph FL (Wu et al., 2021a; Zhang et al., 2021; Liu et al., 2021) is as follows: minθ Gi⊆G L(G i ; θ). However, finding a universal set of parameters (i.e., θ) that works on all tasks will result in finding a suboptimal parameter set, since subgraphs in two different communities with sparse connections are extremely heterogeneous due to the network homophily. To address this limitation, we formulate a novel problem of personalized subgraph FL, formalized as follows: min {θi,µi} K i=1 Gi⊆G L(G i ; θ i , µ i ), θ i ← µ i ⊙   K j=1 α ij θ j   with α ik ≫ α il for G k ⊆ C and G l ⊈ C, where θ i is the weight for subgraph G i belonging to community C. α ij is the coefficient for weight aggregation between clients i and j, which can promote the collaborative learning across multiple local models working on interrelated subgraphs that belong to the same community, by assigning larger weights on them. However, this scalar coefficient α ij cannot inform us which elements of the aggregated weight are relevant to subgraph G i . Therefore, we further multiply it to the trainable sparse vector µ i with element-wise multiplication ⊙, to shift and filter out irrelevant weights from subgraphs of heterogeneous communities. We will specify how to obtain α and µ in Section 4.

4. FEDERATED PERSONALIZED SUBGRAPH LEARNING FRAMEWORK

To realize our goal of personalized subgraph FL (equation 2), we propose to compute subgraph similarities for detecting communities, and to mask weights from subgraphs in unrelated communities.

4.1. SUBGRAPH SIMILARITY ESTIMATION FOR DETECTING SUBGRAPH COMMUNITY

We aim to capture the community structure consisting of a group of densely connected subgraphs. Note that, due to network homophily where similar instances in the graph are more associated with each other (McPherson et al., 2001) , subgraphs within the same community should be similar. In other words, if one can measure subgraph similarities, we can group similar ones into the community. However, measuring similarity between local subgraphs is challenging since we do not know which subgraph each client has due to local accessibility. How can we then compute subgraph similarities, without accessing them? To this end, we propose to approximate the similarity at local clients using auxiliary information obtained from the local GNN models working on the subgraphs. Subgraph Similarity Estimation with Model Parameters For measuring the similarity between local subgraphs, without accessing them, we may use the model parameters as proxies, as follows: S(i, j) = (θ i • θ j )/(∥θ i ∥∥θ j ∥) , where θ is a parameter flatten into a vector, and S is a similarity measure. This may sound reasonable since the GNN model trained on the subgraph will embed its knowledge into its parameters. However, this scheme has a notable drawback that similarity measured in the high-dimensional parameter space is not meaningful due to the curse of dimensionality (Bellman, 1966) , and that the cost of calculating the similarity between parameters grows rapidly as the model size increases (See Figure 3 ).

Subgraph Similarity Estimation with Functional Embedding

To tackle the limitation of using parameter distance, we propose to measure the functional similarity of neural networks by feeding the same input to every local model and then calculating the similarities using their outputs, inspired by neural network search (Jeong et al., 2021) . The main intuition is that we can consider the transformation defined with a neural network as a function, and we measure the functional similarity of two networks by the distance of their outputs for the same input. However, unlike the previous work, which uses Gaussian noises as inputs for image classification, we use random graphs as inputs as we work with GNNs. Formally, let G = ( Ṽ, Ẽ) be a random community graph designed by a stochastic block model (Holland et al., 1983) , where subgraphs within the community have more edges between them than edges across the communities (See Appendix B.3 for initialization details). Then, the similarity between two functions defined by GNNs at clients i and j is defined as follows: S(i, j) = hi • hj ∥ hi ∥∥ hj ∥ , hi = AVG(f ( G; θ i )) and hj = AVG(f ( G; θ j )), where h is the averaged output of all node embeddings for input G with average operation, AVG. We provide additional discussions with results on similarity estimation in Appendix C.6 and C.7. Personalized Weight Aggregation with Subgraph Similarity With equation 3, the remaining step is then to share the model weights between models working on similar subgraphs belonging to the same community. However, entirely ignoring model parameters from different communities may result in exploiting only the local objective while ignoring globally useful weights, which results in suboptimal performance (See Appendix C.8 for details). Therefore, we perform weighted averaging of local models from all clients based on their functional similarities, as follows (Figure 2 (B)): θi ← K j=1 α ij • θ j , α ij = exp(τ • S(i, j)) k exp(τ • S(i, k)) , where α ij is a normalized similarity between clients i and j, and τ is a hyperparameter for scaling the unnormalized similarity score. Note that increasing the value of τ (e.g., 10) will result in model averaging done almost exclusively among subgraphs detected as belonging to the same community. This personalized scheme handles two challenges in subgraph FL. First, in contrast to global weight aggregation which collapses the knowledge from heterogeneous communities, our subgraph FL allows the models belonging to different communities to obtain individual parameters that are beneficial for each community. Also, missing edges (i.e., a lack of information sharing) between interconnected subgraphs, which are explicitly handled by expanding local subgraphs in existing works (Wu et al., 2021a; Zhang et al., 2021) , could be implicitly considered by largely sharing the knowledge among models of probably linked subgraphs within the same community (See Figure 6 and 9 ). This enhances data privacy while minimizing communication costs between subgraphs.

4.2. ADAPTIVE WEIGHT MASKING FOR SELECTING SUBGRAPH-RELEVANT PARAMETERS

Based on the previous similarity matching scheme, we can effectively group GNNs that belong to the same community, thus preventing the collapsing of irrelevant knowledge from other communities. However, the scalar weighting scheme only considers how much each local model from other clients is relevant for the subgraph task, but not which parameters are relevant. Thus we propose a scheme to select only the relevant parameters from the aggregated model weights transmitted from the server. Personalized Parameter Masking We perform selective training and updating of the aggregated parameters by modulating and masking them, using sparse local masks (Figure 2 (C)). Formally, let µ i be a local mask for client i. Then, our local model weight is obtained by modulating the weights from the server, as follows: θ i = θi ⊙ µ i , where ⊙ is an element-wise multiplication operation between the globally given weight θi and the local mask µ i . Note that µ i is a free variable and not shared across clients. Also, we initialize µ i as ones, in order to start training with the globally initialized model parameters without modification. We then further promote sparsity on the mask, to take two advantages. First, we can transmit only the partial parameters, that have not been sparsified at the client, to the server rather than sending all parameters, thus reducing the communication costs. Also, if local masks are sufficiently sparse, local models can be trained faster, when zero-skipping operations are supported. To take these benefits in sparsity, we use L 1 regularizer on µ i when performing local optimization (See Appendix B.3 for details on sparsification), shown in equation 5. Preventing Local Divergence with Proximal Term As masks are trained only with limited local data without parameter sharing, they may be easily overfitted to the training instances in each client. To alleviate this issue, we adopt the proximal term proposed in Li et al. (2020) that regularizes the locally updated model θ i to be closer to the globally given model θi , therefore, preventing the model from extremely drifting to the local training distribution. To sum up, at i-th client, our objective function including sparsity and proximal terms with L 1 and L 2 losses is denoted as follows: min (θi,µi) L(G i ; θ i , µ i ) + λ 1 ∥µ i ∥ 1 + λ 2 ∥θ i -θi ∥ 2 2 , ( ) where L is a conventional cross-entropy loss function, and λ 1 and λ 2 are scaling hyper-parameters.

5. EXPERIMENTS

We now experimentally validate our FED-PUB on six different datasets under both the overlapping and disjoint subgraph scenarios with varying client numbers, on node classification tasks.

5.1. EXPERIMENTAL SETUPS

Datasets Following the experimental setup from Zhang et al. (2021) , we construct distributed subgraphs from the benchmark dataset by dividing it into the number of participants: each FL participant has a subgraph that is a part of an original graph. In particular, we use six datasets: Cora, CiteSeer, Pubmed and ogbn-arxiv for citation graphs (Sen et al., 2008; Hu et al., 2020) ; Computer and Photo for product graphs (McAuley et al., 2015; Shchur et al., 2018) . We then divide the original graph into multiple subgraphs using the METIS graph partitioning algorithm (Karypis, 1997) . Note that, unlike the Louvain algorithm (Blondel et al., 2008) , used in Zhang et al. (2021) , that requires to further merge partitioned subgraphs into particular numbers of subgraphs since it cannot specify the number of subsets (i.e., clients for FL), the METIS algorithm can specify the number of subsets, 

Main Results

Table 1 shows the node classification performance under the overlapping subgraph scenario, in which our FED-PUB statistically (p > 0.05) significantly outperforms all the baselines. In particular, while FedGNN and FedSage+ are two pioneer works for the subgraph FL problem, they significantly underperform personalized FL methods including ours, especially at the larger number of clients. This is even surprising as they share node information between clients for handling the missing edge problem, yet we suppose such inferior performance comes from naive averaging of local weights without consideration of community structures. While personalized FL baselines including FedPer and GCFL show decent performance by alleviating the knowledge collapse issue between subgraphs with local parameterization or clustering, they still largely underperform ours as they are not concerned with aggregation between similar subgraphs that form a community (i.e., GCFL uses a bi-partitioning scheme, which iteratively divides a group of subgraphs within the same community into two disjoint sets). We then further conduct the experiments on the disjoint subgraph scenario (i.e., non-overlapping scenario), where nodes are not overlapped between subgraphs, which makes the subgraph FL problem more heterogeneous. As shown in Table 2 , FED-PUB consistently outperforms all existing baselines in such a challenging scenario, demonstrating the efficacy of ours. Fast Local Convergence As shown in Figure 4 and 5, our FED-PUB converges rapidly, compared against baselines including personalized FL models. We conjecture that this is because, not only ours can accurately identify subgraphs forming the community and then share weights substantially across them for promoting the joint improvement, but also masking out subgraph-irrelevant weights received from the server for localization to local subgraphs, demonstrated in the next two paragraphs.

Community Detection

We aim to show whether the proposed FED-PUB can group subgraphs comprising a community during the personalized weight aggregation. Note that, if two different subgraphs have many missing edges or have similar label distributions, we usually consider those two as within the same community (Radicchi et al., 2004; Girvan & Newman, 2002; Porter et al., 2009) . Thereby, as shown in Figure 6 (a) and (b), there are four different communities by the interval of five, and the last two communities further comprise a larger community. Then, as shown in Figure 6 (c) and (d), our FED-PUB detects obvious four communities at the first few rounds, and then captures the larger yet somewhat less-obvious community consisting of two smaller communities.

Ablation Study

To analyze the contribution of each component, we conduct ablation studies. As shown in Figure 7 , we observe that each of our subgraph similarity matching and weight masking schemes significantly improves the performances from the naive FedAvg, while the performance is much improved when using both together. However, the benefit from each component is different across overlapping and non-overlapping scenarios. In particular, in the former scenario where a group of densely overlapped subgraphs usually comprise a community, similarity matching for community detection is more beneficial since capturing the community would promote the joint improvement of subgraphs belonging to the same community. However, in the non-overlapping scenario, two individual subgraphs become more heterogeneous, thus selectively using the aggregated model weights from the server with personalized weight masks improves the performance a lot (See additional results and discussions on heterogeneity with sparse weight masks in Appendix C.4). Communication Efficiency Another notable advantage of using sparse masks is that we can reduce the communication costs at every FL round, as well as the model size for faster training. In particular, as demonstrated in Table 3 , existing subgraph FL methods require more than two times larger communications costs, measured by adding both the client-to-server and server-to-client costs, compared against the naive FedAvg, since they require to transfer additional node information between clients for estimating the probable nodes on the subgraphs. Contrarily, our FED-PUB has significantly lower communication costs and lower model sizes by using sparse masks on model weights: transmitting and training only the partial parameters not sparsified at the client. Further, we can manage the trade-off between the model sparsity and the performance by controlling the hyperparameter for sparsity regularization, λ 1 (See Appendix C.1 for more hyperparameter analyses). Varying Local Epochs As shown in Figure 8 , when we increase the number of communication rounds and the local steps, the model diverges to the local subgraph (i.e., overfitting), due to the small number of training instances and the direct connection between training and test nodes: struggling to generalize to the test instances. However, our model with the proximal term in equation 5 alleviates this issue, therefore, maintaining the highest local performance. Notably, the performance with five local epochs is inferior to the performance of one epoch, which indicates that increasing the local epochs does not always bring advantages and properly tuning them is important for subgraph FL.

Handling Missing Edges

The missing edge problem, where two interconnected subgraphs cannot share information due to missing edges between them, is a unique challenge in subgraph FL (See Appendix C.9 for more discussions). To tackle this, existing subgraph FL explicitly augments nodes and edges for capturing the information flow over missing edges between interconnected subgraphs, while ours implicitly shares weights a lot across similar subgraphs within the same community. To measure their efficacy, we evaluate the performance on the neighboring subgraph, which has the most missing edges to the local subgraph, based on its local model weight. Specifically, in Figure 9 , (Neighbor) denotes the subgraph performance evaluated by its neighbor model, while (Local) denotes the subgraph performance from its own local model. Then, the high performance on (Neighbor) measure means two associated subgraphs share meaningful knowledge without having actual edges between them, thereby solving the missing edge problem. Figure 9 shows that ours achieves the superior performance on the neighboring subgraph problem against subgraph FL baselines, verifying that ours has an advantage on the missing edge problem by sharing meaningful knowledge between subgraphs having potentially missing edges, without explicitly estimating them.

6. CONCLUSION

We introduced a novel problem of personalized subgraph FL, which focuses on the joint improvement of local GNNs working on interrelated subgraphs (e.g. subgraphs belonging to the same community), by selectively utilizing knowledge from other models. The proposed personalized subgraph FL is highly challenging due to 1) difficulty of computing similarities between local subgraphs that are only locally accessible, and 2) knowledge collapse among local models that work on heterogeneous subgraphs during weight aggregation. To this end, we proposed a novel personalized subgraph FL framework, called FEDerated Personalized sUBgraph learning (FED-PUB), which computes the similarities across subgraphs using functional embeddings of their local GNNs on random graphs, and uses them to perform a weighted average of the local models for each client. Further, we mask out globally given weights to focus on only the relevant subnetwork for each community and client. We extensively validated our framework on multiple benchmark datasets with both overlapping and non-overlapping subgraphs, on which our FED-PUB significantly outperforms relevant baselines. Further analyses show the effectiveness of the subgraph similarity matching for detecting the community structures, as well as the weight masking for tackling the subgraph heterogeneity.

REPRODUCIBILITY STATEMENT

We attach the source code of our FED-PUB framework in the supplementary file. Also, we provide every detail of experimental setups including datasets, models, and implementations in Appendix B. 

A ALGORITHMS

In this section, we provide algorithms of the proposed subgraph similarity estimation and adaptive weight masking in our FED-PUB framework. In particular, weight masking, performed in the client, is shown in Algorithm 1. Also, similarity matching, working on the server, is shown in Algorithm 2. Algorithm 1 FED-PUB Client Algorithm  : if r = 1 then 6: θ (r+1) i ← RunClient( θ(r) ) 7: else 8: θ(r) i ← K j=0 exp(τ •S(i,j)) K k=0 exp(τ •S(i,k)) θj 9: θ (r+1) i ← RunClient( θ(r) i ) 10: end if 11: end for 12: end for

B EXPERIMENTAL SETUPS

In this section, we first provide the descriptions of six different benchmark datasets that we use, along with their preprocessing setups and statistics for FL in Subsection B.1. Then, we explain the baselines and our proposed FED-PUB in detail in Subsection B.2. Lastly, we further describe the implementation details of experiments on synthetic and real-world graphs, as well as additional experimental details on functional similarities and sparse masks in Subsection B.3.

B.1 DATASETS

We report statistics of six different benchmark datasets (Sen et al., 2008; Hu et al., 2020; McAuley et al., 2015; Shchur et al., 2018) , such as Cora, CiteSeer, Pubmed, and ogbn-arxiv for citation graphs; Computer and Photo for amazon product graphs, which we use in our experiments for both the overlapping and non-overlapping node scenarios, in Table 4 . Specifically, in Table 4 , we report the number of nodes, edges, classes, and clustering coefficient for each subgraph, but also the heterogeneity between the subgraphs. Note that, to measure the clustering coefficient, which indicates how much nodes are clustered together, for the individual subgraph, we first calculate the clustering coefficient (Watts & Strogatz, 1998) for all nodes, and then average them. On the other hand, to measure the heterogeneity, which indicates how disjointed subgraphs are dissimilar, we calculate the median Jenson-Shannon divergence of label distributions between all pairs of subgraphs. For dataset splits, we randomly sample 20% nodes for training, 35% for validation, and 35% for testing, for all datasets except for the arxiv dataset. This is because the arxiv dataset has the relatively larger number of nodes as shown in Table 4 , thus we randomly sample 5% nodes for training, the remaining half of the nodes for validation, and the other nodes for testing. We then describe how to partition the original graph into multiple subgraphs, whose number is the same as the number of clients (i.e., FL participants). In general, we use the METIS graph partitioning algorithm (Karypis, 1997) to divide the original graph into multiple subgraphs, which can control the number of disjoint subgraphs as parameters. Consequently, in the non-overlapping node scenario, the disjoint subgraph for each client is directly obtained by the output of the METIS algorithm (i.e., if we set the parameter value for METIS as 10, then we can obtain 10 different disjoint subgraphs, each of which is given to each client). On the other hand, in the overlapping node scenario where nodes are duplicated across different subgraphs, we first divide the original graph into 2, 6, and 10 disjoint subgraphs for 10 clients, 30 clients, and 50 clients, respectively, with the METIS algorithm. After that, in the one splitted subgraph, we randomly sample half of the nodes and their associated edges, and then use them as the subgraph for one particular client. This procedure is performed five times to generate five different yet overlapped subgraphs, per one splitted subgraph from METIS.

B.2 BASELINES AND OUR MODEL

1. FedAvg: This method (McMahan et al., 2017) is the FL baseline, where each client locally updates a model and sends it to a server, while the server aggregates locally updated models with respect to their numbers of training samples and transmits the aggregated one back to the clients. 2. FedProx: This method (Li et al., 2020) is the FL baseline, which prevents the local model from drifting to the local data by minimizing weight differences between local and global models. 3. FedPer: This method (Arivazhagan et al., 2019) is the personalized FL baseline, which shares only the base layers, while keeping the personalized classification layers in the local side. 4. FedGNN: This method (Wu et al., 2021a) is the subgraph FL baseline, which expands the local subgraph by exactly augmenting the relevant nodes from other clients. In the original paper, the authors consider the nodes, which have shared neighboring nodes, over two individual clients as the relevant nodes, and then augment them. However, in our non-overlapping node scenario, since nodes are unique across different clients, we measure the similarities between nodes in different clients, and then augment them having the similarity above the threshold (e.g., 0.5). 5. FedSage+: This method (Zhang et al., 2021) is the subgraph FL baseline, which expands the local subgraph by generating additional nodes with the local graph generator. To train the graph generator, it first transmits the local node representations to other clients, and then calculates the gradient of distances between the transmitted node representations and the other client's node features. Then, the gradient is sent back to the local client, used to train the graph generator. 6. GCFL: This method (Xie et al., 2021) is the graph FL baseline, which targets completely disjoint graphs (e.g., molecular graphs) as in image tasks. In particular, it uses the bi-partitioning scheme, which divides a set of clients into two disjoint client groups based on their gradient similarities. Then, the model weights are only shared between grouped clients having similar gradients, after partitioning. Note that this bi-partitioning scheme is similar to the scheme proposed in clustered-FL (Sattler et al., 2020) for image classification, and we adopt this for our subgraph FL. 7. Local: This method is the non-FL baseline, which only locally trains the model for each client, and does not share any weights between clients. 8. FED-PUB: This is our FEDerated Personalized sUBgraph learning (FED-PUB) framework, which not only estimates the similarity between client subgraphs with their models' functional embeddings for detecting subgraph community structures, but also adaptively masks received weights from the server to filter subgraph-irrelevant weights from heterogeneous communities.

B.3 IMPLEMENTATION DETAILS

Implementation Details on Functional Embeddings The functional embeddings are key ingredients in the proposed FED-PUB framework, to capture community structures of interconnected subgraphs leveraged in personalized weight aggregation (See Section 4.1). To obtain such the functional embeddings, the graph input of GNNs is required, which we randomly generate via a stochastic block model (Holland et al., 1983) . Specifically, we first sample five individual subgraphs, each of which has 100 nodes, in which the probability of edges within the single graph is 0.1, while the probability of edges between different graphs is 0.01. Also, we initialize the node features with the normal distribution of 1.0 mean and 1.0 variance. Note that this randomly sampled graph is initialized at the server-side at once, and the server distributes it to all clients. Then, the client calculates its model's functional embedding, and then transmits only the output embedding to the server. Implementation Details on Sparse Masks As described in Section 4.2, we propose to sparsify the local personalized mask µ i for each client i, for taking the benefit in communication and prediction costs. In this paragraph, we additionally provide the detailed implementation specifications on sparse masks during training and test phases of our FED-PUB. First, in training, we regularize the local mask µ k to be sparse by minimizing the L 1 Norm of it along with its scaling parameter λ 2 to the local loss L, represented in equation 5. However, this regularization scheme might not be enough to exactly make a subset of local masks zero. Therefore, in the test phase, we use the threshold scheme, where elements (neurons) of µ k below a certain threshold (i.e., λ 2 ) are set to zero. By doing so, we can transmit only the partial parameters to the server, but also can predict with only the partial parameters, therefore, effectively reducing both communication and prediction costs. Common Implementation Details for Experiments For all experiments, we stack two layers of Graph Convolutional Network (GCN) (Kipf & Welling, 2017) and one linear classifier layer. Regarding hyperparameters, the number of hidden dimensions is set to 128, and the learning rate is set to 0.001. All models are optimized with Adam optimizer (Kingma & Ba, 2015) . Also, all clients participate in FL at every round. For all experiments about our FED-PUB, we set λ 1 and λ 2 values for L 1 and L 2 losses in equation 5 for sparsity and proximal terms as 0.001. While we can tune such two scaling hyperparameters, we observe that those default values show satisfactory performances across all datasets without specific tuning to each dataset (See Appendix C.1 for more analyses).

Implementation Details on Synthetic Graph Experiments

We perform two experiments on synthetic graphs, which are shown in Figure 1 and Figure 3 . In particular, in the experiment of Figure 1 , there are three communities that have different label distributions (e.g., nodes in the first community have label 0, whereas nodes in the last community have label 2), and three communities consist of 5/5/40 non-overlapped subgraphs, with 50 clients. Each subgraph consists of 30 nodes, and the edges between two nodes are sampled from the probability of 0.5. Also, in the experiment of Figure 3 , there are two communities that have different label distributions, and two communities have 5/15 non-overlapped subgraphs, with 20 clients. Each subgraph consists of 30 nodes, and the edges between two subgraphs within the same community are sampled from the probability of 0.7, whereas the edges between two subgraphs from different communities are sampled from the probability of 0.01. For all experiments, the number of local epochs is set to 3, and the number of total FL rounds is set to 100. In our FED-PER including its variants of using parameter and gradient for subgraph similarity estimation, the scaling hyperparameter (i.e., τ ) for the similarity in equation 4 is set to 10. Implementation Details on Real-World Graph Experiments For relatively small datasets, namely Cora, CiteSeer and PubMed, we set the number of local training epoch as 1, and the number of total rounds as 100. For larger datasets, such as Computer, Photo and arxiv, we set the number of total rounds as 200, while the number of local epochs is set to 2 for Photo and arxiv, and set to 3 for Computer. In the overlapping node scenario, we set the similarity scaling hyperparameter (i.e., τ ) as 5 for all our models. Meanwhile, we set the similarity scaling hyperparameter (i.e., τ ) as 3 in the non-overlapping node scenario for all our models. We observe that, the larger τ value performs better for the overlapping node scenario, in which different subgraphs are easily grouped together, compared to the disjoint node scenario. Finally, we report the test performance of all models at the best validation epoch, and the performance is measured by the node classification accuracy. Computing Resources For all experiments, we use PyTorch (Paszke et al., 2019) and PyTorch Geometric (Fey & Lenssen, 2019) as deep learning libraries. We use two types of GPUs: GeForce 

C ADDITIONAL EXPERIMENTAL RESULTS

In this section, we provide additional experimental results on sensitive analysis of hyperparameters in Section C.1; varying graph partitioning schemes in Section C.2 and C.3; varying random graph inputs in Section C.6; and varying similarity estimation schemes in Section C.7. In addition to them, we also analyze the subgraph heterogeneity itself in Section C.4 and its relationship to the graph size in Section C.5, as well as the impact of missing edges to the task performance in Section C.9.

C.1 RESULTS ON VARYING THE SCALING HYPERPARAMETERS IN LOSS FUNCTION

In Figure 10 , we explore the effects of hyperparameters λ 1 and λ 2 on the Cora dataset with the overlapping node scenario, where the number of local epochs is set as 2 and the number of clients is set as 10. In particular, λ 1 value can control the degree of the model sparsity, thus, to see its efficacy, we fix λ 2 value while varying λ 1 , and then measure both the model sparsity and performance. As shown in Figure 10 left, higher λ 1 values significantly increase the model sparsity, meanwhile, the model performance is slightly decreased. The results indicate that we should consider the trade-off between the sparsity and the model performance, when selecting λ 1 value. On the other hand, λ 2 value is designed to prevent the excessive knowledge drift to the local subgraph distribution, and, to verify its effectiveness, we fix λ 1 value while varying λ 2 . As shown in Figure 10 right, small lambda values lead to the performance degeneration, thus choosing the sufficiently large λ 2 values (e.g., 1e-1) would yield the high performance. Further, we observe that the sparsity does not depend on λ 2 value, thus the effects of λ 1 and λ 2 are orthogonal and complementary. C.2 RESULTS ON LOUVAIN GRAPH PARTITIONING ALGORITHM (Karypis, 1997) that we use, for subgraph FL scenarios. Specifically, since the Louvain algorithm cannot specify the number of graph partitions, the number of subgraphs on the CiteSeer dataset is 38, where three of them have less than ten nodes. Then, based on those 38 disjoint subgraphs, to generate the particular number of clients (e.g., 10), Zhang et al. (2021) randomly merge the different subgraphs without considering their graph properties. Therefore, even though each partitioned subgraph has its unique structural role/characteristic, the reconstructed 10 subgraphs from the original 38 subgraphs have mixed properties (i.e., two incompatible subgraphs could be merged), which is suboptimal. However, as described in the Datasets paragraph of Section 5.1, the METIS that we use can specify the number of partitions, thus more appropriate for making experimental settings for subgraph FL. As shown in Table 5 , we conduct experiments with the Louvain graph partitioning algorithm (Blondel et al., 2008; Zhang et al., 2021) , on Cora, CiteSeer, and PubMed datasets with the number of clients as 10. The results show that our FED-PUB consistently outperforms all the other baselines on the different graph partitioning setting, thus the effectiveness of our FED-PUB becomes obvious. One might be curious about experiments on uniform partitions of graphs, instead of splitting the graph with sophisticated partitioning algorithms (e.g., METIS and Louvain algorithms). Therefore, in this subsection, we explain why this random partitioning setting is unrealistic, and then show the performances on this random setting. To be specific, if we partition the entire graph of the CiteSeer dataset into different subgraphs uniformly at random, the number of nodes of each subgraph becomes larger than the number of edges (e.g., 211 nodes yet 72 edges per subgraph, thus some nodes do not have any edges), which is uncommon in practice. However, we further perform experiments on the random split setting with 10 different clients on the CiteSeer dataset, and then report the results in Table 6 . As shown in Table 6 , the gap between baselines and our model is reduced compared to the non-overlapping and overlapping scenarios in Table 1 and Table 2 . This is because there is no specific community structure in this random setting; however, our FED-PUB still consistently outperforms all baselines.

C.4 ANALYSES ON DISTRIBUTION SHIFTS BETWEEN SUBGRAPHS WITH SPARSE MASKS

To see the distributional shifts between subgraphs in our subgraph FL, we measure label differences between subgraphs with the Jenson-Shannon divergence on the Cora dataset with 20 different clients over the overlapping and non-overlapping scenarios. Then, the results show that the distance (i.e., divergence value) among subgraphs within the same community is 0.384 while the distance between subgraphs belonging to different communities is 0.639 for the non-overlapping node scenario. On the other hand, for the overlapping node scenario, the distance among subgraphs within the same community is 0.047 while the distance between subgraphs belonging to different communities is 0.528. Thus, these results confirm that heterogeneity of subgraphs even within the same community is extremely larger in the non-overlapping setup (0.384) compared to the overlapping setup (0.047). Then, from the above result, we can further argue that personalized weight aggregation from similarity matching is not enough in disjoint subgraph FL problems, since the model weight received from completely heterogeneous subgraphs might not be meaningful to the local subgraph task, especially in the non-overlapping setting. However, in such the extremely heterogeneous case, a personalized weight masking scheme is obviously helpful, since it can filter out irrelevant information transmitted from the other heterogeneous subgraphs, while allowing the model to maintain the locally helpful information in its parameters. This result is also aligned with the results in Figure 7 of the ablation study that, the personalized weight masking scheme brings huge performance improvements in the non-overlapping setting with high heterogeneity between subgraphs, whereas the personalized weight aggregation scheme is more beneficial in the overlapping setting with low heterogeneity. Lastly, to directly see the efficacy of sparse masks in subgraph FL, we empirically analyze whether they can indeed filter irrelevant weights received from heterogeneous communities and subgraphs. To do so, we measure how many parameters are shared between the two most dissimilar (i.e., heterogeneous) subgraphs, as well as between the two most similar subgraphs, for the Cora dataset with 20 clients on the non-overlapping node setting. For the two most similar subgraphs within the same community, 75% parameters are shared. Meanwhile, for the two heterogeneous subgraphs from two opposite communities, 73% parameters are filtered by sparse masks. These results demonstrate that sparse masks can prevent the knowledge collapse from subgraphs of heterogeneous communities. To see how much heterogeneity issues are severe in terms of the number of clients, we first analyze the exact amount of heterogeneities with respect to the client numbers. In particular, following the reported statistics in Table 4 , when we increase the number of clients in both the overlapping and non-overlapping node scenarios, the heterogeneity across subgraphs becomes severe and problematic for personalized subgraph FL, and thus becomes an important issue to tackle.

C.5 ANALYSES ON LOCAL GRAPH SIZE VS HETEROGENEITY

Note that one might be curious about whether our FED-PUB is still effective, when the heterogeneity issue is less significant. Thus, we further conduct the experiment in the setting where the number of clients is 3 on Cora, CiteSeer, and PubMed datasets of the non-overlapping node scenario. As shown in Table 7 , compared to the results in Table 2 with client numbers of 5, 10, and 20, the performance gaps between our FED-PUB and baselines are much reduced. However, we can clearly observe that our FED-PUB consistently outperforms all baselines with large margins even when the number of clients is small, since there still exists incompatible knowledge across clients, which our FED-PUB effectively handles with personalized weight aggregation and local weight masking schemes. As described in Section B.3, to obtain the functional embedding, we use the same random graph for all client models, which is initialized by a stochastic block model (Holland et al., 1983) with node features initialized by the normal distribution. The underlying assumption on using the random graph is that such randomness may not yield any bias on the functional space, unlike existing node features of the particular subgraph.

C.6 RESULTS ON VARYING THE GRAPH INPUTS FOR FUNCTIONAL EMBEDDINGS

In other words, we expect our random graphs are helpful for effectively capturing the similarities among subgraphs. In this subsection, to experimentally validate the above statement, we compare various graph inputs used for calculating the functional embeddings, as follows: 1) SBM denotes the random graph generated by the Stochastic Block Model (SBM) like ours; 2) ER denotes the random graph generated by the Erdos-Renyi (ER) model (Erdős & Rényi, 1960) ; 3) One denotes the random graph having only one node; 4) Feature denotes the graph where node features are initialized by the existing ones in the client. We then measure the performances of those four schemes by calculating the correlation coefficient between label distributions and estimated similarities of subgraphs (i.e., the high correlation coefficient means that the estimated similarities from functional embeddings are similar to the actual label distributions) on the Cora dataset of non-overlapping and overlapping node scenarios with 20 clients, which are reported in Table 8 . Specifically, as shown in Table 8 , compared to the One scheme that uses only one node for calculating the functional embeddings, SBM and ER schemes that use more large numbers of randomly initialized nodes can accurately capture the similarities between subgraphs. This result demonstrates that a sufficient amount of randomness is required to capture the model's functional space. Also, compared to the Feature scheme that uses existing node representations to calculate the functional embeddings, SBM and ER random models show superiority in capturing similarities among subgraphs, which verifies that randomness indeed helps obtain accurate functional embeddings of models without incurring bias. As shown in Figure 3 , our functional embeddings are not only effective but also efficient in capturing similarities between subgraphs, compared against using the parameter and gradient similarities. Additionally, while one might consider using the label distributions as the proxy for similarity estimation between clients, since labels are private local data stored in the client, this scheme may violate the privacy constraint of FL. However, to see their actual performances in the realworld dataset, we additionally conduct experiments on the parameter, gradient, and label similarities, on the Cora dataset of the overlapping node scenario with the number of clients as 30, and then compare the results with our functional similarities at 20, 40, 60, and 80 rounds.

C.7 RESULTS ON VARYING THE SIMILARITY ESTIMATION SCHEMES

As reported in Table 9 , we can observe that the models, which utilize the parameter and gradient for calculating the similarities between subgraphs, show comparable performance to the naive FedAvg model, and inferior than our functional and label similarity schemes. However, even though the label similarity model uses privacy-sensitive local information (i.e., label distributions of every client), the performance of our FED-PUB that utilizes the functional embeddings from the privacy-free random D DISCUSSION ON LIMITATIONS AND POTENTIAL SOCIETAL IMPACTS In this section, we discuss the limitations and potential societal impacts of our work. Limitations While our personalized subgraph FL framework, namely FED-PUB, is generally applicable regardless of subgraph types (e.g., unipartite graphs or bipartite graphs), our experiments are mainly done with unipartite graphs, which are the most popular setups. However, the efficacy of our FED-PUB on the other types of graphs, such as bipartite graphs, would be interesting to investigate, which have but not been explored so far, and we leave this as future work. Potential Societal Impacts The FL mechanism is important for preserving user's privacy, and, while this mechanism is actively studied in image and language domains, it gets little attention in graphs. However, we believe that our work comprehensively investigates and sufficiently tackles unique challenges in subgraph FL, such as missing nodes, edges, and their community structures. Then, the potential positive impact of our work on society is that, ours can contribute to various domains that utilize graph-structured data, such as social, recommendation, and patient networks. Note that we would like to emphasize the importance of our subgraph FL scheme, especially in social and recommendation networks. In the current real-world application, all user's interactions with other users in social networks and with other products in recommendation networks may be stored in the server. However, this may not preserve the user's privacy, but also has potential risks of user data leakage from the server, such that storing user's data in the server is not recommended by existing data protection regularizations such as GDPRfoot_2 . Then, by applying our subgraph FL framework to this domain, we expect such problems could be alleviated by not storing user's interaction data in the server, but only sharing the locally trained machine learning models from client subgraphs. However, the transmitted model parameters from the client to the server may hold privacy-sensitive information. While eliminating it is not the main focus of our work (i.e., we assume that model parameters are transmittable without compromising privacy as in many FL works (McMahan et al., 2017; Li et al., 2020; Arivazhagan et al., 2019) ), the research community may need to put further effort on whether the model parameters are safe, and how to make them more safe if they are not.



FedGNN is extended to FedPerGNN, where the core algorithm of averaging all client gradients is the same. We found communication rounds and local epochs are important factors to prevent overfitting of all models. https://gdpr-info.eu/



Figure 1: (A) An illustration of local subgraphs distributed across multiple participants with overlapping nodes, missing edges and community structures between subgraphs. (B) Existing subgraph FL methods(Wu  et al., 2021a;Zhang et al., 2021) expand the local subgraphs to tackle the missing edge problem, but collapse incompatible knowledge from heterogeneous subgraphs. (C) Our personalized subgraph FL focuses on the joint improvement of local models working on interrelated subgraphs, such as ones within the same community, by selectively sharing the knowledge across them. (Right:) Knowledge collapse results, where local models belonging to two small communities (Communities 1 and 2) suffer from large performance degeneration by existing subgraph FL, such as FedGNN(Wu et al., 2021a; 2022)  and FedSage+(Zhang et al., 2021). A personalized FL method, FedPer(Arivazhagan et al., 2019), also underperforms ours since it only focuses on individual model's improvement without sharing local personalization layers between similar subgraphs.

, works as follows: 1. (Initialization) At the initial communication round r = 0, the central server first selects K clients that are available for training, and initializes their local model parameters as the global parameter θ, represented as follows: θ (0) k ← θ(0) ∀k, where θ (0) k is the parameters for k-th client. 2. (Local Updates) Each active local model performs training on private local data D k to minimize the task loss L(D k ; θ Global Aggregation) After local training, the server aggregates the locally learned knowledge with respect to the number of training instances, i.e., θ(1) ← N k

Figure 2: (A) Two communities, where Community A and B consist of one and two subgraphs, respectively. (B) Client Similarity Matching: we first forward randomly generated graphs to models f ( Ḡ; θi), and obtain functional embeddings hi, which are then used to estimate subgraph similarities. Then, the similarities are used in weight aggregation, resulting in personalized model weights θi. (C) Weight Masking: transmitted weights from the server to clients θi are masked and shifted by local masks µi for localization to the local subgraph.

Figure 3: Effectiveness (top) and efficiencies (bottom) of different similarity measurements.

Figure5: Convergence plots for the non-overlapping node scenario. We visualize the test accuracy curves for all six datasets corresponding to Table2, over 100 communication rounds with 10 clients.

Figure 6: The heatmaps of the community structure on overlapping node scenario with Cora (20 clients). Dark color indicates lots of missing edges between subgraphs (a) or high similarities in labels (b). (c) and (d) are functional similarities captured by our FED-PUB.

Figure 8: Varying the local epochs with accuracy curves.

Figure 10: Analysis on hyperparameters λ1 and λ2, with corresponding model sparsity and performance.

Results on the overlapping node scenario. The reported results are mean and standard deviation over three different runs. The statistically significant performances (p > 0.05) are emphasized in bold. ± 0.25 71.65 ± 0.12 76.63 ± 0.10 65.12 ± 0.08 64.54 ± 0.42 66.68 ± 0.44 82.32 ± 0.07 80.72 ± 0.16 80.54 ± 0.11 -FedAvg 76.48 ± 0.36 53.99 ± 0.98 53.99 ± 4.53 69.48 ± 0.15 66.15 ± 0.64 66.51 ± 1.00 82.67 ± 0.11 82.05 ± 0.12 80.24 ± 0.35 -FedProx 77.85 ± 0.50 51.38 ± 1.74 56.27 ± 9.04 69.39 ± 0.35 66.11 ± 0.75 66.53 ± 0.43 82.63 ± 0.17 82.13 ± 0.13 80.50 ± 0.46 ± 0.20 83.84 ± 0.89 76.60 ± 0.47 92.67 ± 0.19 89.17 ± 0.40 72.36 ± 2.06 63.52 ± 0.11 59.86 ± 0.16 61.12 ± 0.04 73.38 FedPer 89.30 ± 0.04 87.99 ± 0.23 88.22 ± 0.27 92.88 ± 0.24 91.23 ± 0.16 90.92 ± 0.38 63.97 ± 0.08 62.29 ± 0.04 61.24 ± 0.11 78.42 GCFL 89.01 ± 0.22 87.24 ± 0.09 87.02 ± 0.22 92.45 ± 0.10 90.58 ± 0.11 90.54 ± 0.08 63.24 ± 0.02 61.66 ± 0.10 60.32 ± 0.01 77.61 FedGNN 88.15 ± 0.09 87.00 ± 0.10 83.96 ± 0.88 91.47 ± 0.11 87.91 ± 1.34 78.90 ± 6.46 63.08 ± 0.19 60.09 ± 0.04 60.51 ± 0.11 73.66 FedSage+ 89.24 ± 0.15 81.33 ± 1.20 76.72 ± 0.39 92.76 ± 0.05 88.69 ± 0.99 72.41 ± 1.36 63.24 ± 0.02 59.90 ± 0.12 60.95 ± 0.09 73.12 FED-PUB (Ours) 89.98 ± 0.08 89.15 ± 0.06 88.76 ± 0.14 93.22 ± 0.07 92.01 ± 0.07 91.71 ± 0.11 64.18 ± 0.04 63.34 ± 0.12 62.55 ± 0.12 79.53 Convergence plots for the overlapping node scenario. We visualize the test accuracy curves for all six datasets corresponding to Table 1, over 100 communication rounds with 30 clients.

Results on the non-overlapping node scenario. The reported results are mean and standard deviation over three different runs. The statistically significant performances (p > 0.05) are emphasized in bold. Local 81.30 ± 0.21 79.94 ± 0.24 80.30 ± 0.25 69.02 ± 0.05 67.82 ± 0.13 65.98 ± 0.17 84.04 ± 0.18 82.81 ± 0.39 82.65 ± 0.03 -FedAvg 74.45 ± 5.64 69.19 ± 0.67 69.50 ± 3.58 71.06 ± 0.60 63.61 ± 3.59 64.68 ± 1.83 79.40 ± 0.11 82.71 ± 0.29 80.97 ± 0.26 -FedProx 72.03 ± 4.56 60.18 ± 7.04 48.22 ± 6.81 71.73 ± 1.11 63.33 ± 3.25 64.85 ± 1.35 79.45 ± 0.25 82.55 ± 0.24 80.50 ± 0.25 Ours) 83.70 ± 0.19 81.54 ± 0.12 81.75 ± 0.56 72.68 ± 0.44 72.35 ± 0.53 67.62 ± 0.12 86.79 ± 0.09 86.28 ± 0.18 85.53 ± 0.30 -Local 89.22 ± 0.13 88.91 ± 0.17 89.52 ± 0.20 91.67 ± 0.09 91.80 ± 0.02 90.47 ± 0.15 66.76 ± 0.07 64.92 ± 0.09 65.06 ± 0.05 79.57 FedAvg 84.88 ± 1.96 79.54 ± 0.23 74.79 ± 0.24 89.89 ± 0.83 83.15 ± 3.71 81.35 ± 1.04 65.54 ± 0.07 64.44 ± 0.10 63.24 ± 0.13 74.58 FedProx 85.25 ± 1.27 83.81 ± 1.09 73.05 ± 1.30 90.38 ± 0.48 80.92 ± 4.64 82.32 ± 0.29 65.21 ± 0.20 64.37 ± 0.18 63.03 ± 0.04 72.84 FedPer 89.67 ± 0.34 89.73 ± 0.04 87.86 ± 0.43 91.44 ± 0.37 91.76 ± 0.23 90.59 ± 0.06 66.87 ± 0.05 64.99 ± 0.18 64.66 ± 0.11 79.94 GCFL 89.07 ± 0.91 90.03 ± 0.16 89.08 ± 0.25 91.99 ± 0.29 92.06 ± 0.25 90.79 ± 0.17 66.80 ± 0.12 65.09 ± 0.08 65.08 ± 0.04 79.90 FedGNN 88.08 ± 0.15 88.18 ± 0.41 83.16 ± 0.13 90.25 ± 0.70 87.12 ± 2.01 81.00 ± 4.48 65.47 ± 0.22 64.21 ± 0.32 63.80 ± 0.05 75.23 FedSage+ 85.04 ± 0.61 80.50 ± 1.30 70.42 ± 0.85 90.77 ± 0.44 76.81 ± 8.24 80.58 ± 1.15 65.69 ± 0.09 64.52 ± 0.14 63.31 ± 0.20 73.47 FED-PUB (Ours) 90.74 ± 0.05 90.55 ± 0.13 90.12 ± 0.09 93.29 ± 0.19 92.73 ± 0.18 91.92 ± 0.12 67.77 ± 0.09 66.58 ± 0.08 66.64 ± 0.12 81.59

Analysis on efficiencies of communication costs and model sizes with sparse masks.

Dataset statistics. We report the number of nodes, edges, classes, clustering coefficient, and heterogeneity for the original graph and its splitted subgraphs on both overlapping and non-overlapping node scenarios. Note that Ori denotes the original graph, and Cli denotes the number of clients.

Results on experimental settings of Louvain graph partitioning algorithms, followingZhang et al. (2021).

Results on experimental settings of the random graph partitioning.

Results on Cora, CiteSeer, and PubMed datasets on non-overlapping scenarios, with the number of clients of 3.

Results on varying the graph inputs for functional embeddings, over overlapping and non-overlapping node scenarios with 20 clients on Cora.

Results on varying the similarity calculation schemes: parameter, gradient, label, and our functional embedding, on the overlapping node scenario with 30 clients of the Cora dataset.

annex

Under review as a conference paper at ICLR 2023 graph is similar to the performance of the label model. Therefore, along with the results in Figure 6 , these comparison results on similarity schemes further verify the effectiveness of our functional embedding scheme in capturing the similarities among subgraphs, for identifying their communities.

C.8 ANALYSES ON IMPLICIT AND EXPLICIT COMMUNITIES FOR WEIGHT AGGREGATION

As formalized in Equation 4and described in Section 4.1, for personalized weight aggregation based on the functional similarities between clients, we implicitly model the community structures by performing weight aggregation over all available clients. However, one can alternatively perform explicit weight aggregation, by grouping similar subgraphs within the community first and then performing weight aggregation among clients within the community. To see which strategy is superior, we compare the performances of our variants: implicit and explicit community detection for weight aggregation. Note that, for the implicit setup, we use the formulation defined in Equation 4 without any modification. Meanwhile, for the explicit setup, we exclusively perform weight aggregation between clients, having the functional similarity score above 0.5, which we regard as forming the community, with the same normalization trick in Equation 4 after identifying communities.As shown in Table 10 , we observe that the model, which implicitly captures the community structures during weight aggregation, consistently outperforms the other explicit one, except for only one case: Cora with the overlapping node scenario. We believe such the exceptional case on the Cora dataset with the overlapping node scenario might be because, the information in the other communities is especially not useful for this particular setup; therefore, completely ignoring them contributes to the improved performance. Except for this, the results in Table 10 confirm that implicit modeling of community structures is generally better for personalized weight aggregation in subgraph FL. 

