GLASU: A COMMUNICATION-EFFICIENT ALGORITHM FOR FEDERATED LEARNING WITH VERTICALLY DISTRIBUTED GRAPH DATA Anonymous authors Paper under double-blind review

Abstract

Vertical federated learning (VFL) is a distributed learning paradigm, where computing clients collectively train a model based on the partial features of the same set of samples they possess. Current research on VFL focuses on the case when samples are independent, but it rarely addresses an emerging scenario when samples are interrelated through a graph. For graph-structured data, graph neural networks (GNNs) are competitive machine learning models, but a naive implementation in the VFL setting causes a significant communication overhead. Moreover, the analysis of the training is faced with a challenge caused by the biased stochastic gradients. In this paper, we propose a model splitting method that splits a backbone GNN across the clients and the server and a communication-efficient algorithm, GLASU, to train such a model. GLASU adopts lazy aggregation and stale updates to skip aggregation when evaluating the model and skip feature exchanges during training, greatly reducing communication. We offer a theoretical analysis and conduct extensive numerical experiments on real-world datasets, showing that the proposed algorithm effectively trains a GNN model, whose performance matches that of the backbone GNN when trained in a centralized manner.

1. INTRODUCTION

Vertical federated learning (VFL) is a newly developed machine learning scenario in distributed optimization, where clients share data with the same sample identity but each client possesses only a subset of the features for each sample. The goal is for the clients to collaboratively learn a model based on all features. Such a scenario appears in many applications, including healthcare, finance, and recommendation systems (Chen et al., 2020b; Liu et al., 2022) . For example, in healthcare, each hospital may collect partial clinical data of a patient such that their conditions and treatments are best predicted through learning from the data collectively; in finance, banks or e-commerce providers may jointly analyze a customer's credit with their trade histories and personal information; and in recommendation systems, online social/review platforms may collect a user's comments and reviews left at different websites to predict suitable products for the user. Most of the current VFL solutions (Chen et al., 2020b; Liu et al., 2022) treat the case where samples are independent, but omit their relational structure. However, the pairwise relationship between samples emerges in many occasions and it can be crucial in several learning scenarios, including the low-labeling-rate scenario in semi-supervised learning and the no-labeling scenario in selfsupervised learning. Take the financial application as an example: customers and institutions are related through transactions. Such relations can be used to trace finance crimes such as money laundering, to assess the credit risk of a customer, and even to recommend products to them. Each bank and e-commerce provider can infer the relations of the financial individuals registered to them and create a relational graph, in addition to the individual customer information they possess. One of the most effective machine learning models to handle relational data is graph neural networks (GNNs) (Kipf & Welling, 2016; Hamilton et al., 2017; Chen et al., 2018; Velickovic et al., 2018; Chen et al., 2020a) . This model performs neighborhood aggregation in every feature transformation layer, such that the prediction of a graph node is based on not only the information of this node but also that of its neighbors. Although GNNs have been used in federated learning, a majority of the cases therein are horizontal: each client possesses a local dataset of graphs and all clients collaborate to train a unified model to predict graph properties, rather than node properties (He et al., 2021; Bayram & Rekik, 2021; Xie et al., 2021) . Our case is different. We are concerned with subgraph level, vertical federated learning (Zhou et al., 2020; Ni et al., 2021) : each client holds a subgraph of the global graph, part of the features for nodes in this subgraph, and part of the whole model; all clients collaboratively predict node properties. Our vertical setting is exemplified by not only the partitioning of node features, but also the (sub)graphs among the nodes. The setting under our consideration is fundamentally challenging, because fully leveraging features within neighborhoods causes an enormous amount of communication. One method to design and train a GNN is that each client uses a local GNN to extract node representations from its own subgraph and the server aggregates these representations to make predictions (Zhou et al., 2020) . The drawback of this method is that the partial features of a node outside one client's neighborhood are not used, even if this node appears in another client's neighborhood. Another method to train a GNN is to simulate centralized training: transformed features of each node are aggregated by the server, from where neighborhood aggregation is performed (Ni et al., 2021) . This method suffers the communication overhead incurred in each layer computation. In this work, we propose a federated GNN model and a communication-efficient training algorithm, named GLASU, for federated learning with vertically distributed graph data. The model is split across the clients and the server, such that the clients can use a majority of existing GNNs as the backbone, while the server contains no model parameters. The server only aggregates and disseminates computed data with the clients. The communication frequency between the clients and the server is mitigated through lazy aggregation and stale updates (hence the name of the method), with convergence guarantees. Moreover, GLASU can be considered as a framework that encompasses many well-known models and algorithms as special cases, including the work of Liu et al. (2022) when the subgraphs are absent, the work of Zhou et al. (2020) when all aggregations but the final one are skipped, the work of Ni et al. (2021) when no aggregations are skipped, and centralized training when only a single client exists. We summarize the main contributions of this work below: • Model design: We propose a flexible, federated GNN architecture that is compatible with a majority of existing GNN models. • Algorithm design: We propose the communication-efficient GLASU algorithm to train the model. Therein, lazy aggregation saves communication for each joint inference round, through skipping some aggregation layers in the GNN; while stale updates further save communication by allowing the clients to use stale global information for multiple local model updates. • Theoretical analysis: We provide theoretical convergence analysis for GLASU by addressing the challenges of biased stochastic gradient estimation caused by neighborhood sampling and correlated update steps caused by using stale global information. • Numerical results: We conduct extensive experiments, together with ablation studies, to demonstrate that GLASU can achieve a comparable performance as the centralized model on multiple datasets and multiple GNN backbones, and that GLASU effectively saves communication.

1.1. PROBLEM SETUP

Consider M clients, indexed by m = 1, . . . , M , each of which holds a part of a graph with the node feature matrix X ∈ R N ×d and the edge set E. Here, N is the number of nodes in the graph and d is the feature dimension. We assume that each client has the same node set and the same set of training labels, y, but a different private edge set E m and a non-overlapping node feature matrix X m ∈ R N ×dm , such that E = M m=1 E m , X = [X 1 , . . . , X M ], and d = M m=1 d m . We denote the client dataset as D m = {X m , E m , y} and the full dataset as D = {X, E, y}. The task is for the clients to collaboratively infer the labels of nodes in the test set. See Figure 1 for an illustration.

1.2. GRAPH CONVOLUTIONAL NETWORK

The graph convolution network (GCN) (Kipf & Welling, 2016) is a typical example of the family of GNNs. Inside GCN, a graph convolution layer reads H[l + 1] = σ A(E) • H[l] • W[l] , where σ(•) denotes the point-wise nonlinear activation function, A(E) ∈ R N ×N denotes the adjacency matrix defined by the edge set E with proper normalization, ×d[l] denotes the node representation matrix at layer l, and l+1] denotes the weight matrix at the same layer. H[l] ∈ R N W[l] ∈ R d[l]×d[ The initial node representation matrix H [0] = X. The classifier is denoted as ŷ = f (H[L], W[L]) with weight matrix W[L] and the loss function is denoted as ℓ(y, ŷ). Therefore, the overall model parameter is W = {W[0], . . . , W[L -1], W[L]}. Mini-batch training of GCN (and GNNs in general) faces a scalability challenge, because to compute one or a few rows of H[L] (i.e., the representations of a mini-batch), it requires more and more rows of  H[L -1], H[L -2], . . . H[l + 1][S[l + 1]] = σ A(E(S[l + 1], S[l])) • H[l][S[l]] • W[l] , where (Zhang et al., 2021; He et al., 2021; Bayram & Rekik, 2021; Xie et al., 2021) . Applications include predicting molecular properties (Xie et al., 2021) and learning connectional brain templates (Bayram & Rekik, 2021) . A(E(S[l + 1], S[l])) ∈ R |S[l+1]|×|S[l]| is On the other hand, the vertical case can be considered subgraph-level, where each client holds a subgraph of the global graph, a part of the node features, and a part of the whole model (Zhou et al., 2020; Ni et al., 2021) . The clients aim to collaboratively train a global model with the partial features and subgraphs to predict node properties (see Figure 1 ). Existing methods either fail to fully leverage the neighborhood information (Zhou et al., 2020) or incur expensive communication (Ni et al., 2021) . Our approach addresses these shortcomings. An additional scenario that does not fit into the above common categories is a node-level federated learning: the clients are connected by a graph and thus each of them is treated as a node. In other words, the clients, rather than the data, are graph-structured. For example, in Lalitha et al. (2019) and Meng et al. (2021) , each client performs learning with its own data and they exchange data through the communication graph; whereas in Caldarola et al. (2021) and Rizk & Sayed (2021) , the server maintains the graph structure and uses a GNN to aggregate information (either models or data) collected from the clients.

2. PROPOSED APPROACH

In this section, we present the proposed model and the training algorithm GLASU for federated learning on vertically distributed graph data. The neighborhood aggregation in GNNs poses communication challenges distinct from conventional VFL. To mitigate this challenge, we propose lazy aggregation and stale updates to effectively reduce the communication between the clients and the server, while maintaining comparable prediction performance as centralized models. For notational simplicity, we present the approach by using the full-graph notation (1) but note that the implementation involves neighborhood sampling, where a more precise notation should follow (2), and that one can easily change the backbone from GCN to other GNN architectures.

2.1. GNN MODEL SPLITTING

We split the GNN model among the clients and the server, approximating a centralized model. Specifically, each GNN layer contains two sub-layers: the client GNN sub-layer and the server aggregation sub-layer. At the l-th layer, each client computes the local feature matrix H + m [l] = σ A(E m ) • H m [l] • W m [l] with the local weight matrix W m [l] and the local graph E m , where we use the superscript + to denote local representations before aggregation. Then, the server aggregates the clients' representations and outputs H[l + 1] as H[l + 1] = Agg(H + 1 [l], . . . , H + M [l] ), where Agg(•) is an aggregation function. In this paper, we only consider parameter-free aggregations, examples of which include averaging Avg(H + 1 [l], . . . , H + M [l]) = 1 M M m=1 H + m [l] and con- catenation Cat(H + 1 [l], . . . , H + M [l]) = [H + 1 [l], . . . , H + M [l]]. The server broadcasts the aggregated H[l + 1] to the clients so that computation proceeds to the next layer. In the final layer, each client computes a prediction. This layer is the same among clients because they receive the same H[L]. The two aggregation operations of our choice have a few advantages. • Parameter-free: Since the operations do not contain any learnable parameters, the server does not need to perform gradient computations. • Memory-less: In the backward pass, these operations do not require data from the forward pass to back-propagate the gradients. For averaging, the server back-propagates 1 M ∂H[l + 1] to each client, while for concatenation, the server back-propagates the corresponding block of ∂H[l + 1]. • Easy-to-implement: The server implementation is obviously easy because of the parameter-free and memory-less nature. Moreover, they enable parallelization and pipelining. We illustrate in Figure 2 the split of one GNN layer among the clients and the server. Although our approach resembles federated split learning (SplitFed) (Thapa et al., 2022) , there is a fine distinction. In SplitFed, each client can collaborate with the server to perform inference or model updates without accessing information from other clients; while in our case, all clients collectively perform the job. Our approach also differs from conventional VFL that splits the local feature processing and the final classifier among the clients and the server respectively, such that each model update requires a single U-shape communication (Chen et al., 2020b) . In our case, due to the graph structure, each GNN layer contains one client-server interaction and the number of interactions is equal to the number of GNN layers (we will relax this in the following subsection). Note that there are two types of aggregations in our model. One is the neighborhood aggregation (multiplying with matrix A(E m )), as a signature of GNNs, that occurs in each client locally and incurs no communication between the clients and the server. The other one relates to the communication that happens when the server gathers the clients' partial representations and broadcasts back the aggregated representation. 

2.2. LAZY AGGREGATION

The development in the preceding subsection approximates a centralized model, but it is not communication friendly because each layer requires one round of client-server communication. We propose two communication-saving strategies in this subsection and the next. We first consider lazy aggregation, which skips aggregation in certain layers. Instead of performing server aggregation at each layer, we specify a subset of K indices, I = {l 1 , . . . , l K } ⊂ [L] , such that aggregation is performed only at these layers. That is, at a layer l ∈ I, the server performs aggregation and broadcasts the aggregated representations to the clients, serving as the input to the next layer: H m [l + 1] = H[l + 1]; while at a layer l / ∈ I, each client uses the local representations as the input to the next layer: H m [l + 1] = H + m [l]. By doing so, the model skips the server aggregation sub-layer between l k and l k+1 , such that the amount of communication is reduced from O(L) to O(K). There is a subtlety caused by neighborhood sampling: it requires additional rounds of communication. Neighborhood sampling is done similarly to FastGCN (see Section 1.2), but note that whenever server aggregation is performed, it must be done on the same set of sampled nodes across clients. Hence, the server takes the union of the clients' index sets S m [l k ] and broadcasts S[l k ] = M m=1 S m [l k ] to the clients. On the other hand, when server aggregation is skipped at an layer l / ∈ I, each client can use its own set of sampled nodes, S m [l], that may differ across clients. Such a procedure is more flexible than conventional VFL, where strict sample synchronization is enforced. The sampling procedure is summarized in Algorithm 2 in the appendix.

2.3. STALE UPDATES

To further reduce communication, we consider stale updates, which skip aggregation in certain iterations and use stale node representations to perform model updates. The key idea is to fix the mini-batch, including the sampled neighbors at each layer, for training Q iterations. In every other Q iterations, the clients store the aggregated representations at the server aggregation layers. Then, in the subsequent iterations, every server aggregation is replaced by a local aggregation between a client's up-to-date node representations and other clients' stale node representations. By doing so, the clients and the server only need to communicate once in every Q iterations. Specifically, let a round of training contain Q iterations and use t to index the rounds. At the beginning of each round, the clients and the server jointly decide the set of nodes used for training at each layer. Then, they perform a joint inference on the representations H t,+ m [l] at every layer l ∈ I. Each client m will store the "all but m" representation H  Client: W t,0 m = W t-1,Q m , t > 0 W 0 m , t = 0 . 4: Server/Client: {H t -m [l + 1]} l∈I = JointInference(W t,0 m , D m , {S t m [l]} L l=0 ). (Algorithm 3) 5: for q = 0, . . . , Q -1 do 6: for clients in parallel do 7: W t,q+1 m = LocalUpdate(W t,q m , D m , {S t m [l]} L l=0 , {H t -m [l + 1]} l∈I ). (Algorithm 4) 8: end for 9: end for 10: end for 11: Output: {W t,q m } M m=1 information from the aggregated representations H t m [l + 1]: H t -m [l + 1] = Extract(H t m [l + 1], H t,+ m [l] ). For example, when the server aggregation is averaging, the extraction is Extract(H t m [l + 1], H t,+ m [l]) = H t m [l + 1] - 1 M H t,+ m [l]. Afterward, the clients perform Q iterations of model updates, indexed by q = 0, . . . , Q -1, on the local parameters W t,q m in parallel, using the stored aggregated information H t -m [l + 1] whenever a server aggregation is supposed to happen. In other words, the server aggregation is replaced by computation done locally, thus reducing a significant amount of communication. Because H t -m [l + 1] is computed by using stale model parameters {W t,0 m ′ } m ′ ̸ =m at all iterations q ̸ = 0, this approach is called "stale updates." The details are summarized in Algorithm 1, with subroutines given in Appendix A.

2.4. SPECIAL CASES

It is interesting to note that our model and the training algorithm encompass several well-known models and algorithms as special cases. Conventional VFL. VFL algorithms can be viewed as a special case of the proposed algorithm, where A(E m ) = I for all m. In this case, no neighborhood sampling is needed and GLASU reduces to Liu et al. (2022) . Existing VFL algorithms for graphs. The model of Zhou et al. (2020) is a special case of our model, with K = 1; i.e., no communication is performed between the server and the clients except the final prediction layer. In this case, the clients omit the connections absent in the self subgraph but present in other clients' subgraphs. The model of Ni et al. (2021) is also a special case of our model, with K = L. This case requires communication at all layers and is less efficient. Centralized GNNs. When there is a single client (M = 1), our setting is the same as centralized GNN training. Specifically, by letting K = L and properly choosing the server aggregation function Agg(•), our split model can achieve the same performance as a centralized GNN model. Of course, using lazy aggregation (K ̸ = L) and choosing the server aggregation function as concatenation or averaging will make the split model different from a centralized GNN.

2.5. PRIVACY

Our training algorithm GLASU enables privacy protection because it is compatible with existing privacy preserving approaches, including secure aggregation (SA) and differential privacy (DP). SA (Bonawitz et al., 2017; Hardy et al., 2017) is a form of secure multi-party computation approach used for aggregating information from a group of clients, without revealing the information of any individual. This can be achieved by homomorphic encryption (Li et al., 2010; Hardy et al., 2017) . In our case, when the server aggregation is averaging, homomorphic encryption can be directly applied. DP (Wei et al., 2020 ) is a probabilistic protection approach. By injecting stochasticity to the local outputs, this approach guarantees that any attacker cannot distinguish the sample from the dataset up to a certain probability. DP can be applied either solely or in combination with SA to our algorithm in the server-client communication, to offer privacy protection on the client data.

3. CONVERGENCE ANALYSIS

With lazy aggregation and stale updates, GLASU is guaranteed to converge. To start the analysis, denote by For any round t and iteration q in the round, GLASU admits the following convergence guarantee. Theorem 1. Under assumptions A1-A3, by running Algorithm 1 with η ≤ 1 C0•(1+2Q 2 M ) , with probability at least p = 1 -δ, the averaged gradient norm is bounded by: S t = {S t m [l]} L,M l=1, 1 T Q T -1 t=0 Q-1 Q=0 E ∇L(W t,q ; D) 2 ≤ 2(L(W 0,0 ) -L ⋆ ) ηT Q + 28ηM • C 0 + √ M + 1Q 3 σ, where C 0 = G ℓ L f + L ℓ G 2 f is constant and σ > 0 is a function of log(T Q/δ), L f , L g , G f and G g . The detailed proofs are presented in Appendix B. Two key challenges in the analysis are: 1) the stochastic gradient estimation of the network is biased (i.e., E S ∇L(W; S) ̸ = ∇L(W; D)), even in centralized models; and 2) the stale updates in one communication round are correlated, as they are updated with the same samples. Hence, the general unbiasedness and independence assumptions on the stochastic gradients in the analysis of SGD-type algorithms do not apply. Instead, we follow the analysis in Ramezani et al. (2020) to bound the variance of the stochastic gradient in centralized GCN training, and extend the analysis in Liu et al. (2022) for VFL with correlated updates to our case with biased gradients.

4. NUMERICAL EXPERIMENTS

In this section, we conduct numerical experiments on a variety of datasets and demonstrate the effectiveness of GLASU in training with distributed graph data. We first compare the performance of GLASU with those tackling related settings under different assumptions on data distribution and communication. Then, we conduct ablation studies to show the equal criticality of the three components (GNN backbone, lazy aggregation, and stale updates) of GLASU. The experiments are conducted on a distributed cluster with three Tesla V100 GPUs communicated through Ethernet.

4.1. DATASETS

We use seven datasets (in three groups) with varying sizes and data distributions: the Planetoid collection (Yang et al., 2016) , the HeriGraph collection (Bai et al., 2022) , and the Reddit dataset (Hamilton et al., 2017) . Each dataset in the HeriGraph collection (Suzhou, Venice, and Amsterdam) contains data readily distributed: three subgraphs and more than three feature blocks for each node. Hence, we use three clients, each of which handles one subgraph and one feature block. For the other four datasets (Cora, PubMed and CiteSeer in the Planetoid collection; and Reddit), each contains one single graph and thus we manually construct subgraphs through randomly sampling the edges and splitting the input features into non-overlapping blocks, so that each client handles one ), GLASU with no stale updates, i.e., Q = 1 (GLASU-1), and GLASU with stale updates Q = 4 (GLASU-4). subgraph and one feature block. The dataset statistics are summarized in Table 1 and more details are given in Appendix C.1.

4.2. RESULTS

We compare GLASU with three training methods: a) centralized training, where there is only a single client (M = 1), which holds the whole dataset without any data distribution and communication; b) standalone training, where each client trains a model with its local data only and they do not communicate; c) simulated centralized training (Ni et al., 2021) , where each client possesses the full graph but only the partial features, so that it simulates centralized training through server aggregation in each GNN layer. None of these compared methods fits VFL but they offer good references for understanding the performance of VFL on graph data. Except for centralized training, the number of clients M = 3. The number of training rounds, T , and the learning rate η are optimized through grid search. See Appendix C.2 for details. We use GCNII (Chen et al., 2020a) as the backbone GNN. One layer of GCNII reads H[l + 1] = σ ((1 -α[l])A(E)H[l] + α[l]H[0] (1 -β[l])I + β[l]W[l]) , which effectively includes two residual connections. This backbone reduces over-smoothing and results in better prediction accuracy than GCN. We set the number of layers L = 4 and the minibatch size S = 16. For neighborhood sampling, we sample three neighbors per node in S[l + 1] and take the union of the sampled neighbors to form S[l]. For lazy aggregation, we set K = 2. Table 2 reports the classification accuracy of GLASU and the compared training methods, after five runs. As expected, standalone training produces the worst results, because each client uses only local information and misses edges and node features present in other clients. The centralized training and its simulated version lead to similar performance, also as expected. Our method GLASU is quite comparable with these two methods. Using stale updates (Q = 4) is generally outperformed by no stale updates, but occasionally it is better (see PubMed and Amsterdam). The gain in using stale updates occurs in timing, as will be demonstrated in the ablation study next.

4.3. ABLATION STUDY

To further investigate how each component of the proposed approach affects the performance, we conduct an ablation study on a) the backbone GNN model, b) the lazy aggregation parameter K, and c) the stale update parameter Q. The experiments use PubMed and Planetoid for illustration. Backbone model: We compare the performance of GLASU on three backbone models: GCN, GAT (Velickovic et al., 2018) , and GCNII. The learning rate for each backbonoe is tuned to their best performance. The test accuracy on PubMed is shown in Figure 3 . We see that GLASU can take different GNNs as the backbone and reach a similar prediction performance. Lazy aggregation: We investigate the performance of GLASU with different numbers of aggregation layers. We use a 4-layer GCNII as the backbone and set K = 1, 2, 4. The test accuracy and the runtime are listed in Table 3 . We observe that the runtime decreases drastically when using fewer and fewer aggregation layers; from K = 4 to K = 1, the reduction in runtime is 37.4% for PubMed and 58.2% for Amsterdam. Meanwhile, there appears to be a sweet spot in terms of accuracy: K = 2 performs the best. Stale updates: To investigate the time saving due to the use of stale updates, we experiment with a few choices of Q: 2, 4, 8, and 16. We report the time to reach the same test accuracy in Table 4 . We see that stale updates help speed up training by using fewer communication rounds; this trend occurs on the Amsterdam dataset even when taking Q as large as 16. The trend is also noticeable on PubMed, but at some point (Q = 8) it is reverted, likely because it gets harder and harder to reach the desired prediction accuracy. We speculate that the target 82% can never be achieved at Q = 16. # Stale Update Q = 2 Q = 4 Q = 8 Q = 16 PubMed Accuracy (%) 82. 

5. CONCLUSION

We have presented a flexible model splitting approach for VFL with vertically distributed graph data and proposed a communication-efficient algorithm, GLASU, to train the resulting GNN. Due to the graph structure among the samples, VFL on GNNs incurs heavy communication and poses an extra challenge in the convergence analysis, as the stochastic gradients are no longer unbiased. To overcome these challenges, our approach uses lazy aggregation to skip server-client communication and stale global information to update local models, leading to significant communication reduction. Moreover, our analysis makes no assumptions on unbiased gradients. We provide extensive experiments to show the flexibility of the model and the communication saving in the training.

A SUBROUTINES IN ALGORITHM 1

Algorithm 2 Sampling Procedure Client:  for k = K, . . . , 2 do Receive(S[l k ]). Set S m [l k ] = S[l k ]. for l = l k -1, . . . , l k-1 do Uniformly randomly sample indices S m [l] from neighbors of S m [l + 1]. end for Send(S m [l ]) if k > 2. end for Output: {S m [l]} L for k = K -1, . . . , 2 do Aggregate(S m [l k ]). Compute S[l k ] = M m=1 S m [l k ]. Broadcast(S[l k ]). end for Algorithm 3 JointInference Client: Input: W m , D m , {S m [l]} L l=0 Set H m [0] = X m [S m [0]] for l = 0, . . . , L -1 do H + m [l] = σ(A(E(S[l + 1], S[l]))H m [l]W m [l]) if l ∈ I then Send H + m [l] to server Receive H m [l + 1] H -m [l + 1] = Extract(H m [l + 1], H + m [l]) else Set H m [l + 1] = H + m [l] end if end for Output: {H -m [l + 1]} l∈I Server: for l ∈ I do H[l + 1] = Agg(H + 1 [l], . . . , H + M [l]). Broadcast H[l + 1]. end for Algorithm 4 LocalUpdate Input: W t,q m , D m , {S t m [l]} L l=0 , {H t -m [l + 1]} l∈I Set H t,q m [0] = X m [S t m [0]] for l = 0, . . . , L -1 do H t,q,+ m [l] = σ(A(E(S t m [l + 1], S t m [l]))H t,q m [l]W t,q m [l]) if l ∈ I then Set H t,q m [l + 1] = Agg(H t -m [l + 1], H t,q,+ m [l]) else Set H t,q m [l + 1] = H t,q,+ m [l] end if end for Compute loss L t,q m = ℓ (y[S t m [L]], f m (H t,q m [L], W m [L])) Output: W t,q+1 m = W t,q m -η t,q ∇ W t,q m L t,q m B PROOFS FOR SECTION 3 B.1 ASSUMPTIONS Assumption 1 (Smooth function and Lipschitz gradient). The loss function ℓ is G ℓ -smooth with L ℓ -Lipschitz gradient, i.e., ∥ℓ(y, S, W) -ℓ(y, S, W ′ )∥ ≤ G ℓ ∥W -W ′ ∥ ∥∇ W ℓ(y, S, W) -∇ W ′ ℓ(y, S, W ′ )∥ ≤ L ℓ ∥W -W ′ ∥ , ∀ W, W ′ and each client's prediction function f m is G f -smooth with L f -Lipschitz gradient, i.e., ∥f m (S, W m ) -f m (S, W ′ m )∥ ≤ G f ∥W m -W ′ m ∥ ∇ Wm f m (S, W m ) -∇ W ′ m f m (S, W m ) ≤ L f ∥W m -W ′ m ∥ , ∀ W m , W ′ m , ∀ m. Assumption 2 (Lower-bounded function). The training objective is bounded below; that is, there exists a constant L ⋆ > -∞ such that for all {W m }, it satisfies that L({W m }) ≥ L ⋆ . Assumption 3 (Uniform sampling). At each iteration t, the server and the clients uniformly sample nodes {S m [l]} L l=0 , with |S[L]| = S, according to Algorithm 2.

B.2 PROOF OF THEOREM 1

We first note the following useful relation: ∥a + b∥ 2 = ∥a -c + c -b∥ 2 ≤ (1 + α) ∥a -c∥ 2 + (1 + 1 α ) ∥c -b∥ 2 , ∀α > 0. For notation simplicity, let us denote the expectation conditioned on all the information before iteration t as E t [ • ] = E S t [ • |W t-1,Q , . . . , W 0,0 , S t-1 , . . . , S 0 ]; denote the "all-but-m" vector as (•) -m , (e.g., the collection of all client parameters except for client m is W -m = {W m ′ } m ′ ̸ =m ); denote the client model updated with data S as W m (S); denote the gradient evaluated with data S on parameter W m as ∇L(W m (S), S); and denote the stacked gradient of all clients as G = [∇L(W 1 (S), S), . . . , ∇L(W M (S), S)]. Then, the update rule can be rewritten as: W t,q+1 (S t ) = W t,q (S t ) -ηG t,q . (5) In addition, let us define a virtual model sequence updated with full data as W(D), i.e., W t,q+1 (D) = W t,q (D) -η∇L(W t,q (D), D). We can bound the variance of the stochastic gradient at any round t and iteration q = 0 with the following lemma: Lemma 1 (Bounded variance). Under Assumptions 1-3, with probability at least p = 1 -δ, the variance of the stochastic gradient is bounded by: E t ∇L(W; S t ) -∇L(W; D) 2 ≤ σ, ∀ W independent of S t , where σ = 64G 2 ℓ L 2 f log 2d δ + 128L 2 ℓ G 4 f + 1 S log 2d δ + 1 4 . The main technique for proving this lemma is to use the matrix Bernstein inequality (Tropp, 2015) to bound the variance of the stochastic gradients and the variance of the expectation for each client. The proof steps of Lemma 1 follows the same steps in the proofs for Lemmas 5 and 6 of Ramezani et al. (2020) , so we omit them here. Further, we bound the Lipschitz constant of the total loss function in the following lemma: Lemma 2 (Lipschitz gradient). Under Assumptions 1-3, the full gradient and each partial gradient of the objective L(W, S) are Lipschitz continuous with uniform constant C 0 = G ℓ L f + G 2 f L ℓ , ∥∇ W L(W, S) -∇ W ′ L(W ′ , S)∥ ≤ C 0 ∥W -W ′ ∥ , ∀ W, W ′ ∇ Wm L(W, S) -∇ W ′ m L(W ′ , S) ≤ C 0 ∥W -W ′ ∥ , ∀ W, W ′ , ∀ m. The proof of Lemma 2 is given below in Section B.3. With the above results, we begin our proof for Theorem 1. First, applying Lemma 2, we have: L(W t,q+1 , D) -L(W t,q , D) ≤ ∇L(W t,q , D), W t,q+1 -W t,q + C0 2 W t,q+1 -W t,q 2 (a) = -η ∇L(W t,q , D), G t,q + C0η 2 2 G t,q 2 (b) = - η 2 ∇L(W t,q , D) 2 + G t,q 2 -∇L(W t,q , D) -G t,q 2 + C0η 2 2 G t,q 2 = - η 2 ∇L(W t,q , D) 2 - η 2 (1 -ηC0) G t,q 2 + η 2 ∇L(W t,q , D) -G t,q 2 , where step (a) applies the update rule of Algorithm 4 and step (b) uses the fact that ⟨a, b⟩ = 1 2 ∥a∥ 2 + ∥b∥ 2 -∥a -b∥ 2 . Taking expectation, we have: E t [L(W t,q+1 , D) -L(W t,q , D)] ≤ - η 2 E t ∇L(W t,q , D) 2 - η 2 (1 -ηC 0 ) E t G t,q 2 + η 2 E t ∇L(W t,q , D) -G t,q 2 (a) = - η 2 E t ∇L(W t,q , D) 2 + η 2 E t ∇L(W t,q , D) -G t,q 2 - η 2 (1 -ηC 0 )( E t G t,q 2 + E t G t,q -E t G t,q 2 ) (b) ≤ - η 2 E t ∇L(W t,q , D) 2 - η 2 (1 -ηC 0 )( E t G t,q 2 + E t G t,q -E t G t,q 2 ) + η 2 (1 + 1 ηC 0 ) E t ∇L(W t,q , D) -E t G t,q 2 + (1 + ηC 0 ) E t E t G t,q -G t,q 2 = - η 2 E t ∇L(W t,q , D) 2 - η 2 (1 -ηC 0 ) E t G t,q 2 + η 2 C 0 E t G t,q -E t G t,q 2 Term 1 + 1 + ηC 0 2C 0 E t ∇L(W t,q , D) -E t G t,q 2 Term 2 , ( ) where step (a) uses the fact that E(X) 2 = E(X 2 ) + E(X -E(X)) 2 and step (b) uses ( 4) with α = ηC 0 . Next, we bound Term 1 and Term 2 in the above inequality separately.

B.2.1 BOUND OF TERM 1

First, we can rewrite E t [ G t,q -E t G t,q 2 ] as: E t [ G t,q -E t G t,q 2 ] = M m=1 E t ∇L(W t,q m (S t ), S t ) -E S ∇L(W t,q m (S), S) 2 (a) ≤ M m=1 E t ∇L(W t,q m (S t ), S t ) -∇L(W t,q m (D), D) 2 ≜A t,q m , where step (a) uses the fact that E(X -E(X)) 2 ≤ E(X -Y ) 2 for all constant Y . Then, we can bound A t,q m as follows. When q = 0, by Lemma 1, we obtain that A t,0 m ≤ σ holds with probability 1 -δ. In general, when q ≥ 1, we have: A t,q m (4) ≤ 2 E t ∇L(W t,q m (S t ), S t ) -∇L(W t,q m (D), S t ) 2 + 2 E t ∇L(W t,q m (D), S t ) -∇L(W t,q m (D), D) 2 (a) ≤ 2C 2 0 E t W t,q m (S t ) -W t,q m (D) 2 + 2 E t ∇L(W t,q m (D), S t ) -∇L(W t,q m (D), D) 2 (b) ≤ 2C 2 0 E t W t,q m (S t ) -W t,q m (D) 2 + 2σ, which holds with probability 1 -δ. Here, step (a) applies Lemma 2 to the first term and step (b) applies Lemma 1 to the second term. Then, we bound E t ∥W t,q m (S t ) -W t,q m (D)∥ 2 in the above equation as: E t W t,q m (S t ) -W t,q m (D) 2 (a) = E t    W t,0 m -η q-1 q ′ =0 ∇L(W t,q ′ m (S t ), S t ) -   W t,0 m -η q-1 q ′ =0 ∇L(W t,q ′ m (D), D)   2    (b) = η 2 E t    q-1 q ′ =0 ∇L(W t,q ′ m (S t ), S t ) -∇L(W t,q ′ m (D), D) 2    (c) ≤ η 2 Q q-1 q ′ =0 E t ∇L(W t,q ′ m (S t ), S t ) -∇L(W t,q ′ m (D), D) 2 = η 2 q q-1 q ′ =0 A t,q ′ m , where in step (a) we expand the updates to W t,0 m with ( 5) and ( 6); step (b) cancels W t,0 m and rearrange the terms; and step (c) applies the Cauchy-Schwarz inequality. At this point, we have the following relations: E t [ G t,q -E t G t,q 2 ] ≤ M m=1 A t,q m , A t,0 0 ≤ σ, A t,q m ≤ 2C 2 0 η 2 q q-1 q ′ =0 A t,q ′ m + 2σ, ∀ q ≥ 1. Note that q ≤ Q. By choosing 2η 2 C 2 0 Q 2 ≤ 1, which implies that η ≤ 1 √ 2QC0 , and by recursively substituting the terms, we have the following bounds: A t,q m ≤ 2 + 4q 2 η 2 C 2 0 + 8 3 q 3 η 4 C 4 0 • σ ≤ 14 3 σ, E t [ G t,q -E t G t,q 2 ≤ M • 2 + 4q 2 η 2 C 2 0 + 8 3 q 3 η 4 C 4 0 • σ ≤ 14M σ 3 . This completes bounding the term E[ G t,q -E t G t,q 2 ].

B.2.2 BOUND OF TERM 2

We have the following series of relations: E t ∇L(W t,q , D) -E t G t,q 2 = M m=1 E t ∇ Wm L(W t,q (S t ), D) -E S ∇L(W t,q m (S), S) 2 (a) ≤ M m=1 E t E S ∇ Wm L(W t,q (S t ), S) -E S ∇L(W t,q m (S), S) 2 (b) ≤ M m=1 C 2 0 E t E S W t,q (S t ) -[W t,q m (S), W t,0 -m ] 2 = M m=1 C 2 0 E t E S   W t,q m (S t ) -W t,q m (S) 2 + m ′ ̸ =m W t,q m ′ (S t ) -W t,0 m ′ 2   (c) = η 2 M m=1 C 2 0 E t E S    q-1 q ′ =0 ∇L(W t,q ′ m (S t ), S t ) -∇L(W t,q ′ m (S), S) 2 + m ′ ̸ =m q-1 q ′ =0 ∇L(W t,q ′ m (S); S) 2    (d) ≤ η 2 C 2 0 q M m=1 q-1 q ′ =0 E t E S ∇L(W t,q ′ m (S t ), S t ) -∇L(W t,q ′ m (S), S) 2 + m ′ ̸ =m ∇L(W t,q ′ m (S); S) 2   (e) = η 2 (M + 1)C 2 0 q M m=1 q-1 q ′ =0 E t E S ∇L(W t,q ′ m (S t ), S t ) 2 (f ) = η 2 (M + 1)C 2 0 q q-1 q ′ =0 E t G t,q ′ 2 = η 2 (M + 1)C 2 0 q q-1 q ′ =0 E t G t,q ′ -E t G t,q ′ 2 + E t G t,q ′ 2 , where step (a) uses Assumption 3, which states that S is uniformly sampled from D, and applies Jensen's inequality, that is E S ∇ Wm L(W t,q (S t ); S) -E S ∇L(W t,q m (S); S) 2 ≤ E S ∇ Wm L(W t,q (S t ); S) -∇L(W t,q m (S); S) 2 ; step (b) applies Lemma 2 and uses the fact that ∇L(W t,q m (S t ), S t ) is evaluated on W t,q m (S t ) and W t,0 -m ; in step (c) we expand the update steps until t, 0 with (5); step (d) applies Cauchy-Schwarz inequality; in step (e) we reorder the sum and apply the i.i.d. Assumption 3 to S, S t ; and in step (g) we plug in the definition of G. This completes bounding the term E t ∇L(W t,q , D) -E t G t,q 2 .

B.2.3 PROOF OF THE MAIN RESULT

Substituting the last term in ( 9) with ( 14), we obtain that the following holds with probability (1δ) Q : E t [L(W t,q+1 , D) -L(W t,q , D)] ≤ - η 2 E t ∇L(W t,q , D) 2 - η 2 (1 -ηC 0 ) E t G t,q 2 + η 2 C 0 E t G t,q -E t G t,q 2 + 1 + ηC 0 2C 0 η 2 (M + 1)C 2 0 q q-1 q ′ =0 E t G t,q ′ -E t G t,q ′ 2 + E t G t,q ′ 2 ≤ - η 2 E t ∇L(W t,q ) 2 - η 2 (1 -ηL) E t G t,q 2 + 1 + ηC 0 2C 0 η 2 (M + 1)C 2 0 q q-1 q ′ =0 E t E t G t,q ′ 2 + η 2 C 0 • 1 + (1 + ηC 0 ) • (M + 1) • ηQ 2 2 • 14M σ 3 , where in the second inequality, we set η ≤ 1 √ 2QC0 , plug in (13), and use the fact that q ≤ Q. Averaging over t = 0, . . . , T -1 and q = 0, . . . , Q -1 and reorganizing the terms, we obtain: 1 T Q T -1 t=0 Q-1 Q=0 E ∇L(W t,q ; D) 2 ≤ 2 ηT Q E[L(W 0 ) -L(W T,Q )] - 1 -ηC 0 1 + (1 + ηC 0 ) • (M + 1) • Q 2 T Q T -1 t=0 Q-1 q=0 E E t G t,q 2 + 2ηC 0 • 1 + (1 + ηC 0 ) • (M + 1) • ηQ 2 2 • 14M σ 3 , which holds with probability at least (1 -δ) T Q . Let δ = δ ′ /T Q ∈ (0, 1); then, the above equation holds with probability at least (1 -δ ′ /T Q) T Q ≥ 1 -δ ′ /T Q × T Q = 1 -δ ′ . Let 1 -ηC 0 1 + (1 + ηC 0 ) • (M + 1) • Q 2 ≥ 0, Under review as a conference paper at ICLR 2023 (η ≤ 1 √ M +1C0Q ) and apply Assumption 2. Then, we have 1 T Q T -1 t=0 Q-1 Q=0 E ∇L(W t,q ; D) 2 ≤ 2(L(W 0 ) -L ⋆ ) ηT Q + 28ηM • C 0 + √ M + 1Q 3 σ, which holds with probability at least 1 -δ, where σ = 64G 2 ℓ L 2 f log 2dT Q δ + 128L 2 ℓ G 4 f + 1 S log 2dT Q δ + 1 4 . This completes the proof of Theorem 1.

B.3 PROOF FOR LEMMA 2

In this subsection, we prove ∥∇ W L(W) -∇ W ′ L(W ′ )∥ ≤ C 0 ∥W -W ′ ∥ and ∇ Wm L(W) -∇ W ′ m L(W ′ ) ≤ C 0 ∥W -W ′ ∥ . Note that ∇ Wm L(W) is a sub-vector of ∇L(W ′ ), so ∇ Wm L(W) -∇ W ′ m L(W ′ ) ≤ ∥∇ W L(W) -∇ W ′ L(W ′ )∥ . Therefore, we only need to prove the first inequality. The gradient ∇L(W) can be expanded as ∇L(W) = ∇ℓ(y, f m (S, W)) = ∇ℓ(f m (S, W)) • ∇ W f m (S, W) = ∇ℓ(f m ) • ∇f m (W), where in the last equation we omit the irrelevant variables. Then, we have  ≤ G ℓ L f ∥W -W ′ ∥ + L ℓ ∥f m (W) -f m (W ′ )∥ • G f A1 ≤ (G ℓ L f + L ℓ G 2 f ) • ∥W -W ′ ∥ , where step (a) uses the fact that ∥a + b∥ ≤ ∥a∥ + ∥b∥, step (b) uses the fact that ∥ab∥ ≤ ∥a∥ ∥b∥; and in step (c) we apply Lemma 2 (that is, for any G-smooth function g, its gradient is bounded as ∥∇g∥ ≤ G) to the first and the fourth terms and Lipschitz gradient to the second and the third terms. This completes the proof of Lemma 2.

C EXPERIMENT DETAILS

C.1 DETAILS OF THE DATASETS Planetoid (Yang et al., 2016) : This collection contains three citation datasets: Cora, PubMed, and CiteSeer. Each dataset contains one citation graph, where the nodes represent papers and edges represent citations. The node features are a bag of words and the classification target is the paper category. In the experiment, each client holds a non-overlapping block of node features and a subgraph that results from uniformly sampling 80% of the edges. HeriGraph (Bai et al., 2022) : This collection contains three multi-modal graph datasets, each of which is constructed from heritage data posted on social media for a particular city (Suzhou, Amsterdam, and Venice). Each post contains user information, timestamp, geolocation, image, and text annotation. The posts are connected to form three subgraphs: a social subgraph, a spatial subgraph,



Figure 1: Data isolation of vertically distributed graph-structured data over three clients.

Figure 2: Illustration of the split model on M = 3 clients with lazy aggregation. In the model, the second server aggregation layer is skipped due to lazy aggregation, and the graph size used by each layer gradually decreases due to neighborhood sampling.

m=1 the samples used at round t (which include all sampled nodes at different layers and clients); by S = |S t m [L]| the batch size; and by L(W; S) the training objective, which is evaluated at the overall set of model parameters across clients, W = {W m } M m=1 , and a batch of samples, S. A few assumptions are needed (see Appendix B.1 for formal statements). A1: The loss function ℓ is G ℓ -smooth with L ℓ -Lipschitz gradient; and a client's prediction function f m is G f -smooth with L f -Lipschitz gradient. A2: The training objective L(W; D) is bounded below by a finite constant L ⋆ . A3: The samples S t are sampled from D following Algorithm 2 in Appendix A.

Figure 3: Test accuracy with different backbone GNNs on PubMed.

∥∇L(W) -∇L(W′ )∥ = ∥∇ℓ(f m ) • ∇f m (W) -∇ℓ(f ′ m ) • ∇f m (W ′ )∥ = ∥∇ℓ(f m ) • (∇f m (W) -∇f m (W ′ )) + (∇ℓ(f m ) -∇ℓ(f ′ m )) • ∇f m (W ′ )∥ (a) ≤ ∥∇ℓ(f m ) • (∇f m (W) -∇f m (W ′ ))∥ + ∥(∇ℓ(f m ) -∇ℓ(f ′ m )) • ∇f m (W ′ )∥ (b) ≤ ∥∇ℓ(f m )∥ ∥∇f m (W) -∇f m (W ′ )∥ + ∥∇ℓ(f m ) -∇ℓ(f ′ m )∥ ∥∇f m (W ′ )∥ (c)

Training Procedure. All referenced algorithms are detailed in the appendix.

Dataset summary. For a dataset that contains a single graph, each of the M clients holds a sampled subgraph from it. For the HeriGraph datasets, there are M = 3 clients, each of which holds a given subgraph.



Runtime of GLASU with different number of stale updates Q = 2, 4, 6, 16, when reaching 82% test accuracy on PubMed and 89% on Amsterdam.

annex

and a temporal subgraph. The social subgraph is formed based on friendship and common-interest relations of the users. The spatial subgraph is formed based on the spatial proximity of the geolocations. The temporal subgraph is formed based on the temporal proximity of the posts. Each post has three blocks of image features and possibly text features; for classification, it belongs to one of nine heritage attributes. In the experiment, each client holds one of the three subgraphs and one of the three image feature blocks.Reddit (Hamilton et al., 2017) : Reddit is a large online community where users post and comment on different topics. Each node represents a post and the features are the text of the post. Two posts are connected if the same user comments on both. The classification target is the community (subreddit) that a post belongs to. Similar to Planetoid, in the experiment, each client holds a nonoverlapping block of node features and a subgraph that results from uniformly sampling 80% of the edges.

C.2 DETAILS OF THE HYPERPARAMETERS

Here we provide a list of hyperparameters for grid search in Table 5 . The optimal set of hyperparameters for each setting is tuned according to the range listed in the 

