LMC: FAST TRAINING OF GNNS VIA SUBGRAPH-WISE SAMPLING WITH PROVABLE CONVERGENCE

Abstract

The message passing-based graph neural networks (GNNs) have achieved great success in many real-world applications. However, training GNNs on large-scale graphs suffers from the well-known neighbor explosion problem, i.e., the exponentially increasing dependencies of nodes with the number of message passing layers. Subgraph-wise sampling methods-a promising class of mini-batch training techniques-discard messages outside the mini-batches in backward passes to avoid the neighbor explosion problem at the expense of gradient estimation accuracy. This poses significant challenges to their convergence analysis and convergence speeds, which seriously limits their reliable real-world applications. To address this challenge, we propose a novel subgraph-wise sampling method with a convergence guarantee, namely Local Message Compensation (LMC). To the best of our knowledge, LMC is the first subgraph-wise sampling method with provable convergence. The key idea of LMC is to retrieve the discarded messages in backward passes based on a message passing formulation of backward passes. By efficient and effective compensations for the discarded messages in both forward and backward passes, LMC computes accurate mini-batch gradients and thus accelerates convergence. We further show that LMC converges to first-order stationary points of GNNs. Experiments on large-scale benchmark tasks demonstrate that LMC significantly outperforms state-of-the-art subgraph-wise sampling methods in terms of efficiency.

1. INTRODUCTION

Graph neural networks (GNNs) are powerful frameworks that generate node embeddings for graphs via the iterative message passing (MP) scheme (Hamilton, 2020) . At each MP layer, GNNs aggregate messages from each node's neighborhood and then update node embeddings based on aggregation results. Such a scheme has achieved great success in many real-world applications involving graph-structured data, such as search engines (Brin & Page, 1998) , recommendation systems (Fan et al., 2019) , materials engineering (Gostick et al., 2016) , molecular property prediction (Moloi & Ali, 2005; Kearnes et al., 2016) , and combinatorial optimization (Wang et al., 2023) . However, the iterative MP scheme poses challenges to training GNNs on large-scale graphs. One commonly-seen approach to scale deep models to arbitrarily large-scale data with limited GPU memory is to approximate full-batch gradients by mini-batch gradients. Nevertheless, for the graphstructured data, the computational costs for computing the loss across a mini-batch of nodes and the corresponding mini-batch gradients are expensive due to the well-known neighbor explosion problem. Specifically, the embedding of a node at the k-th MP layer recursively depends on the embeddings of its neighbors at the (k -1)-th MP layer. Thus, the complexity grows exponentially with the number of MP layers. To deal with the neighbor explosion problem, recent works propose various sampling techniques to reduce the number of nodes involved in message passing (Ma & Tang, 2021) . For example, nodewise (Hamilton et al., 2017; Chen et al., 2018a) and layer-wise (Chen et al., 2018b; Zou et al., 2019; Huang et al., 2018) sampling methods recursively sample neighbors over MP layers to estimate node embeddings and corresponding mini-batch gradients. Unlike the recursive fashion, subgraph-wise sampling methods (Chiang et al., 2019; Zeng et al., 2020; Fey et al., 2021; Zeng et al., 2021) adopt a cheap and simple one-shot sampling fashion, i.e., sampling the same subgraph constructed based on a mini-batch for different MP layers. By discarding messages outside the mini-batches, subgraphwise sampling methods restrict message passing to the mini-batches such that the complexity grows linearly with the number of MP layers. Moreover, subgraph-wise sampling methods are applicable to a wide range of GNN architectures by directly running GNNs on the subgraphs constructed by the sampled mini-batches (Fey et al., 2021) . Because of these advantages, subgraph-wise sampling methods have recently drawn increasing attention. Despite the empirical success of subgraph-wise sampling methods, discarding messages outside the mini-batch sacrifices the gradient estimation accuracy, which poses significant challenges to their convergence behaviors. First, recent works (Chen et al., 2018a; Cong et al., 2020) demonstrate that the inaccurate mini-batch gradients seriously hurt the convergence speeds of GNNs. Second, in Section 7.3, we demonstrate that many subgraph-wise sampling methods are difficult to resemble full-batch performance under small batch sizes, which we usually use to avoid running out of GPU memory in practice. These issues seriously limit the real-world applications of GNNs. In this paper, we propose a novel subgraph-wise sampling method with a convergence guarantee, namely Local Message Compensation (LMC), which uses efficient and effective compensations to correct the biases of mini-batch gradients and thus accelerates convergence. To the best of our knowledge, LMC is the first subgraph-wise sampling method with provable convergence. Specifically, we first propose unbiased mini-batch gradients for the one-shot sampling fashion, which helps decompose the gradient computation errors into two components: the bias from the discarded messages and the variance of the unbiased mini-batch gradients. Second, based on a message passing formulation of backward passes, we retrieve the messages discarded by existing subgraph-wise sampling methods during the approximation to the unbiased mini-batch gradients. Finally, we propose efficient and effective compensations for the discarded messages with a combination of incomplete up-to-date messages and messages generated from historical information in previous iterations, avoiding the exponentially growing time and memory consumption. An appealing feature of the resulting mechanism is that it can effectively correct the biases of mini-batch gradients, leading to accurate gradient estimation and the speed-up of convergence. We further show that LMC converges to first-order stationary points of GNNs. Notably, the convergence of LMC is based on the interactions between mini-batch nodes and their 1-hop neighbors, without the recursive expansion of neighborhoods to aggregate information far away from the mini-batches. Experiments on largescale benchmark tasks demonstrate that LMC significantly outperforms state-of-the-art subgraphwise sampling methods in terms of efficiency. Moreover, under small batch sizes, LMC outperforms the baselines and resembles the prediction performance of full-batch methods.

2. RELATED WORK

In this section, we discuss some works related to our proposed method. Subgraph-wise Sampling Methods. Subgraph-wise sampling methods sample a mini-batch and then construct the same subgraph based on it for different MP layers (Ma & Tang, 2021) . For example, Cluster-GCN (Chiang et al., 2019) and GraphSAINT (Zeng et al., 2020) construct the subgraph induced by a sampled mini-batch. They encourage connections between the sampled nodes by graph clustering methods (e.g., METIS (Karypis & Kumar, 1998) and Graclus (Dhillon et al., 2007) ), edge, node, or random-walk-based samplers. GNNAutoScale (GAS) (Fey et al., 2021) and MVS-GNN (Cong et al., 2020) use historical embeddings to generate messages outside a sampled subgraph, maintaining the expressiveness of the original GNNs. they are efficient in training and inference, they are not applicable to powerful GNNs with a trainable aggregation process. Historical Values as an Affordable Approximation. The historical values are affordable approximations of the exact values in practice. However, they suffer from frequent data transfers to/from the GPU and the staleness problem. For example, in node-wise sampling, VR-GCN (Chen et al., 2018a) uses historical embeddings to reduce the variance from neighbor sampling (Hamilton et al., 2017) . GAS (Fey et al., 2021) proposes a concurrent mini-batch execution to transfer the active historical embeddings to and from the GPU, leading to comparable runtime with the standard full-batch approach. GraphFM-IB and GraphFM-OB (Yu et al., 2022) apply a momentum step on historical embeddings for node-wise and subgraph-wise sampling methods with historical embeddings, respectively, to alleviate the staleness problem. Both LMC and GraphFM-OB use the node embeddings in the mini-batch to alleviate the staleness problem of the node embeddings outside the mini-batch. We discuss the main differences between LMC and GraphFM-OB in Appendix C.1.

3. PRELIMINARIES

We introduce notations and graph neural networks in Sections 3.1 and 3.2, respectively.

3.1. NOTATIONS

A graph G = (V, E) is defined by a set of nodes V = {v1, v2, . . . , vn} and a set of edges E among these nodes. The set of nodes consists of labeled nodes VL and unlabeled nodes VU := V \ VL. Let (vi, vj) ∈ E denote an edge going from node vi ∈ V to node vj ∈ V, N (vi) = {vj ∈ V|(vi, vj) ∈ E} denote the neighborhood of node vi, and N (vi) denote N (vi) ∪ {vi}. We assume that G is undirected, i.e., vj ∈ N (vi) ⇔ vi ∈ N (vj). Let N (S) = {v ∈ V|(vi, vj) ∈ E, vi ∈ S} denote the neighborhoods of a set of nodes S and N (S) denote N (S) ∪ S. For a positive integer L, [L] denotes {1, . . . , L}. Let the boldface character xi ∈ R dx denote the feature of node vi with dimension dx. Let hi ∈ R d be the d-dimensional embedding of the node vi. Let X = (x1, x2, . . . , xn) ∈ R dx×n and H = (h1, h2, . . . , hn) ∈ R d×n . We also denote the embeddings of a set of nodes S = {vi k } |S| k=1 by HS = (hi k ) |S| k=1 ∈ R d×|S| . For a p × q matrix A ∈ R p×q , ⃗ A ∈ R pq denotes the vectorization of A, i.e., Aij = ⃗ A i+(j-1)p . We denote the j-th columns of A by Aj.

3.2. GRAPH NEURAL NETWORKS

For the semi-supervised node-level prediction, Graph Neural Networks (GNNs) aim to learn node embeddings H with parameters Θ by minimizing the objective function L = 1 |V L | i∈V L ℓw(hi, yi) such that H = GN N (X, E; Θ), where ℓw is the composition of an output layer with parameters w and a loss function. GNNs follow the message passing framework in which vector messages are exchanged between nodes and updated using neural networks. An L-layer GNN performs L message passing iterations with different parameters Θ = (θ l ) L l=1 to generate the final node embeddings H = H L as H l = f θ l (H l-1 ; X), l ∈ [L], where H 0 = X and f θ l is the message passing function of the l-th layer with parameters θ l . The message passing function f θ l follows an aggregation and update scheme, i.e., h l i = u θ l h l-1 i , m l-1 N (vi) , x i ; m l-1 N (vi) = ⊕ θ l g θ l (h l-1 j ) | v j ∈ N (v i ) , l ∈ [L], where g θ l is the function generating individual messages for each neighbor of vi in the l-th message passing iteration, ⊕ θ l is the aggregation function mapping a set of messages to the final message m l-1 N (v i ) , and u θ l is the update function that combines previous node embedding h l-1 i , message m l-1 N (v i ) , and features xi to update node embeddings.

4. MESSAGE PASSING IN BACKWARD PASSES

In Section 4.1, we introduce the gradients of GNNs and formulate the backward passes as message passing. Then we propose backward SGD, which is an SGD variant, in Section 4.2.

4.1. BACKWARD PASSES AND MESSAGE PASSING FORMULATION

The gradient ∇wL is easy to compute and we hence introduce the chain rule to compute ∇ΘL in this section, where Θ = (θ l ) L l=1 . Let V l ≜ ∇ H l L for l ∈ [L] be auxiliary variables. It is easy to compute ⃗ V L = ∇ ⃗ H L L = ∇ ⃗ H L. By the chain rule, we iteratively compute V l based on V l+1 as ⃗ V l = ⃗ ϕ θ l+1 (V l+1 ) ≜ (∇ ⃗ H l ⃗ f θ l+1 ) ⃗ V l+1 and V l = ϕ θ l+1 • • • • • ϕ θ L (V L ). Then, we compute the gradient ∇ θ l L = (∇ θ l ⃗ f θ l ) ⃗ V l , l ∈ [L] by using autograd packages for vector-Jacobian product. We formulate backward passes, i.e., the processes of iterating Equation (3), as message passing. To see this, we need to notice that Equation ( 3) is equivalent to V l i = vj ∈N (vi) ∇ h l i u θ l+1 (h l j , m l N (vj ) , x j ) V l+1 j , i ∈ [n], where V l k is the k-th column of V l and m l N (v j ) is a function of h l i defined in Equation (2). Equa- tion (5) uses ∇ h l i u θ l+1 (h l j , m l N (v j ) , xj) V l+1 j , sum aggregation, and the identity mapping as the generation function, the aggregation function, and the update function, respectively.

4.2. BACKWARD SGD

In this section, we develop an SGD variant-backward SGD, which provides unbiased gradient estimations based on the message passing formulation of backward passes. Backward SGD is the basis of our proposed subgraph-wise sampling method, i.e., LMC, in Section 5. Given a sampled mini-batch VB, suppose that we have computed exact node embeddings (H l V B ) L l=1 and auxiliary variables (V l V B ) L l=1 of nodes in VB. To simplify the analysis, we assume that VB is uniformly sampled from V and the corresponding set of labeled nodes VL B := VB ∩ VL is uniformly sampled from VL. When the sampling is not uniform, we use the normalization technique (Zeng et al., 2020) to enforce the assumption (please see Appendix A.3.1). First, backward SGD computes the mini-batch gradient gw(VB) for parameters w by the derivative of mini-batch loss LV B = 1 |V L B | v j ∈V L B ℓw(hj, yj) as g w (V B ) = 1 |V L B | vj ∈V L B ∇ w ℓ w (h j , y j ). Then, backward SGD computes the mini-batch gradient g θ l (VB) for parameters θ l as g θ l (V B ) = |V| |V B | vj ∈V B ∇ θ l u θ l (h l-1 j , m l-1 N (vj ) , x j ) V l j , l ∈ [L]. Note that the mini-batch gradients g θ l (VB) for different l ∈ [L] are based on the same mini-batch VB, which facilitates designing subgraph-wise sampling methods based on backward SGD. Another appealing feature of backward SGD is that the mini-batch gradients gw(VB) and g θ l (VB), l ∈ [L] are unbiased, as shown in the following theorem. Please see Appendix D.1 for the detailed proof. Theorem 1. Suppose that a mini-batch VB is uniformly sampled from V and the corresponding labeled nodes VL B = VB ∩ VL is uniformly sampled from VL. Then the mini-batch gradients gw(VB) and g θ l (VB), l ∈ [L] in Equations ( 6) and (7) are unbiased.

5. LOCAL MESSAGE COMPENSATION

The exact mini-batch gradients gw(VB) and g θ l (VB), l ∈ [L] computed by backward SGD depend on exact embeddings and auxiliary variables of nodes in the mini-batch VB rather than the whole graph. However, backward SGD is not scalable, as the exact (H l V B ) L l=1 and (V l V B ) L l=1 are expensive to compute due to the neighbor explosion problem. In this section, to deal with the neighbor explosion problem, we develop a novel and scalable subgraph-wise sampling method for GNNs, namely Local Message Compensation (LMC). LMC first efficiently estimates (H l V B ) L l=1 and (V l V B ) L l=1 by convex combinations of the incomplete upto-date values and the historical values, and then computes the mini-batch gradients as shown in  v 4 v 1 v 3 v 2 v 5 (a) Original graph v 3 v 4 v 1 v 2 v 3 v 4 v 1 v 2 v 3 v 4 v 1 v 2 Layer 2 Layer 1 Layer 0 (b) Forward passes of GAS v 3 v 4 v 1 v 2 v 3 v 4 v 1 v 2 v 3 v 4 v 1 v 2 v 3 v 4 v 1 v 2 v 3 v 4 v 1 v 2 v 3 v 4 v 1 v 2 Layer 2 Layer 1 Layer 0 (d) Backward passes of GAS v 3 v 4 v 1 v 2 v 3 v 4 v 1 v 2 v 3 v 4 v 1 v 2 Layer 2 Layer 0 Layer 1 (e) Backward passes of LMC Figure 1 : Comparison of LMC with GNNAutoScale (GAS) (Fey et al., 2021) . (a) shows the original graph with in-batch nodes, 1-hop out-of-batch nodes, and other out-of-batch nodes in orange, blue, and grey, respectively. (b) and (d) show the computation graphs of forward passes and backward passes of GAS, respectively. (c) and (e) show the computation graphs of forward passes and backward passes of LMC, respectively. Equations ( 6) and ( 7). We show that LMC converges to first-order stationary points of GNNs in Section 6. In Algorithm 1 and Section 6, we denote a value in the l-th layer at the k-th iteration by In forward passes, we initialize the temporary embeddings for l = 0 as H 0 = X and update historical embeddings of nodes in VB, i.e., H l V B , in the order of l = 1, 2, . . . , L. Specifically, in the l-th layer, we first update the historical embedding of each node vi ∈ VB as h l i = u θ l (h l-1 i , m l-1 N (vi) , x i ); m l-1 N (vi) = ⊕ θ l g θ l (h l-1 j ) | v j ∈ N (v i ) ∩ V B ∪ g θ l ( h l-1 j ) | v j ∈ N (v i ) \ V B . (8) Then, we compute the temporary embedding of each neighbor vi ∈ N (VB) \ VB as h l i = (1 -β i )h l i + β i h l i , where βi ∈ [0, 1] is the convex combination coefficient for node v i , and h l i = u θ l ( h l-1 i , m l-1 N (vi) , x i ); m l-1 N (vi) = ⊕ θ l g θ l (h l-1 j ) | v j ∈ N (v i ) ∩ V B ∪ g θ l ( h l-1 j ) | v j ∈ N (V B ) ∩ N (v i ) \ V B . ( ) We call C l f ≜ ⊕ θ l g θ l ( h l-1 j ) | vj ∈ N (vi) \ VB the local message compensation in the l-th layer in forward passes. For l ∈ [L], h l i is an approximation to h l i computed by Equation (2). Notice that the total size of Equations ( 8)-( 10) is linear with |N (VB)| rather than the size of the whole graph. Suppose that the maximum neighborhood size is nmax and the number of layers is L, then the time complexity in forward passes is O(L(nmax|VB|d + |VB|d 2 )). In backward passes, we initialize the temporary auxiliary variables for l = L as V L = ∇ H L and update historical auxiliary variables of nodes in VB, i.e., V l V B , in the order of l = L -1, . . . , 1. Specifically, in the l-th layer, we first update the historical auxiliary variable of each vi ∈ VB as V l i = vj ∈N (vi)∩V B ∇ h l i u θ l+1 (h l j , m l N (vj ) , x j ) V l+1 j + vj ∈N (vi)\V B ∇ h l i u θ l+1 ( h l j , m l N (vj ) , x j ) V l+1 j , where h l j , m l N (v j ) , and h l j are computed as shown in Equations ( 8)-( 10). Then, we compute the temporary auxiliary variable of each neighbor vi ∈ N (VB) \ VB as V l i = (1 -β i )V l i + β i V l i , ( ) where βi is the convex combination coefficient used in Equation ( 9), and V l i = vj ∈N (vi)∩V B ∇ h l i u θ l+1 (h l j , m l N (vj ) , x j ) V l+1 j + vj ∈N (V B )∩N (vi)\V B ∇ h l i u θ l+1 ( h l j , m l N (vj ) , x j ) V l+1 j . ( ) We call C l b ≜ v j ∈N (v i )\V B ∇ h l i u θ l+1 ( h l j , m l N (v j ) , xj) V l+1 j the local message compensation in the l-th layer in backward passes. For l ∈ [L], V l i is an approximation to V l i computed by Equation (3). Similar to forward passes, the time complexity in backward passes is O(L(nmax|VB|d + |VB|d 2 )), where nmax is the maximum neighborhood size and L is the number of layers.

Algorithm 1 Local Message Compensation

1: Input: The learning rate η and the convex combination coefficients (βi  and (10) 9: ) n i=1 . 2: Partition V into B parts (V b ) B b=1 3: for k = 1, . . . , N do 4: Randomly sample V b k from (V b ) B b=1 5: Initialize H 0,k = H 0,k = X 6: for l = 1, . . . , L do 7: Update H l,k V b k ▷ (8) 8: Compute H l,k N (V b k )\V b k ▷ (9) end for 10: Initialize V L,k = V L,k = ∇ H L L 11: for l = L -1, . . . , 1 do 12: Update V l,k V b k ▷ (11) 13: Compute V l,k N (V b k )\V b k ▷ (12) and ( 13) 14: end for 15: Compute g k w and g k θ l , l ∈ [L] ▷ (6) and ( 7) 16: Update parameters by 17: w k = w k-1 -η g k w 18: θ l,k = θ l,k-1 -η g k θ l , l ∈ [L] 19: end for LMC additionally stores the historical node embeddings H l and auxiliary vari- ables V l for l ∈ [L]. As pointed out in (Fey et al., 2021) , we can store the majority of historical values in RAM or hard drive storage rather than GPU memory. Thus, the active historical values in forward and backward passes employ O(nmaxL|VB|d) and O(nmaxL|VB|d) GPU memory, respectively (see Appendix B). As the time and memory complexity are independent of the size of the whole graph, i.e., |V|, LMC is scalable. We summarize the computational complexity in Appendix B. Figure 1 shows the message passing mechanisms of GAS (Fey et al., 2021) and LMC. Compared with GAS, LMC proposes compensation messages between inbatch nodes and their 1-hop neighbors simultaneously in forward and backward passes. This corrects the biases of minibatch gradients and thus accelerates convergence. Algorithm 1 summarizes LMC. Unlike above, we add a superscript k for each value to indicate that it is the value at the k-th iteration. At preprocessing step, we partition V into B parts (V b ) B b=1 . At the kth training step, LMC first randomly samples a subgraph constructed by V b k . Notice that we sample more subgraphs to build a large graph in experiments whose convergence analysis is consistent with that of sampling a single subgraph. Then, LMC updates the stored historical node embeddings H l,k V b k in the order of l = 1, . . . , L by Equations ( 8)-( 10), and the stored historical auxiliary variables V l,k V b k in the order of l = L -1, . . . , 1 by Equations ( 11)-( 13). By the randomly updating, the historical values get close to the exact up-to-date values. Finally, for l ∈ [L] and vj ∈ V b k , by replacing h l,k j , m l,k N (v j ) and V l,k j in Equations ( 6) and ( 7) with h l,k j , m l,k N (v j ) , and V l,k j , respectively, LMC computes mini-batch gradients gw, g θ 1 , . . . , g θ L to update parameters w, θ 1 , . . . , θ L .

6. THEORETICAL ANALYSIS

In this section, we provide the theoretical analysis of LMC. Theorem 2 shows that the biases of mini-batch gradients computed by LMC can tend to an arbitrarily small value by setting a proper learning rate and convex combination coefficients. Then, Theorem 3 shows that LMC converges to first-order stationary points of GNNs. We provide detailed proofs of the theorems in Appendix D. In the theoretical analysis, we suppose that the following assumptions hold in this paper. Assumption 1. Assume that (1) at the k-th iteration, a batch of nodes V k B is uniformly sampled from V and the corresponding labeled node set V k L B = V k B ∩VL is uniformly sampled from VL, (2) functions f θ l , ϕ θ l , ∇wL, ∇ θ l L, ∇wℓw, and ∇ θ l u θ l are γ-Lipschitz with γ > 1, ∀ l ∈ [L], (3) norms ∥H l,k ∥F , ∥H l,k ∥F , ∥ H l,k ∥F , ∥ H l,k ∥F , ∥V l,k ∥F , ∥V l,k ∥F , ∥ V l,k ∥F , ∥ V l,k ∥F , ∥∇wL∥2, ∥∇ θ l L∥2, ∥ g θ l ∥2, and ∥ gw∥2 are bounded by G > 1, ∀ l ∈ [L], k ∈ N * . Theorem 2. Suppose that Assumption 1 holds, then with η = O(ε 2 ) and βi = O(ε 2 ), i ∈ [n], there exists C > 0 and ρ ∈ (0, 1) such that E[∥ g w (w k ) -∇ w L(w k )∥ 2 ] ≤ Cε + Cρ k-1 2 + Var(g w (w k )) 1 2 , ∀ k ∈ N * , E[∥ g θ l (θ l,k ) -∇ θ l L(θ l,k )∥ 2 ] ≤ Cε + Cρ k-1 2 + Var(g θ l (θ l,k )) 1 2 , ∀ l ∈ [L], k ∈ N * . Theorem 3. Suppose that Assumption 1 holds. Besides, assume that the optimal value L * = infw,Θ L(w, Θ) is bounded by G. Then, with η = O(ε 4 ), βi = O(ε 4 ), i ∈ [n], and N = O(ε -6 ), LMC ensures to find an ε-stationary solution such that E[∥∇ w,Θ L(w R , Θ R )∥ 2 ] ≤ ε after running for N iterations, where R is uniformly selected from [N ] and Θ R = (θ l,R ) L l=1 .

7. EXPERIMENTS

We introduce experimental settings in Section 7.1. We then evaluate the convergence and efficiency of LMC in Sections 7.2 and 7.3. Finally, we conduct ablation studies about the proposed compensations in Section 7.4. We run all experiments on a single GeForce RTX 2080 Ti (11 GB).

7.1. EXPERIMENTAL SETTINGS

Datasets. Some recent works (Hu et al., 2020) have indicated that many frequently-used graph datasets are too small compared with graphs in real-world applications. Therefore, we evaluate LMC on four large datasets, PPI, REDDIT , FLICKR (Hamilton et al., 2017) , and Ogbn-arxiv (Hu et al., 2020) . These datasets contain thousands or millions of nodes/edges and have been widely used in previous works (Fey et al., 2021; Zeng et al., 2020; Hamilton et al., 2017; Chiang et al., 2019; Chen et al., 2018a; b) . For more details, please refer to Appendix A.1. Baselines and Implementation Details. In terms of prediction performance, our baselines include node-wise sampling methods (GraphSAGE (Hamilton et al., 2017) and VR-GCN (Chen et al., 2018a) ), layer-wise sampling method (FASTGCN (Chen et al., 2018b) and LADIES (Zou et al., 2019) ), subgraph-wise sampling methods (CLUSTER-GCN (Chiang et al., 2019) , GRAPHSAINT (Zeng et al., 2020) , FM (Yu et al., 2022) , and GAS (Fey et al., 2021) ), and a precomputing method (SIGN (Rossi et al., 2020) ). By noticing that GAS and FM achieve the state-of-the-art prediction performance (Table 1 ) among the baselines, we further compare the efficiency of LMC with GAS, FM, and CLUSTER-GCN, another subgraph-wise sampling method using METIS partition. We implement LMC, FM, and CLUSTER-GCN based on the codes and toolkits of GAS (Fey et al., 2021) to ensure a fair comparison. For other implementation details, please refer to Appendix A.3. Hyperparameters. To ensure a fair comparison, we follow the data splits, training pipeline, and most hyperparameters in (Fey et al., 2021) except for the additional hyperparameters in LMC such as β i . We use the grid search to find the best β i (see Appendix A.4 for more details).

7.2. LMC IS FAST WITHOUT SACRIFICING ACCURACY

Table 1 reports the prediction performance of LMC and the baselines. We report the mean and the standard deviation by running each experiment five times for GAS, FM, and LMC. LMC, FM, and GAS all resemble full-batch performance on all datasets while other baselines may fail, especially on the FLICKR dataset. Moreover, LMC, FM, and GAS with deep GNNs, i.e., GCNII (Chen et al., 2020) outperform other baselines on all datasets. As LMC, FM, and GAS share the similar prediction performance, we additionally compare the convergence speed of LMC, FM, GAS, and CLUSTER-GCN, another subgraph-wise sampling method using METIS partition, in Figure 2 and Table 2 . We use a sliding window to smooth the convergence curve in Figure 2 as the accuracy on test data is unstable. The solid curves correspond to the mean, and the shaded regions correspond to values within plus or minus one standard deviation of the mean. To further illustrate the convergence of LMC, we compare the errors of mini-batch gradients computed by CLUSTER, GAS, and LMC. At epoch training step, we record the relative errors ∥ g θ l -∇ θ l L∥2 / ∥∇ θ l L∥2, where ∇ θ l L is the full-batch gradient for the parameters θ l at the l-th MP layer and the g θ l is a mini-batch gradient. To avoid the randomness of the full-batch gradient ∇ θ l L, we set the dropout rate as zero. We report average relative errors during training in Figure 3 . LMC enjoys the smallest estimated errors in the experiments. An appealing feature of mini-batch training methods is that they can avoid the out-of-memory issue by decreasing the batch size. Thus, we evaluate the prediction performance of LMC on Ogbn-arxiv datasets with different batch sizes (numbers of clusters). We conduct experiments under different sizes of sampled clusters per mini-batch. We run each experiment with the same epoch and search learning rates in the same set. We report the best prediction accuracy in Table 3 . LMC outperforms GAS under small batch sizes (batch size = 1 or 2) and achieve comparable performance with GAS (batch size = 5 or 10).

7.4. ABLATION

The improvement of LMC is due to two parts: the compensation in forward passes C l f and the compensation in back passes C l b . Compared with GAS, the compensation in forward passes C l f additionally combines the incomplete up-to-date messages. Figure 4 shows the convergence curves of LMC using both C l f and C l b (denoted by C f &C b ), LMC using only C l f (denoted by C f ), and GAS on the Ogbn-arxiv dataset. Under small batch sizes, the improvement mainly is due to C l b and the incomplete up-to-date messages in forward passes may hurt the performance. This is because the mini-batch and the union of their neighbors are hard to contain most neighbors of out-of-batch nodes when the batch size is small. Thus, the compensation in back passes C l b is the most important component by correcting the bias of the mini-batch gradients. Under large batch sizes, the improvement is due to C l f , as the large batch sizes decrease the discarded messages and improve the accuracy of the mini-batch gradients (see Table 7 in Appendix). Notably, C l b still slightly improves the performance. We provide more ablation studies about βi in Appendix E.4. 

8. CONCLUSION

In this paper, we propose a novel subgraph-wise sampling method with a convergence guarantee, namely Local Message Compensation (LMC). LMC uses efficient and effective compensations to correct the biases of mini-batch gradients and thus accelerates convergence. We show that LMC converges to first-order stationary points of GNNs. To the best of our knowledge, LMC is the first subgraph-wise sampling method for GNNs with provable convergence. Experiments on large-scale benchmark tasks demonstrate that LMC significantly outperforms state-of-the-art subgraph-wise sampling methods in terms of efficiency.

A MORE DETAILS ABOUT EXPERIMENTS

In this section, we introduce more details about our experiments, including datasets, training and evaluation protocols, and implementations.

A.1 DATASETS

We evaluate LMC on four large datasets, PPI, REDDIT, FLICKR (Hamilton et al., 2017) , and Ogbnarxiv (Hu et al., 2020) . All of the datasets do not contain personally identifiable information or offensive content. Table 4 shows the summary statistics of the datasets. Details about the datasets are as follows. • PPI contains 24 protein-protein interaction graphs. Each graph corresponds to a human tissue. Each node indicates a protein with positional gene sets, motif gene sets and immunological signatures as node features. Edges represent interactions between proteins. The task is to classify protein functions. • REDDIT is a post-to-post graph constructed from REDDIT. Each node indicates a post and each edge between posts indicates that the same user comments on both. The task is to classify REDDIT posts into different communities based on (1) the GloVe CommonCrawl word vectors (Pennington et al., 2014) of the post titles and comments, (2) the post's scores, and (3) the number of comments made on the posts. • Ogbn-arxiv is a directed citation network between all Computer Science (CS) arXiv papers indexed by MAG (Wang et al., 2020) . Each node is an arXiv paper and each directed edge indicates that one paper cites another one. The task is to classify unlabeled arXiv papers into different primary categories based on labeled papers and node features, which are computed by averaging word2vec (Mikolov et al., 2013) embeddings of words in papers' title and abstract. • FLICKR categorizes types of images based on their descriptions and properties (Fey et al., 2021; Zeng et al., 2020) . Data Splitting. We use the data splitting strategies following previous works (Fey et al., 2021; Gu et al., 2020) .

A.3.1 NORMALIZATION TECHNIQUE

In Section 4.2 in the main text, we assume that the subgraph V B is uniformly sampled from V and the corresponding set of labeled nodes VL B = VB ∩ VL is uniformly sampled from VL. To enforce the assumption, we use the normalization technique to reweight Equations ( 6) and ( 7) in the main text. Suppose we partition the whole graph V into b parts {V Bi } b i=1 and then uniformly sample c clusters without replacement to construct subgraph V B . By the normalization technique, Equation ( 6) becomes g w (V B ) = b|V L B | c|V L | 1 |V L B | vj ∈V L B ∇ w ℓ w (h j , y j ), where b|V L B | c|V L | is the corresponding weight. Similarly, Equation ( 7) becomes g θ (V B ) = b|V B | c|V| |V| |V B | vj ∈V B ∇ θ u(h j , m N (vj ) , x j )V j , where b|V B | c|V| is the corresponding weight.

A.3.2 INCORPORATING BATCH NORMALIZATION

We uniformly sample a mini-batch of nodes V B and generate the induced subgraph of N (V B ). If we directly feed the H (l) N (V B ) to a batch normalization layer, the learned mean and standard deviation of the batch normalization layer may be biased. Thus, LMC first feeds the embeddings of the minibatch H (l) V B to a batch normalization layer and then feeds the embeddings outside the mini-batch H (l) N (V B )\V B to another batch normalization layer.

A.4 SELECTION OF β i

We select β i = score(i)α for each node v i , where α ∈ [0, 1] is a hyperparameter and score is a function to measure the quality of the incomplete up-to-date messages. We search score in a {f  (x) = x 2 , f (x) = 2x -x 2 , f (x) = x, f (x) = 1; x = deg local (i)/deg global (i)},

B COMPUTATIONAL COMPLEXITY

We summarize the computational complexity in Table 5 , where n max is the maximum of neighborhoods, L is the number of message passing layers, V B is a set of nodes in a sampled mini-batch, d is the embedding dimension, V is the set of nodes in the whole graph, and E is the set of edges in the whole graph. As GD, backward SGD, CLUSTER, GAS, and LMC share the same memory complexity of parameters θ (l) , we omit it in Table 5 . Table 5 : Time and memory complexity per gradient update of message passing based GNNs (e.g. GCN (Kipf & Welling, 2017) and GCNII (Chen et al., 2020) ).

GD and backward SGD O(L(|E|d

+ |V|d 2 )) O(L|V|d) CLUSTER (Chiang et al., 2019) O(L(n max |V B |d + |V B |d 2 )) O(L|V B |d) GAS (Fey et al., 2021) O(L( n max |V B |d + |V B |d 2 )) O(n max L|V B |d) LMC O(L(n max |V B |d + |V B |d 2 )) O(n max L|V B |d) C ADDITIONAL RELATED WORK C.1 MAIN DIFFERENCES BETWEEN LMC AND GRAPHFM • First, LMC focuses on the convergence of subgraph-wise sampling methods, which is orthogonal to the idea of GraphFM-OB to alleviate the staleness problem of historical values. The advanced approach to alleviating the staleness problem of historical values can further improve the performance of LMC and it is easy to establish provable convergence by the extension of LMC. • Second, LMC uses nodes in both mini-batches and their 1-hop neighbors to compute incomplete up-to-date messages. In contrast, GraphFM-OB only uses nodes in the mini-batches. For the nodes whose neighbors are contained in the union of the nodes in mini-batches and their 1-hop neighbors, the aggregation results of LMC are exact, while those of GraphFM-OB are not. • Third, by noticing that aggregation results are biased and the I/O bottleneck for the history access, LMC does not update the historical values in the storage for nodes outside the mini-batches. However, GraphFM-OB updates them based on the aggregation results.

D DETAILED PROOFS

Notations. Unless otherwise specified, C and C ′ with any superscript or subscript denotes constants. We denote the learning rate by η. In this section, we suppose that Assumption 1 holds.

D.1 PROOF OF THEOREM 1: UNBIASED MINI-BATCH GRADIENTS OF BACKWARD SGD

In this subsection, we give the proof of Theorem 1, which shows that the mini-batch gradients computed by backward SGD are unbiased. Proof. As V L B = V B ∩ V L is uniformly sampled from V L , the expectation of g w (V B ) is E[g w (V B )] = E[ 1 |V L B | vj ∈V L B ∇ w ℓ w (h j , y j )] = ∇ w E[ℓ w (h j , y j )] = ∇ w L. As the subgraph V B is uniformly sampled from V, the expectation of g θ l (V B ) is E[g θ l (V B )] = E[ |V| |V B | vj ∈V B ∇ θ l u θ l (h l-1 j , m l-1 N (vj ) , x j ) V l j ] = |V|E[∇ θ l u θ l (h l-1 j , m l-1 N (vj ) , x j )V l j ] = |V| 1 |V| vj ∈V ∇ θ l u θ l (h l-1 j , m l-1 N (vj ) , x j )V l j = vj ∈V ∇ θ l u θ l (h l-1 j , m l-1 N (vj ) , x j )V l j = ∇ θ l L, ∀ l ∈ [L].

D.2 DIFFERENCES BETWEEN EXACT VALUES AT ADJACENT ITERATIONS

We first show that the differences between the exact values of the same layer in two adjacent iterations can be bounded by setting a proper learning rate. Lemma 1. Suppose that Assumption 1 holds. Given an L-layer GNN, for any ε > 0, by letting η ≤ ε (2γ) L G < ε, we have ∥H l,k+1 -H l,k ∥ F < ε, ∀ l ∈ [L], k ∈ N * . Proof. Since η ≤ ε (2γ) L G < ε γ(2γ) L-1 G , we have ∥H 1,k+1 -H 1,k ∥ F = ∥f θ 1,k+1 (X) -f θ 1,k (X)∥ F ≤ γ∥θ 1,k+1 -θ 1,k ∥ ≤ γ∥ g θ 1 ∥η < γGε γ(2γ) L-1 G = ε (2γ) L-1 . Then, because η ≤ ε (2γ) L G < ε (2γ) L-1 G , we have ∥H 2,k+1 -H 2,k ∥ F = ∥f θ 2,k+1 (H 1,k+1 ) -f θ 2,k (H 1,k )∥ F ≤ ∥f θ 2,k+1 (H 1,k+1 ) -f θ 2,k (H 1,k+1 )∥ F + ∥f θ 2,k (H 1,k+1 ) -f θ 2,k (H 1,k )∥ F ≤ γ∥θ 2,k+1 -θ 2,k ∥ + γ∥H 1,k+1 -H 1,k ∥ F ≤ γGη + ε 2(2γ) L-2 < ε 2(2γ) L-2 + ε 2(2γ) L-2 = ε (2γ) L-2 . And so on, we have ∥H l,k+1 -H l,k ∥ F < ε (2γ) L-l , ∀ l ∈ [L], k ∈ N * . Since (2γ) L-l > 1, we have ∥H l,k+1 -H l,k ∥ F < ε, ∀ l ∈ [L], k ∈ N * . Lemma 2. Suppose that Assumption 1 holds. Given an L-layer GNN, for any ε > 0, by letting η ≤ ε (2γ) L-1 G < ε, we have ∥V l,k+1 -V l,k ∥ F < ε, ∀ l ∈ [L], k ∈ N * . Proof. Since η ≤ ε (2γ) L-1 G < ε γ(2γ) L-2 G , we have ∥V L-1,k+1 -V L-1,k ∥ F = ∥ϕ θ L,k+1 (∇ H L) -ϕ θ L,k (∇ H L)∥ F ≤ γ∥θ L,k+1 -θ L,k ∥ ≤ γ∥ g θ L ∥η < γGε γ(2γ) L-2 G = ε (2γ) L-2 . Then, because η ≤ ε (2γ) L-1 G < ε (2γ) L-2 G , we have ∥V L-2,k+1 -V L-2,k ∥ F = ∥ϕ θ L-1,k+1 (V L-1,k+1 ) -ϕ θ L-1,k (V L-1,k )∥ F ≤ ∥ϕ θ L-1,k+1 (V L-1,k+1 ) -ϕ θ L-1,k (V L-1,k+1 )∥ F + ∥ϕ θ L-1,k (V L-1,k+1 ) -ϕ θ L-1,k (V L-1,k )∥ F ≤ γ∥θ L-1,k+1 -θ L-1,k ∥ + γ∥V L-1,k+1 -V L-1,k ∥ F ≤ γGη + ε 2(2γ) L-3 < ε 2(2γ) L-3 + ε 2(2γ) L-3 = ε (2γ) L-3 . And so on, we have ∥V l,k+1 -V l,k ∥ F < ε (2γ) l-1 , ∀ l ∈ [L], k ∈ N * . Since (2γ) l-1 > 1, we have ∥V l,k+1 -V l,k ∥ F < ε, ∀ l ∈ [L], k ∈ N * .

D.3 HISTORICAL VALUES AND TEMPORARY VALUES

Suppose that we uniformly sample a mini-batch V k B ⊂ V at the k-th iteration and |V k B | = S. For the simplicity of notations, we denote the temporary node embeddings and auxiliary variables in the l-th layer by H l,k and V l,k , respectively, where H l,k i = h l,k i , v i ∈ N (V k B ) \ V k B , h l,k i , otherwise, and V l,k i = v l,k i , v i ∈ N (V k B ) \ V k B , v l,k i , otherwise. We abbreviate the process that LMC updates the node embeddings and auxiliary variables of V k B in the l-th layer at the k-th iteration as H l,k V k B = [f θ l,k ( H l-1,k )] V k B , V l,k V k B = [ϕ θ l+1,k ( H l+1,k )] V k B . For each v i ∈ V k B , the update process of v i in the l-th layer at the k-th iteration can be expressed by h l,k i = f θ l,k ,i ( H l-1,k ), V l,k i = ϕ θ l+1,k ,i ( V l+1,k ), where f θ l,k ,i and ϕ θ l+1,k ,i are the components for node v i of f θ l,k and ϕ θ l+1,k , respectively.

D.3.1 CONVEX COMBINATION COEFFICIENTS

We first focus on convex combination coefficients β i , i ∈ [n]. For the simplicity of analysis, we assume β i = β for i ∈ [i]. The analysis of the case where (β i ) n i=1 are different from each other is the same. Lemma 3. Suppose that Assumption 1 holds. For any ε > 0, by letting β ≤ ε 2G , ∀ l ∈ [L], i ∈ [n], we have ∥ H l,k -H l,k ∥ F ≤ ∥H l,k -H l,k ∥ F + ε, ∀ l ∈ [L], k ∈ N * . Proof. Since H l,k = (1 -β)H l,k + β H l,k , we have ∥ H l,k -H l,k ∥ F = ∥(1 -β)H l,k + β H l,k -(1 -β)H l,k + βH l,k ∥ F ≤ (1 -β)∥H l,k -H l,k ∥ F + β∥ H l,k -H l,k ∥ F ≤ ∥H l,k -H l,k ∥ F + 2βG. Hence letting β ≤ ε 2G leads to ∥ H l,k -H l,k ∥ F ≤ ∥H l,k -H l,k ∥ F + ε. Lemma 4. Suppose that Assumption 1 holds. For any ε > 0, by letting β ≤ ε 2G , ∀ l ∈ [L], i ∈ [n], we have ∥ V l,k -V l,k ∥ F ≤ ∥V l,k -V l,k ∥ F + ε. Proof. Since H l,k = (1 -β)H l,k + β H l,k , we have ∥ H l,k -H l,k ∥ = ∥(1 -β)H l,k + β H l,k -(1 -β)H l,k + βH l,k ∥ F ≤ (1 -β)∥H l,k -H l,k ∥ F + β∥ H l,k -H l,k ∥ F ≤ ∥H l,k -H l,k ∥ F + 2βG. Hence letting β ≤ ε 2G leads to ∥ H l,k -H l,k ∥ F ≤ ∥H l,k -H l,k ∥ F + ε.

D.3.2 APPROXIMATION ERRORS OF HISTORICAL VALUES

Next, we focus on the approximation errors of historical node embeddings and auxiliary variables d l,k h := E[∥H l,k -H l,k ∥ 2 F ] 1 2 , l ∈ [L], d l,k v := E[∥V l,k -V l,k ∥ 2 F ] 1 2 , l ∈ [L -1]. Lemma 5. For an L-layer GNN, suppose that Assumption 1 holds. Besides, we suppose that 1. (d l,1 h ) 2 is bounded by G > 1, ∀ l ∈ [L], 2. there exists N ∈ N * such that ∥ H l,k -H l,k ∥ F ≤ ∥H l,k -H l,k ∥ F + 1 N 2 3 , ∀ l ∈ [L], k ∈ N * , ∥H l,k -H l,k-1 ∥ F ≤ 1 N 2 3 , ∀ k ∈ N * , then there exist constants C ′ * ,1 , C ′ * ,2 , and C ′ * ,3 that do not depend on k, l, N , and η, such that (d l,k+1 h ) 2 ≤ C ′ * ,1 η + C ′ * ,2 ρ k + C ′ * ,3 N 2 3 , ∀ l ∈ [L], k ∈ N * , where ρ = n-S n < 1, n = |V|, and S is number of sampled nodes at each iteration. Proof. We have (d l+1,k+1 h ) 2 = E[∥H l+1,k+1 -H l+1,k+1 ∥ 2 F ] = E[ n i=1 ∥h l+1,k+1 i -h l+1,k+1 i ∥ 2 F ] = E[ vi∈V k B ∥f θ l+1,k+1 ,i ( H l,k+1 ) -f θ l+1,k+1 ,i (H l,k+1 )∥ 2 F + vi̸ ∈V k B ∥h l+1,k i -f θ l+1,k+1 ,i (H l,k+1 )∥ 2 F ] = E[ S n n i=1 ∥f θ l+1,k+1 ,i ( H l,k+1 ) -f θ l+1,k+1 ,i (H l,k+1 )∥ 2 F + n -S n n i=1 ∥h l+1,k i -f θ l+1,k+1 ,i (H l,k+1 )∥ 2 F ] ≤ S n n i=1 E[∥f θ l+1,k+1 ,i ( H l,k+1 ) -f θ l+1,k+1 ,i (H l,k+1 )∥ 2 F ] + n -S n n i=1 E[∥h l+1,k i -f θ l+1,k+1 ,i (H l,k+1 )∥ 2 F ]. About the first term, for l ≥ 1, we have E[∥f θ l+1,k+1 ,i ( H l,k+1 ) -f θ l+1,k+1 ,i (H l,k+1 )∥ 2 F ] ≤ γ 2 E[∥ H l,k+1 -H l,k+1 ∥ 2 F ] ≤ γ 2 E[(∥H l,k+1 -H l,k+1 ∥ F + 1 N 2 3 ) 2 ] ≤ 2γ 2 E[∥H l,k+1 -H l,k+1 ∥ 2 F ] + 2γ 2 N 4 3 = 2γ 2 (d l,k+1 h ) 2 + 2γ 2 N 4 3 . For l = 0, we have E[∥f θ l+1,k+1 ,i ( H l,k+1 ) -f θ l+1,k+1 ,i (H l,k+1 )∥ 2 F ] = E[∥f θ 1,k+1 ,i ( H 0,k+1 ) -f θ 1,k+1 ,i (H 0,k+1 )∥ 2 F ] = E[∥f θ 1,k+1 ,i (X) -f θ 1,k+1 ,i (X)∥ 2 F ] = 0. About the second term, for l ≥ 1, we have E[∥h l+1,k i -f θ l+1,k+1 ,i (H l,k+1 )∥ 2 F ] ≤ E[∥h l+1,k i -h l+1,k i + h l+1,k i -f θ l+1,k+1 ,i (H l,k+1 )∥ 2 F ] ≤ E[∥h l+1,k i -h l+1,k i ∥ 2 F ] + E[∥h l+1,k i -f θ l+1,k+1 ,i (H l,k+1 )∥ 2 F ] + 2E[⟨h l+1,k i -h l+1,k i , h l+1,k i -f θ l+1,k+1 ,i (H l,k+1 )⟩] ≤ E[∥h l+1,k i -h l+1,k i ∥ 2 F ] + E[∥h l+1,k i -f θ l+1,k+1 ,i (H l,k ) + f θ l+1,k+1 ,i (H l,k ) -f θ l+1,k+1 ,i (H l,k+1 )∥ 2 F ] + 24E[⟨h l+1,k i -h l+1,k i , h l+1,k i -f θ l+1,k+1 ,i (H l,k+1 )⟩] ≤ E[∥h l+1,k i -h l+1,k i ∥ 2 F ] + 2E[∥h l+1,k i -f θ l+1,k+1 ,i (H l,k )∥ 2 F ] + 2E[∥f θ l+1,k+1 ,i (H l,k ) -f θ l+1,k+1 ,i (H l,k+1 )∥ 2 F ] + 4GE[∥h l+1,k i -f θ l+1,k+1 ,i (H l,k+1 )∥ F ] ≤ E[∥h l+1,k i -h l+1,k i ∥ 2 F ] + 2γ 2 E[∥θ l+1,k -θ l+1,k+1 ∥ 2 ] + 2γ 2 E[∥H l,k -H l,k+1 ∥ 2 F ] + 4GγE[∥θ l+1,k -θ l+1,k+1 ∥ + ∥H l,k -H l,k+1 ∥ F ] ≤ E[∥h l+1,k i -h l+1,k i ∥ 2 F ] + 2γ 2 G 2 η 2 + 4G 2 γη + 2γ 2 N 4 3 + 4Gγ N 2 3 ≤ E[∥h l+1,k i -h l+1,k i ∥ 2 F ] + 2G 2 γ(γ + 2)η + 2γ(γ + 2G) N 2 3 . For l = 0, we have E[∥h l+1,k i -f θ l+1,k+1 ,i (H l,k+1 )∥ 2 F ] ≤ E[∥h l+1,k i -h l+1,k i + h l+1,k i -f θ l+1,k+1 ,i (H l,k+1 )∥ 2 F ] ≤ E[∥h l+1,k i -h l+1,k i ∥ 2 F ] + E[∥h l+1,k i -f θ l+1,k+1 ,i (H l,k+1 )∥ 2 F ] + 2E[⟨h l+1,k i -h l+1,k i , h l+1,k i -f θ l+1,k+1 ,i (H l,k+1 )⟩] ≤ E[∥h l+1,k i -h l+1,k i ∥ 2 F ] + E[∥h l+1,k i -f θ l+1,k+1 ,i (H l,k ) + f θ l+1,k+1 ,i (H l,k ) -f θ l+1,k+1 ,i (H l,k+1 )∥ 2 F ] + 2E[⟨h l+1,k i -h l+1,k i , h l+1,k i -f θ l+1,k+1 ,i (H l,k+1 )⟩] ≤ E[∥h l+1,k i -h l+1,k i ∥ 2 F ] + 2E[∥h l+1,k i -f θ l+1,k+1 ,i (H l,k )∥ 2 F ] + 2E[∥f θ l+1,k+1 ,i (H l,k ) -f θ l+1,k+1 ,i (H l,k+1 )∥ 2 F ] + 4GE[∥h l+1,k i -f θ l+1,k+1 ,i (H l,k+1 )∥ F ] ≤ E[∥h l+1,k i -h l+1,k i ∥ 2 F ] + 2γ 2 E[∥θ l+1,k -θ l+1,k+1 ∥ 2 ] + 4GγE[∥θ l+1,k -θ l+1,k+1 ∥] ≤ E[∥h l+1,k i -h l+1,k i ∥ 2 F ] + 2γ 2 G 2 η 2 + 4G 2 γη, ≤ E[∥h l+1,k i -h l+1,k i ∥ 2 F ] + 2G 2 γ(γ + 2)η + 4G 2 γη. Hence we have (d l+1,k+1 h ) 2 ≤ (n -S) n (d l+1,k h ) 2 + 2(n -S)γ(γ + 2)G 2 η + 0, l = 0, 2γ 2 S(d l,k+1 h ) 2 + 4nγ(γ+G) N 2 3 , l ≥ 1. Let ρ = n-S n < 1. For l = 0, we have (d 1,k+1 h ) 2 - 2(n -S)γ(γ + 2)G 2 η 1 -ρ ≤ ρ((d 1,k h ) 2 - 2(n -S)γ(γ + 2)G 2 η 1 -ρ ) ≤ ρ 2 ((d 1,k-1 h ) 2 - 2(n -S)γ(γ + 2)G 2 η 1 -ρ ) ≤ • • • ≤ ρ k ((d 1,1 h ) 2 - 2(n -S)γ(γ + 2)G 2 η 1 -ρ ) ≤ ρ k G, which leads to (d 1,k+1 h ) 2 ≤ 2(n -S)γ(γ + 2)G 2 1 -ρ η + ρ k G = C ′ 1,1 η + ρ k G. Then, for l = 1 we have (d 2,k+1 h ) 2 ≤ ρ(d 2,k h ) 2 + C 2,1 η + C 2,2 ρ k + C 2,3 N 2 3 , where C 2,1 , C 2,2 , and C 2,3 are all constants. Hence we have (d 2,k+1 h ) 2 - C 2,1 η + C 2,2 ρ k + C2,3 N 2 3 1 -ρ ≤ ρ((d 2,k h ) 2 - C 2,1 η + C 2,2 ρ k + C2,3 N 2 3 1 -ρ ) ≤ • • • ≤ ρ k ((d 2,1 h ) 2 - C 2,1 η + C 2,2 ρ k + C2,3 N 2 3 1 -ρ ) ≤ ρ k G, which leads to (d 2,k+1 h ) 2 ≤ C ′ 2,1 η + C ′ 2,2 ρ k + C ′ 2,3 N 2 3 . And so on, there exist constants C ′ * ,1 , C ′ * ,2 , and C ′ * ,3 that are independent with η, k, l, N such that (d l,k+1 h ) 2 ≤ C ′ * ,1 η + C ′ * ,2 ρ k + C ′ * ,3 N 2 3 , ∀ l ∈ [L], k ∈ N * . Lemma 6. For an L-layer GNN, suppose that Assumption 1 holds. Besides, we suppose that 1. (d l,1 h ) 2 is bounded by G > 1, ∀ l ∈ [L], 2. there exists N ∈ N * such that ∥ V l,k -V l,k ∥ F ≤ ∥V l,k -V l,k ∥ F + 1 N 2 3 , ∀ l ∈ [L], k ∈ N * , ∥V l,k -V l,k-1 ∥ F ≤ 1 N 2 3 , ∀ k ∈ N * , then there exist constants C ′ * ,1 , C ′ * ,2 , and C ′ * ,3 that are independent with k, l, ε * , and η, such that (d l,k+1 v ) 2 ≤ C ′ * ,1 η + C ′ * ,2 ρ k + C ′ * ,3 N 2 3 , ∀ l ∈ [L], k ∈ N * , where ρ = n-S n < 1, n = |V|, and S is number of sampled nodes at each iteration. Proof. Similar to the proof of Lemma 5. Proof. We have ∥ g w (w k ) -g w (w k )∥ 2 = 1 |V k L | ∥ vj ∈V k L ∇ w ℓ w k (h L,k j , y j ) -∇ w ℓ w k (h L,k j , y j )∥ 2 ≤ 1 |V k L | vj ∈V k L ∥∇ w ℓ w k (h L,k j , y j ) -∇ w ℓ w k (h L,k j , y j )∥ 2 ≤ γ |V k L | vj ∈V k L ∥h L,k j -h L,k j ∥ 2 ≤ γ |V k L | vj ∈V k L ∥H L,k -H L,k ∥ F = γ |V k L | • |V k L | • ∥H L,k -H L,k ∥ F = γ∥H L,k -H L,k ∥ F Lemma 8. Suppose that Assumption 1 holds. For any k ∈ N * and l ∈ [L], the difference between g θ l (θ l,k ) and g θ l (θ l,k ) can be bounded as ∥ g θ l (θ l,k ) -g θ l (θ l,k )∥ 2 ≤ |V|G∥V l,k -V l,k ∥ F + |V|Gγ∥H l,k -H l,k ∥ F . Proof. As ∥Aa-Bb∥ 2 ≤ ∥A∥ F ∥a-b∥ 2 +∥A-B∥ F ∥b∥ 2 , we can bound ∥ g θ l (θ l,k )-g θ l (θ l,k )∥ 2 by ∥ g θ l (θ l,k ) -g θ l (θ l,k )∥ 2 ≤ |V| |V k B | vi∈V k B ∥ ∇ θ l u θ l,k (h l-1,k j , m l-1,k N (vj ) , x j ) V l,k j -∇ θ l u θ l,k (h l-1,k j , m l-1,k N (vj ) , x j ) V l,k j ∥ 2 ≤ |V| max vi∈V k B ∥ ∇ θ l u θ l,k (h l-1,k j , m l-1,k N (vj ) , x j ) V l,k j -∇ θ l u θ l,k (h l-1,k j , m l-1,k N (vj ) , x j ) V l,k j ∥ 2 ≤ |V| max vi∈V k B {∥∇ θ l u θ l,k (h l-1,k j , m l-1,k N (vj ) , x j )∥ F ∥V l,k j -V l,k j ∥ 2 + ∥∇ θ l u θ l,k (h l-1,k j , m l-1,k N (vj ) , x j ) -∇ θ l u θ l,k (h l-1,k j , m l-1,k N (vj ) , x j )∥ F ∥V l,k j ∥ 2 } ≤ |V|G∥V l,k -V l,k ∥ F + |V|Gγ∥H l,k -H l,k ∥ F . Lemma 9. For an L-layer ConvGNN, suppose that Assumption 1 holds. For any N ∈ N * , by letting η ≤ 1 (2γ) L G 1 N 2 3 = O( 1 N ) and β i ≤ 1 2G 1 N 2 3 = O( 1 N 2 3 ), i ∈ [n], there exists G 2, * > 0 and ρ ∈ (0, 1) such that for any k ∈ N * we have E[∥∆ k w ∥ 2 2 ] = (Bias( g w (w k ))) 2 + Var(g w (w k )), E[∥∆ k θ l ∥ 2 2 ] = (Bias( g θ l (θ l,k ))) 2 + Var(g θ l (θ l,k )), where Var(g w (w k )) = E[∥g w (w k ) -∇ w L(w k )∥ 2 2 ], Bias( g w (w k )) = E[∥ g w (w k ) -g w (w k )∥ 2 2 ] 1 2 , Var(g θ l (θ l,k )) = E[∥g θ l (θ l,k ) -∇ θ l L(θ l,k )∥ 2 2 ], Bias( g θ l (θ l,k )) = E[∥ g θ l (θ l,k ) -g θ l L(θ l,k )∥ 2 2 ] 1 2 1 N 1 3 . We can decompose ∥∆ k w ∥ 2 2 as ∥∆ k w ∥ 2 2 = ∥ g w (w k ) -∇ w L(w k )∥ 2 2 = ∥ g w (w k ) -g w (w k ) + g w (w k ) -∇ w L(w k )∥ 2 2 = ∥ g w (w k ) -g w (w k )∥ 2 2 + ∥g w (w k ) -∇ w L(w k )∥ 2 2 + 2⟨∥ g w (w k ) -g w (w k ), g w (w k ) -∇ w L(w k )⟩. We take expectation of both sides of the above expression, leading to E[∥∆ k w ∥ 2 2 ] = (Bias( g w (w k ))) 2 + Var(g w (w k )), where Bias( g w (w k )) = E[∥ g w (w k ) -g w (w k )∥ 2 2 ] 1 2 , Var(g w (w k )) = E[∥g w (w k ) -∇ w L(w k )∥ 2 2 ] as E[⟨ g w (w k ) -g w (w k ), g w (w k ) -∇ w L(w k )⟩] = 0. By Lemma 7, we can bound the bias term as Bias( g w (w k )) = E[∥ g w (w k ) -g w (w k )∥ 2 2 ] 1 2 ≤ γ 2 E[∥H L,k -H L,k ∥ 2 F ] 1 2 = γ • d L,k h ≤ γ( C ′ * ,1 η 1 2 + C ′ * ,2 ρ k-1 2 + C ′ * ,3 1 N 1 3 ) ≤ G 2,1 (η 1 2 + ρ k-1 2 + 1 N 1 3 ), where G 2,1 = γ max{ C ′ * ,1 , C ′ * ,2 , C ′ * ,3 }. Similar to Eq. equation 16, we can decompose E[∥∆ k θ l ∥ 2 2 ] as E[∥∆ k θ l ∥ 2 2 ] = (Bias( g θ l (θ l,k ))) 2 + Var(g θ l (θ l,k )), Bias( g θ l (θ l,k )) = E[∥ g θ l (θ l,k ) -g θ l (θ l,k )∥ 2 2 ] 1 2 , Var(g θ l (θ l,k )) = E[∥g θ l (θ l,k ) -∇ θ l (θ l,k ))∥ 2 2 ] . By Lemma 8, we can bound the bias term as Bias( g θ l (θ l,k )) = E[∥ g θ l (θ l,k ) -g θ l (θ l,k )∥ 2 2 ] 1 2 ≤ (2|V| 2 G 2 E[∥V l,k -V l,k ∥ 2 F ] + 2|V| 2 G 2 γ 2 E[∥H l,k -H l,k ∥ 2 F ]) 1 2 ≤ √ 2|V|Gd l,k v + √ 2|V|Gγd l,k h ≤ G 2,2 (η 1 2 + ρ k-1 2 + 1 N 1 3 ), where G 2,2 = √ 2|V|G(1 + γ) max{ C ′ * ,1 , C ′ * ,2 , C ′ * ,3 }. Let G 2, * = max{G 2,1 , G 2,2 }, then we have Bias( g w (w k )) ≤ G 2, * (η 1 2 + ρ k-1 2 + 1 N 1 3 ), Bias( g θ l (θ l,k )) ≤ G 2, * (η 1 2 + ρ k-1 2 + 1 N 1 3 ).

By letting

ε = 1 N 1 3 and C = 2G 2, * , we have Bias( g w (w k )) ≤ Cε + Cρ k-1 2 , Bias( g θ l (θ l,k )) ≤ Cε + Cρ k-1 2 , which leads to E[∥∆ k w ∥ 2 ] ≤ E[∥∆ k w ∥ 2 2 ] 1 2 ≤ (Bias( g w (w k ))) 2 + Var(g w (w k )) 1 2 ≤ Bias( g w (w k )) + Var(g w (w k )) 1 2 ≤ Cε + Cρ k-1 2 + Var(g w (w k )) 1 2 and E[∥∆ k θ l ∥ 2 ] ≤ E[∥∆ k θ l ∥ 2 2 ] 1 2 ≤ (Bias( g θ l (θ l,k ))) 2 + Var(g θ l (θ l,k )) 1 2 ≤ Bias( g θ l (θ l,k )) + Var(g θ l (θ l,k )) 1 2 ≤ Cε + Cρ k-1 2 + Var(g θ l (θ l,k )) 2 . Theorem 2 and Theorem 4 follow immediately.

D.5 PROOF OF THEOREM 3: CONVERGENCE GUARANTEES

In this subsection, we give the convergence guarantees of LMC. We first give sufficient conditions for convergence. Lemma 10. Suppose that function f : R n → R is continuously differentiable. Consider an optimization algorithm with any bounded initialization x 1 and an update rule in the form of x k+1 = x k -ηd(x k ), where η > 0 is the learning rate and d(x k ) is the estimated gradient that can be seen as a stochastic vector depending on x k . Let the estimation error of the gradient be ∆ k = d(x k )-∇f (x k ). Suppose that 1. the optimal value f * = inf x f (x) is bounded; 2. the gradient of f is γ-Lipschitz, i.e., ∥∇f (y) -∇f (x)∥ 2 ≤ γ∥y -x∥ 2 , ∀ x, y ∈ R n ; 3. there exists G 0 > 0 that does not depend on η such that E[∥∆ k ∥ 2 2 ] ≤ G 0 , ∀ k ∈ N * ; 4. there exists N ∈ N * and ρ ∈ (0, 1) that do not depend on η such that |E[⟨∇f (x k ), ∆ k ⟩]| ≤ G 0 (η 1 2 + ρ k-1 2 + 1 N 1 3 ), ∀ k ∈ N * , where G 0 is the same constant as that in Condition 3, then by letting η = min{ 1 γ , 1 N 2 3 }, we have E[∥∇f (x R )∥ 2 2 ] ≤ 2(f (x 1 ) -f * + G 0 ) N 1 3 + γG 0 N 2 3 + G 0 N (1 - √ ρ) = O( 1 N 1 3 ), where R is chosen uniformly from [N ]. Proof. As the gradient of f is γ-Lipschitz, we have Then, we have f (x k+1 ) ≤ f (x k ) + ⟨∇f (x k ), x k+1 -x k ⟩ + γ 2 ∥x k+1 -x k ∥ 2 2 = f (x k ) -η⟨∇f (x k ), d(x k )⟩ + η 2 γ 2 ∥d(x k )∥ 2 2 = f (x k ) -η⟨∇f (x k ), ∆ k ⟩ -η∥∇f (x k )∥ 2 2 + η 2 γ 2 (∥∆ k ∥ 2 2 + ∥∇f (x k )∥ 2 2 + 2⟨∆ k , ∇f (x k )⟩) = f (x k ) -η(1 -ηγ)⟨∇f (x k ), ∆ k ⟩ -η(1 - ηγ 2 )∥∇f (x k )∥ 2 2 + η 2 γ 2 ∥∆ k ∥ 2 2 . By taking expectation of both sides, we have E[f (x k+1 )] ≤ E[f (x k )] -η(1 -ηγ)E[⟨∇f (x k ), ∆ k ⟩] -η(1 - ηγ 2 )E[∥∇f (x k )∥ 2 2 ] + η 2 γ 2 E[∥∆ k ∥ 2 2 ]. By summing up the above inequalities for k ∈ [N ] and dividing both sides by N η(1 -ηγ 2 ), we have N k=1 E[∥∇f (x k )∥ 2 2 ] N ≤ f (x 1 ) -E[f (x N )] N η(1 -ηγ 2 ) + ηγ 2 -ηγ N k=1 E[∥∆ k ∥ 2 2 ] N - (1 -ηγ) (1 -ηγ 2 ) N k=1 E[⟨∇f (x k ), ∆ k ⟩] N ≤ f (x 1 ) -f * N η(1 -ηγ 2 ) + ηγ 2 -ηγ N k=1 E[∥∆ k ∥ 2 2 ] N + N k=1 |E[⟨∇f (x k ), ∆ k ⟩]| N , where the second inequality comes from ηγ > 0 and f (x k ) ≥ f * . According to the above conditions, we have N k=1 E[∥∇f (x k )∥ 2 2 ] N ≤ f (x 1 ) -f * N η(1 -ηγ 2 ) + ηγ 2 -ηγ G 0 + G 0 N k=1 η 1 2 + ρ k-1 2 N + G 0 N 1 3 ≤ f (x 1 ) -f * N η(1 -ηγ 2 ) + ηγ 2 -ηγ G 0 + η 1 2 G 0 + G 0 N ∞ k=1 ρ k-1 2 + G 0 N 1 3 = f (x 1 ) -f * N η(1 -ηγ 2 ) + ηγ 2 -ηγ G 0 + η 1 2 G 0 + G 0 N (1 - √ ρ) + G 0 N 1 3 . Notice that E[∥∇f (x R )∥ 2 2 ] = E R [E∥[∇f (x R )∥ 2 2 | R]] = N k=1 E[∥∇f (x k )∥ 2 2 ] N , where R is uniformly chosen from [N ], hence we have E[∥∇f (x R )∥ 2 2 ] ≤ f (x 1 ) -f * N η(1 -ηγ 2 ) + ηγ 2 -ηγ G 0 + η 1 2 G 0 + G 0 N (1 - √ ρ) + G 0 N 1 3 . By letting η = min{ 1 γ , 1 N 2 3 }, we have E[∥∇f (x R )∥ 2 2 ] ≤ 2(f (x 1 ) -f * ) N 1 3 + γG 0 N 2 3 + G 0 N 1 3 + G 0 N (1 - √ ρ) + G 0 N 1 3 ≤ 2(f (x 1 ) -f * + G 0 ) N 1 3 + γG 0 N 2 3 + G 0 N (1 - √ ρ) = O( 1 N 1 3 ). by letting η ≤ 1 (2γ) L G 1 N 2 3 = O( 1 N 2 3 ) and β i ≤ 1 2G 1 N 2 3 = O( 1 N 2 3 ), i ∈ [n]. Proof. By Eqs. ( 17) and ( 18) we know that there exists G 2, * such that for any k ∈ N * we have E[∥ g w (w k ) -g w (w k )∥ 2 ] ≤ E[∥ g w (w k ) -g w (w k )∥ 2 2 ] 1 2 ≤ G 2, * (η 1 2 + ρ k-1 2 + 1 N 1 3 ) and E[∥ g θ l (θ l,k ) -g θ l (θ l,k )∥ 2 ] ≤ E[∥ g θ l (θ l,k ) -g θ l (θ l,k )∥ 2 2 ] 1 2 ≤ G 2, * (η 1 2 + ρ k-1 2 + 1 N 1 3 ), where ρ = n-S n < 1 is a constant. Hence |E[⟨∇ w L, ∆ k w ⟩]| = |E[⟨∇ w L, g w (w k ) -∇ w L(w k )⟩]| = |E[⟨∇ w L, g w (w k ) -g w (w k )⟩]| ≤ E[∥∇ w L∥ 2 ∥ g w (w k ) -g w (w k )∥ 2 ] ≤ GE[∥ g w (w k ) -g w (w k )∥ 2 ], ≤ G 2 (η 1 2 + ρ k-1 2 + 1 N 1 3 ) and |E[⟨∇ θ l L, ∆ k θ l ⟩]| = |E[⟨∇ θ l L, g θ l (θ l,k ) -∇ θ l L(θ l,k )⟩]| = |E[⟨∇ θ l L, g θ l (θ l,k ) -g θ l (θ l,k )⟩]| ≤ E[∥∇ θ l L∥ 2 ∥ g θ l (θ l,k ) -g θ l (θ l,k )∥ 2 ] ≤ GE[∥ g θ l (θ l,k ) -g θ l (θ l,k )∥ 2 ] ≤ G 2 (η 1 2 + ρ k-1 2 + 1 N 1 3 ), where G 2 = GG 2, * . According to Lemmas 11 and 12, the conditions in Lemma 10 hold. By letting ε = 2(f (x 1 ) -f * + G 0 ) N 1 3 + γG 0 N 2 3 + G 0 N (1 - √ ρ) 1 2 = O( 1 N 1 6 ), Theorem 3 follows immediately.

E MORE EXPERIMENTS

E.1 PERFORMANCE ON SMALL DATASETS Figure 5 reports the convergence curves GD, GAS, and LMC for GCN on three small datasets, i.e., Cora, Citeseer, and PubMed from Planetoid (Yang et al., 2016) . LMC is faster than GAS, especially on the CiteSeer and PubMed datasets. Notably, the key bottleneck on the small datasets is graph sampling rather than forward and backward passes. Thus, GD is faster than GAS and LMC, as it avoids graph sampling by directly using the whole graph.



https://github.com/rusty1s/pyg_autoscale. The owner does not mention the license.



Figure 2: Testing accuracy and training loss w.r.t. runtimes (s).

Figure3: The average relative estimated errors of mini-batch gradients computed by CLUSTER, GAS, and LMC for GCN models.7.3 LMC IS ROBUST IN TERMS OF BATCH SIZES

Figure 4: The improvement of the compensations on the Ogbn-arxiv dataset.

where deg global (i) is the degree of node i in the whole graph and deg local (i) is the degree of node i in the subgraph induced by N (V B ).

t(y -x)), y -x⟩ dt = f (x) + ⟨∇f (x), y -x⟩ + 1 0 ⟨∇f (x + t(y -x)) -∇f (x), y -x⟩ dt ≤ f (x) + ⟨∇f (x), y -x⟩ + 1 0 ∥∇f (x + t(y -x)) -∇f (x)∥ 2 ∥y -x∥ 2 dt ≤ f (x) + ⟨∇f (x), y -x⟩ + 1 0 γt∥y -x∥ 2 2 dt ≤ f (x) + ⟨∇f (x), y -x⟩ + γ 2 ∥y -x∥ 2 2 ,

(•) l,k , but elsewhere we omit the superscript k and denote it by (•) l .

Prediction performance on large graph datasets. OOM denotes the out-of-memory issue. Bold font indicates the best result and underline indicates the second best result.

Table2reports the number of epochs, the runtime to reach the full-batch accuracy in Table1, and the GPU memory. As shown in Table2and Figure2a, LMC is significantly faster than GAS, especially with a speed-up of 2x on REDDIT. Notably, the test accuracy of LMC is more stable than GAS, and thus the smooth test accuracy of LMC outperforms GAS in Figure2b. Although GAS finally resembles full-batch performance in Table1by selecting the best performance on the valid data, it may fail to resemble under small batch sizes due to its unstable process (see Section 7.3). Another appealing feature of LMC is that LMC shares comparable GPU memory costs with GAS, and thus LMC avoids the neighbor explosion problem. FM is slower than other methods, as they additionally update historical embeddings in the storage for the nodes outside the mini-batches. Please see Appendix E.2 for the comparison in terms of training time per epoch. Efficiency of CLUSTER-GCN, GAS, FM, and LMC.

Performance under different batch sizes on the Ogbn-arxiv dataset.

Statistics of the datasets used in our experiments.

ACKNOWLEDGEMENT

The authors would like to thank all the anonymous reviewers for their insightful comments. This work was supported in part by National Nature Science Foundations of China grants U19B2026, U19B2044, 61836011, 62021001, and 61836006, and the Fundamental Research Funds for the Central Universities grant WK3490000004.

annex

2 ] = (Bias( g θ l (θ l,k ))) 2 + Var(g θ l (θ l,k )), where. Suppose that Assumption 1 holds, then with η = O(ε 2 ) and βi = O(ε 2 ), i ∈ [n], there exist C > 0 and ρ ∈ (0, 1) such that for any k ∈ N * and l ∈ [L], the bias terms can be bounded asLemma 7. Suppose that Assumption 1 holds. For any k ∈ N * , the difference between g w (w k ) and g w (w k ) can be bounded asand).Proof. By Lemmas 1 and 2 we know thatBy Lemmas 3 and 4 we know that for any k ∈ N * and l ∈ [L] we have. 

E.2 COMPARISON IN TERMS OF TRAINING TIME PER EPOCH

We evaluate the training time per epoch of CLUSTER, GAS, FM, and LMC in Table 6 . Compared with GAS, LMC additionally accesses historical auxiliary variables. Inspired by GAS (Fey et al., 2021) , we use the concurrent mini-batch execution to asynchronously access historical auxiliary variables. Moreover, from the convergence analysis of LMC, we can sample clusters to construct fixed subgraphs at preprocessing step (Line 2 in Algorithm 1) rather than sample clusters to construct various subgraphs at each training step 2 . This further avoids sampling costs. Finally, the training time per epoch of LMC is comparable with GAS. CLUSTER is slower than GAS and LMC, as it prunes edges in forward passes, introducing additional normalization operation for the adjacency matrix of the sampled subgraph by, where deg V B (i) is the degree in the sampled subgraph rather than the whole graph. The normalized adjacency matrix is difficult to store and reuse, as the sampled subgraph may be different. FM is slower than other methods, as they additionally update historical embeddings in the storage for the nodes outside the mini-batches. 2 CLUSTER-GCN proposes to sample clusters to construct various subgraphs at each training step and LMC follows it. If a subgraph-wise sampling method prunes an edge at the current step, the GNN may observe the pruned edge at the next step by resampling subgraphs. This avoids GNN overfitting the graph which drops some important edges as shown in Section 3.2 in (Chiang et al., 2019) (we also observe that GAS achieves the accuracy of 71.5% and 71.1% under stochastic subgraph partition and fixed subgraph partition respectively on the Ogbn-arxiv dataset).

E.3 COMPARISON IN TERMS OF MEMORY UNDER DIFFERENT BATCH SIZES

In Table 7 , we report the GPU memory consumption, and the proportion of reserved messages is the adjacency matrix used in a subgraph-wise method alg (e.g., CLUSTER, GAS, and LMC), and ∥ • ∥0 denotes the ℓ0-norm. As shown in Table 7 , LMC makes full use of all sampled nodes in both forward and backward passes, which is the same as full-batch GD. Default indicates the default batch size used in the codes and toolkits of GAS (Fey et al., 2021) .Table 7 : GPU memory consumption (MB) and the proportion of reserved messages (%) in forward and backward passes of GD, CLUSTER, GAS, and LMC for training GCN. Default indicates the default batch size used in the codes and toolkits of GAS (Fey et al., 2021) .

Batch size Methods

Ogbn As shown in Section A.4, β i = score(i)α in LMC. We report the prediction performance under α ∈ {0.0, 0.2, 0.4, 0.6, 0.8, 1.0} and score ∈ {f 8 and 9 respectively. When exploring the effect of a specific hyper-parameter, we fix the other hyper-parameters as their best values. Notably, α = 0 implies that LMC directly uses the historical values as affordable without alleviating their staleness, which is the same as that in GAS. Under large batch sizes, LMC achieves the best performance with large β i = 1, as large batch sizes improve the quality of the incomplete up-to-date messages. Under small batch sizes, LMC achieves the best performance with small β i = 0.4score 2x-x 2 (i), as small learning rates alleviate the staleness of the historical values. 

F POTENTIAL SOCIETAL IMPACTS

In this paper, we propose a novel and efficient subgraph-wise sampling method for the training of GNNs, i.e., LMC. This work is promising in many practical and important scenarios such as search engine, recommendation systems, biological networks, and molecular property prediction.Published as a conference paper at ICLR 2023 Nonetheless, this work may have some potential risks. For example, using this work in search engine and recommendation systems to over-mine the behavior of users may cause undesirable privacy disclosure.

