ON THE IMPORTANCE OF SAMPLING IN TRAINING GCNS: CONVERGENCE ANALYSIS AND VARIANCE REDUCTION Anonymous

Abstract

Graph Convolutional Networks (GCNs) have achieved impressive empirical advancement across a wide variety of graph-related applications. Despite their great success, training GCNs on large graphs suffers from computational and memory issues. A potential path to circumvent these obstacles is sampling-based methods, where at each layer a subset of nodes is sampled. Although recent studies have empirically demonstrated the effectiveness of sampling-based methods, these works lack theoretical convergence guarantees under realistic settings and cannot fully leverage the information of evolving parameters during optimization. In this paper, we describe and analyze a general doubly variance reduction schema that can accelerate any sampling method under the memory budget. The motivating impetus for the proposed schema is a careful analysis for the variance of sampling methods where it is shown that the induced variance can be decomposed into node embedding approximation variance (zeroth-order variance) during forward propagation and layerwise-gradient variance (first-order variance) during backward propagation. We theoretically analyze the convergence of the proposed schema and show that it enjoys an O(1/T ) convergence rate. We complement our theoretical results by integrating the proposed schema in different sampling methods and applying them to different large real-world graphs. Training GCNs via sampling. The full-batch training of a typical GCN is employed in Kipf & Welling (2016) which necessities keeping the whole graph data and intermediate nodes' representations in the memory. This is the key bottleneck that hinders the scalability of full-batch GCN training. To overcome this issues, sampling-based GCN training methods (Hamilton et al., 2017; 

1. INTRODUCTION

In the past few years, graph convolutional networks (GCNs) have achieved great success in many graph-related applications, such as semi-supervised node classification (Kipf & Welling, 2016) , supervised graph classification (Xu et al., 2018) , protein interface prediction (Fout et al., 2017) , and knowledge graph (Schlichtkrull et al., 2018; Wang et al., 2017) . However, most works on GCNs focus on relatively small graphs, and scaling GCNs for large-scale graphs is not straight forward. Due to the dependency of the nodes in the graph, we need to consider a large receptive-field to calculate the representation of each node in the mini-batch, while the receptive field grows exponentially with respect to the number of layers. To alleviate this issue, sampling-based methods, such as node-wise sampling (Hamilton et al., 2017; Ying et al., 2018; Chen et al., 2017) , layer-wise sampling (Chen et al., 2018; Zou et al., 2019) , and subgraph sampling (Chiang et al., 2019; Zeng et al., 2019) are proposed for mini-batch GCN training. Although empirical results show that sampling-based methods can scale GCN training to large graphs, these methods suffer from a few key issues. First, the theoretical understanding of samplingbased methods is still lacking. Second, the aforementioned sampling strategies are only based on the structure of the graph. Although most recent works (Huang et al., 2018; Cong et al., 2020) propose to utilize adaptive importance sampling strategies to constantly re-evaluate the relative importance of nodes during training (e.g., current gradient or representation of nodes), finding the optimal adaptive sampling distribution is computationally inadmissible, as it requires to calculate the full gradient or node representations in each iteration. This necessitates developing alternative solutions that can efficiently be computed and that come with theoretical guarantees. In this paper, we develop a novel variance reduction schema that can be applied to any sampling strategy to significantly reduce the induced variance. The key idea is to use the historical node 

MSE of Gradient

Figure 1 : The effect of doubly variance reduction on training loss, validation loss, and mean-square error (MSE) of gradient on Flickr dataset using LADIES proposed in Zou et al. (2019) . embeddings and the historical layerwise gradient of each graph convolution layer as control variants. The main motivation behind the proposed schema stems from our theoretical analysis of the sampling methods' variance in training GCNs. Specifically, we show that due to the composite structure of training objective, any sampling strategy introduces two types of variance in estimating the stochastic gradients: node embedding approximation variance (zeroth-order variance) which results from embeddings approximation during forward propagation, and layerwise-gradient variance (firstorder variance) which results from gradient estimation during backward propagation. In Figure 1 , we exhibit the performance of proposed schema when utilized in the sampling strategy introduced in (Zou et al., 2019) . The plots show that applying our proposal can lead to a significant reduction in variance; hence faster convergence rate and better test accuracy. We can also see that both zerothorder and first-order methods are equally important and demonstrate significant improvement when applied jointly (i.e, doubly variance reduction). Contributions. We summarize the contributions of this paper as follows: • We provide the theoretical analysis for sampling-based GCN training (SGCN) with a nonasymptotic convergence rate. We show that due to the node embedding approximation variance, SGCNs suffer from residual error that hinders their convergence. • We mathematically show that the aforementioned residual error can be resolved by employing zeroth-order variance reduction to node embedding approximation (dubbed as SGCN+), which explains why VRGCN (Chen et al., 2017) enjoys a better convergence than GraphSAGE (Hamilton et al., 2017) , even with less sampled neighbors. • We extend the algorithm from node embedding approximation to stochastic gradient approximation, and propose a generic and efficient doubly variance reduction schema (SGCN++). SGCN++ can be integrated with different sampling-based methods to significantly reduce both zeroth-and first-order variance, and resulting in a faster convergence rate and better generalization. • We theoretically analyze the convergence of SGCN++ and obtain an O(1/T ) rate, which significantly improves the best known bound O(1/ √ T ). We empirically verify SGCN++ through various experiments on several real-world datasets and different sampling methods, where it demonstrates significant improvements over the original sampling methods. LADIES (Zou et al., 2019) further restrict the candidate nodes in the union of the neighborhoods of the sampled nodes in the upper layer. However, significant overhead may be incurred due to the expensive sampling algorithm. In addition, subgraph sampling methods such as GraphSAINT (Zeng et al., 2019) constructs mini-batches by importance sampling, and apply normalization techniques to eliminate bias and reduce variance. However, the sampled subgraphs are usually sparse and requires a large sampling size to guarantee the performance. Theoretical analysis. Despite many algorithmic progresses over the years, the theoretical understanding of the convergence for SGCNs training method is still limited. VRGCN provides a convergence analysis under a strong assumption that the stochastic gradient due to sampling is unbiased and achieved a convergence rate of O(1/ √ T ). However, the convergence analysis is limited to VRGCN, and the assumption is not true due to the composite structure of training objective as will be elaborated. Chen & Luss (2018) provides another convergence analysis for FastGCN under a strong assumption that the stochastic gradient of GCN converges to the consistent gradient exponentially fast with respect to the sample size, and results in the same convergence rate as unbiased ones, i.e., O(1/ √ T ). Most recently, Sato et al. (2020) provides PAC learning-style bounds on the node embedding and gradient estimation for SGCNs training. Another direction of theoretical research focuses on analyzing the expressive power of GCN (Garg et al., 2020; Chen et al., 2019; Zhang et al., 2020) , which is not the focus of this paper and omitted for brevity. Connection to composite optimization. The proposed doubly variance reduction algorithm shares the same spirit with the variance reduced composite optimization problem considered in Zhang & Xiao (2019a) ; Hu et al. (2020) ; Tran-Dinh et al. (2020) ; Zhang & Xiao (2019c; b) , but with two main differences. Firstly, the objective function is different. In composite optimization, an objective function that only the first composite layer has the trainable parameters is considered, where the output of the lower-level function is acting as the parameter for the higher-level function. However, in GCN model, the output of one graph convolutional layer is the input node embedding matrix for the next layer. As a result, when analyzing the convergence of variance reduced neural network model, we have to explicitly handle both the evolving node embedding matrices and the trainable parameters at all layers. Secondly, the data points in these works are sampled independently, but the data points (nodes) in SGCN are sampled node-or layer-dependent according to the graph structure. In our analysis, we provide a sampled-graph structure dependent convergence rate by bridging the connection between the convergence rate of GCN to the graph Laplacian matrices.

3. SGCN: A TIGHIT ANALYSIS OF SGD FOR GCN TRAINING

Full-batch GCN training. We begin by introducing the basic mathematical formulation of training GCNs. In this paper, we consider training GCNs in semi-supervised multi-class classification setting. Given an undirected graph G = (V, E) with N = |V| and |E| edges and the adjacency matrix A ∈ {0, 1} N ×N , we assume that each node is associated with a feature vector x i ∈ R d and label y i . We use X = [x 1 , . . . , x N ] ∈ R N ×d and y = [y 1 , . . . , y N ] ∈ R N to denote the node feature matrix and label vector, respectively. The Laplacian matrix is calculated as L = D -1/2 AD -1/2 or L = D -1 A where D ∈ R N ×N is the degree matrix. We use θ = {W (1) , . . . , W (L) } to denote the stacked weight parameters of a L-layer GCN. The training of full-batch GCN (FullGCN) as an empirical risk minimization problem aims at minimizing the loss L(θ) over all training data L(θ) = 1 N N i=1 Loss(h (L) i , y i ), H (L) = σ L . . . σ Lσ LXW (1) Z (1) W (2) . . . W (L) where h ( ) i is the ith row of node embedding matrix H ( ) = σ(Z ( ) ), Z ( ) = LH ( -1) W ( ) that corresponds to embedding of ith node at th layer (hop), Loss(•, •) is the loss function (e.g., crossentropy loss) to measure the discrepancy between the prediction of the GCN and its ground truth label, and σ(•) is the activation function (e.g., ReLU function). Sampling-based GCN training. When the graph is large, the computational complexity of forward and backward propagation could be very high. One practical solution to alleviate this issue is to sample a subset of nodes and construct a sparser normalized Laplacian matrix L ( ) for each layer with supp( L ( ) ) supp(L), and perform forward and backward propagation only based on the sampled Laplacian matrices. The sparse Laplacian matrix construction algorithms can be roughly classified as nodewise sampling, layerwise sampling, and subgraph sampling. A detailed discussion of different sampling strategies can be found in Appendix D. To apply Stochastic Gradient Descent (SGD) (Bottou et al., 2018) to train GCN (SGCN), we sample a mini-batch of nodes V B ⊆ V from all nodes with size B = |V B |, and construct the set of sparser Laplacian matrices { L ( ) } L =1 based on nodes sampled at each layer and compute the stochastic gradient to update parameters asfoot_0  ∇ L(θ) = 1 B i∈VB ∇Loss( h (L) i , y i ), H (L) = σ L (L) . . . σ L (2) σ L (1) XW (1) Z (1) W (2) . . . W (L) Key challenges. Compared to vanilla SGD, the key challenge of theoretical understanding for SGCN training is the biasedness of stochastic gradient due to sampling of nodes at inner layers. Let denote FullGCN's full-batch gradient as ∇L(θ) = {G ( ) = ∂L(θ) ∂W ( ) } L =1 and SGCN's stochastic gradient as ∇ L(θ) = { G ( ) = ∂ L(θ) ∂W ( ) } L =1 . By the chain rule, we can compute the full-batch gradient G ( ) t w.r.t. the th layer weight matrix W ( ) as G ( ) t = [LH ( -1) t ] D ( +1) t • σ (Z ( ) t ) , D ( ) t = L D ( +1) t • σ (Z ( ) t ) W ( ) t , D (L+1) t = ∂L(θt) ∂H (L) (1) and compute stochastic gradient G ( ) t utilized in SGCN for the th layer w.r.t. W ( ) as 2017) established a convergence rate under the strong assumption that the stochastic gradient of SGCN is unbiased and Chen & Luss (2018) provided another analysis under the strong assumption that the stochastic gradient converges to the consistent gradient exponentially fast as the number of sampled nodes increases. While both studies establish the same convergence rate of O(1/ √ T ), however, these assumptions do not hold in reality due to the composite structure of the training objectives and sampling of nodes at inner layers. Motivated by this, we aim at providing a tight analysis without the aforementioned strong assumptions on the stochastic gradient. Our analysis is inspired by the bias and variance decomposition of the mean-square error of stochastic gradient, which has been previously used in Cong et al. (2020) to analysis the stochastic gradient in GCN. Formally, we can decompose mean-square error Proposition 1. For any ∈ [L], there exist constants B H and B D such that the norm of node embedding matrices and the gradient with respect to the input node embedding matrices satisfy G ( ) t = [ L ( ) H ( -1) t ] D ( +1) t • σ ( Z ( ) t ) , D ( ) t = [ L ( ) ] D ( +1) t • σ ( Z ( ) t ) W ( ) t , D (L+1) t = ∂ L(θt) ∂ H (L) (2) H ( ) F ≤ B H , H ( ) F ≤ B H , ∂σ(LH ( -1) W ( ) ) ∂H ( -1) F ≤ B D , and ∂σ( L ( ) H ( -1) W ( ) ) ∂ H ( -1) F ≤ B D . Before presenting the convergence of SGCN, we introduce the notation of propagation matrices {P ( ) } L =1 , which are defined as the column-wise expectation of the sparser Laplacian matrices. Note that this notation is only for presenting the theoretical results, and are not used in the practical training algorithms. By doing so, we can decompose the difference between L ( ) and L as the summation of column-wise difference L ( ) -P ( )foot_1 F and row-wise difference P ( ) -L 2 F . In the following theorem, we show that the upper bound of the bias and variance of stochastic gradient is closely related to the expectation of column-wise difference E[ L ( ) -P ( ) 2 F ] and rowwise difference E[ P ( ) -L 2 F ] which can significantly impact the convergence of SGCN. Theorem 1 (Convergence of SGCN). Suppose Assumptions 1, 2, 3 hold and apply SGCN with learning rate chosen as η = min{1/L F , 1/ √ T } where L F is the smoothness constant. Let ∆ n and ∆ b denote the upper bound on the variance and bias of stochastic gradients as: ∆ n = L =1 O(E[ L ( ) -P ( ) 2 F ]) + O(E[ P ( ) -L 2 F ]), ∆ b = L =1 O(E[ P ( ) -L 2 F ]) Then, the output of SGCN satisfies min t∈[T ] E[ ∇L(θ t ) 2 F ] ≤ 2(L(θ 1 ) -L(θ )) √ T + L F ∆ n √ T + ∆ b . (5) The exact value of key parameters L F , ∆ n , and ∆ b are computed in Lemma 1, Lemma 2, and Lemma 3 respectively and can be found in Appendix G. Theorem 1 implies that after T iterations the gradient norm of SGCN is at most O(∆ n / √ T ) + ∆ b , which suffers from a constant residual error ∆ b that is not decreasing as the number of iterations T increases. Without the bias 2 we recover the convergence of vanilla SGD. Of course, this type of convergence is only useful if ∆ b and ∆ n are small enough. We note that existing SGCN algorithms propose to reduce ∆ b by increasing the number of neighbors sampled at each layer (e.g., GraphSAGE), or applying importance sampling (e.g., FastGCN, LADIES and GraphSAINT).

4. SGCN+: ZEROTH-ORDER VARIANCE REDUCTION

An important question to answer is: can we eliminate the residual error without using all neighbors during forward-propagation? A remarkable attempt to answer this question has been recently made in VRGCN (Chen et al., 2017) where they propose to use historical node embeddings as an approximation to estimate the true node embeddings. More specifically, the graph convolution in VRGCN is defined as H ( ) t = σ L H ( -1) t-1 W ( ) + L ( ) ( H ( -1) t -H ( -1) t-1 )W ( ) . Taking advantage of historical node embeddings, VRGCN requires less sampled neighbors and results in significant less computation overhead during gradient computation. Although VRGCN achieves significant speed up and better performance compared to other SGCNs, it involves using the full Laplacian matrix at each iteration, which can be computationally prohibitive. Moreover, since both SGCNs and VRGCN are approximating the exact node embeddings calculated using all neighbors, it is still not clear why VRGCN achieves a better convergence result than SGCNs using historical node embeddings. To fill in these gaps, we introduce zeroth-order variance reduced sampling-based GCN training method dubbed as SGCN+. As shown in Algorithm 1, SGCN+ has two types of forward propagation: the forward propagation at the snapshot steps and the forward propagation at the regular steps. At the snapshot step (t mod K = 0), a full Laplacian matrix is utilized: Z ( ) t = LH ( -1) t W ( ) t , H ( ) t = σ(Z ( ) t ), Z ( ) t ← Z ( ) t (6) During the regular steps (t mod K = 0), the sampled Laplacian matrix is utilized: Z ( ) t = Z ( ) t-1 + L ( ) H ( -1) t W ( ) t -L ( ) H ( -1) t-1 W ( ) t-1 , H ( ) t = σ( Z ( ) ) Algorithm 1 SGCN+: Zeroth-order variance reduction (Detailed version in Algorithm 4) 1: Input: Learning rate η > 0, snapshot gap K > 0 2: for t = 1, . . . , T do 3: if t mod K = 0 then 4: Calculate node embeddings using Eq. 6 5: Calculate full-batch gradient ∇L(θt) as Eq. 1 and update as θt+1 = θt -η∇L(θt) 6: else 7: Calculate node embeddings using Eq. 7 8: Calculate stochastic gradient ∇ L(θt) as Eq. 2 and update as θt+1 = θt -η∇ L(θt) 9: end if 10: end for 11: Output: Model with parameter θT +1 Algorithm 2 SGCN++: Doubly variance reduction (Detailed version in Algorithm 5) 1: Input: Learning rate η > 0, snapshot gap K > 0 2: for t = 1, . . . , T do 3: if t mod K = 0 then 4: Calculate node embeddings using Eq. 6 5: Calculate full-batch gradient ∇L(θt) usng Eq. 1 and update as θt+1 = θt -η∇L(θt) 6: Save the per layerwise gradient G ( ) t ← G ( ) t , D ( ) t ← D ( ) t , ∀ ∈ [L] 7: else 8: Calculate node embeddings using Eq. 7 9: Calculate stochastic gradient ∇ L(θt) usng Eq. 10 and update as θt+1 = θt -η∇ L(θt) 10: end if 11: end for 12: Output: Model with parameter θT +1 Comparing with VRGCN, the proposed SGCN+ only requires one full Laplacian graph convolution operation every K iterations, where K > 0 is an additional parameter to be tuned. In the following theorem, we introduce the convergence result of SGCN+. Recall that the node embedding approximation variance (zeroth-order variance) determines the bias of stochastic gradient E[ b 2 F ]. Applying SGCN+ can significantly reduce the bias of stochastic gradient, such that its value is small enough that will not deteriorate the convergence. Theorem 2 (Convergence of SGCN+). Suppose Assumptions 1, 2, 3 hold and apply SGCN+ with learning rate chosen as η = min{1/L F , 1/ √ T } where L F is the smoothness constant. Let ∆ n and ∆ + b denote the upper bound for the variance and bias of stochastic gradient as: ∆ n = L =1 O(E[ L ( ) -P ( ) 2 F ]) + O(E[ P ( ) -L 2 F ]), ∆ + b = η 2 ∆ + b where ∆ + b = O K L =1 |E[ P ( ) 2 F ] -L 2 F | (8) Then, the output of SGCN+ satisfies min t∈[T ] E[ ∇L(θ t ) 2 F ] ≤ 2(L(θ 1 ) -L(θ )) √ T + L F ∆ n √ T + ∆ + b T . The exact value of key parameters L F , ∆ n , and ∆ + b are computed in Lemma 1, Lemma 2, and Lemma 5 respectively, and can be found in Appendix G, H. Theorem 2 implies that after T iterations the gradient norm of SGCN+ is at most O(∆ n / √ T ) + O(∆ + b /T ). When using all neighbors for calculating the exact node embeddings, we have P ( ) = L such that ∆ + b = 0, which leads to convergence rate of SGD. Comparing with vallina SGCN, the bias of SGCN+ is scaled by learning rate η. Therefore, we can reduce the negative effect of bias by choose learning rate as η = O(1/ √ T ). This also explains why SGCN+ achieves a significantly better convergence rate compared to SGCN.

5. SGCN++: DOUBLY VARIANCE REDUCTION

Algorithm 1 applies zeroth-order variance reduction on node embedding matrices and results in a faster convergence. However, both SGCN and SGCN+ suffer from the same stochastic gradient variance ∆ n , which can be only reduced either by increasing the mini-batch size of SGCN or applying variance reduction on stochastic gradient. An interesting question that arises is: can we further accelerate the convergence by simultaneously employing zeroth-order variance reduction on node embeddings and first-order variance reduction on layerwise gradient? To answer this question, we propose doubly variance reduction algorithm SGCN++, that extends the variance reduction algorithm from node embedding approximation to layerwise gradient estimation. As shown in Algorithm 2, the main idea of SGCN++ is to use the historical gradient as control variants for current layerwise gradient estimation. More specifically, similar to SGCN+ that has two types of forward propagation steps, SGCN++ also has two types of backward propagation: at the snapshot steps and at the regular steps. The snapshot steps (t mod K = 0) backward propagation are full-batch gradient computation as is defined in Eq. 1, and the computed full-batch gradient are saved as control variants for the following regular steps. The backward propagation (t mod K = 0) at the regular steps are defined as G ( ) t = G ( ) t-1 + [ L ( ) H ( -1) t ] D ( +1) t • σ ( Z t ) -[ L ( ) H ( -1) t-1 ] D ( +1) t-1 • σ ( Z t-1 ) D ( ) t = D ( ) t-1 + [ L ( ) ] D ( +1) t • σ ( Z t ) [W ( ) t ] -[ L ( ) ] D ( +1) t-1 • σ ( Z t-1 ) [W ( ) t-1 ] (10) Next, in the following theorem, we establish the convergence rate of SGCN++. Recall that the mean-square error of the stochastic gradient can be decomposed into bias E[ b 2 F ] that is due to node embedding approximation and variance E[ n 2 F ] that is due to layerwise gradient estimation. Applying doubly variance reduction on node embedding and layerwise gradient simultaneously can significantly reduce mean-square error of stochastic gradient and speed up convergence. Theorem 3 (Convergence of SGCN++). Suppose Assumptions 1, 2, 3 hold, and denote L F as the smoothness constant and ∆ ++ n+b as the upper-bound of mean-square error of stochastic gradient ∆ ++ n+b = η 2 ∆ ++ n+b = η 2 O K L =1 |E[ L ( ) 2 F ] -L 2 F | (11) Apply SGCN++ in Algorithm 2 with learning rate as η = 2 LF+ L 2 F +4∆ ++ n+b . Then it holds that 1 T T t=1 E[ ∇L(θ t ) 2 ] ≤ 1 T L F + L 2 F + 4∆ ++ n+b L(θ 1 ) -L(θ ) . The exact value of key parameter L F and ∆ ++ n+b are computed in Lemma 1 and Lemma 12 respectively, and can be found in Appendix G, I. Theorem 2 implies that applying doubly variance reduction can scale the mean-square error O(η 2 K) times smaller. As a result, after T iterations the norm of gradient of solution obtained by SGCN++ is at most O(∆ ++ n+b /T ), which enjoys the same rate as vanilla variance reduced SGD (Reddi et al., 2016; Fang et al., 2018) . Scalability of SGCN++. One might doubt whether the computation at the snapshot step with fullbatch gradient will hinder the scalability of SGCN++ for extremely large graphs? Heuristically, we can approximate the full-batch gradient with the gradient calculated on a large-batch using all neighbors. The intuition stems from matrix Bernstein inequality (Gross, 2011) , where the probability of the approximation error violating the desired accuracy decreases exponentially as the number of samples increase. Please refer to Algorithm 6 for the full-batch free SGCN++ and explanation on why large-batch approximation is feasible using tools from matrix concentration. Moreover, we provide the empirical evaluation on the large-batch size instead of full-batch in Appendix C. We remark that large-batch approximation can also be utilized in SGCN+ to further reduce the memory requirement for historical node embeddings. Connection to composite optimization. Although we formulate sampling-based GCNs as a special case of the composite optimization problem, it is worth noting that compared to the classical composite optimization, there are a few key differences that make the utilization of variance reduction methods for composite optimization non-trivial: (a) different objective function that makes the GCN analysis challenging; (b) different gradient computation, analysis, and algorithm which make Experiment results. In Table 1 and Figure 2 , we show the accuracy and convergence comparison of SGCN, SGCN+, and SGCN++. We remark that multi-class classification tasks prefer a more stable node embedding and gradient than single-class classification tasks. Therefore, even the vanilla Exact, GraphSAGE and VRGCN already outperforms other baseline methods on PPI, PPI-large, and Yelp. Applying variance reductions can further improve its performance. In addition, we observe that the effect of variance reduction depends on its base sampling algorithms. Even though the performance of base sampling algorithm various significantly, the doubly variance reduction can bring their performance to a similar level. Moreover, we can observe from the loss curves that SGCNs suffers an residual error as discussed in Theorem 1, and the residual error is proportional to node embedding approximation variance (zeroth-order variance), where VRGCN has less variance than GraphSAGE because of its zeroth-order variance reduction, and GraphSAGE has less variance than LADIES because more nodes are sampled for node embedding approximation. 

7. CONCLUSION

In this work, we develop a theoretical framework for analyzing the convergence of sampling based mini-batch GCNs training. We show that the node embedding approximation variance and layerwise gradient variance are two key factors that slow down the convergence of these methods. Furthermore, we propose doubly variance reduction schema and theoretically analyzed its convergence. Experimental results on benchmark datasets demonstrate the effectiveness of proposed schema to significantly reduce the variance of different sampling strategies to achieve better generalization. 

A CONNECTION TO COMPOSITE OPTIMIZATION

In this section, we formally compare the optimization problem in training GCNs to the standard composite optimization and highlight the key differences that necessitates developing a completely different variance reduction schema and convergence analysis compared to the composite optimization counterparts (e.g., see Fang et al. (2018) ). Different objective function. In composite optimization, the output of the lower-level function is treated as the parameter of the outer-level function. However in GCN, the output of the lower-level function is used as the input of the outer-level function, and the parameter of the outer-level function is independent of the output of the inner-layer result. More specifically, a two-level composite optimization problem can be formulated as F (θ) = 1 N N i=1 f i 1 M M j=1 g j (w) , θ = {w}, where f i (•) is the outer-level function computed on the ith data point, g j (•) is the inner-level function computed on the jth data point, and w is the parameter. We denote ∇f i (•) and ∇g j (•) as the gradient. Then, the gradient for Eq. 13 is computed as ∇F (θ) = 1 N N i=1 ∇f i 1 M M j=1 g j (w) 1 M M j=1 ∇g j (w) , θ = {w}, where the dependency between inner-and outer-level sampling are not considered. One can independently sample inner layer data to estimate g ≈ 1 M M j=1 g j (w) and ∇ g ≈ 1 M M j=1 ∇g j (w), sam- ple outer layer data to estimate ∇ f ≈ 1 N N i=1 ∇f i ( g), then estimate ∇F (θ) by using [∇ f ] ∇ g. By casting the optimizaion problem in GCN as composite optimization problem in Eq. 13, we have L(θ) = 1 B i∈V B Loss(h (L) i , y i ), θ = {W (1) } H (L) = σ( L (L) X W (L) ), W (L) = σ L (L-1) Xσ L (L-2) X . . . σ L (1) XW (1) W (2) . . . , which is different from the vanilla GCN model. To see this, we note that in vanilla GCNs, since the sampled nodes at the th layer are dependent from the nodes sampled at the ( + 1)th layer, we have E[ L ( ) ] = P ( ) = L. However in Eq. 15, since the sampled nodes have no dependency on the weight matrices or nodes sampled at other layers, we can easily obtain E[ L ( ) ] = L. These key differences makes the analysis more involved and are reflected in all three theorems, that give us different results. Different gradient computation and algorithm. The stochastic gradients to update the parameters in Eq. 15 are computed as ∂L(θ) ∂ W ( ) = ∂L(θ) ∂ W (L) L j= +1 ∂ W (j) ∂ W (j-1) . ( ) However in GCN, there are two types of gradient at each layer (i.e., D ( ) and G ( ) ) that are fused with each other (i.e., D ( ) is a part of G ( -1) and D ( ) is a part of D ( -1) ) but with different functionality. D ( ) is passing gradient between different layers, G ( ) is passing gradient to weight matrices. These two types of gradient and their coupled relation make both algorithm and analysis different from Zhang & Xiao (2019b) . For example in Zhang & Xiao (2019b) , the zeroth-order variance reduction is applied to W ( ) t in Eq. 13 (please refer to Algorithm 3 in Zhang & Xiao (2019b) ), where W ( ) t-1 is used as a control variant to reduce the variance of W ( ) t , i.e., W ( +1) t = W ( +1) t-1 + σ( L ( ) t X W ( ) t ) -σ( L ( ) t X W ( ) t-1 ). However in SGCN++, the zeroth-order variance reduction is applied to H ( ) t . Because the node sampled at the tth and (t-1)th iteration are unlikely the same, we cannot directly use H ( ) t-1 to reduce the variance of H ( ) t . Instead, the control variant in SGCN++ is computed by applying historical weight W ( ) t-1 on the historical node embedding from previous layer H ( -1) t-1 , i.e., H ( ) t = H ( ) t-1 + σ( L ( ) t H ( -1) t W ( ) t ) -σ( L ( ) t H ( -1) t-1 W ( ) t-1 ). ( ) These changes are not simply heuristic modifications, but all reflected in the analysis and the result. Different theoretical results and intuition. The aforementioned differences further result in a novel analysis of Theorem 1, where we show that the vanilla sampling-based GCNs suffer a residual error ∆ b that is not decreasing as the number of iterations T increases, and this residual error is strongly connected to the difference between sampled and full Laplacian matrices. This is one of our novel observations for GCNs, when compared to (1) multi-level composite optimization with layerwise changing learning rate Yang et al. (2019) ; Chen et al. ( 2020), ( 2) variance reduction based methods Zhang & Xiao (2019b) , and (3) the previous analysis on the convergence of GCNs Chen et al. ( 2018); Chen & Luss (2018) . Our observation can be used as a theoretical motivation on using first-order and doubly variance reduction, and can mathematically explain why VRGCN outperform GraphSAGE, even with fewer nodes during training. Furthermore, as the algorithm and gradient computation are different, the theoretical results in Theorems 2 and 3 are also different.

B EXPERIMENT CONFIGURATIONS

Hardware specification and environment. We run our experiments on a single machine with Intel i5-7500, NVIDIA GTX1080 GPU (8GB memory) and, 32GB RAM memory. The code is written in Python 3. By default, we train 2-layer GCNs with hidden state dimension of 256, element-wise ELU as the activation function and symmetric normalized Laplacian matrix L = D -1/2 AD -1/2 . We use mean-aggregation for single-class classification task and concatenate-aggregation for multi-class classification. The default mini-batch batch size and sampled node size are summarized in Table 2 . We update the model using Adam optimizer with a learning rate of 0.01. For SGCN++, historical node embeddings are first calculated on GPUs and transfer to CPU memory using PyTorch command Tensor.to(device). Therefore, no extra GPU memory is required when training with SGCN++. To balance the staleness of snapshot model and the computational efficiency, as default we choose snapshot gap K = 10 and early stop inner-loop if the Euclidean distance between current step gradient to snapshot gradient is larger than 0.002 times the norm of snapshot gradient. During training, for each epoch we construct 10 mini-batches in parallel using Python package multiprocessing and perform training on the sampled 10 mini-batches. To achieves a fair comparison of different sampling strategies in terms of sampling complexity, we implement all sampling algorithms using numpy.random and scipy.sparse package. We have to emphasize that, in order to better observe the impact of sampling on convergence, we have not use any augmentation methods (e.g., "layer normalization", "skip-connection", and "attention"), which have been proven to impact the GCN performance in Cai et al. (2020) ; Dwivedi et al. (2020) . Notice that we are not criticizing the usage of these augmentations. Instead, we use the most primitive network structure to better explore the impact of sampling and variance reduction on convergence. Comparison of SGD and Adam. It is worth noting that Adam optimizer is used as the default optimizer during training. We choose Adam optimizer over SGD optimizer for the following reasons: (a) Baseline methods training with SGD cannot converge when using a constant learning rate due to the bias and variance in stochastic gradient (Adam has some implicit variance reduction effect, which can alleviate the issue). The empirical result of SGD trained baseline models has a huge performance gap to the one trained with Adam, which makes the comparison meaningless. For example in Figure 3 , we compare Adam and SGD optimizer on PPI dataset. For Adam optimizer we use PyTorch's default learning rate 0.01, and for SGD optimizer we choose learning rate as 0.1, which is selected as the most stable learning rate from range [0.01, 1] for this dataset. Although the SGD is using a learning rate 10 times larger than Adam, it requires 100 times more iterations than Adam to reach the early stop point (valid loss do not decrease for 200 iterations), and suffers a giant performance gap when comparing to Adam optimizer. (b) Most public implementation of GCNs, including all implementations in PyTorch Geometric and DGL packages, use Adam optimizer instead of SGD optimizer. (c) In this paper, we mainly focus on how to estimate a stabilized stochastic gradient, instead of how to take the existing gradient for weight update. We employ Adam optimizer for all algorithms during experiment, which lead to a fair comparison. Dataset statistics. We summarize the dataset statistics in Table 3 . Because Exact sampling is using all neighbors for the node embedding approximation, it is only affected by layerwise gradient variance (first-order variance). Therefore, employing first-order variance reduction on Exact sampling can significantly reduce the mean-square error of stochastic gradient to full-gradient, and speed up the convergence. Different from Exact sampling that the exact node embeddings are available during training, layerwise sampling algorithm LADIES, and nodewise sampling GraphSAGE are approximating the true node embeddings using a subset of nodes (neighbors). Therefore, these methods both suffer from node embedding approximation variance (zeroth-order variance) and layerwise gradient variance (first-order variance). As a result, applying zeroth-order variance reduction and first-order variance reduction simultaneously is necessary to reduce the mean-square error of the stochastic gradient and speed up the convergence. GPU memory usage. In Figure 5 , We compare the GPU memory usage of SGCN and SGCN++. We calculate the allocated memory by torch.cuda.memory allocated, which is the current GPU memory occupied by tensors in bytes for a given device. We calculate the maximum allocated memory by torch.cuda.max memory allocated, which is the maximum GPU memory occupied by tensors in bytes for a given device. From Figure 5 , we observe that neither running full-batch GCN nor saving historical node embeddings and gradients will significantly increase the computation overhead during training. Besides, since all historical activations are stored outside GPU, we see that SGCN++ only requires several megabytes to transfer data between GPU memory to the host, which can be ignored compared to the memory usage of calculation itself. Evaluation of total time. In Table 4 and Table 5, Notice that we are reporting the total time per iteration because the vanilla sampling-based method cannot reach the same accuracy as the doubly variance reduced algorithm (due to the residual error as shown in Theorem 1). From Table 4 and Table 5 , we can observe that the most time-consuming process in samplingbased GCN training is data sampling and data transfer. The extra computation time introduces by employing the snapshot step is negligible when comparing to the mini-batch sampling time during each regular step. Therefore, a promising future direction for large-scale graph training is developing a provable sampling algorithm with low sampling complexity. Evaluation of snapshot gap for SGCN+ and SGCN++. Doubly variance reduced SGCN++ requires performing full-batch (large-batch) training periodically to calculate the snapshot node embeddings and gradients. A larger snapshot gap K can make training faster, but also might make the snapshot node embeddings and gradients too stale for variance reduction. In this experiment, we evaluate the effect of snapshot grap on training by choosing mini-batch size as B = 512 and change the innerloop intervals from K = 5 mini-batches to K = 20 mini-batches. In Figure 6 and Figure 7 , we show Evaluation of large-batch size for SGCN+ and SGCN++. The full-batch gradient calculation at each snapshot step is computationally expensive. Heuristically, we can approximate the full-batch gradient by using the gradient computed on a large-batch of nodes. Besides, it is worth noting that large-batch approximation can be also used for the node embedding approximation in zeroth-order variance reduction. In SGCN+, saving the historical node embeddings for all nodes in an extreme large graph can be computationally prohibitive. An alternative strategy is sampling a large-batch during the snapshot step, computing the node embeddings for all nodes in the large-batch, and saving the freshly computed node embeddings on the storage. After that, mini-batch nodes are sampled from the large-batch during the regular steps. Let denote B as the snapshot step largebatch size and B denote the regular step mini-batch size. By default, we choose snapshot gap as The effect of mini-batch size. In Figure 10 , we show the comparsion of training loss and validation loss with different regular step mini-batch size. By default, we choose the snapshot gap as K = 10 , fix the snapshot step batch size as B = 80, 000, and change the regular step mini-batch size B from 256 to 2, 048. Besides, we note that subgraph sampling algorithm GraphSAINT requires an extreme large mini-batch size every iterations. In Figure 11 , we explicitly compare the effectiveness of mini-batch size on doubly variance reduced GraphSAINT++ and vanilla GraphSAINT, and show that a smaller mini-batch is required by GraphSAINT++. Evaluation of increasing snapshot gap. Snapshot gap size K serves as a budge hyper-parameter that balances between training speed and the quality of variance reduction. During training, as the number of iterations increases, the GCN models convergences to a saddle point. Therefore, it is interesting to explore whether increasing the snapshot gap K during the training process can obtain a speed boost. In Figure 12 , we show the comparison of validation loss of fixed snapshot gap K = 10 and gradually increasing snapshot gap K = 10 + 0.1 × s, s = 1, 2, . . ., where s is the number of snapshot steps has been computed. Recall that the key bottleneck for SGCN++ is memory budget and sampling complexity, rather than snapshot computing. Dynamically increasing snapshot gap can reduce the number of snapshot steps, but cannot significantly reduce the training time but might lead to a performance drop.

D DIFFERENT SAMPLING STRATEGIES

In this section, we highlight the difference between node-wise sampling, layer-wise sampling, and subgraph sampling algorithms. Node-wide sampling. The main idea of node-wise sampling is to first sample all the nodes needed for the computation using neighbor sampling (NS), then train the GCN based on the sampled nodes. For each node in the th GCN layer, NS randomly samples s of its neighbors at the ( -1)th GCN layer and formulate L ( ) by where N (i) is the full set of ith node neighbor, N ( ) (i) is the sampled neighbors of node i for th GCN layer. GraphSAGE Hamilton et al. (2017) follows the spirit of node-wise sampling where it performs uniform node sampling on the previous layer neighbors for a fixed number of nodes to bound the mini-batch computation complexity. L ( ) i,j = |N (i)| s × L i,j , if j ∈ N ( ) (i) 0, otherwise Layer-wise sampling. To avoid the neighbor explosion issue, layer-wise sampling is introduced to controls the size of sampled neighborhoods in each layer. For the th GCN layer, layer-wise sampling methods sample a set of nodes B ( ) ⊆ V of size s under the distribution p to approximate the Laplacian by L ( ) i,j = 1 s×pj × L i,j , if j ∈ B ( ) 0, otherwise Existing work FastGCN Chen et al. (2018) and LADIES Zou et al. (2019) follows the spirit of layer-wise sampling. FastGCN performs independently node sampling for each layer and applies important sampling to reduce variance and results in a constant sample size in all layers. However, mini-batches potentially become too sparse to achieve high accuracy. LADIES improves FastGCN by layer-dependent sampling. Based on the sampled nodes in the upper layer, it selects their neighborhood nodes, constructs a bipartite subgraph, and computes the importance probability accordingly. Then, it samples a fixed number of nodes based on the calculated probability, and recursively conducts such a procedure per layer to construct the whole computation graph. Subgraph sampling. Subgraph sampling is similar to layer-wise sampling by restricting the sampled Laplacian matrices at each layer are identical L (1) i,j = . . . = L (L) i,j = 1 s×pj × L i,j , if j ∈ B ( ) 0, otherwise For example, GraphSAINT Zeng et al. ( 2019) can be viewed as a special case of layer-wise sampling algorithm FastGCN by restricting the nodes sampled at the 1-st to (L -1)th layer the same as the nodes sampled at the Lth layer. However, GraphSAINT requires a significant large mini-batch size compared to other layer-wise sampling methods. We leave this as a potential future direction to explore.

E DETAILED ALGORITHMS E.1 DESCRIPTION

In order to help readers better compare the difference of different algorithms, we summarize the vanilla sampling-based GCN training algorithm SGCN in Algorithm 3, zeroth-order variance reduced algorithm SGCN+ in Algorithm 4, and doubly variance reduced algorithm SGCN++ in Algorithm 5. In addition, we illustrate in Figure 13 the relationship between node embedding approximation variance and layerwise gradient variance to the forward-and backward-propagation. We remark that zeroth-order variance reduction SGCN+ is only applied during forward-propagation, and doubly (zeroth-and first-order) variance reduction SGCN++ is applied during forward-and backward-propagation, simultaneously. … ! 𝐇 (") = 𝜎( & 𝐋 " 𝐗𝐖 (") ) ! 𝐇 ($) = 𝜎( & 𝐋 $ ! 𝐇 (") 𝐖 ($) ) ! 𝐇 (%) = 𝜎( & 𝐋 % ! 𝐇 (%&") 𝐖 (%) ) 1 st layer 2 nd layer L th layer 1 𝐵 # !∈𝒱ℬ ∇𝐿𝑜𝑠𝑠( ) ℎ ! $ , 𝑦 ! ) Forward-propagation (Node embedding approximation variance) Backward-propagation (Layerwise gradient variance) Calculate node embeddings using ! 𝐇 (%) = [ , ℎ " % , … , , ℎ ' % ] H ( ) = σ( L ( ) H ( -1) W ( ) ), where H (0) = X, Calculate loss as L(θ t ) = 1 B i∈V B Loss( h (L) i , y i ) 6: Calculate stochastic gradient ∇ L(θ t ) = { G ( ) } L =1 as G ( ) t := [ L ( ) H ( -1) t ] D ( +1) t • ∇σ( Z ( ) t ) , D ( ) t := [ L ( ) ] D ( +1) t • ∇σ( Z ( ) t ) W ( ) t , D (L+1) t = ∂ L(θ t ) ∂ H (L) Update parameters as θ t+1 = θ t -η∇ L(θ t ) 8: end for Calculate node embeddings and update historical node embeddings using Z ( ) t = LH ( -1) t W ( ) t , H ( ) t = σ(Z ( ) t ), Z ( ) t ← Z ( ) t Calculate loss as L(θ t ) = 1 N N i=1 Loss(h (L) i , y i ) 7: Calculate full-batch gradient ∇L(θ t ) = {G ( ) } L =1 as G ( ) t := [LH ( -1) t ] D ( +1) t • ∇σ(Z ( ) t ) , D ( ) t := L D ( +1) t • ∇σ(Z ( ) t ) W ( ) t , D (L+1) t = ∂L(θ t ) ∂H (L) Update parameters as θ t+1 = θ t -η∇L(θ t ) 9: else 10: % Regular steps 11: Sample mini-batch V B ⊂ V 12: Calculate node embeddings using Z ( ) t = Z ( ) t-1 + L ( ) H ( -1) t W ( ) t -L ( ) H ( -1) t-1 W ( ) t-1 , H ( ) t = σ( Z ( ) ) Calculate loss as L(θ t ) = 1 B i∈V B Loss( h (L) i , y i ) 14: Calculate the stochastic gradient ∇ L(θ t ) = { G ( ) } L =1 as G ( ) t := [ L ( ) H ( -1) t ] D ( +1) t • ∇σ( Z ( ) t ) , D ( ) t := [ L ( ) ] D ( +1) t • ∇σ( Z ( ) t ) W ( ) t , D (L+1) t = ∂ L(θ t ) ∂ H (L) Update parameters as θ t+1 = θ t -η∇ L(θ t ) Calculate node embeddings and update historical node embeddings using Z ( ) t = LH ( -1) t W ( ) t , H ( ) t = σ(Z ( ) t ), Z ( ) t ← Z ( ) t Calculate loss as L(θ t ) = 1 N N i=1 Loss(h (L) i , y i ) 7: Calculate the full-batch gradient ∇L(θ t ) = {G ( ) } L =1 as G ( ) t := [LH ( -1) t ] D ( ) t • ∇σ(Z ( ) t ) , D ( ) t := L D ( +1) t • ∇σ(Z ( ) t ) W ( ) t , D (L+1) t = ∂L(θ t ) ∂H (L) Save the per layerwise gradient G ( ) t ← G ( ) t , D ( ) t ← D ( ) t for all ∈ [L] 9: Update parameters as θ t+1 = θ t -η∇L(θ t ) 10: Sample mini-batch V B ⊂ V 13: Calculate node embeddings using Z ( ) t = Z ( ) t-1 + L ( ) H ( -1) t W ( ) t -L ( ) H ( -1) t-1 W ( ) t-1 , H ( ) t = σ( Z ( ) ) Calculate loss as L(θ t ) = 1 B i∈V B Loss( h (L) i , y i ) 15: Calculate the stochastic gradient ∇ L(θ t ) = { G ( ) } L =1 as G ( ) t = G ( ) t-1 + [ L ( ) H ( -1) t ] D ( +1) t • ∇σ( Z t ) -[ L ( ) H ( -1) t-1 ] D ( +1) t-1 • ∇σ( Z t-1 ) D ( ) t = D ( ) t-1 + [ L ( ) ] D ( +1) t • ∇σ( Z t ) [W ( ) t ] -[ L ( ) ] D ( +1) t-1 • ∇σ( Z t-1 ) [W ( ) t-1 ], D (L+1) t = ∂ L(θ t ) ∂ H (L) t Update parameters as θ t+1 = θ t -η∇ L(θ t ) 17: end if 18: end for 19: Output: Model with parameter θ T +1 E.5 SGCN++ WITHOUT FULL-BATCH Furthermore, in Algorithm 6, we provide an alternative version of SGCN++ that does not require fullbatch forward-and backward-propagation at the snapshot step. The basic idea is to approximate the full-batch gradient by sampling a large mini-batch V B of size B = |V B | using Exact sampling, then compute the node embedding matrices and stochastic gradients on the sampled large-batch V B . Algorithm 6 SGCN++ (without full-batch): Doubly variance reduction 1: Input: Learning rate η > 0, snapshot gap K > 0 2: for t = 1, . . . , T do 3: if t mod K = 0 then 4: % Snapshot steps 5: Sample a large-batch V B of size B and construct the Laplacian matrices L ( ) for each layer using all neighbors, i.e., L ( ) i,j = L i,j , if j ∈ N ( ) (i) 0, otherwise Calculate node embeddings and update historical node embeddings using Z ( ) t = L ( ) H ( -1) t W ( ) t , H ( ) t = σ(Z ( ) t ), Z ( ) t ← Z ( ) t Calculate loss as L(θ t ) = 1 B i∈V B Loss(h (L) i , y i ) 8: Calculate the approximated snapshot gradient ∇L(θ t ) = {G ( ) } L =1 as G ( ) t := [L ( ) H ( -1) t ] D ( +1) t • ∇σ(Z ( ) t ) , D ( ) t := [L ( ) ] D ( +1) t • ∇σ(Z ( ) t ) W ( ) t , D (L+1) t = ∂L(θ t ) ∂H (L) Save the per layerwise gradient G ( ) t ← G ( ) t , D ( ) t ← D ( ) t , ∀ ∈ [L] 10: Update parameters as θ t+1 = θ t -η∇L(θ t ) 11: else 12: % Regular steps 13: Sample mini-batch V B ⊂ V B 14: Calculate node embeddings using Z ( ) t = Z ( ) t-1 + L ( ) H ( -1) t W ( ) t -L ( ) H ( -1) t-1 W ( ) t-1 , H ( ) t = σ( Z ( ) ) Calculate loss as L(θ t ) = 1 B i∈V B Loss( h (L) i , y i ) 16: Calculate the stochastic gradient ∇ L(θ t ) = { G ( ) } L =1 as G ( ) t = G ( ) t-1 + [ L ( ) H ( -1) t ] D ( +1) t • ∇σ( Z t ) -[ L ( ) H ( -1) t-1 ] D ( +1) t-1 • ∇σ( Z t-1 ) D ( ) t = D ( ) t-1 + [ L ( ) ] D ( +1) t • ∇σ( Z t ) [W ( ) t ] -[ L ( ) ] D ( +1) t-1 • ∇σ( Z t-1 ) [W ( ) t-1 ], D (L+1) t = ∂ L(θ t ) ∂ H (L) t Update parameters as θ t+1 = θ t -η∇ L(θ t ) 18: end if 19: end for 20: Output: Model with parameter θ T +1 The intuition of snapshot step large-batch approximation stems from matrix Bernstein inequality Gross (2011) . More specifically, suppose given G i ∈ R d×d be the stochastic gradient computed by using the ith node with Exact sampling (all neighbors are used to calculate the exact node embeddings). Suppose the different between G i and full-gradient E[ G i ] is uniformly bounded and the variance is bounded: G i -E[ G i ] F ≤ µ, E[ G i -E[ G i ] 2 F ] ≤ σ 2 Let G as the snapshot step gradient computed on the sampled large batch G = 1 B i∈V B G i By matrix Bernstein inequality, we know the probability of G -E[ G i ] F larger than some constant decreases exponentially as the size of the sampled large-batch size B increase, i.e., Pr( G -E[ G i ] F ≥ ) ≤ 2d exp -n • min 2 4σ 2 , 2µ Therefore, by choosing a large enough snapshot step batch size B , we can obtain a good approximation of full-gradient.

F NOTATIONS, IMPORTANT PROPOSITIONS AND LEMMAS F.1 NOTATIONS FOR GRADIENT COMPUTATION

We introduce the following notations to simplify the representation and make it easier for readers to understand. Let formulate each GCN layer in FullGCN as a function H ( ) = [f ( ) (H ( -1) , W ( ) ) := σ(LH ( -1) W ( ) Z ( ) )] ∈ R N ×d (40) and its gradient w.r.t. the input node embedding matrix D ( ) ∈ R N ×d -1 is computed as D ( ) = ∇ H f ( ) (D ( +1) , H ( -1) , W ( ) ) := [L] D ( +1) • σ (LH ( -1) W ( ) ) [W ( ) ] and its gradient w.r.t. the weight matrix G ( ) ∈ R d -1 ×d is computed as G ( ) = ∇ W f ( ) (D ( +1) , H ( -1) , W ( ) ) := [LH ( -1) ] D ( +1) • σ (L ( ) H ( -1) W ( ) ) Similarly, we can formulate the calculation of node embedding matrix H ( ) ∈ R N ×d at each GCN layer in SGCN as H ( ) = [ f ( ) ( H ( -1) , W ( ) ) := σ( L ( ) H ( -1) W ( ) Z ( ) )] and its gradient w.r.t. the input node embedding matrix D ( ) ∈ R N ×d -1 is computed as D ( ) = ∇ H f ( ) ( D ( +1) , H ( -1) , W ( ) ) := [ L ( ) ] D ( +1) • σ ( L ( ) H ( -1) W ( ) ) [W ( ) ] (44) and its gradient w.r.t. the weight matrix G ( ) ∈ R d -1 ×d is computed as G ( ) = ∇ W f ( ) ( D ( +1) , H ( -1) , W ( ) ) := [ L ( ) H ( -1) ] D ( +1) • σ ( L ( ) H ( -1) W ( ) ) Let us denote the gradient of loss w.r.t. the final node embedding matrix as D (L+1) = ∂Loss(H (L) , y) ∂H (L) ∈ R N ×d L , [D (L+1) ] i = 1 N ∂Loss(h (L) i , y i ) ∂h (L) i ∈ R d L D (L+1) = ∂Loss( H (L) , y) ∂ H (L) ∈ R N ×d L , [ D (L+1) ] i = 1 B 1 {i∈V B } ∂Loss( h (L) i , y i ) ∂ h (L) i ∈ R d L (46) Notice that D (L+1) is a N × d L matrix with the row number i ∈ V B are non-zero vectors. Then we can write the gradient for the th weight matrix in FullGCN and SGCN as G ( ) = ∇ W f ( ) (∇ H f ( +1) (. . . ∇ H f (L) (D (L+1) , H (L-1) , W (L) ) . . . , H ( ) , W ( +1) ), H ( -1) , W ( ) ) G ( ) = ∇ W f ( ) (∇ H f ( +1) (. . . ∇ H f (L) ( D (L+1) , H (L-1) , W (L) ) . . . , H ( ) , W ( +1) ), H ( -1) , W ( ) )

F.2 UPPER-BOUNDED ON THE NODE EMBEDDING MATRICES AND LAYERWISE GRADIENTS

Based on the Assumption 3, we first derive the upper-bound on the node embedding matrices and the gradient passing from th layer node embedding matrix to the ( -1)th layer node embedding matrix. Proposition 2 (Detailed version of Proposition 1). For any ∈ [L], the Frobenius norm of node embedding matrices, gradient passing from the th layer node embeddings to the ( -1)th are bounded H ( ) F ≤ B H , H ( ) F ≤ B H , ∂σ(LH ( -1) W ( ) ) ∂H ( -1) F ≤ B D , ∂σ( L ( ) H ( -1) W ( ) ) ∂ H ( -1) F ≤ B D (48) where B H = C σ B LA B W B H , B D = B LA C σ B W L- C loss (49) Proof.  H ( ) F = σ(LH ( -1) W ( ) ) F ≤ C σ B LA B W H ( -1) F ≤ C σ B LA B W X F ≤ C σ B LA B W B H (50) H ( ) F = σ( L ( ) H ( -1) W ( ) ) F ≤ C σ B LA B W H ( -1) F ≤ C σ B LA B W X F ≤ C σ B LA B W B H (51) D ( ) F = [L] D ( +1) • σ (LH ( -1) W ( ) ) [W ( ) ] F ≤ B LA C σ B W D ( +1) F ≤ B LA C σ B W L- D (L+1) F ≤ B LA C σ B W L- C loss (52) D ( ) F = [ L ( ) ] D ( +1) • σ ( L ( ) H ( -1) W ( ) ) [W ( ) ] F ≤ B LA C σ B W D ( +1) F ≤ B LA C σ B W L- D (L+1) F ≤ B LA C σ B W L- C loss C H = C σ B LA B W Proof. f ( ) (H ( -1) 1 , W ( ) ) -f ( ) (H ( -1) 2 , W ( ) ) F = σ( L ( ) H ( -1) 1 W ( ) ) -σ( L ( ) H ( -1) 2 W ( ) ) F ≤ C σ L ( ) F H ( -1) 1 -H ( -1) 2 F W ( ) F ≤ C σ B LA B W H ( -1) 1 -H ( -1) 2 F (54) Proposition 4. f ( ) (•, •) is C W -Lipschitz continuous w.r.t. the weight matrix where C W = C σ B LA B H Proof. f ( ) (H ( -1) , W ( ) 1 ) -f ( ) (H ( -1) , W ( ) 2 ) F = σ( L ( ) H ( -1) W ( ) 1 ) -σ( L ( ) H ( -1) W ( ) 2 ) F ≤ C σ L ( ) F H ( -1) F W ( ) 1 -W ( ) 2 F ≤ C σ B LA B H W ( ) 1 -W ( ) 2 F (55) Proposition 5. ∇ H f ( ) (•, •, •) is L H -Lipschitz continuous where L H = max B LA C σ B W , B 2 LA B D B 2 W L σ , B LA B D C σ + B 2 LA B D B W L σ B H (56) Proof. ∇ H f ( ) ( D ( +1) 1 , H ( -1) , W ( ) ) -∇ H f ( ) ( D ( +1) 2 , H ( -1) , W ( ) ) F ≤ [ L ( ) ] D ( +1) 1 • σ ( L ( ) H ( -1) W ( ) ) [W ( ) ] -[ L ( ) ] D ( +1) 2 • σ ( L ( ) H ( -1) W ( ) ) [W ( ) ] F ≤ B LA C σ B W D ( +1) 1 -D ( +1) 2 F (57) ∇ H f ( ) ( D ( +1) , H ( -1) 1 , W ( ) ) -∇ H f ( ) ( D ( +1) , H ( -1) 2 , W ( ) ) F ≤ [ L ( ) ] D ( +1) • σ ( L ( ) H ( -1) 1 W ( ) ) [W ( ) ] -[ L ( ) ] D ( +1) • σ ( L ( ) H ( -1) 2 W ( ) ) [W ( ) ] F ≤ B 2 LA B D B 2 W L σ H ( -1) 1 -H ( -1) 2 F (58) ∇ H f ( ) ( D ( +1) , H ( -1) , W ( ) 1 ) -∇ H f ( ) ( D ( +1) , H ( -1) , W ( ) 2 ) F ≤ [ L ( ) ] D ( +1) • σ ( L ( ) H ( -1) W ( ) 1 ) [W ( ) 1 ] -[ L ( ) ] D ( +1) • σ ( L ( ) H ( -1) W ( ) 2 ) [W ( ) 2 ] F ≤ (B LA B D C σ + B 2 LA B D B W L σ B H ) W ( ) 1 -W ( ) 2 F (59) Proposition 6. ∇ W f ( ) (•, •, •) is L W -Lipschitz continuous where L W = max B LA B H C σ , B 2 LA B 2 H H D L σ , B LA B D C σ + B 2 LA B 2 H B D L σ (60) Proof. ∇ W f ( ) ( D ( +1) 1 , H ( -1) , W ( ) ) -∇ W f ( ) ( D ( +1) 2 , H ( -1) , W ( ) ) F ≤ [ L ( ) H ( -1) ] D ( +1) 1 • σ ( L ( ) H ( -1) W ( ) ) -[ L ( ) H ( -1) ] D ( +1) 2 • σ ( L ( ) H ( -1) W ( ) ) F ≤ B LA B H C σ D ( +1) 1 -D ( +1) 2 F (61) ∇ W f ( ) ( D ( +1) , H ( -1) 1 , W ( ) ) -∇ W f ( ) ( D ( +1) , H ( -1) 2 , W ( ) ) F ≤ [ L ( ) H ( -1) 1 ] D ( +1) • σ ( L ( ) H ( -1) 1 W ( ) ) -[ L ( ) H ( -1) 2 ] D ( +1) • σ ( L ( ) H ( -1) 2 W ( ) ) F ≤ (B 2 LA B 2 D C σ + B 2 LA B H B D L σ B W ) H ( -1) 1 -H ( -1) 2 F (62) ∇ W f ( ) ( D ( +1) , H ( -1) , W ( ) 1 ) -∇ W f ( ) ( D ( +1) , H ( -1) , W ( ) 2 ) F ≤ [ L ( ) H ( -1) ] D ( +1) • σ ( L ( ) H ( -1) W ( ) 1 ) -[ L ( ) H ( -1) ] D ( +1) • σ ( L ( ) H ( -1) W ( ) 2 ) F ≤ B 2 LA B 2 H H D L σ W ( ) 1 -W ( ) 2 F

F.4 LIPSCHITZ CONTINOUITY OF THE GRADIENT OF GRAPH CONVOLUTIONAL NETWORK

Let first recall the parameters and gradients of a L-layer GCN is defined as θ = {W (1) , . . . , W (L) }, ∇L(θ) = {G (1) , . . . , G (L) } (64) where G ( ) is defined as the gradient w.r.t. the th layer weight matrix. Let us slight abuse of notation and define the distance between two set of parameters θ 1 , θ 2 and its gradient as θ 1 -θ 2 F = L =1 W ( ) 1 -W ( ) 2 F , ∇L(θ 1 ) -∇L(θ 2 ) F = L =1 G ( ) 1 -G ( ) 2 F (65) Then, we derive the Lipschitz continuous constant of the gradient of a L-layer graph convolutonal network. Notice that the above result also hold for sampling-based GCN training. Lemma 1. The gradient of an L-layer GCN is L F -Lipschitz continuous with L F = L(LU 2 max L U 2 max C + U 2 max L ), i.e., ∇L(θ 1 ) -∇L(θ 2 ) 2 F ≤ L F θ 1 -θ 2 2 F (66) where U max C = max j∈{0,...,L-1} By the Lipschitz continuity of f ( ) (•, •) we have H ( ) 1 -H ( ) 2 F ≤ C L-1 H C W W (1) 1 -W (1) 2 F + . . . + C W W ( ) 1 -W ( ) 2 F ≤ max j∈{0,...,L-1} C j H C W L j=1 W (j) 1 -W (j) 2 F (73) Let define U max C is defined as U max C = max j∈{0,...,L-1} C j H C W (74) Plugging it back we have G ( ) 1 -G ( ) 2 F ≤ (LU max L U max C + U max L ) L j=1 W (j) 1 -W (j) 2 F (75) Summing both size from = 1 to = L we have ∇L(θ 1 ) -∇L(θ 2 ) F = L =1 G ( ) 1 -G ( ) 2 F ≤ L(LU max L U max C + U max L ) L j=1 W (j) 1 -W (j) 2 F ≤ L(LU max L U max C + U max L ) θ 1 -θ 2 F G PROOF OF THEOREM 1 By bias-variance decomposition, we can decompose the mean-square error of stochastic gradient as L =1 E[ G ( ) -G ( ) 2 F ] = L =1 E[ E[ G ( ) ] -G ( ) 2 F ] bias E[ b 2 F ] + E[ G ( ) -E[ G ( ) ] 2 F ] variance E[ n 2 F ] Therefore, we have to explicitly define the computation of E[ G ( ) ], which requires computing D(L+1) = E[ D (L+1) ], D( ) = E[ D ( ) ], and Ḡ( ) = E[ G ( ) ]. Let defined a general form of the sampled Laplacian matrix L ( ) ∈ R N ×N as L ( ) i,j = Li,j αi,j if i ∈ B ( ) and j ∈ B ( -1) 0 otherwise ( ) where α i,j is the weighted constant depends on the sampling algorithms. The expectation of L ( ) i,j is computed as E[ L ( ) i,j ] = E i∈B ( ) E j∈B ( -1) [ L ( ) i,j | i ∈ B ( ) ] In order to compute the expectation of SGCN's node embedding matrices, let define the propagation matrix P ( ) ∈ R N ×N as P ( ) i,j = E i∈B ( ) L ( ) i,j | i ∈ B ( ) where the expectation is taken over row indices i. The above equation implies that under the condition that knowing the ith node is in B ( ) , we have P ( ) i,j = L i,j , ∀j = {1, . . . , N }. Let consider the mean-aggregation for the ith node as x ( ) i = σ N j=1 L ( ) i,j x ( -1) j (81) Then, under the condition ith node is in B ( ) , we can replace L ( ) i,j by P ( ) i,j , which gives us x ( ) i = σ N j=1 P ( ) i,j x ( -1) j (82) As a result, we can write the expectation of x ( ) i with respect to the indices i as E i∈B ( ) [x ( ) i | i ∈ B ( ) ] = E i∈B ( ) σ N j=1 L ( ) i,j x ( -1) j | i ∈ B ( ) = E i∈B ( ) σ N j=1 P ( ) i,j x ( -1) j | i ∈ B ( ) = σ N j=1 P ( ) i,j x ( -1) j (83) Then define H( ) ∈ R N ×d as the node embedding of using full-batch but a subset of neighbors for neighbor aggregation, i.e., H( ) = σ(P ( ) H( -1) W ( ) ) where all rows in H( ) are non-zero vectors. Using the notations defined above, we can compute D(L+1) ∈ R N ×d L , Ḡ( ) ∈ R d -1 ×d , and D( ) ∈ R N ×d -1 as D(L+1) = E ∂Loss( H(L) ) ∂ H(L) ∈ R N ×d L , di = 1 N ∂Loss( h(L) i , y i ) ∂ h(L) i ∈ R d L and D( ) = ∇ H f ( ) ( D( +1) , H( -1) , W ( ) ) := [L] D( +1) • σ (P ( ) H( -1) W ( ) ) [W ( ) ] ( ) = ∇ W f ( ) ( D( +1) , H( -1) , W ( ) ) := [L H( -1) ] D( +1) •σ (P ( ) H( -1) W ( ) ) As a result, we can represent Ḡ( ) = E[ G ( ) ] as Ḡ( ) = ∇ W f ( ) (∇ H f ( +1) (. . . ∇ H f (L) ( D(L+1) , H(L-1) , W (L) ) . . . , H( ) , W ( +1) ), H( -1) , W ( ) )) G.1 SUPPORTING LEMMAS We derive the upper-bound of the bias and variance of the stochastic gradient in the following lemmas. Lemma 2 (Upper-bound on variance). We can upper-bound the variance of stochastic gradient in SGCN as L =1 E[ G ( ) -E[ G ( ) ] 2 F ] ≤ L =1 O(E[ L ( ) -P ( ) 2 F ]) + O( P ( ) -L 2 F ) Proof. By definition, we can write the variance in SGCN as E[ G ( ) -E[ G ( ) ] 2 F ] = E[ ∇ W f ( ) (∇ H f ( +1) (. . . ∇ H f (L) ( D (L+1) , H (L-1) W (L) ) . . . , H ( ) , W ( +1) ), H ( -1) , W ( ) ) -∇ W f ( ) (∇ H f ( +1) (. . . ∇ H f (L) ( D(L+1) , H(L-1) , W (L) ) . . . , H( ) , W ( +1) ), H( -1) ), W ( ) 2 F ] ≤ (L + 1)E[ ∇ W f ( ) (∇ H f ( +1) (. . . ∇ H f (L) ( D (L+1) , H (L-1) W (L) ) . . . , H ( ) , W ( +1) ), H ( -1) , W ( ) ) -∇ W f ( ) (∇ H f ( +1) (. . . ∇ H f (L) ( D(L+1) , H (L-1) W (L) ) . . . , H( ) , W ( +1) ), H( -1) , W ( ) ) 2 F ] + (L + 1)E[ ∇ W f ( ) (∇ H f ( +1) (. . . ∇ H f (L) ( D(L+1) , H (L-1) W (L) ) . . . , H ( ) , W ( +1) ), H ( -1) , W ( ) ) -∇ W f ( ) (∇ H f ( +1) (. . . ∇ H f (L) ( D(L+1) , H(L-1) W (L) ) . . . , H ( ) , W ( +1) ), H ( -1) , W ( ) ) 2 F ] + . . . + (L + 1)E[ ∇ W f ( ) (∇ H f ( +1) ( D( +2) , H ( ) , W ( +1) ), H ( -1) , W ( ) ) -∇ W f ( ) (∇ H f ( +1) ( D( +2) , H( ) , W ( +1) ), H ( -1) , W ( ) ) 2 F ] + (L + 1)E[ ∇ W f ( ) ( D( +1) , H ( -1) , W ( ) ) -∇ W f ( ) ( D( +1) , H( -1) , W ( ) ) 2 F ] ≤ (L + 1)L 2 W L 2(L--1) H E[ D (L+1) -D(L+1) 2 F ] + (L + 1)L 2 W L 2(L--2) H E[ ∇ H f (L) ( D(L+1) , H (L-1) W (L) ) -∇ H f (L) ( D(L+1) , H(L-1) W (L) ) 2 F ] + . . . + (L + 1)L 2 W E[ ∇ H f ( +1) ( D( +2) , H ( ) , W ( +1) ) -∇ H f ( +1) ( D( +2) , H( ) , W ( +1) ) 2 F ] + (L + 1)E[ ∇ W f ( ) ( D( +1) , H ( -1) , W ( ) ) -∇ W f ( ) ( D( +1) , H( -1) , W ( ) ) 2 F ] From the previous equation, we know that there are three key factors that will affect the variance: • The difference of gradient with respect to the last layer node representations E[ D (L+1) -D(L+1) 2 F ] • The difference of gradient with respect to the input node embedding matrix at each graph convolutional layer E[ ∇ H f ( +1) ( D( +2) , H ( ) , W ( +1) ) -∇ H f ( +1) ( D( +2) , H( ) , W ( +1) ) 2 F ] (92) • The difference of gradient with respect to the weight matrix at each graph convolutional layer E[ ∇ W f ( ) ( D( +1) , H ( -1) , W ( ) ) -∇ W f ( ) ( D( +1) , H( -1) , W ( ) ) 2 F ] First, Let consider the upper-bound of Eq. 91. E[ D (L+1) -D(L+1) 2 F ] = E[ ∂Loss( H (L) , y) ∂ H (L) - ∂Loss( H(L) , y) ∂ H(L) 2 F ] ≤ L 2 loss E[ H (L) -H(L) 2 F ] ≤ L 2 loss E[ σ( L (L) H (L-1) W (L) ) -σ(P ( ) H(L-1) W (L) ) 2 F ] ≤ L 2 loss C 2 σ B 2 W E[ L (L) H (L-1) -P (L) H(L-1) 2 F ] ≤ L 2 loss C 2 σ B 2 W B 2 H E[ L (L) -P (L) 2 F ] Then, let consider the upper-bound of Eq. 92. E[ ∇ H f ( ) ( D( +1) , H ( -1) , W ( ) ) -∇ H f ( ) ( D( +1) , H( -1) , W ( ) )] 2 F ] = E[ [ L ( ) ] D( +1) • σ ( L ( ) H ( -1) W ( ) ) [W ( ) ] -[L] D( +1) • σ (P ( ) H( -1) W ( ) ) [W ( ) ] 2 F ] ≤ 2E[ [ L ( ) ] D( +1) • σ ( L ( ) H ( -1) W ( ) ) [W ( ) ] -[ L ( ) ] D( +1) • σ (P ( ) H( -1) W ( ) ) [W ( ) ] 2 F ] + 2E[ [ L ( ) ] D( +1) • σ (P ( ) H( -1) W ( ) ) [W ( ) ] -[L] D( +1) • σ (P ( ) H( -1) W ( ) ) [W ( ) ] 2 F ] ≤ 2B 2 LA B 2 D B 4 W L 2 σ E[ L ( ) H ( -1) -P ( ) H( -1) 2 F ] + 2B 2 D C 2 σ B 2 W E[ L ( ) -L 2 F ] ≤ 2B 2 LA B 2 D B 2 H B 4 W L 2 σ E[ L ( ) -P ( ) 2 F ] + 2B 2 D C 2 σ B 2 W E[ L ( ) -P ( ) + P ( ) -L 2 F ] ≤ 2 B 2 LA B 2 D B 2 H B 4 W L 2 σ + 2B 2 D C 2 σ B 2 W E[ L ( ) -P ( ) 2 F ] + 4B 2 D C 2 σ B 2 W E[ P ( ) -L 2 F ] ≤ O(E[ L ( ) -P ( ) 2 F ]) + O(E[ P ( ) -L 2 F ]) Finally, let consider the upper-bound of Eq. 93. E[ ∇ W f ( ) ( D( +1) , H ( -1) , W ( ) ) -∇ W f ( ) ( D( +1) , H( -1) , W ( ) ) 2 F ] ≤ E[ [ L ( ) H ( -1) ] D( +1) • σ ( L ( ) H ( -1) W ( ) ) -[L H( -1) ] D( +1) • σ (P ( ) H( -1) W ( ) ) 2 F ] ≤ 2E[ [ L ( ) H ( -1) ] D( +1) • σ ( L ( ) H ( -1) W ( ) ) -[ L ( ) H ( -1) ] D( +1) • σ (P ( ) H( -1) W ( ) ) 2 F ] + 2E[ [ L ( ) H ( -1) ] D( +1) • σ (P ( ) H( -1) W ( ) ) -[L H( -1) ] D( +1) • σ (P ( ) H( -1) W ( ) ) 2 F ] ≤ 2B 2 LA B 2 H B 2 D B 2 W L 2 σ E[ L ( ) H ( -1) -P ( ) H( -1) 2 F ] + 2B 2 D C 2 σ E[ L ( ) H ( -1) -L H( -1) 2 F ] ≤ 2 B 2 LA B 2 H B 2 D B 2 W L 2 σ + B 2 D C 2 σ E[ L ( ) H ( -1) -P ( ) H( -1) 2 F ] + 2B 2 D B 2 H C 2 σ E[ P ( ) -L 2 F ] ≤ 2 B 2 LA B 4 H B 2 D B 2 W L 2 σ + B 2 H B 2 D C 2 σ E[ L ( ) -P ( ) 2 F ] + 2B 2 D B 2 H C 2 σ E[ P ( ) -L 2 F ] ≤ O(E[ L ( ) -P ( ) 2 F ]) + O(E[ P ( ) -L 2 F ]) Combining the result from Eq. 91, 92, 93 we have E[ G ( ) -E[ G ( ) ] 2 F ] ≤ O(E[ L ( ) -P ( ) 2 F ]) + . . . + O(E[ L (L) -P (L) 2 F ]) + O(E[ P ( ) -L 2 F ]) + . . . + O(E[ P (L) -L 2 F ]) Lemma 3 (Upper-bound on bias). We can upper-bound the bias of stochastic gradient in SGCN as L =1 E[ E[ G ( ) ] -G ( ) 2 F ] ≤ L =1 O( P ( ) -L 2 F ) Proof. By definition, we can write the bias of stochastic gradient in SGCN as E[ E[ G ( ) ] -G ( ) 2 F ] ≤ E[ ∇ W f ( ) (∇ H f ( +1) (. . . ∇ H f (L) ( D(L+1) , H(L-1) , W (L) ) . . . , H( ) , W ( +1) ), H( -1) , W ( ) ) -∇ W f ( ) (∇ H f ( +1) (. . . ∇ H f (L) (D (L+1) , H (L-1) , W (L) ) . . . , H ( ) , W ( +1) ), H ( -1) , W ( ) ) 2 F ] ≤ (L + 1)E[ ∇ W f ( ) (∇ H f ( +1) (. . . ∇ H f (L) ( D(L+1) , H(L-1) , W (L) ) . . . , H( ) , W ( +1) ), H( -1) , W ( ) ) -∇ W f ( ) (∇ H f ( +1) (. . . ∇ H f (L) (D (L+1) , H(L-1) , W (L) ) . . . , H( ) , W ( +1) ), H( -1) , W ( ) ) 2 F ] + (L + 1)E[ ∇ W f ( ) (∇ H f ( +1) (. . . ∇ H f (L) (D (L+1) , H(L-1) , W (L) ) . . . , H( ) , W ( +1) ), H( -1) , W ( ) ) -∇ W f ( ) (∇ H f ( +1) (. . . ∇ H f (L) (D (L+1) , H (L-1) , W (L) ) . . . , H( ) , W ( +1) ), H( -1) , W ( ) ) 2 F ] + . . . + (L + 1)E[ ∇ W f ( ) (∇ H f ( +1) (D ( +2) , H( ) , W ( +1) ), H( -1) , W ( ) ) -∇ W f ( ) (∇ H f ( +1) (D ( +2) , H ( ) , W ( +1) ), H( -1) , W ( ) ) 2 F ] + (L + 1)E[ ∇ W f ( ) (D ( +1) , H( -1) , W ( ) ) -∇ W f ( ) (D ( +1) , H ( -1) , W ( ) ) 2 F ] ≤ (L + 1)L 2 W L 2(L--1) H E[ D(L+1) -D (L+1) 2 F ] + (L + 1)L 2 W L 2(L--2) H E[ ∇ H f (L) (D (L+1) , H(L-1) , W (L) ) -∇ H f (L) (D (L+1) , H (L-1) , W (L) ) 2 F ] + . . . + (L + 1)L 2 W E[ ∇ H f ( +1) (D ( +2) , H( ) , W ( +1) ) -∇ H f ( +1) (D ( +2) , H ( ) , W ( +1) ) 2 F ] + (L + 1)E[ ∇ W f ( ) (D ( +1) , H( -1) , W ( ) ) -∇ W f ( ) (D ( +1) , H ( -1) , W ( ) ) 2 F ] From the previous equation, we know that there are three key factors that will affect the bias: • The difference of gradient with respect to the last layer node representations E[ D(L+1) -D (L+1) 2 F ] • The difference of gradient with respect to the input node embedding matrix at each graph convolutional layer E[ ∇ H f ( +1) (D ( +2) , H( ) , W ( +1) ) -∇ H f ( +1) (D ( +2) , H ( ) , W ( +1) ) 2 F ] (101) • The difference of gradient with respect to the weight matrix at each graph convolutional layer E[ ∇ W f ( ) (D ( +1) , H( -1) , W ( ) ) -∇ W f ( ) (D ( +1) , H ( -1) , W ( ) ) 2 F ] Firstly, let consider the upper-bound of Eq. 100. E[ D(L+1) -D (L+1) 2 F ] = E[ ∂Loss( H(L) , y) ∂ H(L) - ∂Loss(H (L) , y) ∂H (L) 2 F ] ≤ L 2 loss E[ H(L) -H (L) 2 F ] The upper-bound for E[ H( ) -H ( ) 2 F ] as H( ) -H ( ) 2 F = σ(P ( ) H( -1) W ( ) ) -σ(LH ( -1) W ( ) ) 2 F ≤ C 2 σ B 2 W P ( ) H( -1) -L H( -1) + L H( -1) -LH ( -1) 2 F ≤ 2C 2 σ B 2 W B 2 H P ( ) -L 2 F + 2C 2 σ B 2 W B 2 LA H( -1) -H ( -1) 2 F ≤ O( P (1) -L 2 F ) + . . . + O( P ( ) -L 2 F ) Therefore, we have E[ D(L+1) -D (L+1) 2 F ] ≤ O( P (1) -L 2 F ) + . . . + O( P (L) -L 2 F ) Then, let consider the upper-bound of Eq. 101. E[ ∇ H f ( ) (D ( +1) , H( -1) , W ( ) ) -∇ H f ( ) (D ( +1) , H ( -1) , W ( ) ) 2 F ] = E[ [L] D ( +1) • σ (P ( ) H( -1) W ( ) ) [W ( ) ] -[L] D ( +1) • σ (LH ( -1) W ( ) ) [W ( ) ] 2 F ] ≤ B 2 LA B 2 D B 4 W L 2 σ E[ P ( ) H( -1) -L H( -1) + L H( -1) -LH ( -1) 2 F ] ≤ 2B 2 LA B 2 D B 4 W L 2 σ B 2 H E[ P ( ) -L 2 F ] + 2B 4 LA B 2 D B 4 W L 2 σ E[ H( -1) -H ( -1) 2 F ] ≤ O( P (1) -L 2 F ) + . . . + O( P ( ) -L 2 F ) Finally, let consider the upper-bound of Eq. 102. E[ ∇ W f ( ) (D ( +1) , H( -1) , W ( ) ) -∇ W f ( ) (D ( +1) , H ( -1) , W ( ) ) 2 F ] = E[ [L H( -1) ] D ( +1) • σ (P ( ) H( -1) W ( ) ) -[LH ( -1) ] D ( +1) • σ (L ( ) H ( -1) W ( ) ) 2 F ] ≤ 2E[ [L H( -1) ] D ( +1) • σ (P ( ) H( -1) W ( ) ) -[LH ( -1) ] D ( +1) • σ (P ( ) H( -1) W ( ) ) 2 F ] + 2E[ [LH ( -1) ] D ( +1) • σ (P ( ) H( -1) W ( ) ) -[LH ( -1) ] D ( +1) • σ (L ( ) H ( -1) W ( ) ) 2 F ] ≤ 2B 2 D C 2 σ B 2 LA E[ H( -1) -H ( -1) 2 F ] + 2B 2 LA B 2 H B 2 D L 2 σ B 2 W E[ P ( ) H( -1) -L H( -1) + L H( -1) -LH ( -1) 2 F ] ≤ 2 B 2 D C 2 σ B 2 LA + B 4 LA B 2 H B 2 D L 2 σ B 2 W E[ H( -1) -H ( -1) 2 F ] + 2B 2 LA B 4 H B 2 D L 2 σ B 2 W E[ P ( ) -L 2 F ] ≤ O( P (1) -L 2 F ) + . . . + O( P ( ) -L 2 F ) Combining the result from Eq. 100, 101, 102 we have E[ E[ G ( ) ] -G ( ) 2 F ] ≤ O(E[ P (1) -L 2 F ]) + . . . + O(E[ P (L) -L 2 F ]) G.2 REMAINING STEPS TOWARD THEOREM 1 By the smoothness of L(θ t ), we have L(θ t+1 ) ≤ L(θ t ) + ∇L(θ t ), θ t+1 -θ t + L f 2 θ t+1 -θ t 2 F = L(θ t ) -η ∇L(θ t ), ∇ L(θ t ) + η 2 L f 2 ∇ L(θ t ) 2 F (109) Let F t = {{B ( ) 1 } L =1 , . . . , {B ( ) t-1 } L =1 }. Note that the weight parameters θ t is a function of history of the generated random process and hence is random. Taking expectation on both sides condition on F t and using η < 1/L f we have E[L(θ t+1 )|F t ] ≤ L(θ t ) -η ∇L(θ t ), E[∇ L(θ t )|F t ] + η 2 L f 2 E[ ∇ L(θ t ) -E[∇ L(θ t )|F t ] 2 F |F t ] + E[ E[∇ L(θ t )|F t ] 2 F |F t ] = L(θ t ) -η ∇L(θ t ), ∇L(θ t ) + E[b t |F t ] + η 2 L f 2 E[ n t 2 F |F t ] + ∇L(θ t ) + E[b t |F t ] 2 F ≤ L(θ t ) + η 2 -2 ∇L(θ t ), ∇L(θ t ) + E[b t |F t ] + ∇L(θ t ) + E[b t |F t ] 2 F + η 2 L f 2 E[ n t 2 F |F t ] = L(θ t ) + η 2 -∇L(θ t ) 2 F + E[ b t 2 F |F t ] + η 2 L f 2 E[ n t 2 F |F t ] Denote ∆ b as the upper bound of bias of stochasitc gradient as shown in Lemma 3 and Denote ∆ n as the upper bound of bias of stochastic gradient as shown in Lemma 2. Plugging in the upper bound of bias and variance, taking expectation over F t , and rearranging the term we have E[ ∇L(θ t ) 2 F ] ≤ 2 η E[L(θ t )] -E[L(θ t+1 )] + ηL f ∆ n + ∆ b Summing up from t = 1 to T , rearranging we have 1 T T t=1 E[ ∇L(θ t ) 2 F ] ≤ 2 ηT T t=1 (E[L(θ t )] -E[L(θ t+1 )]) + ηL f ∆ n + ∆ b ≤ (a) 2 ηT (L(θ 1 ) -L(θ )) + ηL f ∆ n + ∆ b where the inequality (a) is due to L(θ ) ≤ E[L(θ T +1 )]. By selecting learning rate as η = 1/ √ T , we have 1 T T t=1 E[ ∇L(θ t ) 2 F ] ≤ 2(L(θ 1 ) -L(θ )) √ T + L f ∆ n √ T + ∆ b H PROOF OF THEOREM 2 H.1 SUPPORTING LEMMAS In the following lemma, we derive the upper-bound on the node embedding approximation error of each GCN layer in SGCN+. This upper-bound plays an important role in the analysis of the upperbound of the bias term for the stochastic gradient. Suppose the input node embedding matrix for the GCN layer as H( -1) t , the forward propagation for the th layer in SGCN+ is defined as f ( ) ( H( -1) t , W ( ) ) = Z( ) t := P ( ) H( ) t W ( ) and the forward propagation for the th layer in FullGCN is defined as f ( ) ( H( -1) t , W ( ) ) = L H( -1) t W ( ) In the following, we derive the upper-bound of E[ f ( ) ( H( -1) t , W ( ) ) -f ( ) ( H( -1) t , W ( ) ) 2 F ] = E[ L H( -1) t W ( ) t - Z( ) t 2 F ] Lemma 4. Let denote E ∈ [0, T /K -1] as the current epoch, let t be the current step. Therefore, for any t ∈ {EK + 1, . . . , EK + K}, we have E[ L H( -1) t W ( ) t - Z( ) t 2 F ] ≤ η 2 K × O |E[ P ( ) 2 F ] -L 2 F | + . . . + |E[ P (1) 2 F ] -L 2 F | Proof. L H( -1) t W ( ) t - Z( ) t 2 F = [L H( -1) t W ( ) t -L H( -1) t-1 W ( ) t-1 ] + [L H( -1) t-1 W ( ) t-1 - Z( ) t-1 ] -[ Z( ) t - Z( ) t-1 ] 2 F = L H( -1) t W ( ) t -L H( -1) t-1 W ( ) t-1 2 F + L H( -1) t-1 W ( ) t-1 - Z( ) t-1 2 F + Z( ) t - Z( ) t-1 2 F + 2 L H( -1) t W ( ) t -L H( -1) t-1 W ( ) t-1 , L H( -1) t-1 W ( ) t-1 - Z( ) t-1 -2 L H( -1) t W ( ) t -L H( -1) t-1 W ( ) t-1 , Z( ) t - Z( ) t-1 -2 L H( -1) t-1 W ( ) t-1 - Z( ) t-1 , Z( ) t - Z( ) t-1 Recall that by the update rule, we have Z( ) t - Z( ) t-1 = P ( ) H( -1) t W ( ) t -P ( ) H( -1) t-1 W ( ) t-1 and E[ Z( ) t - Z( ) t-1 |F t ] = L H( -1) t W ( ) t -L H( -1) t-1 W ( ) t-1 Taking expectation on both side condition on F t , we have E[ L H( -1) t W ( ) t - Z( ) t 2 F |F t ] ≤ L H( -1) t-1 W ( ) t-1 - Z( ) t-1 2 F + E[ Z( ) t - Z( ) t-1 2 F |F t ] -L H( -1) t W ( ) t -L H( -1) t-1 W ( ) t-1 2 F (121) Then take the expectation over F t , we have E[ L H( -1) t W ( ) t - Z( ) t 2 F ] ≤ L H( -1) t-1 W ( ) t-1 - Z( ) t-1 2 F + E[ Z( ) t - Z( ) t-1 2 F ] -L H( -1) t W ( ) t -L H( -1) t-1 W ( ) t-1 2 F (122) Since we know t ∈ {EK + 1, . . . , EK + K}, we can denote t = EK + k, k ≤ K such that E[ L H( -1) t W ( ) t - Z( ) t 2 F ] = E[ L H( -1) EK+k W ( ) EK+k - Z( ) EK+k 2 F ] = E[ L H( -1) EK W ( ) EK - Z( ) EK 2 F ] (A) + EK+K t=EK+1 E[ Z( ) t - Z( ) t-1 2 F ] -L H( -1) t W ( ) t -L H( -1) t-1 W ( ) t-1 2 F (123) Knowing that we are using all neighbors at the snapshot step (t mod K) = 0, we have (A) = 0. As a result, we have E[ L H( -1) EK+k W ( ) EK+k - Z( ) EK+k 2 F ] ≤ EK+K t=EK+1 E[ Z( ) t - Z( ) t-1 2 F ] -L H( -1) t W ( ) t -L H( -1) t-1 W ( ) t-1 2 F (B) Let take a closer look at term (B). E[ Z( ) t - Z( ) t-1 2 F ] -L H( -1) t W ( ) t -L H( -1) t-1 W ( ) t-1 2 F = E[ P ( ) H( -1) t W ( ) t -P ( ) H( -1) t-1 W ( ) t-1 2 F ] -L H( -1) t W ( ) t -L H( -1) t-1 W ( ) t-1 2 F ≤ E[ P ( ) ( H( -1) t W ( ) t - H( -1) t-1 W ( ) t-1 ) 2 F ] -L( H( -1) t W ( ) t - H( -1) t-1 W ( ) t-1 ) 2 F ≤ E[ P ( ) 2 F ] -L 2 F E[ H( -1) t W ( ) t - H( -1) t-1 W ( ) t-1 2 F ] (C) Let take a closer look at term (C). E[ H( -1) t W ( ) t - H( -1) t-1 W ( ) t-1 2 F ] = E[ H( -1) t W ( ) t - H( -1) t W ( ) t-1 + H( -1) t W ( ) t-1 - H( -1) t-1 W ( ) t-1 2 F ] ≤ 2B 2 H E[ W ( ) t -W ( ) t-1 2 F ] + 2E[ H( -1) t W ( ) t-1 - H( -1) t-1 W ( ) t-1 2 F ] By induction, we have E[ H( -1) t W ( ) t - H( -1) t-1 W ( ) t-1 2 F ] ≤ 2B 2 H E[ W ( ) t -W ( ) t-1 2 F ] + 2 2 B 4 H E[ W ( -1) t -W ( -1) t-1 2 F ] + . . . + 2 B 2 H E[ W (1) t -W (1) t-1 2 F ] By the update rule of weight matrices, we know E[ W ( ) t -W ( ) t-1 2 F ] = η 2 E[ Ḡ( ) t-1 2 F ] Therefore, we have E[ L H( -1) EK+k W ( ) EK+k - Z( ) EK+k 2 F ] ≤ EK+K t=EK+1 η 2 O |E[ P ( ) 2 F ] -L 2 F | × E[ Ḡ( ) t-1 2 F ] + . . . + |E[ P (1) 2 F ] -L 2 F | × E[ Ḡ(1) t-1 2 F ] (129) By the definition of Ḡ( ) t-1 , we have that E[ Ḡ( ) t-1 2 F ] ≤ B 2 LA B 2 H B 2 D C 2 σ ( ) Plugging it back, we have E[ L H( -1) t W ( ) t - Z( ) t 2 F ] := E[ L H( -1) EK+k W ( ) EK+k - Z( ) EK+k 2 F ] ≤ η 2 K × O |E[ P ( ) 2 F ] -L 2 F | + . . . + |E[ P (1) 2 F ] -L 2 F | Based on the upper-bound of node embedding approximation error of each graph convolutional layer, we derived the upper-bound on the bias of stochastic gradient in SGCN. Lemma 5 (Upper-bound on bias). We can upper-bound the bias of stochastic gradient in SGCN+ as L =1 E[ E[ G ( ) ] -G ( ) 2 F ] ≤ η 2 K L =1 O |E[ P ( ) 2 F ] -L 2 F | Proof. From the decomposition of bias as shown in previously in Eq. 108, we have E[ E[ G ( ) ] -G ( ) 2 F ] ≤ (L + 1)L 2 W L 2(L--1) H E[ D(L+1) -D (L+1) 2 F ] + (L + 1)L 2 W L 2(L--2) H E[ ∇ H f (L) (D (L+1) , H(L-1) , W (L) ) -∇ H f (L) (D (L+1) , H (L-1) , W (L) ) 2 F ] + . . . + (L + 1)L 2 W E[ ∇ H f ( +1) (D ( +2) , H( ) , W ( +1) ) -∇ H f ( +1) (D ( +2) , H ( ) , W ( +1) ) 2 F ] + (L + 1)E[ ∇ W f ( ) (D ( +1) , H( -1) , W ( ) ) -∇ W f ( ) (D ( +1) , H ( -1) , W ( ) ) 2 F ] From the previous equation, we know that there are three key factors that will affect the bias: • The difference of gradient with respect to the last layer node representations E[ D(L+1) -D (L+1) 2 F ] • The difference of gradient with respect to the input node embedding matrix at each graph convolutional layer E[ ∇ H f ( +1) (D ( +2) , H( ) , W ( +1) ) -∇ H f ( +1) (D ( +2) , H ( ) , W ( +1) ) 2 F ] ) • The difference of gradient with respect to the weight matrix at each graph convolutional layer E[ ∇ W f ( ) (D ( +1) , H( -1) , W ( ) ) -∇ W f ( ) (D ( +1) , H ( -1) , W ( ) ) 2 F ] Firstly, let consider the upper-bound of Eq. 134. E[ D(L+1) -D (L+1) 2 F ] = E[ ∂Loss( H(L) , y) ∂ H(L) - ∂Loss(H (L) , y) ∂H (L) 2 F ] ≤ L 2 loss E[ H(L) -H (L) 2 F ] ≤ L 2 loss C 2 σ E[ Z(L) -Z (L) 2 F ] We can decompose E[ Z(L) -Z (L) 2 F ] as E[ Z(L) -Z (L) 2 F ] = E[ Z(L) -LH (L-1) W (L) 2 F ] ≤ 2E[ Z(L) -L H(L-1) W (L) 2 F ] + 2E[ L H(L-1) W (L) -LH (L-1) W (L) 2 F ] ≤ 2E[ Z(L) -L H(L-1) W (L) 2 F ] + 2B 2 LA B 2 W C 2 σ E[ Z(L-1) -Z (L-1) 2 F ] ≤ L =1 O(E[ Z( ) -L H( -1) W ( ) 2 F ]) Using result from Lemma 4, we have E[ D(L+1) -D (L+1) 2 F ] ≤ L =1 η 2 K × O |E[ P ( ) 2 F ] -L 2 F | Then, let consider the upper-bound of Eq. 135. E[ ∇ H f ( ) (D ( +1) , H( -1) , W ( ) ) -∇ H f ( ) (D ( +1) , H ( -1) , W ( ) ) 2 F ] = E[ [L] D ( +1) • σ (P ( ) H( -1) W ( ) ) [W ( ) ] -[L] D ( +1) • σ (LH ( -1) W ( ) ) [W ( ) ] 2 F ] ≤ B 2 LA B 2 D B 2 W L 2 σ E[ Z( ) t -LH ( -1) W ( ) 2 F ] ≤ 2B 2 LA B 2 D B 2 W L 2 σ E[ Z( ) t -L H( -1) W ( ) 2 F ] + 2B 2 LA B 2 D B 2 W L 2 σ E[ L H( -1) W ( ) -LH ( -1) W ( ) 2 F ] (A) (140) where Z( ) t = Z( ) t-1 + P ( ) H( -1) t W ( ) t -P ( ) H( -1) t-1 W ( ) t-1 . Let take a closer look at term (A), we have E[ L H( -1) W ( ) -LH ( -1) W ( ) 2 F ] ≤ B 2 LA B 2 W C 2 σ E[ Z( -2) -LH ( -2) W ( -1) 2 F ] ≤ 2B 2 LA B 2 W C 2 σ E[ Z( -2) -L H( -2) W ( -1) 2 F ] + 2B 2 LA B 2 W C 2 σ E[ L H( -2) W ( -1) -LH ( -2) W ( -1) 2 F ] Therefore, by induction we have E[ ∇ H f ( ) (D ( +1) , H( -1) , W ( ) ) -∇ H f ( ) (D ( +1) , H ( -1) , W ( ) ) 2 F ] ≤ O(E[ Z( ) -L H( -1) W ( ) 2 F ]) + O(E[ Z( -1) -L H( -2) W ( -1) 2 F ]) + . . . + O(E[ Z(2) -L H(1) W (2) 2 F ]) + O(E[ Z(1) -LXW (1) 2 F ]) Using result from Lemma 4, we have E[ ∇ H f ( ) (D ( +1) , H( -1) , W ( ) ) -∇ H f ( ) (D ( +1) , H ( -1) , W ( ) ) 2 F ] ≤ η 2 K × O |E[ P ( ) 2 F ] -L 2 F | + . . . + η 2 K × O |E[ P (1) 2 F ] -L 2 F | Finally, let consider the upper-bound of Eq. 136. E[ ∇ W f ( ) (D ( +1) , H( -1) , W ( ) ) -∇ W f ( ) (D ( +1) , H ( -1) , W ( ) ) 2 F ] = E[ [L H( -1) ] D ( +1) • σ ( Z( ) ) -[LH ( -1) ] D ( +1) • σ (L ( ) H ( -1) W ( ) ) 2 F ] ≤ 2E[ [L H( -1) ] D ( +1) • σ ( Z( ) ) -[LH ( -1) ] D ( +1) • σ ( Z( ) ) 2 F ] + 2E[ [LH ( -1) ] D ( +1) • σ ( Z( ) ) -[LH ( -1) ] D ( +1) • σ (L ( ) H ( -1) W ( ) ) 2 F ] ≤ 2B 2 D C 2 σ B 2 LA E[ H( -1) -H ( -1) 2 F ] (B) +4B 2 LA B 2 H B 2 D L 2 σ E[ Z( ) -L H( -1) W ( ) 2 F ] + 4B 2 LA B 2 H B 2 D L 2 σ E[ L H( -1) W ( ) -LH ( -1) W ( ) 2 F ] ) By definition, we can write the term (B) as E[ H( -1) -H ( -1) 2 F ] ≤ C 2 σ E[ Z( ) -LH ( -2) W ( -1) 2 F ] ≤ 2C 2 σ E[ Z( ) -L H( -2) W ( -1) 2 F ] + 2C 2 σ E[ L H( -2) W ( ) -LH ( -2) W ( -1) 2 F ] Plugging term (B) back and using Eq. 141 and Lemma 4, we have E[ ∇ W f ( ) (D ( +1) , H( -1) , W ( ) ) -∇ W f ( ) (D ( +1) , H ( -1) , W ( ) ) 2 F ] ≤ η 2 K × O |E[ P ( ) 2 F ] -L 2 F | + . . . + η 2 K × O |E[ P (1) 2 F ] -L 2 F | Combining the result from Eq. 134, 135, 136 we have E[ E[ G ( ) ] -G ( ) 2 F ] ≤ η 2 K L =1 O |E[ P ( ) 2 F ] -L 2 F | H.2 REMAINING STEPS TOWARD THEOREM 2 Now we are ready to prove Theorem 2. By the smoothness of L(θ t ), we have L(θ t+1 ) ≤ L(θ t ) + ∇L(θ t ), θ t+1 -θ t + L F 2 θ t+1 -θ t 2 = L(θ t ) -η ∇L(θ t ), ∇ L(θ t ) + η 2 L F 2 ∇ L(θ t ) 2 (148) Let F t = {{B ( ) 1 } L =1 , . . . , {B ( ) t-1 } L =1 }. Note that the weight parameters θ t is a function of history of the generated random process and hence is random. Taking expectation on both sides condition on F t and using η < 1/L F we have E[∇L(θ t+1 )|F t ] ≤ L(θ t ) -η ∇L(θ t ), E[∇ L(θ t )|F t ] + η 2 L F 2 E[ ∇ L(θ t ) -E[∇ L(θ t )|F t ] 2 |F t ] + E[ E[g|F t ] 2 |F t ] = L(θ t ) -η ∇L(θ t ), ∇L(θ t ) + E[b t |F t ] + η 2 L F 2 E[ n t 2 |F t ] + ∇L(θ t ) + E[b t |F t ] 2 ≤ L(θ t ) + η 2 -2 ∇L(θ t ), ∇L(θ t ) + E[b t |F t ] + ∇L(θ t ) + E[b t |F t ] 2 + η 2 L F 2 E[ n t 2 |F t ] ≤ L(θ t ) + η 2 -∇L(θ t ) 2 + E[ b t 2 |F t ] + η 2 L F 2 E[ n t 2 |F t ] Plugging in the upper bound of bias and variance, taking expectation over F t , and rearranging the term we have E[ ∇L(θ t ) 2 ] ≤ 2 η E[L(θ t )] -E[∇L(θ t+1 )] + ηL F ∆ n + η 2 ∆ + b Summing up from t = 1 to T , rearranging we have 1 T T t=1 E[ ∇L(θ t ) 2 ] ≤ 2 ηT T t=1 (E[L(θ t )] -E[L(θ t+1 )]) + ηL F ∆ n + η 2 ∆ + b ≤ (a) 2 ηT (L(θ 1 ) -L(θ )) + ηL F ∆ n + η 2 ∆ + b ( ) where the inequality (a) is due to L(θ ) ≤ E[L(θ T +1 )]. By selecting learning rate as η = 1/ √ T , we have 1 T T t=1 E[ ∇L(θ t ) 2 ] ≤ 2(L(θ 1 ) -L(θ )) √ T + L F ∆ n √ T + ∆ + b T I PROOF OF THEOREM 3 I.1 SUPPORTING LEMMAS In the following lemma, we decompose the mean-square error of stochastic gradient at the th layer E[ G ( ) -G ( ) F 2 ] as the summation of • The difference between the gradient with respect to the last layer node embedding matrix E[ D (L+1) -D (L+1) 2 F ] • The difference of gradient passing from the ( + 1)th layer node embedding to the th layer node embedding E[ ∇ H f ( +1) (D ( +2) , H ( ) , W ( +1) ) -∇ H f ( +1) (D ( +2) , H ( ) , W ( +1) ) 2 F ] ) • The difference of gradient passing from the th layer node embedding to the th layer weight matrix E[ ∇ W f ( ) (D ( +1) , H ( -1) , W ( ) ) -∇ W f ( ) (D ( +1) , H ( -1) , W ( ) ) 2 F ] Lemma 6. The mean-square error of stochastic gradient at the th layer can be decomposed as E[ G ( ) -G ( ) F 2 ] ≤ O(E[ D (L+1) -D (L+1) 2 F ]) + O(E[ ∇ H f (L) (D (L+1) , H (L-1) W (L) ) -∇ H f (L) (D (L+1) , H (L-1) W (L) ) 2 F ]) + . . . + O(E[ ∇ H f ( +1) (D ( +2) , H ( ) , W ( +1) ) -∇ H f ( +1) (D ( +2) , H ( ) , W ( +1) ) 2 F ]) + O(E[ ∇ W f ( ) (D ( +1) , H ( -1) , W ( ) ) -∇ W f ( ) (D ( +1) , H ( -1) , W ( ) ) 2 F ]) Proof. By definition, we can write down the mean-square error of stochastic gradient as +1) , H ( -1) , W ( ) ) -∇ W f ( ) (D ( +1) , H ( -1)  E[ G ( ) -G ( ) 2 F ] = E[ ∇ W f ( ) (∇ H f ( +1) (. . . ∇ H f (L) ( D (L+1) , H (L-1) W (L) ) . . . , H ( ) , W ( +1) ), H ( -1) , W ( ) ) -∇ W f ( ) (∇ H f ( +1) (. . . ∇ H f (L) (D (L+1) , H (L-1) , W (L) ) . . . , H ( ) , W ( +1) ), H ( -1) ), W ( ) 2 F ] ≤ (L + 1)E[ ∇ W f ( ) (∇ H f ( +1) (. . . ∇ H f (L) ( D (L+1) , H (L-1) W (L) ) . . . , H ( ) , W ( +1) ), H ( -1) , W ( ) ) -∇ W f ( ) (∇ H f ( +1) (. . . ∇ H f (L) (D (L+1) , H (L-1) W (L) ) . . . , H ( ) , W ( +1) ), H ( -1) , W ( ) ) 2 F ] + (L + 1)E[ ∇ W f ( ) (∇ H f ( +1) (. . . ∇ H f (L) (D (L+1) , H (L-1) W (L) ) . . . , H ( ) , W ( +1) ), H ( -1) , W ( ) ) -∇ W f ( ) (∇ H f ( +1) (. . . ∇ H f (L) (D (L+1) , H (L-1) W (L) ) . . . , H ( ) , W ( +1) ), H ( -1) , W ( ) ) 2 F ] + . . . + (L + 1)E[ ∇ W f ( ) (∇ H f ( +1) (D ( +2) , H ( ) , W ( +1) ), H ( -1) , W ( ) ) -∇ W f ( ) (∇ H f ( +1) (D ( +2) , H ( ) , W ( +1) ), H ( -1) , W ( ) ) 2 F ] + (L + 1)E[ ∇ W f ( ) (D ( +1) , H ( -1) , W ( ) ) -∇ W f ( ) (D ( +1) , H ( -1) , W ( ) ) 2 F ] ≤ O(E[ D (L+1) -D (L+1) 2 F ]) + O(E[ ∇ H f (L) (D (L+1) , H (L-1) W (L) ) -∇ H f (L) (D (L+1) , H (L-1) W (L) ) 2 F ]) + . . . + O(E[ ∇ H f ( +1) (D ( +2) , H ( ) , W ( +1) ) -∇ H f ( +1) (D ( +2) , H ( ) , W ( +1) ) 2 F ]) + O(E[ ∇ W f ( ) (D ( For term (C 2 ) by definition we have  E[ Z ( ) t -Z ( ) t-1 2 F ] = E[ L ( ) H ( -1) t W ( ) t -L ( ) H ( -1) t-1 W ( ) t-1 2 F ] ≤ 2B 2 LA B 2 W C 2 σ E[ Z ( -1) t -Z ( -1) t-1 2 F ] + 2B 2 LA B 2 H E[ W ( ) t -W ( ) t-1 2 F ] (175) By induction we have E[ Z ( ) t -Z ( ) t-1 2 F ] ≤ O(E[ W ( ) t -W ( ) t-1 2 F ]) + . . . + O(E[ W (1) t -W (1) t-1 2 F ]) Then plugging term (B) back to Eq. 169 we conclude the proof. Using the previous lemma, we provide the upper-bound of Eq. 154, which is one of the three key factors that affect the mean-square error of stochastic gradient at the th layer. (204) Proof. From Lemma 6, we have L+1) , H (L-1) W (L) ) -∇ H f (L) (D (L+1) , H (L-1) W (L) ) 2 F ]) + . . . + O(E[ ∇ H f ( +1) (D ( +2) , H ( ) , W ( +1) ) -∇ H f ( +1) (D ( +2) , H ( ) , W ( +1) ) 2 F ]) + O(E[ ∇ W f ( ) (D ( +1) , H ( -1) , W ( ) ) -∇ W f ( ) (D ( +1) , H ( -1) , W ( ) E[ G ( ) -G ( ) F 2 ] ≤ O(E[ D (L+1) -D (L+1) 2 F ]) + O(E[ ∇ H f (L) (D ( ) 2 F ]) Plugging the result from support lemmas, i.e., Lemma 8, 10, 11 and using the definition of stochastic gradient for all model parameters ∇ L(θ t ) = { G ( ) t } L =1 we have E[ ∇ L(θ t ) -∇L(θ t ) 2 F ] ≤ EK+K t=EK+1 η 2 O L =1 |E[ L ( ) 2 F ] -L 2 F | × E[ ∇ L(θ t-1 ) 2 F ] I.2 REMAINING STEPS TOWARD THEOREM 3 By the smoothness of L(θ t ), we have L(θ T +1 ) ≤ L(θ t ) + ∇L(θ t ), θ t+1 -θ t + L f 2 θ t+1 -θ t 2 F = L(θ t ) -η ∇L(θ t ), ∇ L(θ t ) + η 2 L f 2 ∇ L(θ t ) 2 F = (a) L(θ t ) - η 2 ∇L(θ t ) 2 F + η 2 ∇L(θ t ) -∇ L(θ t ) 2 F - η 2 - L f η 2 2 ∇ L(θ t ) 2 F (225) where equality (a) is due to the fact 2 x, y = x 2 F + y 2 F -xy 2 F for any x, y. Take expectation on both sides, we have E[L(θ T +1 )] ≤ E[L(θ t )]- η 2 E[ ∇L(θ t ) 2 F ]+ η 2 E[ ∇L(θ t )-∇ L(θ t ) 2 F ]- η 2 - L f η 2 2 E[ ∇ L(θ t ) 2 F ] By summing over t = 1, . . . , T where T is the inner-loop size, we have T t=1 E[ ∇L(θ t ) 2 F ] ≤ 2 η E[L(θ 1 )] -E[L(θ T +1 )] + T t=1 [E[ ∇L(θ t ) -∇ L(θ t ) 2 F ] -1 -L f η E[ ∇ L(θ t ) 2 F ]] ≤ 2 η E[L(θ 1 )] -E[L(θ )] + T t=1 [E[ ∇L(θ t ) -∇ L(θ t ) 2 F ] -1 -L f η E[ ∇ L(θ t ) 2 F ]] ) where L(θ ) is the global optimal solution. Let us consider each inner-loop with E ∈ [0, T /K -1] and t ∈ {EK + 1, . . . , EK + K}. Using Lemma 12 we have T /K-1 E=0 EK+K t=EK+1 E[ L(θ t ) -∇ L(θ t ) 2 F ] -1 -L f η T /K-1 E=0 EK+K t=EK+1 E[ ∇ L(θ t ) 2 F ] ≤ O L =1 |E[ L ( ) 2 F ] -L 2 F | T /K-1 E=0 EK+K t=EK+1 t-EK j=2 η 2 E[ ∇ L(θ) j-1 2 F ] -1 -L f η T /K-1 E=0 EK+K t=EK+1 E[ ∇ L(θ t ) 2 F ] ≤ [η 2 K × O L =1 |E[ L ( ) 2 F ] -L 2 F | -1 -L f η ] T /K-1 E=0 EK+K t=EK+1 E[ ∇ L(θ t ) 2 F ] = [η 2 ∆ ++ b+n -(1 -L f η)] T /K-1 E=0 EK+K t=EK+1 E[ ∇ L(θ t ) 2 F ] Notice that η = 2 L f + L 2 f +4∆ ++ b+n is a root of equation η 2 ∆ ++ b+n -(1 -L f η) = 0. Therefore we have T t=1 E[ ∇L(θ t ) 2 F ] ≤ 2 η E[L(θ 1 )] -E[L(θ )] which implies 1 T T t=1 E[ ∇L(θ t ) 2 F ] ≤ 1 T L f + L 2 f + 4∆ ++ b+n E[L(θ 1 )] -E[L(θ )]

J REPRODUCING EXPERIMENT RESULTS

To reproduce the results reported in the paper, we provide link for dataset download, the bash script to reproduce the experiment results, and a jupyter notebook file for a quick visualization and GPU utilization calculation. It is worth noting that due to the existence of randomness, the obtained results (e.g., loss curve) may be slightly different. However, it is not difficult to find that the overall trend of loss curves and conclusions will remains the same. This implementation is based onfoot_3 PyTorch using Python 3. We notice that Python 2 might results in a wrong gradient update, even for vanilla SGCNs.

Install dependencies:

# create virtual environment $ virtualenv env $ source env/bin/activate # install dependencies $ pip install -r requirements.txt Experiments are produced on PPI, PPI-Large, Flickr, Reddit, and Yelp datasets. The utilized datasets can be downloaded fromfoot_4 Google drive. # create folders that save experiment results and datasets $ mkdir ./results $ mkdir ./data # please download the dataset and put them inside this folder To reproduce the results, please run the following commands: $ python train.py --sample_method 'ladies' --dataset 'reddit' $ python train.py --sample_method 'fastgcn' --dataset 'reddit' $ python train.py --sample_method 'graphsage' --dataset 'reddit' $ python train.py --sample_method 'vrgcn' --dataset 'reddit' $ python train.py --sample_method 'graphsaint' --dataset 'reddit' $ python train.py --sample_method 'exact' --dataset 'reddit' $ python train.py --sample_method 'ladies' --dataset 'ppi' $ python train.py --sample_method 'fastgcn' --dataset 'ppi' $ python train.py --sample_method 'graphsage' --dataset 'ppi' $ python train.py --sample_method 'vrgcn' --dataset 'ppi' $ python train.py --sample_method 'graphsaint' --dataset 'ppi' $ python train.py --sample_method 'exact' --dataset 'ppi' $ python train.py --sample_method 'ladies' --dataset 'flickr' $ python train.py --sample_method 'fastgcn' --dataset 'flickr' $ python train.py --sample_method 'graphsage' --dataset 'flickr' $ python train.py --sample_method 'vrgcn' --dataset 'flickr' $ python train.py --sample_method 'graphsaint' --dataset 'flickr' $ python train.py --sample_method 'exact' --dataset 'flickr' $ python train.py --sample_method 'ladies' --dataset 'ppi-large' $ python train.py --sample_method 'fastgcn' --dataset 'ppi-large' $ python train.py --sample_method 'graphsage' --dataset 'ppi-large' $ python train.py --sample_method 'vrgcn' --dataset 'ppi-large' $ python train.py --sample_method 'graphsaint' --dataset 'ppi-large' $ python train.py --sample_method 'exact' --dataset 'ppi-large' $ python train.py --sample_method 'ladies' --dataset 'yelp' $ python train.py --sample_method 'fastgcn' --dataset 'yelp' $ python train.py --sample_method 'graphsage' --dataset 'yelp' $ python train.py --sample_method 'vrgcn' --dataset 'yelp' $ python train.py --sample_method 'graphsaint' --dataset 'yelp' $ python train.py --sample_method 'exact' --dataset 'yelp'



We use a tilde symbol for their stochastic form We have ∆ b = 0 if all neighbor are used to calculate the exact node embeddings, i.e., P ( ) = L, ∀ ∈ [L]. https://github.com/tkipf/pygcn https://pytorch.org/ https://drive.google.com/drive/folders/15eP7OHiHQUnDrHKYh1YPxXkiqGoJhbis? usp=sharing



Figure 2: Comparing the validation loss of SGCN and SGCN++ on real world datasets.

7 and we use PyTorch 1.4 on CUDA 10.1 to train the model on GPU. During each epoch, we randomly construct 10 mini-batches in parallel.Implementation details. To demonstrate the effectiveness of doubly variance reduction, we modified the PyTorch implementation of GCN (Kipf & Welling, 2016) 3 to add LADIES(Zou et al., 2019), FastGCN(Chen et al., 2018), GraphSAGE(Hamilton et al., 2017), GraphSAINT(Zeng et al., 2019), VRGCN(Chen et al., 2017), and Exact sampling mechanism. Then, we implement SGCN+ and SGCN++ on the top of each sampling method to illustrate how zeroth-order variance reduction and doubly variance reduction help for GCN training.

Figure 4: Comparing the mean-square error of stochastic gradient to full gradient and training loss of SGCN, SGCN+, SGCN++ in the first 200 iterations of training process on Reddit dataset.

Figure 5: Comparison of GPU memory usage of SGCN and SGCN++ on Flickr and PPI dataset.

we report the average time of doubly variance reduced LADIES++ and vanilla LADIES. We classify the wall clock time during the training process into five categories: • Snapshot step sampling time: The time used to construct the snapshot full-batch or the snapshot large-batch. In practice, we directly use full-batch training for the smaller datasets (e.g., PPI, PPI-large, and Flickr) and use sampled snapshot large-batch for large datasets (e.g., Reddit and Yelp). When constructing snapshot large-batch, the Exact sampler has to go through all neighbors of each node using for-loops based on the graph structure, such that it is time-consuming. • Snapshot step transfer time: The time required to transfer the sampled snapshot batch nodes and Laplacian matrices to the GPUs. • Regular step sampling time: The time used to construct the mini-batches using layerwise LADIES sampler. • Regular step transfer time: The time required to transfer the sampled mini-batch nodes and Laplacian matrices to GPUs, and the time to transfer the historical node embeddings and the stochastic gradient between GPUs and CPUs. • Computation time: The time used for forward-and backward-propagation.

Figure 6: Comparison of training loss, validation loss, and F1-score of SGCN++ with different snapshot gap on Reddit dataset.

Figure 7: Comparison of training loss, validation loss, and F1-score of SGCN+ with different snapshot gap on Reddit dataset. K = 10, fix the regular step batch size as B = 512, and change the snapshot step batch size B from 20, 000 (20K) to 80, 000 (80K). In Figure 8 and Figure 9, we show the comparison of training loss and validation loss with different snapshot step large-batch size B for SGCN++ and SGCN+, respectively.

Figure 8: Comparison of training loss, validation loss, and F1-score of SGCN++ with different snapshot large-batch size on Reddit dataset.

Figure 9: Comparison of training loss, validation loss, and F1-score of SGCN+ with different snapshot large-batch size on Reddit dataset.

Figure 10: Comparison of training loss, validation loss, and F1-score of SGCN++ with different mini-batch size on Reddit dataset.

Figure 11: Comparing the validation loss and F1-score of GraphSAINT and GraphSAINT++ with different mini-batch size on Reddit dataset

Figure13: Relationship between the two types of variance with the training process, where embedding approximation variance (zeroth-order variance) happens during forward-propagation and layerwise gradient variance (first-order variance) happens during backward-propagation.

Output: Model with parameter θ T +1 E.3 SGCN+ Algorithm 4 SGCN+: Zeroth-order variance reduction (Detailed version of Algorithm 4) 1: Input: Learning rate η > 0, snapshot gap K > 0 2: for t = 1, . . . , T do 3:if t mod K = 0 then 4:

for 18: Output: Model with parameter θ T +1 E.4 SGCN++ Algorithm 5 SGCN++: Doubly variance reduction (Detailed version of Algorithm 5) 1: Input: Learning rate η > 0, snapshot gap K > 0 2: for t = 1, . . . , T do

Plugging (D 1 ), (D 2 ) back to (C 1 ) and (C 1 ), (C 2 ), (C 3 ) back to (B), we have E[ ∇ L(θ t-1 ) 2 F ]



For any layer ∈ [L], the stochastic gradient G

Comparison of the accuracy (F1-score) of SGCN, SGCN+, and SGCN++. SGCN+ and SGCN++ by randomly selecting 50% of nodes for Reddit and 15% of nodes for Yelp dataset. We update the model with a mini-batch size of B = 512 and Adam optimizer with a learning rate of η = 0.01. We conduct training 3 times for 200 epochs and report the average results. We choose the model with the lowest validation error as the convergence point. A summary of experiment configurations and data statistic can be found in Table2 and Table3in Appendix B. Due to the space limit, more experiments can be found in Appendix C.

Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Configuration of different sampling algorithms during training

Summary of dataset statistics. m stands for multi-class classification, and s stands for single-class.Effective of variance reduction. In Figure4, we empirically evaluate the effectiveness of variance reduction by comparing the mean-square error of the stochastic gradient and training loss curve of different sampling strategies on Reddit dataset.

Comparison of average time (1 snapshot step and 10 regular steps) of doubly variance reduced LADIES++ with regular step batch size as 512. Full-batch is used for snapshot step on PPI, PPI-Large, and Flickr. 50% training set nodes are sampled for the snapshot step on Reddit, and 15% training set nodes are sampled for the snapshot step on Yelp.

Comparison of average time (10 regular steps) of LADIES with regular step batch size as

is the gradient for the th weight matrix, i.e., W

8. Let suppose t ∈ {EK + 1, . . . , EK + K}. The upper-bound on the difference of the gradient with respect to the input node embedding matrix at the th graph convolutional layer given the same input D

annex

Proof. We first consider the gradient w.r.t. the th graph convolutional layer weight matrix), H), H), H), H), H), H), H), HBy the Lipschitz continuity of ∇ W f ( ) (•) andthen we can rewrite the above equation asThen, let consider the upper bound of HIn the following lemma, we derive the upper-bound on the difference of the gradient passing from the th to ( -1)th layer given the same inputs D, where the backward propagation for the th layer in SGCN++ is defined asand the backward propagation for the th layer in FullGCN is defined asLemma 7. Let suppose t ∈ {EK + 1, . . . , EK + K}. The upper-bound on the difference of the gradient with respect to the input node embedding matrix at the th graph convolutional layer given the same input D ( +1) t and H ( -1) t is defined asProof. To simplify the presentation, let us denote DThen, by definition we haveTherefore, we know thatTaking expectation condition on F t on both side, and using the fact thatthe following inequality holdsThen, taking expectation over F t , we haveKnowing that we are taking full-batch gradient descent when (t mod K) = 0, we haveLet take closer look at term (B).For term (C 1 ) by definition we knowBy induction, we haveProof. For the gradient w.r.t. the node embedding matrices, we haveLet first take a closer look at term (A). Let suppose t ∈ {EK + 1, . . . , EK + K}, where E = t mod K is the current epoch number and K is the inner-loop size. By the previous lemma, term (A) can be bounded byThen we take a closer look at term (B).The term (C) can be decomposed asBy induction, we haveThe upper-bound for term (D) is similar to one we have in the proof of Lemma 4.

L H

Taking expectation condition on F t and usingwe haveTake expectation over F t we haveCombing with term (A) we haveIn the following lemma, we derive the upper-bound on the difference of the gradient with respect to the weight matrix at each graph convolutional layer. Suppose the input node embedding matrix for the th GCN layer is defined as H ( -1) t, the gradient calculated for the th weight matrix in SGCN++ is defined asand the backward propagation for the th layer in FullGCN is defined asLemma 9. Let suppose t ∈ {EK + 1, . . . , EK + K}. The upper-bound on the difference of the gradient with respect to the th graph convolutional layer given the same input D ( +1) t and H( -1) t is defined asProof. To simplify the presentation, let us denote GThen, by definition, we haveTaking expectation condition on F t on both side, and using the fact that(196) we have the following inequality holdsThen, taking expectation over F t , we haveLet suppose t ∈ {EK + 1, . . . , EK + K}. Then we can denote t = EK + k for some k ≤ K such thatKnowing that we are taking full-batch gradient descent when (t mod K) = 0, we haveLet take closer look at term (B).For term (C 1 ) by definition we knowBy induction, we have For term (D 2 ) we haveBy induction we haveFor term (C 2 ) by definition we havePlugging (D 1 ), (D 2 ) to (C 2 ) and C 1 , (C 2 ) to (B), we haveUsing the previous lemma, we provide the upper-bound of Eq. 155, which is one of the three key factors that affect the mean-square error of stochastic gradient at the th layer.Lemma 10. Let suppose t ∈ {EK + 1, . . . , EK + K}. The upper-bound on the difference of the gradient with respect to the weight of the th graph convolutional layer given the same input DProof. For the gradient w.r.t. the weight matrices, we have(211) Let first take a closer look at term (A). Let suppose t ∈ {EK + 1, . . . , EK + K}, where E = t mod K is the current epoch number and K is the inner-loop size. By the previous lemma, term (A) can be bounded byThen we take a closer look at term (B).Let suppose t ∈ {EK + 1, . . . , EK + K}. Then we can denote t = EK + k for some k ≤ K.From Eq. 220, we know thatPlugging (C) back to (B), combing with (A) we haveIn the following lemma, we provide the upper-bound of Eq. 153, which is one of the three key factors that affect the mean-square error of stochastic gradient at the th layer. Let take closer look at term (A):By induction, we haveLet suppose t ∈ {EK + 1, . . . , EK + K}. Then we can denote t = EK + k for some k ≤ K.From Eq. 220, we know thatTherefore, we knowwhich conclude the proof.Combing the upper-bound of Eq. 153, 154, 155, we provide the upper-bound of mean-suqare error of stochastic gradient in SGCN++.Lemma 12. Let suppose t ∈ {EK + 1, . . . , EK + K}. Then we can denote t = EK + k for some k ≤ K such that

