ON THE IMPORTANCE OF SAMPLING IN TRAINING GCNS: CONVERGENCE ANALYSIS AND VARIANCE REDUCTION Anonymous

Abstract

Graph Convolutional Networks (GCNs) have achieved impressive empirical advancement across a wide variety of graph-related applications. Despite their great success, training GCNs on large graphs suffers from computational and memory issues. A potential path to circumvent these obstacles is sampling-based methods, where at each layer a subset of nodes is sampled. Although recent studies have empirically demonstrated the effectiveness of sampling-based methods, these works lack theoretical convergence guarantees under realistic settings and cannot fully leverage the information of evolving parameters during optimization. In this paper, we describe and analyze a general doubly variance reduction schema that can accelerate any sampling method under the memory budget. The motivating impetus for the proposed schema is a careful analysis for the variance of sampling methods where it is shown that the induced variance can be decomposed into node embedding approximation variance (zeroth-order variance) during forward propagation and layerwise-gradient variance (first-order variance) during backward propagation. We theoretically analyze the convergence of the proposed schema and show that it enjoys an O(1/T ) convergence rate. We complement our theoretical results by integrating the proposed schema in different sampling methods and applying them to different large real-world graphs.

1. INTRODUCTION

In the past few years, graph convolutional networks (GCNs) have achieved great success in many graph-related applications, such as semi-supervised node classification (Kipf & Welling, 2016) , supervised graph classification (Xu et al., 2018) , protein interface prediction (Fout et al., 2017) , and knowledge graph (Schlichtkrull et al., 2018; Wang et al., 2017) . However, most works on GCNs focus on relatively small graphs, and scaling GCNs for large-scale graphs is not straight forward. Due to the dependency of the nodes in the graph, we need to consider a large receptive-field to calculate the representation of each node in the mini-batch, while the receptive field grows exponentially with respect to the number of layers. To alleviate this issue, sampling-based methods, such as node-wise sampling (Hamilton et al., 2017; Ying et al., 2018; Chen et al., 2017) , layer-wise sampling (Chen et al., 2018; Zou et al., 2019) , and subgraph sampling (Chiang et al., 2019; Zeng et al., 2019) are proposed for mini-batch GCN training. Although empirical results show that sampling-based methods can scale GCN training to large graphs, these methods suffer from a few key issues. First, the theoretical understanding of samplingbased methods is still lacking. Second, the aforementioned sampling strategies are only based on the structure of the graph. Although most recent works (Huang et al., 2018; Cong et al., 2020) propose to utilize adaptive importance sampling strategies to constantly re-evaluate the relative importance of nodes during training (e.g., current gradient or representation of nodes), finding the optimal adaptive sampling distribution is computationally inadmissible, as it requires to calculate the full gradient or node representations in each iteration. This necessitates developing alternative solutions that can efficiently be computed and that come with theoretical guarantees. In this paper, we develop a novel variance reduction schema that can be applied to any sampling strategy to significantly reduce the induced variance. The key idea is to use the historical node embeddings and the historical layerwise gradient of each graph convolution layer as control variants. The main motivation behind the proposed schema stems from our theoretical analysis of the sampling methods' variance in training GCNs. Specifically, we show that due to the composite structure of training objective, any sampling strategy introduces two types of variance in estimating the stochastic gradients: node embedding approximation variance (zeroth-order variance) which results from embeddings approximation during forward propagation, and layerwise-gradient variance (firstorder variance) which results from gradient estimation during backward propagation. In Figure 1 , we exhibit the performance of proposed schema when utilized in the sampling strategy introduced in (Zou et al., 2019). The plots show that applying our proposal can lead to a significant reduction in variance; hence faster convergence rate and better test accuracy. We can also see that both zerothorder and first-order methods are equally important and demonstrate significant improvement when applied jointly (i.e, doubly variance reduction). Contributions. We summarize the contributions of this paper as follows: • We provide the theoretical analysis for sampling-based GCN training (SGCN) with a nonasymptotic convergence rate. We show that due to the node embedding approximation variance, SGCNs suffer from residual error that hinders their convergence. • We mathematically show that the aforementioned residual error can be resolved by employing zeroth-order variance reduction to node embedding approximation (dubbed as SGCN+), which explains why VRGCN (Chen et al., 2017) enjoys a better convergence than GraphSAGE (Hamilton et al., 2017) , even with less sampled neighbors. • We extend the algorithm from node embedding approximation to stochastic gradient approximation, and propose a generic and efficient doubly variance reduction schema (SGCN++). SGCN++ can be integrated with different sampling-based methods to significantly reduce both zeroth-and first-order variance, and resulting in a faster convergence rate and better generalization. • We theoretically analyze the convergence of SGCN++ and obtain an O(1/T ) rate, which significantly improves the best known bound O(1/ √ T ). We empirically verify SGCN++ through various experiments on several real-world datasets and different sampling methods, where it demonstrates significant improvements over the original sampling methods.

2. RELATED WORKS

Training GCNs via sampling. The full-batch training of a typical GCN is employed in Kipf & Welling (2016) which necessities keeping the whole graph data and intermediate nodes' representations in the memory. This is the key bottleneck that hinders the scalability of full-batch GCN training. To overcome this issues, sampling-based GCN training methods (Hamilton et al., 2017; Chen et al., 2017; Chiang et al., 2019; Chen et al., 2018; Huang et al., 2018) are proposed to train GCNs based on mini-batch of nodes, and only aggregate the embeddings of a sampled subset of neighbors of nodes in the mini-batch. For example, GraphSAGE (Hamilton et al., 2017) restricts the computation complexity by uniformly sampling a fixed number of neighbors from the previous layer nodes. However, a significant computational overhead is introduced when GCN goes deep. VRGCN (Chen et al., 2017) further reduces the neighborhood size and uses history activation of the previous layer to reduce variance. However, they require to perform a full-batch graph convolutional operation on history activation during each forward propagation, which is computationally expensive. Another direction applies layerwise importance sampling to reduce variance. For example, FastGCN (Chen et al., 2018) independently sample a constant number of nodes in all layers using importance sampling. However, the sampled nodes are too sparse to achieve high accuracy.



Figure 1: The effect of doubly variance reduction on training loss, validation loss, and mean-square error (MSE) of gradient on Flickr dataset using LADIES proposed in Zou et al. (2019).

