ON THE IMPORTANCE OF SAMPLING IN TRAINING GCNS: CONVERGENCE ANALYSIS AND VARIANCE REDUCTION Anonymous

Abstract

Graph Convolutional Networks (GCNs) have achieved impressive empirical advancement across a wide variety of graph-related applications. Despite their great success, training GCNs on large graphs suffers from computational and memory issues. A potential path to circumvent these obstacles is sampling-based methods, where at each layer a subset of nodes is sampled. Although recent studies have empirically demonstrated the effectiveness of sampling-based methods, these works lack theoretical convergence guarantees under realistic settings and cannot fully leverage the information of evolving parameters during optimization. In this paper, we describe and analyze a general doubly variance reduction schema that can accelerate any sampling method under the memory budget. The motivating impetus for the proposed schema is a careful analysis for the variance of sampling methods where it is shown that the induced variance can be decomposed into node embedding approximation variance (zeroth-order variance) during forward propagation and layerwise-gradient variance (first-order variance) during backward propagation. We theoretically analyze the convergence of the proposed schema and show that it enjoys an O(1/T ) convergence rate. We complement our theoretical results by integrating the proposed schema in different sampling methods and applying them to different large real-world graphs.

1. INTRODUCTION

In the past few years, graph convolutional networks (GCNs) have achieved great success in many graph-related applications, such as semi-supervised node classification (Kipf & Welling, 2016) , supervised graph classification (Xu et al., 2018) , protein interface prediction (Fout et al., 2017) , and knowledge graph (Schlichtkrull et al., 2018; Wang et al., 2017) . However, most works on GCNs focus on relatively small graphs, and scaling GCNs for large-scale graphs is not straight forward. Due to the dependency of the nodes in the graph, we need to consider a large receptive-field to calculate the representation of each node in the mini-batch, while the receptive field grows exponentially with respect to the number of layers. To alleviate this issue, sampling-based methods, such as node-wise sampling (Hamilton et al., 2017; Ying et al., 2018; Chen et al., 2017) , layer-wise sampling (Chen et al., 2018; Zou et al., 2019) , and subgraph sampling (Chiang et al., 2019; Zeng et al., 2019) are proposed for mini-batch GCN training. Although empirical results show that sampling-based methods can scale GCN training to large graphs, these methods suffer from a few key issues. First, the theoretical understanding of samplingbased methods is still lacking. Second, the aforementioned sampling strategies are only based on the structure of the graph. Although most recent works (Huang et al., 2018; Cong et al., 2020) propose to utilize adaptive importance sampling strategies to constantly re-evaluate the relative importance of nodes during training (e.g., current gradient or representation of nodes), finding the optimal adaptive sampling distribution is computationally inadmissible, as it requires to calculate the full gradient or node representations in each iteration. This necessitates developing alternative solutions that can efficiently be computed and that come with theoretical guarantees. In this paper, we develop a novel variance reduction schema that can be applied to any sampling strategy to significantly reduce the induced variance. The key idea is to use the historical node 1

