BDS-GCN: EFFICIENT FULL-GRAPH TRAINING OF GRAPH CONVOLUTIONAL NETS WITH PARTITION-PARALLELISM AND BOUNDARY SAMPLING Anonymous

Abstract

Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art model for graph-based learning tasks. However, it is still challenging to train GCNs at scale, limiting their applications to real-world large graphs and hindering the exploration of deeper and more sophisticated GCN architectures. While it can be natural to leverage graph partition and distributed training for tackling this challenge, this direction has only been slightly touched on previously due to the unique challenge posed by the GCN structures, especially the excessive amount of boundary nodes in each partitioned subgraph, which can easily explode the required memory and communications for distributed training of GCNs. To this end, we propose BDS-GCN, a method that adopts unbiased boundary sampling strategy to enable efficient and scalable distributed GCN training while maintaining the full-graph accuracy. Empirical evaluations and ablation studies validate the effectiveness of the proposed BDS-GCN, e.g., boosting the throughput by up-to 500% and reducing the memory usage by up-to 58% for distributed GCN training, while achieving the same accuracy, as compared with the state-of-the-art methods. We believe our BDS-GCN would open up a new paradigm for enabling GCN training at scale. All code will be released publicly upon acceptance.

1. INTRODUCTION

Graph convolutional networks (GCNs) (Kipf & Welling, 2016) have gained increasing attention as they recently demonstrated the state-of-the-art performance in a number of graph-based learning tasks, including node classification (Kipf & Welling, 2016) , link prediction (Zhang & Chen, 2018), graph classification (Xu et al., 2018) , and recommendation systems (Ying et al., 2018) . The excellent performance of GCNs is attributed to their unrestricted and irregular neighborhood connectivity which provides them greater applicability to graph-based data than convolutional neural networks (CNNs) that adopt a fixed regular neighborhood structure. Specifically, given a node in a graph, a GCN first aggregates the features of its neighbor nodes, and then transforms the aggregated feature through (hierarchical) feed-forward propagation to update the given node feature. The two major operations, i.e., neighbor aggregate and update of node features, enables GCNs to take advantage of the graph structure and outperform their structure-unaware alternatives. Despite their promising performance, training GCNs has been very challenging, limiting their application to large real-world graphs and hindering the exploration of deeper and more sophisticated GCN architectures. This is because as the graph size grows, the sheer number of node features and the large adjacency matrix can easily explode the required memory and communications. To tackle this challenge, several sampling-based methods have been developed for reducing the memory requirement at a cost of approximation errors. For example, GraphSAGE (Hamilton et al., 2017) and others (Chen et al., 2017; Huang et al., 2018) reduce the full-batch of a large graph into a mini-batch via neighbor sampling; alternative methods (Chiang et al., 2019; Zeng et al., 2019) use sub-graph sampling to extract induced sub-graphs as training samples. In parallel with sampling-based methods, a recently emerged and promising direction for handling large graph training is the distributed training of GCNs, which aims to train large full-graphs over multiple GPUs without degrading the accuracy. The key idea is to partition a giant graph into small subgraphs such that each could be fit into a GPU, and train them in parallel with necessary communication. Following this paradigm, pioneering efforts, including NeuGraph (Ma et al., 2019) , ROC (Jia et al., 2020), and CAGNET (Tripathy et al., 2020) , demonstrate the great potential of distributed GCN training, but with different trade-offs. NeuGraph and ROC store entire (sub)graphs in CPU for overcoming the hurdle of still severe requirement of GPU memory, which yet relies on heavy GPU-CPU communications. CAGNET splits the feature vector of nodes into small sub-vectors to reduce the granularity of computation for potential memory saving, which however requires repeated and redundant broadcast of all node features across all subgraphs. As a result, these methods not only require either extra CPU resources or communication traffic, but also hurt training performance. To enable efficient full-graph training of GCNs without these aforementioned issues, this work sets out to understand the underlying cause of the memory and communication explosion in distributed GCNs training by carefully analyzing the training paradigm -partition parallelism. We find that even with partition parallelism GCN training can still be ineffective if not designed properly, which motivates us to make the following contributions: • We identify and formalize two main challenges in partition parallel training of GCNs: prohibitive memory requirement and communication volume. We further identify the cause of these challenges to be excessive number of boundary nodes within each partitioned graph and such cause is unique in GCNs architecture due to neighbor aggregation. These findings provide researchers better understanding in distributed GCN trainings and potentially inspires further ideas in this direction. • We propose BDS-GCN, a simple yet effective boundary sampling method for overcoming both challenges above, which enables more scalable and performant large-graph training of GCNs while maintaining a full-graph accuracy. BDS-GCN randomly samples features of boundary nodes during each training epoch, aggressively shrinking the required memory and communication volume without compromising the accuracy. • Experiments and ablation studies consistently validate the effectiveness of the proposed BDS-GCN in terms of training performance and achieved accuracy, e.g.,boosting the throughput by up-to 500% and reducing the memory usage by up-to 58% while achieving the same or even better accuracy, as compared with the state-of-the-art methods when being applied to Reddit and ogbn-products datasets.

2. BACKGROUND AND RELATED WORKS

Graph Convolutional Networks. A GCN takes graph-structured data as input and learn a feature (embedding) vector representing each node in the graph. To learn the feature vector, GCN performs two major steps in each layer, i.e., neighbor aggregate and update, which can be represented as: a (l) v =ζ (l) h (l-1) u | u ∈ N (v) h (l) v =φ (l) a (l) v , h (l-1) v (2) where N (v) denotes the neighbor set of node v in the graph and h (l) u denotes the learned feature vector of node u at the l-th layer. ζ (l) is aggregation function that takes neighbor features to generate aggregation result a (l) v for node v. Then φ (l) gets the feature of node v updated. A famous instance of GCN is GraphSAGE with mean aggregator (Hamilton et al., 2017)  , in which ζ (l) is mean function and φ (l) is σ W • CONCAT a (l) v , h (l-1) v , where W is the weight matrix and σ is a non-linear activation. This instance is the focus of our work, but our approach can also be extended easily to other popular aggregators and update functions. Large Graph Training. Real-world graphs consist of millions of nodes and billions of edges (Hu et al., 2020) , which are beyond the capability of vanilla GCNs (Hamilton et al., 2017; Jia et al., 2020) , especially due to the constraint of GPU memory capacity. To tackle this issue, several samplingbased methods were proposed, such as neighbor sampling (Hamilton et al., 2017; Chen et al., 2017 ), layer sampling (Chen et al., 2018; Huang et al., 2018; Zou et al., 2019) , and subgraph sampling (Chiang et al., 2019; Zeng et al., 2019; Wang et al., 2019) . However, these methods suffer three major drawbacks:

