DISTRIBUTED TRAINING OF GRAPH CONVOLUTIONAL NETWORKS USING SUBGRAPH APPROXIMATION

Abstract

Modern machine learning techniques are successfully being adapted to data modeled as graphs. However, many real-world graphs are typically very large and do not fit in memory, often making the problem of training machine learning models on them intractable. Distributed training has been successfully employed to alleviate memory problems and speed up training in machine learning domains in which the input data is assumed to be independently identical distributed (i.i.d). However, distributing the training of non i.i.d data such as graphs that are used as training inputs in Graph Convolutional Networks (GCNs) causes accuracy problems since information is lost at the graph partitioning boundaries. In this paper, we propose a training strategy that mitigates the lost information across multiple partitions of a graph through a subgraph approximation scheme. Our proposed approach augments each sub-graph with a small amount of edge and vertex information that is approximated from all other sub-graphs. The subgraph approximation approach helps the distributed training system converge at single-machine accuracy, while keeping the memory footprint low and minimizing synchronization overhead between the machines.

1. INTRODUCTION

Graphs are used to model data in a diverse range of applications such as social networks (Hamilton et al., 2017a) , biological networks (Fout et al., 2017 ) and e-commerce interactions (Yang, 2019) . Recently, there has been an increased interest in applying machine learning techniques to graph data. A breakthrough was the development of Graph Convolutional Networks (GCNs), which generalize the function of Convolutional Neural Networks (CNNs), to operate on graphs (Kipf & Welling, 2017) . Training GCNs, much like training CNNs, is a memory and computationally demanding task and may take days or weeks to train on large graphs. GCNs are particularly memory demanding since processing a single vertex may require accessing a large number of neighbors. If the GCN algorithm accesses neighbors which are multiple hops away, it may need to traverse disparate sections of the graph with no locality. This requirement poses challenges when the graph does not fit in memory, which is often the case for large graph representations. Previous approaches to alleviate issues of memory and performance seeks to constrain the number of dependent vertices by employing sampling-based methods (Hamilton et al., 2017b; Chen et al., 2018; Huang et al., 2018; Chiang et al., 2019) . However, sampling the graph may limit the achievable accuracy (Jia et al., 2020) . CNN training turned to data parallel distributed training to alleviate the memory size constraints and the computing capacity of a single machine. Here, a number of machines independently train the same neural network model but on an assigned chunk of the input data. To construct a single global model, the gradients or parameters are aggregated at regular intervals, such as at the completion of a batch (Dean et al., 2012) . While this approach works well for CNNs, it is more challenging for GCNs due to the nature of the input data. If the training input is a set of images, each image is independent. Hence, when the dataset is split into chunks of images, no information is directly carried between the chunks. In contrast, if the input is a graph, the equivalent of splitting the dataset into chunks is to divide the large graph into a number of subgraphs. However, GCNs train on graphs which rely not only on information inherent to a vertex but also information from its neighbors. When a GCN trains on one subgraph and needs information from a neighbor in another subgraph, the information from the edges that span across machines creates a communication bottleneck. For 1

