DISTRIBUTED TRAINING OF GRAPH CONVOLUTIONAL NETWORKS USING SUBGRAPH APPROXIMATION

Abstract

Modern machine learning techniques are successfully being adapted to data modeled as graphs. However, many real-world graphs are typically very large and do not fit in memory, often making the problem of training machine learning models on them intractable. Distributed training has been successfully employed to alleviate memory problems and speed up training in machine learning domains in which the input data is assumed to be independently identical distributed (i.i.d). However, distributing the training of non i.i.d data such as graphs that are used as training inputs in Graph Convolutional Networks (GCNs) causes accuracy problems since information is lost at the graph partitioning boundaries. In this paper, we propose a training strategy that mitigates the lost information across multiple partitions of a graph through a subgraph approximation scheme. Our proposed approach augments each sub-graph with a small amount of edge and vertex information that is approximated from all other sub-graphs. The subgraph approximation approach helps the distributed training system converge at single-machine accuracy, while keeping the memory footprint low and minimizing synchronization overhead between the machines.

1. INTRODUCTION

Graphs are used to model data in a diverse range of applications such as social networks (Hamilton et al., 2017a) , biological networks (Fout et al., 2017 ) and e-commerce interactions (Yang, 2019) . Recently, there has been an increased interest in applying machine learning techniques to graph data. A breakthrough was the development of Graph Convolutional Networks (GCNs), which generalize the function of Convolutional Neural Networks (CNNs), to operate on graphs (Kipf & Welling, 2017) . Training GCNs, much like training CNNs, is a memory and computationally demanding task and may take days or weeks to train on large graphs. GCNs are particularly memory demanding since processing a single vertex may require accessing a large number of neighbors. If the GCN algorithm accesses neighbors which are multiple hops away, it may need to traverse disparate sections of the graph with no locality. This requirement poses challenges when the graph does not fit in memory, which is often the case for large graph representations. Previous approaches to alleviate issues of memory and performance seeks to constrain the number of dependent vertices by employing sampling-based methods (Hamilton et al., 2017b; Chen et al., 2018; Huang et al., 2018; Chiang et al., 2019) . However, sampling the graph may limit the achievable accuracy (Jia et al., 2020) . CNN training turned to data parallel distributed training to alleviate the memory size constraints and the computing capacity of a single machine. Here, a number of machines independently train the same neural network model but on an assigned chunk of the input data. To construct a single global model, the gradients or parameters are aggregated at regular intervals, such as at the completion of a batch (Dean et al., 2012) . While this approach works well for CNNs, it is more challenging for GCNs due to the nature of the input data. If the training input is a set of images, each image is independent. Hence, when the dataset is split into chunks of images, no information is directly carried between the chunks. In contrast, if the input is a graph, the equivalent of splitting the dataset into chunks is to divide the large graph into a number of subgraphs. However, GCNs train on graphs which rely not only on information inherent to a vertex but also information from its neighbors. When a GCN trains on one subgraph and needs information from a neighbor in another subgraph, the information from the edges that span across machines creates a communication bottleneck. For example, in our experimental evaluations we note that when the Reddit dataset (Hamilton et al., 2017b) is partitioned into five subgraphs, there are about 1.3 million (bidirectional) edges that span across machines. If each edge is followed, 2.6 million messages would be sent, each containing a feature vector of 602 single-precision floats, summing up to about 6.3 GB of data transferred each epoch, where one epoch traverses the entire graph. Each training typically requires many epochs. In our experiments running distributed training on the GCN proposed by Kipf & Welling (2017) one epoch takes about one second. Thus the bandwidth required is enormous. This is a serious scalability issue: since the time it takes to finish one epoch decreases when the training is scaled out on more machines, the required bandwidth increases. Even with optimizations such as caching that might be possible (e.g. vertex caching (Yang, 2019)), the amount of communication is significant. Alternatively, instead of communicating between subgraphs, a GCN may ignore any information that spans multiple machines (Jia et al., 2020) . However, we observe that this approach results in accuracy loss due to suboptimal training. For example, when distributed on five compute nodes and ignoring edges that span across machines, we note in our evaluations that the accuracy of GraphSAGE trained on the Reddit dataset drops from 94.94% to 89.95%, a substantial loss in accuracy. To mitigate this problem, we present a distributed training approach of GCNs which reaches an accuracy comparable with training on a single-machine system, while keeping the communication overhead low. This allows for GCNs to train on very large graphs using a distributed system where each machine has limited memory, since the memory usage on each machine scales with the size of its allotted (local) subgraph. We achieve this by approximating the non-local subgraphs and making the approximations available in the local machine. We show that, by incurring a small vertex overhead in each local machine, a high accuracy can be achieved. Our contributions in this paper are: • A novel subgraph approximation scheme to enable distributed training of GCNs on large graphs using memory-constrained machines, while reaching an accuracy comparable to that achieved on a single machine. • An evaluation of the approach on two GCNs and two datasets, showing that it considerably increases convergence accuracy, and allows for faster training while keeping the memory overhead low. The rest of the paper is organized as follows. Section 2 walks through the necessary background and presents experimental data that motivates our approach. Section 3 describes our approach to efficiently distribute GCNs. Section 4 describes our evaluation methodology, while Section 5 presents the results of our evaluation. Section 6 discusses related work before we conclude in Section 7.

2. BACKGROUND AND MOTIVATION

In this section we first describe GCNs. Then, we show that GCNs are resilient to approximations in the input graph, and describe how that motivates our approach in Section 3.

2.1. GRAPH CONVOLUTIONAL NETWORKS

GCNs are neural networks that operate directly on graphs as the input data structure. A GCN aims to define an operation similar to a convolutional layer in CNNs, where a parameterized filter is applied to a pixel and its neighbors in an image (Fig 1 ). In the graph setting, the pixel corresponds to a vertex, and the edges from the vertex point out its neighbors (Fig 2 ). Early approaches to define this operation used graph convolutions from spectral graph theory (Bruna et al., 2014; Henaff et al., 2015) . While theoretically sound, these approaches are difficult to scale to large graphs. (Wu et al., 2020) . Recent research (e.g. (Hamilton et al., 2017b; Chen et al., 2018) ) instead takes a spatial-based approach, which enables computation in batches of vertices. The goal of a GCN is to learn the parameters of a neural network given a graph G = (V, E) where V is a set of vertices and E is a set of edges. The input is a feature matrix X, where each row represents the feature vector x v for every vertex v ∈ V , and an adjacency matrix A which represents the edges e ∈ E. Each layer in can be described as H i = f (H i-1 , A), H 0 = X, where i is the layer order and f is a propagation rule (Kipf & Welling, 2017) . The number of layers and the propagation rule differ between GCNs. While our proposal is general to all GCNs of this type, we evaluate it on the GCN proposed by Kipf & Welling (2017), which we call KW-GCN, and GraphSAGE proposed by Hamilton et al. (2017b) . Both GCNs are described in more detail in Section 4.1.

