HIERBATCHING: LOCALITY-AWARE OUT-OF-CORE TRAINING OF GRAPH NEURAL NETWORKS

Abstract

Graph neural networks (GNNs) have become increasingly popular for analyzing data organized as massive graphs. Efficient training of GNN models under limited computing resources is critical for GNN's widespread adoption. We consider the use of a single commodity machine with limited memory (e.g., 128GB) but ample external storage (e.g., 1TB). On such a platform, the feature data or even the graph may not fit in the memory. When data is stored on external storage, gathering features and constructing neighborhood subgraphs in a typical mini-batch training incurs random storage accesses and thus, causes expensive data movement. To overcome this bottleneck, we propose a locality-aware training scheme, coined HierBatching, which significantly increases training speed while retaining the training quality. The key idea is to exploit the memory hierarchy of a modern GPU machine by constructing batches in an analogously hierarchical manner. HierBatching groups nodes in partitions, each of which is laid out contiguously in the disk for maximal spatial locality. Meanwhile, the main memory is treated as a cache that holds a mega-batch, which is a random collection of partitions. Mini-batches are sampled for GPU training from the mega-batch in the main memory. Each mega-batch is reused multiple times to improve temporal locality. Our experiments show that on a machine with 128GB main memory, HierBatching is 3× to 20× faster than a straightforward out-of-core training approach by using mmap, while maintaining the prediction accuracy.

1. INTRODUCTION

Graph neural networks (GNNs) have emerged as effective machine learning models for many practical applications, such as social network analysis, financial forensics, recommendation, and traffic forecasting (Hamilton et al., 2017b) . Training GNNs has become increasingly challenging as the size of the graphs has increased rapidly. Even benchmark datasets that mimic real-life applications have grown to sizes that forbid the use of commodity laptops or servers for experimentation and model development. For example, the MAG240M dataset (Hu et al., 2020) has 407GB of raw data (where 349GB accounts for node features). The storage requirement easily exceeds the memory capacity of a single machine; hence, one considers either exploiting external storage or using a distributed-memory cluster. The latter imposes a natural cost barrier for many individuals and organizations, and also has been demonstrated to be often heavily communication-bound (Ramezani et al., 2022) . In this work, we exploit external storage (e.g., SSD) and perform out-of-core GNN training on a single machine with one or a few GPUs. Fig. 1 (a) illustrates the typical memory hierarchy of such a machine. The ample external storage is assumed to be able to store the entire dataset in a format that facilitates processing and training, while the main memory cannot. Each GPU, as is common in machine architectures encountered nowadays, has an even smaller memory capacity than the main memory. We use one or multiple GPUs to perform mini-batch training. Traditional wisdom suggests that key to the exploitation of the memory hierarchy of a machine is locality: spatial locality that takes advantage of sequential data accesses and thus reads data in blocks instead of single words, and temporal locality, which takes advantage of the fact that recently accessed data will be accessed again in the near future, and thus, put it at a place closer to the processors to reduce access time. In disk-based computing, for spatial locality the operating system retrieves data from disk to memory in fixed-sized 4KB to 8KB blocks, called pages. This avoids the read amplification problem (which dramatically reduces the effective disk bandwidth) when disk is read in cache-line-size blocks. For temporal locality, it is desirable that data loaded from the disk be reused as often as possible, given the significant latency caused by moving data from disk to memory. Following this intuition, we propose a hierarchical batching scheme, abbreviated as HierBatching, to exploit locality in modern commodity memory hierarchy (external storage, main memory, and GPU memory). HierBatching batches data in an analogously hierarchical fashion: the entire set of graph nodes (with features and edge lists) is stored in external storage, while a mega-batch of nodes is sampled and copied to the main memory serving as a cache; mini-batches are then sampled from the cache for gradient-based training. We make this scheme locality-aware by partitioning the graph nodes into many small partitions that are stored consecutively in the disk (spatial locality), and performing mini-batch training with the same cache multiple rounds (temporal locality). HierBatching preserves the random nature of a stochastic training, as it forms each mega-batch with random combinations of partitions and samples mini-batches from the mega-batch randomly. A realization of HierBatching incurs a subtle challenge because some of the neighbors of a node in a mega-batch may not be included in the cache, thereby degrading the training quality and the prediction accuracy. To maximally increase the node degrees with a small memory, we propose to permanently store the highest-degree nodes in the memory, because they are more likely to be connected to nodes in the mega-batch. We thus divide the main memory into a static cache and a dynamic cache and use the former to accommodate these nodes, such that their features are always reachable without repetitive transfers from the external storage. See Fig. 1 (b) for a preview. Our work makes the following contributions: • We study a practical but under-explored scenario for training GNNs on massive graphs using external storage and propose HierBatching, a locality-aware batching scheme that fully leverages the memory hierarchy of a machine, particularly disk, to improve training efficiency. • We introduce a static cache to compensate the loss of node degrees in the formation of the batching hierarchy, retaining model accuracy obtained by standard in-memory mini-batch training. • We demonstrate empirically that, on a GPU equipped machine with 128GB main memory, Hier-Batching is 3× to 20× faster than DGL with mmap support, while retaining prediction accuracy. HierBatching is also competitive with in-memory DGL, which requires 3 times more memory.

2. PRELIMINARIES AND RELATED WORK

In this work, we consider message-passing GNNs that act on a given graph G(V, E), where V is the node set and E is the edge set. For each node v ∈ V , let h 0 v be the initial feature vector. A K-layer GNN uses message passing to iteratively update the feature vector and produces an output vector h K v . Specifically, the update at the k-th layer (1 ≤ k ≤ K) reads h k v = update k h k-1 v , aggregate k {h k-1 u | u ∈ N (v)} , where N (v) denotes the 1-hop neighborhood of v, and update k and aggregate k are operators on the feature vectors, generally layer-dependent. Common GNNs, such as GCN (Kipf & Welling,



Figure 1: Overview of hierarchical batching. The disk stores the entire graph which is divided into partitions. Multiple (four in this example) partitions are randomly selected and loaded into the main memory to form a mega-batch, from which mini-batches are sampled and moved to the GPUs for training.

