GCINT:DYNAMIC QUANTIZATION ALGORITHM FOR TRAINING GRAPH CONVOLUTION NEURAL NETWORKS USING ONLY INTEGERS

Abstract

Quantization approaches can minimize storage costs while decreasing the computational complexity of a model, although there is minimal study in the GNN field on quantization networks. We studied the four primary reasons why existing quantization approaches cannot be employed extensively with GNNs: (1)Quantifying the distinctions between data sources; (2)Quantifying the distinctions between data streams; (3)Quantifying the distinctions between concentrations; (4)QAT's Limitations. Based on this, we propose GCINT, which is an efficient quantization framework prepared for GNN training. The entire forward, backward, optimizer, and loss functions are calculated using integer data. We achieved a training acceleration ratio of nearly 10× compared to FP32 Cuda Core in RTX 2080TI INT8 Tensor Core. Our quantization is independent of the dataset and weight distribution, and more than 2,000 randomized trials have been undertaken on the 8 popular GNN benchmark datasets, with all achieving errors within 1% of the FP32.

1. INTRODUCTION

There is an abundance of graph-structured data in the natural and social sciences. In fields such as social networks (Fan et al., 2019) , recommender systems (Wu et al., 2020) , traffic networks (Jiang & Luo, 2022) , molecular prediction (Mansimov et al., 2019) , and drug discovery (Zhang et al., 2022) , Graph Neural Networks (GNNs) representative deep learning systems for graph data learning, inference, and generalization have produced superior outcomes. As graph learning applications increase and graph data expands, the training of GNNs becomes inefficient due to two significant obstacles: (1) Storage expenses. Since training needs recording the outputs of several layers in forward propagation for backward propagation calculation, extremely large scale graph data is frequently saved utilizing distributed CPU-centric memory storage by distributed GPU clusters employing a minibatch training technique. Common acceleration devices such as GPUs and FPGAs with on-chip storage and bandwidth can no longer match the demand for training large GNNs and are too dependent on sampling techniques to train on a device with a limited batch size for each training session (Yang, 2019). (2) Calculated expenses. Training a single epoch on the Reddit dataset generally requires tens of TFLOPS, even for KB-sized GNN models. Quantization (Yang et al., 2019) can lower storage costs while decreasing the model's computational complexity (Nagel et al., 2021) . Although quantization is widely used in CNNs, research on quantized networks for GNNs is scarce, we believe the following factors primarily restrict the applicability of quantization approaches in GNNs: (1) Quantifying the distinctions between data sources. During CNN training, the RGB images of UINT8 are normalized and sent to the network. In contrast, when using GNN models, the node features of the network are frequently not the consequence of normalization, and the distribution of node features will shift as the graph changes and embedding methods are employed. The information contained in the image could have been represented by UINT8, whereas the embedding vectors of graph nodes are typically in FP32 data format, which contains significantly more information than UINT8. Therefore, it is essential for GNN to quantize the dataset, which must represent a large amount of information in the dataset with a limited number of bits. ( 2) Quantifying the distinctions between data streams. The calculation in each layer of CNN that maps to the GPU is typically General Matrix Multiplication (GEMM), and the activation distribution in each layer of GNN is strongly tied to the graph topology, when the average degree of the graph is high, the aggregation process of the integer domain is more susceptible to data overflow. Conversely, when the average degree of the graph is low, the distribution of activation in each layer of the network will be focused in the low bit data range. This brings uncertainty into quantitative training, and typical quantization approaches for CNN will not be employed directly in GNN models. (3) Quantifying the distinctions between concentrations. Due to the fact that DNN models generally include millions of parameters, it is required to decrease the complexity of storage and processing by quantizing and compressing the weights, such as BNN (Tang et al., 2017) and XNOR-Net (Rastegari et al., 2016) quantize the weights to binary. However, GNN models are typically in the KB order of magnitude, and the gains from compressing the weights are not substantial. (4) QAT's Limitations. The majority of CNN quantization is based on the research of QAT (Wang et al., 2019) quantization operators, and this design strategy has evolved into a robust, low-error quantization model. QAT conducts low-bit quantization of weights and activation during forward propagation, then reduces the noise and loss induced by forward quantization using FP32 back propagation, and lastly dequantizes the model to integer for inference acceleration during model deployment (Krishnamoorthi, 2018) . The restriction of the QAT is that it cannot be utilized for accelerated training since the data format after quantization is still FP32 during the training process. After training a CNN model for a certain class of tasks with a significant quantity of data, they seldom need to be retrained after deployment, hence QAT provides CNN models with extraordinarily high advantages. GNNs tend to be dynamic graphs in the real world and require finetuning or retraining of the model; hence, speeding the training process of GNNs is more relevant than accelerating the inference process, which QAT cannot perform for accelerated training. This work considers the motivations and problems associated with quantization of graph architectures, and provides the following contributions: • We employ a top-down quantification study methodology. The vast majority of prior quantization investigations have been bottom-up procedures, i.e., beginning with the FP32 tensor, quantizing to the FP32 tensor that can be mapped by integers one by one, and then dequantizing to the integer, which we consider a superfluous operation. Consequently, we investigate the quantization strategy for integer tensor computation directly from a different angle, which has the advantage that a shaping model can be obtained directly without dequantization and can be used directly for inference and training speedup in fixed-point hardware, such as GPU INT Tensor Core. • We propose a novel quantization training algorithm for graphs as an alternative to the traditional QAT method to accelerate the GNN training process. This algorithm can adaptively adjust the quantization range according to the sparsity of the graph data and can accommodate unevenly distributed data during training. The entire training forward, backward, optimizer, and loss functions are calculated using integer data, which can be directly accelerated by using the INT Tensor Core of GPU for GNN training. We achieved a training acceleration ratio of nearly 10× compared to FP32 Cuda Core in RTX 2080TI INT8 Tensor Core, and can train a larger subgraph than the original one with limited memory. • Our quantization is independent of the dataset and weight distribution, and more than 2,000 randomized trials have been undertaken on the 8 popular GNN benchmark datasets, with all achieving errors within 1% of the FP32 baseline, and without fine-tuning hyperparameters. et al., 2018) provides a low-bit quantization approach for weight, activation, gradient, and error in CNN training. Data is linearly mapped using integers to FP32, and all training data must be dequantized to the integer domain before being utilized for training acceleration. The authors believe that following matrix multiplication of the bita tensor with the bitw tensor, an bit[a + w -1] tensor is formed, which is then quantized to the bita tensor and sent to the next network layer. We feel that basing the quantization approach on this assumption is not rigorous since it disregards the dimensionality of the tensor. Assuming that the dimensionality of the tensor is n, the data bit-width range of the output tensor should be [a+w-1+log 2 n]. However, because the GEMM and SPMM(Sparse-Dense Matrix Multiplication) dimensions in GNN are typically large, the output data bit-width range will overflow to varying degrees, rendering the method inapplicable to the training speedup of GNN.



presents the quantitative studies published in recent years. WAGE (Wu

