A 2 Q: AGGREGATION-AWARE QUANTIZATION FOR GRAPH NEURAL NETWORKS

Abstract

As graph data size increases, the vast latency and memory consumption during inference pose a significant challenge to the real-world deployment of Graph Neural Networks (GNNs). While quantization is a powerful approach to reducing GNNs complexity, most previous works on GNNs quantization fail to exploit the unique characteristics of GNNs, suffering from severe accuracy degradation. Through an in-depth analysis of the topology of GNNs, we observe that the topology of the graph leads to significant differences between nodes, and most of the nodes in a graph appear to have a small aggregation value. Motivated by this, in this paper, we propose the Aggregation-Aware mixed-precision Quantization (A 2 Q) for GNNs, where an appropriate bitwidth is automatically learned and assigned to each node in the graph. To mitigate the vanishing gradient problem caused by sparse connections between nodes, we propose a Local Gradient method to serve the quantization error of the node features as the supervision during training. We also develop a Nearest Neighbor Strategy to deal with the generalization on unseen graphs. Extensive experiments on eight public node-level and graph-level datasets demonstrate the generality and robustness of our proposed method. Compared to the FP32 models, our method can achieve up to a 18.6x (i.e., 1.70bit) compression ratio with negligible accuracy degradation. Morever, compared to the state-of-theart quantization method, our method can achieve up to 11.4% and 9.5% accuracy improvements on the node-level and graph-level tasks, respectively, and up to 2x speedup on a dedicated hardware accelerator.

1. INTRODUCTION

Recently, Graph Neural Networks (GNNs) have attracted much attention due to their superior learning and representing ability for non-Euclidean geometric data. A number of GNNs have been widely used in real-world applications, such as recommendation system (Jin et al., 2020) , and social network analysis (Lerer et al., 2019) , etc. Many of these tasks put forward high requirements for low-latency inference. However, the real-world graphs are often extremely large and irregular, such as Reddit with 232,965 nodes, which needs 19G floating-point operations (FLOPs) to be processed by a 2-layer Graph Convolutional Network (GCN) with only 81KB parameters (Tailor et al., 2020) , while ResNet-50, a 50-layer DNN, only takes 8G FLOPs to process an image (Canziani et al., 2016) . What is worse, it requires a huge amount of memory access for GNNs inference, e.g., the nodes features size of Reddit is up to 534MB, leading to high latency. Therefore, the aforementioned problems pose a challenge to realize efficient inference of GNNs. Neural network quantization can reduce the model size and accelerate inference without modifying the model architecture, which has become a promising method to solve this problem in recent years.

