DEGREE-QUANT: QUANTIZATION-AWARE TRAINING FOR GRAPH NEURAL NETWORKS

Abstract

Graph neural networks (GNNs) have demonstrated strong performance on a wide variety of tasks due to their ability to model non-uniform structured data. Despite their promise, there exists little research exploring methods to make them more efficient at inference time. In this work, we explore the viability of training quantized GNNs, enabling the usage of low precision integer arithmetic during inference. For GNNs seemingly unimportant choices in quantization implementation cause dramatic changes in performance. We identify the sources of error that uniquely arise when attempting to quantize GNNs, and propose an architecturally-agnostic and stable method, Degree-Quant, to improve performance over existing quantizationaware training baselines commonly used on other architectures, such as CNNs. We validate our method on six datasets and show, unlike previous attempts, that models generalize to unseen graphs. Models trained with Degree-Quant for INT8 quantization perform as well as FP32 models in most cases; for INT4 models, we obtain up to 26% gains over the baselines. Our work enables up to 4.7× speedups on CPU when using INT8 arithmetic.

1. INTRODUCTION

GNNs have received substantial attention in recent years due to their ability to model irregularly structured data. As a result, they are extensively used for applications as diverse as molecular interactions (Duvenaud et al., 2015; Wu et al., 2017) , social networks (Hamilton et al., 2017) , recommendation systems (van den Berg et al., 2017) or program understanding (Allamanis et al., 2018) . Recent advancements have centered around building more sophisticated models including new types of layers (Kipf & Welling, 2017; Velickovic et al., 2018; Xu et al., 2019) and better aggregation functions (Corso et al., 2020) . However, despite GNNs having few model parameters, the compute required for each application remains tightly coupled to the input graph size. A 2-layer Graph Convolutional Network (GCN) model with 32 hidden units would result in a model size of just 81KB but requires 19 GigaOPs to process the entire Reddit graph. We illustrate this growth in fig. 1 . One major challenge with graph architectures is therefore performing inference efficiently, which limits the applications they can be deployed for. For example, GNNs have been combined with CNNs for SLAM feature matching (Sarlin et al., 2019) , however it is not trivial to deploy this technique on smartphones, or even smaller devices, whose neural network accelerators often do not implement floating point arithmetic, and instead favour more efficient integer arithmetic. Integer quantization is one way to lower the compute, memory and energy budget required to perform inference, without requiring modifications to the model architecture; this is also useful for model serving in data centers. Although quantization has been well studied for CNNs and language models (Jacob et al., 2017; Wang et al., 2018; Zafrir et al., 2019; Prato et al., 2019) GNNs with models sizes 100× smaller than popular CNNs require many more OPs to process large graphs. h l v h l+1 v CNN GNN h l+1 = h l * K h l+1 v = γ(h l v , u∈N (v) [φ(h l u , h l v , euv)]) Layer l Layer l + 1 Figure 2 : While CNNs operate on regular grids, GNNs operate on graphs with varying topology. A node's neighborhood size and ordering varies for GNNs. Both architectures use weight sharing. GNN efficiency (Mukkara et al., 2018; Jia et al., 2020; Zeng & Prasanna, 2020; Yan et al., 2020) . To the best of our knowledge, there is no work explicitly characterising the issues that arise when quantizing GNNs or showing latency benefits of using low-precision arithmetic. The recent work of Wang et al. (2020) explores only binarized embeddings of a single graph type (citation networks). In Feng et al. (2020) a heterogeneous quantization framework assigns different bits to embedding and attention coefficients in each layer while maintaining the weights at full precision (FP32). Due to the mismatch in operands' bit-width the majority of the operations are performed at FP32 after data casting, making it impractical to use in general purpose hardware such as CPUs or GPUs. In addition they do not demonstrate how to train networks which generalize to unseen input graphs. Our framework relies upon uniform quantization applied to all elements in the network and uses bit-widths (8-bit and 4-bit) that are supported by off-the-shelf hardware such as CPUs and GPU for which efficient low-level operators for common operations found in GNNs exists. This work considers the motivations and problems associated with quantization of graph architectures, and provides the following contributions: • The explanation of the sources of degradation in GNNs when using lower precision arithmetic. We show how the choice of straight-through estimator (STE) implementation, node degree, and method for tracking quantization statistics significantly impacts performance. • An architecture-agnostic method for quantization-aware training on graphs, Degree-Quant (DQ), which results in INT8 models often performing as well as their FP32 counterparts. At INT4, models trained with DQ typically outperform quantized baselines by over 20%. We show, unlike previous work, that models trained with DQ generalize to unseen graphs. We provide code at this URL: https://github.com/camlsys/degree-quant. • We show that quantized networks achieve up to 4.7× speedups on CPU with INT8 arithmetic, relative to full precision floating point, with 4-8× reductions in runtime memory usage.

2.1. MESSAGE PASSING NEURAL NETWORKS (MPNNS)

Many popular GNN architectures may be viewed as generalizations of CNN architectures to an irregular domain: at a high level, graph architectures attempt to build representations based on a node's neighborhood (see fig. 2 ). Unlike CNNs, however, this neighborhood does not have a fixed ordering or size. This work considers GNN architectures conforming to the MPNN paradigm (Gilmer et al., 2017) . A graph G = (V, E) has node features X ∈ R N ×F , an incidence matrix I ∈ N 2×E , and optionally Ddimensional edge features E ∈ R E×D . The forward pass through an MPNN layer consists of message passing, aggregation and update phases: h (i) l+1 = γ(h (i) l , j∈N (i) [φ(h (j) l , h l , e ij )]). Messages from node u to node v are calculated using function φ, and are aggregated using a permutation-invariant function . The features at v are subsequently updated using γ. We focus on three architectures with corresponding update rules: 1. Graph Convolution Network (GCN): 2017) , where d i refers to the degree of node i. h (i) l+1 = j∈N (i)∪{i} ( 1 √ didj Wh (j) l ) (Kipf & Welling,

2.. Graph Attention Network (GAT):

h (i) l+1 = α i,i Wh (i) l + j∈N (i) (α i,j Wh (j) l ) , where α represent attention coefficients (Velickovic et al., 2018) .

3.. Graph Isomorphism Network (GIN):

h (i) l+1 = f Θ [(1 + )h (i) l + j∈N (i) h (j) l ], where f is a learnable function (e.g. a MLP) and is a learnable constant (Xu et al., 2019) .

2.2. QUANTIZATION FOR NON-GRAPH NEURAL NETWORKS

Quantization allows for model size reduction and inference speedup without changing the model architecture. While there exists extensive studies of the impact of quantization at different bitwidths (Courbariaux et al., 2015; Han et al., 2015; Louizos et al., 2017) and data formats (Micikevicius et al., 2017; Carmichael et al., 2018; Kalamkar et al., 2019) , it is 8-bit integer (INT8) quantization that has attracted the most attention. This is due to INT8 models reaching comparable accuracy levels to FP32 models (Krishnamoorthi, 2018; Jacob et al., 2017) , offer a 4× model compression, and result in inference speedups on off-the-shelf hardware as 8-bit arithmetic is widely supported. Quantization-aware training (QAT) has become the de facto approach towards designing robust quantized models with low error (Wang et al., 2018; Zafrir et al., 2019; Wang et al., 2018) . In their simplest forms, QAT schemes involve exposing the numerical errors introduced by quantization by simulating it on the forward pass Jacob et al. (2017) and make use of STE (Bengio et al., 2013) to compute the gradients-as if no quantization had been applied. For integer QAT, the quantization of a tensor x during the forward pass is often implemented as: x q = min(q max , max(q min , x/s + z )), where q min and q max are the minimum and maximum representable values at a given bit-width and signedness, s is the scaling factor making x span the [q min , q max ] range and, z is the zero-point, which allows for the real value 0 to be representable in x q . Both s and z are scalars obtained at training time. hen, the tensor is dequantized as: x = (x q -z)s, where the resulting tensor x ∼ x for a high enough bit-width. This similarity degrades at lower bit-widths. Other variants of integer QAT are presented in Jacob et al. (2017) and Krishnamoorthi (2018) . To reach performance comparable to FP32 models, QAT schemes often rely on other techniques such as gradient clipping, to mask gradient updates based on the largest representable value at a given bit-width; stochastic, or noisy, QAT, which stochastically applies QAT to a portion of the weights at each training step (Fan et al., 2020; Dong et al., 2017) ; or the re-ordering of layers (Sheng et al., 2018; Alizadeh et al., 2019) .

3. QUANTIZATION FOR GNNS

In this section, we build an intuition for why GNNs would fail with low precision arithmetic by identifying the sources of error that will disproportionately affect the accuracy of a low precision model. Using this insight, we propose our technique for QAT with GNNs, Degree-Quant. Our analysis focuses on three models: GCN, GAT and GIN. This choice was made as we believe that these are among the most popular graph architectures, with strong performance on a variety of tasks (Dwivedi et al., 2020) , while also being representative of different trends in the literature.

3.1. SOURCES OF ERROR

QAT relies upon the STE to make an estimate of the gradient despite the non-differentiable rounding operation in the forward pass. If this approximation is inaccurate, however, then poor performance will be obtained. In GNN layers, we identify the aggregation phase, where nodes combine messages from a varying number of neighbors in a permutation-invariant fashion, as a source of substantial numerical error, especially at nodes with high in-degree. Outputs from aggregation have magnitudes that vary significantly depending on a node's in-degree: as it increases, the variance of aggregation values will increase.foot_0 Over the course of training q min and q max , the quantization range statistics, become severely distorted by infrequent outliers, reducing the resolution for the vast majority of values observed. This reults in increased rounding error for nodes with smaller in-degrees. Controlling q min and q max hence becomes a trade-off balancing truncation error and rounding error. We can derive how the mean and variance of the aggregation output values vary as node in-degree, n, increases for each of the three GNN layers. Suppose we model incoming message values for a single output dimension with random variables X i , without making assumptions on their exact distribution or independence. Further, we use Y n as the random variable representing the value of node output after the aggregation step. With GIN layers, we have Y n = (1 + )X 0 + n i=1 X i . It is trivial to prove that E(Y n ) = O(n). The variance of the aggregation output is also O(n) in the case that that i =j Cov(X i , X j ) i Var(X i ). We note that if i =j Cov(X i , X j ) is large then it implies that the network has learned highly redundant features, and may be a sign of over-fitting. Similar arguments can be made for GCN and GAT layers; we would expect GCN aggregation values to grow like O( √ n), and GAT aggregation values to remain constant (O(1)) due to the attention coefficients. We empirically validate these predictions on GNNs trained on Cora; results are plotted in fig. 3 . We see that the aggregation values do follow the trends predicted, and that for the values of in-degree in the plot (up to 168) the covariance terms can be neglected. As expected, the variance and mean of the aggregated output grow fastest for GIN, and are roughly constant for GAT as in-degree increases. From this empirical evidence, it would be expected that GIN layers are most affected by quantization. By using GIN and GCN as examples, we can see how aggregation error causes error in weight updates. Suppose we consider a GIN layer incorporating one weight matrix in the update function i.e. h (i) l+1 = f (Wy (i) GIN ) , where f is an activation function, y (i) GIN = (1 + )h (i) l + j∈N (i) h (j) l , and N (i) denotes the in-neighbors of node i. Writing y (i) GCN = k∈N (i) ( 1 √ did k Wh (j) l ), we see that the derivatives of the loss with respect to the weights for GCN and GIN are: GIN ∂L ∂W = |V | i=1 ∂L ∂h (i) l+1 • f (Wy (i) GIN ) y (i) GIN GCN ∂L ∂W = |V | i=1 j∈N (i) 1 didj ∂L ∂h (i) l+1 • f (y (i) GCN ) h (j) l The larger the error in y (i) GIN -caused by aggregation error-the greater the error in the weight gradients for GIN, which results in poorly performing models being obtained. The same argument applies to GCN, with the h (j) l and y (i) GCN terms introducing aggregation error into the weight updates.

3.2. OUR METHOD: DEGREE-QUANT

To address these sources of error we propose Degree-Quant (DQ), a method for QAT with GNNs. We consider both inaccurate weight updates and unrepresentative quantization ranges. Calculate mask and quantized weights, Θ , which all operations share 3: m ← BERNOULLI(p) 4: Θ ← QUANTIZE(Θ) 5: Messages with masked sources are at full precision (excluding weights) 6: M ←MESSAGECALCULATE(G, Θ , m) 7: X ← QUANTIZE(AGGREGATE(M, Θ , m), m) No quantization for masked nodes 8: return UPDATE(X, Θ , m) Quantized weights always used 9: end procedure Stochastic Protection from Quantization to Improve Weight Update Accuracy. DQ aims to encourage more accurate weight updates by stochastically protecting nodes in the network from quantization. At each layer a protective node mask is generated; all masked nodes have the phases of the message passing, aggregation and update performed at full precision. This includes messages sent by protected nodes to other nodes, as shown in fig. 4 (a detailed diagram is shown in fig. 11 ). It is also important to note that the weights used at all nodes are the same quantized weights; this is motivated by the fact that our method is used to encourage more accurate gradients to flow back to the weights through high in-degree nodes. At test time protection is disabled: all nodes operate at low precision. To generate the mask, we pre-process each graph before training and create a vector of probabilities p with length equal to the number of nodes. At training time, mask m is generated by sampling using the Bernoulli distribution: m ∼ Bernoulli(p). In our scheme p i is higher if the in-degree of node i is large, as we find empirically that high in-degree nodes contribute most towards error in weight updates. We use a scheme with two hyperparameters, p min and p max ; nodes with the maximum in-degree are assigned p max as their masking probability, with all other nodes assigned a probability calculated by interpolating between p min and p max based on their in-degree ranking in the graph. Percentile Tracking of Quantization Ranges. Figure 3 demonstrates large fluctuations in the variance of the aggregation output as in-degree increases. Since these can disproportionately affect the ranges found by using min-max or momentum-based quantization, we propose using percentiles. While percentiles have been used for post-training quantization (Wu et al., 2020) , we are the first (to the best of our knowledge) to propose making it a core part of QAT; we find it to be a key contributor to achieving consistent results with graphs. Using percentiles involves ordering the values in the tensor and clipping a fraction of the values at both ends of the distribution. The fraction to clip is a hyperparameter. We are more aggressive than existing literature on the quantity we discard: we clip the top and bottom 0.1%, rather than 0.01%, as we observe the fluctuations to be a larger issue with GNNs than with CNNs or DNNs. Quantization ranges are more representative of the vast majority of values in this scheme, resulting in less rounding error. We emphasize that a core contribution of DQ is that it is architecture-agnostic. Our method enables a wide variety of architectures to use low precision arithmetic at inference time. Our method is also orthogonal-and complementary-to other techniques for decreasing GNN computation requirements, such as sampling based methods which are used to reduce memory consumption (Zeng et al., 2020) , or weight pruning (Blalock et al., 2020) approaches to achieve further model compression. 

4. EXPERIMENTS

In this section we first analyse how the choice of quantization implementation affects performance of GNNs. We subsequently evaluate Degree-Quant against the strong baselines of: FP32, INT8-QAT and, INT8-QAT with stochastic masking of weights (Fan et al., 2020) . We refer to this last approach as noisy QAT or nQAT. To make explicit that we are quantizing both weights and activations, we use the notation W8A8. We repeat the experiments at INT4. Our study evaluates performance on six datasets and includes both node-level and graph-level tasks. The datasets used were Cora, CiteSeer, ZINC, MNIST and CIFAR10 superpixels, and REDDIT-BINARY. Across all datasets INT8 models trained with Degree-Quant manage to recover most of the accuracy lost as a result of quantization. In some instances, DQ-INT8 outperform the extensively tuned FP32 baselines. For INT4, DQ outperforms all QAT baselines and results in double digits improvements over QAT-INT4 in some settings. Details about each dataset and our experimental setup can be found in appendix A.1.

4.1. IMPACT OF QUANTIZATION GRADIENT ESTIMATOR ON CONVERGENCE

The STE is a workaround for when the forward pass contains non-differentiable operations (e.g. rounding in QAT) that has been widely adopted in practice. While the choice of STE implementation generally results in marginal differences for CNNs-even for binary networks (Alizadeh et al., 2019) -it is unclear whether only marginal differences will also be observed for GNNs. Motivated by this, we study the impact of four off-the-shelve quantization procedures on the three architectures evaluated for each type of dataset; the implementation details of each one is described in appendix A.3. We perform this experiment to ensure that we have the strongest possible QAT baselines. Results are shown in table 1. We found the choice quantization implementation to be highly dependent on the model architecture and type of problem to be solved: we see a much larger variance than is observed with CNNs; this is an important discovery for future work building on our study. We observe a general trend in all INT4 experiments benefiting from momentum as it helps smoothing out the quantization statistics for the inherently noisy training stage at low bitwidths. This trend applies as well for the majority of INT8 experiments, while exhibiting little impact on MNIST. For INT8 Cora-GCN, large gradient norm values in the early stages of training (see fig. 5 ) mean that these models not benefit from momentum as quantization ranges fail to keep up with the rate of changes in tensor values; higher momentum can help but also leads to instability. In contrast, GAT has stable initial training dynamics, and hence obtains better results with momentum. For the molecules dataset ZINC, we consistently obtained lower regression loss when using momentum. We note that GIN models often suffer from higher performance degradation (as was first noted in fig. 3 ), specially at W4A4. This is not the case however for image datasets using superpixels. We believe that datasets with Gaussian-like node degree distributions (see fig. 9 ) are more tolerant of the imprecision introduced by quantization, compared to datasets with tailed distributions. We leave more in-depth analysis of how graph topology affects quantization as future work. models achieve over 20% lower regression loss. Among the top performing models using DQ, ratios of p min and p max in [0.0, 0.2] were the most common. Figure 10 in the appendix shows validation loss curves for GIN models trained using different DQ probabilities on the REDDIT-BINARY dataset.

5. DISCUSSION

Latency and Memory Implications. In addition to offering significantly lower memory usage (4× with INT8), quantization can reduce latency-especially on CPUs. We found that with INT8 arithmetic we could accelerate inference by up to 4.7×. We note that the latency benefit depends on the graph topology and feature dimension, therefore we ran benchmarks on a variety of graph datasets, including Redditfoot_1 , Zinc, Cora, Citeseer, and CIFAR-10; Zinc and Reddit results are shown in table 4 , with further results given in the appendix. For a GCN layer with in-and out-dimension of 128, we get speed-ups of: 4.3× on Reddit, 2.5× on Zinc, 1.3× on Cora, 1.3× on Citeseer and, 2.1× on CIFAR-10. It is also worth emphasizing that quantized networks are necessary to efficiently use accelerators deployed in smartphones and smaller devices as they primarily accelerate integer arithmetic, and that CPUs remain a common choice for model serving on servers. The decrease in latency on CPUs is due to improved cache performance for the sparse operations; GPUs, however, see less benefit due to their massively-parallel nature which relies on mechanisms other than caching to hide slow random memory accesses, which are unavoidable in this application. Ablation Study: Benefits of Percentile Ranges. Figure 5 shows the value of percentiles during training. We see that when using absolute min/max the upper range grows to over double the range required for 99.9% of values, effectively halving the resolution of the quantized values. DQ is more stable, and we obtained strong results with an order of magnitude less tuning relative to the baselines. Ablation Study: Effect of Stochastic Element in Degree-Quant. We observe that the stochastic protective masking in DQ alone often achieves most of the performance gain over the QAT baseline; results are given in table 9 in the appendix. The benefit of the percentile-based quantization ranges is stability, although it can yield some performance gains. The full DQ method provides consistently good results on all architectures and datasets, without requiring an extensive analysis as in 4.1.

6. CONCLUSION

This work has presented Degree-Quant, an architecture-agnostic and stable method for training quantized GNN models that can be accelerated using off-the-shelf hardware. With 4-bit weights and activations we achieve 8× compression while surpassing strong baselines by margins regularly exceeding 20%. At 8-bits, models trained with DQ perform on par or better than the baselines while achieving up to 4.7× lower latency than FP32 models. Our work offers a comprehensive foundation for future work in this area and is a first step towards enabling GNNs to be deployed more widely, including to resource constrained devices such as smartphones.

A APPENDIX

Readers seeking advice on implementation will find appendix A.5 especially useful. We provide significant advice surrounding best practices on quantization for GNNs, along with techniques which we believe can boost our methods beyond the performance described in this paper, but for which we did not have time to fully evaluate.

A.1 EXPERIMENTAL SETUP

As baselines we use the architectures and results reported by Fey & Lenssen (2019) for citation networks, Dwivedi et al. ( 2020) for MNIST, CIFAR-10 and ZINC and, Xu et al. (2019) for REDDIT-BINARY. We re-implemented the architectures and datasets used in these publications and replicated the results reported at FP32. Models using GIN layers learn parameter . These models are often referred to as GIN-. The high-level description of these architectures is shown in table 5 . The number of parameters for each architecture-dataset in this work are shown in table 6. Our infrastructure was implemented using PyTorch Geometric (PyG) (Fey & Lenssen, 2019) . We generate candidate hyperparameters using random search, and prune trials using the asynchronous hyperband algorithm (Li et al., 2020) . Hyperparameters searched over were learning rate, weight decay, and dropout (Srivastava et al., 2014) and drop-edge (Rong et al., 2020) probabilities. The search ranges were initialized centered at the values used in the reference implementations of the baselines. Degree-Quant requires searching for two additional hyperparameters, p min and p max , these were tuned in a grid-search fashion. We report our results using the hyperparameters which achieved the best validation loss over 100 runs on the Cora and Citeseer datasets, 10 runs for MNIST, CIFAR-10 and ZINC, and 10-fold cross-validation for REDDIT-BINARY. We generally used fewer hyperparameter runs for our DQ runs than we did for baselines-even ignoring the searches over the various STE configs. As our method is more stable, finding a reasonable set of parameters was easier than before. As is usual with quantization experiments, we found it useful to decrease the learning rate relative to the FP32 baseline. Our experiments ran on several machines in our SLURM cluster using Intel CPUs and NVIDIA GPUs. Each machine was running Ubuntu 18.04. The GPU models in our cluster were: V100, RTX 2080Ti and GTX 1080Ti. For QAT experiments, all elements of each network are quantized: inputs to each layer, the weights, the messages sent between nodes, the inputs to aggregation stage and its outputs and, the outputs of the update stage (which are the outputs of the GNN layer before activation). In this way, all intermediate tensors in GNNs are quantized with the exception of the attention mechanism in GAT; we do not quantize after the softmax calculation, due to the numerical precision required at this stage. With the exception of Cora and Citeseer, the models evaluated in this work make use of Batch Normalization (Ioffe & Szegedy, 2015) . For deployments of quantized models, Batch Normalization layers are often folded with the weights (Krishnamoorthi, 2018) . This is to ensure the input to the next layer is within the expected [q min , q max ] ranges. In this work, for both QAT baselines and QAT+DQ, we left BN layers unfolded but ensure the inputs and outputs were quantized to the appropriate number of bits (i.e. INT8 or INT4) before getting multiplied with the layer weights. We leave as future work proposing a BN folding mechanism applicable for GNNs and studying its impact for deployments of quantized GNNs. Model # Layers # Hidden Units Residual Output MLP Arch. Cit M C Z R Cit M C Z R Cit M C Z R Cit M C Z R GCN 2 The GIN models evaluated on REDDIT-BINARY used QAT for all layers with the exception of the input layer of the MLP in the first GIN layer. This compromise was needed to overcome the severe degradation introduced by quantization when operating on nodes with a single scalar as feature.

A.2 DATASETS

We show in Table 7 the statistics for each dataset either used or referred to in this work. For Cora and Citeseer datasets, nodes correspond to documents and edges to citations between these. Node features are a bag-of-words representation of the document. The task is to classify each node in the graph (i.e. each document) correctly. The MNIST and CIFAR-10 datasets (commonly used for image classification) are transformed using SLIC (Achanta et al., 2012) into graphs where each node represents a cluster of perceptually similar pixels or superpixels. The task is to classify each image using their superpixels graph representation. The ZINC dataset contains graphs representing molecules, were each node is an atom. The task is to regress a molecular property (constrained solubility (Jin et al., 2018) ) given the graph representation of the molecule. Nodes in graphs of the REDDIT-BINARY dataset represent users of a Reddit thread with edges drawn between a pair of nodes if these interacted. This dataset contains graphs of two types of communities: question-answer threads and discussion threads. The task is to determine if a given graph is from a question-answer thread or a discussion thread. We use standard splits for MNIST, CIFAR-10 and ZINC. For citation datasets (Cora and Citeseer), we use the splits used by Kipf & Welling (2017) . For REDDIT-BINARY we use 10-fold cross validation. Table 7 : Statistics for each dataset used in the paper. Some datasets are only referred to in fig. 1 A.3 QUANTIZATION IMPLEMENTATIONS In section 4.1 we analyse different readily available quantization implementations and how they impact in QAT results. First, vanilla STE, which is the reference STE (Bengio et al., 2013) that lets the gradients pass unchanged; and gradient clipping (GC), which clips the gradients based on the maximum representable value for a given quantization level. Or in other words, GC limits gradients if the tensor's magnitudes are outside the [q min , q max ] range. x min = min(X) if step = 0 min(x min , X) otherwise (1) x min = min(X) if step = 0 (1 -c)x min + c min(X) otherwise (2) The quantization modules keep track of the input tensor's min and max values, x min and x max , which are then used to compute q min , q max , zero-point and scale parameters. For both vanilla STE and GC, we study two popular ways of keeping track of these statistics: min/max, which tracks the min/max tensor values observed over the course of training; and momentum, which computes the moving averages of those statistic during training. The update rules for x min for STE min/max and STE momentum are presented in eq. ( 1) and eq. ( 2) respectively, where X is the tensor to be quantized and c is the momentum hyperparameter, which in all our experiments is set to its default 0.01. Equivalent rules apply when updating x max (omitted). For stochastic QAT we followed the implementation described in Fan et al. (2020) , where at each training step a binary mask sampled from a Bernoulli distribution is used to specify which elements of the weight tensor will be quantized and which will be left at full precision. We experimented with block sizes larger than one (i.e. a single scalar) but often resulted in a sever drop in performance. All the reported results use block size of one.

A.4 DEGREE-QUANT AND GRAPH LEVEL SUMMARIZATION

The percentile operation in our quantization scheme remains important for summarizing the graph when doing graph-level tasks, such as graph regression (Zinc) or graph classification (MNIST, CIFAR-10 and REDDIT-BINARY). Since the number of nodes in each input graph is not constant, this can cause the summarized representation produced from the final graph layer to have a more tailed distribution than would be seen with other types of architectures (e.g. CNN). Adding the percentile operation reduces the impact of these extreme tails in the fully connected graph-summarization layers, thereby increasing overall performance. The arguments regarding weight update accuracy also still apply, as the ∂L ∂h (i) l+1 term in the equations for the GCN and GIN should be more accurate compared to when the activations are always quantized before the summarization. This phenomenon is also noted by Fan et al. (2020) .

A.5 IMPLEMENTATION ADVICE

We provide details that will be useful for others working in the area, including suggestions that should boost the performance of our results and accelerate training. We release code on GitHub; this code is a clean implementation of the paper, suitable for users in downstream works. A.5.1 QUANTIZATION SETUP As our work studies the pitfalls of quantization for GNNs, we were more aggressive in our implementation than is absolutely necessary: everything (where reasonably possible) in our networks is quantized. In practice, this leaves low-hanging fruit for improvements in accuracy: • Not quantizing the final layer (as is common practice for CNNs and Transformers) helps with accuracy, especially at INT4. A similar practice at the first layer will also be useful. • Using higher precision for the "summarization" stages of the model, which contributes little towards the runtime in most cases. • Taking advantage of mixed precision: since the benefits of quantization are primarily in the message passing phase (discussed below), one technique to boost accuracy is to only make the messages low precision. We advise choosing a more realistic (less aggressive) convention than used in this work. The first two items would be appropriate.

A.5.2 RELATIVE VALUE OF PERCENTILES COMPARED TO PROTECTIVE MASKING

There are two components to our proposed technique: stochastic, topology-aware, masking and percentile-based range observers for quantizers. We believe that percentiles provide more immediate value, especially at INT4. We find that they are useful purely from the perspective of stabilizing the optimization and reducing the sensitivity to hyperparameters. 



The reader should note that we are not referring to the concept of estimator variance, which is the subject of sampling based approaches-we are exclusively discussing the variance of values immediately after aggregation. The largest graph commonly benchmarked on in the GNN literature



Figure3: Analysis of values collected immediately after aggregation at the final layer of FP32 GNNs trained on Cora. Generated using channel data collected from 100 runs for each architecture. As in-degree grows, so does the mean and variance of channel values after aggregation.

Figure4: High-level view of the stochastic element of Degree-Quant. Protected (high in-degree) nodes, in blue, operate at full precision, while unprotected nodes (red) operate at reduced precision. High in-degree nodes contribute most to poor gradient estimates, hence they are stochastically protected from quantization more often.

Figure 5: qmax with absolute min/max and percentile ranges, applied to INT8 GCN training on Cora. We observe that the percentile max is half that of the absolute, doubling resolution for the majority of values.

Figure 6: Analysis of how INT8 GAT performance degrades on Cora as individual elements are reduced to 4-bit precision without DQ. For GAT the message elements are crucial to classification performance.

Figure 7: Degradation of INT8 GCN on Cora as individual elements are converted to INT4 without Degree-Quant.

Figure 8: Degradation of INT8 GIN on Cora as individual elements are converted to INT4 without Degree-Quant.

Figure 9: In-degree distribution for each of the six datasets assessed. Note that a log y-axis is used for all datasets except for MNIST and CIFAR-10.

Figure11: Diagram representing how DQ makes use of a topology-aware quantization strategy that is better suited for GNNs. The diagram illustrates this for a GCN layer. At every training stage, a degree-based mask is generated. This mask is used in all quantization layers located after each of the stages in the message-passing pipeline. By retaining at FP32 nodes with higher-degree more often, the noisy updates during training have a lesser impact and therefore models perform better, even at INT4.

, there remains relatively little work addressing Despite GNN model sizes rarely exceeding 1MB, the OPs needed for inference grows at least linearly with the size of the dataset and node features.

Impact on performance of four typical quantization implementations for INT8 and INT4. The configuration that resulted in best performing models for each dataset-model pair is bolded. Hyperparameters for each experiment were fine-tuned independently. As expected, adding clipping does not change performance with min/max but does with momentum. A major contribution of this work is identifying that seemingly unimportant choices in quantization implementation cause dramatic changes in performance.

This INT4 results in table 2 and QAT-INT4, with the exception of MNIST (an easy to classify dataset), corroborate our hypothesis that GIN layers are less resilient to quantization. This was first observed in fig.3. In the case of ZINC, while all models results in noticeable degradation, GIN sees a more severe 16% increase of regression loss compared to our FP32 baseline. For QAT W4A4 an accuracy drop of over 35% and 47% is observed for Cora and Citeseer respectively. The stochasticity induced by nQAT helped in recovering some of the accuracy lost as a result of quantization for citation networks (both INT8 and INT4) but had little impact on other datasets and harmed performance in some cases.

Results for DQ-INT8 GIN models perform nearly as well as at FP32. For INT4, DQ offers a significant increase in accuracy.

INT8 latency results run on a 22 core 2.1GHz Intel Xeon Gold 6152 and, on a GTX 1080Ti GPU. Quantization provides large speedups on a variety of graphs for CPU and non-negligible speedups with unoptimized INT8 GPU kernels.

Ablation Study: Source of Degradation at INT4. Figure6assesses how INT8 GAT (without DQ) degrades as single elements are converted to INT4, in order to understand the precipitous drop in accuracy in the INT4 baselines; further plots for GCN and GIN are included in the appendix. We observe that most elements cause only modest performance losses relative to a full INT8 model. DQ is most important to apply to elements which are constrained by numerical precision, such as the aggregation and message elements in GAT. Weight elements, however, are consistently unaffected.

High level description of the architectures evaluated for citation networks (Cit), MNIST (M), CIFAR-10 (C), ZINC (Z) and REDDIT-BINARY (R). We relied on Adam optimizer for all experiments. For all batched experiments, we used 128 batch-sizes. All GAT models used 8 attention heads. All GIN architectures used 2-layer MLPs, except those for citation networks which used a single linear layer.

Number of parameters for each of the evaluated architectures

Final test accuracies for FP32 and DQ-INT8 models whose validation loss curves are shown in fig. 10

Ablation study against the two elements of Degree-Quant (DQ). The first two rows of results are obtained with only the stochastic element of Degree-Quant enabled for INT8 and INT4. Percentile-based quantization ranges are disabled in these experiments. The bottom row of results were obtained with noisy quantization (nQAT) at INT4 with the use of percentiles. DQ masking alone is often sufficient to achieve excellent results, but the addition of percentile-based range tracking can be beneficial to increase stability. We can see that using nQAT with percentiles is not sufficient to achieve results of the quality DQ provides.

INT8 latency results run on a 22 core 2.1GHz Intel Xeon Gold 6152 and, on a GTX 1080Ti GPU. All layers have 128 in/out features. For CIFAR-10 we used batch size of 1K graphs.

ACKNOWLEDGEMENTS

This work was supported by Samsung AI and by the UK's Engineering and Physical Sciences Research Council (EPSRC) with grants EP/M50659X/1 and EP/S001530/1 (the MOA project) and the European Research Council via the REDIAL project.

annex

However, adding the masking does improve performance further. This is evident from table 9. In fact, performance may be degraded slightly when percentiles are also applied: this can be observed by comparing table 9 to the main results in the paper, table 2.

A.5.3 PERCENTILES

The key downside with applying percentiles for range observers is that the operation can take significant time. Training with DQ is slower than before-however, since there is less sensitivity to hyperparameters, fewer runs end up being needed. We are confident that an effective way to speed up this operation is to use sampling. We expect 10% of the data should be adequate, however we believe that even 1% of the data may be sufficient (dataset and model dependent). However, we have not evaluated this setup in the paper; it is provided in the code release for experimentation.

A.5.4 IMPROVING ON PERCENTILES

We believe that it is possible to significantly boost the performance of GNN quantization by employing a learned step size approach. Although we used percentiles in this paper to illustrate the rangeprecision trade-off for GNNs, we expect that learning the ranges will lead to better results. This approach, pioneered by works such as Esser et al. ( 2020), has been highly effective on CNNs even down to 2 bit quantization.Another approach would be to use robust quantization: the ideas in these works are to reduce the impact of changing quantization ranges i.e. making the architecture more robust to quantization. Works in this area include Alizadeh et al. (2020) and Shkolnik et al. (2020) .

A.5.5 IMPROVING LATENCY

The slowest step of GNN inference is typically the sparse operations. It is therefore best to minimize the sizes of the messages between nodes i.e. quantize the message phase most aggressively. This makes the biggest impact on CPUs which are dependent on caches to obtain good performance.We evaluated our code on CPU using Numpy and Scipy routines. For the GPU, we used implementations from PyTorch and PyTorch Geometric and lightly modified them to support INT8 where necessary. These results, while useful for illustrating the benefits of quantization, are by no means optimal: we did not devote significant time to improving latency. We believe better results can be obtained by taking advantage of techniques such as cache blocking or kernel fusion.

A.5.6 PITFALLS

Training these models can be highly unstable: some experiments in the paper had standard deviations as large as 18%. We observed this to affect citation network experiments to the extent that they would not converge on GPUs: all these experiments had to be run on CPUs.

A.6 DEGRADATION STUDIES

Figures 7 and 8 show the results of the ablation study conducted in section 5 for GCN and GIN. We observe that GCN is more tolerant to INT4 quantization than other architectures. GIN, however, requires accurate representations after the update stage, and heavily suffers from further quantization like GAT. The idea of performing different stages of inference at different precisions has been proposed, although it is uncommon (Wang et al., 2018) . 

