LEARNING GRAPH NORMALIZATION FOR GRAPH NEURAL NETWORKS

Abstract

Graph Neural Networks (GNNs) have emerged as a useful paradigm to process graph-structured data. Usually, GNNs are stacked to multiple layers and the node representations in each layer are computed through propagating and aggregating the neighboring node features with respect to the graph. To effectively train a GNN with multiple layers, some normalization techniques are necessary. Though the existing normalization techniques have been shown to accelerate training GNNs, the structure information on the graph is ignored yet. In this paper, we propose two graph-aware normalization methods to effectively train GNNs. Then, by taking into account that normalization methods for GNNs are highly task-relevant and it is hard to know in advance which normalization method is the best, we propose to learn attentive graph normalization by optimizing a weighted combination of multiple graph normalization methods at different scales. By optimizing the combination weights, we can automatically select the best or the best combination of multiple normalization methods for a specific task. We conduct extensive experiments on benchmark datasets for different tasks and confirm that the graph-aware normalization methods lead to promising results and that the learned weights suggest the more appropriate normalization methods for specific task.

1. INTRODUCTION

Graph Neural Networks (GNNs) have shown great popularity due to their efficiency in learning on graphs for various application areas, such as natural language processing (Yao et al., 2019; Zhang et al., 2018 ), computer vision (Li et al., 2020; Cheng et al., 2020) , point cloud (Shi & Rajkumar, 2020) , drug discovery (Lim et al., 2019) , citation networks (Kipf & Welling, 2016) , and social networks (Chen et al., 2018) . A graph consists of nodes and edges, where nodes represent individual objects and edges represent relationships among those objects. In the GNN framework, the node or edge representations are alternately updated by propagating information along the edges of a graph via non-linear transformation and aggregation functions (Wu et al., 2020; Zhang et al., 2018) . GNN captures long-range node dependencies via stacking multiple message-passing layers, allowing the information to propagate over multiple-hops (Xu et al., 2018) . In essence, GNN is a new kind of neural networks which exploits neural network operations over graph structure. Among the numerous kinds of GNNs (Bruna et al., 2014; Defferrard et al., 2016; Maron et al., 2019; Xu et al., 2019) , message-passing GNNs (Scarselli et al., 2009; Li et al., 2016; Kipf & Welling, 2016; Velickovic et al., 2018; Bresson & Laurent, 2017) have been the most widely used due to their ability to leverage the basic building blocks of deep learning such as batching, normalization and residual connections. To update the feature representation of a node, many approaches are designed. For example, Graph ConvNet (GCN) (Kipf & Welling, 2016) employs an averaging operation over the neighborhood node with the same weight value for each of its neighbors; GraphSage (Hamilton et al., 2017) samples a fixed-size neighborhood of each node and performs mean aggregator or LSTM-based aggregator over the neighbors; Graph Attention Network (GAT) (Velickovic et al., 2018) incorporates an attention mechanism into the propagation step, which updates the feature representation of each code via a weighted sum of adjacent node representations; MoNet (Monti et al., 2017) designs a Gaussian kernel with learnable parameters to assign different weights to neighbors; GatedGCN (Bresson & Laurent, 2017) explicitly introduces edge features at each layer and updates edge features by considering the feature representations of these two con- Each normalization method has its advantages and is suitable for some particular tasks. For instance, BN has achieved perfect performance in computer vision whereas LN outperforms BN in natural language processing (Vaswani et al., 2017) . v k, i v k, i N( ) k k k=1 B e k, i, j e k, i As an analogue, in Dwivedi et al. ( 2020), BN is utilized for each graph propagation layer during training GNNs. In Zhao & Akoglu (2020), a novel normalization layer, denoted as PAIRNORM, is introduced to mitigate the over-smoothing problem and prevent all node representations from homogenization by differentiating the distances between different node pairs. Although these methods mentioned above have been demonstrated being useful in training GNNs, the local structure and global structure of the graph are ignored in these existing methods. Moreover, in previous work, only one of the mentioned normalization methods is selected and it is used for all normalization layers. This may limit the potential performance improvement of the normalization method and it is also hard to decide which normalization method is suitable to a specific task. Graph data contains rich structural information. By considering the structure information in the graph, in this paper, we propose two graph-aware normalization methods at different scales: a) adjacency-wise normalization, and b) graph-wise normalization. Unlike BN and LN, the adjacencywise normalization takes into account the local structure in the graph whereas the graph-wise normalization takes into account the global structure in the graph. On other hand, while multiple normalization methods are available for training GNNs and it is still hard to know in advance which normalization method is the most suitable to a specific task. To tackle with this deficiency, we further propose to learn attentive graph normalization by optimizing a weighted combination of multiple normalization methods. By optimizing the combination weights, we can select the best or the best combination of multiple normalization methods for training GNNs at a specific task automatically. The contributions of the paper are highlighted as follows. • We propose two graph-aware normalization methods: adjacency-wise normalization and graphwise normalization. To the best of our knowledge, it is for the first time that the graph-aware normalization method is proposed for training GNNs. • We present to learn attentive graph normalization by optimizing a weighted combination of different normalization methods. By learning the combination weights, we can automatically select the



Figure 1: Illustration for normalization methods on graph. Node features are normalized at four levels: (a) Node-wise; (b) Adjacency-wise; (c) Graph-wise; and (d) Batch-wise. Similarly the four normalization methods can be extended to normalize edge features as shown in (e), (f), (g), and (h). nected nodes of the edge and has achieved state-of-art results on several datasets (Dwivedi et al., 2020). More detailed overview about GNNs are provided in Appendix A. It is well known that one of the critical ingredients to effectively train deep neural networks is normalization technique, e.g., Batch Normalization (BN) (Ioffe & Szegedy, 2015) is widely used to accelerate the deep neural networks training. Other than BN, several normalization methods have been developed from different perspectives, e.g., Layer Normalization (LN) (Ba et al., 2016) and Group Normalization (Wu & He, 2018) which operate along the channel dimension, Instance Normalization (Ulyanov et al., 2016) which performs a BN-like normalization for each sample,Switchable Normalization (Luo et al., 2019)  which utilizes three distinct scopes-including channel, layer, and minibatch-to compute the first order and second order statistics. Each normalization method has its advantages and is suitable for some particular tasks. For instance, BN has achieved perfect performance in computer vision whereas LN outperforms BN in natural language processing(Vaswani et al., 2017).

