LEARNING GRAPH NORMALIZATION FOR GRAPH NEURAL NETWORKS

Abstract

Graph Neural Networks (GNNs) have emerged as a useful paradigm to process graph-structured data. Usually, GNNs are stacked to multiple layers and the node representations in each layer are computed through propagating and aggregating the neighboring node features with respect to the graph. To effectively train a GNN with multiple layers, some normalization techniques are necessary. Though the existing normalization techniques have been shown to accelerate training GNNs, the structure information on the graph is ignored yet. In this paper, we propose two graph-aware normalization methods to effectively train GNNs. Then, by taking into account that normalization methods for GNNs are highly task-relevant and it is hard to know in advance which normalization method is the best, we propose to learn attentive graph normalization by optimizing a weighted combination of multiple graph normalization methods at different scales. By optimizing the combination weights, we can automatically select the best or the best combination of multiple normalization methods for a specific task. We conduct extensive experiments on benchmark datasets for different tasks and confirm that the graph-aware normalization methods lead to promising results and that the learned weights suggest the more appropriate normalization methods for specific task.

1. INTRODUCTION

Graph Neural Networks (GNNs) have shown great popularity due to their efficiency in learning on graphs for various application areas, such as natural language processing (Yao et al., 2019; Zhang et al., 2018) , computer vision (Li et al., 2020; Cheng et al., 2020) , point cloud (Shi & Rajkumar, 2020) , drug discovery (Lim et al., 2019) , citation networks (Kipf & Welling, 2016) , and social networks (Chen et al., 2018) . A graph consists of nodes and edges, where nodes represent individual objects and edges represent relationships among those objects. In the GNN framework, the node or edge representations are alternately updated by propagating information along the edges of a graph via non-linear transformation and aggregation functions (Wu et al., 2020; Zhang et al., 2018) . GNN captures long-range node dependencies via stacking multiple message-passing layers, allowing the information to propagate over multiple-hops (Xu et al., 2018) . In essence, GNN is a new kind of neural networks which exploits neural network operations over graph structure. Among the numerous kinds of GNNs (Bruna et al., 2014; Defferrard et al., 2016; Maron et al., 2019; Xu et al., 2019) , message-passing GNNs (Scarselli et al., 2009; Li et al., 2016; Kipf & Welling, 2016; Velickovic et al., 2018; Bresson & Laurent, 2017) have been the most widely used due to their ability to leverage the basic building blocks of deep learning such as batching, normalization and residual connections. To update the feature representation of a node, many approaches are designed. For example, Graph ConvNet (GCN) (Kipf & Welling, 2016) employs an averaging operation over the neighborhood node with the same weight value for each of its neighbors; GraphSage (Hamilton et al., 2017) samples a fixed-size neighborhood of each node and performs mean aggregator or LSTM-based aggregator over the neighbors; Graph Attention Network (GAT) (Velickovic et al., 2018) incorporates an attention mechanism into the propagation step, which updates the feature representation of each code via a weighted sum of adjacent node representations; MoNet (Monti et al., 2017) designs a Gaussian kernel with learnable parameters to assign different weights to neighbors; GatedGCN (Bresson & Laurent, 2017) explicitly introduces edge features at each layer and updates edge features by considering the feature representations of these two con- and Group Normalization (Wu & He, 2018) which operate along the channel dimension, Instance Normalization (Ulyanov et al., 2016) which performs a BN-like normalization for each sample, Switchable Normalization (Luo et al., 2019) which utilizes three distinct scopes-including channel, layer, and minibatch-to compute the first order and second order statistics. Each normalization method has its advantages and is suitable for some particular tasks. For instance, BN has achieved perfect performance in computer vision whereas LN outperforms BN in natural language processing (Vaswani et al., 2017) . As an analogue, in Dwivedi et al. (2020) , BN is utilized for each graph propagation layer during training GNNs. In Zhao & Akoglu (2020) , a novel normalization layer, denoted as PAIRNORM, is introduced to mitigate the over-smoothing problem and prevent all node representations from homogenization by differentiating the distances between different node pairs. Although these methods mentioned above have been demonstrated being useful in training GNNs, the local structure and global structure of the graph are ignored in these existing methods. Moreover, in previous work, only one of the mentioned normalization methods is selected and it is used for all normalization layers. This may limit the potential performance improvement of the normalization method and it is also hard to decide which normalization method is suitable to a specific task. Graph data contains rich structural information. The contributions of the paper are highlighted as follows. • We propose two graph-aware normalization methods: adjacency-wise normalization and graphwise normalization. To the best of our knowledge, it is for the first time that the graph-aware normalization method is proposed for training GNNs. • We present to learn attentive graph normalization by optimizing a weighted combination of different normalization methods. By learning the combination weights, we can automatically select the best normalization method or the best combination of multiple normalization methods for training GNNs at a specific task. • We conduct extensive experiments on benchmark datasets for different tasks and confirm that the graph-aware normalization methods leads to promising results and that the learned weights suggest the more appropriate normalization methods for specific task. 2 GRAPH-AWARE NORMALIZATION AT DIFFERENT SCALES Suppose that we have N graphs G 1 , G 2 , ..., G N in a mini-batch. Let G k = (V k , E k ) be the k-th graph, where V k is the set of nodes and E k is the set of edges. We use v k,i to denote the i-th node of graph G k and use e k,i,j to denote the edge between nodes v k,i and v k,j of graph G k . Moreover, we use h v k,i ∈ R d to represent the feature of node v k,i and h j v k,i to represent the j-th element of h v k,i . We use N (v k,i ) to represent the neighbors of node v k,i (including node v k,i itself). For clarity, we formulate the normalization methods for training GNNs from different scales, as illustrated in Figure 1 (a)-(d) , including node-wise normalization, adjacency-wise normalization, graph-wise normalization and batch-wise normalization. Node-wise Normalization. Node-wise normalization on graph, denoted as GN n , considers to normalize the feature vector h v k,i of each node v k,i and compute the first and the second statistics over the d entries of the feature vector h v k,i as follows: ĥ(n) v k,i = h v k,i -µ (n) k,i 1 σ (n) k,i , µ (n) k,i = 1 d d j=1 h j v k,i , σ (n) k,i = 1 d d j=1 (h j v k,i -µ (n) k,i ) 2 , k,i and σ (n) k,i are the mean and the standard deviation along the feature dimension for node v k,i , and 1 ∈ R d represents a d-dimension vector of all 1. Note that node-wise normalization is equivalent to applying LN to each node of the graph to reduce the "covariate shift" problemfoot_0 . Adjacency-wise Normalization. Each node in a graph has its neighbors. However, node-wise normalization performs normalization on each node individually and ignores the local structure in the graph. Here, we propose to take into account the adjacency structure in the graph and normalize the node features of the adjacent neighbors. We term it as adjacency-wise normalization on graph, denoted as GN a . For each node v k,i in graph G k , we consider its adjacent nodes N (v k,i ), as illustrated in Figure 1 (b) . Specifically, the adjacency-wise normalization for node v k,i is defined as follows: ĥ(a) v k,i = h v k,i -µ (a) k,i 1 σ (a) k,i , µ (a) k,i = 1 |N (v k,i )| × d j ∈N (v k,i ) d j=1 h j v k,j , σ (a) k,i = 1 |N (v k,i )| × d j ∈N (v k,i ) d j=1 (h j v k,j -µ (a) k,i ) 2 , ( ) where µ k,i are the first order and second order statistics over the adjacent nodes. 2 Graph-wise Normalization. Note that nodes belonging to graph G k naturally form a group. In order to preserve the global structure of a graph, we propose to normalize the node feature based on the first and the second order statistics computed over graph G k . Specifically, we define a graph-wise normalization on graph, denoted as GN g , for node v k,i as follows: ĥ (g) v k,i = (h v k,i -µ (g) k )Λ -1 k , (g) k = 1 |G k | v k,i ∈G k h v k,i , where µ (g) k and Λ k are the first order and the second order statistics in graph G k in which Λ k is a diagonal matrix with diagonal entry Λ jj k is defined as Λ jj k = 1 |G k | v k,i ∈G k (h j v k,i -µ (g),j k ) 2 . ( ) If the task has only a single graph, then graph-wise normalization is similar to BN. However, unlike in BN, graph-wise normalization does not use a smoothing average updater.foot_2  Batch-wise Normalization. To keep training stable, BN is one of the most critical components. For a mini-batch, there are N graphs. We compute the mean and standard deviation across over the graphs of a mini-batch, then each node feature h v k,i is normalized as follows: ĥ(b) v k,i = (h v k,i -µ (b) )Λ -1 , µ (b) = 1 T N k=1 |G k | i=1 h v k,i , where T = N k=1 |G k | means the total number of the nodes in the N graphs and Λ is a diagonal matrix to keep the standard deviation of the note features over N graphs in which the diagonal entry Λ jj is defined Λ jj = 1 T N k=1 |G k | i=1 (h j v k,i -µ (b),j ) 2 . ( ) Note that batch-wise normalization on graph, named as GN b , is effectively BN (Ioffe & Szegedy, 2015) , which performs normalization over all nodes of the N graphs in a mini-batch. The normalization methods applying to node features h v k,i can also be extended to edge features h e k,i,j where h e k,i,j denotes the feature of edge e i,j in graph G k , as illustrated in Figure 1 (e)-(h). Remark. The properties of the four normalization methods are summarized as follows. • Node-wise normalization only considers to normalize the feature of each node individually but ignores the adjacency structure and the whole graph structures. It is equivalent to LN (Ba et al., 2016) in operation. • Adjacency-wise normalization takes the adjacent nodes into account, whereas graph-wise normalization takes into account the features of all nodes in a graph. • Batch-wise normalization is the same as the standard batch normalization (Ioffe & Szegedy, 2015) . If the task only involves a single graph, then the batch-wise normalization is similar to the graph normalization except that momentum average used in batch-wise normalization is not used in the graph-wise normalization.

3. LEARNING ATTENTIVE GRAPH NORMALIZATION

Although we have defined several normalization methods for the graph-structured data, different tasks prefer to different normalization methods and for a specific task, it is hard to decide which normalization method should be used. Moreover, one normalization approach is utilized in all normalization layers of a GNN. This will sacrifice the performance of a GNN. To remedy these issues, we propose to learn attentive graph normalization for training GNNs by optimizing a weighted combination of the normalization methods. Specifically, we combine the node feature ĥv k,i under different normalization methods as follows: ĥv k,i = γ(α (n) ĥ(n) v k,i + α (a) ĥ(a) v k,i + α (g) ĥ(g) v k,i + α (b) ĥ(b) v k,i ) + β, where α (n) , α (a) , α (g) and α (b) ∈ R d are trainable gate parameters with the same dimension as h v k,i ,γ ∈ R and β ∈ R d are the trainable scale and shift parameters, respectively. Note that we attempt to use the learned α (n) , α (a) , α (g) and α (b) indicate the contribution of the corresponding normalized feature to ĥv k,i . Thus, we impose normalization constraints on each dimension of α (n) , α (a) , α (g) and α (b) that α (u) j ∈ [0, 1] where u ∈ {n, a, g, b} and j = 1, • • • , d, and u∈{n,a,g,b} α (u) j = 1 where j = 1, • • • , d. In this way, if a normalization method is better for a specific task, the learned corresponding weights will be higher than others. Thus, we term the learned attentive graph normalization method in Equation ( 11) as Automatic Graph Normalization (AGN). In AGN, multiple normalization methods collaborate and compete with each others to improve the performance of GNNs. Different normalization methods are suitable for different tasks. In AGN, the attention weights α (n) , α (a) , α (g) and α (b) are optimized for a specific task and thus the best-performing normalization method will have a set of significant weights. Therefore AGN can serve as an effective strategy to select one of the best-performing normalization method or the best combination of multiple normalization methods for a specific task.

4. EXPERIMENTS

We evaluate GN n , GN a , GN g , GN b , and AGN under three GNN frameworks, including Graph Convolution Network (GCN) , Graph Attention Network (GAT) and GatedGCN. We also assess the performance of GNNs without normalization layer named as "No Norm". The benchmark datasets consist of three types of tasks including node classification, link prediction, and graph classification/regression. We use all seven datasets from Dwivedi et al. ( 2020), which are PATTERN, CLUS-TER, SROIE, TSP, COLLAB, MNIST, CIFAR10, and ZINC. In addition, we apply GatedGCN for key information extraction problem and evaluate the effect of different normalization methods on SROIE (Huang et al., 2019) , which is used for extracting key information from receipt in ICDAR 2019 Challenge (task 3). The detailed statistics of the datasets are presented in Appendix C.1. The implementations of GCN, GAT and GatedGCN are from GNN benchmarking framworkfoot_3 . The hyper-parameters and optimizers of the models and the details of the experimental settings are kept the same as in (Dwivedi et al., 2020) . We run experiments on CLUSTER and PATTERN datasets with GNNS of depth of layers L = {4, 16}, respectively. For the other datasets, we fix the number of GCN layers to L = 4.

4.1. NODE CLASSIFICATION

For datasets CLUSTER and PATTREN, the average node-level accuracy which is weighted with respect to the class sizes is used to evaluate the performance of all models. For each model, we conduct 4 trials with different seeds {41, 95, 35, 12} to compute the average accuracy and the results are shown in Table 1 . As can be read, graph-wise normalization (GN g ) outperforms batch-wise normalization (GN b ) obviously in most situations. For instance, when the depth of GNNs is 4, GatedGCN with GN g achieves 9% improvement over GN b on CLUSTER. Batch-wise normalization computes the statistics over a batch data and ignores the differences between different graphs. Different from GN b , GN g performs normalization only for each graph. Thus, GN g can learn the dedicated information of each graph and normalize the feature of each graph into a reasonable range. As we known, the performance of the adjacency-wise normalization (GN a ) is similar with that of the node-wise normalization (GN n ). Compared with GN n , GN a consider the neighbors of each node and gets higher accuracies. AGN gets comparable results for different GNNs and the results of AGN are close to the best results in most cases due to its flexibility and adaptability. AGN can adaptively learn the optimal combination of the normalization methods which better adapt to the node classification task. Moreover, we apply node classification to key information extraction on SROIE which consists of 626 receipts for training and 347 receipts for testing. Each image is annotated with text bounding We can observe that GN g achieves the best performance among all compared normalization methods. In the receipt, there are many nodes with only numeric texts. It is hard to differentiate the "Total" field from other nodes with numeric text. GN g performs well in this field and outperforms the second best by 2.0%. We believe that the graph-wise normalization can make the "Total" field stand out from the other bounding boxes with numeric text by aggregating the relevant anchor point information from its neighbors and removing the mean number information. Similarly, graph-wise normalization can promote extracting information for the other three key fields. It is interesting that the graph of each receipt is special with neighboring nodes that usually belong to different classes. Thus, the performance of adjacency-wise normalization is worse than node-wise normalization.

4.2. LINK PREDICTION

Link prediction is to predict whether there is a link between two nodes v i and v j in a graph. Two node features of v i and v j , at both ends of edge e i,j , are concatenated to make a prediction. Experimental results are shown in Table 3 

4.4. ANALYSIS AND FURTHER INVESTIGATIONS

The above experimental results indicate that GN g outperforms batch normalization on most node classification tasks. For each single normalization method, it performs very well on some datasets, while its performance may decrease sharply on other datasets. Meanwhile, our proposed AGN which integrates several normalization methods into a unified optimization framework achieves competitive results compared with the best single normalization method on various datasets.

Behaviours of Learned Weights in AGN.

To gain more insight into AGN, we conduct a set of experiments to analyze the effect of each normalization method on different datasets. Note that AGN combines the results of several normalization methods and {α (u) } u∈n,a,g,b in Equation ( 11) indicate the importance of the corresponding normalization methods, respectively. We initialize the weights {α (u) } u∈n,a,g,b in each layer with the equal values, i.e., α (m) j = 0.25 for j = 1, • • • , d and m ∈ {n, a, g, b}. In the training phase, the value of each component of {α (u) } u∈n,a,g,b changes between 0 and 1. On each dataset, we investigate the learned optimal weights on average at different layers of GatedGCN. Particularly, We collect the learned weights of each normalization method in each layer and calculate the averaged weights of each normalization method over all of the d entries of α (u) . We show the learned weights on average in Figure 2 . As can be observed, the learned weights of each normalization method on average not only change for different dataset but also 11). vary for different layers. This implies that different layers prefer to different normalization method in order to yield good performance. We can also observe that the weights on GN g are larger than others on node classification tasks and GN b is more important on others. Our proposed AGN has the ability to automatically choose the suitable normalization method for a specific task. Evaluation on Selected Normalization Methods via AGN. To further evaluate the performance of the selected normalization methods, we select the two best-performing normalization methods, combine them into a new normalization method as in Equation ( 11), and conduct experiments on each dataset. The experimental results are listed in Table 5 . We can read that the combined normalization method obtains the comparable results with the best normalization method. Therefore these results show that the learnt weights indicate whether the corresponding normalization method is suitable for the current task. 

5. CONCLUSIONS

We formulated four normalization methods for training GNNs at different scales: node-wise normalization, adjacency-wise normalization, graph-wise normalization, and batch-wise normalization. Particularly, the adjacency-wise normalization and graph-wise normalization are graph-aware normalization methods, which are designed with respect to the local and the global structure of the graph, respectively. Moreover, we proposed a novel optimization framework, called Automatic Graph Normalization, to learn attentive graph normalization by optimizing an attentively weighted combination of multiple graph normalization methods. We conducted extensive experiments on seven benchmark datasets at different tasks and confirmed that the graph-aware normalization methods and the automatically learned graph normalization method lead to promising results and that the learned optimal weights suggest more appropriate normalization methods for specific tasks. where h u denote the feature vector at the -th layer of node u ∈ N (v), ψ and φ are learnable functions, M is the aggregation function for nodes in N (v), and C is utilized to combine the feature of node v and its neighbors. Especially, the initial node representation h 0 v = x v represents the original input feature vector of node v. Graph ConvNets (Kipf & Welling, 2016) treats each neighbor node u equally to update the representation of a node v as: h +1 v = ReLU( 1 deg v u∈N (v) W h u ), where W ∈ R d×d , deg v is the in-degree of node v. One graph convolutional layer only considers immediate neighbors. To use neighbors within k hops, in practice, multiple GCN layers are stacked. All neighbors contribute equally in the information passing of GCN. One key issue of the GCN is an over-smoothing problem, which can be partially eased by residual shortcut across layers. Another effective approach is to use spatial GNNs, such as GAT (Velickovic et al., 2018) and GatedGCN (Bresson & Laurent, 2017) . GAT (Velickovic et al., 2018) learns to assign different weight to adjacent nodes by adopting attention mechanism. In GAT, the feature representation of v can be updated by: h +1 v = σ( u∈N (v) a u,v W h u ), where a u,v measures the contribution of node u's feature to node v defined as follows: a u,v = exp(g(α T [W h u ||W h v ])) k∈N (v) exp(g(α T [W h k ||W h v ])) , where g(•) is a LeaklyReLU activation function, α is a weight vector and || is the concatenation operation. Similar to Vaswani et al. (2017) , to expand GAT's expressive capability and stabilize the learning process, multi-head attention is employed in GAT. GAT has achieved an impressive improvement over GCN on node classification tasks. However, as the number of graph convolutional layers increases, nodes representations will converge to the same value. Unfortunately, the oversmoothing problem still exists. To mitigate the over-smoothing problem, GatedGCN (Bresson & Laurent, 2017) integrates gated mechanism (Hochreiter & Schmidhuber, 1997) , batch normalization (Ioffe & Szegedy, 2015) , and residual connections (He et al., 2016) into the network design. Unlike GCNs, which treats all edges equally, GatedGCN uses an edge gated mechanism to give different weights to different nodes. Thus, for node v, the formulation for updating the feature representation is: h +1 v = h v + ReLU(BN(W h v + u∈N (v) e vu U h u )), where W , U ∈ R d×d , is the Hadamard product, and the edge gates e v,u are defined as: e v,u = σ(ê v,u ) u ∈N (v) σ(ê v,u ) + c , ê v,u = ê -1 v,u + ReLU(BN(A h -1 v + B h -1 u + C ê -1 v,u )), where σ(•) is a sigmoid function, c is a small fixed constant, A , B , C ∈ R d×d . Different from traditional GNNs, GatedGCN explicitly considers edge feature êv,u at each layer.

B.1 BATCH NORMALIZATION

Batch normalization (BN) (Ioffe & Szegedy, 2015) has become one of the critical components in training a deep neural network, which normalizes the features by using the first order and the second order statistics computed within a mini-batch. BN can reduce the internal covariate shift problem and accelerate the training process. We briefly introduce the formulation of BN. Firstly, H = {h 1 , h 2 , ..., h m } ∈ R d×m is denoted as the input of a normalization layer, where m is the batch size and h i represents a sample. Then, µ (m) ∈ R d and σ (m) ∈ R d denote the mean vector and the variance vector of the m sample in H, respectively. BN normalizes each dimension of features using µ (m) and σ (m) as: ĥ = γ(h -µ (m) )./σ (m) + β, µ (m) = 1 m m j=1 h j , σ (m) i = 1 m m j=1 (h ij -µ (m) i ) 2 , µ = αµ + (1 -α)µ (m) , σ 2 = ασ 2 + (1 -α)(σ (m) ) 2 , where ./ means element-wise division, γ and β are trainable scale and shift parameters, respectively. In Equation ( 18), µ and σ denote the running mean vector and the variance vector to approximate the mean vector and the variance vector of the dataset. During testing, they are used for normalization.

B.2 LAYER NORMALIZATION

Layer Normalization (LN) (Ba et al., 2016) is widely adopted in Natural Language Processing, specially Transformer (Vaswani et al., 2017) incorporates LN as a standard normalization scheme. BN computes a mean and a variance over a mini-batch and the stability of training is highly dependent on these two statistics. across the feature dimension. The normalization equations of LN are as follows: ĥj = γ h j -µ (L) j 1 σ (L) j + β, µ (L) j = 1 d d i=1 h ij , σ L) j = 1 d d i=1 (h ij -µ (L) j ) 2 , where ĥj ∈ R d is the normalized feature vector, 1 ∈ R d is a d dimension vector of 1's, γ ∈ R d and β ∈ R d are scale and shift parameters of dimension d. Overall, there are many normalization approaches (Ulyanov et al., 2016; Wu & He, 2018; Shen et al., 2020; Dimitriou & Arandjelovic, 2020) . Shen et al. (2020) has indicated that BN is suitable for computer vision tasks, while LN achieves better results on NLP. For a normalization approach, its performance may vary a lot in different tasks. Thus, it is very important to investigate the performance of normalization approaches in GNNs.

C DATASETS AND EXPERIMENTAL DETAILS

C.1 DATASET STATISTICS Table C .1 summarizes the statistics of the datasets used for our experiments.

C.2 SROIE

For a receipt, each text bounding box (bbox) is viewed as a node of a graph. The positions and the attributes of the bounding box, and the corresponding text are used as the node feature. To describe the relationships among all the text bounding boxes on a receipt, we consider the distance between two nodes v i and v j . If the distance between two nodes is less than a threshold θ, we connect v i and v j by an edge e i,j . Since that the relative positions of two text bounding boxes are important for node classification, we encode the relative coordinates of v i and v j to represent the edge e ij . In this way, such an information extraction task from a receipt can be treated as a node classification task on a graph. Our goal is to label each node (i.e., text bounding box) with five different classes: "Company", "Date", "Address", "Total" and "Other". Since that GatedGCN explicitly exploits edge features and has achieved state-of-the-art performance on various tasks, we use GatedGCN with 8 GCN layers for this task. 

D ACKNOWLEDGEMENT

We would like to thank Vijay et al. to release their benchmarking code for our research. We also want to thank the DGL team for their excellent toolbox. 



The node-wise normalization method in Equation (1) can also be used to normalize the feature at each edge, as illustrated in Figure1(e). For the edge e k,i,j , as Figure1(f), the adjacent edges N (e k,i ) can be considered in a similar way. For the edges E k of graph G k (Figure1(g)), we can also define the same normalization. https://github.com/graphdeeplearning/benchmarking-gnns



Figure 1: Illustration for normalization methods on graph. Node features are normalized at four levels: (a) Node-wise; (b) Adjacency-wise; (c) Graph-wise; and (d) Batch-wise. Similarly the four normalization methods can be extended to normalize edge features as shown in (e), (f), (g), and (h). nected nodes of the edge and has achieved state-of-art results on several datasets (Dwivedi et al., 2020). More detailed overview about GNNs are provided in Appendix A. It is well known that one of the critical ingredients to effectively train deep neural networks is normalization technique, e.g., Batch Normalization (BN) (Ioffe & Szegedy, 2015) is widely used to accelerate the deep neural networks training. Other than BN, several normalization methods have been developed from different perspectives, e.g., Layer Normalization (LN) (Ba et al., 2016) and Group Normalization(Wu & He, 2018) which operate along the channel dimension, Instance Normalization(Ulyanov et al., 2016) which performs a BN-like normalization for each sample, Switchable Normalization(Luo et al., 2019) which utilizes three distinct scopes-including channel, layer, and minibatch-to compute the first order and second order statistics. Each normalization method has its advantages and is suitable for some particular tasks. For instance, BN has achieved perfect performance in computer vision whereas LN outperforms BN in natural language processing(Vaswani et al., 2017).

Figure 2: Learnt weight distributions of normalization methods along with layers on different tasks.

Shen et al. (2020) has showed that transformer with BN leads to poor performance because of the large fluctuations of batch statistics throughout training. Layer normalization computes the mean and variance along the feature dimension for each training case. Different from BN, for each sample h j ∈ R d , LN computes mean µ

Figure 3: Sample images of the SROIE dataset. Four entities are highlighted in different colors. "Company", "Address", "Date", and "Total" are marked with Red, Blue, Yellow, and Purple individually. The "Company" and the "Address" entities usually consist of several text lines.

Figure 5: Training loss and test result of GatedGCN on CIFAR, MNIST and ZINC vs. the number of steps, with different normalization methods.

By considering the structure information in the graph, in this paper, we propose two graph-aware normalization methods at different scales: a) adjacency-wise normalization, and b) graph-wise normalization. Unlike BN and LN, the adjacencywise normalization takes into account the local structure in the graph whereas the graph-wise normalization takes into account the global structure in the graph. On other hand, while multiple normalization methods are available for training GNNs and it is still hard to know in advance which normalization method is the most suitable to a specific task. To tackle with this deficiency, we further propose to learn attentive graph normalization by optimizing a weighted combination of multiple normalization methods. By optimizing the combination weights, we can select the best or the best combination of multiple normalization methods for training GNNs at a specific task automatically.

Results on CLUSTER and PATTERN. Red: the best model,Violet: good models. boxes (bbox) and the transcript of each text bbox. There are four entities to extract (i.e., Company, Date, Address and Total) from a receipt, as shown in Appendix C.2. For a receipt image, each text bounding box is label with five classes (i.e., Total, Date, Address, Company and Other). Then, the key information extraction is treated as node classification and we treat each text bounding box in a receipt image as a node. Feature representation for each node will be supplied by Appendix C.2. "Company" and "Address" usually consist of multiple text bounding boxes (nodes). The entity is recorded as "extracted successful" if and only if all nodes of each entity are classified correctly. In this experiment, we use GatedGCN. We compute the mean accuracy for each text field and the average accuracy for each receipt and show the results in Table2.

Performance (accuracy) comparison of different normalization approaches.

. All the five normalization methods achieve similar performance on dataset TSP. Compared with others, the results of AGN are very stable. For each GNN, the result of AGN is comparable with the best result. Link prediction results on the TSP and COLAB. Red: the best model,Violet: good models. group. Average MAE also is reported in Table4. We can see that GN b outperforms others in most cases. GN g does not work well on graph classification and regression. Furthermore, GN g affects the performance of AGN. In AGN, the normalized features of GN n , GN a , GN g , and GN b are integrated and automatically paid more attention to GN b due to its outstanding performance. Therefore, the performance of AGN is comparable with GN b .

Results on MNIST, CIFAR10 and ZINC. Red: the best model,Violet: good models.

Performance of different normalization methods on seven benchmark datasets. For each dataset, we give the performance of the best two normalization methods and a new normalization method which is combined the two best-performing normalization methods as Equation (

Summary statistics of datasets used in our experiments

Training loss and test result of GatedGCN on CLUSTER, PATTERN and TSP vs. the number of steps, with different normalization methods.

