DISTRIBUTIONAL SIGNALS FOR NODE CLASSIFICA-TION IN GRAPH NEURAL NETWORKS

Abstract

In graph neural networks (GNNs), both node features and labels are examples of graph signals, a key notion in graph signal processing (GSP). While it is common in GSP to impose signal smoothness constraints in learning and estimation tasks, it is unclear how this can be done for discrete node labels. We bridge this gap by introducing the concept of distributional graph signals. In our framework, we work with the distributions of node labels instead of their values and propose notions of smoothness and non-uniformity of such distributional graph signals. We then propose a general regularization method for GNNs that allows us to encode distributional smoothness and non-uniformity of the model output in semisupervised node classification tasks. Numerical experiments demonstrate that our method can significantly improve the performance of most base GNN models in different problem settings.

1. INTRODUCTION

We consider the semi-supervised node classification problem (Kipf & Welling, 2017) that determines class labels of nodes in graphs given sample observations and possibly node features. Numerous graph neural network (GNN) models have been proposed to tackle this problem. One of the first models is the graph convolutional network (GCN) (Defferrard et al., 2016) . Interpreted geometrically, a GCN aggregates information such as node features from the neighborhood of each node of the graph. Algebraically, this process is equivalent to applying a graph convolution filter to node feature vectors. Subsequently, many GNN models with different considerations are introduced. Popular models include the graph attention network (GAT) (Velicković et al., 2018) that learns weights between pairs of nodes during aggregation, and the hyperbolic graph convolutional neural network (HGCN) (Chami et al., 2019) that considers embedding of nodes of a graph in a hyperbolic space instead of a Euclidean space. For inductive learning, GraphSAGE (Hamilton et al., 2017) is proposed to generate low-dimensional vector representations for nodes that are useful for graphs with rich node attribute information. While new models draw inspiration from GCN, GCN itself is built upon the foundation of graph signal processing (GSP). GSP is a signal processing framework that handles graph-structured data (Shuman et al., 2013; Ortega et al., 2018; Ji & Tay, 2019) . A graph signal is a vector with each component corresponding to a node of a graph. Examples include node features and node labels. Moreover, convolutions used in models such as GCN are special cases of convolution filters in GSP (Shuman et al., 2013) . All these show the close connections between GSP theory and GNNs. In GSP, signal smoothness (over the graph) is widely used to regularize inference tasks. Intuitively, a signal is smooth if its values are similar at each pair of nodes connected by an edge. One popular way to formally define signal smoothness is to use the Laplacian quadratic form. There are numerous GSP tools that leverage a smooth prior of the graph signals. For example, Laplacian (Tikhonov) regularization is proposed for noise removal in Shuman et al. (2013) and signal interpolation (Narang et al., 2013) . In Chen et al. (2015) , it is used in graph signal in-painting and anomaly detection. In Kalofolias (2016) , the same technique is used for graph topology inference. However, for GNNs, it is remarked in Yang et al. (2021, Section 4.1.2) that "graph Laplacian regularization can hardly provide extra information that existing GNNs cannot capture". Therefore, a regularization scheme based on feature propagation is proposed. It is demonstrated to be effective by comparing with other methods such as Feng et al. (2021) and Deng & Zhu (2019) based on adversarial learning and Stretcu et al. (2019) that co-trains GNN models with an additional agreement model, which gives the probability that two nodes have the same label. We partially agree with the above assertion regarding graph Laplacian regularization, while remaining reservative about its full correctness. In this paper, we propose a method that is inspired by Laplacian regularization. As our main contribution, we introduce the notion of distributional graph signals, instead of considering graph signals. Analogous to the graph signal smoothness defined using graph Laplacian, we define the smoothness of distributional graph signals. Together with another property known as non-uniformity, we devise a regularization scheme for GNNs in node classification tasks. This approach is easy to implement and can be used as a plug-in regularization term together with any given base GNN model. Its effectiveness is demonstrated with numerical results.

2. DISTRIBUTIONAL GRAPH SIGNALS

In this section, we motivate and introduce distributional graph signals based on GSP theory.

2.1. GSP PRELIMINARIES AND SIGNAL SMOOTHNESS

In this subsection, we give a brief overview of GSP theory (Shuman et al., 2013) . The focus is on the discussion of graph signal smoothness. Let G = (V, E) be an undirected graph with V the vertex set and E the edge set. Suppose the size of the graph is n = |V|. Fix an ordering of V. Then, the space of graph signals can be identified with the vector space R n , with a graph signal x ∈ R n , which assigns its i-th component to the i-th vertex of G. By convention, signals are in column form, and x(i) is the i-th component of x. In GSP, the key notion is the graph shift operator. Though there are several choices for the graph shift operator, in our paper, we consider a common choice: L G , the Laplacian of G, defined by L G = D G -A G , where D G , A G are the degree matrix and adjacency matrix of G, respectively. The Laplacian L G is positive semi-definite and symmetric. By the spectral theorem, it has an eigendecomposition L G = U G Λ G U ⊤ G . In the decomposition, Λ G is a diagonal matrix, whose diagonal entries {λ 1 , . . . , λ n } are eigenvalues of L G . They are non-negative and we assume λ 1 ≤ . . . ≤ λ n . The associated eigenbasis {u 1 , . . . , u n } are the columns of U G . In GSP, an eigenvector with a small eigenvalue (and hence a small index) is considered to be smooth. The signal values of such a vector have small fluctuations across the edges of G. Given a graph signal x, its graph Fourier transform is x = U ⊤ G x, or equivalently, x(i) = ⟨x, u i ⟩, for 1 ≤ i ≤ n. The components x(i) of x are called the frequency components of x. Same as above, the signal x is smooth if x(i) has a small absolute value for large i. Quantitatively, we can define its total variation by T (x) = (vi,vj )∈E (x(i) -x(j)) 2 = x ⊤ L G x. (1) It is straightforward to compute that T (u i ) = λ i . This observation indicates that it is reasonable to use total variation as a measure of smoothness. Minimizing the total variation of graph signals has many applications in GSP as we have pointed out in Section 1.

2.2. STEP GRAPH SIGNALS

Let S be a finite set of numbers. A step graph signal with respect to (w.r.t.) S is a graph signal x such that all its components take values in S, i.e., x ∈ S n . Example 1. For the simplest example, consider the classical Heaviside function H on R defined by H(x) = 1, for x > 0 and H(x) = 0, for x ≤ 0. It is a non-smooth function, as it is not even continuous at x = 0. On the other hand, let G be the path graph with 2m + 1 nodes embedded on the real line by identifying the nodes of G with the integers in the interval [-m, m] . Then H induces a step graph signal h on G. Same as the Heaviside function H, the signal h should be considered to be a non-smooth graph signal. Step graph signals occur naturally in semi-supervised node classification tasks. In particular, if S is the set of all possible class labels, then the labels of the nodes of G form a step graph signal c w.r.t. S on G. We expect that analogous to Example 1, c can possibly be non-smooth. To demonstrate, we analyze c using its Fourier transform ĉ (cf. Section 2.1). More specifically, we take G 0 to be the main connected component of the Cora graph Sen et al. (2008) and c to be the ground truth labels. We also generate a random signal r following the same empirical distribution (on S) estimated using c. We show plots of both Fourier transforms ĉ and r in Fig. 1 . We see that the high-frequency components of the ground truth labels c can be large, and it is even possible that its spectrum, i.e., frequency components, resemble that of a random signal. The observations support our speculation about the non-smoothness of the step signals. Therefore, in order to leverage signal smoothness to enhance model performance, we need to find an alternative to the step (label) signals. This also supports the remark of Yang et al. (2021) regarding Laplacian regularization from a different point of view (see also the experiments in Section 4.2). Figure 1 : Plots of ĉ and r for Cora.

2.3. DISTRIBUTIONAL GRAPH SIGNALS AND MARGINALS

We want to use probability theory to introduce the notion of "smoothness" for step signals in the next section. For preparation, in this subsection, we formally introduce distributional graph signals and discuss how they arise naturally in GNNs. Our theory (e.g., in Definition 1 and equation ( 3)) is based on probability measures defined on metric spaces. To this end, we endow any discrete space S with the discrete metric d(s 1 , s 2 ) = 1 if s 1 ̸ = s 2 ∈ S and 0 otherwise. For R n , one can use the usual Euclidean metric or any other norm. Definition 1. For a metric space M = S n or R n , let P(M) be the space of probability measures on M (w.r.t. the Borel σ-algebra) having finite second moments. An element µ ∈ P(M) is called a distributional graph signal. The marginals of a distributional graph signal µ are the marginal distributions N = {µ i : 1 ≤ i ≤ n} w.r.t. the n coordinates of either S n or R n . To understand why a distribution µ is related to GSP, consider a step graph signal (resp. ordinary graph signal) x in S n (resp. R n ). It induces the delta distribution δ x in P(S n ) supported at x. Its marginals are the delta distributions {δ x(i) : 1 ≤ i ≤ n}. Therefore, Definition 1 subsumes ordinary graph signals as special cases. v 1 v 2 v 3 v 4 1 3 1 3 2 5 3 10 0 2 3 3 5 1 10 1 3 1 3 3 5 0 X 1,: X 3,: X 2,: X 4,: ? Figure 2: We give an example of the marginals of a distributional graph signal. We notice that for X 1,: , the probability weights are equal and it is hard to determine the arg max of the 3 classes. Moreover, the weights of the 2nd class have large differences along a few edges, e.g., (v 2 , v 3 ), (v 3 , v 4 ). These features make prediction unreliable. We propose solutions to such issues in this paper. Distributional graph signals also occur naturally in GNN models (illustrated in Fig. 2 ). As in the previous subsection, let S be the set of all possible class labels of size m and M n,m (R) be the space of n × m matrices. For a typical model M such as the GCN, the last stage of the pipeline usually consists of the following stepsfoot_0 : Logits: O ∈ M n,m (R) Softmax:ϕ -----→ X = ϕ(O) ∈ M n,m (R) arg max -----→ Output labels: c ∈ S n . (2) The i-th row X i,: of X can be viewed as the weights of a probability distribution µ X,i on S. They do not directly give a distributional graph signal in P(S n ). However, we shall interpret N X = {µ X,i , 1 ≤ i ≤ n} as the marginals of some unknown distributional graph signal, and N X is the main subject of study in this paper. In order to mimic the ways smoothness of graph signals is used in GSP, we introduce an appropriate notion of the total variation of distributional graph signals in the next section.

3. REGULARIZED GRAPH NEURAL NETWORKS

In this section, we study smoothness and non-uniformity properties of distributional graph signals in Section 3.1 and Section 3.2 respectively. Each of these subsections yields an expression that we want to minimize. They are combined in Section 3.3 to give the proposed regularization term.

3.1. TOTAL VARIATIONS AND LAPLACIAN REGULARIZATION

The goal of this subsection is to introduce and compare different notions of total variation associated with distributional graph signals and their marginals that leverage signal smoothness. First of all, given µ ∈ P(M) for M = S n or R n , its total variation can be modified directly from (1) as follows: T (µ) = E x∼µ T (x) = (vi,vj )∈E d(x(i), x(j)) 2 dµ(x). ( ) where d is the metric on S or R. However, in many cases of interest such as GNNs, only marginals of some µ are observed (cf. Section 2.3). We want to define total variation in such a situation as well. We borrow ideas from the prototype of the Wasserstein metric (Villani, 2009) , which we recall now. The notations and assumptions follow those of Definition 1. Definition 2. For µ 1 , µ 2 ∈ P(M), the Wasserstein metric W (µ 1 , µ 2 ) between µ 1 , µ 2 is defined by W (µ 1 , µ 2 ) 2 = inf γ∈Γ(µ1,µ2) d(x, y) 2 dγ(x, y), where Γ(µ 1 , µ 2 ) is the set of couplings of µ 1 , µ 2 , i.e., the collection of probability measures on M × M whose marginals are µ 1 and µ 2 , respectively. It can be verified that W (•, •) indeed defines a metric on P(M) (see Villani (2009) ). As a special case, if δ x and δ y are delta distributions supported on x, y ∈ M, then W (δ x , δ y ) = d(x, y). The key insight is that we want to take infimum over all possible distributions given the prescribed marginals; and whatever we define, it should subsume (1) as a special case for a collection of delta distributions. Definition 3. Given N = {µ i : 1 ≤ i ≤ n} with µ i ∈ P(S) (resp. µ i ∈ P(R)), for 1 ≤ i ≤ n, then the total variation of N is defined as: T (N ) = inf µ∈Γ(N ) T (µ), where Γ(N ) is the collection of all distributional graph signals in P(S n ) (resp. P(R n )) whose marginals agree with N . Though an important theoretical tool, the Wasserstein metric is usually difficult to compute explicitly. On the other hand, if G is the graph with 2 nodes connected by an edge and N = {µ 1 , µ 2 }, then T (N ) = W (µ 1 , µ 2 ) 2 . As a consequence, finding the exact value of T (N ) can be challenging. Therefore, we next introduce approximations that can be more readily computed. For N = {µ i : 1 ≤ i ≤ n} with µ i ∈ P(S), let µ s ∈ R n be the graph signal of probability weights of s ∈ S, i.e., µ s (i) is the probability weight of µ i at s. We can also stack these signals as a matrix X N whose columns are µ s . Based on the GSP version of total variation (cf. ( 1)), we introduce two more versions of total variation that are easy to compute. Definition 4. Given N = {µ i : 1 ≤ i ≤ n} with µ i ∈ P(S), we define the ℓ 1 and ℓ 2 versions of total variations as: • T 1 (N ) = s∈S (vi,vj )∈E |µ s (i) -µ s (j)|, • T 2 (N ) = s∈S (vi,vj )∈E µ s (i) -µ s (j) 2 = Tr(X ⊤ N L G X N ), where Tr is the matrix trace. The complexity of computing either T 1 or T 2 is at most O(|S||E|), and it involves only matrix multiplication for T 2 . In addition, T , T 1 and T 2 satisfy the following relation. Theorem 1. Given N = {µ i : 1 ≤ i ≤ n} with µ i ∈ P(S), we have T 2 (N ) ≤ T 1 (N ) ≤ 2T (N ). Moreover, T 1 (N ) = 2T (N ) if G is a tree. The discussion and proof of a more general result can be found in Appendix D. The upshot is that if we expect T (N ) to be small for some N , then so are necessarily T 1 (N ) and T 2 (N ). In view of computation cost, we mainly use T 2 in the design of the regularization model in Section 3.3. Note that this is analogous to the Laplacian regularization in GSP.

3.2. NON-UNIFORMITY

As discussed in Section 2.3, a base GNN model may output a matrix X (in (2)), with associated N X = {µ X,i , 1 ≤ i ≤ n}. For node classification problems, it is desirable that there is less ambiguity in the decision for each node so that one can pinpoint the correct class label. Mathematically, this requires that each µ X,i deviates from the uniform distribution U(S) on the finite set of label classes S, measured by the Wasserstein metric W (µ X,i , U(S)). The following result is proved in Appendix D. Lemma 1. For a fixed sequence of non-positive numbers (a i ) 1≤i≤n , there is a constant C independent of X such that Tr(X ⊤ DX) + C ≥ 2 1≤i≤n a i W (µ X,i , U(S)) 2 , ( ) where D is the diagonal matrix with diagonal entries (a i ) 1≤i≤n . Moreover, Tr(X ⊤ o DX o ) ≤ Tr(X ⊤ DX) ≤ Tr(X ⊤ u DX u ) , where X o is a matrix with each row a one-hot vector and X u is the matrix with each entry 1/|S|. As we want each µ X,i to deviate from the uniform distribution, the right-hand side of (5) should be made small (as negative as possible). This is ensured if the proxy Tr(X ⊤ DX), which is easy to compute, is small. Moreover, we notice that X u (resp. X o ) corresponds to uniform (resp. δ) marginal distributions. The second half of the statement suggests that minimizing Tr(X ⊤ DX) may drive marginals, i.e., rows of X, to comply with non-uniformity. Numerical experiments in Section 4.2 support that this proxy works well. We use this term in conjunction with T 2 introduced in Section 3.1 in our proposed regularized model.

3.3. THE LOSS FUNCTION WITH REGULARIZATION

Suppose we are given a base GNN model, denoted by M. Let F be the matrix of input feature vectors and Θ be the parameter space for M. Assume M has a loss function L M , and the model is set to solve the optimization problem min θ∈Θ L M (F , θ). Consider the steps given in (2). We have X = ϕ(O) viewed as a matrix of probability weights. For regularization, we introduce another loss L 0 to supplement L M . Sections 3.1 and 3.2 suggest that L 0 should consist of two parts L 0 = L 1 + L 2 . The loss L 1 (cf. Definition 4) is related to the smoothness of the distributional graph signal X and takes the form L 1 (X) = Tr(X ⊤ L G X) = Tr(X ⊤ (D G - A G )X). On the other hand, L 2 prevents the distributional signal from being uniform and can be explicitly expressed as L 2 (X) = Tr(X ⊤ DX) for a suitably chosen negative semi-definite diagonal matrix D (cf. Lemma 1). Summing up L 1 and L 2 , we have L 0 (X) = Tr X ⊤ (D G -A G + D)X . In our experiments, we take D = I n -D G and obtain the easily computable loss L 0 (X) = Tr X ⊤ (I n -A G )X . In summary, if we express the output of the second last layer of M as O = ψ(F , θ) of input F and model parameters θ, then the regularized model R-M of M solves the optimization: min θ∈Θ L R-M (F , θ) = min θ∈Θ L M (F , θ) + η • L 0 ϕ • ψ(F , θ) , where the coefficient η is a tunable hyperparameter and • denotes function composition. In the regularized model R-M, we do not make any other changes to the based model M apart from using the new loss function L R-M during training. A schematic illustration is shown in Fig. 3 (a) .  O Base model M c L M (•) φ φ(O) η • L 0 (•) + R-M F ψ (a) (b) . . . . . . X φ η X • L 0 (•) φ(X) R-module

3.4. FURTHER DISCUSSIONS

Most regularization methods introduce a penalty to supplement the base model loss. For example, in the recent work, P-reg (Yang et al., 2021) proposes to apply a propagation matrix (e.g., the normalized adjacency matrix) to the output features. The model penalizes large discrepancies, measured by a metric such as squared error distance, between the original and transformed features. In LEreg (Ma et al., 2021) , intra-energy and inter-energy losses are introduced. Both are variants of the total variation (1). The novelty lies in introducing a "merged graph" with each node representing a whole label class. All these works are related to the smoothness prior used in GSP theory (Section 2.1). In this paper, we take a fundamentally different view of graph signals by treating a distribution as a signal. This allows a principled and straightforward adaption of existing GSP approaches, as long as we have suitable notions of the signal's total variation. Moreover, non-uniformity does not have a counterpart for ordinary graph signals, which are essentially delta distributions. Compared to GSP as well as other regularization methods such as P-reg and LEreg, a salient feature of our method is that the matrix I n -A G in L 0 is in general not positive semi-definite. Therefore, the infimum of L 0 (X) can be -∞ if the domain of the entries of X is unbounded. This is another reason why restricting to distributional graph signals is essential. The proposed regularization is primarily for node classification as distributional graph signals can be interpreted as the likelihoods of class labels. However, the method can be extended to other tasks through an R-module (Fig. 3 (b) ) that consists of the following steps: • Apply ϕ (e.g., softmax) that turns a feature X into a matrix of probability weights ϕ(X). • Plug ϕ(X) in the loss η X L 0 with tunable coefficient η X . Given a base model, multiple R-modules can be inserted at different places of the model pipeline, and all the losses are combined with the original loss of the model. The insight is that useful node features for graph learning tasks should be inherently associated with smooth and non-uniform distributional graph signals. We demonstrate this approach with link prediction and graph classification tasks in Appendix C.

4. EXPERIMENTS

In this section, we verify the empirical performance of our proposed regularization method based on both Euclidean and hyperbolic GNN models. We consider node classification under both transductive and inductive settings. The datasets used include Cora, Citeseer, Pubmed, (Amazon) Photo, CS, Airport, Disease, and PPI (Sen et al., 2008; Namata et al., 2012; Zhang & Chen, 2018; Shchur et al., 2018; Fey & Lenssen, 2019; Chami et al., 2019; Szklarczyk et al., 2016) . Their statistics are given in Appendix A. For our regularization method, the source code is provided in Appendix B. No further tuning of the base model is needed. We compare with base models and benchmarks in Section 4.1, and with variants of our model in Section 4.2. Other graph learning tasks are discussed in Appendix C.

4.1.1. TRANSDUCTIVE LEARNING MODELS

In this subsection, we consider transductive tasks, in which there is a single graph containing both labeled training nodes and test nodes. The base models include GCN (Defferrard et al., 2016) , GAT (Velicković et al., 2018) , GraphSAGE (Hamilton et al., 2017) (abbreviated as SAGE) and GraphCON (Rusch et al., 2022) (abbreviated as CON). Implementation details are given in Appendix B. In addition to comparison with base models, we use models having a similar structure to our approach as benchmarks. More specifically, we implement different versions of P-reg in (Yang et al., 2021) , denoted by P-GCN and P-GAT (based on popular GNN models GCN, GAT), and different versions of LEReg (Ma et al., 2021) , denoted by L-GCN and L-GAT. The parameters are tuned as suggested by the respective papers. We also have the Laplacian method, denoted by LAP, in which the final class label c is used in the loss L 0 . For a fair comparison, tests are performed under the same hardware and software environment. Similar considerations are applied in Section 4.1.2 and 4.1.3. We also compare with BVAT (Deng & Zhu, 2019) , GAM (Stretcu et al., 2019) , and GraphAT (Feng et al., 2021) (the results are taken from literatures though they are unreported for Photo and CS). The test accuracies (in %) are shown in Table 1 . In each row, best performers are highlighted in blue and red for our approach and benchmarks, respectively. An underlined entry means no noticeable performance improvement over the base model is observed. In general, we see that our proposed regularized models improve upon their respective base models with significant performance gain in many cases (cf. Appendix E). Moreover, our method can match up with or even outperform many benchmarks. The base models are GCN, GAT, as well as GraphSAGE that is primarily designed for inductive learning tasks. The performance comparison between the base models and their regularized versions is shown in Table 3 . The conclusion agrees with those observed in the previous subsections. 

4.2. ANALYSIS AND ABLATION STUDIES

We analyze our model with R-GCN, which has significant gain as compared with its base model (cf. Table 1 ). The graph G 0 = (V 0 , E 0 ) is the main connected component of Cora. Specifically, we want to study whether R-GCN indeed generates distributional graph signals with desired properties. For smoothness (cf. Section 3.1), we have interpreted earlier ϕ(O) of the output features of O as weights of marginal distributions. For the column ϕ(O) :,1 , we take the subvector indexed by V 0 , then normalize and denote it by x (analysis of other columns are in Appendix G). We compute its Fourier transform x and inspect its high-frequency components. We show x for different epochs in Fig. 4 . We also show the spectral plots for the last epoch (epoch 200) of GCN and of the signal of (normalized) ground truth labels. We see a clear shrinkage of high-frequency components for R-GCN (epoch 200). For non-uniformity (cf. Section 3.2), we collect in the set K R-GCN (resp. K GCN ) the probability weights for all the label classes and nodes, i.e., 18956 entries of ϕ(O) for epoch 200 of R-GCN (resp. GCN). Non-uniformity suggests that K R-GCN contains less values near the average 1/7 and more values near 1, as compared with K GCN . To verify, we compare |K R-GCN ∩ [1/7 -ϵ 1 , 1/7 + ϵ 1 ]| with |K GCN ∩ [1/7 -ϵ 1 , 1/7 + ϵ 1 ]|, and |K R-GCN ∩ [1 -ϵ 2 , 1]| with |K GCN ∩ [1 -ϵ 2 , 1]|. The plots for different choices of (small) ϵ 1 , ϵ 2 are shown in Fig. 5 . The results agree with our speculation. 4 . We see that R-GCN remains the most effective as compared with its variants. This suggests that each component of R-GCN plays a useful role. 

5. CONCLUSION

In this paper, we introduce the notion of distributional graph signals and total variations that measure the smoothness of such signals. Based on this and the concept of non-uniformity, we propose a regularization scheme that can be applied directly to enhance the performance of many GNN models. The method is analogous to the regularization method of a smooth signal prior in GSP. Detailed model setups are contained in the respective github links. For example, according to https: //github.com/dmlc/dgl/blob/master/examples/pytorch/gcn/train.py, for GCN and datasets Cora, Citeseer, Pubmed, two convolution layers with 16 hidden units are used. The dropout rate is set to be 0.5. Adam optimizer is used with the learning rate 1e -2 and weight decay 5e -4. On the other hand, according to https://github.com/dmlc/dgl/blob/master/ examples/pytorch/gat/train.py, for GAT and datasets Cora, Citeseer, Pubmed, two graph attention layers with 8 hidden units and 8 heads are used. The dropout rate is set to be 0.6. Adam optimizer is used with a learning rate 5e -3 and weight decay 5e -4. As the regularization does not change the base model, the exact same setups are used. In Table 6 , we provide the values for the coefficient η used in Section 4 (irrelevant fields are filled with "-"). We briefly describe the strategy of choosing η. We fix a lower bound (= 10 -5 ) and an upper bound (= 1) for η, both are loose. We perform a search analogous to binary search within the range based on validation performance. The scaling factor for the search can be different from 2: we use a large scaling factor to identify an interval with significant performance improvement and then perform a fine-scale search within the interval. If no performance improvement is observed within the initial range, then we declare the regularization does not show improvement for the given base model.  - - - - - R-HGCN - - - - - 0.01 0.001 - - - - R-HGAT - - - - - 0.01 0.001 - - - - R-GIL - - - - - 0.01 - - - - -

C APPENDIX: LINK PREDICTION AND GRAPH CLASSIFICATION

We follow Section 3.4 to apply the proposed regularization method to link prediction and graph classification. For link prediction (Liben-Nowell & Kleinberg, 2007; Zhang & Chen, 2018) , we want to predict whether two nodes in a network are likely to have a link. Suppose a base GNN model for link prediction is given. We insert an R-module (Fig. 3 (b )) to the last node feature matrix in the model pipeline. The loss of the R-module is added directly to the original loss. We test on Cora and Airport datasets with base models: GCN, GAT, HGCN, and GIL. The results are shown in Table 7 . Except for HGCN, we see that the regularization can enhance the performance of the base models. We next consider graph classification. For such a task, we want to determine the class label of each graph in a dataset containing multiple graphs. Graph classification is closely related to the theoretical problem of graph isomorphism test, and GIN (Xu et al., 2019 ) is a GNN model that explores such a connection. We use GIN and the variant GIN2 with the learnable importance of the target node compared to its neighbors, as base models. Similarly to link prediction, we insert an R-module (Fig. 3 (b )) to the last node feature matrix in the model pipeline. We test with bioinformatics datasets MUTAG and PTC (Yanardag & Vishwanathan, 2015) , following the protocol described in (Xu et al., 2019) and report the 10-fold cross validation accuracy. Comparison results are shown in Table 8 . The regularization indeed works for both models. 

D APPENDIX: PROOFS OF THEORETICAL RESULTS

In this appendix, we discuss and prove a general result that implies Theorem 1. In addition, we also prove Lemma 1. We start with a computation of the Wasserstein distance. Suppose S = {s 1 , . . . , s m } is a finite discrete set and d is the discrete metric on S. For µ, ν ∈ P(S), let (µ(s i )) 1≤i≤n and (ν(s i )) 1≤i≤n be their respective probability weights. Lemma 2. W (µ, ν) 2 = 1 2 1≤i≤m |µ(s i ) -ν(s i )|. Proof. Let γ = γ(s i , s j ) 1≤i,j≤m be in Γ(µ, ν). We have 1≤i≤m 1≤j≤m γ(s i , s j )d(s i , s j ) 2 = 1≤i≤m 1≤j̸ =i≤m γ(s i , s j ) = 1≤i≤m   1≤j≤m γ(s i , s j ) -γ(s i , s i )   = 1≤i≤m (µ(s i ) -γ(s i , s i )) ≥ 1≤i≤m (µ(s i ) -min(µ(s i ), ν(s i )). As W (µ, ν) 2 is defined by taking the infimum of the left-hand side over all γ ∈ Γ(µ, ν), we have W (µ, ν) 2 ≥ 1≤i≤m µ(s i ) -min(µ(s i ), ν(s i ) . By the same argument, we also have W (µ, ν) 2 ≥ 1≤i≤m ν(s i ) -min(µ(s i ), ν(s i ) . Summing up these two inequalities, we have 2W (µ, ν) 2 ≥ 1≤i≤m µ(s i ) + ν(s i ) -2 min(µ(s i ), ν(s i ) = 1≤i≤m |µ(s i ) -ν(s i )|. Therefore, to prove the lemma, it suffices to show that there is a γ such that γ(s i , s i ) = min µ(s i ), ν(s i ) . For this, we prove a slightly more general claim: if non-negative numbers (x i ) 1≤i≤m and (y i ) 1≤i≤m satisfy 1≤i≤m x i = 1≤j≤m y i = a, then there are non-negative (z i,j ) 1≤i,j≤m such that 1≤j≤m z i,j = x i , 1 ≤ i ≤ m, 1≤i≤m z i,j = y j , 1 ≤ j ≤ m, and z i,i = min(x i , y i ), 1 ≤ i ≤ m. We prove this by induction on m. The case for m = 1 is trivially true by taking z 1,1 = x 1 = y 1 . For m ≥ 2, without loss of generality, we assume that x 1 ≥ y 1 and x 2 ≤ y 2 . Then we choose z 1,1 = y 1 , z 2,2 = x 2 , z 1,j = 0, 1 < j ≤ m and z i,2 = 0, 1 ≤ i ̸ = 2 ≤ m. As a result, we form another two sequences of non-negative numbers x 1 -y 1 , x 3 , . . . , x m and y 2 -x 2 , y 3 , . . . , y m with both summing to a -x 2 -y 1 . By the induction hypothesis, we are able to find non-negative (z ′ i,j ) 1≤i,j≤m-1 for the two new sequences of length m -1 each. It suffices to let z i,j = z ′ i-1,j-1 for i > 1 or j > 2 and z i,1 = z ′ i-1,1 for i > 1 (illustrated in Fig. 9 ). This proves the claim and hence the lemma. z i,1 z i,j z i-1,1 z i-1,j-1 y 1 x 2 0 0 0 0 0 Figure 9: The relations between z i,j and z ′ i,j . To state and prove a general form of Theorem 1, we need to introduce a few more notions. We fix marginal distributions N = {µ i : 1 ≤ i ≤ n} with µ i ∈ P(S). For any pair of nodes v i and v j and s ∈ S, define ρ i,j (s) = µ j (s)/µ i (s) if µ j (s) ≤ µ i (s) and 1 otherwise. (8) More generally, if P = (v i0 , . . . , v i l ) is a directed path on G from v i0 to v i l , then ρ P (s) = 0≤j<l ρ ij ,ij+1 (s). It is important to point out that ρ P can be computed directly as long as N is given. In the graph G, suppose H is a spanning tree and v 0 is a fixed (root) node. Let E H be the edge set of H and E ′ = E\E H . For each edge e = (v i , v j ), let P i (resp. P j ) be the unique path on H connecting v 0 and v i (resp. v j ). Moreover, v 0 is an endpoint of P i ∩ P j , and let v k be the other endpoint of P i ∩ P j . Denote by Q i (resp. Q j ) be the direct path (on H) from v k to v i (resp. v j ) (see Fig. 10 ). We introduce t i,j (s) = µ i (s) + µ j (s) -2µ k (s)ρ Qi (s)ρ Qj (s). Definition 5. Define T H,v0 (N ) = s∈S (vi,vj )∈E t i,j (s). v i v j e = (v i , v j ) ∈ E v 0 v k Q i Q j H Figure 10: An example of paths Q i and Q j . We can compute the special case where G = H is a tree. Notice that for an edge e = (v i , v j ) ∈ E H directed from v i to v j , then v k = v i and Q i = {v i }, Q j = e and ρ Qi (s) = 1. If µ i (s) ≥ µ j (s), then 2µ k (s)ρ Qi (s)ρ Qj (s) = 2µ i (s) • µ j (s)/µ i (s) = 2µ j (s). Hence, we have µ i (s) + µ j (s) -2µ k (s)ρ Qi (s)ρ Qj (s) = µ i (s) -µ j (s) = |µ i (s) -µ j (s)|. The case µ i (s) < µ j (s) is similar, and in summary t i,j (s) = µ i (s) + µ j (s) -2µ k (s)ρ Qi (s)ρ Qj (s) = |µ i (s) -µ j (s)| Therefore, for any v 0 , we have T H,v0 (N ) = s∈S (vi,vj )∈E |µ i (s) -µ j (s)| = s∈S (vi,vj )∈E |µ s (i) -µ s (j)| = T 1 (N ), where µ s (i) is same as µ i (s) (cf. Section 3.1). Following the notations in Section 3.1, we have the following generalization of Theorem 1. Theorem 2. For any spanning tree H of G and root node v 0 , we have T 2 (N ) ≤ T 1 (N ) ≤ 2T (N ) ≤ T H,v0 (N ). Before proving the result, we remark that although T H,v0 (N ) is defined in a convoluted way, it can however be computed directly given N , H and v 0 . Therefore, the result gives computable upper and lower bounds of T (N ). Proof. As |µ s (i) -µ s (j)| ≤ 1, it is trivially true that T 2 (N ) ≤ T 1 (N ). To show T 1 (N ) ≤ 2T (N ), we first claim that T (N ) = inf µ∈Γ(N ) T (µ) is achieved for some µ 0 ∈ Γ(N ). The map α : Γ(N ) → R, µ → T (µ) is continuous. On the other hand, Γ(N ) is a compact subset of a Euclidean space. This is because P(S n ) is a bounded subset of R m n with the components corresponding to weights of the joint distribution. Moreover, Γ(N ) is closed because the condition to have the prescribed marginal distributions is a set of linear conditions. By the extreme value theorem, inf α is achieved for some µ 0 ∈ Γ(N ). For each edge (v i , v j ) ∈ E, let µ 0,i,j be the marginal distribution of the pair (v i , v j ). The marginals of µ 0,i,j are µ i and µ j at v i and v j respectively. We have T (N ) = T (µ 0 ) = (vi,vj d(x(i), x(j)) 2 dµ 0 (x) = (vi,vj )∈E d(x(i), x(j)) 2 dµ 0 (x) = (vi,vj )∈E d(y(i), y(j)) 2 dµ 0,i,j (y) Def. 2 ≥ (vi,vj )∈E W (µ i , µ j ) 2 Lem. 2 = 1 2 (vi,vj )∈E s∈S |µ s (i) -µ s (j)| = 1 2 T 1 (N ). We now prove 2T (N ) ≤ T H,v0 (N ). Given a spanning tree H and node v 0 , we construct µ H,v0 ∈ Γ(N ) using ideas from the theory of Bayesian networks (Bishop, 2006) as follows. We make H directed by requiring that each edge is pointed away from the (root) node v 0 . As a consequence, each node has at most 1 incoming edge. Let x = (x i ) 1≤i≤n be the random vector with x i the (random) label at v i . For each directed edge (v i , v j ) in H, let µ i,j ∈ Γ(µ i , µ j ) be a distribution that realizes W (µ i , µ j ), which give a conditional probability weights p i,j = {p i,j (s, s ′ ) : s, s ′ ∈ S} = {p(x j = s ′ | x i = s) : s, s ′ ∈ S}. By Bishop (2006) Section 8.1 (8.5), there is a distribution µ H,v0 ∈ Γ(N ) such that its marginal for each edge (v i , v j ) ∈ E H is µ i,j . For each edge (v i , v j ) ∈ E ′ , let µ ′ i,j be the marginal of µ H,v0 to the pair (v i , v j ). As µ i,j realizes W (µ i , µ j ), by Lemma 2, µ i,j (s, s) = min(µ i (s), µ j (s)) for each s ∈ S. In particular, p i,j (s, s) = ρ i,j (s) (cf. ( 8)). More generally, if P is a directed path in H from v i to v j , then the following inequality holds p i,j (s, s) ≥ ρ P (s). According to the definition, we have T (N ) ≤ T (µ H,v0 ) = S E H + S E ′ . The summand S E H = (vi,vj )∈E H d(y(i), y(j)) 2 dµ i,j is the summation over the edges of E H , while S E ′ = (vi,vj )∈E ′ d(y(i), y(j)) 2 dµ ′ i,j is the summand over E ′ . For S E H , we have seen that for each edge (v i , v j ) ∈ E H d(y(i), y(j)) 2 dµ i,j (y) Lem. 2 = 1 2 s∈S |µ i (s) -µ j (s)| (10) = 1 2 s∈S µ i (s) + µ j (s) -2µ k (s)ρ Qi (s)ρ Qj (s) , where the right-hand-side is the term 1 2 t i,j (s) (cf. ( 9)) that corresponds to (v i , v j ) in 1 2 T H,v0 (N ). Consider (v i , v j ) ∈ E ′ and we first notice the following identity: 2 d(y(i), y(j)) 2 dµ ′ i,j (y) (7) = s∈S µ i (s) + µ j (s) -2µ ′ i,j (s, s). To show the summand is bounded by t i,j (s), it suffices to show µ ′ i,j (s, s) is bounded below by ρ Qi (S)ρ Qj (s)µ k (s), with v k and paths Q i , Q j be as in (9). We estimate using the construction of µ ′ i,j based on the method of Bayesian network (on directed H) as follows: µ ′ i,j (s, s) ≥ p(x i = s, x j = s, x k = s) = p(x i = s, x j = s | x k = s)µ k (s) = p k,i (s, s)p k,j (s, s)µ k (s) ≥ ρ Qi (s)ρ Qj (s)µ k (s). For the last equality, we use the fact that Q i ∩ Q j = {v k }; and hence x i , x j are independent given x k by the Bayesian network construction. The last line follows from (12). Consequently, the inequality that 2T (N ) ≤ T H,v0 (N ) follows. If we examine the formula of T H,v0 (N ) (by changing summation order), it can be decomposed into two parts (vi,vj )∈E H and (vi,vj )∈E ′ . The former is essentially T 1 of N on the tree H. This is the reason that Theorem 2 implies Theorem 1. Formally we have the following. Proof of Theorem 1. It suffices to prove the last statement for G being a tree. As we have seen in ( 11) that if G is a tree, then T H,v0 (N ) = T 1 (N ). Therefore, Theorem 2 implies 2T (N ) = T 1 (N ), and hence Theorem 1. We end this section by proving Lemma 1. Proof of Lemma 1. Using Lemma 2, we estimate 2 1≤i≤n a i W (µ X,i , U(S)) 2 = 1≤i≤n a i 1≤j≤m X i,j - 1 |S| ≤ 1≤i≤n a i 1≤j≤m (X i,j - 1 |S| ) 2 = 1≤i≤n a i 1≤j≤m X 2 i,j -2 1≤i≤n a i 1≤j≤m X i,j |S| + 1≤i≤n a i 1≤j≤m 1 |S| 2 = Tr(X ⊤ DX) -2 1≤i≤n a i 1 |S| + 1≤i≤n a i 1 |S| = Tr(X ⊤ DX) - 1 |S| Tr(D). Therefore, 2 1≤i≤n a i W (µ X,i , U(S)) 2 ≤ Tr(X ⊤ DX) + C with C = -1 |S| Tr(D), which is independent of X. The last statement of the lemma follows from the simple fact that for each 1 ≤ i ≤ n, we have 1≤j≤m 1 |S| 2 ≤ 1≤j≤m X 2 i,j ≤ 1. On the other hand, though graph signal smoothness has been fundamental in both GSP and GNNs, the negative effect of over-smoothing has also been examined (NT & Maehara, 2019; Oono & Suzuki, 2020) ) and models are proposed to alleviate it. For example, apart from the more recent works such as Yang et al. (2021) ; Ma et al. (2021) that have already been discussed in detail, PairNorm proposed in Zhao & Akoglu (2020) encourages the similarity between connected nodes, and at the same time adds a negative term based on distances between disconnected pairs. MADReg in Chen et al. (2020) proposes to use step size limits to make the graph nodes receive less interference noise. In Feng et al. (2020) , randomly dropping nodes is proposed to reduce the convergence speed of over-smoothing. Adding skip connections is also introduced in Li et al. (2019) ; Luan et al. (2019) . In our paper, we take a different point of view by not considering the smoothness of ordinary graph signals. Instead, we speculate that properties such as smoothness and non-uniformity of distributional graph signals may play important roles. Moreover, requiring a distributional graph signal to satisfy non-uniformity partially prevents the unfavorable situation that many connected nodes have similar marginal distributions that are approximately uniform. This can be viewed as a countermeasure to over-smoothing intrinsically contained in our approach. 



To simplify the presentation, we assume that all models considered have the intermediate softmax step, though some implementations omit this part using the fact that the exponential function is increasing.



Figure 3: This figure illustrates in (a) how R-M is constructed upon the base model M, and in (b) an R-module that can be used at different places in GNN models.

Figure 4: Spectral plots of (normalized) signals of probability weights and ground truth labels.

Figure 5: Number of instances of output probability weights within a given range.

Figure 8: The regularization term.

APPENDIX: ADDITIONAL PLOTS FOR MODEL ANALYSIS We supplement Fig. 4 by showing spectral plots of signals of probability weights ϕ(O) :,i , 2 ≤ i ≤ 7 for the Cora dataset. The index i corresponds to the i-th label class. From the plots, we observe that during the training of R-GCN, the high-frequency components indeed shrink for all the label classes. Compared with GCN, the last epoch of R-GCN has smaller high-frequency components for the 3rd and 5th label classes

Figure 12: Spectral plots of (normalized) signals of probability weights for the 2nd label class.

Figure 13: Spectral plots of (normalized) signals of probability weights for the 3rd label class.

Figure 14: Spectral plots of (normalized) signals of probability weights for the 4th label class.

Figure 15: Spectral plots of (normalized) signals of probability weights for the 5th label class.

Figure 16: Spectral plots of (normalized) signals of probability weights for the 6th label class.

Figure 17: Spectral plots of (normalized) signals of probability weights for the 7th label class.

Transductive learning models. We also consider the interactive model GIL that combines both Euclidean and hyperbolic approachesZhu et al. (2020). The comparison results are shown in Table2. Again, we see a general improvement by using the proposed regularization, which yields performance comparable with benchmarks.

Hyperbolic models. In contrast with transductive learning, inductive learning requires one to deal with unseen data outside the training set. For example, in the PPI dataset, different graphs correspond to different human tissues and we test two out of 24 graphs that are unseen during training. Though the citation datasets Cora, Citeseer, and Pubmed are for transductive learning, we modify the datasets following Mishra et al. (2021) that use the induced subgraph of training nodes during training. During validation and testing, we use induced subgraphs of validation nodes and testing nodes with training nodes, respectively. The nodes that are being predicted are unseen during training.

Inductive learning models

Ablation study

Choices of η

Link prediction. The best performer is highlighted in blue.

Graph classification. The best performer is highlighted in blue.

annex

A APPENDIX: DATA STATISTICS In Table 5 , we provide statistics of datasets used in Section 4. In this appendix, we provide the source code for the loss function L 0 and the implementation of the regularization method. In addition, we give details of the base models used. For the source code, we assume PyTorch and the Deep Graph Library (dgl) are used.In Fig. 6 , we first import modules and packages (though not all packages are used for the other code segments below):Figure 6 : Import modules and packages.In "my_loss" function (Fig. 7 ), we implement the loss L 0 (cf. Section 3.3). For the inputs of "my_loss", "g" is a dgl graph, and "x" corresponds to O in Section 3.3.Figure 7 : "my_loss" function.Suppose "loss1" (computed using "loss_fcn") is the loss of the base model, we compute "loss2" for L 0 (Fig. 8 ). They are combined as in (6), using the tunable coefficient "eta".

E APPENDIX: SIGNIFICANCE OF THE PERFORMANCE

In this appendix, we analyze the significance of the performance (in Section 4.1) of the proposed regularization method with the following setup. Suppose we have two models A and B with A having higher average accuracy. To determine if the difference is significant, we perform the hypothesis test for the null hypothesis that the difference in accuracy between model A and B has 0 mean. We compute the p-value. The smaller the p-value, the higher the confidence to reject the null hypothesis and to conclude that model A has a better performance. A p-value less than 0.05 is typically considered statistically significant.In Section 4.1, we have 38 comparisons between regularized models and based models. We perform the tests described in the previous paragraph and record the p-values. The boxplot of the p-values is shown in Fig. 11 . It indicates that in most cases, our proposed regularization significantly improves the respective based models. On the other hand, there are 11 comparisons between our approach and best benchmarks. We again perform the hypothesis test for each comparison and record the p-value.The boxplot of the p-values is shown in Fig. 11 . We see that in many cases, our method significantly outperforms the benchmarks. 2021) and supported by our evidence in Section 2.2 that Laplacian regularization may have its drawbacks, it has achieved a certain degree of success in earlier works Zhu et al. (2003) ; Zhou et al. (2004) ; Ando & Zhang (2007) . These methods are based on the insight that neighboring nodes are likely to have the same labels. In the language of GSP, they intend to leverage the smoothness of the class label graph signal.

