LEARNING PARAMETRISED GRAPH SHIFT OPERATORS

Abstract

In many domains data is currently represented as graphs and therefore, the graph representation of this data becomes increasingly important in machine learning. Network data is, implicitly or explicitly, always represented using a graph shift operator (GSO) with the most common choices being the adjacency, Laplacian matrices and their normalisations. In this paper, a novel parametrised GSO (PGSO) is proposed, where specific parameter values result in the most commonly used GSOs and message-passing operators in graph neural network (GNN) frameworks. The PGSO is suggested as a replacement of the standard GSOs that are used in state-of-the-art GNN architectures and the optimisation of the PGSO parameters is seamlessly included in the model training. It is proved that the PGSO has real eigenvalues and a set of real eigenvectors independent of the parameter values and spectral bounds on the PGSO are derived. PGSO parameters are shown to adapt to the sparsity of the graph structure in a study on stochastic blockmodel networks, where they are found to automatically replicate the GSO regularisation found in the literature. On several real-world datasets the accuracy of state-of-theart GNN architectures is improved by the inclusion of the PGSO in both nodeand graph-classification tasks.

1. INTRODUCTION

Real-world data and applications often involve significant structural complexity and as a consequence graph representation learning attracts great research interest (Hamilton et al., 2017b; Wu et al., 2020) . The topology of the observations plays a central role when performing machine learning tasks on graph structured data. A variety of supervised, semi-supervised or unsupervised graph learning algorithms employ different forms of operators that encode the topology of these observations. The most commonly used operators are the adjacency matrix, the Laplacian matrix and their normalised variants. All of these matrices belong to a general set of linear operators, the Graph Shift Operators (GSOs) (Sandryhaila & Moura, 2013; Mateos et al., 2019) . Graph Neural Networks (GNNs), the main application domain in this paper, are representative cases of algorithms that use chosen GSOs to encode the graph structure, i.e., to encode neighbourhoods used in the aggregation operators. Several GNN models (Kipf & Welling, 2017; Hamilton et al., 2017a; Xu et al., 2019) choose different variants of normalised adjacency matrices as GSOs. Interestingly, in a variety of tasks and datasets, the incorporation of explicit structural information of neighbourhoods into the model is found to improve results (Pei et al., 2020; Zhang & Chen, 2018; You et al., 2019) , leading us to conclude that the chosen GSO is not entirely capturing the information of the data topology. In most of these approaches, the GSO is chosen without an analysis of the impact of this choice of representation. From this observation arise our two research questions. Question 1: Is there a single optimal representation to encode graph structures or is the optimal representation task-and data-dependent? On different tasks and datasets, the choice between the different representations encoded by the different graph shift operator matrices has shown to be a consequential decision. Due to the past successful approaches that use different GSOs for different tasks and datasets, it is natural to assume that there is no single optimal representation for all scenarios. Finding an optimal representation of network data could contribute positively to a range of learning tasks such as node and graph classification or community detection. Fundamental to this search is an answer to Question 1. In addition, we pose the following second research question. Question 2: Can we learn such an optimal representation to encode graph structure in a numerically stable and computationally efficient way? The utilisation of a GSO as a topology representation is currently a hand-engineered choice of normalised variants of the adjacency matrix. Thus, the learnable representation of node interactions is transferred into either convolutional filters (Kipf & Welling, 2017; Hamilton et al., 2017a) or attention weights (Veličković et al., 2018) , keeping the used GSO constant. In this work, we suggest a parametrisation of the GSO. Specific parameter values in our proposed parametrised (and differentiable) GSO result in the most commonly used GSOs, namely the adjacency, unnormalised Laplacian and both normalised Laplacian matrices, and GNN aggregation functions, e.g., the averaging and summation message passing operations. The beauty of this innovation is that it can be seamlessly included in both message passing and convolutional GNN architectures. Optimising the operator parameters will allow us to find answers to our two research questions. The remainder of this paper is organised as follows. In Section 2, we give an overview of related work in the literature. Then in Section 3, we define our parametrised graph shift operator (PGSO) and discuss how it can be incorporated into many state-of-the-art GNN architectures. This is followed by a spectral analysis of our PGSO in Section 4, where we observe good numerical stability in practice. In Section 5, we analyse the performance of GNN architectures augmented by the PGSO in a node classification task on a set of stochastic blockmodel graphs with varying sparsity and on learning tasks performed on several real-world datasets.

2. RELATED WORK

GSOs emerge in different research fields such as in physics, network science, computer science and mathematics, taking usually the form of either graph Laplacian normalisations or variants of the adjacency matrix. In an abundant number of machine learning applications the expressivity of GSOs is exploited, e.g., in unsupervised learning (von Luxburg, 2007; Kim et al., 2008) , semisupervised node classification on graph-structured data (Kipf & Welling, 2017; Schlichtkrull et al., 2018) and supervised learning on computer vision tasks (Chang & Yeung, 2006) . The majority of these works assumes a specified normalised version of the Laplacian that encodes the structural information of the problem and usually these versions differ depending on the analysed dataset and the end-user task. Recently, new findings on the impact of the chosen Laplacian representation have emerged that highlight the contribution of Laplacian regularisation (Dall'Amico et al., 2020; Saade et al., 2014; Dall'Amico et al., 2019) . The different GSO choices in different tasks indicate a data-dependent relation between the structure of the data and its optimal GSO representation. This observation motivates us to investigate how beneficial a well-chosen GSO can be for a learning task on structured data. GNNs use a variety of GSOs to encode neighbourhood topologies, either normalisations of the adjacency matrix (Xu et al., 2019; Hamilton et al., 2017a) or normalisations of the graph Laplacian (Kipf & Welling, 2017; Wu et al., 2019) . Due to the efficiency and the predictive performance of GNNs, a research interest has recently emerged in their expressive power. One of the examined aspects is that of the equivalence of the GNNs' expressive power with that of the Weisfeiler-Lehman graph isomorphism test (Dasoulas et al., 2020; Maron et al., 2019; Morris et al., 2019; Xu et al., 2019) . Another research direction is that of analysing the depth and the width of GNNs, moving one step forward to the design of deep GNNs (Loukas, 2020; Li et al., 2018; Liu et al., 2020; Alon & Yahav, 2020) . In this analysis, the authors study phenomena of Laplacian oversmoothing and combinatorial oversquashing, that harm the expressiveness of GNNs. In most of these approaches, however, the used GSO is fixed without a motivation of the choice. We hope that the parametrised GSO that is presented in this work can contribute positively to the expressivity analysis of GNNs. We will now delineate our approach and that of a closely related work by Klicpera et al. (2019) . Klicpera et al. (2019) demonstrate that varying the choice of the GSO in the message passing step of GNNs can lead to significant performance gains. In Klicpera et al. (2019) two fixed diffusion operators with a much larger receptive field than the 1-hop neighbourhood convolutions, are inserted into the architectures, leading to a significant improvement of the GNNs' performance. In our work here we replace the GSOs in GNN frameworks with the PGSO, which has a receptive field equal to the 1-hop neighbourhood of the nodes. We find that parameter values of our PGSO can be trained in a numerically stable fashion, which allows us to chose a parametric form unifying the most common GSOs and aggregation functions. As with standard GNN architectures the receptive field of the convolutions is increased in our architectures by stacking additional layers. Klicpera et al. (2019) increase the size of the receptive field and keep the neighbourhood representation fixed, while we keep the size of the receptive field fixed and learn the neighbourhood representation.

3. PARAMETRISED GRAPH SHIFT OPERATORS

We define notation and fundamental concepts in Section 3.1 and introduce our proposed parametrised graph shift operator γ(A, S) in Section 3.2. In Section 3.3, we provide a detailed discussion of how γ(A, S) can be applied to the message-passing operation in GNNs and we demonstrate use cases of the incorporation of γ(A, S) in different GNN architectures.

3.1. PRELIMINARIES

Let a graph G be a tuple, G = (V, E), where V and E are the sets of nodes and edges and let |V | = n. We assume the graph G to be attributed with attribute matrix X ∈ R n×d , where the i th row of X contains the d-dimensional attribute vector corresponding to node v i . We denote the n × n identity matrix by I n and the n-dimensional column vector of all ones by 1 n . Given the node and edge sets, one can define the adjacency matrix, denoted A ∈ [0, 1] n×n , where A ij = 0 if and only if (i, j) ∈ E, and the degree matrix of A as D = Diag(A1 n ). Recently the notion of a GSO has been defined as a general family of operators which enable the propagation of signals over graph structures (Sandryhaila & Moura, 2013; Shuman et al., 2013) . Definition 1. Graph Shift Operator A matrix S ∈ R n×n is called a Graph Shift Operator (GSO) if it satisfies S ij = 0 for i = j and (i, j) / ∈ E (Mateos et al., 2019; Gama et al., 2020) . This general definition includes the adjacency and Laplacian matrices as instances of its class. Remark 1. According to Definition 1, the existence of an edge (i, j) ∈ E does not imply a nonzero entry in the GSO, S ij = 0. Hence, the correspondence between a GSO and a graph is not bijective in general.

3.2. PARAMETRISED GSO

We begin by defining our parametrised graph shift operator. Definition 2. We define the parametrised graph shift operator (PGSO), denoted by γ(A, S) , as γ(A, S) = m 1 D e1 a + m 2 D e2 a A a D e3 a + m 3 I n , where A a = A + aI n and D a = Diag(A a 1 n ) is the degree matrix of A a . We denote the parameter tuple corresponding the γ(A, S) by S = (m 1 , m 2 , m 3 , e 1 , e 2 , e 3 , a) consisting of scalar multiplicative parameters m 1 , m 2 , m 3 , scalar exponential parameters e 1 , e 2 , e 3 and an additive parameter a. The main motivation of the parametrised form in Equation ( 1) is to span the space of commonly used GSOs and indeed we are able to generate a wide range of different graph shift operators by choosing different values for the parameter set S. In Table 1 , we give examples of parameter values in γ(A, S) which result in the most commonly used GSOs and message-passing operators in GNNs. Unlike the GSO, the PGSO uniquely identifies the graph it corresponds to, i.e., the GSO and PGSO do not share the property discussed in Remark 1. Although we base the definition of γ(A, S) on the adjacency matrix, we can define the PGSO using other graph representation matrices, such as the non-backtracking operator B, γ(B, S), (Krzakala et al., 2013; Bordenave et al., 2015) or the diffusion matrix S, γ(S, S) (Klicpera et al., 2019) . Table 1 : Known Graph Shift Operators as parameter choices S in γ(A, S). S=(m 1 , m 2 , m 3 , e 1 , e 2 , e 3 , a) Operator Description (0, 1, 0, 0, 0, 0, 0) A Adjacency matrix and Summation Aggregation Operator of GNNs (1, 0, 1, 0, 0, 0 Simić, 2010)  (0, 1, 0, 0, 0  ) D -A Unnormalised Laplacian matrix L (1, 1, 0, 1, 0, 0, 0) D + A Signless Laplacian matrix Q (Cvetković & ) I n -D -1 A Random-walk Normalised Laplacian L rw (0, -1, 1, 0, -1 2 , -1 2 , 0) I n -D -1 2 AD -1 2 Symmetric Normalised Laplacian L sym (0, 1, 0, 0, -1 2 , -1 2 , 1) D -1 2 1 A 1 D -1 2 1 Normalised Adjacency matrix of GCNs (Kipf & Welling, 2017 ) (0, 1, 0, 0, -1, 0, 0) D -1 A Mean Aggregation Operator of GNNs (Xu et al., 2019) 3.3 SUGGESTED METHOD: GNN-PGSO AND GNN-mPGSO Next, we formally discuss how γ(A, S) is incorporated in GNN models. Let a GNN model be denoted by M(φ(A), X), taking as input a non-parametrised function of the adjacency matrix φ(A) : [0, 1] n×n → R n×n and an attribute matrix (in case of an attributed graph) X ∈ R n×d . Further, let K denote the number of aggregation layers that M consists of. The Parametrised Graph Shift Operator (PGSO) formulation transforms the GNN model M(φ(A), X) into the GNN-PGSO model M (γ(A, S), X). Moreover, we define the GNN-mPGSO model M (γ [K] (A, S [K] ), X), where γ [K] (A, S [K] ) = [γ(A, S 1 ), ..., γ(A, S K )], i.e., we assign each GNN layer a different parameter tuple S l for l ∈ {1, . . . , K}. Message-passing steps and convolutions In a spectral-based GNN (Wu et al., 2020) , where the GSO is explicitly multiplied by the model parameters, it is straightforward to replace the GSO with γ(A, S). However, with some further analysis γ(A, S) can also be incorporated in spatial-based GNNs, where the node update equation is defined as a message-passing step. Here we illustrate the required analysis on the sum-based aggregation operator, where we sum the feature vectors h j ∈ R d in the neighborhood of a given node v i , denoted N (v i ). The sum operator of the neighborhood representations can be reexpressed as: j:vj ∈N (vi) h j = n j=1 A ij h j . Using this observation we can derive the application of γ(A, S) in a message-passing step to be, (γ(A, S)h) i = m 1 (D a ) e1 i h i + m 2 n j=1 (D a ) e2 i (A a ) ij (D a ) e3 j h j + m 3 h i . (2)

Examples

The following examples highlight the usage of the γ(A, S) operator: 1. In the standard GCN (Kipf & Welling, 2017) the propagation rule of the node representation H (l) ∈ R n×d in a computation layer l is, H (l+1) = σ D -1 2 1 A 1 D -1 2 1 H (l) W (l) , where W (l) is the trainable weight matrix and σ denotes a non-linear activation function. The GCN-PGSO and GCN-mPGSO models, respectively, perform the following propagation rules, H (l+1) = σ γ(A, S)H (l) W (l) and H (l+1) = σ γ(A, S l )H (l) W (l) . 2. The Graph Isomorphism Network (GIN) (Xu et al., 2019) consists of the following propagation rule for a node representation h (l) i ∈ R d of node v i in the computation layer l, h (l+1) i = σ h (l) i W (l) + j:vj ∈N (vi) h (l) j W (l) . Using the Equation (2) the propagation rule is transformed into the GIN-PGSO formulation as, h (l+1) i = σ m 1 (D a ) e1 i + m 3 h (l) i W (l) + m 2 j:vj ∈N (vi) ij h (l) j W (l) , where uv are edge weights defined as ij = (D a ) e2 i (D a ) e3 j . Analogously, we can construct the GIN-mPGSO formulation by superscripting every parameter in S by (l). Computational Cost Since in (1) the exponential parameters are applied only to diagonal matrices the PGSO and mPGSO are efficiently computable and optimisable. γ(A, S) can be extended by using vector instead of scalar parameters. Although this extension leads to better expressivity, the computational cost is increased, as the number of parameters then depends on the size of the graph.

4. SPECTRAL ANALYSIS OF γ(A, S)

In this section, we study spectral properties of γ(A, S) in theoretical analysis in Section 4.1 and through empirical observation in Section 4.2. The obtained theoretical results provide a foundation for further analysis of methodology involving the PGSO and allow an efficient observation of spectral support bounds of commonly used GSOs, that are instances of γ(A, S).

4.1. THEORETICAL ANALYSIS

Here we investigate the spectral properties of our γ(A, S). Throughout this section we assume that we work on undirected graphs. In Theorem 1 we show that γ(A, S) has a real spectrum and eigenvectors independent of the parameters S. In Theorem 2 we study the spectral support of γ(A, S). Theorem 1. γ(A, S) has real eigenvalues and a set of real eigenvectors. The proof of Theorem 1 follows directly from noticing that γ(A, S), which is not symmetric in general, is similar to a symmetric matrix and therefore shares eigenvalues with this symmetric matrix (Horn & Johnson, 1985, pp. 45,60) ; the full proof can be found in Appendix A.1. Being able to guarantee real eigenvalues and eigenvectors for all parameter values of γ(A, S) enables practitioners to deploy our PGSO in spectral network analysis without having to replace elements of the algorithms which assume a real spectrum. As a result of Theorem 1 eigenvalue computations can be stabilised by working with the symmetric, similar PGSO used in the proof of Theorem 1. Especially the theoretical analysis of the PGSO and algorithms involving the PGSO is aided by Theorem 1. There are several publications, where the complications and lack of results for complex valued graph spectra are discussed in the case of directed graphs (Guo & Mohar, 2017; Brualdi, 2010; Li et al., 2015) . To remain within the real domain independent of the parameter choice (for undirected graphs) enables analysts to access a wide variety of spectral results, which only hold for real symmetric matrices. An example of such a theorem is Cauchy's interlacing theorem (Bernstein, 2009, p. 709) , which can be used to relate the adjacency matrix spectra of a graph and its subgraphs. Hall et al. (2009) and Porto & Allem (2017) have been able to prove interlacing theorems for unnormalised, signless and normalised Laplacians. The fact that the PGSO has a real spectrum presents a first step to extend their work to potentially apply to the PGSO independent of the parameter values. However, this is only one example of a powerful theoretical result, which relies on a real spectrum and is accessible to our PGSO formulation due to the result shown in Theorem 1. Now that the spectrum of γ(A, S) has been shown to be real we study the spectral support of γ(A, S). A common way to prove bounds on the spectral support of a matrix is via direct application of the Gershgorin Theorem (Bernstein, 2009, p. 293)  as done in Theorem 2. Theorem 2. Let C i = m 1 (d i + a) e1 + m 2 (d i + a) e2+e3 a + m 3 and R i = |m 2 |(d i + a) e2+e3 d i , where d i denotes the degree of node v i . Furthermore, we denote eigenvalues of γ(A, S) by λ 1 ≤ λ 2 ≤ . . . ≤ λ n . Then, for all j ∈ {1, . . . , n}, λ j ∈ min i∈{1,...,n} (C i -R i ) , max i∈{1,...,n} (C i + R i ) . The proof of Theorem 2 is shown in Appendix A.2. Examples For the parametrisation of γ(A, S) corresponding to the adjacency matrix, we obtain C i = 0 and R i = d i . R i is clearly maximised by the maximum degree and therefore, from (3) the spectral support of A is equal to [-d max , d max ], as required. Similarly, the spectral supports of L, L sym and L rw , can be deduced by plugging in the corresponding parameters into the result in (3). For the operator used in the message passing of the GCN (Kipf & Welling, 2017) a Gershgorin bound such as the one in Theorem 2 has not been calculated yet as far as we are aware. We obtain C i = 1/(d i + 1) and R i = d i /(d i + 1) . Therefore, from (3) the spectral support of the Kipf and Welling operator is restricted to lie within [-(d max -1)/(d max + 1), 1], the lower bound of this interval tends to -1 as d max → ∞. So only in the limit is their spectral support symmetric around 0. The bounds proven in Theorem 2 allow the observation of the spectral bounds for the many GSOs which can be obtained via specific parameter choices made in our PGSO formulation. In the context of GNNs such bounds are of value since they allow statements about numerical stability and vanishing/exploding gradients to be made. For example in Kipf & Welling (2017) the observation that the spectrum of the symmetric normalised Laplacian is contained in the interval [0, 2] motivated them to make use of the "renormalisation trick" to stabilise their computations. Since the PGSO is learned its spectral support varies throughout training. The bounds in Theorem 2 enable us to monitor bounds on the spectral support in a numerically efficient manner, avoiding the computation of eigenvalues at each iteration, as is showcased in the empirical observation in Section 4.2. Since the majority of the commonly used graph shift operators correspond to specific parametrisations of γ(A, S) the spectral properties of γ(A, S) are general themselves, in the sense that we cannot establish spectral results for γ(A, S), which have already been shown to not be present for one of its parametrisations. This generality and lack of specific spectral features, which hold for all parametrisations is precisely one of the strengths of utilising γ(A, S), since it allows the graph shift operator to manifest different spectral properties in different tasks, when they are of benefit.

4.2. EMPIRICAL OBSERVATION

In this section, we observe bounds on the spectral support, proven in Theorem 2 (Figure 1(a) ) and optimal parameters (Figure 1 (b)) of the PGSO incorporated in the GCN during training on a nodeclassification task on the Cora dataset. More details on the task will be provided in Section 5.3. Surprisingly, the spectral support of the PGSO remains centered at 0 throughout training in Figure 1 (a) without this being enforced by the design of the PGSO. Recall that the centre of the spectral support intervals has the following parametric form C i = m 1 (d i +a) e1 +m 2 (d i +a) e2+e3 a+m 3 . It is nice to observe that this desirable property of the "renormalised" operator used by Kipf & Welling (2017) is preserved throughout training. We also observe that the spectral bounds smoothly increase throughout training. The increasing support is a direct result of the learned PGSO parameters. As we expected from our analysis of Figure 1 

5. EXPERIMENTS

We evaluate the performance of the PGSO in a simulation study and on 8 real-world datasets. In Section 5.1, we show the ability of γ(A, S) to adapt to varying sparsity scenarios through a stochastic blockmodel study, in Section 5.2 we study the sensitivity of the GNN-PGSO's performance to different γ(A, S) initialisations and in Section 5.3 we evaluate the contribution of PGSO and mPGSO to the GNN performance in real-world datasets.

5.1. SPARSITY INTERPRETATION OF γ(A, S)

The choice of a graph shift operator depends on the structural information of a dataset and on the end task. For example, in graph classification tasks the datasets usually contain small and sparse graphs (Wu et al., 2018) , while in node classification tasks the graphs are larger and denser. In this section, we highlight the ability of γ(A, S) to adapt to different sparsity regimes.

Stochastic Blockmodels

In order to simulate graphs with varying sparsity levels, we utilise the parametrisation of stochastic blockmodel generation. Stochastic blockmodels (SBMs) are wellstudied generative models for random graphs, that tend to contain community, i.e., block, structure (Holland et al., 1983; Karrer & Newman, 2011) . Let k be the number of communities, {C 1 , ..., C k } be the k disjoint communities and p ij be the probability of an edge occuring between a node u ∈ C i and a node v ∈ C j . SBMs offer a flexible tool for random graph generation with specific properties, such as the detectability of the community structure (Decelle et al., 2011) and the sparsity level of the graph by choosing appropriate edge probabilities p ij . Here, we focus on a restricted stochastic blockmodel parametrisation where the probabilities of an edge occurring between nodes within the same community are all equal to p, i.e., p ii = p ∀i ∈ {1, ..., k}, and the probabilities of an edge between nodes in different communities are all equal to q, i.e., p ij = q ∀i = j, i, j ∈ {1, ..., k}. Dataset and experimentation setup In this experiment, we consider 15 p, q parameter combinations (p, q) ∈ {(0.5, 0.25), (0.48, 0.24), ..., (0.22, 0.11)}, where by decreasing the parameters p and q we increase the sparsity of the sampled networks. For each parameter set we sample 25 graphs with 3 communities containing 200 nodes each. Figure 5 in Appendix B.1 contains plots of adjacency matrices sampled from these 15 parameter combinations. The learning task performed by a GCN will be to assign community membership labels to each node. For all parameter choices the ratio p/q, i.e., the community detectability, is equal to 2, so that the theoretical difficulty the task remains constant across the different sparsity levels. As in Dwivedi et al. ( 2020), all of the nodes have uninformative attributes, except for one node per community that carries its community membership as its node attribute. We set the train/validation/test splits equal to 80%, 10%, 10% of nodes, respectively, for each community. We use a 3-layer Graph Convolutional Network (Kipf & Welling, 2017) with hidden size 64 and with our PGSO incorporated as the message passing operator and initialised to the GCN configuration. We, then, observe the learned parameter values of γ(A, S) after 200 epochs. Parameter Values Figure 2 shows the average optimal parameter values of γ(A, S) found for each one of the 15 node classification datasets. We observe that all parameters remain close to constant as the sparsity of the SBM samples increases, except for the additive parameter a, which can be observed to clearly increase with increasing sparsity levels. The parameter a in the PGSO parametrisation plays a very similar role to the regularisation parameter of the normalised adjacency matrix, which is observed to significantly improve the performance of the spectral clustering algorithm in the task of community detection in the challenging sparse case in Dall'Amico et al. and Qin & Rohe (2013). It is very nice to see that the PGSO automatically varies this regularisation parameter, replicating the beneficial regularisation observed in the literature, without this behaviour being incentivised in any way by the model design or the parametrisation. In Figure 2 we can furthermore observe that the e 2 and e 3 values are closely aligned for all sparsity levels indicating that a symmetric normalisation of the adjacency matrix seems to be beneficial in this task.

5.2. SENSITIVITY ANALYSIS OF γ(A, S) TO DIFFERENT INITIALISATIONS

In Section 5.1, we initialised a 3-layer GCN-PGSO model with the original GCN parameter configuration, as shown in Table 1 . In this section, we observe how sensitive the optimal parameters of γ(A, S) and the final model accuracy are to different GSO initialisations. Experimentation setup: We use the GCN-PGSO model, described in Section 3.3, for the node classification task on the Cora dataset (McCallum et al., 2000) . We consider 5 different initialisations of γ(A, S) that correspond to 1) the GCN operator D -1/2 1 A 1 D -1/2 1 , 2) the adjacency matrix A, 3) the random-walk normalised Laplacian L rw = I -D -1 A, 4) the symmetric normalised Laplacian L sym = I -D -1/2 AD -1/2 and 5) a naive all-zeros initialisation, where all γ(A, S) parameters are set to 0. In Figure 3 , we observe the parameter evolution of γ(A, S) over 150 epochs for the different initialisations. The parameter values obtained from the GCN, adjacency matrix and all-zeros initialisations exhibit a great amount of similarity; while the normalised Laplacian initilisations lead to similar optimal values for five of the seven parameters. For all initialisations we observe that parameters m 1 , m 3 , e 1 , e 2 , e 3 monotonically increase until they converge. Parameters m 2 and a initially increase for the GCN, adjacency and all-zeros initialisations, while they initially decrease for the two normalised Laplacians. The achieved accuracy of the five initialisations is plotted in Appendix B.4, where it can be observed that the resulting accuracy from the two normalised Laplacian initialisations is slightly lower than the one achieved by the remaining three initialisations. Overall, the accuracy is not very sensitive to the different initialisations.

5.3. REAL-WORLD SCENARIOS

In this section, we evaluate the contribution of the parametrised GSO, when we apply it to a variety of graph learning tasks. In order to highlight the flexibility of γ(A, S), we perform both node classification and graph classification tasks. Datasets: For node classification, we have used the well-examined datasets Cora and Cite-Seer (McCallum et al., 2000; Giles et al., 1998) and ogbn-arxiv, that is a citation network from a recently popular collection of graph benchmarks, the Open Graph Benchmark (Hu et al., 2020) . For graph classification, we have used the extensively-studied TU datasets MUTAG, PTC-MR, IMDB-BINARY and IMDB-MULTI (Kersting et al., 2016) and OGBG-MOLHIV dataset from OGB (Hu et al., 2020) . Details and statistics of the datasets can be found in Appendix B.2.1. Experimentation Setup: For the node classification datasets Cora and CiteSeer, we performed cross validation with the train/validation/test splits being the same as in Kipf & Welling (2017) , while for Ogbn-arxiv, we used the same splitting method used in Hu et al. (2020) , according to the publication dates. For the TU datasets, we performed 10-fold cross validation with grid search hyper-parameter optimisation and for OGBG-MOLHIV, following Hu et al. (2020) we used a scaffold splitting approach and measured the validation ROC-AUC. We compared the contribution of PGSO and mPGSO on 4 standard GNN baselines: 1) GCN: Graph Convolutional Network (Kipf & Welling, 2017) , 2) GAT: Graph Attention Network (Veličković et al., 2018) , 3) SGC: Simplified Graph Convolution (Wu et al., 2019) and 4) GIN: Graph Isomorphism Network (Xu et al., 2019) . A full description of the experimentation details for each task can be found in Appendix B.2.2. In Figure 4 , we show the contribution of the PGSO and the mPGSO methods, when applied to standard GNNs. For all datasets and GNN architectures, the inclusion of the PGSO or the mPGSO improves the model performance. In the graph classification tasks the performance improvement is higher than the node classification tasks. Specifically, on MUTAG, PTC-MR and IMDB-BINARY, we observe a significant improvement of the classification accuracy for all GNNs. In the comparison between PGSO and mPGSO we do not find a clear winner. For this reason, we prefer the PGSO variant, as it is more efficient and never harms the model's performance. In Appendix B.5 we study the convergence of optimal accuracy and loss values of the GNN-PGSO model in a nodeclassification and a graph classifcation dataset. In both experiments we observe the PGSO model to converge slightly faster and to better accuracy and loss values than its conventional counterpart.

6. CONCLUSION

In this work, a parametrised graph shift operator (PGSO) is proposed, that encodes graph structures and can be included in any graph representation learning scenario. Focusing on graph neural networks (GNNs), we demonstrate that the PGSO can be integrated in the GNN model training. We proved that the PGSO has real eigenvalues and a set of real eigenvectors and derived its spectral bounds, which when observed in practice show that our learned PGSO leads to numerical stable computations. A study on stochastic blockmodel graphs demonstrated the ability of the PGSO to automatically adapt to networks with varying sparsity, independently confirming the positive impact of GSO regularisation which was found in the literature. Experiments on 8 real-world datasets, where both node and graph classification was performed, demonstrate that the accuracy of a representative sample of the current state-of-the-art GNNs can be improved by the incorporation of our PGSO. In answer to our two research questions posed in Section 1, our experimental results have shown that the optimal representation of graph structures is task and data dependent. We have furthermore found that PGSO parameters can be incorporated in the training of GNNs and lead to numerically stable learning and message passing operators. 

A.2 OF 2

In this proof will be utilising the property similar share eigenvalues by considering the matrix D e3 a γ(A, S)D -e3 a , which is similar to γ(A, S) . The Gershgorin Theorem states that all eigenvalues of a matrix are contained in circles centred at the matrix's diagonal elements with radii equal to the corresponding off-diagonal row-sums of the matrix elements in absolute value (Bernstein, 2009, p. 293) . In Theorem 1 we showed that γ(A, S) has a real spectrum and therefore the circles in the Gershgorin Theorem are in fact intervals on the real line in the case of γ(A, S) . The parametrised form of D e3 a γ(A, S)D -e3 a is as follows, D e3 a γ(A, S)D -e3 a = m 1 D e1 a + m 2 D e2+e3 a A a + m 3 I n . From (4) we can simply read off that the diagonal elements, denoted C i , and off-diagonal row-sums of the matrix elements in absolute value, denoted R i , of D e3 a γ(A, S)D -e3 a take the following form, R i = m 1 (d i + a) e1 + m 2 (d i + a) e2+e3 a + m 3 , C i = |m 2 |(d i + a) e2+e3 d i . Hence, via the similarity relationship of D e3 a γ(A, S)D -e3 a and γ(A, S) and a direct application of the Gershgorin Theorem applied to D e3 a γ(A, S)D -e3 a we obtain the required result.

B EXPERIMENTS B.1 STOCHASTIC BLOCK MODELS

In this section, we present a visualisation of graphs generated from stochastic blockmodels (SBMs) with varying sparsity. Specifically, in Figure 5 we display 15 adjacency matrices of graphs generated from SBMs with edge probabilities (p, q) = {(0.50, 0.25), ..., (0.22, 0.11)}. The format of this adjacency matrix visualisation is taken from Dwivedi et al. (2020) . The ordering of the node labels in Figure 5 corresponds to their block membership in order to highlight the block structure in the adjacency matrix plots.

B.2 SETUP REAL-WORLD DATASETS

In section, we the dataset and and experimentation details for the experiments 5.3.

DATASETS

The datasets used in the setup both node and graph classification tasks. • Cora and CiteSeer are citation networks (Kipf & Welling, 2017) , where nodes correspond to documents and edges encode citation links. Both datasets contain node attributes, that are sparse bag-of-words representations of the documents. • Ogbn-arxiv is a citation network with directed edges, where each node corresponds to an arXiv paper and the edges denote citations from one paper to another (Hu et al., 2020) . The dataset contains node attributes, that are averaged word embeddings of the titles and the abstracts of dimensionality 128. The label of each node is the subject area of the paper and can take 40 values. • MUTAG and PTC-MR are a collection of bio-informatics networks (Kersting et al., 2016) , where the label of each graph correspond to a chemical property of the compound. • IMDB-BINARY and IMDB-MULTI are datasets containing collaborations between actors/actresses, where each graph is an ego-graph an actor/actress and edges occur when two actors/actresses are playing in the same movie (Kersting et al., 2016) . These datasets do not contain node attributes, thus following the literature, we create node features by using one-hot encodings of the node degrees. • OGBG-MOLHIV is a collection of graphs, that represent molecules (Wu et al., 2018) . Nodes are atoms and the edges correspond to chemical bonds between atoms. The graphs contain node features, that are processed as in Hu et al. (2020) . In Table 2 the dataset statistics of both node and graph classification tasks are displayed. In this section, we initially report the experimental setting that is common to all experiments and, then, separately provide the details which are specific to the different tasks (node classification and graph classification). In our experiments, we use the Adam optimizer (Kingma & Ba, 2015) with a weight decay on the parameters of 5 * 10 -4 and an initial learning rate of 0.005 for the exponential parameters and an initial learning rate of 0.01 for all other model parameters. The differentiation between the two learning rates is crucial for the model training, as the fluctuation of the exponential parameters has a higher impact on the model behavior than the rest of the model parameters. We, also, used a learning rate scheduler that decayed both the learning rates by 0.5 every 50 epochs.



Figure 1: (a) bounds on the spectral support and (b) parameter values of γ(A, S) plotted against the training epochs of a GCN-PGSO applied to a node classification task on the Cora graph. Note that the optimal values of e 2 and e 3 lie close together and hence appear as a single line in (b).

(a), we observe the parameters of the PGSO to be smoothly varying throughout training in Figure 1(b). The parameters in Figure 1(b) can be seen to have been initialised at the values corresponding to the chosen GSO in the GCN and from there they smoothly vary towards new optimal values within the first 40 training epochs, which are then stable throughout the remainder of the training, ruling out exploding gradients. It is nice to note that in Section 5 Figure 4(b) we observe the accuracy of the GCN using the trained PGSO parameters, displayed in Figure 1(b), to slightly outperform the standard GCN.

Figure 2: Mean and standard deviation of the PGSO parameters on SBM

Figure 3: Parameter evolution of γ(A, S) for 150 epochs, when applied on GCN-PGSO model for the node classification task on Cora.

Figure 4: Classification accuracy results for both node and graph classification tasks (Validation ROC-AUC for OGBG-MOLHIV). Lower case letters denote a node classification task, while capital letters a graph classification task.

Figure 5: List adjacency generated stochastic for varying by decreasing probability tuple (p, q).

Datasets statistics

APPENDIX A PROOFS

A.1 PROOF OF THEOREM 1In this appendix we prove that γ(A, S) has real eigenvalues and a set of real eigenvectors. For this proof the definition of similar matrices is fundamental. Definition 3. Two matrices Φ, Ψ ∈ C n×n are similar via a nonsingular similarity matrix S ∈ C n×n if Ψ = S -1 ΦS.Next we note that γ(A, S) is similar to the symmetric matrix D -(e2-e3)/2 a γ(A, S)D (e2-e3)/2 a .Real, symmetric matrices have a set of real eigenvalues and furthermore, possess a set of real eigenvectors. It is a known property of similar matrices that they share eigenvalues and that for a given eigenvector w of Φ, Sw is an eigenvector of Ψ (Horn & Johnson, 1985, pp. 45,60) . Therefore, by the similarity relationship to a real symmetric matrix via a real similarity matrix, γ(A, S) has real eigenvalues and a set of real eigenvectors.Node Classification Tasks report the details in the classification regarding model selection, metrics and design. Cora and CiteSeer datasets formulated binary class classification problems, the is formulated as a 40class classification problem.• Model Selection: We perform cross-validation for all GNN, GNN-PGSO and GNN-mPGSO models with predefined dataset splits. For Cora and CiteSeer, we use the same train/validation/test splits as in Kipf & Welling (2017) for the sake of comparison with other methods that use the same splits. For Ogbn-arxiv, we use the splitting method, described in Hu et al. (2020) , that splits the data into the train split, containing all papers until 2017, the validation split, containing all papers published in 2018 and test split with the published papers since 2019. • Hyper-parameter Tuning: We use grid search to tune the hyper-parameters of the models (dimensionality of hidden units, number of GNN layers, number of epochs and the batch size). • Evaluation Metric: For all node classification datasets (Cora, CiteSeer, Ogbn-arxiv), we use as evaluation metric the validation accuracy. • Model Design: The number of hidden units that was grid-searched was within {16, 32, 64}.The number of GNN layers that were used were within {1, 2, 3, 4}. • Experiment Design: The number of epochs for the model training was {50, 100, 200}, batch size within {16, 32, 64} and a dropout ratio of 0.5.

Graph Classification Tasks

In the same fashion we report below the details of the experimentation setup of the graph classification tasks. The MUTAG, PTC-MR, IMDB-BINARY and OGBG-MOLHIV are formulated as binary class classification problems, while IMDB-MULTI is formulated as a 3-class classification problem.• Model Selection: We perform 10-fold cross-validation for all GNN,GNN-PGSO, GNN-mPGSO models. For MUTAG, PTC-MR, IMDB-BINARY and IMDB-MULTI, we have used k random and stratified splits to provide balanced train/validation/test sets. For OGBG-MOLHIV we use the scaffold splitting as in Hu et al. (2020) , that seperates the graphs based on their 2D structural representations. • Hyper-parameter Tuning: We use grid search to tune the hyper-parameters of the models (dimensionality of hidden units, number of GNN layers, number of epochs and the batch size). • Evaluation Metric: For the MUTAG, PTC-MR, IMDB-BINARY, IMDB-MULTI datasets, we use as evaluation metric the standard classification accuracy, while for the OGBG-MOLHIV we use as Hu et al. (2020) the validation ROC-AUC. • Model Design: The number of hidden units that was grid-searched was within {16, 32, 64}.The number of GNN layers that were used were {1, 2, 3, 4, 5}. • Experiment Design: The number of epochs for the model training was {100, 200, 300}, the batch size within {16, 32, 64} and the dropout ratio of 0.5.

B.3 TABLES

Next, we present Tables 3 and 4 containing the detailed experimentation results. In Table 3 , the average accuracies with their corresponding standard deviations for the node classification tasks are presented, in particular for the datasets Cora, CiteSeer and Ogbn-arxiv. In Table 4 , the results for the graph classification tasks are presented, specifically MUTAG, PTC-MR, IMDB-BINARY, IMDB-MULTI and OGBG-MOLHIV. We note that for the first 4 graph classification datasets we compute the classification accuracy, while for OGBG-MOLHIV we use the validation ROC-AUC, following Hu et al. (2020) .

B.4 MODEL PERFORMANCE FOR DIFFERENT INITIALISATIONS

In Figure 6 , we show the train and validation accuracy achieved by the PGSO model discussed in Section 5.2 using the 5 different initialisation configurations: GCN operator, adjacency matrix, all-zeros, symmetric normalised Laplacian and random-walk Normalised Laplacian. For all experiments, we used the same configuration of the hyper-parameters for a fair comparison.

B.5 CONVERGENCE STUDY

In this appendix, we empirically analyse the contribution of the PGSO in a node classification and a graph classification task regarding the accuracy and loss convergence. For the node classification task, we have used the Cora dataset and a GCN model. Both for the standard GCN and the GCN-PGSO models, we used the same experimentation configuration for a fair For graph classification task, have used PTC-MR and GIN model. the standard GIN and the GIN-PGSO models, we used the same configuration for a fair comparison. In Figures 7 and 8 , we show the accuracy and loss convergence using the standard model and the PGSO model. As we can see, for both the node classification and graph classification task, the PGSO incorporation into the model has a positive impact on the achieved accuracy and the loss minimization throughout training. Specifically, we observe a slightly faster convergence of accuracy and loss in better values using the GCN-PGSO and GIN-PGSO models. 

