LOW-RANK GRAPH NEURAL NETWORKS INSPIRED BY THE WEAK-BALANCE THEORY IN SOCIAL NETWORKS Anonymous

Abstract

Graph Neural Networks (GNNs) have achieved state-of-the-art performance on node classification tasks by exploiting both the graph structures and node features. Generally, most existing GNNs depend on the implicit homophily assumption that nodes belonging to the same class are more likely to be connected. However, GNNs may fail to model heterophilious graphs where nodes with different labels tend to be linked, as shown in recent studies. To address this issue, we propose a generic GNN applicable to both homophilious and heterophilious graphs, namely Low-Rank Graph Neural Network (LRGNN). In detail, we aim at computing a coefficient matrix such that the sign of each coefficient reveals whether the corresponding two nodes belong to the same class, which is similar to the sign inference problem in Signed Social Networks. In this paper, we show that signed graphs are naturally generalized weakly-balanced for node classification tasks. Motivated by this observation, we propose to leverage low-rank matrix factorization (LRMF) to recover a coefficient matrix from a partially observed signed adjacency matrix. To effectively capture the node similarity, we further incorporate the low-rank representation (LRR) method. Our theoretical result shows that under the update rule of node representations, LRR obtained by solving a subspace clustering problem can recover the subspace structure of node representations. To solve the corresponding optimization problem, we utilize an iterative optimization algorithm with a convergence guarantee and develop a neural-style initialization manner that enables fast convergence. Finally, extensive experimental evaluation on both real-world and synthetic graphs has validated the superior performance of LRGNN over various state-of-the-art GNNs. In particular, LRGNN can offer clear performance gains in a scenario when the node features are not informative enough.

1. INTRODUCTION

Graphs (or networks) are ubiquitous in a variety of fields, such as social networks, biology, and chemistry. Many real-world networks follow the Homophily assumption, i.e., linked nodes tend to share the same label or have similar features; while for graphs with heterophily, nodes with different labels are more likely to form a link. For example, many people tend to connect with people of the opposite sex in dating graphs. For graphs with homophily, Graph Neural Networks (GNNs) variants (Kipf & Welling, 2017; Hamilton et al., 2017; Velickovic et al., 2018) have achieved remarkable successes on various graph mining tasks. Among them, Graph Convolutional Network (GCN) (Kipf & Welling, 2017) and Graph Attention Networks (GATs) (Velickovic et al., 2018) are representative methods. However, the performance of GNNs deteriorates when learning on graphs with heterophily, in that the smoothing operation used in traditional GNNs tends to make representations of neighboring nodes similar, even though they have different labels. Some designs (Zhu et al., 2020; Chien et al., 2021; Lim et al., 2021) have been proposed to enhance the representational power of GNNs under heterophilous scenarios (see Zheng et al. (2022) for a survey). Among them, high-pass filters are the most frequently used components since they can push away a node from its neighbors in the embedding space, which conforms to the characteristic of heterophily that nodes are generally dissimilar to their neighbors. High-pass filters are usually realized by negating the normalized adjacency matrix. In the spatial graph convolution domain, signed message passing (Yan et al., 2021; Bo et al., 2021) allows negative aggregation coefficients so as to push away those neighboring heterophilious nodes. However, most existing methods have weaknesses that restrict their representational power. Spectral-based methods (Chien et al., 2021; Luan et al., 2021) combine high-pass filters with low-pass ones by linearly combining the outputs of intermediate layers. These methods fail to capture the node-level homophily ratio as they utilizes only one type of convolutional filters in each layer. Additionally, spatial-based methods (Bo et al., 2021; Yang et al., 2021) update the representation of each node by computing a learnable weighted combination of the representations of its neighbors and updating the aggregation coefficients based on the attention function used in Graph Attention Network (GAT) (Velickovic et al., 2018) . As GAT computes a form of static attention and the ranking of the attention scores is unconditioned on the query node (Brody et al., 2022) , this attention function is prone to produce uniform attention scores and cannot distinguish nodes of different classes when their distributions of features are of small difference (Fountoulakis et al., 2022) . In this paper, we address the challenges of generalizing GNNs to heterophilious graphs with the help of social theory developed from signed social networks (SSNs) whose positive links represent the friendship between two users while whose negative links represent enmity. Similarly, we call the graphs with negative edges as signed graphs. In SSNs, a practical theory called the weak balance theory (Davis, 1967) modifies the structural balance theory (Cartwright & Harary, 1956 ) by eliminating the patter "an enemy of my enemy is my friend", and keeping the following patterns of "an enemy of my friend is my enemy", "a friend of my friend is my friend", and "a friend of my enemy is my enemy". In the context of node classification task, we can view homophily and heterophily as friendship and enmity, respectively. It is easy to verify that signed graphs are weakly balanced when considering node classification tasks, e.g., a homophilious neighbor of my homophilious neighbor is also a homophilious node to me but a heterophilious neighbor of my heterophilious neighbor is not necessarily a homophilious node. The weak balance naturally leads to a global low-rank structure for the network, based on which the sign inference problem can be formulated as a low-rank matrix completion problem (LRMC) (Hsieh et al., 2012) . The weak-balance theory motivates us to apply low-rank approximation approaches to node classification on heterophilious graphs. Specifically, given a partially observed signed adjacency matrix (where negative edges are allowed), we aim to recover a coefficient matrix Z via low-rank approximation methods, such that a positive Z i,j implies node v i and v j have the same label and the magnitude of Z i,j represents the importance of node v j to v i . Then we can use Z to update node representations by performing feature propagation in GNNs. Due to the polynomial time complexity of solving LRMC which is practically infeasible for large networks, we resort to the low-rank matrix factorization (LRMF) technique that is scalable to large graphs. The low-rank approximation of signed networks can achieve satisfactory or even exact recovery under certain assumptions as stated in Davenport & Romberg (2016) . Furthermore, to better capture the similarity between node representations, we leverage the low-rank representation(LRR) (Liu et al., 2010) learning for recovering the underlying subspace structure on heterophilious graphs. Thus, we name the new derived GNN as Low Rank Graph Neural Network (LRGNN). It is important to note that the low-rank assumption was used to improve the defense over adversarial examples (Jin et al., 2020) , it is not designed for heterophilious graph modeling and not implemented for signed graphs. To solve the corresponding non-convex optimization problem, we utilize the softImpute-ALS algorithm (Hastie et al., 2015) which minimizes the objective function by minimizing a surrogate function. Though LRGNN characterizes a node's representation using the linear combination of all the node representations, we can reduce the time complexity to linear by leveraging some tricks of matrix multiplication. Extensive experimental results on both real-world and synthetic datasets showed the superior performance and efficiency of LRGNN over state-of-the-art methods. We put all proofs and the discussion of related work in Appendix due to the space limitation.

2. PRELIMINARIES

Notations. Denote by G = (V, E) a undirected graph, where V and E denote the node set and edge set, respectively. The nodes are described by a node feature matrix X ∈ R n×f , where n and f are the number of nodes and number of features per node, respectively. Y ∈ R n×c is the node label matrix. The neighbor set of node v i is denoted by N i . We denote the node representation matrix in the l-th layer by H (l) . We let A denote the adjacency matrix that generally A i,j = 1 if (i, j) ∈ E and 0 otherwise; while in signed graphs (networks), we extend the values of adjacency matrix to {-1, 0, 1}, where 1 corresponds to homophily (e.g., friendship), -1 corresponds to heterophily (e.g., enmity ), and 0 stands for unknown relationship. We call this generalized adjacency matrix as the signed adjacency matrix and denote it by Ã. The Frobenius norm of a matrix is given by ∥ • ∥ 2 F . A i,: represents the i-th row of matrix A. We define the underlying complete graph and generalized k-weakly balanced graph as follows. Definition 1. For a signed graph, its underlying complete graph is given by adding all the missing edges to the graph, with appropriate signs, so that the resulting complete graph contains no positive links that connect two enemies (heterophilious nodes). Definition 2. A signed graph is said to be generalized c-weakly balanced if the nodes can be divided into c groups such that in its underlying complete graph, all the within-group edges are positive and all the between-group edges are negative. Signed graphs are naturally generalized c-weakly balanced when considering node classification tasks with c classes. Figure 1 shows an example of a 3-weakly balanced graph. The following theorem reveals the low-rank structures of generalized weakly balanced graphs. Theorem 1. (Hsieh et al., 2012) The signed adjacency matrix of the underlying complete graph of a generalized c-weakly balanced graph is exactly of rank c if c > 2, and rank 1 if c ≤ 2. Considering low-rank structures, it is natural to model the edge sign inference problem as an LRMC problem since LRMC can provide theoretical guarantees for exact recovery under some conditions. Given a partially observed signed matrix adjacency Ã, the task of LRMC is to find the lowest-rank solution among all the feasible solutions: min rank(Z), s.t. P Ω (Z) = P Ω ( Ã), where P Ω (•) is an element-wise function defined as P Ω (M i,j ) = M i,j if (i, j) ∈ E and 0 otherwise, for an arbitrary matrix M ∈ R n×n . Unfortunately, ( 1) is a NP-hard problem. An approximate optimization problem of (1) can be obtained using the convex relaxation min ∥Z∥ * , s.t. P Ω (Z) = P Ω ( Ã), (2) where ∥Z∥ * denotes the nuclear norm of Z, which is the tightest convex relaxation of rank. Note that the values of entries of Z can be continuous. This optimization problem can be solved to yield the global optimal solution in polynomial time in many cases (Candès & Recht, 2012) . We next discuss the conditions of perfect recovery. Let τ be the class imbalance defined as τ ≡ max {n/n i }, where n is the number of nodes, n i the number of nodes of the i-th class. A surprising result involving LRMC is that by solving (2), perfect recovery from the observations is possible if certain assumptions hold. Theorem 2. (Hsieh et al., 2012) (Recovery Condition for Signed Networks) Suppose we observe edges Ãi,j , (i, j) ∈ Ω, from a k-weakly balanced signed network A * . Also, suppose that the following assumptions hold: (1) k is bounded (k = O(1)), (2) the set of observed entries Ω is uniformly sampled, and (3) number of samples is sufficiently large, i.e., |Ω| ≥ Cτ 4 nlog 2 n, where C is a constant. Then A * can be perfectly recovered by solving (2) with probability at least 1 -n -3 . Although LRMC can conditionally provide exact recovery, solving (2) requires performing singular value decomposition on a potentially large matrix, which is computationally expensive. In practice, real-world graphs may have a large number of nodes. To make this idea practical, a fast algorithm is needed. To this end, we further relax the objective as min ∥P Ω (Z -Ã)∥ 2 F , s.t. rank(Z) ≤ q, where c ≤ q ≪ n is a parameter. The constraint on the rank of Z can be guaranteed by decomposing Z into the product of two low-rank matrices. This gives an LRMF problem. min U,V∈R n×q ∥P Ω (UV T -Ã)∥ 2 F + λΘ(U, V), where λ is a parameter and Θ(•, •) a regularizer that encourages some desired properties in U, V such as sparsity, non-negative, and orthogonality. Though (4) is a difficult non-convex problem with no known globally convergent algorithm, we can adopt alternating minimization based algorithms to obtain approximate yet good solutions. We note that matrix factorization (MF) algorithms predict the sign and magnitude of aggregation coefficient (UV T ) i,j only using the information from the observed entries in Ã.

3. LOW RANK GRAPH NEURAL NETWORKS

In this section, we present the overall framework of Low Rank Graph Neural Networks and the corresponding optimization algorithm.

3.1. OVERALL FRAMEWORK

Here, we detail our model design. Inspired by Lim et al. ( 2021), we first apply MLPs to fuse feature matrix and adjacency matrix into a lower-dimensional matrix H (0) ∈ R n×c . H (0) = (1 -µ)M LP X (X) + µM LP A (A), where 0 < µ < 1 is the balance term. We here use adjacency matrix A to exploit the information of graph topology as MF-based methods will recover a complete graph from observations, where information of the original graph topology is lost. For the updating of deeper layers, we introduce a hyper-parameter β as motivated from APPNP (Klicpera et al., 2019) by keeping the term βH (0) which is known to especially useful for learning on heterophilous graphs since it is not restricted by the homophily assumption. Once we have a coefficient matrix U (l) * V (l) T * for the l-th layer, the node representations for the (l + 1)-th layer are updated as H (l+1) = (1 -β)U (l) * V (l) T * H (l) + βH (0) , where 0 < β < 1, U (l) * , V * ∈ R n×q are derived by minimizing the following objective function F (U (l) , V (l) ) = ∥H (l) -(1 -β)U (l) V (l) T H (l) -βH (0) ∥ 2 F + γ∥P Ω (U (l) V (l) T -Ã)∥ 2 F , ) where γ weights the importance of the MF term. The first term is the propagation term which enables the aggregation coefficients to capture the similarity of node representations via involving matrix H (l) and H (0) . It is important to note that the first term can be derived from the low-rank representation learning of the coefficient matrix (as will described in Section 4). This term is also closely related to GloGNN (Li et al., 2022) and the difference will be discussed in Section 4. The second term is the MF term that aims to recover the missing edges from observed ones. Therefore, our method captures node correlations and derives aggregation coefficients considering both the observed edges and node representations. There are still several problems we need to address. First, the signed adjacency matrix Ã is not available. Second, optimizing ( 7) is not trivial due to the element-wise function P Ω (•) and the propagation term. Finally, it is important to find a good starting point for U and V so as to converge within a small number of iterations. To address the first problem, we can use any off-the-shelf neural network classifier to generate pseudo labels. We also exploit the known node labels in training set T V and the label matrix Y. Similar to Zhu et al. (2021) , we generate the pseudo labels as follows, Ȳ = O ⊙ Y + (1 -O) ⊙ Ŷ, Ŷ = softmax(P), P = f N N (X), where f N N (•) denotes a trained neural network, ⊙ the Hadamard product, and O i,: = 1 if i ∈ T V , 0 otherwise. The signed adjacency matrix is defined as Ãi,j = ⟨ Ȳi,: , Ȳj,: ⟩ -δ if (i, j) ∈ E and 0 otherwise. Here, 0 < δ < 1 is a parameter to control the ratio of negative edge weights. ⟨•, •⟩ denotes the inner product of two vectors. We note that one can use stored pseudo labels generated by any model. In our experiments, we use simple models including GCN and MLP for fairness.

3.2. SOFTIMPUTE ALTERNATING LEAST SQUARES ALGORITHM

Practical solutions to LRMF fall into two main camps, i.e., Alternating Least Squares algorithm (ALS) (Wiberg, 1976; Shum et al., 1995; Huynh et al., 2003) and Newton methods (Buchanan & Fitzgibbon, 2005; Okatani & Deguchi, 2007; Chen, 2008) . We refer the interested reader to Davenport & Romberg (2016) for a survey. Unfortunately, both these two methods are difficult to handle additional constraints. The propagation term hinders us to optimize (7) by ALS or Newton methods. To tackle this issue, we make use of the softImpute-ALS algorithm (Hastie et al., 2015) . Consider that we have current estimates Ū(l) and V(l) , and we now wish to derive a new Ũ(l) that minimize the objective. We first introduce the following surrogate function for deriving Ũ(l) . S U (Z U | Ū(l) , V(l) ) = ∥H (l) -(1 -β)Z U V(l) T H (l) -βH (0) ∥ 2 F + γ∥P Ω (Z U V(l) T -Ã) + P ϕ (Z U V(l) T -Ū(l) V(l) T )∥ 2 F ( ) where P ϕ (•) is an element-wise function and defined as P ϕ (M i,j ) = M i,j if (i, j) / ∈ E and 0 otherwise. Then Ũ(l) can be obtained by minimizing the surrogate function over Z U , i.e., Ũ(l) = argmin Z U ∈R n×q S U (Z U | Ū(l) , V(l) ). The closed-form solution is (see Appendix A.1). Ũ(l) = [γ A V(l) + (1 -β)H (l) H (l) T V(l) -β(1 -β)H (0) H (l) T V(l) ]• [γ V(l) T V(l) + (1 -β) 2 V(l) T H (l) H (l) T V(l) ] -1 , where A = P Ω ( Ã) + P ϕ ( Ū(l) V(l) T ). Similarly, we define a surrogate function for deriving Ṽ(l) . S V (Z V | Ũ(l) , V(l) ) = ∥H (l) -(1 -β) Ũ(l) Z T V H (l) -βH (0) ∥ 2 F + γ∥P Ω ( Ũ(l) Z T V -Ã) + P ϕ ( Ũ(l) Z T V -Ũ(l) V(l) T )∥ 2 F ( ) We obtain Ṽ(l) by minimizing the above surrogate function over Z V . The closed-form solution is Ṽ(l) = [γI n + (1 -β) 2 H (l) H (l) T ] -1 • [γ A T Ũ(l) + (1 -β)H (l) H (l) T Ũ(l) -β(1 -β)H (l) H (0) T Ũ(l) ] • [ Ũ(l) T Ũ(l) ] -1 Here A = P Ω ( Ã) + P ϕ ( Ũ(l) V(l) T ). We establish the following result to justify the design of the surrogate functions. Theorem 3. (Correctness) The objective function ( 7) is non-increasing under the update rules ( 10) and ( 12), F ( Ũ(l) , Ṽ(l) ) ≤ F ( Ũ(l) , V(l) ) ≤ F ( Ū(l) , V(l) ) Remark 1. One can calculate H (l+1) in a time complexity linear to the number of edges using some tricks of matrix multiplication. See appendix A.3 for a detailed discussion. Most traditional approaches randomly initialize U and V. In this paper, we employ a more neuralstyle initialization manner. U (l) init = argmin U ∥ Ã -UV (l) init ∥ 2 2 = ( ÃV (l) init )(V (l) T init V (l) init ) -1 , V (l) init = f init (H (0) ) (14) Here f init (•) denotes a fully-connected layer or graph convolution layer, depending on the homophily ratio. We empirically found that this initialization can provide better results within a few iterations for updates. The pseudocode of LRGNN can be found in Appendix A.2.

4. PLACING LRGNN IN THE CONTEXT OF SUBSPACE CLUSTERING

In this section, we show that the propagation term can be derived by performing subspace clustering using low-rank representation (LRR) (Liu et al., 2010) . We consider the subspace clustering problem. Given a set of data samples (each sample is associated with a vector) drawn from a union of linear subspaces, the goal of subspace clustering is to group the samples into their respective subspaces. Spectral-type methods first learn a coefficient matrix from the given data, then the clustering results can be obtained by applying spectral clustering methods to the coefficient matrix. Suppose we have data matrix X ∈ R n×d at hand, we want to find a coefficient matrix such that each of data samples can be represented by the linear combination of the basis in a "dictionary" D ∈ R n×d as X = ZD, where Z i,j describes the affinity between the i-th sample and j-th sample and Z i,: is the representation of the i-th sample. LRR aims at finding the lowest-rank representation among all the candidates. LRR is robust to noise and outliers in the subspace clustering problem (Liu et al., 2010) . Latent Low-Rank Representation (LatLRR) (Liu & Yan, 2011) further improves LRR with hidden data. We consider a variant of LatLRR, min ∥Z∥ * + λ∥Z∥ 2 F , s.t. X = [Z||I n ][(1 -β)X T ||βX T H ] T = (1 -β)ZX + βX H , ( ) where X H ∈ R n×d is the hidden data matrix and β a parameter to weight the importance of hidden data. Back to our problem, now we wish to perform subspace clustering on H (l) . When we consider the graphs with heterophily, it is advisable to use the initial node representations H (0) as hidden data since it is not restricted by the homophily assumption. By appropriately replacing the symbols, we obtain our final optimization problem min ∥Z∥ * + λ∥Z∥ 2 F , s.t. H (l) = (1 -β)ZH (l) + βH (0) The low-rank structure induced by the nuclear norm regularization on Z is of great significance. We have the following result. Theorem 4. Assume that the row vectors (node representations) of H (0) are drawn from a union of independent subspaces {S i } c i=1 . Also Assume that the update rule of node representation matrix is H (l+1) = (1 -β)Z (l) H (l) + βH (0) , where Z (l) is an optimal solution (assume it exists) to the following optimization problem min ∥Z∥ * + λ∥Z∥ 2 F , s.t. H (l) = (1 -β)ZH (l) + βH (0) , where λ > 0, β > 0. Then for any node-pair (denoted by v i and v j ) that belong to different subspaces, we have Z (l) i,j = 0, ∀l ≥ 0. When the number of clusters is known (e.g., c) and the number of columns in U (i.e., q) is sufficiently larger than c, the nuclear norm minimization problem can be equivalently replaced by the low-rank decomposition (Cabral et al., 2013) . We therefore have the following optimization problem min The first term is exactly the propagation term in (7). U (l) ,V (l) ∈R n×q ∥H (l) -(1-β)U (l) V (l) T H (l) -βH (0) ∥ 2 F +λ ′ ∥U (l) V (l) T ∥ 2 F , s.t. q ≥ c (19) Our numerical simulation experiments show that LRR obtained by solving ( 19) can reveal the membership of the samples: within-subspace elements are dense, and the between-subspace elements are sparse. We here include λ ′ ∥U (l) V (l) T ∥ 2 F for theoretical completeness. In our simulations, as λ ′ increases, within-subspace elements get closer to zeros, as shown in Figure 2 . Therefore, our objective function excludes this regularization term. The detailed settings and more simulation results are provided in Appendix A.6. In summary, our objective function contains two parts. The propagation term is motivated by the LRR method for subspace clustering. The second MF term is inspired by the low-rank structures of the weakly-balanced graphs. Discussion. A closely related work is the GloGNN (Li et al., 2022) , which defines the coefficient matrix as Notice that although the first term in Problem ( 19) has a similar form as the first term in Equation (20), they are derived from different perspectives. We here point out several major differences between our LRGNN and GloGNN. First and most importantly, the coefficient matrix in LRGNN is explicitly modeled as a low-rank matrix. The propagation term in GloGNN is only the vanilla subspace clustering augmented with hidden data, neither sparse nor low-rank is ensured. However, LRR is shown to offer both theoretical and empirical benefits to subspace clustering. Moreover, the weak balance theory developed from signed social networks reveals the low-rank structures of the underlying complete graphs. Second, LRGNN aims at recovering the missing edge weights via matrix factorization while GloGNN encourages the missing edges to be near zeros by including a term ∥Z (l) -A GCN ∥ 2 F . On the contrary, the MF term in LRGNN does not impose restrictions on the unobserved entries due to the element-wise function P Ω (•). Third, the adjacency matrix A GCN used in GloGNN is the symmetric normalized adjacency matrix (or its powers) used in vanilla GCN, whose edge weights are uniform and positive. In contrast, we use the pseudo labels to generate a similarity matrix that allows negative weights. Z (l) * = argmin Z (l) ∈R n×n ∥H (l) -(1 -β)Z (l) H (l) -βH (0) ∥ 2 F + γ∥Z (l) -A GCN ∥ 2 F + λ∥Z (l) ∥ 2 F ( )

5. EXPERIMENT

In this section, we evaluate the performance of LRGNN. We put the experimental results w.r.t. the convergence rate of the softImpute-ALS algorithm, ablation study, robustness to random noise added to the signed adjacency matrix, and a large-scale graph in Appendix A.9 due to space limitation. Datasets. We use three homophilious datasets including Cora, Citeseer and Pubmed (Yang et al., 2016) . We also use 6 heterophilious datasets released in (Pei et al., 2020) and (Rozemberczki et al., 2021) . The training/validation/testing splits used in this paper are the same as (Pei et al., 2020) . The datasets and splits are all available from the Pytorch Geometric library (Fey & Lenssen, 2019) . Details of these datasets are provided in Appendix A.5. Baselines. We compare LRGNN with 11 baselines, including (1) classic GNN models: vanilla GCN (Kipf & Welling, 2017), GAT (Velickovic et al., 2018) and MixHop (Abu-El-Haija et al., 2019) . (2) GNN models dedicated to tackling heterophily: H 2 GCN (Zhu et al., 2020) , GPR-GNN (Chien et al., 2021) , WRGAT (Suresh et al., 2021 ), LINKX (Lim et al., 2021) , GGCN (Yan et al., 2021) , ACM-GCN (Luan et al., 2021), and GloGNN++ (Li et al., 2022) . (3) 2-layer MLP. We choose GloGNN++ and ACM-GCN as they generally perform better than other variants proposed in the corresponding papers. Node classification results. Table 1 summarizes the test accuracy of the methods on the node classification task over datasets with diverse homophily ratios. We may make several observations from the table . (1) MLP is a strong baseline for heterophilous datasets. It significantly outperforms GCN, GAT, and MixHop on Texas, Wisconsin, and Cornell datasets. For example, the average classification accuracy of MLP on Texas is 80.81% while that of GCN is 55.14%. (2) We observe 3) Generally, methods dedicated to heterophilous graphs perform better than MLP and traditional GNNs. GPR-GNN is the exception. (4) H 2 GCN (6.22), WRGAT (5.78), GGCN (4.00), ACM-GCN (3.56), GloGNN++ (2.67), and LRGNN (1.56) are the top-performers (the number in parentheses corresponds to the average rank across all datasets). LRGNN performs the best in terms of average rank. This shows that LRGNN can consistently offer superior performance on both homophilious and heterophilious graphs. Notably, LRGNN achieves the best result on Squirrel with around 20.5% improvement over the runner-up score achieved by LINKX. It is worth noting that the superior performance of LRGNN on Squirrel and Chameleon may be attributed to the relatively large average node degrees of these two datasets (38.16 and 13.8, respectively) , which indicates that we have more observations to recover the underlying complete graph. Choice of operating rank q. We investigate how the operating rank q affects the performance. Specifically, we are interested in the recovery loss of MF: err = 1 2n 2 i,j |sign((U (L) * V (L) T * ) i,j )- sign(A * i,j )|, where L denotes the last layer. From Figure 3 we can observe that as q increases, the error gradually decreases at first, then the error gradually rises after the lowest point (associated with a vertical line). For these three datasets, the lowest points are 6, 4, and 5, respectively. Note that these datasets all have 5 classes while for Texas, there is one class that only contains one node. This result empirically verifies the low-rank structures of real-world graphs. We can conclude that the best choice of operating rank q should be exactly the number of classes or slightly larger than it. This also explains the effectiveness of low-rank modeling: we can explicitly choose a proper rank for the coefficient matrix according to the number of classes since in most cases, the factors U and V are full rank matrices, so the rank of the coefficient matrix is exactly q. In general LRMF problem, we have no exact information about the rank of the matrix we want to recover. However, we do know the rank when considering node classification tasks, which makes choosing a proper q easier. Results on synthetic graphs. To comprehensively evaluate the performance of LRGNN, we use random partition graphs (Kim & Oh, 2021) generated by stochastic block model. The node features are sampled from Gaussian distributions where the centers of clusters are vertices of a hypercube. Note that the distance between the means of Gaussian distributions is small compared to the standard deviation. As a result, node features of different classes are hard to distinguish, which can be verified by the poor performance of MLP on these synthetic graphs. We use 15 synthetic random graphs with varying node-level homophily ratios and average degrees. More details can be found in Appendix A.5. We can make the following observations from Figure 4 . When homophily ratio or degree is low, i.e., average degree = 0.5, homophily ratio = 0.1, and homophily ratio = 0.3, LRGNN is the only one that can achieve higher accuracy than MLP. Besides, LRGNN outperforms other methods by a large margin when the homophily ratio and average degree are not too high. We note that LRGNN also significantly outperforms all the baselines on Squirrel and Chameleon. These graphs have a common characteristic, i.e, the quality of node features is not good enough for MLPs to achieve an acceptable result. We conjecture that LRGNN performs particularly well on these kinds of graphs, that is, LRGNN can offer superior performance even if the node features are not so informative. We next validate our conjecture using a real-world dataset. Specifically, we degrade the quality of features of Texas dataset and then test the performance of models trained on this corrupted dataset. We add Gaussian variables to the features and obtain a degraded feature matrix by X ′ i,j = X i,j + ϵ j , with ϵ j i.i.d. sampled from a Gaussian distribution N (0, σ 2 ). Note that the original features will be overwhelmed by the Gaussian random variables if a large σ is applied, and thus the features are less informative. We take GloGNN++, H 2 GCN, and MLP as representative models for comparison. We can observe from Table 4 that, as the features get less informative, the performance of these three methods deteriorates dramatically. For example, when σ = 0.8, their accuracies are around 60%, while the accuracy of LRGNN is above 80%. In addition to that, a large σ also makes their training process erratic, which is accompanied by significant error bars. These results empirically confirm our previous analysis that LRGNN is particularly effective in the scenario that node features are not that informative. This may explain the effectiveness of LRGNN, namely deriving the coefficient matrix by leveraging both the node representations and observed signed edges. Efficiency study. We also evaluate the efficiency of LRGNN compared to other baseline models. We select GGCN, ACM-GCN, H 2 GCN, and GloGNN++ for comparison as they are the top performers in Table 1 . We exclude WRGAT since it takes a long time to precompute the multi-relational graphs. For GGCN, Glo-GNN++, and H 2 GCN, we use the codes and hyper-parameters provided by their authors. Since there is no available code for ACM-GCN, we implement it using Pytorch Library and tune the hyper-parameters based on validation set. We can observe from Table 2 that LRGNN has the shortest running time on 6 out of 9 datasets. Especially, LRGNN converges within 10 seconds over all the datasets. As a comparison, GGCN fails to converge within 10 minutes on Pubmed. Besides, LRGNN achieves around 8.5× and 4.8× speedups compared with GloGNN++ on Texas and Citeseer, respectively. In conclusion, LRGNN is efficient and converges very fast. It is worth noting that to achieve the reported results of LRGNN in Table 1 , the number of layers and the update iterations for softImpute-ALS algorithm are nearly always 1. This explains the superior efficiency of LRGNN and confirms the fast convergence rate of softImpute-ALS algorithm over the neural-style initialization.

6. CONCLUSION

In this paper, we address the challenge of generalizing GNNs to heterophilious graphs with the help of the weak balance theory. Inspired by the low-rank structures of the weakly-balanced graphs, we propose to explicitly model the coefficient matrix as a low-rank matrix. The coefficient matrix is derived by solving a difficult non-convex optimization problem whose objective function consists of an LRR and an LRMF term. Extensive experimental results have demonstrated the effectiveness of the proposed LRGNN, especially when the node features are not informative enough.



Figure 1: An illustrative example of a 3-weakly balanced signed graph and its adjacency matrix of rank 3. Color corresponds to label.

Figure 2: Numerical simulation result. The low-rank representation is obtained by solving (19). The shaded region indicates a 95% confidence interval.

Figure 3: Recovery error of matrix factorization on three datasets. The lowest point is associated with a vertical line. The shaded region corresponds to a 95% confidence interval.

Figure 4: The first three subfigures are results on synthetic graphs and the last one is the result on corrupted Texas. Error bars indicate 95% confidence interval.

Node classification results on 9 real-world benchmark datasets. The results we report are the averages associated with standard deviations over 10 trials. We highlight the best results in bold and the runner-up results with underlines.

Empirical running time comparison. Average running time per epoch(in ms)/average total running time(in s). -indicates that the algorithm fails to converge within 10 minutes. two different patterns among the heterophilious datasets. GCN, GAT, MixHop, and LINKX perform worse than MLP on Texas, Wisconsin, and Cornell. However, they outperform MLP on Squirrel and Chameleon datasets. (

