BIGCN: A BI-DIRECTIONAL LOW-PASS FILTERING GRAPH NEURAL NETWORK

Abstract

Graph convolutional networks have achieved great success on graph-structured data. Many graph convolutional networks can be regarded as low-pass filters for graph signals. In this paper, we propose a new model, BiGCN, which represents a graph neural network as a bi-directional low-pass filter. Specifically, we not only consider the original graph structure information but also the latent correlation between features, thus BiGCN can filter the signals along with both the original graph and a latent feature-connection graph. Our model outperforms previous graph neural networks in the tasks of node classification and link prediction on most of the benchmark datasets, especially when we add noise to the node features.

1. INTRODUCTION

Graphs are important research objects in the field of machine learning as they are good carriers for structural data such as social networks and citation networks. Recently, graph neural networks (GNNs) received extensive attention due to their great performances in graph representation learning. A graph neural network takes node features and graph structure (e.g. adjacency matrix) as input, and embeds the graph into a lower-dimensional space. With the success of GNNs (Kipf & Welling, 2017; Veličković et al., 2017; Hamilton et al., 2017; Chen et al., 2018) in various domains, more and more efforts are focused on the reasons why GNNs are so powerful (Xu et al., 2019) . Li et al (Li et al., 2018) re-examined graph convolutional networks (GCNs) and connected it with Laplacian smoothing. NT and Maehara et al (NT & Maehara, 2019) revisited GCNs in terms of graph signal processing and explained that many graph convolutions can be considered as low-pass filters (e.g. (Kipf & Welling, 2017; Wu et al., 2019) ) which can capture low-frequency components and remove some feature noise by making connective nodes more similar. In fact, these findings are not new. Since its first appearance in Bruna et al. (2014) , spectral GCNs have been closely related to graph signal processing and denoising. The spectral graph convolutional operation is derived from Graph Fourier Transform, and the filter can be formulated as a function with respect to the graph Laplacian matrix, denoted as g(L). In general spectral GCNs, the forward function is: H (l+1) = σ(g(L)H (l) ). Kipf and Welling (Kipf & Welling, 2017) approximated g(L) using first-order Chebyshev polynomials, which can be simplified as multiplying the augmented normalized adjacency matrix to the feature matrix. Despite the efficiency, this first-order graph filter is found sensitive to changes in the graph signals and the underlying graph structure (Isufi et al., 2016; Bianchi et al., 2019) . For instance, on isolated nodes or small single components of the graph, their denoising effect is quite limited due to the lack of reliable neighbors. The potential incorrect structure information will also constrain the power of GCNs and cause more negative impacts with deeper layers. As noisy/incorrect information is inevitable in real-world graph data, more powerful and robust GCNs are needed to solve this problem. In this work, we propose a new graph neural network with more powerful denoising effects from the perspective of graph signal processing and higher fault tolerance to the graph structure. Different from image data, graph data usually has high dimensional features, and there may be some latent connection/correlation between each dimensions. Noting this, we take this connection information into account to offset the efforts of certain unreliable structure information, and remove extra noise by applying a smoothness assumption on such a "feature graph". Derived from the additional Laplacian smoothing regularization in this feature graph, we obtain a novel variant of spectral GCNs, named BiGCN, which contains low-pass graph filters for both the original graph and a latent feature connection graph in each convolution layer. Our model can extract low-frequency components from both the graphs, so it is more expressive than the original spectral GCN; and it removes the noise from two directions, so it is also more robust. We evaluate our model on two tasks: node classification and link prediction. In addition to the original graph data, in order to demonstrate the effectiveness of our model with respect to graph signal denoising and fault tolerance, we design three cases with noise/structure mistakes: randomly adding Gaussian noise with different variances to a certain percentage of nodes; adding different levels of Gaussian noise to the whole graph feature; and changing a certain percentage of connections. The remarkable performances of our model in these experiments verify our power and robustness on both clean data and noisy data. The main contributions of this work are summarized below. • We propose a new framework for the representation learning of graphs with node features. Instead of only considering the signals in the original graph, we take into account the feature correlations and make the model more robust. • We formulate our graph neural network based on Laplacian smoothing and derive a bidirectional low-pass graph filter using the Alternating Direction Method of Multipliers (ADMM) algorithm. • We set three cases to demonstrate the powerful denoising capacity and high fault tolerance of our model in tasks of node classification and link prediction.

2. RELATED WORK

We summarize the related work in the field of graph signal processing and denoising and recent work on spectral graph convolutional networks as follows.

2.1. GRAPH SIGNAL PROCESSING AND DENOISING

Graph-structured data is ubiquitous in the world. Graph signal processing (GSP) (Ortega et al., 2018) is intended for analyzing and processing the graph signals whose values are defined on the set of graph vertices. It can be seen as a bridge between classical signal processing and spectral graph theory. One line of the research in this area is the generalization of the Fourier transform to the graph domain and the development of powerful graph filters (Zhu & Rabbat, 2012; Isufi et al., 2016) . It can be applied to various tasks, such as representation learning and denoising (Chen et al., 2014) . More recently, the tools of GSP have been successfully used for the definition of spectral graph neural networks, making a strong connection between GSP and deep learning. In this work, we restart with the concepts from graph signal processing and define a new smoothing model for deep graph learning and graph denoising. It is worth mentioning that the concept of denoising/robustness in GSP is different from the defense/robustness against adversarial attacks (e.g. (Zügner & Günnemann, 2019) ), so we do not make comparisons with those models.

2.2. SPECTRAL GRAPH CONVOLUTIONAL NETWORKS

Inspired by the success of convolutional neural networks in images and other Euclidean domains, the researcher also started to extend the power of deep learning to graphs. One of the earliest trends for defining the convolutional operation on graphs is the use of the Graph Fourier Transform and its definition in the spectral domain instead of the original spatial domain (Bruna et al., 2014) . Defferrard et al (Defferrard et al., 2016) proposed ChebyNet which defines a filter as Chebyshev polynomials of the diagonal matrix of eigenvalues, which can be exactly localized in the k-hop neighborhood. Later on, Kipf and Welling (Kipf & Welling, 2017) simplified the Chebyshev filters using the first-order polynomial filter, which led to the well-known graph convolutional network. Recently, many new spectral graph filters have been developed. For example, the rational auto-regressive moving average graph filters (ARMA) (Isufi et al., 2016; Bianchi et al., 2019) are proposed to enhance the modeling capacity of GNNs. Compared to the polynomial ones, ARMA filters are more robust and provide a more flexible graph frequency response. Feedback-looped filters (Wijesinghe & Wang, 2019) further improved localization and computational efficiency. There is also another type of graph convolutional networks that defines convolutional operations in the spatial domain by aggregating information from neighbors. The spatial types are not closely related to our work, so it is beyond the scope of our discussion. As we will discuss later, our model is closely related to spectral graph convolutional networks. We define our graph filter from the perspective of Laplacian smoothing, and then extend it not only to the original graph but also to a latent feature graph in order to improve the capacity and robustness of the model.

3. BACKGROUND: GRAPH SIGNAL PROCESSING

In this section, we will briefly introduce some concepts of graph signal processing (GSP), including graphs smoothness, graph Fourier Transform and graph filters, which will be used in later sections. Graph Laplacian and Smoothness. A graph can be represented as G = (V, E), which consists of a set of n nodes V = {1, . . . , n} and a set of edges E ⊆ V × V . In this paper, we only consider undirected attributed graphs. We denote the adjacency matrix of G as A = (a ij ) ∈ R n×n and the degree matrix of G as D = diag(d(1), . . . , d(n)) ∈ R n×n . In the degree matrix, d(i) represents the degree of vertex i ∈ V . We consider that each vertex i ∈ V associates a scalar x(i) ∈ R which is also called a graph signal. All graph signals can be represented by x ∈ R n . Some variants of graph Laplacian can be defined on graph G. We denote the graph Laplacian of G as L = D -A ∈ R n×n . It should be noted that the sum of rows of graph Laplacian L is zero. The smoothness of a graph signal x can be measure through the quadratic form of graph Laplacian: ∆(x) = x T Lx = Σ i,j 1 2 a ij (x(i) -x(j)) 2 . Due to the fact that x T Lx ≥ 0, L is a semi-positive definite and symmetric matrix. Graph Fourier Transform and Graph Filters. Decomposing the Laplacian matrix with L = U ΛU T , we can get the orthogonal eigenvectors U as Fourier basis and eigenvalues Λ as graph frequencies. The Graph Fourier Transform F : R n → R n is defined by Fx = x := U T x. The inverse Graph Fourier Transform is defined by F -1 x = x := U x. It enables us to transfer the graph signal to the spectral domain, and then define a graph filter g in the spectral domain for filtering the graph signal x: g(L)x = U g(Λ)U T x = U g(Λ)F(x) where g(Λ) = diag(g(λ 1 ), ...g(λ N )) controls how the graph frequencies can be altered.

4. BIGCN

The Graph Fourier Transform has been successfully used to define various low-pass filters on graph signals (column vectors of feature matrix) and derive spectral graph convolutional networks (Defferrard et al., 2016; Bianchi et al., 2019; Wijesinghe & Wang, 2019) . A spectral graph convolutional operation can be formulated as a function g with respect to the Laplacian matrix L. Although it can smooth the graph and remove certain feature-wise noise by assimilating neighbor nodes, it is sensitive to node-wise noise and unreliable structure information. Notice that when the node features contain rich information, there may exist correlations between different dimensions of features which can be used to figure out the low-tolerance problem. Therefore, it is natural to define filters on "feature signals" (row vectors of graph feature matrix) based on the feature correlation. Inspired by this, we propose a bi-directional spectral GCN, named BiGCN, with column filters and row filters derived from the Laplacian smoothness assumption, as shown in Fig 1 . In this way, we can enhance the denoising capacity and fault tolerance to graph structure of spectral graph convolutions. To explain it better, we start with the following simple case.

4.1. FROM LAPLACIAN SMOOTHING TO GRAPH CONVOLUTION

Assuming that f = y 0 + η is an observation with noise η, to recover the true graph signal y 0 , a natural optimization problem is given by: min y y -f 2 2 +λy T Ly, In the feature graph, d i indicates each dimension of features with a row vector of the input feature matrix as its "feature vector". We use a learnable matrix to capture feature correlations. where λ is a hyper-parameter, L is the (normalized) Laplacian matrix. The optimal solution to this problem is the true graph signal given by y = (I + λL) -1 f. (1) If we generalize the noisy graph signal f to a noisy feature matrix F = Y 0 + N , then the true graph feature matrix Y 0 can be estimated as follows: Y 0 = arg min Y Y -F 2 F +λtrace(Y T LY ) = (I + λL) -1 F. Y T LY , the Laplacian regularization, achieves a smoothness assumption on the feature matrix. (I + λL) -1 is equivalent to a low-pass filters in graph spectral domain which can remove featurewise/column-wise noise and can be used to defined a new graph convolutional operation. Specifically, by multiplying a learnable matrix W (i.e. adding a linear layer for node feature transformation beforehand, which is similar to (Wu et al., 2019; NT & Maehara, 2019 )), we obtain a new graph convolutional layer as follows: H (l+1) = σ((I + λL) -1 H (l) W (l) ). In order to reduce the computational complexity, we can simplify the propagation formulation by approximating (I + λL) -1 with its first-order Taylor expansion I -λL.

4.2. BI-DIRECTIONAL SMOOTHING AND FILTERING

Considering the latent correlation between different dimensions of features, similar to the graph adjacency matrix, we can define a "feature adjacency matrix" A to indicate such feature connections. For instance, if i -th, j -th, k -th dimension feature refer to "height","weight" and "age" respectively, then "weight" may have very strong correlation with "height" but weak correlation with "age", so it is reasonable to assign A ji = 1 while A jk = 0 (if we assume A is a 0 -1 matrix). With a given "feature adjacency matrix", we can construct a corresponding "feature graph" in which nodes indicate each dimension of features and edges indicate the correlation relationship. In addition, if Y n×d is the feature matrix of graph G, then Y T d×n would be the "feature matrix of the feature graph". That is, the column vectors of Y n×d are the feature vectors of those original nodes while the row vectors are exactly the feature vectors of "feature nodes". Analogously, we can derive the Laplacian matrix L of this feature graph. When noise is not only feature-wise but also node-wise, or when graph structure information is not completely reliable, it is beneficial to consider feature correlation information in order to recover the clean feature matrix better. Thus we add a Laplacian smoothness regularization on feature graph to the optimization problem indicated above: L = min Y Y -F 2 F +λ 1 trace(Y T L 1 Y ) + λ 2 trace(Y L 2 Y T ). Here L 1 and L 2 are the normalized Laplacian matrix of the original graph and feature graph, λ 1 and λ 2 are hyper-parameters of the two Laplacian regularization. Y L Y T is the Laplacian regularization on feature graph or row vectors of the original feature matrix. The solution of this optimization problem is equal to the solution of differential equation: ∂L ∂Y = 2Y -2F + 2λ 1 L 1 Y + 2λ 2 Y L 2 = 0. (5) This equation, equivalent to λ 1 L 1 Y + λ 2 Y L 2 = F -Y , is a Sylvester equation. The numerical solution of Sylvester equations can be calculated using some classical algorithm such as Bartels-Stewart algorithm (Bartels, 1972) , Hessenberg-Schur method (Golub et al., 1979) and LAPACK algorithm (Anderson et al., 1999) . However, all of them require Schur decomposition which including Householder transforms and QR iteration with O(n 3 ) computational cost. Consequently, we transform the original problem to a bi-criteria optimization problem with equality constraint instead of solving the Sylvester equation directly: L = min Y1 f (Y 1 ) + min Y2 g(Y 2 ) s.t Y 2 -Y 1 = 0, f (Y 1 ) = 1 2 Y 1 -F 2 F +λ 1 trace(Y T 1 L 1 Y 1 ), g(Y 2 ) = 1 2 Y 2 -F 2 F +λ 2 trace(Y 2 L 2 Y T 2 ). We adopt the ADMM algorithm (Boyd et al., 2011) to solve this constrain convex optimization problem. The augmented Lagrangian function of L is: L p (Y 1 , Y 2 , Z) =f (Y 1 ) + g(Y 2 ) + trace(Z T (Y 2 -Y 1 )) + p 2 Y 2 -Y 1 2 F . The update iteration form of ADMM algorithm is: Y (k+1) 1 := arg min Y1 L p (Y 1 , Y (k) 2 , Z (k) ) = arg min Y1 1 2 Y 1 -F 2 F +λ 1 trace(Y T 1 L 1 Y 1 ) + trace(Z (k) T (Y (k) 2 -Y 1 )) + p 2 Y (k) 2 -Y 1 2 F , Y (k+1) 2 := arg min Y2 L p (Y (k+1) 1 , Y 2 , Z (k) ) = arg min Y2 1 2 Y 2 -F 2 F +λ 2 trace(Y 2 L 2 Y T 2 ) + trace(Z (k) T (Y 2 -Y (k+1) )) + p 2 Y 2 -Y (k+1) 1 2 F , Z (k+1) = Z (k) + p(Y (k+1) 2 -Y (k+1) 1 ). We obtain Y 1 and Y 2 iteration formulation by computing the stationary points of L p (Y 1 , Y (k) 2 , Z (k) ) and L p (Y (k+1) 1 , Y 2 , Z (k) ): Y (k+1) 1 = 1 1 + p (I + 2λ 1 1 + p L 1 ) -1 (F + pY (k) 2 + Z (k) ), Y (k+1) 2 = 1 1 + p (F + pY (k+1) 1 -Z (k) )(I + 2λ 2 1 + p L 2 ) -1 . (9) To decrease the complexity of computation, we can use first-order Taylor approximation to simplify the iteration formulations by choosing appropriate hyper-parameters p and λ 1 , λ 2 such that the eigenvalues of 2λ1 1+p L 1 and 2λ2 1+p L 2 all fall into [-1, 1]: Y (k+1) 1 = 1 1 + p (I - 2λ 1 1 + p L 1 )(F + pY (k) 2 + Z (k) ), Y (k+1) 2 = 1 1 + p (F + pY (k+1) 1 -Z (k) )(I - 2λ 2 1 + p L 2 ), Z (k+1) = Z (k) + p(Y (k+1) 2 -Y (k+1) 1 ). In each iteration, as shown in Fig 1 , we update Y 1 by appling the column low-pass filter I -2λ1 1+p L 1 to the previous Y 2 , then update Y 2 by appling the row low-pass filter I -2λ2 1+p L 2 to the new Y 1 . To some extent, the new Y 1 is the low-frequency column components of the original Y 2 and the new Y 2 is the low-frequency row components of the new Y 1 . After k iteration (in our experiments, k = 2), we take the mean of Y as the approximate solution Y , denote it as Y = ADM M (F, L 1 , L 2 ). In this way, the output of ADMM contains two kinds of low-frequency components. Moreover, we can generalize L 2 to a learnable symmetric matrix based on the original feature matrix F (or some prior knowledge), since it is hard to give a quantitative description on feature correlations. In (l + 1) th propagation layer, F = H (l) is the output of l th layer, L 2 is a learnable symmetric matrix depending on H (l) , for this we denote L 2 as L (l) 2 . The entire formulation is: H (l+1) = σ(ADM M (H (l) , L 1 , L (l) 2 )W (l) ). ( ) Discussion about over-smoothing Since our algorithm is derived from a bidirectional smoothing, some may worry about the over-smoothing problem. The over-smoothing issue of GCN is explored in (Li et al., 2018; Oono & Suzuki, 2020) , where the main claim is that when the GCN model goes very deep, it will encounter over-smoothing problem and lose its expressive power. From this perspective, our model will also be faced with the same problem when we stack many layers. However, a single BiGCN layer is just a more expressive and robust filter than a normal GCN layer. Actually, compared with the single-direction low-pass filtering GCN with a general forward function: H (l+1) = σ(g(L 1 )H (l) W (l) ), ADM M (H (l) , L 1 , L 2 ), combining low-frequency components of both column and row vectors of H (l) , is more informative than g(L 1 )H (l) since the latter can be regarded as one part of the former to some extent. It also explains that BiGCN is more expressive that single-direction low-pass filtering GCNs. Furthermore, when we take L 2 as an identity matrix (in equation 5), BiGCN degenerates to a single-directional GCN with low-pass filter: ((1 + λ 2 )I + λ 1 L 1 ) -1 . It also illustrates that BiGCN has more general model capacity. More technical details are added in Appendix. In practice, we can also mix the BiGCN layer with original GCN layers or use jumping knowledge (Xu et al., 2018) to alleviate the over-smoothing problem: for example, we can use BiGCN at the bottom and then stack other GCN layers above. As we will show in experiments, the adding smoothing term in the BiGCN layers does not lead to over-smoothing; instead, it improves the performance on various datasets.

5. EXPERIMENT

We test BiGCN on two graph-based tasks: semi-supervised node classification and link prediction on several benchmarks. As these datasets are usually observed and carefully collected through a rigid screening, noise can be negligible. However, in many real-world data, noise is everywhere and cannot be ignored. To highlight the denoising capacity of the bi-directional filters, we design three cases and conduct extensive experiments on artificial noisy data. In noise level case, we add different levels of noise to the whole graph. In noise rate case, we randomly add noise to a part of nodes. Considering the potential unreliable connection on the graph, to fully verify the fault tolerance to structure information, we set structure mistakes case in which we will change graph structure. We compare our performance with several baselines including original GCN (Kipf & Welling, 2017), GraphSAGE (Hamilton et al., 2017) , GAT (Veličković et al., 2017) , GIN (Xu et al., 2019) , and GDC (Klicpera et al., 2019) .

5.1. BENCHMARK DATASETS

We conduct link prediction experiments on Citation networks and node classification experiments both on Citation networks and Co-purchase networks. Citation. A citation network dataset consists of documents as nodes and citation links as directed edges. We use three undirected citation graph datasets: Cora (Sen et al., 2008) , CiteSeer (Rossi & Ahmed, 2015) , and PubMed (Namata et al., 2012) for both node classification and link prediction tasks as they are common in all baseline approaches. In addition, we add another citation network DBLP (Pang et al., 2015) to link prediction tasks. Co-purchase. We also use two Co-purchase networks Amazon Computers (McAuley et al., 2015) and Amazon Photos (Shchur et al., 2018) , which take goods as nodes, to predict the respective product category of goods. The features are bag-of-words node features and the edges represent that two goods are frequently bought together.

5.2. EXPERIMENTAL SETUP

We train a two-layer BiGCN as the same as other baselines. Details of the hyperparameters setting and noise cases setting are contained in the appendix. Learnable L 2 . We introduce a completely learnable L2 in our experiments. In detail, we define L 2 = I -D -1/2 2 A 2 D -1/2 2 , A 2 = W 2 + W T 2 where W 2 = sigmoid(W ) and W is an uppertriangle matrix parameter to be optimized. To make it sparse, we also add L1 regularization to L 2 . For each layer, L 2 is defined differently. Note that our framework is general and in practice there may be other reasonable choices for L 2 (e.g. as we discussed in Appendix).

5.3. BASELINE MODELS

We compare our BiGCN with several state-of-the-art GNN models: GCN (Kipf & Welling, 2017), GraphSAGE (Hamilton et al., 2017) , GAT (Veličković et al., 2017) , GIN (Xu et al., 2019) : Graph Isomorphism Network, GDC (Klicpera et al., 2019) : Graph diffusion convolution based on generalized graph diffusion. We compare one of the variants of GDC which leverages personalized PageRank graph diffusion to improve the original GCN and adapt GCN into link prediction tasks is consistent with the implementation in P-GNN.

5.4. RESULTS

We set three types of noise cases in terms of noise level, noise rate and structure mistake to evaluate each model on node classification and link prediction tasks (excluding structure mistakes). "Noise level" and "noise rate" add different types of noise to node features; "structure mistake" indicates we randomly remove or add edges in the original graph. For noise on node features, we expect our BiGCN show its ability as graph filters. For structural errors, we expect the latent feature graph can help with the correction of structural errors in original graphs. The detailed settings of these cases as well as some additional experimental results can be found in the Appendix. Noise level case. In this case, we add Gaussian noise with a fixed variance (from 0.1 to 0.9, called the noise level) to the feature matrix. Structure mistakes case. Structure mistakes refer to the incorrect interaction relationship among nodes. In this setting, we artificially remove or add a certain percentage of edges of graphs at random and conduct experiments on node classification. Fig 4 illustrates the outstanding robustness of BiGCN that is superior to all baselines, demonstrating that our bi-directional filters can effectively utilize information from the latent feature graph and drastically reduce the negative impact of the incorrect structural information. At last, we would like to mention that our model also outperform other models in most cases on clean data without noise. This can attribute to BiGCN's ability to efficiently extract graph features through its bidirectional filters. The detailed values in the figures are listed in Appendix. 

6. CONCLUSION

We proposed bidirectional low-pass filtering GCN, a more powerful and robust network than general spectral GCNs. The bidirectional filter of BiGCN can capture more informative graph signal components than the single-directional one. With the help of latent feature correlation, BiGCN also enhances the network's tolerance to noisy graph signals and unreliable edge connections. Extensive experiments show that our model achieves remarkable performance improvement on noisy graphs.

A MODEL EXPRESSIVENESS

In this section, we add more details about the our discussion of over-smoothing in Section 4. As a bi-directional low-pass filter, our model can extract more informative features from the spectral domain. To simplify the analysis, let us take just one step of ADMM (k=1). Since Z 0 = 0, Y 0 1 = Y 0 2 = F , we have the final solution from Equation ( 10) as follows Y 1 = (I - 2λ 1 1 + p L 1 )F, Y 2 = (I - 2pλ 1 (1 + p) 2 L 1 )F (I - 2λ 2 1 + p L 2 ) = (I - 2λ 2 1 + p L 2 )F T (I - 2pλ 1 (1 + p) 2 L 1 ) T . From this solution, we can see that Y 1 is a low-pass filter which extracts low-frequency features from the original graph via L 1 ; Y 2 is a low-pass filter which extracts low-frequency features from the feature graph via L 2 and then do some transformation. Since we take the average of Y 1 and Y 2 as the output of ADM M (H, L 1 , L 2 ), the BiGCN layer will extract low-frequency features from both the graphs. That means, our model adds new information from the latent feature graph while not losing any features in the original graph. Compared to the original single-directional GCN, our model has more informative features and is more powerful in representation. When we take more than one step of ADMM, from Equation ( 10) we know that the additive component (I -2λ1 1+p L 1 )F is always in Y 1 (with a scaling coefficient), and the component F (I -2λ2 1+p L 2 ) is always in Y 2 . So, the output of the BiGCN layer will always contain the low-frequency features from the original graph and the feature graph with some additional features with transformation, which can give us the same conclusion as the one step case.

B SENSITIVITY ANALYSIS

To demonstrate how hyper-parameters (iterations of ADMM, λ 2 , p and λ) influence BiGCN, we take Cora as an example and present the results on node classification under certain settings of artificial noise. First, we investigate the influence of iteration and λ 2 on clean data and three noise cases with 0.2 noise rate, 0.2 noise level and 0.1% structure mistakes respectively. Fig 5 (a) shows that ADMM with 2 iterations is good enough and the choice of λ 2 has very little impact on results since it can be absorbed into the learnable L 2 . Then we take a particular case in which noise rate equals to 0.2 as an example to illustrate how much the performance of BiGCN depends on p and λ. Fig 5 (b) shows that p guarantees relatively stable performance over a wide range values and only λ has comparable larger impact.

C FLEXIBLE SELECTION OF L 2

In our paper, we assume the latent feature graph L 2 as a learnable matrix and automatically optimize it. However, in practice it can also be defined as other fixed forms. For example, a common way to deal with the latent correlation is to use a correlation graph Li et al. (2017) . Another special case is if we define L 2 as an identity matrix, our model will degenerate to a normal (single-directional) low-pass filtering GCN. When we take L 2 = I in Equation ( 5), the solution becomes Y = ((1 + λ 2 )I + λ 1 L 1 ) -1 F which is similar to the single-directional low pass filter (Equation ( 2)). Then the BiGCN layer will degenerate to the GCN layer as follows: H (l+1) = σ(((1 + λ 2 )I + λ 1 L 1 ) -1 H (l) W (l) ). To show the difference between different definitions of L 2 , we design a simple approach using a thresholded correlation matrix for L 2 to compare with the method used in our main paper. In particular, we define an edge weight A ij as follows. Figure 5 : Sensitivity analysis of iteration, λ 2 , λ and p on node classification. For iteration and λ 2 , we conduct experiments on clean data and three noise cases with 0.2 noise rate, 0.2 noise level and 0.1% structure mistakes respectively. For p and λ, we provide the performance of BiGCN on Cora with 0.2 noise rate. (P ij ) j∈N (i)∪i = sof tmax([ x T i x j x i x j ] j∈N (i)∪i ), A ij = 0, P ij ≤ mean(P ) 1, P ij > mean(P ) . Then we compute L 2 as the normalized Laplacian obtained from A, i.e. L 2 = D-1 2 Ã D-1 2 . For a simple demonstration, we only compare the two models on Cora with node feature noises. From Table 1 and Table 2 , we can see that our learnable L 2 is overall better. However, a fixed L 2 can still give us decent results. When the node feature dimension is large, fixing L 2 may be more efficient. 

D EXPERIMENTAL DETAILS

We train a two-layer BiGCN as the same as other baselines using Adam as the optimization method with 0.01 learning rate, 5 × 10 -4 weight decay, and 0.5 dropout rate for all benchmarks and baselines. In the node classification task, we use early stopping with patience 100 to early stop the model training process and select the best performing models based on validation set accuracy. In the link prediction task, we use the maximum 100 epochs to train each classifier and report the test ROCAUC selected based on the best validation set ROCAUC every 10 epochs. In addition, we follow the experimental setting from P-GNN (position-aware GNN) and the approach that we adapt GCN into link prediction tasks is consistent with the implementation in P-GNN. We set the random seed for each run and we take mean test results for 10 runs to report the performances. All the experimental datasets are taken from PyTorch Geometric and we test BiGCN and other baselines on the whole graph while in GDC, only the largest connected component of the graph is selected. Thus, the experimental results we reported of GDC maybe not completely consistent with that reported by GDC. We found that the Citation datasets in PyTorch Geometric are a little different from those used in GCN, GraphSAGE, and GAT. It may be the reason why their accuracy results on Citeseer and Pubmed in node classification tasks are a little lower than the original papers reported. To highlight the denoising capacity of the bi-directional filters, we design the following three cases and conduct extensive experiments on artificial noisy data. The noise level case and noise rate cases are adding noise on node features and the structure mistake case adds noise to graph structures. Noise level case. In this case, we add different Gaussian noise with zero mean to all the node features in the graph, i.e. to the feature matrix and use the variance of Gaussian (from 0.1 to 0.9) as the quantitative indexes of noise level. Noise rate case. In this case, we add Gaussian noise with the same distribution to different proportions of nodes, i.e. some rows of the feature matrix, at a random and quantitatively study how the percentage (from 10% to 100%) of nodes with noisy features impacts the model performances.

Structure mistakes case.

In practice, it is common and inevitable to observe wrong or interference link information in real-world data, especially in a large-scale network, such as a social network. Therefore, we artificially make random changes in the graph structure, such as removing edges or adding false edges by directly reversing the value of the original adjacency matrix (from 0 to 1 or from 1 to 0) symmetrically to obtain an error adjacency matrix. We choose different scales of errors to decide how many values would be reversed randomly. For example, assigning a 0.01% error rate to a graph consisting of 300 vertices means that 0.01 × 10 -2 × 300 2 = 9 values symmetrically distributed in the adjacency matrix will be changed. We conduct all of the above cases on five benchmarks in node classification tasks and the two previous cases on four benchmarks in link prediction tasks. For more experimental details please refer to our codes: https://anonymous.4open. science/r/4fefefed-4d59-4214-a324-832ac0ef1e96/.

D.1 DATASETS

We use three Citation networks (Cora, Citeseer, and Pubmed) and two Co-purchase networks for node classification tasks and all the Citation datasets for link prediction. The performances of models on clean benchmarks in node classification and link prediction are shown in Table 4 and 5 respectively. These results correspond to the values with noise level 0 in the figures of Section 5. 

E NUMERICAL RESULTS AND HYPERPARAMETERS

In order to facilitate future research to compare with our results, we share the accurate numeric results here in addition to the curves shown in the pictures of the Experimental section. We also share the experimental environment and the optimal hyperparameters we used to get the results in B.2.

E.1.1 NOISE RATE (NR)

Node Classification (NC) All implementations for both node classification and link prediction are based on PyTorch 1.2.0 and Pytorch Geometricfoot_0 . All experiments based on PyTorch are running on one NVIDIA GeForce RTX 2080 Ti GPU using CUDA. The experimental datasets are taken from the PyTorch Geometric platform. We tune our hyperparameters for each model using validation data and listed the final optimal setting in the following tables. To accelerate the tedious process of hyper-parameters tuning, we set 2λ1 1+p = 2λ2 1+p = λ and choose different hyper-parameter p for different datasets. 



https://github.com/rusty1s/pytorch_geometric



Figure1: Illustration of one BiGCN layer. In the feature graph, d i indicates each dimension of features with a row vector of the input feature matrix as its "feature vector". We use a learnable matrix to capture feature correlations.

Photo

Photo

Photo

4 88.5 ± 1.1 91.6 ± 0.7 92.6 ± 0.4 BiGCN 91.5 ± 0.5 90.5 ± 0.7 91.6 ± 0.3 93.1 ± 0.3 D.3 EXPERIMENTAL RESULTS ON AMZ COMP The node classification performances of models on AMZ Comp dataset are shown in Fig 6.

Figure 6: Node classification accuracy of models on AMZ Comp dataset.

As Fig 2 shows, BiGCN outperforms other baselines and shows flatter declines with increasing noise levels, demonstrating better robustness in both node classification and link prediction tasks.Noise rate case. Here, we randomly choose a part of nodes at a fixed percentage (from 0.1 to 0.9, called the noise rate) to add different Gaussian noise. From Fig3we can see that, on the two tasks, BiGCN performs much better than baselines on all benchmarks apart from Cora. Especially on the PubMed dataset, BiGCN improves node classification accuracy by more than 10%.

Node classification accuracy in noise rate case on Cora dataset of two types of L2.

Node classification accuracy in noise level case on Cora dataset of two types of L2.

Bechmark Dataset.

BiGCN compared to GNNs on node classification tasks, measured in accuracy (%). Standard deviation errors are given.

BiGCN compared to GNNs on link prediction tasks, measured in ROC AUC (%). Standard deviation errors are given.

Cora -NR -NC

Photos -NR -NC

Citeseer -NR -LP

Pubmed -NR -LP

BiGCN 0.720 0.682 0.659 0.645 0.643 0.637 0.631 0.635

Citeseer --SM -NC 0.001 0.003 0.005 0.007 0.009 0.011 0.013 0.015 GCN 0.521 0.446 0.409 0.378 0.362 0.351 0.347 0.329 SAGE 0.520 0.446 0.411 0.375 0.369 0.363 0.336 0.337 GAT 0.498 0.397 0.344 0.299 0.282 0.272 0.254 0.231 GIN 0.397 0.221 0.211 0.210 0.201 0.204 0.206 0.205 GDC 0.529 0.515 0.515 0.511 0.516 0.514 0.518 0.526 BiGCN 0.601 0.577 0.565 0.567 0.565 0.565 0.562 0.567 BiGCN 0.822 0.795 0.775 0.764 0.756 0.756 0.749 0.749

Photos -SM -NC 0.001 0.003 0.005 0.007 0.009 0.011 0.013 0.015

Hyper-parameters of BiGCN in Node Classification

Hyper-parameters of BiGCN in Link Prediction

