Wasserstein diffusion on graphs with missing attributes

Abstract

Many real-world graphs are attributed graphs where nodes are associated with non-topological features. While attributes can be missing anywhere in an attributed graph, most of existing node representation learning approaches do not consider such incomplete information. In this paper, we propose a general non-parametric framework to mitigate this problem. Starting from a decomposition of the attribute matrix, we transform node features into discrete distributions in a lower-dimensional space equipped with the Wasserstein metric. On this Wasserstein space, we propose Wasserstein graph diffusion to smooth the distribution representations of nodes with information from their local neighborhoods. This allows us to reduce the distortion caused by missing attributes and obtain integrated representations expressing information of both topology structure and attributes. We then pull the nodes back to the original space and produce corresponding point representations to facilitate various downstream tasks. To show the power of our representation method, we designed two algorithms based on it for node classification (with missing attributes) and matrix completion respectively, and demonstrate their effectiveness in experiments.

1. Introduction

Many real-world networks are attributed networks, where nodes are not only connected with other nodes, but also associated with features, e.g., social network users with profiles or keywords showing interests, Internet Web pages with content information, etc. Learning node representations underlie various downstream graph-based learning tasks and have attracted much attention (Perozzi et al., 2014; Grover & Leskovec, 2016; Pimentel et al., 2017; Duarte et al., 2019) . A high-quality node representation is able to express node-attributed and graph-structured information and can better capture meaningful latent information. Random walk based graph embedding approaches (Perozzi et al., 2014; Grover & Leskovec, 2016) exploit graph structure information to preserve pre-specified node similarities in the embedding space and have proven successful in various applications based on plain graphs. In addition, graph neural networks, many of which base on the message passing schema (Gilmer et al., 2017) , aggregate information from neighborhoods and allow us to incorporate attribute and structure information effectively. However, most of the methods, which embed nodes into a lower-dimensional Euclidean space, suffer from common limitations: they fail to model complex patterns or capture complicated latent information stemming from the limited representation capacity of the embedding space. There has recently been a tendency to embed nodes into a more complex target space with an attempt to increase the ability to express composite information. A prominent example is Wasserstein embedding that represents nodes as probability distributions (Bojchevski & Günnemann, 2018; Muzellec & Cuturi, 2018; Frogner et al., 2019) equipped with Wasserstein metric. A common practice is to learn a mapping from original space to Wasserstein space by minimizing distortion while the objective functions are usually difficult to optimize and require expensive computations. On the other hand, most representation learning methods highly depend on the completeness of observed node attributes which are usually partially absent and even entirely inaccessible in real-life graph data. For instance, in the case of social networks like Facebook and Twitter, in which personal information is incomplete as users are not willing to provide their information for privacy concerns. Consequently, presentation learning models that require fully observed attributes may not be able to cope with these types of real-world networks. In this paper, we propose a novel non-parametric framework to mitigate this problem. Starting from a decomposition of the attribute matrix, we transform node features into discrete distributions in a lower-dimensional space equipped with the Wasserstein metric and implicitly implement dimension reduction which greatly reduces computational complexity. Preserving node similarity is a common precondition for incorporating structural information into representation learning. Based on this, we develop a Wasserstein graph diffusion process to effectively propagate a node distribution to its neighborhood and contain node similarity in the Wasserstein space. To some extent, this diffusion operation implicitly compensates for the loss of information by aggregating information from neighbors. Therefore, we reduce the distortion caused by missing attributes and obtain integrated node representations representing node attributes and graph structure. In addition to produce distribution representations, our framework can leverage the inverse mapping to transform the node distributions back to node features (point representations). Experimentally, we show that these node features are efficient node representations and wellsuited to various downstream learning tasks. More precisely, to comprehensively investigate the representation ability, we examine our framework on node classification concerning two missing cases: partially missing and entire node attributes missing. Moreover, we adapt our framework for matrix completion to show our ability to recover absent values. Contributions. We develop a novel non-parametric framework for node representation learning to utilize incomplete node-attributed information. The contributions of our framework are: 1. embedding nodes into a low-dimension discrete Wasserstein space through matrix decomposition; 2. reducing distortion caused by incomplete information and producing effective distribution representations for expressing both attributes and structure information through the Wasserstein graph diffusion process; 3. reconstructing node features which can be used for various downstream tasks as well as for matrix completion.

Graph representation learning

In this paper, we focus on learning node representations on attributed graphs. There are many effective graph embedding approaches, such as DeepWalk (Bojchevski & Günnemann, 2018) , node2vec (Grover & Leskovec, 2016) , GenVetor (Duarte et al., 2019) , which embed nodes into a lower-dimension Euclidean space and preserve graph structure while most of them disregard node informative attributes. So far, there are little attention paid to attribute information (Yang et al., 2015; Gao & Huang, 2018; Hong et al., 2019) . The advent of graph neural networks (Bruna et al., 2014; Kipf & Welling, 2017; Hamilton et al., 2017; Veličković et al., 2017; Gilmer et al., 2017; Klicpera et al., 2019a; b) fill the gap to some extent, by defining graph convolutional operations in spectral domain or aggregating neighborhood information in spatial domain. Their learned node representation integrate both node attributes and graph structure information. Due to the expressive limitation of Euclidean space, embedding into a more complex target space has been explored. There are several works to leverage distributions equipped with Wasserstein distance to model complex data since a probability is well-suited for modeling uncertainty and flexibility of complex networks. For instance, Graph2Gauss (Bojchevski & Günnemann, 2018) represents node as Gaussian distributions such that uncertainty and complex interactions across nodes are reflected in the embeddings. Similarly, Muzellec & Cuturi (2018) presented a framework in which point embedding can be thought of as a particular case of elliptical distributions embedding. Frogner et al. (2019) learn to compute minimum-distortion embeddings in discrete distribution space such that the underlying distances of input space are preserved approximately.

Missing attributes

The completeness and adequacy of attribute information is a precondition for learning high-quality node embeddings of an attributed graph. To represent incomplete attributes in the embeddings, a fundamental method is providing plausible imputations for missing values. There is a variety of missing value imputation (MVI) techniques such as mean imputation, KNN imputation (Troyanskaya et al., 2001) , softimpute (Hastie et al., 2015) , and so on. As a consequence, the representation capacity of the generated embeddings is inherently bounded by the reconstruction ability of the imputation methods. As the number of missing attributes increases, this can lead to distortion problems as well as unstable learning. The herein proposed method is the first work to compute embeddings of a graph with incomplete attributes directly. k in principal component space which will be transformed to discrete distributions endowed with Wasserstein distance through a particular reversible positive mapping. In this discrete Wasserstein space, we utilize Wasserstein barycenter to aggregate distributional information of h -hop neighbors. In each WGD layer, the updated node distributions will be transformed back to the principal component space to update corresponding node representations which is also the input of next layer. The support points are shared over layers. After L times update, we pull U L k back to the original feature space through inverse mapping to generate new node representations.

3.1. Preliminary: Wasserstein distance

Wasserstein distance is an optimal transport metric, which measures the distance traveled in transporting the mass in one distribution to match another. The p-Wasserstein distance between two distributions µ and ν over a metric space X is defined as W p (µ, ν) = inf (x,y)∼Π(µ,ν) X ×X d(x, y) p dπ(x, y) 1/p where Π(µ, ν) is the the set of probabilistic couplings π on (µ, ν), d(x, y) is a ground metric on X . In this paper, we take p = 2. The Wasserstein space is a metric space that endow probability distributions with the Wasserstein distance.

3.2. The space transformation

Space transformation is the first step of our WGD framework, which attempts to transform node features to discrete distributions endowed with the Wasserstein metric. A common assumption for matrix completion is that the matrix is low-ranked, i.e. the features lie in a smaller subspace, and the missing features can be recovered from this space. Inspired by Alternating Least Square(ALS) Algorithm, a well known missing value imputation method which follows this assumption and uses SVD to factorize matrix into low-ranked submatrices, we first decompose the feature matrix X ∈ R n×m into a principal component matrix U , an singular value matrix Λ, and an orthogonal basis matrix V , i.e. X = U ΛV . For dimensionality reduction, we only account for the first k singular vectors of V : U k , Λ k , V k = SVD(X, k), (1) here U k ∈ R n×k , V k ∈ R m×k and Λ k ∈ R k×k . To impute missing entries, ALS alternatively optimizes these submatries. While our method is not for matrix completion and not need to optimize U k and V k . We aim to generate expressive node representations in the principal components space where U k is the initial node embedding matrix. It is worth noting that such node representations have strong semantic information: each feature dimension is a basis vector from V k . Moreover, in a broad sense (allowing the existence of negative frequencies), we can express nodes as general histograms with principal components (the rows of U k , noted as row(U k )), acting as frequencies and basis vectors (the columns of V k , noted as col(V k )) acting as bins. Therefore, to capture the ground semantic information, we transform U k from a Euclidean space to a discrete distribution space. More precisely, this space transformation involves a reversible positive function (here we use the exponential function for implementation) and a normalization operation: Ũk := φ(U k ) = Normalize(exp(row(U k ))). Here, Ũk can be seen as the discrete distributions of nodes which have common support points col(V k ) with the notation supp(U k ), i.e. supp(U k ) = col(V k ). Each row of Ũk is a discrete distribution with the form u i = j a ij δ vj where a ij are weigths summing to 1, and v j is a column of V k as well as a support point. The distance of various support points, noting that Λ is the square root of the eigenvalues of X X, i.e. we define the ground metric d as follows: d(v i , v j ) = X(v i -v j ) 2 = Λ ii v i -Λ jj v j 2 = |Λ 2 ii -Λ 2 jj |. Here, v i and v j refer to the i-th and j-th support points which are mutual orthogonal unit vectors. In the meantime, we equip node discrete distributions Ũk with Wasserstein metric: W 2 2 (ũ i , ũj |D) = min T ≥0 tr(DT ) subject to T 1 = ũi , T 1 = ũj . Here D ∈ R k×k refers to the underlying distance matrix with D ij = d(v i , v j ).

3.3. The Wasserstein graph diffusion process

Although we produce node distribution representations from space transformation, the information extracted from such representations is limited and distorted caused by missing attributes. To reduce the distortion, we carry out the Wasserstein graph diffusion process to smooth node distributions over its local neighborhood such that valid information extracted from different node attributes can be shared, i.e. informational complementarity. The Wasserstein graph diffusion can boil down to an aggregation operation realized by computing Wasserstein barycenter that is exactly the updated node representations. In this way, both topology structure information and the aforementioned integrated attribute information are incorporated into node representations. Note that it is similar to the message aggregation in graph neural networks, except we do it in a Wasserstein space and introduce no parameters. We denote the node distribution update process as Barycenter Update: ũ = Barycenter U pdate(u, D) := arg inf p 1 |N (u)| u ∈N (u) W 2 2 (p, ũ |D), here N (u) = N (u) ∪ {u}, N (u) refers to the neighbors of u, |N (u)| equals to the degree of u (with self-loop) and ũ refers to the discrete distribution of node u . Similar to common message aggregation on graph, the Wasserstein diffusion process included l times Barycenter Update can affect l-hop neighbors. Recall that the set of support points of barycenter, noted as S u , contains all possible combination of the common support points supp(U k ) shared by all initial node distributions: S v = { 1 |N (u)| |N (u)| i=1 x i |x i ∈ supp(U k )}. ( ) Noting that, as the Barycenter Update process goes on, nodes will not share the same support points any more and their set of support points will be larger and larger that dramatically increase the computation. On the other hand, such a free-support problem is notoriously difficult to solve. Therefore, we leverage the fixed-support Wasserstein barycenter for the update that preserves the initial support supp(U k ). Therefore, no matter how many times we update node distributions, they always share the common support points. In practice, we use Iterative Bregman Projection (IBP) algorithm (Benamou et al., 2015) to obtain the fixed-support barycenter (see Algorithm 1 ). The Wasserstein diffusion process is formulated as follows: Ũ (0) k = φ(U k ); Ũ (l+1) k = Barycenter Update( Ũ (l) k , D) with fixed supp(U k ). ( ) Through the Wasserstein diffusion process, we obtain the updated node distributions Ũk which have an effective ability to represent node attributes and graph structure. Algorithm 1 Iterative Bregman Projection for Barycenter Update input: discrete distribution matrix P d×n , distance matrix D d×d , weights vector w, . 1: initialize: K = exp(-D/ ), V 0 = 1 d ×n 2: for i = 1 . . . M do 3: U i = P KVi-1 4: p i = exp(log(K U i )w) 5: V i = pi K Ui 6: end for 7: output: p M

3.4. The inverse transformation

We first derive an approximate inverse transformation from the prespecified mapping (2) to convert the updated distribution representations Ũk to the principal component space: Ūk = Gram Schmidt Ortho(col(log( Ũk ))). ( ) Here Gram Schmidt Ortho refers to the Gram-Schmidt Orthogonalization processing used to hold the semantic structure of the principal component space. In addition, as a side effect, empirical results show that orthogonalizing node embeddings can efficiently alleviate over-smoothing problem. On the other hand, SVD separates the observed attribute information into two parts: one is extracted by the updated node distributions while the other maintains in the fixed support points. To generate more expressive node representation incorporating both the two parts of information, we pull nodes back to the feature space: X = Ūk Λ k V k . Note that this X is not a matrix completion for the original X, since the elements in X do not remain the same. Instead, it is a transformation of the representation U k after our Wasserstein diffusion process, by combing with neural networks it can be then used for various downstreaming tasks, as we will show in the next section.

4. EMPIRICAL STUDY

As explained in the previous section, our WGD framework can incorporate node attributes and graph structure to reconstruct node representations in the original space, i.e. the feature matrix. Therefore, it is well-suited for various downstream learning tasks as well as matrix completion. In this section, we adapt WGD for node classification tasks and matrix completion then evaluate the quality of reconstructed node features (representations) respectively.

4.1. Node classification on graphs with missing attributes

In this section, we apply our framework to node classification tasks on attributed graphs with missing features. Algorithm 2 summarizes the architecture. Each WGD layer (the outer loop) contains three main stages: space transformation, Wasserstein diffusion, and inverse transformation. The explicit formulas of space transformation and inverse transformation are given by ( 2) and ( 8). In the diffusion process, we conduct Barycenter Update (7) h times to aggregate h-hop neighbors probability information and take the mean outputs of each layer as the updated principal components matrix to avoid over smoothing. We reconstruct feature matrix (9) (acting as updated node representations) on the original space and feed them to a two-layer MLP for node classification, termed WGD-MLP. To confirm the importance of the Wasserstein diffusion process, we propose an ablation framework, called SVD-GCN , in which we skip the diffusion process and remain the other stages unchanged. The forward formula of l + 1-th SVD-GCN layer is: X (l+1) = U (l) k Λ k V (l) k , with U (l) k , Λ k , V (l) k = SVD(X (l) , k). Precisely, we leverage low-rank SVD to factorize the matrix then feed the reconstructed feature matrix to a two-layer GCN (Kipf & Welling, 2017) . Furthermore, to show that matrix decomposition is necessary, we provid an additional ablation framework, termed WE-MLP. In this baseline, we directly transform node features to discrete distributions skipping the stage of matrix decomposition. The forward formula of l + 1-th WE-MLP layer is: X (l+1) = log(Barycenter Update(Normalize(exp(X (l) )))). ( ) After that, we feed the output that is identical to the updated node representations to a two-layer MLP for node classification. Algorithm 2 WGD adapted for node classification on graphs with missing attributes input: attribute matrix X n×m containing initial values, k. 1: Apply k-rank SVD to X: U k , Λ k , V k = SVD(X, k) 2: U (0) k ← U k 3: for l = 1 → L do 4: space transformation: Ũ (l-1) k = Normalize(exp(row(U (l-1) ))) 5: Û (0) k ← Ũ (l-1) k 6: for i = 1 → h do 7: Wasserstein diffusion: Û (i) k = Barycenter Update( Û (i-1) k , D) 8: end for 9: inverse transformation: U (l) k = Gram Schmidt Ortho(col(log( Û (h) k ))) 10: end for 11: Ūk = mean(U (1) k , ..., U (L) k ) 12: X = Ūk Λ k V k 13: Apply MLP for node classification: Y = MLP( X) Experimental Settings. In the experiments of node classification, we implicitly evaluate the ability of WGD to produce high-quality node representations utilizing incomplete attribute information. Since node attributes of common benchmarks of node classification, such as citation networks, are usually fully collected, we artificially remove some attributes at random. To thoroughly and quantitatively evaluate our representation capacity, we conducte experiments on three common node classification benchmarks: Cora, Citeseer and PubMed in two settings: a. Partially Missing Partially missing means some entities of the feature matrix are missing. Precisely, we randomly remove the values of an attribute matrix in a given proportion (from 10% to 90%). b. Entirely Missing Entirely missing means some nodes have complete attributes while the others have no feature at all. In this case, we remove some rows of the feature matrix in a given proportion (from 10% to 90%) at random. We apply various traditional imputation approaches including zero-impute, mean-impute, soft-impute (Mazumder et al., 2010) , and KNN-impute (Batista et al., 2002) to complete the missing values and use a two-layer GCN to classify nodes based on imputed feature matrices. We call them ZERO-GCN, MEAN-GCN, SOFT-GCN and KNN-GCN respectively. Here, zero-impute and mean-impute means that we replace missing values with zero value or the mean of observed features respectively. Moreover, we set the performance of GCN only leveraging graph structure, i.e. with identity matrix as feature matrix, as a lower bound, named GCN NOFEAT. Moreover, in this situation, we use Label Propagation (LP) Algorithm as a better low-bound. If the performance of a model is below the low-bounds, it reflects that the model is unable to utilize incomplete attributes effectively. In all the benchmark and experimental settings, we fix most of the hyperparameters of WGD-MLP: 7 WGD layers including twice Barycenter Update on each layer, two-layer MLP with 128 hidden units with 0.5 dropout. We apply rank 64 SVD on the incomplete attribute matrix in which we replace the missing values with the mean value of observed features. The settings of GCN used in all baselines and SVD-GCN are the same, i.e. 2 layers with 32 hidden units. For all models, we use Adam optimizer with 0.01 learning rate and weight decay depending on different missing cases. We early stop the model training process with patience 100, select the best performing models based on validation set accuracy, and report the mean accuracy for 10 runs as the final results. For more experimental details please refer to our codes: https://anonymous.4open.science/r/3507bfe0-b3b1-4d18-a7b2-eb3643ceedb1 

4.2. Multi-Graph Matrix completion

An interesting aspect of the WGD framework is that they allow us to reconstruct feature matrix with the introduction of reconstruction constrains and neural networks. In this section, we test the ability of WGD to reconstruct missing values of a multi-graph matrix, or more precisely, of recommendation systems with additional information including the similarity of users and items represented as the users graph and the items graph, respectively. Algorithm 3 summarizes the WGD framework adapted for matrix completion tasks. Assume that X is the incomplete feature matrix of the users graph, then X is that of the items graph. Leveraging k-rank SVD, we have U k , Λ k , V k = SVD(X, k). Through the space transformation (2), we obtain φ(U k ), the discrete distributions of users with supp(U k ) = col(V k ). In the same way, we have V k , Λ k , U k = SVD(X , k) and φ(V k ) with supp(V k ) = col(U k ). Typically, the update of node distributions on one graph will lead to unexpected changes in the support points of the other graph. However, this does not appear to be the case in WGD, as the predefined distance matrix D (3) only depends on the fixed Λ k . Therefore, supp(U k ) and supp(V k ) share a common distance matrix. It implies that, two Wasserstein diffusion processes formulatd by ( 5), can simultaneously go on. Informally, this generalized WGD framework, called the multi-graph WGD, can be thought of as an overlay of two original WGDs. The difference appears in the last step: in the multi-graph WGD, we concatenate the outputs of each layer, feed them to a simple MLP and normalize columns of the learned U k and V k to be unit vectors. Benchmarks. We conduct experiments on two popular matrix completion datasets with multi-graph: Flixster and MovieLens-100K processed by Monti et al. (2017) . Baselines. We compare our multi-graph WGD framework with some advanced matrix completion methods, including GRALS (Rao et al., 2015) , sRMGCNN (Monti et al., 2017) , GC-MC (Berg et al., 2017), F-EAE (Hartford et al., 2018) , and IGMC (Zhang Algorithm 3 multi-graph WGD adapted for matrix completion input: attribute matrix X n×m containing initial values, k. 1: Apply k-rank SVD to X: U k , Λ k , V k = SVD(X, k) 2: U (0) k ← U k , V (0) k ← V k 3: for l = 1 → L do 4: space transormation: Ũ (l-1) k = Normalize(exp(row(U (l-1) ))) 5: space transormation: Ṽ (l-1) k = Normalize(exp(row(V (l-1) ))) 6: Û (0) k ← Ũ (l-1) k , V k ← Ṽ (l-1) k 7: for i = 1 → h do 8: Wasserstein diffusion: Û (i) k = Barycenter Update( Û (i-1) k , D) 9: Wasserstein diffusion: V (i) k = Barycenter Update( V (i-1) k , D) 10: end for 11: inverse transformation: U (l) k = Gram Schmidt Ortho(col(log( Û (h) k ))) 12: inverse transformation: V & Chen, 2020). GRALS is a graph regularization method and sRMGCNN is a factorized matrix model. GC-MC directly applies GCN for link prediction on the user-item bipartite graph. F-EAE and IGMC are inductive matrix completion methods without using side information. The former leverages exchangable matrix layers while the latter focus on local subgraphs around each rating and trains a GNN to map the subgraphs to ratings. Experimental Settings and Results. We follow the experimental setup of Monti et al. (2017) and take the common metric Root Mean Square Error (RMSE) to evaluate the accuracy of matrix completion. We set rank= 10 as the same as sRMGCNN for Flixster and MovieLens-100K. In addition, we use 4-layer MLP with 160 hidden units in all experiments. We set L = 5, h = 1 in Flixster and L = 7, h = 1 in MovieLens-100K. We train the model using Adam optimizer with 0.001 learning rate. 

5. Conclusion

Graphs with missing node attributes are ubiquitous in the real-world, while most node representation learning approaches have limited ability to capture such incomplete information. To mitigate this problem, we introduced matrix decomposition and Wasserstein graph diffusion for representation learning such that observed node features can be transformed into discrete distributions and diffused along the graph. We developed a general framework that can produce high-quality node representations with powerful ability to represent attribute and structure information and adapted the framework for two applications: node classification and matrix completion. Extensive experiments on node classification under two missing settings verified the powerful representation capacity and superior robustness of our framework. Our model also proves effective to recover missing features in matrix completion.



Figure1: The WGD Framework involves transformations among three space: feature space, principal component space and discrete Wasserstein space. SVD is leveraged to generate initial node representations U 0 k in principal component space which will be transformed to discrete distributions endowed with Wasserstein distance through a particular reversible positive mapping. In this discrete Wasserstein space, we utilize Wasserstein barycenter to aggregate distributional information of h -hop neighbors. In each WGD layer, the updated node distributions will be transformed back to the principal component space to update corresponding node representations which is also the input of next layer. The support points are shared over layers. After L times update, we pull U L k back to the original feature space through inverse mapping to generate new node representations.

Figure 3: Sensitive analysis for the number of WGD layers L and times h of Wasserstein diffusion in each layer. The results (node classification accuracy on Cora dataset) show that our method prevents over-smoothing.

Reconstruct U k : Ūk = L 2 normalize(M LP u ( Ûk )) 16: Reconstruct V k : Vk = L 2 normalize(M LP v ( Vk )) 17: return X = Ūk Λ k V k

Experimental ResultsAs shown in Fig2, WGD-MLP outperforms all baselines and , especially when the missing rate is over 50%. There is a clear trend of decreasing of baselines in two types of missing attributes. In contrast, the flat curve of WGD-MLP reflects our remarkable robustness. On Cora and Citeseer, comparison of the lower bound provided by GCN NOFEAT with baselines confirms that imputation strategies are ineffective and even counterproductive when observed attributes are grossly inadequate. By contrast, WGD always shows a continuously satisfactory performance, even when 90% of attributes are unavailable. For instance, in both two types of missing data, the performance of WGD consistently falls by only 10% and 5%, respectively, on Cora and Pubmed datasets. It illustrates that WGD can significantly reduce informative distortion caused by missing data and learn effective latent node representations. On the other hand, the inferior performance of SVD-GCN convincingly demonstrates the efficiency of Wasserstein diffusion. For WE-MLP, we conduct experiments on Cora and Pubmed with a fixed missing Sensitive Analysis To show how the number of WGD layers L and iterations h of Wasserstein diffusion in each layer influence our performance of node classification, we take Cora as an example and present the results in Fig 3.As we know, many GNN models encounter the over-smoothing issue when they go deep; however, Figure3shows contrary results and demonstrates that our method can efficiently handle over-smoothing. This is due to two strategies: self-connection (similar to the strategy in APPNPKlicpera et al. (2019a) ) and orthogonalization which make nodes different.

RMSE test results on Flixster and MovieLens-100K.

Table 1 presents the experimental results. As we can see, our WGD-MLP model outperforms most methods and achieves comparable performance as the state of art model IGMC. However, the parameter numbers in our model is much less than the GNN based model IGMC.

