CONFIDENCE-BASED FEATURE IMPUTATION FOR GRAPHS WITH PARTIALLY KNOWN FEATURES

Abstract

This paper investigates a missing feature imputation problem for graph learning tasks. Several methods have previously addressed learning tasks on graphs with missing features. However, in cases of high rates of missing features, they were unable to avoid significant performance degradation. To overcome this limitation, we introduce a novel concept of channel-wise confidence in a node feature, which is assigned to each imputed channel feature of a node for reflecting certainty of the imputation. We then design pseudo-confidence using the channel-wise shortest path distance between a missing-feature node and its nearest known-feature node to replace unavailable true confidence in an actual learning process. Based on the pseudo-confidence, we propose a novel feature imputation scheme that performs channel-wise inter-node diffusion and node-wise inter-channel propagation. The scheme can endure even at an exceedingly high missing rate (e.g., 99.5%) and it achieves state-of-the-art accuracy for both semi-supervised node classification and link prediction on various datasets containing a high rate of missing features. Codes are available at https://github.com/daehoum1/pcfi.

1. INTRODUCTION

In recent years, graph neural networks (GNNs) have received considerable attention and have performed outstandingly on numerous problems across multiple fields (Zhou et al., 2020; Wu et al., 2020) . While various GNNs handling attributed graphs are designed for node representation (Defferrard et al., 2016; Kipf & Welling, 2016a; Veličković et al., 2017; Xu et al., 2018) and graph representation learning (Kipf & Welling, 2016b; Sun et al., 2019; Velickovic et al., 2019) , GNN models typically assume that features of all nodes are fully observed. In real-world situations, however, features in graph-structured data are often partially observed, as illustrated in the following cases. First, collecting complete data for a large graph is prohibitively expensive or even impossible. Second, measurement failure is common. Third, in social networks, most users desire to protect their personal information selectively. As data security regulation continues to tighten around the world (GDPR), access to full data is expected to become increasingly difficult. Under these circumstances, most GNNs cannot be applied directly due to incomplete features. Several methods have been proposed to solve learning tasks with graphs containing missing features (Jiang & Zhang, 2020 ; Chen et al., 2020; Taguchi et al., 2021) , but they suffer from significant performance degradation at high rates of missing features. A recent work by (Rossi et al., 2021) demonstrated improved performance by introducing feature propagation (FP), which iteratively propagates known features among the nodes along edges. However, even FP cannot avoid a considerable accuracy drop at an extremely high missing rate (e.g., 99.5%). We assume that it is because FP takes graph diffusion through undirected edges. Consequently, in FP, message passing between two nodes occurs with the same strength regardless of the direction. Moreover, FP only diffuses observed features channel-wisely, which means that it does not consider any relationship between channels. Therefore, to better impute missing features in a graph, we propose to consider both inter-channel and inter-node relationships so that we can effectively exploit the sparsely known features. To this end, we design an elaborate feature imputation scheme that includes two processes. The first process is the feature recovery via channel-wise inter-node diffusion, and the second is the feature refinement via node-wise inter-channel propagation. The first process diffuses features by assigning different importance to each recovered channel feature, in contrast to usual diffusion. To this end, we introduce a novel concept of channel-wise confidence, which reflects the quality of channel feature recovery. This confidence is also used in the second process for channel feature refinement based on highly confident feature by utilizing the inter-channel correlation. The true confidence in a missing channel feature is inaccessible without every actual feature. Thus, we define pseudo-confidence for use in our scheme instead of true confidence. Using channel-wise confidence further refines the less confident channel feature by aggregating the highly confident channel features in each node or through the highly confident channel features diffused from neighboring nodes. The key contribution of our work is summarized as follows: (1) we propose a new concept of channel-wise confidence that represents the quality of a recovered channel feature. (2) We design a method to provide pseudo-confidence that can be used in place of unavailable true confidence in a missing channel feature. (3) Based on the pseudo-confidence, we propose a novel feature imputation scheme that achieves the state-of-the-art performance for node classification and link prediction even in an extremely high rate (e.g., 99.5%) of missing features.

2. RELATED WORK 2.1 LEARNING ON GRAPHS WITH MISSING NODE FEATURES

The problem with missing data has been widely investigated in the literature (Allison, 2001; Loh & Wainwright, 2011; Little & Rubin, 2019; You et al., 2020) . Recently, focusing on graph-structured data with pre-defined connectivity, there have been several attempts to learn graphs with missing node features. (Monti et al., 2017) proposed recurrent multi-graph convolutional neural networks (RMGCNN) and separable RMGCNN (sRMGCNN), a scalable version of RMGCNN. Structureattribute transformer (SAT) (Chen et al., 2020) models the joint distribution of graph structures and node attributes through distribution techniques, then completes missing node attributes. GCN for missing features (GCNMF) (Taguchi et al., 2021) adapts graph convolutional networks (GCN) (Kipf & Welling, 2016a) to graphs that contain missing node features via representing the missing features using the Gaussian mixture model. Meanwhile, a partial graph neural network (PaGNN) (Jiang & Zhang, 2020) leverages a partial message-propagation scheme that considers only known features during propagation. However, these methods experience large performance degradation when there exists a high feature missing rate. Feature propagation (FP) (Rossi et al., 2021) reconstructs missing features by diffusing known features. However, in diffusion of FP, a missing feature is formed by aggregating features from neighboring nodes regardless of whether a feature is known or inferred. Moreover, FP does not consider any interdependency among feature channels. To utilize relationships among channels, we construct a correlation matrix of recovered features and additionally refine the features.

2.2. DISTANCE ENCODING

Distance encoding (DE) on graphs defines extra features using distance from a node to the node set where the prediction is made. (Zhang & Chen, 2018 ) extracts a local enclosing subgraph around each target node pair, and uses GNN to learn graph structure features for link prediction. (Li et al., 2020) exploits structure-related features called DE that encodes distance between a node and its neighboring node set with graph-distance measures (e.g., shortest path distance or generalized PageRank scores (Li et al., 2019) ). (Zhang et al., 2021) unifies the aforementioned techniques into a labeling trick. Heterogeneous graph neural network (HGNN) (Ji et al., 2021) proposes a heterogeneous distance encoding in consideration of multiple types of paths in enclosing subgraphs of heterogeneous graphs. Distance encoding in existing methods improves the representation power of GNNs. We use distance encoding to distinguish missing features based on the shortest path distance from a missing feature to known features in the same channel. . While these matrices work well for each target task, from a node's perspective, the sum of edge weights for aggregating features is not one in general. Therefore, since features are not updated at the same scale of original features, these matrices are not suitable for missing feature recovery.

3.1. OVERVIEW

We address a problem with graph learning tasks containing missing node features. To demonstrate the effectiveness of our feature imputation, we target two main graph learning tasks. The first target task, semi-supervised node classification, is to infer the labels of the unlabeled nodes from the partially known features/labels and the fully known graph structure. The second target task, link prediction, is to predict whether two nodes are likely to share a link. Figure 1 depicts the overall scheme of the proposed feature imputation. Our key idea is to assign different pseudo-confidence to each imputed channel features. To this end, the proposed imputation scheme includes two processes. The first process is the feature recovery via channel-wise inter-node diffusion, and the second is the feature refinement via node-wise inter-channel propagation. The imputed features obtained from the two processes are used for downstream tasks via off-the-shelf GNNs. In Sec. 3.2, we begin by introducing the notations used in this paper. In Sec. 3.3, we outline the proposed PC (pseudo-confidence)-based feature imputation (PCFI) scheme that imputes missing node features. We then propose a method to determine the pseudo-confidence in Sec. 3.4. In Sec. 3.5, we present channel-wise inter-node diffusion that iteratively propagates known features with consideration of PC. In Sec. 3.6, we present node-wise inter-channel propagation that adjusts features based on correlation coefficients between channels.

3.2. NOTATIONS

Basic notation on graphs. An undirected connected graph is represented as G = (V, E, A) where V = {v i } N i=1 is the set of N nodes, E is the edge set with (v i , v j ) ∈ E, and A ∈ {0, 1} N ×N denotes an adjacency matrix. X = [x i,d ] ∈ R N ×F is a node feature matrix with N nodes and F channels, i.e., x i,d , the d-th channel feature value of the node v i . N (v i ) denotes the set of neighbors of v i . Given an arbitrary matrix M ∈ R n×m , we let M i,: denote the i-th row vector of M . Similarly, we let M :,j denote the j-th column vector of M . Notation for graphs with missing node features. As we assume that partial or even very few node features are known, we define V  (d) u = V \ V (d) k . Then V (d) k and V (d) u are referred to source nodes and missing nodes, respectively. By reordering the nodes according to whether a feature value is known or not for the d-th channel, we can write graph signal for the d-th channel features and adjacency matrix as: x (d) = x (d) k x (d) u A (d) = A (d) kk A (d) ku A (d) uk A (d) uu . Here, x (d) , x k , and x (d) u are column vectors that represent corresponding graph signal. Since the graph is undirected, A (d) is symmetric and thus (A (d) ku ) ⊤ = A (d) uk . Note that A (d) is different from A due to reordering while they represent the same graph structure. X = [x i,d ] denotes recovered features for X from {x (d) k } F d=1 and {A (d) } F d=1 .

3.3. PC-BASED FEATURE IMPUTATION

The proposed PC-based feature imputation (PCFI) scheme leverages the shortest path distance between nodes to compute pseudo-confidence. PCFI consists of two stages: channel-wise inter-node diffusion and node-wise inter-channel propagation. The first stage, channel-wise inter-node diffusion, finds X (recovered features for X) through PC-based feature diffusion on a given graph G. Then, the second stage, node-wise inter-channel, refines X to the final imputed features X by considering PC and correlation between channels. To perform node classification and link prediction, a GNN is trained with imputed node features X. In this work, PCFI is designed to perform the downstream tasks well. However, since PCFI is independent of the type of learning task, PCFI is not limited to the two tasks. Therfore, it can be applied to various graph learning tasks with missing node features. Formally, the proposed framework can be expressed as X = f 1 ({x (d) k } F d=1 , {A (d) } F d=1 ) (1a) X = f 2 ( X) (1b) Ỹ = g θ ( X, A), where f 1 is channel-wise inter-node diffusion, f 2 is node-wise inter-channel propagation, and Ŷ is a prediction for desired output of a given task. Here, PCFI is expressed as f 2 • f 1 , and any GNN architecture can be adopted as g θ according to the type of task.

3.4. PSEUDO-CONFIDENCE

We begin by defining the concept of confidence in the recovered feature xi,d of a node v i for channel d in the first process. Definition 1. Confidence in the recovered channel feature xi,d is defined by similarity between xi,d and true one x i,d , which is a value between 0 and 1. Note that the feature x i,d of a source node is observed and thus its confidence becomes 1. When the recovered xi,d is far from the true x i,d , the confidence in xi,d will decrease towards 0. However, it is a chicken and egg problem to determine xi,d and its confidence. That is, the confidence in xi,d is unavailable before attaining xi,d according to Definition 1, whereas the proposed scheme can not yield xi,d without the confidence. To navigate this issue, instead of true confidence, we design a pseudo-confidence using the shortest path distance between a node and its nearest source node for a specific channel (SPD-S). For instance, SPD-S of the i-th node for the d-th channel feature is denoted by S i,d , which is calculated via S i,d = s(v i |V (d) k , A (d) ), where s(•) yields the shortest path distance between the i-th node and its nearest source node in , 2012) . Due to feature homophily, the feature similarity between any two nodes tends to increase as the shortest path distance between the two nodes decreases. V (d) k on A (d) Based on feature homophily, we assume that the recovered feature xi,d of a node v i more confidently becomes similar to the given feature of its nearest source node as SPD-S of v i (S i,d ) decreases. According to the assumption, we define pseudo-confidence using SPD-S in Definition 2. Definition 2. Pseudo-confidence (PC) in xi,d is defined by a function ξ i,d = α S i,d where α ∈ (0, 1) is a hyper-parameter. By Definition 2, PC becomes 1 for xi,d = x i,d on source nodes. Moreover, PC decreases exponentially for a missing node features as S i,d increases. Likewise, PC reflects the tendency toward confidence in Definition 1. We verified that this tendency exists regardless of imputation methods via experiments on real datasets (see Figure 7 in APPENDIX). Therefore, pseudo-confidence using SPD-S is properly designed to replace confidence. To the best of our knowledge, ours is the first model that leverages a distance for graph imputation.

3.5. CHANNEL-WISE INTER-NODE DIFFUSION

To recover missing node features in a channel-wise manner via graph diffusion, source nodes independently propagate their features to their neighbors for each channel. Instead of simple aggregating all neighborhood features with the same weights, our scheme aggregates features with different importance according to their confidences. As a result, the recovered features of missing nodes are aggregated in low-confidence and the given features of source nodes are aggregated in highconfidence, which is our design objective. To this end, we design a novel diffusion matrix based on the pseudo-confidence. For the design, Definition 3 first defines 'Relative PC' that represents an amount of PC in a particular node feature relative to another node feature. Definition 3. Relative PC of xj,d relative to xi,d is defined by ξ j/i,d = ξ j,d /ξ i,d = α S j,d -S i,d . Then, suppose that a missing node feature x i,d of v i aggregates features from v j ∈ N (v i ). If v i and v j are neighborhoods to each other, the difference between SPD-S of v i and SPD-S of v j cannot exceed 1. Hence, the relative PC of a node to its neighbor can be determined using Proposition 1. Proposition 1. If S i,d = m ≥ 1, v i is a missing node, then ξ j/i,d for v j ∈ N (v i ) is given by ξ j/i,d = α -1 if S i,d > S j,d , ξ j/i,d = 1 if S i,d = S j,d , ξ j/i,d = α if S i,d < S j,d , Otherwise, v i is a source node (S i,d = 0), then ξ j/i,d for v j ∈ N (v i ) is given by ξ j/i,d = 1 if v j is a source node(S j,d = 1), ξ j/i,d = α if v j is a missing node(S j,d = 0). The proof of Proposition 1 is given in Appendix A.1. Before defining a transition matrix, we temporarily reorder nodes according to whether a feature value is known for the d-th channel, i.e., x (d) and A (d) are reordered for each channel as Section 3.2 describes. After the feature diffusion stage, we order the nodes according to the original numbering. Built on Proposition 1, we construct a weighted adjacency matrix W (d) for the d-th channel. W (d) ∈ R N ×N is defined as follows, W (d) i,j =      ξ j/i,d if i ̸ = j , A (d) i,j = 1 0 if i ̸ = j , A (d) i,j = 0 1 if i = j. Note that self-loops are added to W (d) with a weight of 1 so that each node can keep some of its own feature. W d i,j is an edge weight corresponding to message passing from v j to v i . Proposition 1 implies that α -1 is assigned to high-PC neighbors, 1 to same-PC, and α to low-PC neighbors. That is, W (d) allows a node to aggregate high PC more than low PC channel features from its neighbors. Furthermore, consider message passing between two connected nodes v i and v j s.t. W (d) i,j = ξ j/i,d = α. By Definition 3, ξ i/j,d = ξ -1 j/i,d , so that W (d) j,i = (W (d) i,j ) -1 = α -1 . This means that message passing from a high confident node to a low confident node occurs in a large amount, while message passing in the opposite direction occurs in a small amount. The hyper-parameter α tunes the strength of message passing depending on the confidence. To ensure convergence of diffusion process, we normalize W (d) to W (d) = (D (d) ) -1 W (d) through row-stochastic normalization with D (d) ii = j W i,j . Since x d k with true feature values should be preserved, we replace the first |V (d) | rows of W (d) with one-hot vectors indicating V (d) k . Finally, the channel-wise inter-node diffusion matrix W (d) for the d-th channel is expressed as W (d) = I 0 ku W (d) uk W (d) uu , where I ∈ R |V (d) k |×|V (d) k | is an identity matrix and 0 ku ∈ {0} |V (d) k |×|V (d) u | is a zero matrix. Note that W (d) remains row-stochastic despite the replacement. An aggregation in a specific node can be regarded as a weighted sum of features on neighboring nodes. A row-stochastic matrix for transition matrix means that when a node aggregates features from its neighbors, the sum of the weights is 1.  x(d) (0) = x (d) k 0 u x(d) (t) = W (d) x(d) (t -1), where x(d) (t) is a recovered feature vector for x (d) after t propagation steps, 0 u is a zero-column vector of size |V u to zero. As K → ∞, this recursion converges (the proof is provided in Appendix A.2). We approximate the steady state to x(d) (K), which is calculated by ( W (d) ) K x(d) (0) with large enough K. The diffusion is performed for each channel and outputs { x(d) (K)} F d=1 . Due to the reordering of nodes for each channel before the diffusion, node indices in x(d) (K) for d ∈ {1, ..., F } differ. Therefore, after unifying different ordering in each x(d) (K) according to the original order in X, we concatenate all x(d) (K) along the channels into X, which is the final output in this stage.

3.6. NODE-WISE INTER-CHANNEL PROPAGATION

In the previous stage, we obtained X = [x i,d ] (recovered features for X) via channel-wise internode diffusion performed separately for each channel. The proposed feature diffusion is enacted based on the graph structure and pseudo-confidence, but it does not consider dependency between channels. Since the dependency between channels can be another important factor for imputing missing node features, we develop an additional scheme to refine X to improve the performance of downstream tasks by considering both channel correlation and pseudo-confidence. At this stage, within a node, a low-PC channel feature is refined by reflecting a high-PC channel feature according to the degree of correlation between the two channels. We first prepare a correlation coefficient matrix R = [R a,b ] ∈ R F ×F , giving the correlation coefficient between each pair of channels for the proposed scheme. R a,b , the correlation coefficient between X:,a and X:,b , is calculated by R a,b = 1 N -1 N i=1 (x i,a -m a )(x i,b -m b ) σ a σ b where m d = 1 N N i=1 xi,d and σ d = 1 N -1 N i=1 (x i,d -m d ) 2 . In this stage, unlike looking across the nodes for each channel in the previous stage, we look across the channels for each node. As the right-hand graph of Figure 1 illustrates, we define fully connected directed graphs {H (i) } N i=1 called node-wise inter-channel propagation graphs from the given graph G. H (i) for the i-th node in G is defined by H (i) = (V (i) , E (i) , B (i) ), where V (i) = {v (i) d } F d=1 is a set of nodes in H (i) , E (i) is a set of directed edges in H (i) , and B (i) ∈ R F ×F is a weighted adjacency matrix for refining Xi,: . To refine xi,d of the i-th node via inter-channel propagation, we assign xi,d to each v (i) d as a scalar node feature for the d-th channel (d ∈ {1, ..., F }). The weights in E (i) are given by B (i) in (8). We design B (i) for inter-channel propagation in each node to achieve three goals: (1) highly correlated channels should exchange more information to each other than less correlated channels, (2) a low-PC channel feature should receive more information from other channels for refinement than a high-PC channel feature, and (3) a high PC channel feature should propagate more information to other node channels than a low PC channel feature. Based on these design goals, the weight of the directed edge from the b-th channel to the a-th channel (B (i) a,b ) in B (i) is designed by B (i) a,b = β(1 -α Si,a )α S i,b R a,b if a ̸ = b 0 if a = b , where R a,b , α S i,b , and (1 -α Si,a ) are the terms fore meeting design goals (1), (2), and (3), respectively. α is hyper-parameter for pseudo-confidence in Definition 2, and β is the scaling hyperparameter. Node-wise inter-channel propagation on H (i) outputs the final imputed features for X i,: . We define node-wise inter-channel propagation as X⊤ i,: = X⊤ i,: + B (i) ( Xi,: -[m 1 , m 2 , • • • , m F ]) ⊤ , where Xi,: and Xi,: are row vectors. Preserving the pre-recovered channel feature values (as self loops), message passing among different channel features is conducted along the directed edges of B (i) . After calculating Xi,: for i ∈ {1, ..., N }, we obtain the final recovered features by concatenating them, i.e., X = [ X⊤ 1,: X⊤ 2,: • • • X⊤ N,: ] ⊤ . Moreover, since R is calculated via recovered features X for all nodes in G, channel correlation propagation injects global information into recovered features for X. In turn, X is a final output of PC-based feature imputation and is fed to GNN to solve a downstream task.

4. EXPERIMENTS

To validate our method, we conducted experiments for two main graph learning tasks: semisupervised node classification and link prediction.

4.1. EXPERIMENTAL SETUP

Datasets. We experimented with six benchmark datasets from two different domains: citation networks (Cora, CiteSeer, PubMed (Sen et al., 2008) and OGBN-Arxiv (Hu et al., 2020) ) and recommendation networks (Amazon-Computers and Amazon-Photo (Shchur et al., 2018) ). For link Figure 2 : Average accuracy (%) on the six datasets with r m ∈ {0, 0.5, 0.9, 0.995}. sRMGCNN and GCNMF are excepted due to OOM results in certain datasets and the significantly poor performance on all the available datasets, as table 1 shows. prediction, we evaluated all methods on the five benchmark datasets except OGBN-Arxiv that was caused out of memory. The datasets are described in Appendix A.4.1. Compared Methods. For semi-supervised node classification, we compared our method to two baselines and four state-of-the-art methods. we set Baseline 1 to a simple scheme that directly fed the graph data with missing features to GNN without recovery, where all missing values in a feature matrix were set to zero. We set Baseline 2 to label propagation (LP) ( For link prediction, we compared our method with sRMGCNN and FP, which are the feature imputation approach. To perform link prediction on the imputed features by each method, graph autoencoder (GAE) (Kipf & Welling, 2016b) models were adopted. We used features inferred by each method as input of GAE models. We further compared against GCNMF (Taguchi et al., 2021) for link prediction. We report the detailed implementation in Appendix A.3. Data Settings. Regardless of task type, we removed features according to missing rate r m (0 < r m < 1). Missing features were selected in two ways. • Structural missing. We first randomly selected nodes in a ratio of r m among all nodes. Then, we assigned all features of the selected nodes to missing (unknown) values (zero). • Uniform missing. We randomly selected features in a ratio of r m from the node feature matrix X, and we set the selected features to missing (unknown) values (zero). For semi-supervised node classification, we randomly generated 10 different training/validation/test splits, except OGBN-Arxiv where the split was fixed according to the specified criteria. For link prediction, we also randomly generated 10 different training/validation/test splits for each datasets. We describe the generated splits in detail in Appendix A.4.2. Hyper-parameters. Across all the compared methods, we tuned hyper-parameters based on validation set. For PCFI, we analyzed the influence of α and β in Appendix A.3.2. We used grid search to find the two hyper-parameters in the range of 0 < α < 1 and 0 < β ≤ 1 on validation sets. For the node classification, (α, β) was determined by the best pair from Ablation Study. We present the ablation study to show the effectiveness of each component (rowstochastic transition matrix, channel-wise inter-node diffusion, and node-wise inter-channel propagation) of PCFI in Appendix A.4.4.  {(α, β)|α ∈ {0.1, 0.2, • • • , 0.9}, β ∈ {10 -6 ,

4.2. SEMI-SUPERVISED NODE CLASSIFICATION RESULTS

Figure 2 demonstrates the trend of an average accuracy of compared methods for node classification on six datasets with different r m . The performance gain of PCFI is remarkable at r m = 0.995. In contrast, the average accuracy of existing methods rapidly decrease as r m increases and are overtaken by LP which does not utilize features. In the case of uniform missing features, FP exhibits better resistance than LP, but the gap from ours increases as r m increases. Table 1 illustrates the detailed results of node classification with r m = 0.995. sRMGCNN and GCNMF show significantly low performance for all experiments in this extremely challenging environment. Baseline 2 (LP) outperforms PaGNN in general, and even FP shows worse accuracy than Baseline 2 (LP) in certain settings. For all the datasets, PCFI performed in a manner that was superior to the other methods at r m = 0.995.

4.3. LINK PREDICTION RESULTS

Table 2 demonstrates the results for the link prediction task at r m =0.995. PCFI achieves state-ofthe-art performance across all settings except PubMed with structural missing. Based on the results on semi-supervised node classification and link prediction, which are representative graph learning tasks, PCFI shows the effectiveness at a very high rate of missing features.

5. CONCLUSION

We introduced a novel concept of channel-wise confidence to impute highly rated missing features in a graph. To replace the unavailable true confidence, we designed a pseudo-confidence obtainable from the shortest path distance of each channel feature on a node. Using the pseudo-confidence, we developed a new framework for missing feature imputation that consists of channel-wise internode diffusion and node-wise inter-channel propagation. As validated in experiments, the proposed method demonstrates outperforming performance on both node classification and link prediction. The channel-wise confidence approach for missing feature imputation can be straightforwardly applied to various graph-related downstream tasks with missing node features.

ETHICS STATEMENT

The intentionally removed private or confidential information can be recovered using the proposed method and the recovered information can be misused. Therefore, the work is suggested to be used for positive impacts on society in areas such as health care (Wang et al., 2020; Deng et al., 2020) , crime prediction (Wang et al., 2021) , and weather forecasting (Han et al., 2022) .

REPRODUCIBILITY STATEMENT

For theoretical results, we explained the assumptions and the complete proofs of all theoretical results in Section 3.4, 3.5, and Appendix. In addition, we include the data and implementation details to reproduce the experimental results in Section 4 and Appendix A.3. The codes are available at https://github.com/daehoum1/pcfi.

A APPENDIX

A.1 PROOF OF PROPOSITION 1 Proposition 1. If S i,d = m ≥ 1, v i is a missing node, then ξ j/i,d for v j ∈ N (v i ) is given by ξ j/i,d = α -1 if S i,d > S j,d , ξ j/i,d = 1 if S i,d = S j,d , ξ j/i,d = α if S i,d < S j,d , Otherwise, v i is a source node (S i,d = 0), then ξ j/i,d for v j ∈ N (v i ) is given by ξ j/i,d = 1 if v j is a source node (S j,d = 1), ξ j/i,d = α if v j is a missing node (S j,d = 0). Proof. Let v a and v b be arbitrary nodes, and let δ(v a , v b ) denote the number of edges in the shortest path between v a and v b . The shortest path distance from v i to its nearest source node for the d-th feature channel, S i,d , is given by S i,d = min{δ(v i , v s )| v s ∈ V (d) k }. Claim 1: S i,d = 0 ⇔ v i ∈ V (d) k . Proof: Since v i is a source node, S i,d = 0. Claim 2: S i,d ≥ 1 ⇔ v i / ∈ V (d) k ⇔ v i ∈ V (d) u . Proof: Since v i is not a known node (v i / ∈ V (d) k ) if and only of v i is unknown node (v i ∈ V (d) u ), then S i,d ≥ 1 is obvious. Let v s be a known node such that δ(v s , v i ) = m which exists because S i,d = m. Then δ(v s , v j ) ≤ δ(v s , v i ) + δ(v i , v j ) = m + 1 holds by the triangle inequality since shortest path distance is a metric on the graph. This proves that S j,d ≤ m + 1, and also included the case of S i,d = 0 as a special case. Assume that there is some known node v s ′ such that δ(v j , v s ′ ) ≤ m -2. Then δ(v i , v s ′ ) ≤ δ(v i , v j ) + δ(v j , v s ′ ) ≤ 1 + m -2 = m -1 by the triangle inequality. However, this contradicts S i,d = m. Therefore, for all source node s ′ , δ(v j , v s ′ ) ≥ m -1 which implies S j,d ≥ m -1.. Then, the following Claim 3 also holds. Claim 3: If S i,d = m ≥ 1, then S j,d -S i,d ∈ {-1, 0, 1} for v j ∈ N (v i ). Otherwise, if S i,d = 0, then S j,d -S i,d ∈ {0, 1} for v j ∈ N (v i ) According to Claim 3 and ξ j/i,d = α S j,d -S i,d in Definition 3 of the main text, the proposition 1 holds trivially.

A.2 CONVERGENCE OF CHANNEL-WISE INTER-NODE DIFFUSION

The convergence of the proposed Channel-wise Inter-node Diffusion is presented in the following Proposition. Proposition A.1. The channel-wise inter-node diffusion matrix for the d-th channel, W (d) , is expressed by W (d) = I 0 ku W (d) uk W (d) uu , where W (d) is the row-stochastic matrix calculated by normalizing W (d) . The recursion in channelwise inter-node diffusion for the d-th channel is defined by x(d) (0) = x (d) k 0 u x(d) (t) = W (d) x(d) (t -1) Then, lim K→∞ x(d) u (K) converges to (I -W (d) uu ) -1 W (d) uk x (d) k , where x (d) k is the known feature of the d-th channel. The proof of this Proposition follows that of (Rossi et al., 2021) which proves the case of a symmetrically-normalized diffusion matrix. In our proof, the diffusion matrix is not symmetric. For proof of Proposition A.1, we first give Lemma A.1 and A.2. Lemma A.1. W (d) is the row-stochastic matrix calculated by normalizing W (d) which is the weighted adjacency matrix of the connected graph G. That is, W (d) = (D (d) ) -1 W (d) where d) , and let ρ(•) denote spectral radius. Then, ρ(W D (d) ii = j W i,j . Let W (d) uu be the | x(d) u | × | x(d) u | bottom-right submatrix of W (d) uu ) < 1. Proof. Let W (d) uu0 ∈ R N ×N be the matrix where the bottom right submatrix is W (d) uu and all the other elements are zero. That is, d) is the weighted adjacency matrix of connected graph G, W W (d) uu0 = 0 kk 0 ku 0 uk W (d) uu where 0 kk ∈ {0} | x(d) k |×| x(d) k | , 0 ku ∈ {0} | x(d) k |×| x(d) u | , and 0 uk ∈ {0} | x(d) u |×| x(d) k | . Since W ( uu0 ≤ W (d) element- wisely and W (d) uu0 ̸ = W (d) . Moreover, since W (d) uu0 + W (d) is a weighted adjacency matrix of a strongly connected graph, W (d) is irreducible due to Theorem 2.2.7 of (Berman & Plemmons, 1994). Then, by Corollary 2.1.5 of (Berman & Plemmons, 1994), ρ(W (d) uu0 ) < ρ(W (d) ). Since the spectral radius of a stochastic matrix is one (Theorem 2.5. 3 in (Berman & Plemmons, 1994)), ρ(W (d) ) = 1. Furthermore, since W (d) uu0 and W (d) uu share the same non-zero eigenvalues, ρ(W (d) uu ) = ρ(W (d) uu ). Finally, ρ(W (d) uu ) = ρ(W (d) uu0 ) < ρ(W (d) ) = 1. Lemma A.2. I -W (d) uu is invertible where I is the | x(d) u | × | x(d) u | identity matrix. Proof. Since 1 is not an eigenvalue of W In the following, we give the proof of Proposition 1. Proof of Proposition 1. Unfolding the recurrence relation gives us x(d) (t) = x(d) k (t) x(d) u (t) = I 0 ku W (d) uk W (d) uu x(d) k (t -1) x(d) u (t -1) = x(d) k (t -1) W (d) uk x(d) k (t -1) + W (d) uu x(d) u (t -1) . Since x(d) k (t) = x(d) k (t -1) in the first | x(d) k | rows, x(d) k (K) = ... = x(d) k . That is, x(d) k (K) remains x (d) k . Hence lim K→∞ x(d) k (K) converges to x (d) k . Now, we just consider the convergence of lim K→∞ x(d) u (K). Unrolling the recursion of the last | x(d) u | rows become, x(d) u (K) = W (d) uk x (d) k + W (d) uu x(d) u (K -1) = W (d) uk x (d) k + W (d) uu (W (d) uk x (d) k + W (d) uu x(d) u (K -2)) = . . . = ( K-1 t=0 (W (d) uu ) t )W (d) uk x (d) k + (W (d) uu ) K x(d) u (0) Since lim K→∞ (W (d) uu ) K = 0 by Lemma A.1, lim K→∞ (W (d) uu ) K x(d) u (0) = 0 regardless of the initial state for x(d) u (0). (We replace x(d) u 0) with a zero column vector for simplicity.) Thus, it remains to consider lim K→∞ ( K-1 t=0 (W (d) uu ) t )W (d) uk x (d) k . Since ρ(W (d) uu ) < 1 by Lemma A.1 and (I -W (d) uu ) -1 is invertible by Lemma A.2, the geometric series converges as follows lim K→∞ x(d) u (K) = lim K→∞ ( K k = (I -W Thus, the recursion in channel-wise inter-node diffusion converges. □ 

Missing type Structural missing

Uniform missing rm 0.995 0.9 0.5 0.995 0.9 0.5 Cora α 0.9 0.9 0.6 0.7 0.5 0.5 β 1 10 -1 1 10 -2 10 -1 CiteSeer α 0.8 0.6 0.5 0.8 0.5 0.1 β 10 -1.5 10 -0.5 1 10 -0.5 1 1 PubMed α 0.8 0.9 0.9 0.7 0.8 0.2 β 10 -1 10 -0.5 10 -0.5 10 -2.5 1 1 Photo α 0.2 0.4 0.6 0.2 0.2 0.5 β 10 -4 10 -6 10 -2.5 10 -1.5 10 -6 10 4.5 Computers α 0.1 0.1 0.3 0.1 0.2 0.4 β 10 -3. 5 10 -4 10 -5.5 10 -2.5 10 -4 10 -5.5 OGBN-Arxiv α 0.1 0.4 0.2 0.2 0.8 0.8 β 10 -6 10 -6 10 -6 10 -4 10 -6 10 -2.5  β 1 10 -1 1 10 -1 10 -1 Uniform α 0.9 0.6 0.3 0.1 0.1 β 1 1 1 1 1 {0.1, 0.2, ..., 0.9}, and β from {10 -6 , 10 -5 , 10 -4 , ..., , 1}. To find the best hyper-parameter, we used grid search on a validation set. The detailed setting of hyper-parameters for all datasets used in our paper are listed in Table 3 and Table 4 . To train downstream GCNs added to PCFI for node classification, we fix a learning rate to 0.005. Then, to train downstream GAE added to PCFI for link prediction, we set a learning rate to 0.01 and 0.001 for {Cora, CiteSeer, PubMed} and {Photo, Computers}, respectively. Downstream GCN for node classification. We set the number of layers to 3, and We fix dropout rate p = 0.5. The hidden dimension was set to 64 for all datasets except OGBN-Arxiv where 256 is used. For OGBN-Arxiv, as the Jumping Knowledge scheme (Xu et al., 2018) with max aggregation was applied to FP, we also utilized the scheme.

A.3.3 IMPLEMENTATION OF BASELINES

GCNMF (Taguchi et al., 2021) . We used publicly released code by the authors. The code for GCNMFfoot_0 is MIT licensed. FP (Rossi et al., 2021) . We used publicly released code by the authors. The code for FPfoot_1 is Apache-2.0 licensed. PaGNN (Jiang & Zhang, 2020) . We used re-implemented Apache-2.0-licensed codefoot_2 since we could not find officially released code by the authors for PaGNN. sRMGCNN (Monti et al., 2017) . Due to the compatibility problem from the old-version Tensorflow (Abadi et al., 2016) of the code, we only updated the version of publicly released codefoot_3 to Tensorflow 2.3.0. The code is GPL-3.0 licensed. Label propagation (Zhu & Ghahramani, 2002) . We employed re-implemented code included in MIT-licensed Pytorch Geometric. We tuned hyper-parameter α of LP in {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95} by grid search.

A.4.1 DATASETS

All the datasets used in our experiments are publicly available from the MIT-licensed Pytorch Geometric package. We conducted all the experiments on the largest connected component of each given graph. For a disconnected graph, we can simply apply PCFI to each connected component independently. The description for the datasets is summarized in Table 5 . Link prediction. We randomly generated 10 different training/validation/test splits for each datasets. In each split, we applied the same split ratio regardless of datasets. As the setting in (Kipf & Welling, 2016b), we commonly used 85% edges for train, 5% edges for validation, and the 10% edges for test.

A.4.3 GAIN OF PCFI ACCORDING TO FEATURE HOMOPHILY

In this section, since pseudo-confidence is based on the assumption that linked nodes have similar features, we explored how feature homophily of graphs impacts the performance of PCFI. We further analyzed the gain of PCFI over FP in terms of feature homophily. For the experiments, we generated features with different feature homophily on graphs from the synthetic dataset where each graph contains 5000 nodes that belong to one of 10 classes. We selected two graphs (graphs with class homophily of 0.3 and 0.5) from the dataset. Preserving the graph structure and class distribution, we newly generated multiple sets of node features for each graph so that each generated graph has different feature homophily. The features for nodes were sampled from overlapping 10 5-dimensional Gaussians which correspond to each class and the means of the Gaussians are set differently to be the same distance from each other. For the covariance matrix of each Gaussian, diagonal elements were set to the same value and the other elements were set to 0.1 times the diagonal element. That is, the covariance between different channels was set to 0.1 times the variance of a channel. For each feature generation of a graph, we changed the scale of the covariance matrix so that features are created with different feature homophily. As the scale of the covariance matrix decreases, the overlapped area between Gaussians decreases. By doing so, more similar features are generated within the same class and the graph has higher feature homophily. Then, for semi-supervised node classification, we randomly generated 10 different training/validation/test splits. For each split, we set the numbers of nodes in training, validation, and test set to be equal. Figure 5 demonstrates the trend of the accuracy of PCFI and FP under a structural-missing setting with r m = 0.995. We can confirm that the gain of PCFI over FP exceeds 10% on both graphs with high feature homophily. Furthermore, PCFI shows superior performance regardless of levels of feature homophily. This is because confidence based on feature homophily is the valid concept on graphs with missing features and pseudo-confidence is designed properly to replace confidence.

A.4.4 ABLATION STUDY

To verify the effectiveness of each element of PCFI, we carried out an ablation study. We started by measuring the performance of FP that performs simple graph diffusion with a symmetricallynormalized transition matrix. Firstly, we changed the normalization type of the transition matrix. We replaced the symmetrically-normalized transition matrix with a row-stochastic transition matrix (row-ST in Table 6 and Table7 ). The row-stochastic transition matrix leads to feature recovery at the same scale of actual features. Secondly, we introduced PC to the diffusion process, which means PC-based channel-wise inter-node diffusion was performed (CID in Table 6 and Table7 ). Lastly, we performed node-wise channel propagation on recovered features obtained by channel-wise internode diffusion (NIP in Table 6 and Table7 ). We compared the performance of these four cases. In this experiment, for CiteSeer, we performed node classification under structural-missing setting with r m = 0.995. Then, for Cora and PubMed, we performed link prediction with r m = 0.995. We applied structural missing and uniform missing to Cora and PubMed, respectively. As shown in Table 6 and Table 7 , each component of PCFI contributes to performance improvement throughout the various settings. 

A.4.5 RESISTANCE OF PCFI TO MISSING FEATURES

To verify the resistance of PCFI against missing features, we compared the averaged node classification accuracy for the 6 datasets by increasing r m as shown in Table . 8. We compared each average accuracy with different r m from r m = 0 to r m = 0.95. For structural missing, PCFI loses only 2.23% of relative average accuracy with r m = 0.9, and 4.82% with r m = 0.995. In the case of uniform missing, PCFI loses only 1.81% of relative average accuracy with r m = 0.9, and 2.66% with r m = 0.995. Note that classification with r m = 0.995 is the extremely missing case in which only 0.5% of features are known. This result demonstrates that PCFI is robust to missing features. Even at the same r m , we observed the average accuracy for structural missing is lower than one for uniform missing. We analyzed this observation in the aspect of confidence. Since features are missing in a node unit for structural missing, all channel features of a node have the same SPD-S, i.e., S i,1 = . . . = S i,F . Hence, in a node far from its nearest source node, every channel feature has low confidence. Then, every missing feature of the node can not be improved via nodewise inter-channel propagation due to the absence of highly confident channel features in the node. Thus, the node is likely to be misclassified. In contrast, for uniform missing, channel features of a node have various PC where known channel features have high confidence and propagate their feature information to unknown channel features in the node. Hence, most nodes can be classified well owing to the recovered features from highly confident channel features. We claim that this observation shows the validity of the concept of channel-wise confidence. The tables containing accuracy at different r m for the two missing types is in Section A. Under a structural missing setting, node features in a node have the same SPD-S, i.e., S i,1 = ... = S i,d for the i-th node. Since pseudo confidence ξ i,d is calculated by ξ i,d = α S i,d , node features within a node also have the same pseudo-confidence (PC) for structural missing, i.e., ξ i,1 = ... = ξ i,d . We refer to SPD-S of node features within a node as SPD-S of the node. Similarly, we refer to PC of node features with a node as PC of the node. The test nodes are divided according to SPD-S of the nodes. Then, to observe the relationship between PC and classification accuracy, we calculated classification accuracy of nodes in each group. We conducted experiments on Cora and CiteSeer under a structural missing setting with r m = 0.995. We compared PCFI with sRMGCNN and FP. Figure 6 shows node classification accuracy according to SPD-S of test nodes. For both datasets, as SPD-S of nodes increases, the accuracy of FP tends to decrease. However, for large-SPD-S nodes, PCFI gains noticeable performance improvement compared to FP. Furthermore, the results on Cora show that PCFI outperforms PCFI without node-wise inter-channel propagation on large-SPD-S nodes. Since large SPD-S means low PC, we can observe that PCFI imputes low-PC missing features effectively. (original features in a node) and Xi,: (its recovered features) according to SPD-S of features within the node. For both datasets, we randomly selected 99.5% nodes and remove all features of the selected nodes (structural missing) so that all channel features within a node have the same SPD-S. The imputed feature similarity between two nodes tends to decrease as the shortest path distance between the two nodes increases. We further conducted experiments to observe the degree of feature recovery according to SPD-S of nodes. To evaluate the degree of feature recovery for each node, we measured the cosine similarity between recovered node features and original node features. The setting for experiments is the same as in section A.4.6. Figure 7 demonstrates the results on Cora and CiteSeer. As SPD-S of nodes increases from zero, which means PC of nodes decreases, the cosine similarity between tends to decreases. In other words, PC of nodes increases, the cosine similarity tends to increase. This shows that the pseudoconfidence is designed properly based on SPD-S, which reflect the confidence. Unlike the tendency in node classification accuracy, PCFI shows almost the same degree of feature recovery as PCFI without node-wise inter-channel propagation. This means that node-wise interchannel propagation improves performance with very little refinement. That is, higher classification accuracy on nodes does not necessarily mean higher degree of feature recovery of the nodes. We leave a detailed analysis of this observation for future work.



https://github.com/marblet/GCNmf https://github.com/twitter-research/feature-propagation https://github.com/twitter-research/feature-propagation https://github.com/fmonti/mgcnn



Figure 1: Overall scheme of the proposed Pseudo-Confidence-based Feature Imputation (PCFI) method.Based on the graph structure and partially known features, we calculate the channel-wise shortest path distance between a node with a missing feature and its nearest source node (SPD-S). Based on SPD-S, we determine the pseudo-confidence in the recovered feature, using a predetermined hyper-parameter α (0 < α < 1). Pseudo-confidence plays an important role in the two stages: channel-wise Inter-node diffusion and node-wise inter-channel propagation.

as a set of nodes where the d-th channel feature values are known (k in V (d) k means 'known'). The set of nodes with the unknown d-th channel feature values is denoted by V

Therefore, unlike a symmetric transition matrix (Kipf & Welling, 2016a; Klicpera et al., 2019; Rossi et al., 2021) or a column-stochastic (random walk) transition matrix (Page et al., 1999; Chung, 2007; Perozzi et al., 2014; Grover & Leskovec, 2016; Atwood & Towsley, 2016; Klicpera et al., 2018; Lim et al., 2021), features of missing nodes can form at the same scale of known features. Preserving the original scale allows features to recover close to the actual features. Now, we define channel-wise inter-node diffusion for the d-th channel as

(d) u |, and t ∈ [1, K]. Here we initialize missing feature values x (d)

Zhu & Ghahramani, 2002) which does not use node features and propagates only partially-known labels for inferring the remaining labels. That is, LP corresponds to the case of 100% feature missing. The four state-of-theart methods can be categorized into two approaches: GCN-variant model={GCNMF (Taguchi et al., 2021), PaGNN (Jiang & Zhang, 2020)} and feature imputation= {sRMGCNN (Monti et al., 2017), FP (Rossi et al., 2021)}. While GCN-variant models were designed to perform node classification directly with partially known features, feature imputation methods combine with GNN models for downstream tasks. In Baseline 1, sRMGCNN, FP, and our method, we commonly used vanilla GCN (Kipf & Welling, 2016a) for the downstream task.

10 -5.5 , • • • , 1}}. For the link prediction, the best (α, β) was searched from {(α, β)|α ∈ {0.1, 0.2, • • • , 0.9}, β ∈ {10 -6 , 10 -5 , • • • , 1}}, as shown in Figure 3, 4 of APPENDIX.

uu by Lemma A.1, 0 is not an eigenvlaue of I -W

Figure 4: Link prediction results on Cora with different α and β. The experiments are conducted under a structural-missing setting with r m = 0.995.

Figure 5: Node classification accuracy (%) on the synthetic graphs according to -log(E d ), where E d is the Dirichlet energy (An increase in E d means an increase in homophily.). The experiments are conducted under a structural-missing setting with r m = 0.995. The proposed PCFI consistently outperforms FP and the performance gap widens as homophily increases.

Figure 6: Node classification accuracy (%) according to SPD-S of test nodes. For both Cora and CiteSeer, structural missing with r m = 0.995 is applied. PCFI* denotes PCFI without node-wise inter-channel propagation. PCFI shows a noticeable performance gain especially for nodes with low-PC missing features (large-SPD-S nodes). Also, node-wise inter-channel propagation shows its effectiveness on nodes with low-PC features.

.7 DEGREE OF FEATURE RECOVERY ACCORDING TO PSEUDO-CONFIDENCE

Figure 7:  Cosine similarity between X i,: (original features in a node) and Xi,: (its recovered features) according to SPD-S of features within the node. For both datasets, we randomly selected 99.5% nodes and remove all features of the selected nodes (structural missing) so that all channel features within a node have the same SPD-S. The imputed feature similarity between two nodes tends to decrease as the shortest path distance between the two nodes increases.

. It is notable that, if the i-th node is a source node, its nearest source node is itself, meaning S i,d becomes zero. We construct SPD-S matrix S ∈ R N ×F of which elements are S i,d . Consider X = [x i,d ] that represents the recovered features of X with consideration of feature homophily (McPherson et al., 2001) that represents a local property on a graph (Bisgin et al., 2010; Lauw et al., 2010; Bisgin et al.

Node classification accuracy (%) at missing rate r m = 0.995. OOM denotes out of memory. * denotes incalculable average for six datasets due to OOM results.

Link prediction results (%) at missing rate r m = 0.995. OOM denotes out of memory. ± 0.75 66.34 ± 5.78 68.26 ± 1.07 83.74 ± 1.05 86.45 ± 1.15 66.46 ± 5.63 67.25 ± 1.10 86.31 ± 1.40 87.30 ± 1.33 AUC 92.58 ± 0.86 68.80 ± 6.44 71.09 ± 0.87 86.12 ± 1.04 88.26 ± 0.97 68.87 ± 6.36 70.78 ± 0.86 88.73 ± 1.16 89.24 ± 1.08 CiteSeer AP 90.50 ± 0.92 67.75 ± 1.95 67.75 ± 1.98 79.74 ± 1.71 80.12 ± 1.59 64.35 ± 5.19 65.71 ± 1.80 82.02 ± 1.95 82.98 ± 2.30 AUC 91.65 ± 0.99 69.08 ± 1.88 69.10 ± 1.95 83.24 ± 1.43 83.88 ± 1.30 66.30 ± 5.65 68.55 ± 1.72 85.81 ± 1.47 86.28 ± 1.77 ± 0.38 81.48 ± 0.29 81.45 ± 0.30 94.05 ± 1.18 96.40 ± 0.42 81.53 ± 0.27 81.48 ± 0.30 95.97 ± 0.21 97.07 ± 0.21 AUC 95.34 ± 0.42 81.07 ± 0.33 81.03 ± 0.34 93.57 ± 1.06 96.01 ± 0.49 81.14 ± 0.29 81.07 ± 0.33 95.54 ± 0.24 96.89 ± 0.23 AUC 93.79 ± 1.09 83.66 ± 0.24 83.62 ± 0.24 90.92 ± 1.05 94.67 ± 0.43 83.68 ± 0.26 83.65 ± 0.25 93.90 ± 0.24 96.03 ± 0.22

Hyper-parameters of PCFI used in node classification.

Hyper-parameters of PCFI used in link prediction.

Dataset statistics.

Ablation study of PCFI. row-ST, CID and, NIP denote a row-stochastic transition matrix, channel-wise inter-node diffusion, and node-wise inter-channel propagation, respectively.

Ablation study of PCFI. row-ST, CID and, NIP denote a row-stochastic transition matrix, channel-wise inter-node diffusion, and node-wise inter-channel propagation, respectively.

Node classification accuracy (%) of PCFI at different missing rates of for two missing types. For each experiment, we report the mean with standard deviation (mean ± std). For each missing type, we report average accuracy with relative drop (%p) compared to a full-feature setting (average (drop)).

ACKNOWLEDGMENTS

This work was supported by IITP grant funded by Korea government(MSIT) [No.B0101-15-0266, Development of High Performance Visual BigData Discovery Platform for Large-Scale Realtime Data Analysis; NO.2021-0-01343, Artificial Intelligence Graduate School Program (Seoul National University)]

A.3 IMPLEMENTATION A.3.1 IMPLEMENTATION DETAILS

We used Pytorch (Paszke et al., 2017) and Pytorch Geometric (Fey & Lenssen, 2019) for the experiments on an NVIDIA GTX 2080 Ti GPU with 11GB of memory.Node classification. We trained GCN-variant models (GCNMF, PaGNN) and GCN models for feature imputation methods (Baseline 1, sRMGCNN, FP, PCFI) as follows. We used Adam optimizer (Kingma & Ba, 2014) and set the maximal number of epochs to 10000. We used an early stopping strategy with patience of 200 epochs. By grid search on each validation set, learning rates of all experiments are chosen from {0.01, 0.005, 0.001, 0.0001}, and dropout (Srivastava et al., 2014) was applied with p selected in {0.0, 0.25, 0.5}. Link prediction. For GCNMF and GAE used as common downstream models for feature imputation methods, we trained the models with Adam optimizer for 200 iterations. By grid search on the validation set, learning rates of all methods are searched from {0.1, 0.01, 0.005, 0.001, 0.0001} for each dataset, and dropout was applied to each layer with p searched from {0.0, 0.25, 0.5}. As specified in (Kipf & Welling, 2016b) and (Taguchi et al., 2021) , we used a 32-dim hidden layer and 16-dim latent variables for the all auto-encoder models.For all the compared methods, we followed all the hyper-parameters in original papers or codes if feasible. If hyper-parameters (the number of layers and hidden dimension) of a model for certain datasets are not clarified in the papers, we searched the hyper-parameters using grid search. In that case, we searched the number of layers from {2, 3} and the hidden dimension from {16, 32, 64, 128, 256}.We present the pseudo-code of our PCFI in Sec. A.6. Our code will be available upon publication. For node classification, we set search range of α and β same as in Figure 3 . For node classification, we chose α from {0.1, 0.2, ..., 0.9}, and β from {10 -6 , 10 -5.5 , 10 -5 , ..., , 1} using a validation set. Then, for link prediction, we set search range of α and β same as in Figure 4 . We selected α from 

