CONFIDENCE-BASED FEATURE IMPUTATION FOR GRAPHS WITH PARTIALLY KNOWN FEATURES

Abstract

This paper investigates a missing feature imputation problem for graph learning tasks. Several methods have previously addressed learning tasks on graphs with missing features. However, in cases of high rates of missing features, they were unable to avoid significant performance degradation. To overcome this limitation, we introduce a novel concept of channel-wise confidence in a node feature, which is assigned to each imputed channel feature of a node for reflecting certainty of the imputation. We then design pseudo-confidence using the channel-wise shortest path distance between a missing-feature node and its nearest known-feature node to replace unavailable true confidence in an actual learning process. Based on the pseudo-confidence, we propose a novel feature imputation scheme that performs channel-wise inter-node diffusion and node-wise inter-channel propagation. The scheme can endure even at an exceedingly high missing rate (e.g., 99.5%) and it achieves state-of-the-art accuracy for both semi-supervised node classification and link prediction on various datasets containing a high rate of missing features. Codes are available at https://github.com/daehoum1/pcfi.

1. INTRODUCTION

In recent years, graph neural networks (GNNs) have received considerable attention and have performed outstandingly on numerous problems across multiple fields (Zhou et al., 2020; Wu et al., 2020) . While various GNNs handling attributed graphs are designed for node representation (Defferrard et al., 2016; Kipf & Welling, 2016a; Veličković et al., 2017; Xu et al., 2018) and graph representation learning (Kipf & Welling, 2016b; Sun et al., 2019; Velickovic et al., 2019) , GNN models typically assume that features of all nodes are fully observed. In real-world situations, however, features in graph-structured data are often partially observed, as illustrated in the following cases. First, collecting complete data for a large graph is prohibitively expensive or even impossible. Second, measurement failure is common. Third, in social networks, most users desire to protect their personal information selectively. As data security regulation continues to tighten around the world (GDPR), access to full data is expected to become increasingly difficult. Under these circumstances, most GNNs cannot be applied directly due to incomplete features. Several methods have been proposed to solve learning tasks with graphs containing missing features (Jiang & Zhang, 2020; Chen et al., 2020; Taguchi et al., 2021) , but they suffer from significant performance degradation at high rates of missing features. A recent work by (Rossi et al., 2021) demonstrated improved performance by introducing feature propagation (FP), which iteratively propagates known features among the nodes along edges. However, even FP cannot avoid a considerable accuracy drop at an extremely high missing rate (e.g., 99.5%). We assume that it is because FP takes graph diffusion through undirected edges. Consequently, in FP, message passing between two nodes occurs with the same strength regardless of the direction. Moreover, FP only diffuses observed features channel-wisely, which means that it does not consider any relationship between channels. Therefore, to better impute missing features in a graph, we propose to consider both inter-channel and inter-node relationships so that we can effectively exploit the sparsely known features. To this end, we design an elaborate feature imputation scheme that includes two processes. The first process is the feature recovery via channel-wise inter-node diffusion, and the second is the feature

