ACTIVE SAMPLING FOR NODE ATTRIBUTE COMPLE-TION ON GRAPHS

Abstract

Node attribute is one kind of crucial information on graphs, but real-world graphs usually face attribute-missing problem where attributes of partial nodes are missing and attributes of the other nodes are available. It is meaningful to restore the missing attributes so as to benefit downstream graph learning tasks. Popular GNN is not designed for this node attribute completion issue and is not capable of solving it. Recent proposed Structure-attribute Transformer (SAT) framework decouples the input of graph structures and node attributes by a distribution matching technique, and can work on it properly. However, SAT leverages nodes with observed attributes in an equally-treated way and neglects the different contributions of different nodes in learning. In this paper, we propose a novel active sampling algorithm (ATS) to more efficiently utilize the nodes with observed attributes and better restore the missing node attributes. Specifically, ATS contains two metrics that measure the representativeness and uncertainty of each node's information by considering the graph structures, representation similarity and learning bias. Then, these two metrics are linearly combined by a Beta distribution controlled weighting scheme to finally determine which nodes are selected into the train set in the next optimization step. This ATS algorithm can be combined with SAT framework together, and is learned in an iterative manner. Through extensive experiments on 4 public benchmark datasets and two downstream tasks, we show the superiority of ATS in node attribute completion.



For example, in citation graphs, key terms or detailed content of some papers may be inaccessible because of copyright protection. In social networks, profiles of some users may be unavailable due to privacy protection. When observing the attributes of partial nodes on graphs, it is significant to restore the missing attributes of the other nodes so as to benefit the downstream graph learning tasks. Namely, this is the goal of node attribute completion task. 2019) can potentially deal with this problem but they rely on high-quality random walks and carefully designed sampling strategies which are hard to be guaranteed Yang et al. (2019) . The popular GNN framework takes graph structures and node attributes as a coupled input and can work on the node attribute completion problem by some attribute-filling tricks, while these tricks introduce noise in learning and bring worse performance. In last few years, researchers begin to concentrate on the learning problem on the attribute-missing graphs. Chen et al. ( 2022) propose a novel structure-attribute transformer (SAT) framework that can handle the node attribute completion case. SAT leverages structures and attributes in a decoupled scheme and achieves the joint distribution modeling by matching the latent codes of structures and attributes. Although SAT has shown great promise on node attribute completion problem, it leverages the nodes with observed attributes in an equally-treated manner and ignores the different contributions of nodes in the learning schedule. Given limited nodes with observed attributes, it is more important to notice that different nodes have different information (e.g. degrees, neighbours, etc. 2019) on the optimization objective may come to mind to be a potential solution. Whereas, the information of nodes is influenced by each other and has more complex patterns. The importance distribution is implicit, intractable and rather complicated, raising great difficulties to design its formulation. It's challenging to find a more practical way to exert the different importance of the partial nodes with observed attributes at different learning stages. In this paper, we propose an active sampling algorithm named ATS to better leverage the partial nodes with observed attributes and help SAT model converge to a more desirable state. In particular, ATS measures the representativeness and uncertainty of node information on graphs to adaptively and gradually select nodes from the candidate set to the train set after each training epoch, and thus encourage the model to consider the node's importance in learning. The representativeness and uncertainty are designed by considering the graph structures, representation similarity and learning bias. Furthermore, it is interesting to find that the learning prefers nodes of high representativeness and low uncertainty at the early stage while low representativeness and high uncertainty at the late stage. Thereby, we proposes a Beta distribution controlled weighting scheme to exert adaptive learning weights on representativeness and uncertainty. In this way, these two metrics are linearly combined as the final score to determine which nodes are selected into the train set in next optimization epoch. The active sampling algorithm (ATS) and the SAT model are learned in an iterative manner until the model converges. Our contributions are as summarized follows: • In node attribute completion, to better leverage the partial nodes with observed attributes, we advocate to use active sampling algorithm to adaptively and gradually select samples into the train set in each optimization epoch and help the model converge to a better state. • We propose a novel ATS algorithm to measure the importance of nodes by designed representativeness and uncertainty metrics. Furthermore, when combining these two metrics as the final score function, we propose a Beta distribution controlled weighting scheme to better exert the power of representativeness and uncertainty in learning. • We combine ATS with SAT, a newly node attribute completion model, and conduct extensive experiments on 4 public benchmarks. Through the experimental results, we show that our ATS algorithm can help SAT reach a better optimum, and restore higher-quality node attributes that benefit downstream node classification and profiling tasks. (2017) . GNN can infer the distribution of nodes based on node attributes and edges and achieve impressive results on graph-related tasks. There are also numerous creative modifications in GNN.



known as a kind of important information on graphs, plays a vital role in many graph learning tasks. It boosts the performance of Graph Neural Network (GNN) Defferrard et al. (2016); Kipf & Welling (2017); Xu et al. (2019b); Veličković et al. (2018) in various domains, e.g. node classification Jin et al. (2021); Xu et al. (2019a) and community detection Sun et al. (2021); Chen et al. (2017). Meanwhile, node attribute provides human-perceptive demonstrations for the non-Euclidean structured data Zhang et al. (2019); Li et al. (2021). In spite of its indispensability, real-world graphs may have missing node attributes due to kinds of reasons Chen et al. (2022).

) and should have different importance in the learning process. Importance re-weighting Wang et al. (2017); Fang et al. (2020); Byrd & Lipton (

