SEEDGNN: GRAPH NEURAL NETWORK FOR SUPER-VISED SEEDED GRAPH MATCHING

Abstract

There have been significant interests in designing Graph Neural Networks (GNNs) for seeded graph matching, which aims to match two (unlabeled) graphs using only topological information and a small set of seeds. However, most previous GNNs for seeded graph matching employ a semi-supervised approach, which requires a large number of seeds and can not learn knowledge transferable to unseen graphs. In contrast, this paper proposes a new supervised approach that can learn from a training set how to match unseen graphs with only a few seeds. At the core of our SeedGNN architecture are two novel modules: 1) a convolution module that can easily learn the capability of counting and using witnesses of different hops; 2) a percolation module that can use easily-matched pairs as new seeds to percolate and match other nodes. We evaluate SeedGNN on both synthetic and real graphs, and demonstrate significant performance improvement over both non-learning and learning algorithms in the existing literature. Further, our experiments confirm that the knowledge learned by SeedGNN from training graphs can be generalized to test graphs with different sizes and categories.

1. INTRODUCTION

Graph matching, also known as network alignment, aims to find the node correspondence between two graphs that maximally aligns their edge sets. As a ubiquitous but challenging problem, graph matching has numerous applications, including social network analysis (Narayanan et al., 2008; 2009; Zafarani et al., 2015; Zhang et al., 2015b; a; Chiasserini et al., 2016 ), computer vision (Conte et al., 2004; Schellewald et al., 2005; Vento et al., 2013) , natural language processing (Haghighi et al., 2005) , and computational biology (Singh et al., 2008; Kazemi et al., 2016; Kriege et al., 2019) . This paper focuses on seeded graph matching, where a small portion of the node correspondence between the two graphs is revealed as seeds, and we seek to complete the correspondence by growing from the few seeded node pairs. Seeded graph matching is motivated by the fact that, in many real applications, the correspondence between a small portion of the two node sets is naturally available. For example, in social network de-anonymization, some users who explicitly link their accounts across different social networks could become seeds (Narayanan et al., 2008; 2009) . Knowledge of even a few seeds has been shown to significantly improve the matching results for many real-world graphs (Kazemi et al., 2015; Fishkind et al., 2019) . Recently, the Graph Neural Network (GNN) approach for graph matching has attracted much research attention. Although such a machine-learning-based approach usually does not possess provable theoretical guarantees, it has the potential to learn valuable features from a large set of training data. Unfortunately, to date GNN has not been successfully applied to seeded graph matching. Most previous GNNs for seeded graph matching are limited to a semi-supervised learning paradigm, which only operates on a single pair of graphs (Zhang et al., 2019; Li et al., 2019a; b; c; Zhou et al., 2019; Chen et al., 2020; Derr et al., 2021) and treats the seed set as the labelled training data. The goal is to learn the useful features from the seed set, and then to generalize the knowledge to the rest of the unseeded nodes. This semi-supervised learning, however, suffers from two major limitations. First, in order to obtain high matching accuracy, the set of seeds needs to be sufficiently large, which is often unrealistic in practice. Second, as this semi-supervised setting only learns within a given pair of graphs, there is no effort in transferring knowledge from one pair of graphs to other pairs of unseen graphs, which severely limits GNNs' potential in distilling the common knowledge from a large set of training graphs. A natural but fundamental question is Can we learn to match two graphs from only a few seeds while generalizing to unseen graphs? This paper provides an affirmative answer to this question. Specifically, we design a novel GNN architecture through a supervised approach, namely SeedGNN, that can learn from many examples of matched graph pairs, distill the knowledge into the trained model automatically, and then apply such knowledge to match unseen graph pairs with only a few seeds. In contrast to prior GNN approach for seeded graph matching that apply GNNs separately to each graph and learn a nodeembedding for each node (by aggregating neighborhood information within each individual graph), a key departure of our SeedGNN architecture is to apply the GNN jointly over two graphs and to learn a pair-wise similarity for each pair of nodes directly. As we will discuss further below, this pair-wise GNN architecture is crucial for learning both useful features (from seeds) and the best way to synthesize them in different types of graphs. (We note that this type of pair-wise GNNs have been used in a supervised learning approach for seedless graph matching in Rolínek et al. ( 2020); Wang et al. (2021) . However, they have not been used for seeded graph matching. See Section 2 for further discussions.) Numerical experiments on both synthetic and real-world graphs show that our SeedGNN significantly outperforms the state-of-the-art algorithms, including both non-learning and learning-based ones, in terms of seed size requirement and matching accuracy. Moreover, our SeedGNN can generalize to match unseen graphs of sizes and types different from the training set. At the core of our SeedGNN are two innovative designs. One is the convolution module that learns to count "witnesses" at different hops -a notion that plays a pivotal role in seeded graph matching (Mossel et al., 2019) . Here, the ℓ-hop witnesses of a node-pair are seeded pairs that lie in the neighborhood at ℓ hops. Naturally, a true pair is expected to have more witnesses than a fake pair. As we will further discuss in Section 4.1 and Section 4.2, our pair-wise SeedGNN architecture is much more effective than existing node-based GNNs in learning how to count witnesses, in a manner that can be easily generalized to unseen graphs. The second innovation is the percolation module that matches high-confidence node-pairs at one layer and propagates the matched node-pairs as new seeds to the subsequent layers, triggering a percolation process that matches a large number of node pairs. However, we emphasize that it remains highly non-trivial how to best utilize either the witness or the percolation idea for achieving high matching accuracy. Indeed, when graphs are very sparse, even true node-pairs may not have enough witnesses if the number of hops ℓ is small; when graphs are very dense, a fake pair may also have many witnesses if ℓ is large. Similarly, a fake pair may be incorrectly propagated as a new seed, which can lead to many cascading errors. The pair-wise architecture of SeedGNN is also crucial to facilitate learning how to best synthesize these two modules. As a result, our SeedGNN can potentially figure out which hops of witnesses are more reliable and what "cleaner" new seeds should be used to trigger the percolation process.

2. FURTHER RELATED WORK

Theoretical Algorithms Various seeded matching algorithms have been proposed based on handdesigned similarity metrics computed from local topological structures (Pedarsani et al., 2011; Yartseva et al., 2013; Korula et al., 2014; Chiasserini et al., 2016; Shirani et al., 2017; Mossel et al., 2019; Yu et al., 2021b) . The theoretical analysis on these algorithms explains why a particular set of features (e.g., witnesses (Korula et al., 2014) and percolation (Yartseva et al., 2013) ) are valuable for graph matching. However, these theoretical algorithms may not synthesize different features most effectively. See detailed discussion in Appendix A. In contrast, our SeedGNN can potentially figure out what combinations of features are most useful, and therefore it can potentially outperform known theoretical algorithms (see our experiments in Section 5 and Appendix C.3).

GNN for Seedless Graph Matching

As we discussed earlier, most existing GNNs for seeded graph matching take a semi-supervised learning approach. In contrast, our SeedGNN falls into a supervised learning approach, which aims to transfer knowledge from training graphs to unseen graphs. In the literature, such a supervised learning approach has been applied to seedless versions of the graph matching problems in (Zanfir et al., 2018; Wang et al., 2019; 2021; 2020a; 2021; Jiang et al., 2022; Wang et al., 2020b; Fey et al., 2020; Rolínek et al., 2020; Gao et al., 2021; Yu et al., 2021c) . For such seedless matching problems, non-topological node features are often assumed to be available. Thus, a node-based GNN is effective in learning how to extract useful node representations from high-quality non-topological node features. Unfortunately, from our own experience, we found that it is not easy to design a node-based GNN that effectively utilizes seed information (see

