DETECTING SMALL QUERY GRAPHS IN A LARGE GRAPH VIA NEURAL SUBGRAPH SEARCH Anonymous

Abstract

Recent advances have shown the success of using reinforcement learning and search to solve NP-hard graph-related tasks, such as Traveling Salesman Optimization, Graph Edit Distance computation, etc. However, it remains unclear how one can efficiently and accurately detect the occurrences of a small query graph in a large target graph, which is a core operation in graph database search, biomedical analysis, social group finding, etc. This task is called Subgraph Matching which essentially performs subgraph isomorphism check between a query graph and a large target graph. One promising approach to this classical problem is the "learning-to-search" paradigm, where a reinforcement learning (RL) agent is designed with a learned policy to guide a search algorithm to quickly find the solution without any solved instances for supervision. However, for the specific task of Subgraph Matching, though the query graph is usually small given by the user as input, the target graph is often orders-of-magnitude larger. It poses challenges to the neural network design and can lead to solution and reward sparsity. In this paper, we propose NSUBS with two innovations to tackle the challenges: (1) A novel encoder-decoder neural network architecture to dynamically compute the matching information between the query and the target graphs at each search state; (2) A novel look-ahead loss function for training the policy network. Experiments on six large real-world target graphs show that NSUBS can significantly improve the subgraph matching performance.

1. INTRODUCTION

With the growing amount of graph data that naturally arises in many domains, solving graph-related tasks via machine learning has gained increasing attention. Many NP hard tasks, e.g. Traveling Salesman Optimization (Xing & Tu, 2020) , Graph Edit Distance computation (Wang et al., 2021) , Maximum Common Subgraph detection (Bai et al., 2021) , have recently been tackled via learningbased methods. These works on the one hand rely on search to enumerate the large solution space, and on the other hand use reinforcement learning (RL) to learn a good search policy from training data, thus obviating the need for hand-crafted heuristics adopted by traditional solvers. Such learning-tosearch paradigm (Bai et al., 2021 ) also allows the training the RL agent without any solved instances for supervision. However, how to design a neural network architecture under the RL-guided search framework remains unclear for the task of Subgraph Matching, which requires the detection of all occurrences of a small query graph in an orders-of-magnitude larger target graph. Subgraph Matching has wide applications in graph database search (Lee et al., 2012) , knowledge graph query (Kim et al., 2015) , biomedical analysis (Zhang et al., 2009) , social group finding (Ma et al., 2018) , quantum circuit design (Jiang et al., 2021) , etc. As a concrete example, Subgraph Matching is used for protein complex search in a protein-protein interaction network to test whether the interactions within a protein complex in a species are also present in other species (Bonnici et al., 2013) . Due to its NP-hard nature, the state-of-the-art Subgraph Matching algorithms rely on backtracking search with various techniques proposed to reduce the large search space (Sun & Luo, 2020; Kim et al., 2021; Wang et al., 2022) . However, these techniques are mostly driven by heuristics, and as a result, we observe that such solvers often fail to find any solution on large target graphs under a reasonable time limit, although they tend to work well on small graph pairs. We denote this phenomenon as solution sparsity. Such solution sparsity requires the designed model to not only have enough capacity but also to run efficiently under limited computational budget. Another consequence of solution sparsity is that, there can be little-to-no reward signals for the RL agent under an RL training framework (Silver et al., 2017) , which we denote as reward sparsity. In this paper, we propose NSUBS with two means to address the aforementioned challenges. First, we propose a novel graph encoder-decoder neural network to dynamically match the query graph with the target graph and perform aggregation operation only on the query graph to reduce information loss. The novel encoder decouples the intra-graph message passing module (the "propagation" module) that yields state-independent node embeddings, and the inter-graph message passing module (the "matching" module) that refines the node embeddings via subgraph-to-graph matching. Thus, the intra-graph embeddings can be computed only once at the beginning of search for efficient inference. We further advance the inter-graph message passing by propagating only between nodes that either are already matched or can be matched in future by running a local candidate search space computation algorithm at each search state. Such algorithm leverages the key requirement of Subgraph Matching that every node and edge in the query graph must be matched to the target graph, and therefore reduces the amount of candidates from all the nodes in the target graph to a much smaller amount. Compared with a Graph Matching Network (Li et al., 2019) which computes all the pairwise node-to-node message passing between two input graphs, our matching module is able to focus on only the node pairs that can contribute to the solution, and thus is both more effective and more efficient. In addition, we propose the use of sampling of subgraphs to obtain ground-truth subgraph-to-graph node-node mappings to alleviate the reward sparsity issue during training. We design a novel look-ahead loss function where the positive node-node pairs are augmented with positive node-node pairs in future states to boost the amount of training signals at each search state. Experiments on synthetic and real graph datasets demonstrate that NSUBS outperforms baseline solvers in terms of effectiveness by a large margin. Our contributions can be summarized as follows: • We address the challenging yet important task of Subgraph Matching with a vast amount of practical applications and propose NSUBS as the solution. • One key novelty is a proposed encoder layer consisting of a propagation module and a matching module that dynamically passes the information between the input graphs. • We conduct extensive experiments on real-world graphs to demonstrate the effectiveness of the proposed approach compared against a series of strong baselines in Subgraph Matching.

2.1. PROBLEM DEFINITION

We denote a query graph as q = (V q , E q ) and a target graph as G = (V G , E G ) where V and E denote the node and edge sets. q and G are associated with a node labeling function L g which maps every node into a label l in a label set Σ. Subgraph: For a subset of nodes S of V q , q[S] denotes the subgraph of q with an node set S and a edge set consisting of all the edges in E q that have both endpoints in S. In this paper, we adopt the definition of non-induced subgraph. Subgraph isomorphism: q is subgraph isomorphic to G if there exists an injective node-to-node mapping M : V q → V G such that (1) ∀u ∈ V q , L g (u) = L g (M (u)); and (2) ∀e (u,u ′ ) ∈ E q , e (M (u),M (u ′ )) ∈ E G . Subgraph Matching: The task of Subgraph Matching aims to find the subgraphs in G that are isomorphic to q. We call M a solution, or a match of q to G. We call a pair (q, G) is solved if the algorithm can find any match under a given time limit, which we find a challenge for existing solvers on input graphs in experiments especially on large graphs. For solved pairs, the number of found subgraphs by an algorithm is reported.

2.2. RELATED WORK

Non-learning methods on Subgraph Matching Existing methods on Subgraph Matching can be broadly categorized into backtracking search algorithms (Shang et al., 2008; He & Singh, 2008; Han et al., 2013; 2019; Kim et al., 2021; Wang et al., 2022) and multi-way join approaches (Lai et al., 2015; 2016; 2019; Kankanamge et al., 2017) . The former category of approaches employ a branch and bound approach to grow the solution from an empty subgraph by gradually seeking one matching node pair at a time following a strategic order until the entire search space is explored. The multi-way join approaches rely on decomposing the query graph into nodes and edges and performing join

