LEARNING TO SEARCH FOR FAST MAXIMUM COM-MON SUBGRAPH DETECTION

Abstract

Detecting the Maximum Common Subgraph (MCS) between two input graphs is fundamental for applications in biomedical analysis, malware detection, cloud computing, etc. This is especially important in the task of drug design, where the successful extraction of common substructures in compounds can reduce the number of experiments needed to be conducted by humans. However, MCS computation is NP-hard, and state-of-the-art MCS solvers rely on heuristics in search which in practice cannot find good solution for large graph pairs under a limited search budget. Here we propose GLSEARCH, a Graph Neural Network based model for MCS detection, which learns to search. Our model uses a state-ofthe-art branch and bound algorithm as the backbone search algorithm to extract subgraphs by selecting one node pair at a time. In order to make better node selection decision at each step, we replace the node selection heuristics with a novel task-specific Deep Q-Network (DQN), allowing the search process to find larger common subgraphs faster. To enhance the training of DQN, we leverage the search process to provide supervision in a pre-training stage and guide our agent during an imitation learning stage. Therefore, our framework allows search and reinforcement learning to mutually benefit each other. Experiments on synthetic and real-world large graph pairs demonstrate that our model outperforms state-ofthe-art MCS solvers and neural graph matching network models.

1. INTRODUCTION

Due to the flexible and expressive nature of graphs, designing machine learning approaches to solve graph tasks is gaining increasing attention from researchers. Among various graph tasks detecting the largest subgraph that is commonly present in both input graphs, known as Maximum Common Subgraph (MCS) (Bunke & Shearer, 1998) (as shown in Figure 1 ), is an important yet particularly hard task. MCS naturally encodes the degree of similarity between two graphs, is domain-agnostic, and thus has occurred in many domains such as software analysis (Park et al., 2013) , graph database systems (Yan et al., 2005) and cloud computing platforms (Cao et al., 2011) . In drug design, the manual testing of the effects of a new drug is known to be a major bottleneck, and the identification of compounds that share common or similar subgraphs which tend to have similar properties can effectively reduce the manual labor (Ehrlich & Rarey, 2011) . MCS detection is NP-hard in its nature and is thus a very challenging task. On one hand, the state-ofthe-art exact MCS detection algorithms based on branch and bound run in exponential time in worst cases (Liu et al., 2019) . What is worse, they rely on several heuristics on how to explore the search space. For example, MCSP (McCreesh et al., 2017) uses node degree as its heuristic by choosing high-degree nodes to visit first, but in many cases the true MCS contains small-degree nodes. On the other hand, existing machine learning approaches to graph matching such as Wang et al. (2019) and Bai et al. (2020b) either do not address the MCS detection task directly or rely on labeled data requiring the pre-computation of MCS results by running exact solvers. In this paper, we present GLSEARCH (Graph Learning to Search), a general framework for MCS detection combining the advantages of search and reinforcement learning. GLSEARCH learns to search by adopting a Deep Q-Network (DQN) (Mnih et al., 2015) to replace the node selection heuristics required by state-of-the-art MCS solvers, leading to faster arrival of the optimal solution for an input graph pair, which is particularly useful when the simpler heuristics fail and graphs are large According to whether each node is connected to the two selected nodes or not, the nodes not in the current solution are split into three bidomains (Section 2.2), denoted as "00", "01", and "10", where "0" indicates not connected to a node in the selected two nodes, and "1" indicates connected. For example, each node in the "10" bidomain is connected to the top "C" node in the subgraph and disconnected to the bottom "C" node in the subgraph. with a limited search budget. Thanks to the learning capacity of Graph Neural Networks (GNN), our DQN is specially designed for the MCS detection task with a novel reformulation of DQN to better capture the effect of different node selections. Given the large action space incurred by large graph pairs, to enhance the training of DQN, we leverage the search algorithm to not only provide supervised signals in a pre-training stage but also offer guidance during an imitation learning stage. Experiments on real graph datasets that are significantly larger than exisitng datasets adopted by state-of-the-art MCS solvers demonstrate that GLSEARCH outperforms baseline solvers and machine learning models for graph matching in terms of effectiveness by a large margin. Our contributions can be summarized as follows: • We address the challenging yet important task of Maximum Common Subgraph detection for general-domain input graph pairs and propose GLSEARCH as the solution. • The key novelty is the DQN which learns to search. Specifically, it is trained under the reinforcement learning framework to make the best decision at each search step in order to quickly find the best MCS solution during search. The search in turns helps training of DQN in a pre-training stage and an imitation learning stage. • We conduct extensive experiments on medium-size synthetic graphs and very large realworld graphs to demonstrate the effectiveness of the proposed approach compared against a series of string baselines in MCS detection and graph matching.

2.1. PROBLEM DEFINITION

We denote a graph as G = (V, E) where V and E denote the vertex and edge set. An induced subgraph is defined as G s = (V s , E s ) where E s preserves all the edges between nodes in V s , i.e. ∀i, j ∈ V s , (i, j) ∈ E s if and only if (i, j) ∈ E. In this paper, we aim at detecting the Maximum Common induced Subgraph (MCS) between an input graph pair, denoted as MCS(G 1 , G 2 ), which is the largest induced subgraph that is contained in both G 1 and G 2 . In addition, we require MCS(G 1 , G 2 ) to be a connected subgraph. We allow the nodes of input graphs to be labeled, in which case the labels of nodes in the MCS must match as in Figure 1 . Graph isomorphism and subgraph isomorphism can be regarded as two special tasks of MCS:  |MCS(G 1 , G 2 )| = |V 1 | = |V 2 | if G 1 are isomorphic to G 2 , |MCS(G 1 , G 2 )| = min (|V 1 |, |V 2 |) when G 1 (or G 2 ) is subgraph isomorphic to G 2 (



Figure1: Left: For graph pair (G 1 , G 2 ) with node labels, the induced connected MCS is the fivemember ring structure highlighted in circle. Right: At this step, there are two nodes currently selected. According to whether each node is connected to the two selected nodes or not, the nodes not in the current solution are split into three bidomains (Section 2.2), denoted as "00", "01", and "10", where "0" indicates not connected to a node in the selected two nodes, and "1" indicates connected. For example, each node in the "10" bidomain is connected to the top "C" node in the subgraph and disconnected to the bottom "C" node in the subgraph.

or G 1 ).2.2 SEARCH ALGORITHM FOR MCSAmong various algorithms for MCS, we adopt the state-of-the-art search-based algorithm in our framework. The basic version, MCSP, is presented inMcCreesh et al. (2017)  and the more advanced version, MCSP+RL, is proposed inLiu et al. (2019). The whole search algorithm, outlined in

