DESCO: TOWARDS SCALABLE DEEP SUBGRAPH COUNTING

Abstract

Subgraph counting is the problem of determining the number of a given query graph in a large target graph. Despite being a #P problem, subgraph counting is a crucial graph analysis method in domains ranging from biology and social science to risk management and software analysis. However, existing exact counting methods take combinatorially long runtime as target and query sizes increase. Existing approximate heuristic methods and neural approaches fall short in accuracy due to high label dynamic range, limited model expressive power, and inability to predict the distribution of subgraph counts in the target graph. Here we propose DeSCo, a neural deep subgraph counting framework, which aims to accurately predict the count and distribution of query graphs on any given target graph. De-SCo uses canonical partition to divide the large target graph into small neighborhood graphs and predict the canonical count objective on each neighborhood. The proposed partition method avoids missing or double-counting any patterns of the target graph. A novel subgraph-based heterogeneous graph neural network is then used to improve the expressive power. Finally, gossip correction improves counting accuracy via prediction propagation with learnable weights. Compared with state-of-the-art approximate heuristic and neural methods. DeSCo achieves 437× improvement in the mean squared error of count prediction and benefits from the polynomial runtime complexity.

1. INTRODUCTION

Given a query graph and a target graph, the problem of subgraph counting is to count the number of patterns, defined as subgraphs of the target graph, that are graph-isomorphic to the query graph Ribeiro et al. (2021) . While being an essential method in graph and network analysis, subgraph counting is a #P-complete problem Valiant (1979) . Due to the computational complexity, existing exact counting algorithms are restricted to small query graphs with no more than 5 vertices Pinar et al. ( 2017 Proposed work. To resolve the above challenges, we propose DeSCo, a GNN-based model that learns to predict both pattern counts and distribution on any target graph. The main idea of DeSCo is to leverage and organize local information of neighborhood patterns to predict query count and distribution in the entire target graph. DeSCo first uses canonical partition to decompose the target graph into small neighborhoods without missing and double-counting any patterns. The local information is then encoded using a GNN with subgraph-based heterogeneous message passing. Finally, we perform gossip correction to improve counting accuracy. Our contributions are three-fold. Canonical partition. Firstly, we propose a novel divide-and-conquer scheme called canonical partition to decompose the problem into subgraph counting for individual neighborhoods. The canonical partition ensures that no pattern will be double counted or missed over all neighborhoods. The algorithm allows the model to make accurate predictions even with the high dynamic range of labels and enables subgraph count distribution prediction for the first time. Subgraph-based heterogeneous message passing. Secondly, we propose a general approach to enhance the expressive power of any MPGNNs by encoding the subgraph structure through heterogeneous message passing. The message type is determined by whether the edge presents in a certain subgraph, e.g., a triangle. We theoretically prove that its expressive power can exceed the upper bound of that of MPGNNs. We show that this architecture outperforms expressive GNNs, including GIN Xu et al. (2018) and ID-GNN You et al. (2021) . Gossip correction. We overcome the challenge of accurate count prediction by utilizing two inductive biases of the counting problem: homophily and antisymmetry. Real-world graphs share similar patterns among adjacent nodes, as shown in Figure 1 . Furthermore, since canonical count depends



Subgraph counting is crucial for domains including biology Takigawa & Mamitsuka (2013); Solé & Valverde (2008); Adamcsek et al. (2006); Bascompte & Melián (2005); Bader & Hogue (2003), social science Uddin et al. (2013); Prell & Skvoretz (2008); Kalish & Robins (2006); Wasserman et al. (1994), risk management Ribeiro et al. (2017); Akoglu & Faloutsos (2013), and software analysis Valverde & Solé (2005); Wu et al. (2018).

); Ortmann & Brandes (2017); Ahmed et al. (2015). The commonly used VF2 Cordella et al. (2004) algorithm fails to even count a single query of 5-node chain within a week's time budget on a large target graph Astro Leskovec et al. (2007) with nineteen thousand nodes. Luckily, approximate counting of query graphs is sufficient in many real-world use cases Iyer et al. (2018); Kashtan et al. (2004); Ribeiro & Silva (2010). Approximation methods can scale to large targets by substructure sampling, random walk, and color-based sampling, allowing estimation of the frequency of query graph occurrences. Very recently, Graph Neural Networks (GNNs) are employed as a deep learning-based approach to subgraph counting Zhao et al. (2021); Liu et al. (2020); Chen et al. (2020). The target graph and the query graph are embedded via a GNN, which predicts the motif count through a regression task.

Figure 1: The ground truth and predicted count distributions of different query graphs over the target graph CiteSeer, a citation network. The hotspots are where the patterns appear most often in the target graph. The hotspots of k-chains represent overlapped linear citation chains, indicating original publications that motivate multiple future directions of incremental contributions. The hotspots of k-cliques indicate research focuses, containing publications of small subdivision that builds upon all prior publications.

Figure 1 demonstrates DeSCo's predictions on the query graph count distribution of a citation network. The count hotspots of different queries can indicate citation patterns of different scientific communities Gao & Lafferty (2017); Yang et al. (2015), which shed light on the research impact of works in this network.

